Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3

Indexing Text

Abstract

Data has no value if it cannot be sensibly examined. In past centuries, the index has been the key to searching and retrieving text. Today, it is tempting to think that the index is obsolete, being replaced the by the "find" box that pops onto the screen when we press the "Ctrl-F" keys. This is not the case; simple "find" searches cannot cope with the variations and complexities of textual information. A thoughtful index is a reconceptualization of the document that permits rapid retrieval of terms that are related to the search term. An index, aided by proper annotation of data, permits us to understand data in ways that were not anticipated when the original content was collected. With the use of computers, multiple indexes, designed for different purposes, can be created for a single document or data set. As data accrues, indexes can be updated. When data sets are combined, their respective indexes can be merged. A good way of thinking about indexes is that the document contains all of the complexity; the index contains all of the simplicity. Data scientists who understand how to create and use indexes will be in the best position to search, retrieve, and analyze textual data.

Keywords

Index; Concordance; Autocoding; Autoencoding; Page rank; Search; Retrieval

3.1 How Data Scientists Use Indexes

If I had known what it would be like to have it all, I might have been willing to settle for less.

Lily Tomlin

Have computers brought death to the once-familiar index? So I've been told. A friend of mine once suggested that the ubiquitous "search" box has made the index obsolete. If you're reading almost any kind of text on your computer screen, pushing the "Ctrl-F" keys will produce a "search" box that prompts for a word or term, and highlights all the occurrences in the text. Why would anyone go to the trouble of visiting an index, if they can find any term, on the spot, and instantly?

The index may be disparaged today, but there was a time when indexes were considered to be an essential component of every serious text¹ (See Glossary item, Indexes). In Wheeler's 1920 book, "Indexing: Principles, Rules and Examples," the author wrote, "The importance of book indexes is so widely recognized and the want of them so frequently deplored that no argument in their favor seems necessary."² Wheeler was right to believe that indexes are important, but he was wrong to suggest that no argument in their favor is necessary.

Instructional books are not lists of facts, and novels are not the print versions of movies. Every book, regardless of its topic, is a representation of the mind of the author. Every sentence tells us something about how the author interprets reality, and anyone who takes the time to read a single-author book is influenced by the author's conjured worldview. A single-author book is the closest thing we have to a Vulcan mind meld. The index organizes the author's message by reimagining the text. Great indexes provide a way of seeing the world created by the book, often in a manner that was unintended by the book's author. Ultimately, an index gives us an opportunity to grow beyond the text, discovering relationships among concepts that were missed by the author. By all rights, indexers should be given published credit for their creative products, much like authors are given credit for their works.³

Here are a few of the specific strengths of an index that cannot be duplicated by "find" operations on terms entered into a query box.⁴

1. An index can be read, as a stand-alone document, to acquire a quick view of the book's contents.⁵

2. When you do a "find" search in a query box, your search may come up empty if there is nothing in the text that matches your query. This can be very frustrating if you know that the text covers the topic entered into the query box. Indexes avoid the problem of fruitless searches. By browsing the index, you can find the term you need, without foreknowledge of its exact wording within the text. When you find a term in the index, you may also find closely related terms, subindexed under your search term, or alphabetically indexed above or below your search term.

3. Searches on computerized indexes are nearly instantaneous, because the index is precompiled. Even when the text is massive (eg, gigabytes, terabytes), information retrieval via an index will be nearly instantaneous.

4. Indexes can be tied to a classification or other specialized nomenclature. Doing so permits the analyst to know the relationships among different topics within the index, and within the text⁶ (See Glossary items, Nomenclature, Terminology, Vocabulary, Classification, Indexes vs. classifications).

5. Many indexes are cross-indexed, providing a set of relationships among different terms, that a clever data analyst might find useful.

6. Indexes can be merged. If the location entries for index terms are annotated with some identifier for the source text, then searches on a merged index will yield locators that point to specific locations from all of the sources.

7. Indexes can be embedded directly in the text.⁷ Whereas conventional indexes contain locators to the text, embedded indexes are built into the locations where the index term is found in the text, with each location listing other locations where the term can be found. These onsite lists of terms can be hidden from the viewer with formatting instructions (eg, pop-up link tags in the case of HTML). Programmers can reconstitute conventional indexes from location-embedded tags, as required.

8. Indexes can be created to satisfy a particular goal; and the process of creating a made-to-order index can be repeated again and again. For example, if you have a massive or complex data resource devoted to ornithology, and you have an interest in the geographic location of species, you might want to create an index specifically keyed to localities, or you might want to add a locality subentry for every indexed bird name in your original index. Such indexes can be constructed as add-ons, when needed.

9. Indexes can be updated. If terminology or classifications change, there is nothing stopping you from re-building the index with an updated specification, without modifying your source data.

10. Indexes are created after the database has been created. In some cases, the data manager does not envision the full potential of a data resource until after it is built. The index can be designed to encourage novel uses for the data resource.

11. Indexes can occasionally substitute for the original text. A telephone book is an example of an index that serves its purpose without being attached to a related data source (eg, caller logs, switching diagrams).

You'll notice that the majority of the listed properties of indexes were impossible to achieve before the advent of computers. Today, a clever data scientist can prepare innovative and powerful indexes, if she has the following three ingredients:

1. The ability to write simple indexing scripts. Examples of short but powerful scripts for building indexes are provided in this chapter.

2. Nomenclatures with which to collect and organize terms from the text. The search for a specific index term can be greatly enhanced by expanding the search to include all terms that are synonymous with the search term. Nomenclatures organize equivalent terms under a canonical concept.

3. Access to data resources whose indexes can be sensibly merged. In the past several decades, data scientists have gained access to large, public, electronic data resources; perfect for indexing.

Data scientists should think of indexes as a type of data object. As such, an index is programmable, meaning that a savvy programmer can add all manner of functionality to indexes. For example, consider the index created for the Google search engine. The Google index uses a computational method known as PageRank. The rank of a page is determined by two scores: the relevancy of the page to the query phrase; and the importance of the page. The relevancy of the page is determined by factors such as how closely the page matches the query phrase, and whether the content of the page is focused on the subject of the query. The importance of the page is determined by how many Web pages link to and from the page, and the importance of the Web pages involved in the linkages. It is easy to see that the methods for scoring relevance and importance are subject to algorithmic variances, particularly with respect to the choice of measures (ie, the way in which a page's focus on a particular topic is quantified), and the weights applied to each measurement. The reasons that PageRank query responses are returned rapidly is that the score of a page's importance is precomputed, and stored in an index, with the page's Web addresses. Word matches from the query phrase to Web pages are quickly assembled using an index consisting of words, the pages containing the words, and the locations of the words in the pages.⁸ The success of Page Rank, as employed by Google, is legendary (See Glossary item, Page Rank).⁹

An international standard (ISO 999) describes the fundamental principles of indexing (See Glossary item, ISO). Aside from following a few rules for arranging headings and subheadings, the field of computerized indexing is one of the most open and active fields in the data sciences. Indexes should be accepted as a useful device for understanding and simplifying data.

3.2 Concordances and Indexed Lists

From a programmer's point of view, the user is a peripheral that types when you issue a read request.

P. Williams

The easiest type of index to build is the concordance; a list of all the different words contained in a text, with the locations in the text where each word appears (See Glossary item, Concordance). The next easiest index to build is the extracted term index, wherein all of the terms extracted from the text are built into a list, along with the locations of each term. This section will describe computational methods for building both types of basic indexes.

Here, in a short Perl script, concord_gettysbu.pl, are instructions for building the concordance for the Gettysburg address:

#!/usr/local/bin/perl

open (TEXT, "gettysbu.txt");

open(OUT, ">concordance.txt");

$/ = "";

$line = < TEXT >;

foreach $word (split(/[s ]/,$line))

{

$word_location = $word_location + 1;

$locations{$word} = $locations{$word} . ",$word_location";

}

foreach $word (sort keys %locations)

{

$locations{$word} =~ s/ˆ[ ,]+//o;

print OUT "$word $locations{$word} ";

}

exit;

The script parses the words of the text, and collects the locations where each word occurs (as the distance, in words, from the start of the file), and produces an ASCIIbetical listing of the words, with their respective locations. The concordance is deposited in the file “concordance.txt.” We will be using this file in two more scripts that follow in this section. The first few lines of output are shown:

But 102

Four 1

God 240

It 91,160,186

Now 31

The 118,139

We 55,65

a 14,36,59,70,76,104,243

above 131

add 136

advanced. 185

ago 6

all 26

altogether 93

A concordance script, in Python:

#!/usr/local/bin/python

import re

import string

sentence_list = []

word_list = []

word_dict = {}

format_list = []

count = 0

in_text = open('gettysbu.txt', "r")

in_text_string = in_text.read()

in_text_string = in_text_string.replace(" "," ")

in_text_string = in_text_string.replace(" +"," ")

sentence_list = re.split(r'[.!?] +(?=[A-Z])',in_text_string)

for sentence in sentence_list:

count = count + 1

sentence = string.lower(sentence)

word_list = sentence.split()

for word in word_list:

if word_dict.has_key(word):

word_dict[word] = word_dict[word] + ',' + str(count)

else:

word_dict[word] = str(count)

keylist = word_dict.keys()

keylist.sort()

for key in keylist:

print key, word_dict[key]

exit

A concordance script, in Ruby¹⁰:

#!/usr/local/bin/ruby

f = File.open("gettysbu.txt")

wordplace = Hash.new(""); wordarray = Array.new

f.each do

|line|

line.downcase!

line.gsub!(/[ˆa-z]/," ")

wordarray = line.split.uniq

next if wordarray == []

wordarray.each{|word| wordplace[word] = "#{wordplace[word]} #{f.lineno}"}

wordarray = []

end

wordplace.keys.sort.each{|key| puts "#{key} #{wordplace[key]}"}

exit

At this point, building a concordance may appear to be an easy, but somewhat pointless exercise. Does the concordance provide any functionality beyond that provided by the ubiquitous "search" box? There are five very useful properties of concordances that you might not have anticipated.

1. You can always reconstruct the original text from the concordance. Hence, after you've built your concordance, you can discard the original text.

2. You can merge concordances without forfeiting your ability to reconstruct the original texts, just as long as you tag locations with some character sequence that identifies the text of origin.

3. You can use a concordance to search for the locations where multi-word terms appear.

4. You can use the concordance to retrieve the sentences and paragraphs in which a search word or a search term appears, in the absence of the original text. The concordance alone can reconstruct and retrieve the appropriate segments of text, on-the-fly, thus bypassing the need to search the original text.

5. A concordance provides a profile of the book, and can be used to compute a similarity score among different books.

Perhaps most amazing is that all five of these useful properties of concordances can be achieved with minor modifications to one of the trivial scripts, vida supra, that build the concordance.

There's insufficient room to explore and demonstrate all five properties of concordances, but let's examine a script, concord_reverse.pl, that reconstructs the original text, from a concordance.

#!/usr/local/bin/perl

open (TEXT, "concordance.TXT")||die;

$line = " ";

while ($line ne "")

{

$line = < TEXT >;

$line =~ s/ /,/o;

$line =~ / /;

$location_word = $ˋ;

@location_array = split(/,/,$');

foreach $location (@location_array)

{

$concordance_hash{$location} = $location_word;

}

$n = 1;

while (exists($concordance_hash{$n}))

{

print $concordance_hash{$n} . " ";

$n = $n + 1;

}

exit;

Here are the first few lines of output:

Four score and seven years ago our fathers brought forth on this continent a new

nation, conceived in liberty and dedicated to the proposition that all men are

created equal. Now we are engaged in a great civil war, testing whether that

nation or any nation so conceived and so dedicated can long endure. We are met on

a great battlefield of that war. We have come to dedicate a portion of that field

Had we wanted to write a script that produces a merged concordance for multiple books, we could have simply written a loop that repeated the concordance-building process for each text. Within the loop, we would have tagged each word location with a short notation indicating the particular source book. For example, locations from the Gettysburg address could have been prepended with "G:" and locations from the Bible might have been prepended with a "B:"

Here is a Perl script, proximity_reconstruction.pl, that prints out the a sequence of words flanking the word "who" in every location of the Gettysburg address where the word "who" appears, working exclusively from a concordance.

#!/usr/local/bin/perl

open (TEXT, "concordance.TXT")||die;

$line = " ";

while ($line ne "")

{

$line = < TEXT >;

$line =~ s/ /,/o;

$line =~ / /;

$location_word = $ˋ;

@location_array = split(/,/,$');

$word_hash{$location_word} = [@location_array];

foreach $location (@location_array)

{

$concordance_hash{$location} = $location_word;

}

@word_locations_array = @{$word_hash{"who"}};

foreach $center_place (@word_locations_array)

{

$n = 1;

print "Match.. ";

while ($n < 11)

{

$location = $n + $center_place -5;

print $concordance_hash{$location} . " ";

$n = $n + 1;

}

print " ";

}

exit;

Here is the output of the proximity_reconstruction.pl script, which searches for words in the proximity of "who" appearing in the Gettysburg address.

c:ftp>proximity_reconstruction.pl

Match.. final resting-place for those who here gave their lives that

Match.. men, living and dead who struggled here have consecrated it

Match.. unfinished work which they who fought here have thus far

Notice that the chosen search word, "who" sits in the center of each line.

Using the provided script as a template, you should be able to write your own scripts, in Perl, Python, or Ruby, that can instantly locate any chosen single word or multi-word terms, producing every sentence, or paragraph in which any search term is contained. Because concordances are precompiled, with the locations of every word, your search and retrieval scripts will run much faster than scripts that rely on Regex matches conducted over the original text (See Glossary item, Regex). Speed enhancements will be most noticeable for large text files.

Aside from reconstructing the full text from the concordance, a concordance provides a fairly good idea of the following information:

1. The topic of the text, based on the words appearing in the concordance. For example, a text having "begat" and "anointed" and "thy" is most likely to be the Old Testament.

2. The complexity of the language. A complex or scholarly text will have a larger vocabulary than a romance novel.

3. A precise idea of the length of the text, achieved by adding all of the occurrences of each of the words in the concordance.

4. The co-locations among words (ie, which words often precede or follow one another).

5. The care with which the text was prepared, achieved by counting the misspelled words.

3.3 Term Extraction and Simple Indexes

I was working on the proof of one of my poems all the morning, and took out a comma. In the afternoon I put it back again.

Oscar Wilde

Terms are phrases, most often noun phrases, and are sometimes individual words that have a precise meaning within a knowledge domain. Indexing software extracts and organizes the terms included in a text, providing all the text locations where each of the collected terms occurs.

Indexing can be a difficult process, requiring creativity. A good indexer will always attempt to answer the following four questions, before beginning a new project.

1. How do I find the relevant terms in the text? It is easy to collect the words in a text, but the concept of a term is highly subjective.

2. Which terms should be excluded? Perhaps the index should be parsimonious, including only the most important terms. After it is decided which terms are important and which terms are not, how shall we handle important terms that appear hundreds of times throughout the text? Is it realistic to think that anyone would use an index wherein a single term may appear at hundreds of locations in the text? Which locations, if any, of commonly occurring terms can be omitted from the index?

3. How should I organize the terms in the index? Which terms need to be cross-indexed, or subindexed under other terms?

4. For whom am I creating the index, and for what purposes? An index for an ebook will be used quite differently than an index for a printed book. An index for a terabyte corpus of text will be used differently than an index for an ebook.

Putting aside epistemology for the moment, let us examine the mechanics of extracting terms of a particular type or pattern, and annotating their locations. In Open Source Tools for Chapter 2, we demonstrated how regular expressions could be used to extract a list of proper names (ie, given name plus family name), from a public domain book, downloaded from Project Gutenberg (See Glossary item, Project Gutenberg). We will use the same book here, under filename "english_lit.txt." The short Perl script, namesget2.pl, extracts proper names and adds their byte locations (ie, the number of bytes counting from the start of the text).

#!/usr/local/bin/perl

undef($/);

open (TEXT, "english_lit.txt");

$line = < TEXT >;

while ($line =~ /[A-Z][a-z]+[ ]{1}[A-Z][a-z]+/g)

{

$name = $&;

$name_place = length($`);

$name =~ s/ / /;

next if ($name !~ /[A-Z][a-z]+/);

$namelist{$name} = "$namelist{$name}, $name_place";

}

@namelist = sort(keys(%namelist));

foreach $name (@namelist)

{

print "$name $namelist{$name} ";

}

exit;

The first two dozen lines of output, from the Perl script namesget2.pl, are shown:

Abbess Hilda , 75574, 75713

About Project , 1302966

Abraham Cowley , 409655

Abraham Cowper , 1220412

Abt Vogler , 997769, 1003750, 1008995

Academy Classics , 1212052

Adam Bede , 1073581

Adam Smith , 628914, 785572, 932577

Adelaide Procter , 1026546

Adelaide Witham , 1186897

Ado About , 315615

After Alfred , 91274

Again Beowulf , 41291, 42464

Albion Series , 96018

Aldine Edition , 942820

Aldine Poets , 211090

Aldine Series , 495306

Alexander Pope , 12315, 751310

Alexander Selkirk , 674969, 730407

Alfred Lord , 1189113

Alfred Tennyson , 13458, 1189448

Algernon Charles , 1024420, 1196870

Alice Brown , 1199231

The Perl script for extracting proper names from a text file, looks for a specific pattern, indicating the likely location of a proper name.

while ($line =~ /[A-Z][a-z]+[ ]{1}[A-Z][a-z]+/g)

In this case, the script extracts pairs of words in which each word begins with an uppercase letter. A byproduct of the pattern match is the string preceding the match (designated as the special Perl variable, "$`"). The length of the string contained in the "$`" variable is the byte location of the pattern match (ie, the byte location of the term that we hope will be a proper name).

Note that we prefer, in this case, to use the byte location of names, rather than page location. The concept of the page number is fading fast into obsolescence, as modern e-readers allow users to customize the appearance of text on their screens (eg, font size, font type, page width, image size). In the case of the e-reader, the page number has no absolute value. The story is not much different for printed books. With print-on-demand services, one book may be published with variable pagination, depending on the print settings selected. Basically, page numbers are a concept that adds complexity, without adding much useful information. Nonetheless, if indexes with page numbers are desired, it is simple enough to create a look-up table wherein ranges of character locations are mapped to page numbers. With such a table, an index annotated with character locations (ie, byte locations) could be transformed into an index annotated with page numbers, with just a few lines of code.

In most cases, good indexes cannot be built by looking for specific patterns, such as we might find for proper names. The indexer must be very clever. There are several available methods for finding and extracting index terms from a corpus of text,¹¹ but no method is as simple, fast, and scalable as the "stop word" method¹² (See Glossary items, Term extraction algorithm, Scalable). The "stop word" method operates under the empiric observation that all text is composed of words and phrases that represent specific concepts, that are connected by high frequency words of minimal information content.⁴

Consider the first sentence from James Joyce's "Ulysses": "Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of lather on which a mirror and a razor lay crossed."

The terms in the sentence are the noun-phrases, the literary equivalents of data objects: Buck Mulligan, stairhead, bowl of lather, mirror, razor. The remainder of the sentences are words that describe data objects (ie, metadata relating the reader to the data object or relating data objects to one another). The way that noun phrases relate to one another are somewhat limited. If we simply deleted all of the relational words from a sentence, we would be left with a list of noun phrases, and the noun phrases, with their locations, could populate our index.

Here are a few examples:

"The diagnosis is chronic viral hepatitis."

"An epidural hemorrhage can occur after a lucid interval."

Let's eliminate sequences of relational words and replace them with a delimiter sequence, in this case "***."

"*** diagnosis *** chronic viral hepatitis."

"*** epidural hemorrhage *** lucid interval.”

The indexable terms are the word sequences that remain when the delimiter sequences are removed:

diagnosis

chronic viral hepatitix

epidural hemorrhage

lucid interval

Let's write a simple Perl script that extracts common relational terms from a text, extracting the remaining word sequences. All we need is a list of commonly occurring relational terms, and a public domain text file. For our sample text, we will use the same text file, english_lit.txt, that we used earlier in this section, and in Open Source Tools for Chapter 2. Our list of common words is sometimes called a stop word list or a barrier word list, because it contains the words that demarcate the beginnings and endings of indexable terms (See Glossary item, Stop words). The Perl script, name_index.pl, includes our stop word list as an array variable.

#!/usr/local/bin/perl

@stoparray = qw(a about above according across actual actually added after afterwards again against ahead all almost alone along already also although always am among amongst amount an and another any anyhow anyone anything anyway anywhere approximately are arising around as at award away back be became because become becomes becoming been before beforehand behind being below beside besides best better between beyond birthday both bottom but by call can cannot can't certain come comes coming completely computer concerning consider considered considering consisting could couldn't cry describe despite detail did discussion do does doesn't doing done down due during each eight either eleven else elsewhere empty enough especially even ever every everyone everything everywhere except few fifteen fifty fill find fire first five followed following for former formerly forty forward found four from front full further get give given giving go had hardly has hasn't have haven't having he hence her here hereafter hereby herein hereupon hers herself him himself his honor how however hundred i if in indeed inside instead interest into is isn't it items its itself just keep largely last later latter least less let lets like likely little look looks made mainly make makes making many may maybe me meantime meanwhile meet meets might million mine miss more moreover most mostly move mr mrs much must my myself name namely near nearly neither never nevertheless next nine ninety no nobody none nonetheless nor not nothing now nowhere obtain obtained of off often on once one only onto or other others otherwise our ours ourselves out outside over overall own part per perhaps please possible possibly previously put quite rather really recent recently regarding reprinted resulted resulting same see seem seemed seeming seems seen serious seven seventy several she should shouldn't show showed shown shows side significant significantly since sincere six sixty so so-called some somehow someone something sometime sometimes somewhere still stop studies study such suggest system take taken takes taking ten than that the their theirs them themselves then there thereafter thereby therefore therein thereupon these they thick thin third thirty this those though thousand three through throughout thru thus to together too top toward towards trillion twelve twenty two under undergoing unless unlike unlikely until up upon upward us use used using various versus very via was way ways we well were weren't what whatever whats when whence whenever where whereafter whereas whereby wherein whereupon wherever whether which while whither who whoever whole whom whomever whos whose why will with within without would wouldn't yes yet you your yours yourself yourselves );

open (TEXT, "english_lit.txt");

open (OUT, ">phrase_list.txt");

undef($/);

$whole_book = < TEXT >;

$whole_book =~ s/[ ]/ /g;

foreach $stopword (@stoparray)

{

$whole_book =~ s/-/ /g;

$whole_book =~ s/ *$stopword */ vvvvv /ig;

}

@sentence_array = split(/[a-z]+. +[A-Z]/, $whole_book);

foreach $sentence (@sentence_array)

{

$sentence = lc($sentence);

push(@phrasearray, split(/ *vvvvv */, $sentence));

}

@phrasearray = grep($_ ne $prev && (($prev) = $_), sort(@phrasearray));

print OUT join(" ", @phrasearray);

exit;

The output file contains over 40,000 terms, extracted from the a plain-text book of English literature, 1.3 megabytes in length. Here's a short sampling from the output file, phrase_list.txt:

shelley's "adonais,"

shelley's _adonais_.

shelley's characters

shelley's crude revolutionary doctrines.

shelley's father,

shelley's greater mood.

shelley's influence

shelley's italian life.

shelley's longer poems. _adonais_

shelley's miscellaneous works,

shelley's poem

shelley's poem _adonais_,

shelley's poetry.

shelley's poetry?

shelley's revolt

shelley's revolutionary enthusiasm,

shelley's works.

short descriptive passages

short descriptive poems

short english abstracts.

short essays

short halves, separated

short hymns

short miscellaneous poems

short period

short poem reflecting

short poems

short poems expressing

short poems suggested

short poems, edited

short space

short span

short stanzas,

short stories

short story teller

short story,

short sword

This simple algorithm, or something much like it, is a fast and efficient method to build a collection of index terms. The working part of the script, that finds and extracts index terms, uses under a dozen lines of code. Notice that the algorithm breaks the text into sentences, before substituting a delimiter (ie, "vvvvv") for the stop word sequences. This is done to avoid the extraction of sequences that overlap sentences.

Here is a Python script, extractor.py, that will extract term phrases from any file that is composed of lists of sentences (ie, a text file with one sentence per line of file).¹² The stop words are expected to reside in the file, stop.txt, wherein the stop words are listed one word per line of file).

#!/usr/local/bin/python

import re, string

item_list = []

stopfile = open("stop.txt",'r')

stop_list = stopfile.readlines()

stopfile.close()

in_text = open('list_of_sentences.txt', "r")

count = 0

for line in in_text:

count = count + 1

for stopword in stop_list:

stopword = re.sub(r' ', '', stopword)

line = re.sub(r' *' + stopword + r' *', ' ', line)

item_list.extend(line.split(" "))

item_list = sorted(set(item_list))

out_text = open('phrases.txt', "w")

for item in item_list:

print>>out_text, item

exit

Here is an equivalent Ruby script, extractor.rb.¹²

#!/usr/local/bin/ruby

phrase_array = []

stoparray = IO.read("stop.txt").split(/ /)

sentence_array = IO.read("list_of_sentences.txt").split(/ /)

out_text = File.open("phrases.txt", "w")

sentence_array.each do

|sentence|

stoparray.each do

|stopword|

sentence.gsub!(/ *#{stopword} */, " ") if sentence.include? stopword

end

phrase_array = phrase_array + sentence.split(/ /)

end

out_text.puts phrase_array.sort.uniq

exit

The output is an alphabetic file of the phrases that might appear in a book's index.

3.4 Autoencoding and Indexing with Nomenclatures

The beginning of wisdom is to call things by their right names.

Chinese proverb

Nomenclatures are listings of terms that cover all of the concepts in a knowledge domain (See Glossary items, Nomenclature, Thesaurus, Vocabulary, Dictionary). A nomenclature is different from a dictionary for three reasons: (1) the nomenclature terms are not annotated with definitions, (2) nomenclature terms may be multi-word, and (3) the terms in the nomenclature are limited to the scope of the selected knowledge domain. In addition, most nomenclatures group synonyms under a group code. For example, a food nomenclature might collect submarine, hoagie, po' boy, grinder, hero, and torpedo under an alphanumeric code such as "F63958." The canonical concepts listed in a nomenclature are typically organized into a hierarchical classification.¹³^,¹⁴

Nomenclatures simplify textual documents by uniting synonymous terms under a common code. Extending the example of the submarine sandwich, you can imagine that if a text document were to attach the code "F63958" to every textual occurrence of any of the synonyms of submarine sandwich, then it would be a simple matter to write a script that retrieved every paragraph in which "submarine sandwich" occurred, as well as every paragraph in which any and all of its synonymous terms occurred.

When you use the "find" or "search" dialog boxes available to word processors and e-readers, the search routine locates every occurrence of your search term in the text that you are reading. If you are lucky, the "find" box will support either a Regex or a "wildcard" search, retrieving every location matching a particular pattern. Under no circumstances will a "find" box support a search that retrieves all of the occurrences of every word or term that is synonymous with your query. That is the job of an index.

You may be thinking that a clever person, fully aware of what she is looking for, will know the synonyms for a search term and will simply repeat her "find" box query with each alternate term, or, if the "find" box permits, will execute an OR command, listing all the synonyms that apply, in a singly query. In practice, this never happens. Individuals are seldom equipped with the determination, patience, and expertise to formulate on-the-spot, comprehensive queries. A well-crafted nomenclature will contain synonymous terms that are unlikely to be anticipated by any individual. As an example, consider the Developmental Lineage Classification and Taxonomy of Neoplasms; this taxonomy contains 120 synonyms listed for "liver cancer."¹³^–¹⁷ Here are just a few of 120 synonymous or plesionymous terms collected under the nomenclature code, "C3099000":

adenoca of the liver

adenocarcinoma arising in liver

adenocarcinoma involving liver

liver with cancer

liver carcinoma

carcinoma arising in liver cells

hcc - hepatocellular carcinoma

primary liver carcinoma

hepatic carcinoma

hepatoma

hepatocarcinoma

liver cell carcinoma

Nomenclatures play an important role in data simplification by providing the synonymy required for comprehensive indexing and retrieval of nomenclature concepts. The process of attaching a nomenclature code to a fragment of text is called "coding," an act not to be confused with the programmer's use of "coding" to mean developing a software program. In the medical field, coding has become something of an industry. Healthcare providers hire armies of "coders" to attach disease codes and billing codes to the medical reports and transactions prepared for electronic medical records. Mistakes in coding can have dire consequences. In 2009, the Department of Veterans Affairs sent out hundreds of letters to veterans with the devastating news that they had contracted Amyotrophic Lateral Sclerosis, also known as Lou Gehrig's disease, a fatal degenerative neurologic condition. About 600 of the recipients did not, in fact, have the disease. The VA retracted these letters, attributing the confusion to a coding error.¹⁸ Coding text is difficult. Human coders are inconsistent, idiosyncratic, and prone to error. Coding accuracy for humans seems to fall in the range of 85% to 90%.¹⁹

Try as they might, human coders cannot keep up with the terabytes of data produced each week by modern information systems. Consequently, there is a great need for fast and accurate software capable of automatic coding, alternatively called autocoding or autoencoding (See Glossary items, Coding, Autoencoding). Autocoding algorithms involve parsing text, word by word, looking for exact matches between runs of words and entries in a nomenclature.²⁰^,²¹ When a match occurs, the words in the text that matched the nomenclature term are assigned the nomenclature code that corresponds to the matched term.

Here is one possible algorithmic strategy for autocoding the sentence: "Margins positive malignant melanoma." For this example, you would be using a nomenclature that lists all of the tumors that occur in humans. Let us assume that the terms "malignant melanoma" and "melanoma" are included in the nomenclature. They are both assigned the same code, for example "Q5673013," because the people who wrote the nomenclature considered both terms to be biologically equivalent.⁴

Let's pretend that we are computers, tasked with autocoding the diagnostic sentence, "Margins positive malignant melanoma":

1. Begin parsing the sentence, one word at a time. The first word is "Margins." You check against the nomenclature, and find that "margins" is not a term listed in the nomenclature. Save the word "margins." We'll use it in step 2.

2. You go to the second word, "positive" and find no matches in the nomenclature. You retrieve the former word "margins" and check to see if there is a 2-word term, "margins positive" listed in the nomenclature. There is not. Save "margins" and "positive" and continue.

3. You go to the next word, "malignant." There is no match in the nomenclature. You check to determine whether the 2-word term "positive malignant" and the 3-word term "margins positive malignant" are in the nomenclature. They are not.

4. You go to the next word, "melanoma." You check and find that melanoma is in the nomenclature. You check against the two-word term "malignant melanoma," the three-word term "positive malignant melanoma," and the four-word term "margins positive malignant melanoma." There is a match for "malignant melanoma" but it yields the same code as the code for "melanoma."

5. The autocoder appends the code, "Q5673013" to the sentence, and proceeds to the next sentence, where it repeats the algorithm.

The algorithm seems like a lot of work, requiring many comparisons, but it is actually quite efficient. A complete nomenclature, with each nomenclature term paired with its code, can be held in a single variable, in rapid access memory (See Glossary item, Variable). Look-ups to determine whether a word or phrase is included in the nomenclature, are fast. As it happens, there are methods that will speed things along much faster than our sample algorithm. My own previously published method can process text at a rate more than a 1000-fold faster than natural language methods.²⁰

Implementations of a very fast autocoder are available in my previously published manuscripts and books.²⁰^–²²^,¹⁰^,¹² Complete implementations are too large to include here, but the block containing the basic parsing and matching algorithm can be expressed in a few lines of code.

The Ruby code snippet for extracting terms in a sentence that match a nomenclature item, and producing its nomenclature code (ie, autocoding the terms in a text), is shown:

absfile.each do

|sentence|

sentence.chomp!

outfile.puts " Abstract title…" + sentence.capitalize + "."

sentence_array = sentence.split

length = sentence_array.size

length.times do

(1..sentence_array.size).each do

|place_length|

phrase = sentence_array.slice(0,place_length).join(" ")

if literalhash.has_key?(phrase)

outfile.puts "Term…" + phrase.capitalize + " " + literalhash[phrase]

end

sentence_array.shift

end

Here is the equivalent Python code:

for line in absfile:

sentence = line

sentence = sentence.rstrip()

sentence_array = sentence.split(" ")

length = len(sentence_array)

for i in range(length):

for place_length in range(len(sentence_array)):

last_element = place_length + 1

phrase = ' '.join(sentence_array[0:last_element])

if literalhash.has_key(phrase):

print>>outfile,"Term…" + phrase + " " + literalhash[phrase]

sentence_array.pop(0)

exit

Here's the equivalent Perl code:

$line = " ";

while($line ne "")

{

$line = < ABSFILE >;

$sentence = $line;

$sentence =~ s/ //o;

@sentence_array = split(/ /,$sentence);

$cycles = scalar(@sentence_array);

for($n=0;$n<$cycles;$n++)

{

for($i=0;$i<scalar(@sentence_array);$i++)

{

@part_array = @sentence_array[0..$i];

$phrase = join(" ", @part_array);

if (exists($literalhash{$phrase}))

{

print OUTFILE "Term…" . ucfirst($phrase) . " " . $literalhash{$phrase} . " ";

}

shift(@sentence_array);

}

exit;

In all three implementations, the first order of business (not shown) involves building the associative array, or dictionary object, for the nomenclature (See Glossary items, Associative array, Hash). This data structure is assigned the variable "literalhash," or "$literalhash" in Perl (See Glossary item, Metasyntactic variable). The "literalhash" object consists of key/value pairs, wherein every unique nomenclature term is a key matched with its nomenclature code value. Once the dictionary object is prepared, a simple word-by-word parsing routine, vida supra, matches nomenclature phrases encountered in a body of text with its nomenclature code.

Autocoding projects never end. Once you think you've finished annotating every term with its correct nomenclature code, thus ensuring that synonymous and plesionymous terms can be found and retrieved together, you learn that the whole job must be repeated.²⁰^–²³ Why is autocoding a Sisyphean task? Several reasons:

(1) Nomenclatures keep changing: Newer versions of nomenclatures may have a new set of codes that are incompatible with the old set of codes. It happens all the time. In the medical field newer versions of the International Classification of Diseases are not fully compatible with older versions, necessitating double coding (eg, applying new and old versions of the International Classification of Diseases codes to the U.S. mortality tables). Nomenclature revisions are a common consequence of terminology expansions. As a nomenclature increases in size and scope, preexisting relationships among concepts may not always accommodate newer concepts introduced into later versions of the same nomenclature.²⁴ For example, an automotive nomenclature from the 1980s would list the DeLorean as a manufactured car, but a current nomenclature might list it as a discontinued model, or a classic, or an antique. Needless to say, nomenclatures are overhauled when two or more nomenclatures merge.

(2) Data resources often need to switch nomenclatures: When old information systems are replaced with new systems, vendors of the newer system may impose a different nomenclature on their data. The codes preserved in the legacy system become unusable, unless they can be directly mapped to codes in the new system (See Glossary item, Nomenclature mapping). Because concepts and terms among nomenclatures will differ, even when both nomenclatures cover the same knowledge domain, mapping projects are seldom successful. In most cases, the legacy data is either ignored, or it is recoded with the new nomenclature.

(3) Data resources may need to be multi-coded: Using several different nomenclatures, at once, to satisfy the different needs of different data users.

Is there a way to simply avoid coding, and recoding, without sacrificing the advantages of nomenclature-based data retrieval? Yes, there is a way.²⁵ Textual data can be interrogated, retrieving all the data that is relevant to a search term or its synonymous terms, using any nomenclature, and without bothering to preannotate the data source with nomenclature codes. Here is how it is done:

1. Pick a nomenclature, any nomenclature, that covers the knowledge domain of the text that you will be searching. If you like, you can choose more than one nomenclature. It makes no difference. For the sake of this exercise, the text will be a collection of magazines that focus on cooking. We'll choose a nomenclature that covers types of food items.

2. Prepare a concordance from the text that you'll be searching (as previously described in Section 3.2).

3. Select a demonstration search term. We'll pick “submarine sandwich.”

4. Retrieve all synonyms for “submarine sandwich.” The nomenclature groups equivalent terms together, so finding the synonyms and plesionyms for “submarine sandwich” is instantaneous. The synonym list found in the nomenclature is: “submarine sandwich, hoagie, po' boy, grinder, hero, and torpedo.”

5. Loop through the list of synonyms, and use the concordance to give you every location where the first word of the nomenclature item (eg, “submarine” in “submarine sandwich”) appears in the text. If the item consists of more than one word, as in the case of “submarine sandwich,” use the concordance to exclude locations where the word “submarine” is not immediately followed by the word “sandwich.” For example, if “submarine” is found at word location 2741, then we would need to find the word “sandwich” at location 2742. Failing that, location 2741 would be deleted from the list of locations where the term “submarine sandwich” is found. Repeat this process for every synonym of “submarine sandwich.”

6. At the end of the process, you will have retrieved the location of every term in your text corpus that matches any of the nomenclature's synonyms for “submarine sandwich.”

You may be thinking that there are a lot of steps to this process and that the time to collect a nomenclature search must surely exceed the length of time for a simple look-up using your favorite application's search utility. Actually, this is not the case. On-the-fly nomenclature searches are extremely fast because every step involves locating ordered items in pre-compiled lists (ie, the pre-compiled nomenclature list and the pre-compiled concordance). The algorithms for searches on ordered lists are much faster than searches conducted by parsing through every character of a long text, while hunting for a word pattern match. Using any nomenclature of your choice, you can find all matches to all synonymous terms from a body of text of any size, with near-instantaneous speed. Public domain Perl code for the "on-the-fly autocoding" script, along with a full description of the algorithm, is available in an open access publication (See Glossary item, Public domain)²⁵.

3.5 Computational Operations on Indexes

She was incapable of saying please, incapable of saying thank you and incapable of saying sorry, all the while creating a surge in the demand for these expressions.

Edward St. Aubyn, in his book, "At Last"

In Section 3.1, How Data Scientists Use Indexes. The claim was made that indexes are programmable objects, and can be used for purposes that were unimaginable prior to the advent of computers (See Glossary item, Burrows-Wheeler transform). In this section, we will see that an index of terms can be used in an encryption protocol. Encryption protocols are often specified as a sequence of mutually accepted actions between two entities (See Glossary item, Encryption). For our example, let us pretend that the two entities are Alice and Bob, and they need to negotiate a confidential exchange of information.²⁶ A generalized confidentiality problem can be presented as a negotiation protocol between Alice and Bob.²⁶

Bob has a file containing the medical records of millions of patients. Alice has secret software that can annotate Bob's file, enhancing its value many-fold. Alice won't give Bob her secret algorithm, but is willing to demonstrate the algorithm if Bob gives her his database. Bob won't give Alice the database, but he can give her little snippets of the database containing insufficient information to match patients with records.

Bob prepares an algorithm that transforms his file into two pieces. Piece 1 is a file that contains all of the phrases (ie, extracted terms) from the original file with each phrase attached to its one-way hash value (See Glossary item, One-way hash). We will be learning more about one-way hashes in Chapter 5. For now, all you need to know is that a one-way hash value is a fixed-length pseudorandom sequence of characters computed on, in this case, fragments of text. The one-way hash has two very important properties: (1) a given phrase will always yield the same one-way hash value when operated on by the one-way hash algorithm and (2) there is no feasible way to determine the phrase by inspecting or manipulating the hash value. This second property holds true even if the hashing algorithm is known. Bob will give Alice Piece 1.

Piece 2 is a file wherein each phrase from the original file is replaced by its one-way hash value. High frequency words (ie, stop words) are left in place in Piece 2. Piece 2 and Piece 1 are used to reconstruct the original text or an annotated version of the original text, using Alice's modifications to Piece 1. The reconstruction algorithm simply steps through all the character strings found in Piece 2. When it encounters a hash-value, the algorithm looks at the list of hash-values in Piece 1 and substitutes the phrase associated with the hash-value back into the Piece 2 file. This continues until the end of Piece 2 is reached, at which time the Piece 2 file has been restored as the original file (plus any annotations that Alice may have added to the terms in Piece 1). This completes the confidential negotiation.

The following is an example of a single line of Bob's text that has been converted into two pieces according to the described algorithm.

Here is Bob's original text, which Bob does not want Alice to see.

"they suggested that the manifestations were as severe in the mother as in the sons and that this suggested autosomal dominant inheritance."

Here is Bob's Piece 1, prepared by extracting the phrases from the original text, and producing the one-way hash values for each of the extracted phrases.

684327ec3b2f020aa3099edb177d3794 = > suggested autosomal dominant inheritance

3c188dace2e7977fd6333e4d8010e181 = > mother

8c81b4aaf9c2009666d532da3b19d5f8 = > manifestations

db277da2e82a4cb7e9b37c8b0c7f66f0 = > suggested

e183376eb9cc9a301952c05b5e4e84e3 = > sons

22cf107be97ab08b33a62db68b4a390d = > severe

Here is Bob's piece 2, created by substituting phrases in the original text with their one-way hash values, leaving the stop words in place. Bob keeps the original text, and piece 2, and does not send either to Alice (who must work exclusively from piece 1).

they db277da2e82a4cb7e9b37c8b0c7f66f0 that the 8c81b4aaf9c2009666d532da3b19d5f8 were as 22cf107be97ab08b33a62db68b4a390d in the 3c188dace2e7977fd6333e4d8010e181 as in the e183376eb9cc9a301952c05b5e4e84e3 and that this 684327ec3b2f020aa3099edb177d3794.

Properties of Piece 1 and Piece 2

Piece 1 (the listing of phrases and their one-way hashes)

1. Contains no information on the frequency of occurrence of the phrases found in the original text (because recurring phrases map to the same hash code and appear as a single entry in Piece 1).

2. Contains no information on the order or locations of the phrases found in the original text.

3. Contains all the concepts found in the original text. Stop words are a popular method of parsing text into concepts.

4. Bob can destroy Piece 1 and recreate it at any time, from the original text.

5. Alice can use the phrases in Piece 1 to transform, annotate, or search the concepts found in the original file.

6. Alice can transfer Piece 1 to a third party without violating Bob's confidentiality.

7. Alice can keep Piece 1 and add it to her database of Piece 1 files collected from all of her clients.

8. Piece 1 is not necessarily unique. Different original files may yield the same Piece 1 (if they're composed of the same phrases). Therefore Piece 1 cannot be used to authenticate the original file used to produce Piece 1 (See Glossary item, Authentication).

Properties of Piece 2

1. Contains no information that can be used to connect any private information to any particular data record.

2. Contains nothing but hash values of phrases and stop words, in their correct order of occurrence in the original text.

3. Anyone obtaining Piece 1 and Piece 2 can reconstruct the original text.

4. The original text can be reconstructed from Piece 2, and any file into which Piece 1 has been merged. There is no necessity to preserve Piece 1 in its original form.

5. Bob can lose or destroy Piece 2, and recreate it later from the original file, using the same algorithm.

If Alice had Piece 1 and Piece 2 she could simply use Piece 1 to find the text phrases that match the hash-values in Piece 2. Substituting the phrases back into Piece 2 will recreate Bob's original line of text. Bob must ensure that Alice never obtains Piece 2.

Alice uses her software (which is her secret) to annotate phrases from Piece 1. Presumably, Alice's software does something that enhances the value of the phrases. Such enhancements might involve annotating each phrase with a nomenclature code, a link to a database, an image file, or a location where related information is stored. Alice substitutes the transformed text (or simply appends the transformed text) for each phrase back into Piece 1, co-locating it with the original one-way hash number associated with the phrase.

Let's pretend that Alice has an autocoder that provides a standard nomenclature code to medical phrases that occur in text. Alice's software transforms the original phrases from Piece 1, preserving the original hash values, and appending a nomenclature code to every phrase that matches a nomenclature term. Here is the file, basically a modification of Piece 1, produced by Alice's software.

684327ec3b2f020aa3099edb177d3794 = > suggested (autosomal dominant inheritance=C0443147)

3c188dace2e7977fd6333e4d8010e181 = > (mother=C0026591)

8c81b4aaf9c2009666d532da3b19d5f8 = > manifestations

db277da2e82a4cb7e9b37c8b0c7f66f0 = > suggested

e183376eb9cc9a301952c05b5e4e84e3 = > (son=C0037683)

22cf107be97ab08b33a62db68b4a390d = > (severe=C0205082)

Alice returns the modified piece 1 (ie, the coded phrase list) to Bob. Bob now takes the transformed Piece 1 and substitutes the transformed phrases for each occurrence of the hash values occurring in Piece 2 (which he has saved for this very purpose).

The reconstructed sentence is now:

they suggested that the manifestations were as (severe=C0205082) in the (mother=C0026591) as in the (son=C0037683) and that this suggested (autosomal dominant heritance=C0443147)

The original sentence has been annotated with nomenclature codes. This was all accomplished without sharing confidential information that might have been contained in the text. Bob never had access to Alice's software. Alice never had the opportunity to see Bob's original text.

The negotiation between Bob and Alice need not be based on the exchange of text. The same negotiation would apply to any set of data elements that can be transformed or annotated. The protocol has practical value in instances when the sender and receiver each have something to hide: the contents of the original data in the case of the sender; and the secret software, in the case of the receiver.

Data scientists who are reluctant to share their data, based on confidentiality or privacy issues, should know that there are a variety of protocols that permit data to be safely shared, without breaching the secrecy of information contained in the data files. The protocol discussed here, developed by the author, is just one of many widely available negotiation protocols.²⁷ Encryption protocols will be discussed in greater detail in Chapter 5 (See Glossary items, Deidentification, Reidentification, Scrubbing, Data scrubbing, Deidentification versus anonymization)

Open Source Tools

Perl: The only language that looks the same before and after RSA encryption.

Keith Bostic

Word Lists

Word lists, for just about any written language for which there is an electronic literature, are easy to create. Here is a short Python script, words.py, that prompts the user to enter a line of text. The script drops the line to lowercase, removes the carriage return at the end of the line, parses the result into an alphabetized list, removes duplicate terms from the list, and prints out the list, with one term assigned to each line of output. This words.py script can be easily modified to create word lists from plain-text files (See Glossary item, Metasyntactic variable).

#!/usr/local/bin/python

import sys, re, string

print "Enter a line of text to be parsed into a word list"

line = sys.stdin.readline()

line = string.lower(line)

line = string.rstrip(line)

linearray = sorted(set(re.split(r' +', line)))

for i in range(0, len(linearray)):

print(linearray[i])

exit

Here is some a sample of output, when the input is the first line of Joyce's Finnegan’s Wake:

c:ftp>words.py

Enter a line of text to be parsed into a word list

a way a lone a last a loved a long the riverrun, past Eve and Adam's, from swerv

e of shore to bend of bay, brings us by a commodius vicus

adam's,

and

bay,

bend

brings

commodius

eve

from

last

lone

long

loved

past

riverrun,

shore

swerve

the

vicus

way

Here is a nearly equivalent Perl script, words.pl, that creates a wordlist from a file. In this case, the chosen file happens to be "gettbysu.txt," containing the full-text of the Gettysburg address. We could have included the name of any plain-text file.

#!/usr/local/bin/perl

open(TEXT, "gettysbu.txt");

undef($/);

$var = lc(< TEXT >);

$var =~ s/ / /g;

$var =~ s/'s//g;

$var =~ tr/a-zA-Z'- //cd;

@words = sort(split(/ +/, $var));

@words = grep($_ ne $prev && (($prev) = $_), @words);

print (join(" ",@words));

exit;

The words.pl script was designed for speed. You'll notice that it slurps the entire contents of a file into a string variable. If we were dealing with a very large file that exceeded the functional RAM memory limits of our computer, we could modify the script to parse through the file line-by-line.

Aside from word lists you create for yourself, there are a wide variety of specialized knowledge domain nomenclatures that are available to the public.²⁸^,²⁹^,¹⁴^,³⁰^–³² Linux distributions often bundle a wordlist, under filename "words," that is useful for parsing and natural language processing applications. A copy of the Linux wordlist is available at:

http://www.cs.duke.edu/~ola/ap/linuxwords

Curated lists of terms, either generalized, or restricted to a specific knowledge domain, are indispensable for developing a variety of applications (eg, spell-checkers, natural language processors, machine translation, coding by term, indexing). Personally, I have spent an inexcusable amount of time creating my own lists, when no equivalent public domain resources were available.

Doublet Lists

Doublet lists (lists of two-word terms that occur in common usage or in a body of text) are a highly underutilized resource. The special value of doublets is that single word terms tend to have multiple meanings, while doublets tend to have specific meaning.

Here are a few examples:***

The word "rose" can mean the past tense of rise, or the flower. The doublet "rose garden" refers specifically to a place where the rose flower grows.

The word "lead" can mean a verb form of the infinitive, "to lead," or it can refer to the metal. The term "lead paint" has a different meaning than "lead violinist." Furthermore, every multiword term of length greater than two can be constructed with overlapping doublets, with each doublet having a specific meaning.

For example, "Lincoln Continental convertible" = "Lincoln Continental" + "Continental convertible." The three words, "Lincoln, "Continental," and "convertible" all have different meanings, under different circumstances. But the two doublets, "Lincoln Continental" and "Continental Convertible" would be unusual to encounter on their own, and have unique meanings.

Perusal of any nomenclature will reveal that most of the terms included in nomenclatures consist of two or more words. This is because single word terms often lack specificity. For example, in a nomenclature of recipes, you might expect to find, "Eggplant Parmesan" but you may be disappointed if you look for "Eggplant" or "Parmesan." In a taxonomy of neoplasms, available at: http://www.julesberman.info/figs/neocl_f.htm, containing over 120,000 terms, only a few hundred of those terms are single word terms.³³

Lists of doublets, collected from a corpus of text, or from a nomenclature, have a variety of uses in data simplification projects.³³^,²⁰^,²⁵ We will show examples in Section 5.4.

For now, you should know that compiling doublet lists, from any corpus of text, is extremely easy.

Here is a Perl script, doublet_maker.pl, that creates a list of alphabetized doublets occurring in any text file of your choice (filename.txt in this example):

#!/usr/local/bin/perl

open(TEXT,"filename.txt")||die"cannot";

open(OUT,">doublets.txt")||die"cannot";

undef($/);

$var = < TEXT >;

$var =~ s/ / /g;

$var =~ s/'s//g;

$var =~ tr/a-zA-Z'- //cd;

@words = split(/ +/, $var);

foreach $thing (@words)

{

$doublet = "$oldthing $thing";

if ($doublet =~ /ˆ[a-z]+ [a-z]+$/)

{

$doublethash{$doublet}="";

}

$oldthing = $thing;

}

close TEXT;

@wordarray = sort(keys(%doublethash));

print OUT join(" ",@wordarray);

close OUT;

exit;

Here is an equivalent Python script, doublet_maker.py:

#!/usr/local/bin/python

import anydbm, string, re

in_file = open('filename.txt', "r")

out_file = open('doubs.txt',"w")

doubhash = {}

for line in in_file:

line = line.lower()

line = re.sub('[.,<>?/;:"[]{}|=+-_ ()*&ˆ%$#@!ˋ~1234567890]', ' ', line)

hoparray = line.split()

hoparray.append(" ")

for i in range(len(hoparray)-1):

doublet = hoparray[i] + " " + hoparray[i + 1]

if doubhash.has_key(doublet):

continue

doubhash_match = re.search(r'[a-z]+ [a-z]+', doublet)

if doubhash_match:

doubhash[doublet] = ""

for keys,values in sorted(doubhash.items()):

out_file.write(keys + ' ')

exit

Here is an equivalent Ruby script, doublet_maker.rb that creates a doublet list from file filename.txt:

#!/usr/local/bin/ruby

intext = File.open("filename.txt", "r")

outtext = File.open("doubs.txt", "w")

doubhash = Hash.new(0)

line_array = Array.new(0)

while record = intext.gets

oldword = ""

line_array = record.chomp.strip.split(/s +/)

line_array.each do

|word|

doublet = [oldword, word].join(" ")

oldword = word

next unless (doublet =~ /ˆ[a-z]+s[a-z]+$/)

doubhash[doublet] = ""

end

doubhash.each {|k,v| outtext.puts k }

exit

I have deposited a public domain doublet list, available for download at: http://www.julesberman.info/doublets.htm

The first few lines of the list are shown:

a bachelor

a background

a bacteremia

a bacteria

a bacterial

a bacterium

a bad

a balance

a balanced

a banana

Ngram Lists

Ngrams are subsequences of text, of length n words. A complete collection of ngrams consists of all of the possible ordered subsequences of words in a text. Because sentences are the basic units of statements and ideas, when we speak of ngrams, we are confining ourselves to ngrams of sentences. Let's examine all the ngrams for the sentence, "Ngrams are ordered word sequences."

Ngrams (1-gram)

are (1-gram)

ordered (1-gram)

word (1-gram)

sequences (1-gram)

Ngrams are (2-gram)

are ordered (2-gram)

ordered word (2-gram)

word sequences (2-gram)

Ngrams are ordered (3-gram)

are ordered word (3-gram)

ordered word sequences (3-gram)

Ngrams are ordered word (4-gram)

are ordered word sequences (4-gram)

Ngrams are ordered word sequences (5-gram)

Google has collected ngrams from scanned literature dating back to 1500. The public can enter their own ngrams into Google's ngram viewer, and receive a graph of the published occurrences of the phrase, through time.⁹

For example, we can use Google's Ngram viewer to visualize the frequency of occurrence of the single word, "photon" (Fig. 3.1).

f03-01-9780128037812 — Figure 3.1 Google Ngram for the word "photon," from a corpus of literature covering years 1900 to 2000. Notice that the first appearances of the term "photon" closely corresponds to its discovery, in the second decade of the 20th century. Source: Google Ngram viewer, with permission from Google.

The result fits into an historical narrative. The name "photon" comes from the Greek word for light. The word seems to have been used first in 1916, and is credited to Leonard T. Troland. When we chart the appearance of "photon" in published literature, we see that it does not appear until about 1920, when it rapidly entered common usage.

We can use the Ngram viewer to find trends (eg, peaks, valleys, and periodicities) in data. Consider the Google Ngram Viewer results for the two-word ngram, "yellow fever."

We see that the term "yellow fever" (a mosquito-transmitted hepatitis) appeared in the literature beginning about 1800, with several subsequent peaks. The dates of the peaks correspond roughly to outbreaks of yellow fever in Philadelphia (epidemic of 1793), New Orleans (epidemic of 1853), with U.S. construction efforts in the Panama Canal (1904–14), and with well-documented WWII Pacific outbreaks (about 1942). Following the 1942 epidemic, an effective vaccine was available, and the incidence of yellow fever, as well as the literature occurrences of the "yellow fever" n-gram, dropped precipitously. In this case, a simple review of ngram frequencies provides an accurate chart of historic yellow fever outbreaks (Fig. 3.2).⁴^,⁹

f03-02-9780128037812 — Figure 3.2 Google Ngram for the phrase "yellow fever," counting occurrences of the term in a large corpus, from the years 1700 to 2000. Peaks roughly correspond to yellow fever epidemics. Source: Google Ngram viewer, with permission from Google.

Google's own ngram viewer supports simple lookups of term frequencies. For more advanced analyses (eg, finding co-occurrences of all ngrams against all other ngrams), data scientists can download the ngram data files, available at no cost from Google, and write their own programs, suited to their repurposing goals.

Here is a short Perl script that will take a sentence and produce a list of all the contained ngrams. This short script can easily be adapted to parse large collections of sentences, and to remove punctuation.

The Perl script, ngram_list.pl:

#!/usr/local/bin/perl

$text = "ngrams are ordered word sequences";

@text_list = split(" ", $text);

while(scalar(@text_list) !=0)

{

push(@parts_list, join(" ", @text_list));

shift(@text_list);

}

foreach $part (@parts_list)

{

$previous = "";

@word_list = split(" ", $part);

while(scalar(@word_list) !=0)

{

$ngram_list{join(" ", @word_list)} = "";

$first_word = shift(@word_list);

$ngram_list{$first_word} = "";

$previous = $previous . " " . $first_word;

$previous =~ s/ˆ //o;

$ngram_list{$previous} = "";

}

print(join(" ", sort(keys(%ngram_list))));

exit;

Here is the output of the ngram_list.pl script:

c:ftp>ngram_list.pl

are

are ordered

are ordered word

are ordered word sequences

ngrams

ngrams are

ngrams are ordered

ngrams are ordered word

ngrams are ordered word sequences

ordered

ordered word

ordered word sequences

sequences

word

word sequences

The ngram_list.pl script can be easily modified to parse through all the sentences of any text, regardless of length, building the list of ngrams as it proceeds.

Glossary

ANSI The American National Standards Institute (ANSI) accredits standards developing organizations to create American National Standards (ANS). A so-called ANSI standard is produced when an ANSI-accredited standards development organization follows ANSI procedures and receives confirmation, from ANSI, that all the procedures were followed. ANSI coordinates efforts to gain international standards certification from the ISO (International Standards Organization) or the IEC (International Electrotechnical Commission). ANSI works with hundreds of ANSI-accredited standards developers.

Associative array A data structure consisting of an unordered list of key/value data pairs. Also known as hash, hash table, map, symbol table, dictionary, or dictionary array. The proliferation of synonyms suggests that associative arrays, or their computational equivalents, have great utility. Associative arrays are used in Perl, Python, Ruby and most modern programming languages. Here is an example in which an associative array (ie, a member of Class Hash) is created in Ruby.

#!/usr/local/bin/ruby my_hash = Hash.new
my_hash["C05"] = "Albumin" my_hash["C39"] = "Choline"
my_hash.each {|key,value|
STDOUT.print(key, " --- ", value, " ")} exit

The first line of the script creates a new associative array, named my_hash. The next two lines create two key/value elements for the associative array (C05/Albumin and C39/Choline). The next line instructs ruby to print out the elements in the my_hash associative array. Here is the output of the short ruby script. *

Output:
C05 --- Albumin
C39 --- Choline

Authentication A process for determining if the data object that is received (eg, document, file, image) is the data object that was intended to be received. The simplest authentication protocol involves one-way hash operations on the data that needs to be authenticated. Suppose you happen to know that a certain file, named z.txt will be arriving via email and that this file has an MD5 hash of "uF7pBPGgxKtabA/2zYlscQ==." You receive the z.txt, and you perform an MD5 one-way hash operation on the file, as shown here:

#!/usr/bin/python
import base64
import md5
md5_object = md5.new()
sample_file = open ("z.txt", "rb")
string = sample_file.read()
sample_file.close()
md5_object.update(string)
md5_string = md5_object.digest()
print(base64.encodestring(md5_string))
exit

Let's assume that the output of the MD5 hash operation, performed on the z.txt file, is "uF7pBPGgxKtabA/2zYlscQ==." This would tell us that the received z.txt file is authentic (ie, it is the file that you were intended to receive); because no other file has the same MD5 hash. Additional implementations of one-way hashes are described in Open Source Tools for Chapter 5. The authentication process, in this example, does not tell you who sent the file, the time that the file was created, or anything about the validity of the contents of the file. These would require a protocol that included signature, timestamp, and data validation, in addition to authentication. In common usage, authentication protocols often include entity authentication (ie, some method by which the entity sending the file is verified). Consequently, authentication protocols are often confused with signature verification protocols. An ancient historical example serves to distinguish the concepts of authentication protocols and signature protocols. Since earliest recorded history, fingerprints were used as a method of authentication. When a scholar or artisan produced a product, he would press his thumb into the clay tablet, or the pot, or the wax seal closing a document. Anyone doubting the authenticity of the pot could ask the artisan for a thumbprint. If the new thumbprint matched the thumbprint on the tablet, pot, or document, then all knew that the person creating the new thumbprint and the person who had put his thumbprint into the object were the same individual. Of course, this was not proof that the object was the creation of the person with the matching thumbprint. For all anyone knew, there may have been a hundred different pottery artisans, with one person pressing his thumb into every pot produced. You might argue that the thumbprint served as the signature of the artisan. In practical terms, no. The thumbprint, by itself, does not tell you whose print was used. Thumbprints could not be read, at least not in the same way as a written signature. The ancients needed to compare the pot's thumbprint against the thumbprint of the living person who made the print. When the person died, civilization was left with a bunch of pots with the same thumbprint, but without any certain way of knowing whose thumb produced them. In essence, because there was no ancient database that permanently associated thumbprints with individuals, the process of establishing the identity of the pot-maker became very difficult once the artisan died. A good signature protocol permanently binds an authentication code to a unique entity (eg, a person). Today, we can find a fingerprint at the scene of a crime; we can find a matching signature in a database; and we can link the fingerprint to one individual. Hence, in modern times, fingerprints are true "digital" signatures, no pun intended. Modern uses of fingerprints include keying (eg, opening locked devices based on an authenticated fingerprint), tracking (eg, establishing the path and whereabouts of an individual by following a trail of fingerprints or other identifiers), and body part identification (ie, identifying the remains of individuals recovered from mass graves or from the sites of catastrophic events based on fingerprint matches). Over the past decade, flaws in the vaunted process of fingerprint identification have been documented, and the improvement of the science of identification is an active area of investigation.³⁴ See HMAC. See Digital signature.

Autocoding When nomenclature coding is done automatically, by a computer program, the process is known as "autocoding" or "autoencoding." See Coding. See Nomenclature. See Autoencoding.

Autoencoding Synonym for autocoding. See Autocoding.

Autovivification In programming, autovivification is a feature of some programming languages wherein a variable or data structure seemingly brings itself into life, without definition or declaration, at the moment when its name first appears in a program. The programming language automatically registers the variable and endows it with a set of properties consistent with its type, as determined by the context, within the program. Perl supports autovivification. Python and Ruby, under most circumstances, do not. In the case of Ruby, new class objects (ie, instances of the class) are formally declared and created, by sending the "new" method to the class assigned to the newly declared object. See Reification.

Blended class Also known as class noise, subsumes the more familiar, but less precise term, "Labeling error." Blended class refers to inaccuracies (eg, misleading results) introduced in the analysis of data due to errors in class assignments (ie, assigning a data object to class A when the object should have been assigned to class B). If you are testing the effectiveness of an antibiotic on a class of people with bacterial pneumonia, the accuracy of your results will be forfeit when your study population includes subjects with viral pneumonia, or smoking-related lung damage. Errors induced by blending classes are often overlooked by data analysts who incorrectly assume that the experiment was designed to ensure that each data group is composed of a uniform and representative population. A common source of class blending occurs when the classification upon which the experiment is designed is itself blended. For example, imagine that you are a cancer researcher and you want to perform a study of patients with malignant fibrous histiocytomas (MFH), comparing the clinical course of these patients with the clinical course of patients who have other types of tumors. Let's imagine that the class of tumors known as MFH does not actually exist; that it is a grab-bag term erroneously assigned to a variety of other tumors that happened to look similar to one another. This being the case, it would be impossible to produce any valid results based on a study of patients diagnosed as MFH. The results would be a biased and irreproducible cacophony of data collected across different, and undetermined, classes of tumors. Believe it or not, this specific example, of the blended MFH class of tumors, is selected from the real-life annals of tumor biology.³⁵^,³⁶ The literature is rife with research of dubious quality, based on poorly designed classifications and blended classes. A detailed discussion of this topic is found in Section 6.5, "Properties that Cross Multiple Classes." One caveat. Efforts to eliminate class blending can be counterproductive if undertaken with excess zeal. For example, in an effort to reduce class blending, a researcher may choose groups of subjects who are uniform with respect to every known observable property. For example, suppose you want to actually compare apples with oranges. To avoid class blending, you might want to make very sure that your apples do not include any kumquats, or persimmons. You should be certain that your oranges do not include any limes or grapefruits. Imagine that you go even further, choosing only apples and oranges of one variety (eg, Macintosh apples and navel oranges), size (eg 10 cm), and origin (eg, California). How will your comparisons apply to the varieties of apples and oranges that you have excluded from your study? You may actually reach conclusions that are invalid and irreproducible for more generalized populations within each class. In this case, you have succeeded in eliminating class blending, at the expense of losing representative populations of the classes. See Simpson's paradox.

Burrows-Wheeler transform Abbreviated as BWT, the Burrows-Wheeler transform produces a compressed version of an original file, along with a concordance to the contents of the file. Using a reverse BWT, you can reconstruct the original file, or you can find any portion of a file preceding or succeeding any location in the file. The BWT transformation is an amazing example of simplification, applied to informatics. A detailed discussion of the BWT is found in Open Source Tools for Chapter 8. See Concordance.

Check digit A checksum that produces a single digit as output is referred to as a check digit. Some of the common identification codes in use today, such as ISBN numbers for books, come with a built-in check digit. Of course, when using a single digit as a check value, you can expect that some transmitted errors will escape the check, but the check digit is useful in systems wherein occasional mistakes are tolerated; or wherein the purpose of the check digit is to find a specific type of error (eg, an error produced by a substitution in a single character or digit), and wherein the check digit itself is rarely transmitted in error. See Checksum.

Checksum An outdated term that is sometimes used synonymously with one-way hash or message digest. Checksums are performed on a string, block, or file yielding a short alphanumeric string intended to be specific for the input data. Ideally, If a single bit were to change, anywhere within the input file, then the checksum for the input file would change drastically. Checksums, as the name implies, involve summing values (typically weighted character values) to produce a sequence that can be calculated on a file before and after transmission. Most of the errors that were commonly introduced by poor transmission could be detected with checksums. Today, the old checksum algorithms have been largely replaced with one-way hash algorithms. A checksum that produces a single digit as output is referred to as a check digit. See Check digit. See One-way hash. See Message digest. See HMAC.

Child class The direct or first generation subclass of a class. Sometimes referred to as the daughter class or, less precisely, as the subclass. See Parent class. See Classification.

Class A class is a group of objects that share a set of properties that define the class and that distinguish the members of the class from members of other classes. The word "class," lowercase, is used as a general term. The word "Class," uppercase, followed by an uppercase noun (eg Class Animalia), represents a specific class within a formal classification. See Classification.

Classification A system in which every object in a knowledge domain is assigned to a class within a hierarchy of classes. The properties of superclasses are inherited by the subclasses. Every class has one immediate superclass (ie, parent class), although a parent class may have more than one immediate subclass (ie, child class). Objects do not change their class assignment in a classification, unless there was a mistake in the assignment. For example, a rabbit is always a rabbit, and does not change into a tiger. Classifications can be thought of as the simplest and most restrictive type of ontology, and serve to reduce the complexity of a knowledge domain.³⁷ Classifications can be easily modeled in an object-oriented programming language and are nonchaotic (ie, calculations performed on the members and classes of a classification should yield the same output, each time the calculation is performed). A classification should be distinguished from an ontology. In an ontology, a class may have more than one parent class and an object may be a member of more than one class. A classification can be considered a special type of ontology wherein each class is limited to a single parent class and each object has membership in one, and only one, class. See Nomenclature. See Thesaurus. See Vocabulary. See Classification. See Dictionary. See Terminology. See Ontology. See Parent class. See Child class. See Superclass. See Unclassifiable objects.

Coding The term "coding" has three very different meanings; depending on which branch of science influences your thinking. For programmers, coding means writing the code that constitutes a computer program. For cryptographers, coding is synonymous with encrypting (ie, using a cipher to encode a message). For medics, coding is calling an emergency team to handle a patient in extremis. For informaticians and library scientists, coding involves assigning an alphanumeric identifier, representing a concept listed in a nomenclature, to a term. For example, a surgical pathology report may include the diagnosis, “Adenocarcinoma of prostate." A nomenclature may assign a code C4863000 that uniquely identifies the concept "Adenocarcinoma." Coding the report may involve annotating every occurrence of the work "Adenocarcinoma" with the "C4863000" identifier. For a detailed explanation of coding, and its importance for searching and retrieving data, see the full discussion in Section 3.4. See Autocoding. See Nomenclature.

Concordance A concordance is an index consisting of every word in the text, along with every location wherein each word can be found. It is computationally trivial to reconstruct the original text from the concordance. Before the advent of computers, concordances fell into the provenance of religious scholars, who painstakingly recorded the locations of the all words appearing in the Bible, ancient scrolls, and any texts whose words were considered to be divinely inspired. Today, a concordance for a Bible-length book can be constructed in about a second. Furthermore, the original text can be reconstructed from the concordance, in about the same time.

Data scrubbing A term that is very similar to data deidentification and is sometimes used improperly as a synonym for data deidentification. Data scrubbing refers to the removal, from data records, of information that is considered unwanted. This may include identifiers, private information, or any incriminating or otherwise objectionable language contained in data records, as well as any information deemed irrelevant to the purpose served by the record. See Deidentification.

Deidentification The process of removing all of the links in a data record that can connect the information in the record to an individual. This usually includes the record identifier, demographic information (eg, place of birth), personal information (eg, birthdate), biometrics (eg, fingerprints), and so on. The process of deidentification will vary based on the type of records examined. Deidentifying protocols exist wherein deidentificatied records can be reidentified, when necessary. See Reidentification. See Data scrubbing.

Deidentification versus anonymization Anonymization is a process whereby all the links between an individual and the individual's data record are irreversibly removed. The difference between anonymization and deidentification is that anonymization is irreversible. Because anonymization is irreversible, the opportunities for verifying the quality of data are limited. For example, if someone suspects that samples have been switched in a data set, thus putting the results of the study into doubt, an anonymized set of data would afford no opportunity to resolve the problem by reidentifying the original samples. See Reidentification.

Dictionary A terminology or word list accompanied by a definition for each item. See Nomenclature. See Vocabulary. See Terminology.

Digital signature As it is used in the field of data privacy, a digital signature is an alphanumeric sequence that could only have been produced by a private key owned by one particular person. Operationally, a message digest (eg, a one-way hash value) is produced from the document that is to be signed. The person "signing" the document encrypts the message digest using her private key, and submits the document and the encrypted message digest to the person who intends to verify that the document has been signed. This person decrypts the encrypted message digest with her public key (ie, the public key complement to the private key) to produce the original one-way hash value. Next, a one-way hash is performed on the received document. If the resulting one-way hash is the same as the decrypted one-way hash, then several statements hold true: the document received is the same document as the document that had been "signed." The signer of the document had access to the private key that complemented the public key that was used to decrypt the encrypted one-way hash. The assumption here is that the signer was the only individual with access to the private key. Digital signature protocols, in general, have a private method for encrypting a hash, and a public method for verifying the signature. Such protocols operate under the assumption that only one person can encrypt the hash for the message, and that the name of that person is known; hence, the protocol establishes a verified signature. It should be emphasized that a digital signature is quite different from a written signature; the latter usually indicates that the signer wrote the document or somehow attests to the veracity of the document. The digital signature merely indicates that the document was received from a particular person, contingent on the assumption that the private key was available only to that person. To understand how a digital signature protocol may be maliciously deployed, imagine the following scenario: I contact you and tell you that I am Elvis Presley and would like you to have a copy of my public key plus a file that I have encrypted using my private key. You receive the file and the public key; and you use the public key to decrypt the file. You conclude that the file was indeed sent by Elvis Presley. You read the decrypted file and learn that Elvis advises you to invest all your money in a company that manufactures concrete guitars; which, of course, you do. Elvis knows guitars. The problem here is that the signature was valid, but the valid signature was not authentic. See Authentication.

Encryption A common definition of encryption involves an algorithm that takes some text or data and transforms it, bit-by-bit, into an output that cannot be interpreted (ie, from which the contents of the source file cannot be determined). Encryption comes with the implied understanding that there exists some reverse transform that can be applied to the encrypted data, to reconstitute the original source. As used herein, the definition of encryption is expanded to include any protocols by which files can be shared, in such a way that only the intended recipients can make sense of the received documents. This would include protocols that divide files into pieces that can only be reassembled into the original file using a password. Encryption would also include protocols that alter parts of a file while retaining the original text in other parts of the file. As described in Chapter 5, there are instances when some data in a file should be shared, while only specific parts need to be encrypted. The protocols that accomplish these kinds of file transformations need not always employ classic encryption algorithms. See Winnowing and chaffing.

HMAC Hashed Message Authentication Code. When a one-way hash is employed in an authentication protocol, it is often referred to as an HMAC. See One-way hash. See Message digest. See Checksum.

Hash A hash, also known as associative array and as dictionary, is a data structure comprising an unordered list of key/value pairs. The term "hash" must be distinguished from the unrelated term, "One-way hash." See One-way hash.

ISO International Standards Organization. The ISO is a nongovernmental organization that develops international standards (eg, ISO-11179 for metadata and ISO-8601 for date and time). See ANSI.

Index An index is a an ordered collection of words, phrases, concepts, or subsets of classes of information (eg, geographic names, names of persons, dates of events), linked to the locations where they occur in the text. The terms in an index are selected and ordered based on the indexer's conceptualization of the text, and of the utility of the text to the intended readers. Furthermore, the index is seldom, if ever, created by the author(s) of the text. Hence, the index is a reconceptualization of the original text, in tabular form, comprising a new, creative, work.⁵ See Indexes.

Indexes Every writer must search deeply into his or her soul to find the correct plural form of "index." Is it "indexes" or is it "indices?" Latinists insist that "indices" is the proper and exclusive plural form. Grammarians agree, reserving "indexes" for the third person singular verb form; "The student indexes his thesis." Nonetheless, popular usage of the plural of "index," referring to the section at the end of a book, is almost always "indexes," the form used herein. See Index.

Indexes versus classifications Indexes and classifications both help us to expand and simplify our perception of a subject or a knowledge domain. The key difference between the two concepts is that indexes are methods of searching and retrieving data objects; whereas classifications are methods of describing the relationships among data objects.

Instance An instance is a specific example of an object that is not itself a class or group of objects. For example, Tony the Tiger is an instance of the tiger species. Tony the Tiger is a unique animal and is not itself a group of animals or a class of animals. The terms instance, instance object, and object are sometimes used interchangeably, but the special value of the "instance" concept, in a system wherein everything is an object, is that it distinguishes members of classes (ie, the instances) from the classes to which they belong.

Message digest Within the context of this book, "message digest," "digest," "HMAC," and "one-way hash" are equivalent terms. See One-way hash. See HMAC.

Metasyntactic variable A Variable name that imports no specific meaning, such as x, n, foo, foobar. Dummy variables are often used in iterating loops. For example:

for($i=0;$i<1000;$i++)

Good form dictates against the liberal use of metasyntactic variables. In most cases, programmers should create variable names that describe the purpose of the variable (eg, time_of_day, column_sum, current_line_from_file).

Multiclass classification A misnomer imported from the field of machine translation, and indicating the assignment of an instance to more than one class. Classifications, as defined in this book, impose one-class classification (ie, an instance can be assigned to one and only one class). It is tempting to think that a ball should be included in class "toy" and in class "spheroids," but multiclass assignments create unnecessary classes of inscrutable provenance, and taxonomies of enormous size, consisting largely of replicate items. See Multiclass inheritance. See Taxonomy.

Multiclass inheritance In ontologies, multiclass inheritance occurs when a child class has more than one parent class. For example, a member of Class House may have two different parent classes: Class Shelter, and Class Property. Multiclass inheritance is generally permitted in ontologies but is forbidden in one type of restrictive ontology, known as a classification. See Classification. See Parent class. See Multiclass classification.

Namespace A namespace is the realm in which a metadata tag applies. The purpose of a namespace is to distinguish metadata tags that have the same name, but a different meaning. For example, within a single XML file, the metadata tag "date" may be used to signify a calendar date, or the fruit, or the social engagement. To avoid confusion, metadata terms are assigned a prefix that is associated with a Web document that defines the term (ie, establishes the tag's namespace). In practical terms, a tag that can have different descriptive meanings in different contexts is provided with a prefix that links to a web document wherein the meaning of the tag, as it applies in the XML document, is specified.

Nomenclature A nomenclatures is a listing of terms that cover all of the concepts in a knowledge domain. A nomenclature is different from a dictionary for three reasons: (1) the nomenclature terms are not annotated with definitions, (2) nomenclature terms may be multi-word, and (3) the terms in the nomenclature are limited to the scope of the selected knowledge domain. In addition, most nomenclatures group synonyms under a group code. For example, a food nomenclature might collect submarine, hoagie, po' boy, grinder, hero, and torpedo under an alphanumeric code such as "F63958." Nomenclatures simplify textual documents by uniting synonymous terms under a common code. Documents that have been coded with the same nomenclature can be integrated with other documents that have been similarly coded, and queries conducted over such documents will yield the same results, regardless of which term is entered (ie, a search for either hoagie, or po' boy will retrieve the same information, if both terms have been annotated with the synonym code, "F63948"). Optimally, the canonical concepts listed in the nomenclature are organized into a hierarchical classification.¹³^,¹⁴ See Coding. See Autocoding.

Nomenclature mapping Specialized nomenclatures employ specific names for concepts that are included in other nomenclatures, under other names. For example, the term that pathologists use for a certain benign fibrous tumor of the skin is "fibrous histiocytoma," a term spurned by dermatologists, who prefer to use "dermatofibroma" to describe the same tumor. As another horrifying example, the names for the physiologic responses caused by a reversible cerebral vasoconstricitve event include: thunderclap headache, Call-Fleming syndrome, benign angiopathy of the central nervous system, postpartum angiopathy, migrainous vasospasm, and migraine angiitis. The choice of term will vary depending on the medical specialty of the physician (eg, neurologist, rheumatologist, obstetrician). To mitigate the discord among specialty nomenclatures, lexicographers may undertake a harmonization project, in which nomenclatures with overlapping concepts are mapped to one another.

Nonatomicity Nonatomicity is the assignment of a collection of objects to a single, composite object, that cannot be further simplified or sensibly deconstructed. For example, the human body is composed of trillions of individual cells, each of which lives for some length of time, and then dies. Many of the cells in the body are capable of dividing to produce more cells. In many cases, the cells of the body that are capable of dividing can be cultured and grown in plastic containers, much like bacteria can be cultured and grown in Petri dishes. If the human body is composed of individual cells, why do we habitually think of each human as a single living entity? Why don't we think of humans as bags of individual cells? Perhaps the reason stems from the coordinated responses of cells. When someone steps on the cells of your toe, the cells in your brain sense pain, the cells in your mouth and vocal cords say ouch, and an army of inflammatory cells rush to the scene of the crime. The cells in your toe are not capable of registering an actionable complaint, without a great deal of assistance. Another reason that organisms, composed of trillions of living cells, are generally considered to have nonatomicity, probably relates to the "species" concept in biology. Every cell in an organisms descended from the same zygote, and every zygote in every member of the same species descended from the same ancestral organism. Hence, there seems to be little benefit to assigning unique entity status to the individual cells that compose organisms, when the class structure for organisms is based on descent through zygotes. See Species.

One-way hash A one-way hash is an algorithm that transforms one string into another string (a fixed-length sequence of seemingly random characters) in such a way that the original string cannot be calculated by operations on the one-way hash value (ie, the calculation is one-way only). One-way hash values can be calculated for any string, including a person's name, a document, or an image. For any given input string, the resultant one-way hash will always be the same. If a single byte of the input string is modified, the resulting one-way hash will be changed, and will have a totally different sequence than the one-way hash sequence calculated for the unmodified string. Most modern programming languages have several methods for generating one-way hash values. Here is a short Ruby script that generates a one-way hash value for a file:

#!/usr/local/bin/ruby
require 'digest/md5'
file_contents = File.new("simplify.txt").binmode
hash_string = Digest::MD5.base64digest(file_contents.read)
puts hash_string
exit

Here is the one-way hash value for the file, simplify.txt, using the md5 algorithm:

0CfZez7L1A6WFcT+oxMh+g==

If we copy our example file to another file, with an alternate filename, the md5 algorithm will generate the same hash value. Likewise, if we generate a one-way hash value, using the md5 algorithm implemented in some other language, such as Python or Perl, the outputs will be identical. One-way hash values can be designed to produce long fixed-length output strings (eg, 256 bits in length). When the output of a one-way hash algorithm is very long, the chance of a hash string collision (ie, the occurrence of two different input strings generating the same one-way hash output value) is negligible. Clever variations on one-way hash algorithms have been repurposed as identifier systems.³⁸^–⁴¹ Examples of one-way hash implementations in Perl and Python are found in Open Source Tools for Chapter 5, "Encryption." See HMAC. See Message digest. See Checksum.

Ontology An ontology is a collection of classes and their relationships to one another. Ontologies are usually rule-based systems (ie, membership in a class is determined by one or more class rules). Two important features distinguish ontologies from classifications. Ontologies permit classes to have more than one parent class and more than one child class. For example, the class of automobiles may be a direct subclass of "motorized devices" and a direct subclass of "mechanized transporters." In addition, an instance of a class can be an instance of any number of additional classes. For example, a Lamborghini may be a member of class "automobiles" and of class "luxury items." This means that the lineage of an instance in an ontology can be highly complex, with a single instance occurring in multiple classes, and with many connections between classes. Because recursive relations are permitted, it is possible to build an ontology wherein a class is both an ancestor class and a descendant class of itself. A classification is a highly restrained ontology wherein instances can belong to only one class, and each class may have only one direct parent class. Because classifications have an enforced linear hierarchy, they can be easily modeled, and the lineage of any instance can be traced unambiguously. See Classification. See Multiclass classification. See Multiclass inheritance.

Page Rank Page Rank, alternately PageRank, is a computational method popularized by Google that searches through an index of every Web page, to produce an ordered set of Web pages whose content can be matched against a query phrase. The rank of a page is determined by two scores: the relevancy of the page to the query phrase; and the importance of the page. The relevancy of the page is determined by factors such as how closely the page matches the query phrase, and whether the content of the page is focused on the subject of the query. The importance of the page is determined by how many Web pages link to and from the page, and the importance of the Web pages involved in the linkages. It is easy to see that the methods for scoring relevance and importance are subject to many algorithmic variations, particularly with respect to the choice of measures (ie, the way in which a page's focus on a particular topic is quantified), and the weights applied to each measurement. The reasons that Page Rank query are fast is that the score of a page's importance is pre-computed, and stored with the page's Web addresses. Word matches from the query phrase to Web pages are quickly assembled using a pre-computed index of words, the pages containing the words, and the locations of the words in the pages.⁸ The success of Page Rank, as employed by Google, is legend. Page ranking is an example of Object ranking, a computation method employed in ranking data objects. Object ranking involves providing objects with a quantitative score that provides some clue to the relevance or the popularity of an object.

Parent class The immediate ancestor, or the next-higher class (ie, the direct superclass) of a class. For example, in the classification of living organisms, Class Vertebrata is the parent class of Class Gnathostomata. Class Gnathostomata is the parent class of Class Teleostomi. In a classification, which imposes single class inheritance, each child class has exactly one parent class; whereas one parent class may have several different child classes. Furthermore, some classes, in particular, the bottom class in the lineage, have no child classes (ie, a class need not always be a superclass of other classes). A class can be defined by its properties, its membership (ie, the instances that belong to the class), and by the name of its parent class. When we list all of the classes in a classification, in any order, we can always reconstruct the complete class lineage, in in their correct lineage and branchings, if we know the name of each class's parent class. See Instance. See Child class. See Superclass.

Project Gutenberg An organization that has converted nearly 50,000 books into freely available ebooks. Most of the Project Gutenberg ebooks were prepared from works published prior to 1923, for which copyright protections have expired. Such books fall into the public domain. More information is available at: www.gutenberg.org. See Public domain.

Public domain Data that is not owned by an entity. Public domain materials include documents whose copyright terms have expired, materials produced by the federal government, materials that contain no creative content (ie, materials that cannot be copyrighted), or materials donated to the public domain by the entity that holds copyright. Public domain data can be accessed, copied, and redistributed without violating piracy laws. It is important to note that plagiarism laws and rules of ethics apply to public domain data. You must properly attribute authorship to public domain documents. If you purposely fail to attribute authorship or if you purposefully and falsely attribute authorship to the wrong person (eg, yourself), then this is unethical, and an act of plagiarism.

RDF Schema Resource Description Framework Schema (RDFS). A document containing a list of classes, their definitions, and the names of the parent class(es) for each class. In an RDF Schema, the list of classes is typically followed by a list of properties that apply to one or more classes in the Schema. To be useful, RDF Schemas are posted on the Internet, as a Web page, with a unique Web address. Anyone can incorporate the classes and properties of a public RDF Schema into their own RDF documents (public or private) by linking named classes and properties, in their RDF document, to the web address of the RDF Schema where the classes and properties are defined. See Namespace. See RDFS.

RDFS Same as RDF Schema.

Regex Shortened form of "Regular Expression." See Regular expression.

Regular expression Short form, Regex. A regular expression is a widely used syntax for specifying character patterns. Most programming languages and many word processing applications use regular expressions for describing character patterns that can be matched against character strings. A detailed description of regular expressions is found in Open Source Tools for Chapter 2.

Reidentification A term casually applied to any instance whereby information can be linked to a specific person, after the links between the information and the person associated with the information have been removed. Used this way, the term reidentification connotes an insufficient deidentification process. In the health care industry, the term "reidentification" means something else entirely. In the U.S., regulations define "reidentification" under the "Standards for Privacy of Individually Identifiable Health Information.”⁴² Therein, reidentification is a legally sanctioned process whereby deidentified records can be linked back to their human subjects, under circumstances deemed legitimate and compelling, by a privacy board. Reidentification is typically accomplished via the use of a confidential list of links between human subject names and deidentified records, held by a trusted party. In the healthcare realm, when a human subject is identified through fraud, trickery, or through the deliberate use of computational methods to break the confidentiality of insufficiently deidentified records (ie, hacking), the term "reidentification" would not apply.⁴

Reification A programming term that has some similarity to "autovivification." In either case, an abstract piece of a program brings itself into life, at the moment when it is assigned a name. Whereas autovivification generally applies to variables and data structures, reification generally applies to blocks of code, methods, and data objects. When a named block of code becomes reified, it can be invoked anywhere within the program, by its name. See Autovivification.

Scalable Software is scalable if it operates smoothly, whether the data is small or large. Software programs that operate by slurping all data into a RAM variable (ie, a data holder in RAM memory) are not scalable, because such programs will eventually encounter a quantity of data that is too large to store in RAM. As a rule of thumb, programs that process text at speeds less than a megabyte per second are not scalable, as they cannot cope, in a reasonable time frame, with quantities of data in the gigabyte and higher range.

Scrubbing Data scrubbing is a lot like any other kind of scrubbing. The purpose is to get rid of the dirt and to leave behind a clean product. As an example, when medical records are scrubbed, the most important component to remove is usually patient names and identifiers (eg, social security numbers), any other information present in the text that may help determine the identity of the patient (eg, address, date of birth, eye color, tattoo descriptions), and any information that is not relevant to the intended purpose of the record (eg, complaints directed to the hospital staff, television channel preferences, income). There are two general approaches to scrubbing algorithms: (1) developing numerous routines that find and delete data that needs to be scrubbed and (2) developing routines that find the information needed in a research study, and deleting all other data from the records. Data scrubbing is discussed in depth in Section 5.4.

Simpson's paradox Occurs when a correlation that holds in two different data sets is reversed if the data sets are combined. For example, baseball player A may have a higher batting average than player B for each of two seasons, but when the data for the two seasons are combined, player B may have the higher 2-season average. Simpson's paradox is just one example of unexpected changes in outcome when variables are unknowingly hidden or blended.⁴³

Species Species is the bottom-most class of any classification or ontology. Because the species class contains the individual objects of the classification, it is the only class which is not abstract. The special significance of the species class is best exemplified in the classification of living organisms. Every species of organism contains individuals that share a common ancestral relationship to one another. When we look at a group of squirrels, we know that each squirrel in the group has its own unique personality, its own unique genes (ie, genotype), and its own unique set of physical features (ie, phenotype). Moreover, although the DNA sequences of individual squirrels are unique, we assume that there is a commonality to the genome of squirrels that distinguishes it from the genome of every other species. If we use the modern definition of species as an evolving gene pool, we see that the species can be thought of as a biological life form, with substance (a population of propagating genes), and a function (evolving to produce new species).⁴⁴^–⁴⁶ Put simply, species speciate; individuals do not. As a corollary, species evolve; individuals simply propagate. Hence, the species class is a separable biological unit with form and function. We, as individuals, are focused on the lives of individual things, and we must be reminded of the role of species in biological and nonbiological classifications. The concept of species is discussed in greater detail in Section 6.4. See Blended class. See Nonatomicity.

Stop words High frequency words such as "the, and, an, but, if," that tend to delineate phrases or terms in text. Also called "barrier words." An example of the use of stop words in text processing is provided in Section 3.2.

Superclass Any of the ancestral classes of a subclass. For example, in the classification of living organisms, the class of vertebrates is a superclass of the class of mammals. The immediate superclass of a class is its parent class. In common parlance, when we speak of the superclass of a class, we are usually referring to its parent class. See Parent class.

Taxonomic order In biological taxonomy, the hierarchical lineage of organisms are divided into a descending list of named orders: Kingdom, Phylum (Division), Class, Order, Family, and Genus, Species. As we have learned more and more about the classes of organisms, modern taxonomists have added additional ranks to the classification (eg, supraphylum, subphylum, suborder, infraclass, etc.). Was this really necessary? All of this taxonomic complexity could be averted by dropping named ranks and simply referring to every class as "Class." Modern specifications for class hierarchies (eg, RDF Schema) encapsulate each class with the name of its superclass. When every object yields its class and superclass, it is possible to trace any object's class lineage. For example, in the classification of living organisms, if you know the name of the parent for each class, you can write a simple script that generates the complete ancestral lineage for every class and species within the classification.¹² See Class. See Taxonomy. See RDF Schema. See Species.

Taxonomy A taxonomy is the collection of named instances (class members) in a classification or an ontology. When you see a schematic showing class relationships, with individual classes represented by geometric shapes and the relationships represented by arrows or connecting lines between the classes, then you are essentially looking at the structure of a classification, minus the taxonomy. You can think of building a taxonomy as the act of pouring all of the names of all of the instances into their proper classes. A taxonomy is similar to a nomenclature; the difference is that in a taxonomy, every named instance must have an assigned class. See Taxonomic order.

Term extraction algorithm Terms are phrases, most often noun phrases, and sometimes individual words, that have a precise meaning within a knowledge domain. For example, "software validation." "RDF triple" and "World Wide Telescope" are examples of terms that might appear in the index or the glossary of this book. The most useful terms might appear up to a dozen times in the text, but when they occur on every page, their value as a searchable item is diminished; there are just too many instances of the term to be of practical value. Hence, terms are sometimes described as noun phrases that have low-frequency and high information content. Various algorithms are available to extract candidate terms from textual documents. The computer-generated list of candidate terms can be examined by a curator who determines whether they should be included in the index created for the document from which they were extracted. The curator may also compare the extracted candidate terms against a standard nomenclature, to determine whether the candidate terms should be added to the nomenclature.⁴ Examples of term extraction algorithms are provided in Section 3.2.

Terminology The collection of words and terms used in some particular discipline, field, or knowledge domain. Nearly synonymous with vocabulary and with nomenclature. Vocabularies, unlike terminologies, are not be confined to the terms used in a particular field. Nomenclatures, unlike terminologies, usually aggregate equivalent terms under a canonical synonym.

Thesaurus A vocabulary that groups together synonymous terms. A thesaurus is very similar to a nomenclature. There are two minor differences. Nomenclatures included multi-word terms; whereas a thesaurus is typically composed of one-word terms. In addition, nomenclatures are typically restricted to a well-defined topic or knowledge domain (eg, names of stars, infectious diseases, etc.). See Nomenclature. See Vocabulary. See Classification. See Dictionary. See Terminology. See Ontology.

Unclassifiable objects Classifications create a class for every object and taxonomies assign each and every object to its correct class. This means that a classification is not permitted to contain unclassified objects; a condition that puts fussy taxonomists in an untenable position. Suppose you have an object, and you simply do not know enough about the object to confidently assign it to a class. Or, suppose you have an object that seems to fit more than one class, and you can't decide which class is the correct class. What do you do? Historically, scientists have resorted to creating a "miscellaneous" class into which otherwise unclassifiable objects are given a temporary home, until more suitable accommodations can be provided. I have spoken with numerous data managers, and everyone seems to be of a mind that "miscellaneous" classes, created as a stopgap measure, serve a useful purpose. Not so. Historically, the promiscuous application of "miscellaneous" classes have proven to be a huge impediment to the advancement of science. In the case of the classification of living organisms, the class of protozoans stands as a case in point. Ernst Haeckel, a leading biological taxonomist in his time, created the Kingdom Protista (ie, protozoans), in 1866, to accommodate a wide variety of simple organisms with superficial commonalities. Haeckel himself understood that the protists were a blended class that included unrelated organisms, but he believed that further study would resolve the confusion. In a sense, he was right, but the process took much longer than he had anticipated; occupying generations of taxonomists over the following 150 years. Today, Kingdom Protista no longer exists. Its members have been reassigned to other classes. Nonetheless, textbooks of microbiology still describe the protozoans, just as though this name continued to occupy a legitimate place among terrestrial organisms. In the meantime, therapeutic opportunities for eradicating so-called protozoal infections, using class-targeted agents, have no doubt been missed.⁴⁷ You might think that the creation of a class of living organisms, with no established scientific relation to the real world, was a rare and ancient event in the annals of biology, having little or no chance of being repeated. Not so. A special pseudoclass of fungi, deuteromyctetes (spelled with a lowercase "d," signifying its questionable validity as a true biologic class) has been created to hold fungi of indeterminate speciation. At present, there are several thousand such fungi, sitting in a taxonomic limbo, waiting to be placed into a definitive taxonomic class.⁴⁸^,⁴⁷ See Blended class.

Variable In algebra, a variable is a quantity, in an equation, that can change; as opposed to a constant quantity, that cannot change. In computer science, a variable can be perceived as a container that can be assigned a value. If you assign the integer 7 to a container named "x," then "x" equals 7, until you re-assign some other value to the container (ie, variables are mutable). In some computer languages, when you issue a command assigning a value to a new (undeclared) variable, the variable automatically comes into existence to accept the assignment. The process whereby an object comes into existence, because its existence was implied by an action (such as value assignment), is called reification. See Reification. See Autovivification.

Vocabulary A comprehensive collection of words and their associated meanings. In some quarters, "vocabulary" and "nomenclature" are used interchangeably, but they are different from one another. Nomenclatures typically focus on terms confined to one knowledge domain. Nomenclatures typically do not contain definitions for the contained terms. Nomenclatures typically group terms by synonymy. Lastly, nomenclatures include multi-word terms. Vocabularies are collections of single words, culled from multiple knowledge domains, with their definitions, and assembled in alphabetic order. See Nomenclature. See Thesaurus. See Taxonomy. See Dictionary. See Terminology.

Winnowing and chaffing Better known to contrarians as chaffing and winnowing. A protocol invented by Ronald Rivest for securing messages against eavesdroppers, without technically employing encryption.⁴⁹ As used in this book, the winnowing and chaffing protocol would be considered a type of encryption. A detailed discussion of winnowing and chaffing is found in Open Source tools for Chapter 8. See Encryption.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 3: Indexing Text

Create new playlist

Sign In

Sign Up

3.1 How Data Scientists Use Indexes

3.2 Concordances and Indexed Lists

3.3 Term Extraction and Simple Indexes

3.4 Autoencoding and Indexing with Nomenclatures

3.5 Computational Operations on Indexes

Open Source Tools

Word Lists

Doublet Lists

Ngram Lists

Glossary

Table of Contents for
Chapter 3: Indexing Text