Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 2

Structuring Text

Abstract

Most of the data on the Web today is unstructured text, produced by individuals, trying their best to communicate with one another. Data simplification often begins with textual data. This chapter provides readers with tools and strategies for imposing some basic structure on free-text.

Keywords

Free-text; ASCII; Asciibetical order; Alphabetization; Sentence parsing; Abbreviations; Acronyms; Metadata; XML; HTML; Markup languages

2.1 The Meaninglessness of Free Text

I've had a perfectly wonderful evening. But this wasn't it.

Groucho Marx

English is such a ridiculous language that an objective observer might guess that it was designed for the purpose of impeding communication. As someone who has dabbled in machine translation, here are just a few points that I find particularly irritating:

1. Homonyms. One word can have dozens of meanings, depending on context. Unfortunately, you cannot really depend on context, particularly when writing machine translation software (see Glossary item, Machine translation). Here are a few egregious examples:

Both the martini and the bar patron, were drunk.

The bandage was wound around a wound.

He farmed to produce produce.

Present the present in the present time.

Don't object to the data object.

Teach a sow to sow seed.

Wind the sail before the wind comes.
Homonymic misunderstandings are the root cause of all puns, the lowest and most despised form of humor.

2. Janus sentences. A single word can have opposite meanings, or opposite words may have equivalent meanings, depending on its idiomatic context. For example, if you were asked "Have you been here?" and you have not, you might answer, "No, I haven't." If I were to negate the same question, and ask, "Haven't you been here?," you would likewise answer "No, I haven't." Same answer, opposite question.

If you were told that the light is out in your room, then you would know that the light was not working. If you were told that the stars are out tonight, then you would know that the stars are shining, as expected.

Antonyms become synonyms, as if by whim. If I were to cash in my chips, I would collect the money owed me. If I cashed out my chips, the meaning does not change. In or out, it's all the same. Contrariwise, overlook and oversee should, logically, be synonyms; but they are antonyms.

3. Word meanings change with case. As examples, Nice and nice, Polish and polish, Herb and herb, August and august. Likewise, all abbreviations whose letters form a legitimate word will mean one thing when appearing in uppercase (eg, AIDS, the disease; US, the country; OR, the state) and another thing when appearing in lowercase (eg, "aids," to help; "us," the first person plural; and "or," the option).

4. Noncompositionality of words. The meaning of individual words cannot be reliably deduced by analyzing root parts. For example, there is neither pine or apple in pineapple, no egg in eggplant, and hamburgers are made from beef, not ham. You can assume that a lover will love, but you cannot assume that a finger will "fing." Vegetarians will eat vegetables, but humanitarians will not eat humans.¹

5. Unrestricted extensibility. Sentences are extensible, but comprehension is not. For example, the following four sentences are examples of proper English.

The cat died.

The cat the dog chased died.

The cat the dog the rat bit chased died.

The cat the dog the rat the elephant admired bit chased died.

If we think about these sentences long enough, we might conclude that the following assertions applied:

The elephant admired the rat.

The rat bit the dog.

The dog chased the cat.

The cat died.

6. Reifications. The most unforgivable flaw in English is the common usage of reification; the process whereby the subject of a sentence is inferred, without actually being named (see Glossary item, Reification). Reification is accomplished with pronouns and other subject references.¹

Here is an example:

"It never rains here."

The word "it" seems to be the subject of the sentence; but what is "it" really? "It" seems to be the thing that never rains at a particular location specified as "here" (wherever "here" is). What would be the noun word for which "it" is the pronoun?

The sentence "It never rains here" is meaningless because there is no way of determining the subject of the sentence (ie, the object to which the sentence applies).

Let's look at another example of reification; this one taken from a newspaper article.¹

"After her husband disappeared on a 1944 recon mission over Southern France, Antoine de Saint-Exupery's widow sat down and wrote this memoir of their dramatic marriage."

There are two reified persons in the sentence: "her husband," and "Antoine de Saint-Exupery's widow." In the first phrase, "her husband" is a relationship (ie, "husband") established for a pronoun (ie, "her") referenced to the person in the second phrase. The person in the second phrase is reified by a relationship to Saint-Exupery (ie, "widow"), who just happens to be the reification of the person in the first phrase (ie, "Saint-Exupery is her husband").

A final example is:

"Do you know who I am?"

There are no identifiable individuals; everyone is reified and reduced to an unspecified pronoun ("you," "I"). Though there are just a few words in the sentence, half of them are superfluous. The words "Do," "who," and "am" are merely fluff, with no informational purpose. In an object-oriented world, the question would be transformed into an assertion, "You know me," and the assertion would be sent an object method, "true?".¹ We are jumping ahead. Objects, assertions, and methods will be discussed in later chapters.

We write self-referential reifying sentences every time we use a pronoun. Strictly speaking, such sentences are meaningless and cannot be sensibly evaluated by software programs (see Glossary item, Meaning). The subjects of the sentence are not properly identified, the references to the subjects are ambiguous.

A Hungarian friend once told me that "English is a language that, if you sit down and learn it all by yourself, you will end up speaking a language that nobody else on earth speaks." By this, he meant that English makes no sense. The meaning of English language comes from common usage, not through semantic logic.

If English is meaningless, then how can we produce software that accurately translates English to other languages? As you might guess, idiomatic and complex sentences are difficult to translate. Machine translation of free-text is greatly facilitated when text is written as simple, short, declarative sentences. Over the years, there have been several attempts at simplifying English, most notably "Controlled English," and "Basic English."

Controlled English is a disciplined approach to sentence construction that avoids some of the intrinsic flaws in written language. Pogson, in 1988, formalized the rules for Controlled English. Pogson's key points are as follows²^,³:

1. Each word in the text may convey only one meaning and context. If "iris" is an anatomic part of the eye, then it cannot be used as a flower. If "report" is used as a noun, then it should not be used as a verb elsewhere in the text.

2. For each meaning, only one term may be used (eg, if you use the term "chair," then you should not use the term "seat" to describe the same piece of furniture, elsewhere in the text).

C.K. Ogden, in 1930, introduced Basic English, wherein words are constrained to a list of about 850 words that have clear definitions and that convey more than 90% of the concepts commonly described in narrative text.⁴ Basic English was championed by some very influential people, including Winston Churchill. Numerous books have been "translated" into Basic English, including the Bible.³

Most recently, a type of simplified English has been developed for Wikipedia readers who do not speak English as their first language.⁵ Simplified versions of many Wikipedia pages have been composed, thus extending the reach of this remarkable Web resource.

Although Controlled English, Basic English, and Simplified English all serve to reduce the complexity of language, they could go much further to improve the computation of sentence parsing. Here are a few additional suggestions for creating sentences that can be understood by humans and by computers³:

1. Sentences should be short and declarative, with an obvious subject, predicate, and object. The shorter the sentence, the lower the likelihood of misinterpretation.

2. Negations should include the word "not" and double negations should never be used. Most importantly, negations should not be expressed as positive assertions. "John is absent" is a positive statement that has the same meaning as "John is not present." The former requires the computer to understand the meaning of "absence," whereas the latter only requires the computer to recognize a common negation operator ("not").

3. Abbreviations and acronyms should be avoided, wherever feasible. If an abbreviation must be used, then it should appear in all uppercase letters, without periods. Abbreviations can be pluralized by adding a lowercase "s."

4. Natural language parsers must know where one sentence ends and the next begins. The "period" alone cannot serve as a reliable sentence delimiter. Wanton periods appear in honorifics, abbreviations, quantitative values, and Web addresses (eg, Mr., Ph.D., U.S., $5.15, gov.com). Using the period as the sentence terminator would result in the abrupt separation and loss of terms that would otherwise be connected. Consider using a period, exclamation point or a question mark followed by a double-space. If a sentence naturally ends at a carriage return, two spaces should be added after the period, as buffer. The consistent use of two spaces after a period runs counter to the preference of the printing industry, where single space sentence delimiters currently prevail. Regardless, the consistent inclusion of double spaces between sentences greatly simplifies computer parsing, and should be used when circumstances permit (see Glossary item, Monospaced font).

5. Reifications should be abandoned, whenever feasible. Use "Rain seldom occurs at this location," rather than "It doesn't rain here much."

We labor under the delusion that the sentences we write have specific, unambiguous meaning. This is simply not true. Most sentences are ambiguous. In Chapter 6, we will discuss how to create meaningful sentences.

2.2 Sorting Text, the Impossible Dream

Consistency is the last refuge of the unimaginative.

Oscar Wilde

The world loves an alphabetically sorted list. A quick glance at any alphabetized list of words or terms from a text document always reveals a great deal about the content. If you've scanned the first 100 words on an alphabetized list, and you're still looking at words that begin with "ab," then you can infer that the document is long and that its vocabulary is rich.

For programmers, the greatest value of alphabetized lists comes with fast search and retrieval algorithms. As it happens, it is computationally trivial to find any word in an alphabetically sorted list. Surprisingly, increasing the length of the list does not appreciably increase the length of time required to locate the item; hence, searches on alphabetized lists are virtually instantaneous. If you were a computer, here is how you would search an alphabetized list:

1. Go to the center of the list and take a look at the word at that location.

2. Ask whether the word you are searching has an alphabetic rank less than the word you've just plucked from the center of the list. If so, then you know that your search-word is located somewhere in the first half of the list. Otherwise, your search-word must belong in the second half of the list. Hence, you've reduced the length of the list you must search by one-half.

3. Repeat steps 1 and 2, halving the length of the search-list with each iteration, until you come to the word in the list that matches your search-word.

Imagine that your list of alphabetized words contained a thousand billion words (ie, about 2 exp 30 words). If the words were held in an alphabetized list, then you would be able to find your word in just 30 repetitions of steps 1 and 2. This is so because every repetition of the algorithm reduces the search by a factor of 2.

Alphabetized lists of words enhance our ability to quickly merge lists to produce an aggregate alphabetized list. The merge is performed by item-by-item alphabetic comparisons, until one of the two lists is exhausted. If the same two lists were merged, without presorting each list, then producing an alphabetized list would require a new sort on a list whose length was the sum of the two lists being merged.

Sorting has such great importance to computer scientists that every introductory programming text contains a section or chapter devoted to sorting routines. Most modern programming languages come with a built-in sort command that will arrange lists of items into alphanumeric order.

On the face of it, alphanumeric sorting is one of the simplest and most useful computational functions provided by modern computers. It is a shame that there exists no satisfactory set of rules that define the product of the sort. In point of fact, there is no guarantee that any two operating systems, provided with the same list of words or numbers, will produce identical sorted lists. Furthermore, it is logically impossible to create a set of sorting rules that will produce a sensible and helpful sorted list, under all possible sets of circumstances. Put simply, sorting is a process that cannot be done correctly.

Just for a taste of how idiosyncratic sorting can be, here are examples of how a few filenames in a directory would be sorted in Linux, DOS, and by a text editor with a sorting feature.

Unix directory sort:

1. X_HIST.PL

2. xlxs_spreadsheet.pdf

3. XMLVOCAB.PL

4. XOXOHWRD.ZBK

5. xy.TXT

6. XY_RAND.PL

7. xyrite.TXT

DOS directory sort:

1. xlxs_spreadsheet.pdf

2. XMLVOCAB.PL

3. XOXOHWRD.ZBK

4. xy.TXT

5. xyrite.TXT

6. XY_RAND.PL

7. X_HIST.PL

Wordprocessor directory sort:

XMLVOCAB.PL

XOXOHWRD.ZBK

XY_RAND.PL

X_HIST.PL

xlxs_spreadsheet.pdf

xy.TXT

xyrite.TXT

No two sorts, for the same list of filenames, were equivalent. How so?

Whenever we sort words or phrases (ie, sequences of words), we must make decisions about the positional hierarchy of characters, the relevancy of characters, and the equivalences among characters and sequences of characters. Whichever decisions we make may conflict with decisions that others have made. Furthermore, the decisions that we make will prove to be wrong decisions, under unexpected conditions.

Before moving forward, we should ask ourselves, "Why does the hierarchical positioning of characters matter?" We must understand that algorithms that search and retrieve words and phrases from sorted lists will fail if the list is sorted in a manner that is different from the search algorithm's method for determining the precedence of which of two words/phrases are higher in the list. For example, in step 2 of the search algorithm described previously, we compare our search term with the term found at the middle of the alphabetized list. If our search term alphabetically precedes the term located at the middle of the list, we will devote the next iteration of the search to items in the first half of the list. If the list had been sorted in a manner that put the search term in the second half of the list (ie, if the sorted list used a different method for establishing alphabetic precedence), then the search term will not be found and retrieved, using our search algorithm.

The first order of business in any sorting routine is to determine whether the sort will be done word-for-word or letter-by-letter. In a word-for-word sort, New Brunswick comes before Newark, NJ. This is because "New" precedes "Newark." The word-for-word sort is sometimes referred to as the nothing-before-something search. In this case, the "nothing" following "New" precedes the "ark" in "Newark." If we had done a letter-by-letter search, Newark, NJ precedes New Brunswick, because the letter that follows "New" in "Newark" is an "a" that precedes the letter that follows "New" in "New Brunswick" (ie, "B").

The next order of business in sorting routines is to determine how to deal with uppercase and lowercase letters. Should the uppercase letters precede the lowercase letters? How about punctuation? Should we delete punctuation marks prior to the sort, or should we replace punctuation marks with a space, or should we leave them untouched? Should we treat punctuation marks that occur inside a word (eg, "isn't") differently from punctuation marks that occur outside words (eg, "The end."). What should we do with punctuation marks that occur inside and outside of a word (eg, "U.S.A.")? Should "P value" reside next to P-value" or should Page be sandwiched in between?

How do we alphabetize names that contain an umlauted character? Do you pretend the umlaut isn't there, and put it in alphabetic order with the plain characters? The same problem applies to every special character (eg, "üéâäàèïÇÉúê").

How do we alphabetize single characters created by combining two characters (eg, "æ"). Should we render Cæsar, unto "Caesar"?

How do we handle surnames preceded by modifiers? Do you alphabetize de Broglie under "D" or under "d" or under "B"? If you choose B, then what do you do with the concatenated form of the name, "deBroglie"?

You are probably thinking that now is the time to consult some official guidelines. The National Information Standard Organization has published an informative tract, titled, "Guidelines for Alphabetical Arrangement of Letters and Sorting of Numerals and Other Symbols."⁶ These guidelines describe several sorting methods: word-for-word, letter-by-letter, ASCIIbetical, and modified ASCIIbetical. The Guidelines refer us to various sets of alphabetization rules: American Library Association, British Standards Institution, Chicago Manual of Style, and Library of Congress. None of these guidelines are compatible with one another and all are subject to revisions.

For the computer savvy, the choice is obvious: use ASCIIbetical sorting. ASCIIbetical sorting is based on assigning every character to its ASCII value, and sorting character by character (including punctuation, spaces and nonprintable characters). This type of sorting is by far the easiest way to arrange any sequence alphanumerically, and can be accomplished in a manner that produces an identically sorted output in any operating system or in any programming environment (Fig. 2.1).

f02-01-9780128037812 — Figure 2.1 The ASCII chart. Notice that the familiar typewriter (keyboard) characters fall within the first 126 ASCII values. Notice also that the lowercase letters have higher ASCII values than the uppercase letters and that the punctuation marks are scattered in three areas: under the uppercase letters and between the uppercase letters and lower case letters, as well as over the lowercase letters.

ASCII sorts have their limitations, listing terms in a way that you might not prefer. For example, consider the following Perl script, sortlet.pl that sorts a set of words and phrases by ASCII values:

#!/usr/local/bin/perl

@word_array = ("MacIntire", "Macadam", "wilson", "tilson", "Wilson",

"I cannot go", "I can not go", "I can also go", "I candle maker",

"O'Brien", "OBrien", "O'Brien's", "O'Briens", "OBrien's", "Oar", "O'Brienesque");

@word_array = sort (@word_array);

print join(" ", @word_array);

exit;

Here is the output of the sortlet.pl script:

C:ftppl>perl sortlet.pl

I can also go

I can not go

I candle maker

I cannot go

MacIntire

Macadam

O'Brien

O'Brien's

O'Brienesque

O'Briens

OBrien

OBrien's

Oar

Wilson

tilson

wilson

Notice that Macadam follows MacIntire, tilson is sandwiched between Wilson and wilson, and Oar follows O'Brien.

It gets much worse. Consider the following six characters, and their ASCII values:

Ä (ASCII 142)

ä (ASCII 132)

A (ASCII 65)

a (ASCII 97)

á (ASCII 160)

å (ASCII 134)

The numeric ASCII values for the variant forms of "A" are scattered, ranging from 65 to 160. This means that an ASCII sort based on ASCII values of characters will place words containing different versions of "A" in widely scattered locations in the sorted output list. The scattering of specially accented letters would apply to virtually every specially accented character.

Here's an absolutely impossible sorting job for you to sink your teeth into. HTML is a pure-ASCII format that recognizes most of the printable ASCII characters, rendering them much like they would be rendered in any text editor. Several of the printable ASCII characters are ignored by browsers, most notably the acute angle brackets, "< and >." The reason for this is that the acute angle brackets are used as formatting symbols for tags. If they were rendered as characters, every embedded tag would be visible in the browser. Whenever you need to render an angle bracket in an HTML page, you must use special encodings, specifically "<" for the left bracket (ie, the bracket that looks like a less-than symbol), and ">" for the right bracket (ie, the bracket that looks like a greater-than symbol). By substituting a four-character encoding for a single ASCII character, it becomes impossible to produce a sensible character-by-character sort on HTML character strings.

The situation is made even more impossible (if such a thing exists) by the use of alternative sequences of characters that are meant to represent single ASCII characters. For example, consider the following HTML document, and its browser rendering.

< html><head></head >

< h2 >

à à

á á

â â

ã ã

ä ä

å å

æ æ

ç ç

è è

é é

ê ê

ë ë

ì ì

í í

î î

ï ï

</h2 >

</html >

There is simply no way to order a word sensibly when the characters in the word can be represented by any one of several encodings (Fig. 2.2).

f02-02-9780128037812 — Figure 2.2 A Web browser rendition of each of the characters represented in the preceding HTML document. Notice that each pair of encodings, listed line-by-line in the HTML document, is represented by a character that is not included in the standard keyboard.

Textual data would be simple to sort if every character in a data file consisted exclusively of so-called printable ASCII. Here are the printable ASCII characters:

!"#$%&'()*+,-./0123456789:;<=>

?@ABCDEFGHIJKLMNOPQRSTUVWXYZ

[]ˆ_`abcdefghijklmnopqrstuvwxyz{|}~

The so-called printable ASCII characters are the subset of printable characters that appear on the standard computer keyboard. Technically, the list of printable characters includes at least a dozen special characters (eg, accented letters, diacriticals, common symbols). As used here and elsewhere, the printable characters are the keyboard characters listed above.

If the printable ASCII characters are what you truly desire, any corpus of text can be converted into printable ASCII with one line of Perl code, that I refer to as my Procrustean translator. Procrustes was a mythical Greek giant who was a thief and a murderer. He would capture travelers and tie them to an iron bed. If they were longer than the bed, he would hack off their limbs until they fit.

The Perl script, unprintable.pl, that replaces unprintable ASCII characters with a space and leaves everything else unchanged:

#!/usr/local/bin/perl

$var = "μ½¼ßüéâäàèïÇÉúêæ That's all folks!";

$var =~ tr/121540-176/ /c;

print $var;

exit;

Output of the Perl script, unprintable.pl:

c:ftp>unprintable.pl

That's all folks!

The so-called unprintable letters inside the input variable, "μ½¼ßüéâäàèïÇÉúêæ" were replaced by spaces, and the printable characters, "That's all folks!" were left intact. The Procrustean translator lacks subtlety, but it gets the job done.

Here is a nearly equivalent Python script, printable.py:

#!/usr/local/bin/python

# -*- coding: iso-8859-15 -*-

import string

in_string = "prinüéêçâäàtable"

out_string = filter(lambda x: x in string.printable, in_string)

print out_string

exit

Notice that printable.py contains a specialized line that informs the Python interpreter that the script is coded in the iso-8859-15 character set, a variant of ASCII that replaces some of the nonalphabetic ASCII characters with Europeanized nonkeyboard alphabetic characters.

Here is the output of printable.py

c:ftp>printable.py

printable

A less brutal script would attempt to map common nonkeyboard characters to their closes printable versions. For example, ü would be replaced by u, é by e, â by a, Ç by c. Some nonprintable characters can be replaced by words (eg, μ by microns, ½ by one-half, ß by beta). Any such program would need to be updated from time to time, to cope with evolving character standards, such as UTF (see Glossary item, UTF).

Here is a short Perl script, non_keyboard.pl, that strips nonkeyboard characters, with the exception of about two dozen commonly occurring nonkeyboard characters that can be mapped to recognizably near-equivalent printable characters or printable words.

#!/usr/local/bin/perl

$var = "μ½¼ßüéâäàèïÇÉúêæ";

$var =~ s/×/ x /g;

$var =~ s/–/-/g;

$var =~ s/—/-/g;

$var =~ s/"/"/g;

$var =~ s/’/'/g;

$var =~ s/‘/'/g;

$var =~ s/μ/ microns /g;

$var =~ s/½/ one-half /g;

$var =~ s/¼/ one-quarter /g;

$var =~ s/ß/ beta /g;

$var =~ s/æ/ae/g;

$var =~ tr/üéâäàèïÇÉúê/ueaaaeiCEue/;

$var =~ tr/121540-176/ /c;

print "The input, "μ½¼ßüéâäàèïÇÉúêæ" has been modified to "$var"";

exit;

Here is the output of the non_keyboard.pl script.

Notice that the input string appearing in the non_keyboard.pl Perl script is "μ½¼ßüéâäàèïÇÉúêæ," while the rendition of the same input screen, on the monitor, is a collection of straight-angled rods and several Greek letters. This happens because the monitor translates ASCII values to a different set of display characters than does the text editor used to produce the non_keyboard.pl Perl script. This is one more example of the unwanted and perplexing consequences that occur when employing nonprintable characters within textual data sets (Fig. 2.3).

f02-03-9780128037812 — Figure 2.3 The output of the Perl script non_keyboard.pl. The nonprintable input string has been translated to the printable output string, "microns one-half one-quarter beta ueaaaeiCEueae."

Of course, this short script can be modified to translate any quantity of text files containing nonkeyboard characters into text files composed exclusively of keyboard characters. If you need to sort text ASCIIbetically, then you should seriously consider running your files through a script that enforces pure printable-ASCII content.

2.3 Sentence Parsing

We'd all have been better off to have read half as many books. Twice.

Dick Cavett

Sentences are the units of language that contain complete thoughts and meaningful assertions. Hence, you might expect that programming languages would be designed to parse through text files sentence-by-sentence. This is not the case. Programming languages parse through text documents line-by-line (see Glossary item, Line). For the data scientist who is trying to make sense of plain-text data, information extracted from lines of text can be deceiving. For example, consider these two lines

By all means, pull the emergency cord

if you want us all to die!

Here, we have one sentence, with a newline break after the word "cord." If we had constructed a program that parsed and interpreted text line-by-line, we might have concluded that pulling the emergency cord is a good thing. If we had constructed a program that parsed and interpreted text sentence-by-sentence, we might have averted catastrophe.

The process of structuring text begins with extracting sentences. Here is a short Perl script, sentence.pl, that operates on the first paragraph of verse from Lewis Carroll's poem, Jabberwocky.

#!/usr/local/bin/perl

$all_text =

"And, has thou slain the Jabberwock? Come

to my arms, my beamish boy! O frabjous

day! Callooh! Callay! He chortled in his

joy. Lewis Carroll, excerpted from

Jabberwocky";

$all_text =~ s/ / /g;

$all_text =~ s/([ˆA-Z]+[.!?][ ]{1,3})([A-Z])/$1 $2/g;

print $all_text;

exit;

Here is the output of the sentence.pl script. Notice that the lines of the original text have been assembled as individual sentences.

c:ftppl>sentence.pl

And, has thou slain the Jabberwock?

Come to my arms, my beamish boy!

O frabjous day!

Callooh!

Callay!

He chortled in his joy.

Lewis Carroll, excerpted from Jabberwocky

The script loads the text into a single string variable and removes the newline characters (ie, the line breaks), replacing them with a space. Next, the script inserts a line break character at patterns that are likely to occur at the ends of sentences. Specifically, the pattern match searches for a sequence of lowercase letters, followed by a sentence delimiting character (ie, a period, a question mark or an exclamation mark), followed by one to three spaces, followed by an uppercase letter. We will use sentence parsing routines in scripts that appear in later chapters. For now, here is the equivalent script, sentence.py, in Python:

#!/usr/local/bin/python

import re

all_text =

"And, has thou slain the Jabberwock? Come

to my arms, my beamish boy! O frabjous

day! Callooh! Callay! He chortled in his

joy. Lewis Carroll, excerpted from

Jabberwocky";

sentence_list = re.split(r'[.!?] +(?=[A-Z])', all_text)

print ' '.join(sentence_list)

exit

Here is the equivalent script, sentence.rb, in Ruby:

#!/usr/local/bin/ruby

all_text =

"And, has thou slain the Jabberwock? Come

to my arms, my beamish boy! O frabjous

day! Callooh! Callay! He chortled in his

joy. Lewis Carroll, excerpted from

Jabberwocky";

all_text.split(/[.!?] +(?=[A-Z])/).each {|phrase| puts phrase}

exit

After a textual document is parsed into sentences, the next step often involves feeding the sentences to a natural language processor (see Glossary items, Natural language processing, Machine learning, Dark data). Each sentence is parsed into assigned grammatical tokens (eg, A = adjective, D = determiner, N = noun, P = preposition, V = main verb). A determiner is a word such as "a" or "the," that specifies the noun.⁷

Consider the sentence, "The quick brown fox jumped over lazy dogs." This sentence can be grammatically tokenized as:

the::D

quick::A

brown::A

fox::N

jumped::V

over::P

the::D

lazy::A

dog::N

We can express the sentence as a sequence of its tokens listed in the order of occurrence in the sentence: DAANVPDAN. This does not seem like much of a breakthrough, but imagine having a large collection of such token sequences representing every sentence from a large text corpus. With such a data set, we could begin to understand the rules of sentence structure. Commonly recurring sequences, like DAANVPDAN, might be assumed to be proper sentences. Sequences that occur uniquely in a large text corpus are probably poorly constructed sentences. Before long, we might find ourselves constructing logic rules that reduce the complexity of sentences by dropping subsequences which, when removed, yield a sequence that occurs more commonly than the original sequence. For example, our table of sequences might indicate that we can convert DAANVPDAN into NVPAN (ie, "Fox jumped over lazy dog"), without sacrificing too much of the meaning from the original sentence and preserving a grammatical sequence that occurs commonly in the text corpus.

This short example serves as an overly simplistic introduction to natural language processing. We can begin to imagine that the grammatical rules of a language can be represented by sequences of tokens that can be translated into words or phrases from a second language, and reordered according to grammatical rules appropriate to the target language. Many natural language processing projects involve transforming text into a new form, with desirable properties (eg, other languages, an index, a collection of names, a new text with words and phrases replaced with canonical forms extracted from a nomenclature).⁷

2.4 Abbreviations

A synonym is a word you use when you can't spell the other one.

Baltasar Gracian

People confuse shortening with simplifying; a terrible mistake. In point of fact, next to reifying pronouns, abbreviations are the most vexing cause of complex and meaningless language. Before we tackle the complexities of abbreviations, let's define our terms. An abbreviation is a shortened form of a word or term. An acronym is a an abbreviation composed of letters extracted from the words composing a multi-word term. There are two major types of abbreviations: universal/permanent and local/ephemeral. The universal/permanent abbreviations are recognized everywhere and have been used for decades (eg, USA, DNA, UK). Some of the universal/permanent abbreviations ascend to the status of words whose long-forms have been abandoned. For example, we use laser as a word. Few who use the term know that "laser" is an acronym for "light amplification by stimulated emission of radiation." Likewise, we use "AIDS" as a word, forgetting that it is an acronym for "acquired immune deficiency syndrome." The acronym is inaccurate, as AIDS is not a primary immunodeficiency disease; it is a viral disease for which immunodeficiency is a secondary complication. Local/ephemeral abbreviations are created for terms that are repeated within a particular document or a particular class of documents. Synonyms and plesionyms (ie, near-synonyms) allow authors to represent a single concept using alternate terms (see Glossary item, Plesionymy).⁸

Abbreviations make textual data complex, for three principle reasons:

1. No rules exist with which abbreviations can be logically expanded to their full-length form.

2. A single abbreviation may mean different things to different individuals, or to the same individual at different times.

3. A single term may have multiple different abbreviations. (In medicine, angioimmunoblastic lymphadenopathy can be abbreviated as ABL, AIL, or AIML.) These are the so-called polysemous abbreviations (see Glossary item, Polysemy). In the medical literature, a single abbreviations may have dozens of different expansions.⁸

Some of the worst abbreviations fall into one of the following categories:

Abbreviations that are neither acronyms nor shortened forms of expansions. For example, the short form of "diagnosis" is "dx," although no "x" is contained therein. The same applies to the "x" in "tx," the abbreviation for "therapy," but not the "X" in "TX" that stands for Texas. For that matter, the short form of "times" is an "x," relating to the notation for the multiplication operator. Roman numerals I, V, X, L, and M are abbreviations for words assigned to numbers, but they are not characters included in the expanded words (eg, there is no "I" in "one"). EKG is the abbreviation for electrocardiogram, a word totally bereft of any "K." The "K" comes from the German orthography. There is no letter "q" in subcutaneous, but the abbreviation for the word is sometimes "subq;" never "subc." What form of alchemy converts ethanol to its common abbreviation, "EtOH?"

Mixed-form abbreviations. In medical lingo "DSV" represents the dermatome of the fifth (V) sacral nerve. Here a preposition, an article, and a noun (of, the, nerve) have all been unceremoniously excluded from the abbreviation; the order or the acronym components have been transposed (dermatome sacral fifth); an ordinal has been changed to a cardinal (fifth changed to five), and the cardinal has been shortened to its roman numeral equivalent (V).

Prepositions and articles arbitrarily retained in an acronym. When creating an abbreviation, should we retain or abandon prepositions? Many acronyms exclude prepositions and articles. USA is the acronym for United States of America; the "of" is ignored. DOB (Date Of Birth) remembers the "of."

Single expansions with multiple abbreviations. Just as abbreviations can map to many different expansions, the reverse can occur. For instance, high-grade squamous intraepithelial lesion can be abbreviated as HGSIL or HSIL. Xanthogranulomatous pyelonephritis can be abbreviated as xgp or xgpn.

Recursive abbreviations. The following example exemplifies the horror of recursive abbreviations. The term SMETE is the abbreviation for the phrase "science, math, engineering, and technology education." NSDL is a real-life abbreviation, for "National SMETE digital Library community." To fully expand the term (ie, to provide meaning to the abbreviation), you must recursively expand the embedded abbreviation, to produce "National science, math, engineering, and technology education digital Library community."

Stupid or purposefully nonsensical abbreviations. The term GNU (Gnu is not UNIX) is a recursive acronym. Fully expanded, this acronym is of infinite length. Although the N and the U expand to words ("Not Unix"), the letter G is simply inscrutable. Another example of an inexplicable abbreviation is PT-LPD (post-transplantation lymphoproliferative disorders). The only logical location for a hyphen would be smack between the letters p and t. Is the hyphen situated between the T and the L for the sole purpose of irritating us?

Abbreviations that change from place to place. Americans sometimes forget that most English-speaking countries use British English. For example an esophagus in New York is an oesophagus in London. Hence TOF makes no sense as an abbreviation of tracheo-esophageal fistula here in the U.S., but this abbreviation makes perfect sense to physicians in England, where a patients may have a tracheo-oesophageal fistula. The term GERD (representing the phrase gastroesophageal reflux disease) makes perfect sense to Americans, but it must be confusing in Britain, where the esophagus is not an organ.

Abbreviations masquerading as words. Our greatest vitriol is reserved for abbreviations that look just like common words. Some of the worst offenders come from the medical lexicon: axillary node dissection (AND), acute lymphocytic leukemia (ALL), Bornholm eye disease (BED), and expired air resuscitation (EAR). Such acronyms aggravate the computational task confidently translating common words. Acronyms commonly appear as uppercase strings, but a review of a text corpus of medical notes has shown that words could not be consistently distinguished from homonymous word-acronyms.⁹

Fatal abbreviations. Fatal abbreviations are those which can kill individuals if they are interpreted incorrectly. They all seem to originate in the world of medicine:

MVR, which can be expanded to any of: mitral valve regurgitation, mitral valve repair, or mitral valve replacement;

LLL, which can be expanded to any of: left lower lid, left lower lip, or left lower lung;

DOA, dead on arrival, date of arrival, date of admission, drug of abuse.

Is a fear of abbreviations rational, or does this fear emanate from an overactive imagination? In 2004, the Joint Commission on Accreditation of Healthcare Organizations, a stalwart institution not known to be squeamish, issued an announcement that, henceforth, a list of specified abbreviations should be excluded from medical records Rboodr.

Examples of Forbidden abbreviations are:

IU (International Unit), mistaken as IV (intravenous) or 10 (ten).

Q.D., Q.O.D. (Latin abbreviation for once daily and every other day), mistaken for each other.

Trailing zero (X.0 mg) or a lack of a leading zero (.X mg), in which cases the decimal point may be missed. Never write a zero by itself after a decimal point (X mg), and always use a zero before a decimal point (0.X mg).

MS, MSO4, MgSO4, all of which can be confused with one another and with morphine sulfate or magnesium sulfate. Write "morphine sulfate" or "magnesium sulfate."

Abbreviations on the hospital watch list were:

mg (for microgram), mistaken for mg (milligrams), resulting in a 1000-fold dosing overdose.

h.s., which can mean either half-strength or the Latin abbreviation for bedtime or may be mistaken for q.h.s., taken every hour. All can result in a dosing error.

T.I.W. (for three times a week), mistaken for three times a day or twice weekly, resulting in an overdose.

The list of abbreviations that can kill, in the medical setting, is quite lengthy. Fatal abbreviations probably devolved through imprecise, inconsistent, or idiosyncratic uses of an abbreviation, by the busy hospital staff who enter notes and orders into patient charts. For any knowledge domain, the potentially fatal abbreviations is the most important to catch.

Nobody has ever found an accurate way of disambiguating and translating abbreviations.⁸ There are, however a few simple suggestions, based on years of exasperating experience, that might save you time and energy.

1. Disallow the use of abbreviations, whenever possible. Abbreviations never enhance the value of information. The time saved by using an abbreviation is far exceeded by the time spent attempting to deduce its correct meaning.

2. When writing software applications that find and expand abbreviations, the output should list every known expansion of the abbreviation. For example, the abbreviation, "ppp," appearing in a medical report, should have all these expansions inserted into the text, as annotations: pancreatic polypeptide, palatopharyngoplasty, palmoplantar pustulosis, pancreatic polypeptide, pentose phosphate pathway, platelet poor plasma, primary proliferative polycythaemia, or primary proliferative polycythemia. Leave it up to the knowledge domain experts to disambiguate the results.

2.5 Annotation and the Simple Science of Metadata

All parts should go together without forcing. You must remember that the parts you are reassembling were disassembled by you. Therefore, if you can't get them together again, there must be a reason. By all means, do not use a hammer.

IBM Manual, 1925

Free text is most useful when it is marked with tags that describe the text. Tags can also provide marked text with functionality (eg, linkage to a Web page, linkage to some other section of text, linkage to an image, activation of a script). By marking text with tags, it is possible to transform a document into an ersatz database, that can be integrated with other documents, and queried. This section describes the fundamentals of markup, particularly HTML and XML.

HTML (HyperText Markup Language) is a collection of format descriptors for text and data. All Web browsers are designed to interpret embedded HTML metadata tags and to display the enclosed data (see Glossary item, Metadata). Tags tell browsers how they should display the tagged object (the object described with metadata).

For example:

This book, Data Simplification , is way cool.

The tag provides the Web browser with the size of the text that needs to be displayed. The tag marks the end of the text to which the font size instruction applies. Similarly, within the enclosed text, a tag indicates the start of italicized text, and a tag indicates the start of bolded text (Fig. 2.4).

f02-04-9780128037812 — Figure 2.4 A sentence, with HTML mark-up, as it is displayed in a Web browser.

Before there was HTML, there were file formats that contained embedded tags indicating how the text should be displayed by the word processing application. In the 1980s, XyWrite files were popular among professional writers, because they contained embedded formatting instructions that were accepted by the publishing industry. The commercial XyWrite word processing application was fast and could handle large files. What XyWrite and its competitors could not do, and what HTML provided, was markup instructions for linking parts of one document with parts of another document, and for retrieving specified objects (eg, other Web pages, image files, external programs) from some selected location on the Internet. Internet locations are known as URLs (Uniform Resource Locators). When your browser encounters a link tag, it sends a request to have the object at the URL delivered to the browser. The request and response are negotiated using HTTP (HyperText Transfer Protocol). Your browser formats the Web page, and assembles the linked external data objects specified by the HTML tags found in the Web page document (see Glossary item, Data object).

The strength of HTML is its remarkable simplicity. A few dozen HTML tags are all that are needed to create glorious Web pages, with active links to any location in the Web universe.

HTML is referred to as a specification, because it specifies formatting information for the browser (see Glossary items, Specification, Specification versus standard). It is the browser that produces a visual output that conforms to the specified instructions embedded in textual metadata.

While HTML tells Web browsers how to format Web pages (ie, what Web pages should look like when they are displayed in a browser), HTML tells us nothing about the data being displayed. XML (eXtensible Markup Language) is a markup protocol that binds data elements to metadata, the descriptors of the data (see Glossary item, Protocol).¹⁰^,¹¹ Surprisingly, this simple relationship between data and the data that describes data is the most powerful innovation in data organization since the invention of the book. Seldom does a technology emerge with all the techniques required for its success, but this seems to be the case for XML.

In XML, data descriptors (known as XML tags) enclose the data they describe with angle-brackets, much as HTML encloses text with formatting instructions.

< date>June 16, 1904 </date >

The tag, < date > and its end-tag, </date > enclose a data element, which in this case is the unabbreviated month, beginning with an uppercase letter and followed by lowercase letters, followed by a space, followed by a two-digit numeric for the date of the month, followed by a comma and space, followed by the 4-digit year. The XML tag could have been defined in a separate document detailing the data format of the data element described by the XML tag. ISO-11179 is a standard that explains how to specify the properties of tags (see Glossary item, ISO-11179).

If we had chosen to, we could have broken the < date > tag into its constituent parts.

< date >

< month>June </month >

< day>16</day >

< year>1904</year >

</date >

Six properties of XML explain its extraordinary utility.¹⁰^,¹¹ These are:

1. Enforced and defined structure (XML rules and schema): An XML file is well-formed if it conforms to the basic rules for XML file construction recommended by the W3C (Worldwide Web Consortium). This means that it must be a plain-text file, with a header indicating that it is an XML file, and must enclose data elements with metadata tags that declare the start and end of the data element. The tags must conform to certain rules (eg, alphanumeric strings without intervening spaces) and must also obey the rules for nesting data elements.¹⁰^,¹² A metadata/data pair may be contained within another metadata/data pair (so-called nesting), but a metadata/data pair cannot straggle over another metadata/data pair. Most browsers will parse XML files, rejecting files that are not well-formed. The ability to ensure that every XML file conforms to basic rules of metadata tagging and nesting makes it possible to extract XML files as sensible data structures.

2. Reserved namespaces: Namespaces preserve the intended meaning of tags whose meanings might otherwise change Web page to Web page (see Glossary item, Namespace). When you encounter the XML tag < date >, would you know whether the tag referred to a calendar date, or the fruit known as date, or the social encounter known as date? A namespace is the realm in which a metadata tag applies. The purpose of a namespace is to distinguish metadata tags that have the same name, but a different meaning. For example, within a single XML file, the metadata tag "date" may be used to signify a calendar date, or the fruit, or the social engagement. To avoid confusion, metadata terms are assigned a prefix that is associated with a Web document that defines the term (ie, establishes the tag's namespace). For example, an XML page might contain three "date" tags, each prefixed with a code that identifies the namespace that defines the intended meaning for the date tag.

< calendar:date>June 16, 1904 </calendar:date >

< agriculture:date>Thoory </agriculture:date >

< social:date>Pyramus and Thisbe<social:date >

At the top of the XML document you would expect to find links to the three URL locations (ie, Web addresses) where the namespaces appearing in the XML snippet (ie, "calendar:," "agriculture:," and "social:") can be found. If you followed the links to these three namespaces, you would find the definition of "date" used within each respective namespace. See URL.

3. Linking data via the internet: XML comes with specifications for linking XML documents with other XML documents, or with any external file that has a specific identifier or Web location (see Glossary items, URN, URL). This means that there is a logical and standard method for linking any XML document or any part of any XML document, including individual data elements, to any other uniquely identified resource (eg, Web page).

4. Logic and meaning: Although the technical methodologies associated with XML can be daunting, the most difficult issues always relate to the meaning of things. A variety of formal approaches have been proposed to reach the level of meaning within the context of XML. The simplest of these is the Resource Description Framework (see Glossary item, RDF). The importance of the RDF model is that it binds data and metadata to a unique object with a Web location. Consistent use of the RDF model assures that data anywhere on the Web can always be connected through unique objects using RDF descriptions. The association of described data with a unique object confers meaning and greatly advances our ability to integrate data over the internet. RDF will be discussed in much greater detail in Open Source Tools for Chapter 6.

5. Self-awareness: Because XML can be used to describe anything, it can certainly be used to describe a query related to an XML page. Furthermore, it can be used to describe protocols for transferring data, performing Web services, or describing the programmer interface to databases. It can describe the rules for interoperability for any data process, including peer-to-peer data sharing. When an XML file is capable of displaying autonomous behavior, composing queries, merging replies and transforming its own content, it is usually referred to as a software agent.

6. Formal metadata: The International Standards Organization has created a standard way of defining metadata tags, (the ISO-11179 specification (see Glossary items, ISO, ANSI, American National Standards Institute).

2.6 Specifications Good, Standards Bad

The nice thing about standards is that you have so many to choose from.

Andrew S. Tanenbaum

Data standards are the false gods of informatics. They promise miracles, but they can't deliver. The biggest drawback of standards is that they change all the time. If you take the time to read some of the computer literature from the 1970s or 1980s, you will come across the names of standards that have long since fallen into well-deserved obscurity. You may find that the computer literature from the 1970s is nearly impossible to read with any level of comprehension, due to the large number of now-obsolete standards-related acronyms scattered through every page. Today's eternal standard is tomorrow's indecipherable gibberish.⁷

The Open Systems Interconnection (OSI) was an internet protocol created in 1977 with approval from the International Organization for Standardization. It has been supplanted by TCP/IP, the protocol that everyone uses today. A handful of programming languages has been recognized as standards by the American National Standards Institute. These include Basic, C, Ada, and Mumps. Basic and C are still popular languages. Ada, recommended by the Federal Government, back in 1995 for all high performance software applications, is virtually forgotten.¹³ Mumps is still in use, particularly in hospital information systems, but it changed its name to M, lost its allure to a new generation of programmers, and now comes in various implementations that may not strictly conform to the original standard.

In many cases, as a standard matures, it often becomes hopelessly complex. As the complexity becomes unmanageable, those who profess to use the standard may develop their own idiosyncratic implementations. Organizations that produce standards seldom provide a mechanism to ensure that the standard is implemented correctly. Standards have long been plagued by noncompliance or (more frequently) under-compliance. Over time, so-called standard-compliant systems tend to become incompatible with one another. The net result is that legacy data, purported to conform to a standard format, is no longer understandable (see Glossary items, Legacy data, Dark data).

Malcolm Duncan has posted an insightful and funny essay entitled "The Chocolate Teapot (version 2.3)."¹⁴ In this essay, he shows how new versions of standards may unintentionally alter the meanings of classes of terms contained in earlier versions, making it impossible to compare or sensibly aggregate and interpret terms and concepts contained in any of the versions.⁷

Suppose you have a cooking-ware terminology with a "teapot" item. Version 1 of the nomenclature may list only one teapot material, porcelain, and only two permissible teapot colors, blue or white. Version 2 of the terminology might accommodate two teapot subtypes: blue teapot and white teapot (ie, in version 2, blue and white are subtypes of teapot, not colors of teapot). If a teapot were neither blue nor white, it would be coded under the parent term, "teapot." Suppose version 3 accommodates some new additions to the teapot pantheon: chocolate teapot, ornamental teapot, china teapot, and industrial teapot. Now the teapot world is shaken by a tempest of monumental proportions. The white and the blue teapots had been implicitly considered to be made of porcelain, like all china teapots. How does one deal with a white teapot that is not porcelain or a porcelain teapot that is a china teapot? If we had previously assumed that a teapot was an item in which tea is made, how do we adjust, conceptually, to the new term "ornamental teapot?" If the teapot is ornamental, then it has no tea-making functionality, and if it cannot be used to make tea, how can it be a teapot? Must we change our concept of the teapot to include anything that looks like a teapot? If so, how can we deal with the new term "industrial teapot," which is likely to be a big stainless steel vat that has more in common, structurally, with a microbrewery fermenter than with an ornamental teapot? What is the meaning of a chocolate teapot? Is it something made of chocolate, is it chocolate-colored, or does it brew chocolate-flavored tea? Suddenly we have lost the ability to map terms in version 3 to terms in versions 1 and 2 (see Glossary item, Nomenclature mapping). We no longer understand the classes of objects (ie, teapots) in the various versions of our cookware nomenclature. We cannot unambiguously attach nomenclature terms to objects in our data collection (eg, blue china teapot). We no longer have a precise definition of a teapot or of the subtypes of teapot.

Regarding versioning, it is a very good rule of thumb that when you encounter a standard whose name includes a version number (eg, International Classification of Diseases-10 or Diagnostic and Statistical Manual of Mental Disorders-5), you can be certain that the standard is unstable, and must be continually revised. Some continuously revised standards cling tenaciously to life, when they really deserve to die. In some cases, a poor standard is kept alive indefinitely by influential leaders in their fields, or by entities who have an economic stake in perpetuating the standard.

Raymond Kammer, then Director of the U.S. National Institute of Standards and Technology, understood the downside of standards. In a year 2000 government report, he wrote that "the consequences of standards can be negative. For example, companies and nations can use standards to disadvantage competitors. Embodied in national regulations, standards can be crafted to impede export access, sometimes necessitating excessive testing and even redesigns of products. A 1999 survey by the National Association of Manufacturers reported that about half of U.S. small manufacturers find international standards or product certification requirements to be barriers to trade. And according to the Transatlantic Business Dialogue, differing requirements add more than 10% to the cost of car design and development."¹⁵

As it happens, data standards are seldom, if ever, implemented properly. In some cases, the standards are simply too complex to comprehend. Try as they might, every implementation of a complex standard is somewhat idiosyncratic. Consequently, no two implementations of a complex data standard are equivalent to one another. In many cases, corporations and government agencies will purposefully veer from the standard to accommodate some local exigency. In some cases, a corporation may find it prudent to include nonstandard embellishments to a standard to create products or functionalities that cannot be easily reproduced by their competitors. In such cases, customers accustomed to a particular manufacturer's rendition of a standard may find it impossible to switch providers (see Glossary item, Lock in).

The process of developing new standards is costly. Interested parties must send representatives to many meetings. In the case of international standards, meetings occur in locations throughout the planet. Someone must pay for the expertise required to develop the standard, improve drafts, and vet the final version. Standards development agencies become involved in the process, and the final product must be approved by one of the agencies that confer final approval. After a standard is approved, it must be accepted by its intended community of users. Educating a community in the use of a standard is another expense. In some cases, an approved standard never gains traction. Because standards cost a great deal of money to develop, it is only natural that corporate sponsors play a major role in the development and deployment of new standards. Software vendors are clever and have learned to benefit from the standards-making process. In some cases, members of a standards committee may knowingly insert a fragment of their own patented property into the standard. After the standard is released and implemented, in many different vendor systems, the patent holder rises to assert the hidden patent. In this case, all those who implemented the standard may find themselves required to pay a royalty for the use of intellectual property sequestered within the standard (see Glossary items, Patent farming, Intellectual property).¹

Savvy standards committees take measures to reduce patent farming. Such measures may take the form of agreements, signed by all members of the standards committee, to refrain from asserting patent claims on the standard. There are several ways to circumvent and undermine these agreements. If a corporation holds patents on components of a standard, the corporation can sell their patents to a third party. The third party would be a so-called patent holding company that buys patents in selected technologies with the intention of eventually asserting patents over an array of related activities.¹⁶ If the patent holder asserts the patent, the corporation might profit from patent farming, through their sale of the patent, without actually breaking the agreement (see Glossary item, Patent farming).

Corporations can profit from standards indirectly, by obtaining patents on the uses of the standard; not on the patent itself. For example, an open standard may have been created that can be obtained at no cost, and that is popular among its intended users, and that contains no hidden intellectual property (see Glossary items, Open standard, Intellectual property). An interested corporation or individual may discover a novel use for the standard. The corporation or individual can patent the use of the standard, without needing to patent the standard itself. The patent holder will have the legal right to assert the patent over anyone who uses the standard for the purpose claimed by the patent. This patent protection will apply even when the standard is free and open.¹

Despite the problems inherent in standards, government committees cling to standards as the best way to share data. The perception is that in the absence of standards, the necessary activities of data sharing, data verification, data analysis, and any meaningful validation of the conclusions will be impossible to achieve.¹⁷ This long-held perception may not be true. Data standards, intended to simplify our ability to understand and share data, may have increased the complexity of data science. As each new standard is born, our ability to understand our data seems to diminish. Luckily, many of the problems produced by the proliferation of data standards can be avoided by switching to a data annotation technique broadly known as "specification." Although the terms "specification" and "standard" are used interchangeably, by the incognoscenti, the two terms are quite different from one another. A specification is a formal way of describing data. A standard is a set of requirements, created by a standards development organization, that comprise a predetermined content and format for a set of data (see Glossary items, Specification, Specification versus standard).

A specification is an accepted method for describing objects (physical objects such as nuts and bolts; or symbolic objects, such as numbers; or concepts expressed as text). In general, specifications do not require explicit items of information (ie, they do not impose restrictions on the content that is included in or excluded from documents), and specifications do not impose any order of appearance of the data contained in the document (ie, you can mix up and rearrange the data records in a specification if you like). Specifications are not generally certified by a standards organization. Examples of specifications are RDF (Resource Description Framework) produced by the W3C (World Wide Web Consortium), and TCP/IP (Transfer Control Protocol/Internet Protocol), maintained by the Internet Engineering Task Force. The most widely implemented specifications are simple; thus, easily adaptable.

Specifications, unlike standards, do not tell you what data must be included in a document. A specification simply provides a uniform way of representing the information you choose to include. Some of the most useful and popular specifications are XML, RDF, Notation 3, and Turtle (see Glossary items, XML, RDF, Notation 3, Turtle). In general, specifications do not require explicit items of information (ie, they do not impose restrictions on the content that is included in or excluded from documents), and specifications do not impose any order of appearance of the data contained in the document (ie, you can mix up and rearrange the data records in a specification if you like). Specifications are not typically certified by a standards organization, but they are developed by special interest groups. Their legitimacy depends on their popularity.

Files that comply with a specification can be parsed and manipulated by generalized software designed to parse the markup language of the specification (eg, XML, RDF) and to organize the data into data structures defined within the file.

Specifications serve most of the purposes of a standard, plus providing many important functions that standards typically lack (eg, full data description, data exchange across diverse types of data sets, data merging, and semantic logic) (see Glossary item, Semantics). Data specifications spare us most of the heavy baggage that comes with a standard, which includes: limited flexibility to include changing diverse data objects, locked-in data descriptors, licensing and other intellectual property issues, competition among standards that compete within the same data domain, and bureaucratic overhead (see Glossary item, Intellectual property).⁷

Most importantly, specifications make standards fungible. A good specification can be ported into a data standard, and a reasonably good data standard can be ported into a specification. For example, there are dozens of image formats (eg, jpeg, png, gif, tiff). Although these formats have not gone through a standards development process, they are used by billions of individuals and have achieved the status as de facto standards. For most of us, the selection of any particular image format is inconsequential. Data scientists have access to robust image software that will convert images from one format to another.

The most common mistake committed by data scientists is to convert legacy data (ie, old data incompatible with the data held in a newer information system) directly into a contemporary standard, and using analytic software that is designed to operate exclusively upon the chosen standard (see Glossary item, Legacy data). Doing so only serves to perpetuate your legacy-related frustrations. You can be certain that your data standard and your software application will be unsuitable for the next generation of data scientists. It makes much more sense to port legacy data into a general specification, from which data can be ported to any current or future data standard.

Open Source Tools

Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems.

Jamie Zawinski

ASCII

ASCII is the American Standard Code for Information Interchange, ISO-14962-1997. The ASCII standard is a way of assigning specific 8-bit strings (a string of 0s and 1s of length 8) to the alphanumeric characters and punctuation. Uppercase letters are assigned a different string of 0s and 1s than their corresponding lowercase letters. There are 256 ways of combining 0s and 1s in strings of length 8, and this means there are 256 different ASCII characters.

The familiar keyboard keys produce ASCII characters that happen to occupy ASCII values under 128. Hence, alphanumerics and common punctuation are represented as 8-bits, with the first bit, "0" serving as padding; and keyboard characters are commonly referred to as 7-bit ASCII. Files composed of common keyboard characters, are commonly referred to as plain-text files or as 7-bit ASCII files. Applications that display and edit unformatted, 7-bit ASCII files, are referred to as text editors. For some uses, the 256 ASCII character limit is too constraining. There are many languages in the world, with their own alphabets or with their own versions of ASCII Romanized alphabet. Consequently, a new character code (UNICODE) has been designed as an expansion of ASCII. To maintain facile software conversion from ASCII to Unicode, ASCII is embedded in the Unicode standard (see Glossary items, Text editor, Plain-text).

Regular Expressions

Every graduate student working through any of the fields of science anticipates a future life involved with the deep analysis of data. This dream can come true, to some extent, but the majority of every scientist's time is devoted to the dirty job of structuring data into a useful format. I feel justified in referring to this activity as "dirty," because most of the terms that describe this activity convey the notion that data needs to be cleansed or rehabilitated prior to use (see Glossary items, Data cleaning, Data flattening, Data fusion, Data merging, Data mining, Data munging, Data reduction, Data scraping, Data scrubbing, Data wrangling). The common denominator for data restructuring is pattern matching; finding a pattern in the data that requires substitution or transformation, or relocation. Hence, every serious data scientist must master Regular expressions, the universal syntax for expressing string patterns.

Regular expressions, commonly referred to as Regex, refers to the standard pattern-matching syntax used in most modern programming languages, including Perl, Python, and Ruby. Some word processing applications, such as OpenOffice, also support Regex string searches.

Here are the basic pattern notations used in Regex:

g Match globally, (find all occurrences).

i Do case-insensitive pattern matching.

m Treat string as multiple lines.

o Compile pattern only once.

s Treat string as single line.

x Use extended regular expressions.

ˆ Match the beginning of the line.

. Match any character (except newline).

$ Match the end of the line (or before newline at the end).

| Alternation.

() Grouping.

[] Character class.

* Match 0 or more times.

+ Match 1 or more times.

? Match 1 or 0 times.

{n} Match exactly n times.

{n,} Match at least n times.

{n,m} Match at least n but not more than m times.

newline(LF, NL).

W Match a non-word character.

s Match a whitespace character.

S Match a non-whitespace character.

d Match a digit character.

D Match a non-digit character.

Perl, Ruby, and Python each have the equivalent of a substitution operator that looks for a specific pattern match, within a string, and, finding a match, substitutes another string at the location of the matching pattern.

In Perl, the substitution operator syntax looks like this:

$string =~ s/< pattern that you match >/< replacement pattern >/options;

Here is a Perl snippet demonstrating Regex substitution:

$string =~ s/[ ]+/ /g;

$string =~ s/([ˆA-Z]+.[ ]{1,2})([A-Z])/$1 $2/g;

This short snippet uses Perl's substitution operator on a string, to do the following:

1. The first command of the snippet substitutes a space character for every occurrence in the string of one or more carriage returns. By removing all of the carriage returns in the string, the line can produce an output string that occupies a single line.

2. The second command looks through the string for a succession of characters that are not uppercase letters of the alphabet. If such a substring is followed by a period, followed by one or two spaces, followed by a an uppercase letter of the alphabet, it will make a substitution.

3. The substitution will consist of the found pattern described in the first parenthesized group in the search pattern (ie, the string of characters that are not uppercase letters of the alphabet, followed by a period, followed by one or two spaces) and it will insert this found string, followed by a carriage return, followed by the uppercase letter that followed in the string.

4. The second command will continue searching the string for additional pattern matches, making appropriate substitutions for every matching pattern, until it exhausts the original. The string to be searched can comfortably hold the length of an entire book. String length is only limited by the RAM memory of your computer.

What did this complex and seemingly pointless task accomplish? Roughly, the two-line snippet is a sentence parser that transforms a plain-text string into a series of lines, each line composed of a sentence. If you do not immediately see how this is accomplished, here is a hint. Nearly every sentence ends with a non-uppercase character, followed by a period, followed by one or two spaces, followed by the first letter of the next sentence, an uppercase character.

For the uninitiated, Regex can be daunting. The florid inscrutability of Regex expressions, found in virtually every Perl script, contributed greatly to the somewhat cynical assessment, attributed to Keith Bostic, that Perl looks the same before and after encryption (see Glossary item, Encryption). Nonetheless, Regex is an absolute necessity for any data scientist who needs to extract or transform data (ie, every data scientist). Regular expression patterns are identical from language to language. However, the syntax of commands that operate with regular expressions (eg, string substitutions, extractions, and other actions that occur at the location of a pattern match) will vary somewhat among the different programming languages that support Regex.

As another example of the power of Regex, here is a short Perl script that parses through a plain-text book, extracting any text that matches a pattern commonly encountered by given name followed by a family name.

#!/usr/local/bin/perl

undef($/);

open (TEXT, "english_lit.txt");

open (OUT, ">english_lit_names.txt");

$line = < TEXT >;

close TEXT;

while ($line =~ /[A-Z][a-z]+[ ]{1}[A-Z][a-z]+/g)

{

$name = $&;

$name =~ s/ / /;

next if ($name =~ /ˆThe/);

next if ($name =~ /ˆIn/);

next if ($name =~ /ˆOf/);

next if ($name =~ /ˆIn/);

next if ($name !~ /[A-Z][a-z]+/);

$namelist{$name} = "";

}

while ($line =~ /[A-Z][a-z]+[, ]+[A-Z][,. ]*[A-Z]?[,. ]*[A-Z]?[,. ]*/g)

{

$name = $&;

next if ($name =~ /ˆThe/);

next if ($name =~ /ˆIn/);

$name =~ s/[,. ]//g;

next if ($name =~ / {3,}/);

$name =~ s/ +$//;

next if ($name !~ /[A-Z][a-z]+/);

$namelist{$name} = "";

}

print OUT join(" ", sort(keys(%namelist)));

close OUT;

system 'notepad english_lit_names.txt';

exit;

This short script will parse a full-length book almost instantly, producing a long list of alphabetized two-word terms, most of which are legitimate names:

Short sample of output

Abbess Hilda

About Project

Abraham Cowley

Abraham Cowper

Abt Vogler

Academy Classics

Adam Bede

Adam Smith

Adelaide Procter

Adelaide Witham

Ado About

After Alfred

Again Beowulf

Albion Series

Aldine Edition

Aldine Poets

Aldine Series

Alexander Pope

Alexander Selkirk

Alfred Lord

Alfred Tennyson

Algernon Charles

Alice Brown

All Delight

Alloway Kirk

Although Bacon

American Book

American Indians

American Revolution

American Taxation

Amerigo Vespucci

Among Browning

Among Coleridge

Among My

Among Ruskin

Amos Barton

You can see that the script is not perfect. The output includes two-word strings that are not names (eg, Ado About, Among My), and it undoubtedly excludes legitimate names (eg, ee cummings, Madonna). Still, it's a start. In Section 3.3 of Chapter 3, we will see how we can add a few lines of code and transform this script into an indexing application.

Format Commands

Format commands are short routines built into many programming languages, that display variables as neat columns of padded or justified words and numbers. Format commands have equivalent functionality in most higher level programming languages (eg, Python, Ruby, Perl, C, Java). The generic name of the format command is usually "printf." The syntax for the printf command involves assigning a field specifier for fields or columns, followed by the list of the variables that will fit the specified fields. Each field specifier consists of a percent sign, followed by an integer that specifies the size of the column, followed by a dot, followed by an integer specifying the maximum size of the element to be placed in the field, followed by a letter indicating the type of element (eg, string, decimal integer, etc.). The following the list of specifiers constitute the most common instructions in printf statements.¹⁸

Here is a list of printf field specifiers.

%% a percent sign

%c a character with the given number

%s a string

%d a signed integer, in decimal

%u an unsigned integer, in decimal

%o an unsigned integer, in octal

%x an unsigned integer, in hexadecimal

%e a floating-point number, in scientific notation

%f a floating-point number, in fixed decimal notation

%g a floating-point number, in %e or %f notation

Here is an example of a Perl one-liner, launched from Perl's debugger environment (see Open Source Tools for Chapter 1, Interactive Line Interpreters for Perl, Python, and Ruby).

DB<1 > printf("%-10.10s %0.1u %7.6u %4.3u ", "hello", 3, 28, 15, );

The one-liner Perl command saves 10 spaces for a string, printing the string from left to right, beginning with the first space. Following the reserved 10 spaces is an unsigned, single character, then a saved length of 7 spaces, allowing for a 6 character unsigned integer, then a saved length of 4 spaces, allowing 3 spaces for an unsigned integer, then a carriage-return character.

Here is the output, from the Perl debugger:

hello 3 000028 015

Strictly speaking, Python has no printf function. It uses the % operator instead, but it serves the same purpose and uses an equivalent syntax.

Here is a Python one-liner launched from the Python interpreter:

>>> "%-20.20s %8.06d" % ("hello", 35)

The Python code creates 20 spaces, then prints a string of characters, beginning at the first space. Following that, 8 spaces are saved for an integer, and prints out 6 characters, padding zeros if the supplied integer is smaller than 6 characters in length. Here is the output:

'hello 000035'

Here is an example of the printf command, used in the Ruby script, printf_ruby.rb:

#!/usr/local/bin/ruby

freq = Hash.new⁽⁰⁾

my_string = "Peter Piper picked a peck of pickled peppers.

A peck of pickled peppers Peter Piper picked.

If Peter Piper picked a peck of pickled peppers,

Where is the peck of pickled peppers that Peter

Piper picked?"

my_string.downcase.scan(/w +/){|word| freq[word] = freq[word]+1}

freq.sort_by {|k, v| v}.reverse.each {|k,v| printf "%-10.10s %0.2u ", k, v}

exit

Here is the output of the printf_ruby.rb script.

c:ftp>printf_ruby.rb

peter 04

piper 04

picked 04

peck 04

of 04

peppers 04

pickled 04

a 03

if 01

where 01

is 01

the 01

that 01

In this example, the printf command tells the interpreter of the programming language to expect a character string followed by an integer; and that the character string should start at the left-most space, padded out to fill 10 spaces, to be followed immediately by a 2 digit integer, indicating the number of occurrences of the word, and a newline space. The comma-separated parameters that follow (ie, k, freq(k)) supply the interpreter with the string and the number that will be used in the execution of the printf command.

Converting Nonprintable Files to Plain-Text

Removing all the nonprintable characters from a file is a blunt and irreversible tactic. Sometimes, it suffices to transform a file into printable characters, from which you can reconstitute the original file, if needed. The Base64 algorithm reads binary files 6 bits at a time (instead of the 8-bit read for ASCII streams). Whereas an 8-bit chunk corresponds to a base 256 number (ie, 2 to the 8 exponent), a 6-bit chunk corresponds to a base 64 number (ie, 2 to the 6 exponent). The 6-bit chunks are assigned to a range of printable ASCII characters. Binary files that are transformed to Base64 can be reverse-transformed back to the original file.

Here is a short Ruby script, base64_ruby.rb, that takes a sentence, encodes it to Base64, and reverses the process to produce the original text.

#!/usr/local/bin/ruby

require 'base64'

data = "Time flies like an arrow. Fruit flies like a banana."

base64_transformed_data = Base64.encode64(data)

puts base64_transformed_data

base64_reverse = Base64.decode64(base64_transformed_data)

puts base64_reverse

exit

Output:

c:ftp>base64_ruby.rb

VGltZSBmbGllcyBsaWtlIGFuIGFycm93LiAgRnJ1aXQgZmxpZXMgbGlrZSBh

IGJhbmFuYS4 =

Time flies like an arrow. Fruit flies like a banana.

Here is a Perl script, base64_perl, that will take any file, ASCII or binary, and produce a printable Base64 translation. In the example below, the script encodes the Gettysburg address into Base64 text.

#!/usr/local/bin/perl

use MIME::Base64;

open (TEXT,"gettysbu.txt");

binmode TEXT;

$/ = undef;

$string = < TEXT >;

close TEXT;

$encoded = encode_base64($string);

print $encoded;

$decoded = decode_base64($encoded);

print " $decoded";

exit;

Here are the first few lines of output from the base64_perl script:

c:ftp>base64_perl.pl

Rm91ciBzY29yZSBhbmQgc2V2ZW4geWVhcnMgYWdvIG91ciBmYXRoZXJzIGJyb3VnaHQgZm9ydGgg

b24gdGhpcw0KY29udGluZW50IGEgbmV3IG5hdGlvbiwgY29uY2VpdmVkIGluIGxpYmVydHkgYW5k

Here is the equivalent script, in Python¹⁹:

#!/usr/local/bin/python

import base64

sample_file = open ("gettysbu.txt", "rb")

string = sample_file.read()

sample_file.close()

print base64.encodestring(string)

print base64.decodestring(base64.encodestring(string))

exit

Adobe Acrobat Portable Document Format (ie, the ubiquitous PDF files), are unsuitable for text processing algorithms; they contain obscure formatting instructions, in a nonprintable ASCII format. Version 11 of Adobe Reader has a built-in pdf-to-text conversion tool, but the Adobe tool does not provide the power and flexibility of a utility that can be called from a script. A script that contains a system call to a utility can convert input files from multiple storage locations, and seamlessly integrate the returned output into subsequent script commands (see Glossary item, System call).

For rugged data simplification efforts, you may want to use the Xpdf utility. Xpdf is open source software that includes a PDF text extractor (pdftotext.exe) and a number of related utilities that operate on PDF files. Xpdf runs under multiple operating systems. The suite of Xpdf executables can be downloaded at: http://www.foolabs.com/xpdf/download.html

Once downloaded, the pdftotext utility can be called from the subdirectory in which it resides.

c:ftpxpdf>pdftotext zerolet.pdf

Here is a short Perl script that produces a text file for each of the .pdf files in a subdirectory. The script is intended to be launched from the same directory in which the pdftotext.exe file resides.

#!/usr/local/bin/perl

opendir(XPDF_SUBDIR, ".") || die ("Unable to open directory");

@from_files = readdir(XPDF_SUBDIR);

closedir(XPDF_SUBDIR);

foreach $filename (@from_files)

{

if ($filename =~ /.pdf/)

{

system("pdftotext.exe $filename");

}

exit;

Here is the equivalent Python script:

#!/usr/local/bin/python

import os, re, string

filelist = os.listdir(".")

for file in filelist:

if ".pdf" in file:

command = "pdftotext.exe " + file

os.system(command);

exit

Here is the equivalent Ruby script:

#!/usr/local/bin/ruby

filelist = Dir.glob("*.pdf")

filelist.each do

|filename|

system("pdftotext.exe #{filename}")

end

exit

Dublin Core

The specification for general file descriptors is the Dublin Core.²⁰ This set of about 15 common data elements, developed by a committee of librarians, specify the header information in electronic files and documents. The Dublin Core elements include such information as the date that the file was created, the name of the entity that created the file, and a general comment on the contents of the file. The Dublin Core elements aid in indexing and retrieving electronic files and should be included in every electronic document. The Dublin Core metadata specification is found at: http://dublincore.org/documents/dces/

Some of the most useful Dublin Core elements are¹:

Contributor the entity that contributes to the document

Coverage the general area of information covered in the document

Creator entity primarily responsible for creating the document

Date a time associated with an event relevant to the document

Description description of the document

Format file format

Identifier a character string that uniquely and unambiguously identifies the document

Language the language of the document

Publisher the entity that makes the resource available

Relation a pointer to another, related document, typically the identifier of the related document

Rights the property rights that apply to the document

Source an identifier linking to another document from which the current document was derived

Subject the topic of the document

Title title of the document

Type genre of the document

An XML syntax for expressing the Dublin Core elements is available.²¹^,²²

Glossary

ANSI The American National Standards Institute (ANSI) accredits standards developing organizations to create American National Standards (ANS). A so-called ANSI standard is produced when an ANSI-accredited standards development organization follows ANSI procedures and receives confirmation from ANSI that all the procedures were followed. ANSI coordinates efforts to gain international standards certification from the ISO (International Standards Organization) or the IEC (International Electrotechnical Commission). ANSI works with hundreds of ANSI-accredited standards developers.

ASCII ASCII is the American Standard Code for Information Interchange, ISO-14962-1997. The ASCII standard is a way of assigning specific 8-bit strings (a string of 0s and 1s of length 8) to the alphanumeric characters and punctuation. Uppercase letters are assigned a different string of 0s and 1s than their matching lowercase letters. There are 256 ways of combining 0s and 1s in strings of length 8. This means that that there are 256 different ASCII characters, and every ASCII character can be assigned a number-equivalent, in the range of 0 to 255. The familiar keyboard keys produce ASCII characters that happen to occupy ASCII values under 128. Hence, alphanumerics and common punctuation are represented as 8-bits, with the first bit, "0," serving as padding. Hence, keyboard characters are commonly referred to as 7-bit ASCII. These are the classic ASCII characters:

!"#$%&'()*+,-./0123456789:;<=> ?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]ˆ_`abcdefghijklmnopqrstuvwxyz{|}~

Files composed exclusively of common keyboard characters are commonly referred to as plain-text files or as 7-bit ASCII files. See Text editor. See Plain-text. See ISO.

American National Standards Institute See ANSI.

Autovivification In programming, autovivification is a feature of some programming languages wherein a variable or data structure seemingly brings itself into life, without definition or declaration, at the moment when its name first appears in a program. The programming language automatically registers the variable and endows it with a set of properties consistent with its type, as determined by the context, within the program. Perl supports autovivification. Python and Ruby, under most circumstances, do not. In the case of Ruby, new class objects (ie, instances of the class) are formally declared and created, by sending the "new" method to the class assigned to the newly declared object. See Reification.

Dark data Unstructured and ignored legacy data, presumed to account for most of the data in the "infoverse." The term gets its name from "dark matter" which is the invisible stuff that accounts for most of the gravitational attraction in the physical universe.

Data cleaning More correctly, data cleansing, and is synonymous with data fixing or data correcting. Data cleaning is the process by which errors, spurious anomalies, and missing values are somehow handled. The options for data cleaning are: correcting the error, deleting the error, leaving the error unchanged, or imputing a value.²³ Data cleaning should not be confused with data scrubbing. See Data scrubbing.

Data flattening In the field of informatics, data flattening is a popular but ultimately counterproductive method of data organization and data reduction. Data flattening involves removing data annotations that are deemed unnecessary for the intended purposes of the data (eg, timestamps, field designators, identifiers for individual data objects referenced within a document). Data flattening makes it difficult or impossible to verify data or to discover relationships among data objects. A detailed discussion of the topic is found in Section 4.5, "Reducing Data." See Verification. See Data repurposing. See Pseudosimplification.

Data fusion Data fusion is very closely related to data integration. The subtle difference between the two concepts lies in the end result. Data fusion creates a new and accurate set of data representing the combined data sources. Data integration is an on-the-fly usage of data pulled from different domains and, as such, does not yield a residual fused set of data.

Data merging A nonspecific term that includes data fusion, data integration, and any methods that facilitate the accrual of data derived from multiple sources. See Data fusion.

Data mining Alternate form, datamining. The term "data mining" is closely related to "data repurposing" and both endeavors employ many of the same techniques. Accordingly, the same data scientists engaged in data mining efforts are likely to be involved in data repurposing efforts. In data mining, the data, and the expected purpose of the data, are typically provided to the data miner, often in a form suited for analysis. In data repurposing projects, the data scientists are expected to find unintended purposes for the data, and the job typically involves transforming the data into a form suitable for its new purpose.

Data munging Refers to a multitude of tasks involved in preparing data for some intended purpose (eg, data cleaning, data scrubbing, data transformation). Synonymous with data wrangling.

Data object A data object is whatever is being described by the data. For example, if the data is "6 feet tall," then the data object is the person or thing to which "6 feet tall" applies. Minimally, a data object is a metadata/data pair, assigned to a unique identifier (ie, a triple). In practice, the most common data objects are simple data records, corresponding to a row in a spreadsheet or a line in a flat-file. Data objects in object-oriented programming languages typically encapsulate several items of data, including an object name, an object unique identifier, multiple data/metadata pairs, and the name of the object's class. See Triple. See Identifier. See Metadata.

Data reduction When a very large data set is analyzed, it may be impractical or counterproductive to work with every element of the collected data. In such cases, the data analyst may choose to eliminate some of the data, or develop methods whereby the data is approximated. Some data scientists reserve the term "data reduction" for methods that reduce the dimensionality of multivariate data sets.

Data repurposing Involves using old data in new ways, that were not foreseen by the people who originally collected the data. Data repurposing comes in the following categories: (1) Using the preexisting data to ask and answer questions that were not contemplated by the people who designed and collected the data; (2) Combining preexisting data with additional data, of the same kind, to produce aggregate data that suits a new set of questions that could not have been answered with any one of the component data sources; (3) Reanalyzing data to validate assertions, theories, or conclusions drawn from the original studies; (4) Reanalyzing the original data set using alternate or improved methods to attain outcomes of greater precision or reliability than the outcomes produced in the original analysis; (5) Integrating heterogeneous data sets (ie, data sets with seemingly unrelated types of information), for the purpose of answering questions or developing concepts that span diverse scientific disciplines; (6) Finding subsets in a population once thought to be homogeneous; (7) Seeking new relationships among data objects; (8) Creating, on-the-fly, novel data sets through data file linkages; (9) Creating new concepts or ways of thinking about old concepts, based on a re-examination of data; (10) Fine-tuning existing data models; and (11) Starting over and remodeling systems.⁷ See Heterogeneous data.

Data scraping Pulling together desired sections of a data set or text, using software.

Data scrubbing A term that is very similar to data deidentification and is sometimes used improperly as a synonym for data deidentification. Data scrubbing refers to the removal, from data records, of information that is considered unwanted. This may include identifiers, private information, or any incriminating or otherwise objectionable language contained in data records, as well as any information deemed irrelevant to the purpose served by the record. See Deidentification.

Data wrangling Jargon referring to a multitude of tasks involved in preparing data for eventual analysis. Synonymous with data munging.²⁴

Deidentification The process of removing all of the links in a data record that can connect the information in the record to an individual. This usually includes the record identifier, demographic information (eg, place of birth), personal information (eg, birthdate), biometrics (eg, fingerprints), and so on. The process of deidentification will vary based on the type of records examined. Deidentifying protocols exist wherein deidentificatied records can be reidentified, when necessary. See Reidentification. See Data scrubbing.

Encryption A common definition of encryption involves an algorithm that takes some text or data and transforms it, bit-by-bit, into an output that cannot be interpreted (ie, from which the contents of the source file cannot be determined). Encryption comes with the implied understanding that there exists some reverse transformation that can be applied to the encrypted data, to reconstitute the original source. As used herein, the definition of encryption is expanded to include any protocols by which files can be shared, in such a way that only the intended recipients can make sense of the received documents. This would include protocols that divide files into pieces that can only be reassembled into the original file using a password. Encryption would also include protocols that alter parts of a file while retaining the original text in other parts of the file. As described in Chapter 5, there are instances when some data in a file should be shared, while only specific parts need to be encrypted. The protocols that accomplish these kinds of file transformations need not always employ classic encryption algorithms. See Winnowing and chaffing.

HTML Hyper Text Markup Language is an ASCII-based set of formatting instructions for Web pages. HTML formatting instructions, known as tags, are embedded in the document, and double-bracketed, indicating the start point and end points for instruction. Here is an example of an HTML tag instructing the Web browser to display the word "Hello" in italics: Hello . All Web browsers conforming to the HTML specification must contain software routines that recognize and implement the HTML instructions embedded within in Web documents. In addition to formatting instructions, HTML also includes linkage instructions, in which the Web browsers must retrieve and display a listed Web page, or a Web resource, such as an image. The protocol whereby Web browsers, following HTML instructions, retrieve Web pages from other internet sites, is known as HTTP (HyperText Transfer Protocol).

Heterogeneous data Two sets of data are considered heterogeneous when they are dissimilar to one another, with regard to content, purpose, format, organization, or annotations. One of the purposes of data science is to discover relationships among heterogeneous data sources. For example, epidemiologic data sets may be of service to molecular biologists who have gene sequence data on diverse human populations. The epidemiologic data is likely to contain different types of data values, annotated and formatted in a manner different from the data and annotations in a gene sequence database. The two types of related data, epidemiologic and genetic, have dissimilar content; hence they are heterogeneous to one another.

ISO International Standards Organization. The ISO is a nongovernmental organization that develops international standards (eg, ISO-11179 for metadata and ISO-8601 for date and time). See ANSI.

ISO-11179 The ISO standard for defining metadata, such as XML tags. The standard requires that the definitions for metadata used in XML (the so-called tags) be accessible and should include the following information for each tag: Name (the label assigned to the tag), Identifier (the unique identifier assigned to the tag), Version (the version of the tag), Registration Authority (the entity authorized to register the tag), Language (the language in which the tag is specified), Definition (a statement that clearly represents the concept and essential nature of the tag), Obligation (indicating whether the tag is required), Datatype (indicating the type of data that can be represented in the value enclosed by the tag), Maximum Occurrence (indicating any limit to the number of times that the tag appears in a document), and Comment (a remark describing how the tag might be used).²⁵ See ISO.

Identification The process of providing a data object with an identifier, or the process of distinguishing one data object from all other data objects on the basis of its associated identifier. See Identifier.

Identifier A string that is associated with a particular thing (eg person, document, transaction, data object), and not associated with any other thing.²⁶ Object identification usually involves permanently assigning a seemingly random sequence of numeric digits (0-9) and alphabet characters (a-z and A-Z) to a data object. A data object can be a specific piece of data (eg, a data record), or an abstraction, such as a class of objects or a number or a string or a variable. See Identification.

Intellectual property Data, software, algorithms, and applications that are created by an entity capable of ownership (eg, humans, corporations, universities). The owner entity holds rights over the manner in which the intellectual property can be used and distributed. Protections for intellectual property may come in the form of copyrights, patents, and laws that apply to theft. Copyright applies to published information. Patents apply to novel processes and inventions. Certain types of intellectual property can only be protected by being secretive. For example, magic tricks cannot be copyrighted or patented; this is why magicians guard their intellectual property against theft. Intellectual property can be sold outright, or used under a legal agreement (eg, license, contract, transfer agreement, royalty, usage fee, and so on). Intellectual property can also be shared freely, while retaining ownership (eg, open source license, GNU license, FOSS license, Creative Commons license).

Introspection A method by which data objects can be interrogated to yield information about themselves (eg, properties, values, and class membership). Through introspection, the relationships among the data objects can be examined. Introspective methods are built into object-oriented languages. The data provided by introspection can be applied, at run-time, to modify a script's operation; a technique known as reflection. Specifically, any properties, methods, and encapsulated data of a data object can be used in the script to modify the script's run-time behavior. See Reflection.

Legacy data Data collected by an information system that has been replaced by a newer system, and which cannot be immediately integrated into the newer system's database. For example, hospitals regularly replace their hospital information systems with new systems that promise greater efficiencies, expanded services, or improved interoperability with other information systems. In many cases, the new system cannot readily integrate the data collected from the older system. The previously collected data becomes a legacy to the new system. In many cases, legacy data is simply "stored" for some arbitrary period of time, in case someone actually needs to retrieve any of the legacy data. After a decade or so, the hospital finds itself without any staff members who are capable of locating the storage site of the legacy data, or moving the data into a modern operating system, or interpreting the stored data, or retrieving appropriate data records, or producing a usable query output.

Line A line in a nonbinary file is a sequence of characters that terminate with an end-of-line character. The end-of-line character may differ among operating systems. For example, the DOS end-of-line character is ASCII 13 (ie, the carriage return character) followed by ASCII 10 (ie, the line feed character), simulating the new line movement in manual typewriters. The Linux end-of-line character is ASCII 10 (ie, the line feed character only). When programming in Perl, Python, or Ruby, the newline character is represented by " " regardless of which operating system or file system is used. For most purposes, use of " " seamlessly compensates for discrepancies among operating systems with regard to their preferences for end-of-line characters. Binary files, such as image files or telemetry files, have no designated end-of-line characters. When a file is opened as a binary file, any end-of-line characters that happen to be included in the file are simply ignored as such, by the operating system.

Lock in Also appearing as lock-in and, more specifically, as vendor lock-in. Describes the situation when a data manager is dependent on a single manufacturer, supplier, software application, standard, or operational protocol and cannot use alternate sources without violating a license, incurring substantial costs, or suffering great inconvenience. One of the most important precepts of data simplification is to avoid vendor lock-in, whenever possible. Aside from vendor lock-in, data scientists should understand that user communities participate complicitly in efforts to lock-in inferior software, and standards. After a user has committed time, money, resources to a particular methodology, or has built professional ties with a community that supports the methodology, he or she will fight tooth and nail to preserve the status quo. It can be very difficult for new methods to replace entrenched methods, and this could be described as user lock-in.

Machine learning Refers to computer systems and software applications that learn or improve as new data is acquired. Examples would include language translation software that improves in accuracy as additional language data is added to the system, and predictive software that improves as more examples are obtained. Machine learning can be applied to search engines, optical character recognition software, speech recognition software, vision software, and neural networks. Machine learning systems are likely to use training data sets and test data sets.

Machine translation Ultimately, the job of machine translation is to translate text from one language into another language. The process of machine translation begins with extracting sentences from text, parsing the words of the sentence into grammatical parts, and arranging the grammatical parts into an order that imposes logical sense on the sentence. Once this is done, each of the parts can be translated by a dictionary that finds equivalent terms in a foreign language, then reassembled as a foreign language sentence by applying grammatical positioning rules appropriate for the target language. Because these steps apply the natural rules for sentence constructions in a foreign language, the process is often referred to as natural language machine translation. It is important to note that nowhere in the process of machine translation is it necessary to find meaning in the source text, or to produce meaning in the output. Machine translation algorithms preserve ambiguities, without attempting to impose a meaningful result.

Meaning In informatics, meaning is achieved when described data is bound to a unique identifier of a data object. "Claude Funston's height is 5 feet 11 inches," comes pretty close to being a meaningful statement. The statement contains data (5 feet 11 inches), and the data is described (height). The described data belongs to a unique object (Claude Funston). Ideally, the name "Claude Funston" should be provided with a unique identifier, to distinguish one instance of Claude Funston from all the other persons who are named Claude Funston. The statement would also benefit from a formal system that ensures that the metadata makes sense (eg, What exactly is height, and does Claude Funston fall into a class of objects for which height is a property?) and that the data is appropriate (eg, Is 5 feet 11 inches an allowable measure of a person's height?). A statement with meaning does not need to be a true statement (eg, The height of Claude Funston was not 5 feet 11 inches when Claude Funston was an infant). See Semantics. See Triple. See RDF.

Metadata The data that describes data. For example, a data element (also known as data point) may consist of the number, "6." The metadata for the data may be the words "Height, in feet." A data element is useless without its metadata, and metadata is useless unless it adequately describes a data element. In XML, the metadata/data annotation comes in the form < metadata tag>data<end of metadata tag > and might be look something like:

< weight_in_pounds>150 </weight_in_pounds >

In spreadsheets, the data elements are the cells of the spreadsheet. The column headers are the metadata that describe the data values in the column's cells, and the row headers are the record numbers that uniquely identify each record (ie, each row of cells). See XML.

Microarray Also known as gene chip, gene expression array, DNA microarray, or DNA chips. These consist of thousands of small samples of chosen DNA sequences arrayed onto a block of support material (usually, a glass slide). When the array is incubated with a mixture of DNA sequences prepared from cell samples, hybridization will occur between molecules on the array and single stranded complementary (ie, identically sequenced) molecules present in the cell sample. The greater the concentration of complementary molecules in the cell sample, the greater the number of fluorescently tagged hybridized molecules in the array. A specialized instrument prepares an image of the array, and quantifies the fluorescence in each array spot. Spots with high fluorescence indicate relatively large quantities of DNA in the cell sample that match the specific sequence of DNA in the array spot. The data comprising all the fluorescent intensity measurements for every spot in the array produces a gene profile characteristic of the cell sample.

Monospaced font Alternate terms are fixed-pitch, fixed-width, or nonproportional font. Most modern fonts have adjustable spacing, to make the best use of the distance between successive letters and to produce a pleasing look to the printed words. For example, "Te" printed in proportional font might push the "e" under the roof of the top bar of the "T." You will never see the "e" snuggled under the bar of the "T" in a monospaced font. Hence, when using proportional font, the spacings inserted into text will have variable presentations, when displayed on the monitor or when printed on paper. Programmers should use monospaced fonts when composing software; think of the indentations in Python and Ruby and the use of spaces as padding in quoted text and formatting commands. Examples of monospaced fonts include: Courier, Courier New, Lucida Console, Monaco, and Consolas.

Namespace A namespace is the realm in which a metadata tag applies. The purpose of a namespace is to distinguish metadata tags that have the same name, but a different meaning. For example, within a single XML file, the metadata tag "date" may be used to signify a calendar date, or the fruit, or the social engagement. To avoid confusion, metadata terms are assigned a prefix that is associated with a Web document that defines the term (ie, establishes the tag's namespace). In practical terms, a tag that can have different descriptive meanings in different contexts is provided with a prefix that links to a Web document wherein the meaning of the tag, as it applies in the XML document, is specified. An example of namespace syntax is provided in Section 2.5.

Natural language processing A field broadly concerned with how computers interpret human language (ie, machine translation). At its simplest level, this may involve parsing through text, and organizing the grammatical units of individual sentences (ie, tokenization). The grammatical units can be trimmed, reorganized, matched against concept equivalents in a nomenclature or a foreign language dictionary, and reassembled as a simplified, grammatically uniform, output or as a translation into another language.

Nomenclature mapping Specialized nomenclatures employ specific names for concepts that are included in other nomenclatures, under other names. For example, medical specialists often preserve their favored names for concepts that cross into different fields of medicine. The term that pathologists use for a certain benign fibrous tumor of the skin is "fibrous histiocytoma," a term spurned by dermatologists, who prefer to use "dermatofibroma" to describe the same tumor. As another horrifying example, the names for the physiologic responses caused by a reversible cerebral vasoconstricitive event include: thunderclap headache, Call-Fleming syndrome, benign angiopathy of the central nervous system, postpartum angiopathy, migrainous vasospasm, and migraine angiitis. The choice of term will vary depending on the medical specialty of the physician (eg, neurologist, rheumatologist, obstetrician). To mitigate the discord among specialty nomenclatures, lexicographers may undertake a harmonization project, in which nomenclatures with overlapping concepts are mapped to one another.

Notation 3 Also called n3. A syntax for expressing assertions as triples (unique subject + metadata + data). Notation 3 expresses the same information as the more formal RDF syntax, but n3 is easier for humans to read.²⁷ RDF and n3 are interconvertible, and either one can be parsed and equivalently tokenized (ie, broken into elements that can be reorganized in a different format, such as a database record). See RDF. See Triple.

Open standard It is disappointing to admit that many of the standards that apply to data and software are neither free nor open. Standards developing organizations occasionally require the members of their communities to purchase a license as a condition for obtaining, viewing, or using the standard. Such licenses may include encumbrances that impose strict limits on the way that the standard is used, distributed, modified, shared, or transmitted. Restrictions on the distribution of the standard may extend to any and all materials annotated with elements of the standard, and may apply to future versions of the standard. The concept of an open standard, as the name suggests, provides certain liberties, but many standards, even open standards, are encumbered with restrictions that users must abide.

Patent farming The practice of hiding intellectual property within a standard, or method, or device is known as patent farming or patent ambushing.²⁸ After a standard is adopted into general use, the patent farmer announces the presence of his or her hidden patented material and presses for royalties. The patent farmer plants seeds in the standard and harvests his crop when the standard has grown to maturity; a rustic metaphor for some highly sophisticated and cynical behavior.

Plain-text Plain-text refers to character strings or files that are composed of the characters accessible to a typewriter keyboard. These files typically have a ".txt" suffix to their names. Plain-text files are sometimes referred to as 7-bit ASCII files because all of the familiar keyboard characters have ASCII vales under 128 (ie, can be designated in binary with, just seven 0s and 1s. In practice, plain-text files exclude 7-bit ASCII symbols that do not code for familiar keyboard characters. To further confuse the issue, plain-text files may contain ASCII characters above 7 bits (ie, characters from 128 to 255) that represent characters that are printable on computer monitors, such as accented letters. See ASCII.

Plesionymy Plesionyms are nearly synonymous words, or pairs of words that are sometimes synonymous; other times not. For example, the noun forms of "smell" and "odor" are synonymous. As a verb, "smell" does the job, but "odor" comes up short. You can smell a fish, but you cannot odor a fish. Smell and odor are plesionyms to one another. Plesionymy is another challenge for machine translators.

Polysemy In polysemy, a single word, character string or phrase has more than one meaning. Polysemy can be thought of as the opposite of synonymy, wherein different words mean the same thing. Polysemy is a particularly vexing problem in the realm of medical abbreviations. A single acronym may have literally dozens of possible expansions.⁸ For example, ms is the abbreviation for manuscript, mass spectrometry, mental status, millisecond, mitral stenosis, morphine sulfate, multiple sclerosis, or the female honorific.

Protocol A set of instructions, policies, or fully-described procedures for accomplishing a service, operation, or task. Data is generated and collected according to protocols. There are protocols for conducting experiments, and there are protocols for measuring the results. There are protocols for choosing the human subjects included in a clinical trial, and there are protocols for interacting with the human subjects during the course of the trial. All network communications are conducted via protocols; the Internet operates under a protocol (TCP-IP, Transmission Control Protocol-Internet Protocol). One of the jobs of the modern data scientist is to create and curate protocols.

Pseudosimplification Refers to any method intended to simplify data that reduces the quality of the data, interferes with introspection, modifies the data without leaving a clear trail connecting the original data to the modified data, eliminates timestamp data, transforms the data in a manner that defies reverse transformation, or that unintentionally increases the complexity of data. See Data flattening.

RDF Resource Description Framework (RDF) is a syntax in XML notation that formally expresses assertions as triples. The RDF triple consists of a uniquely identified subject plus a metadata descriptor for the data plus a data element. Triples are necessary and sufficient to create statements that convey meaning. Triples can be aggregated with other triples from the same data set or from other data sets, so long as each triple pertains to a unique subject that is identified equivalently through the data sets. Enormous data sets of RDF triples can be merged or functionally integrated with other massive or complex data resources. For a detailed discussion see Open Source Tools for Chapter 6, Syntax for Triples. See Notation 3. See Semantics. See Triple. See XML.

Reflection A programming technique wherein a computer program will modify itself, at run-time, based on information it acquires through introspection. For example, a computer program may iterate over a collection of data objects, examining the self-descriptive information for each object in the collection (ie, object introspection). If the information indicates that the data object belongs to a particular class of objects, then the program may call a method appropriate for the class. The program executes in a manner determined by descriptive information obtained during run-time; metaphorically reflecting upon the purpose of its computational task. See Introspection.

Reidentification A term casually applied to any instance whereby information can be linked to a specific person, after the links between the information and the person associated with the information have been removed. Used this way, the term reidentification connotes an insufficient deidentification process. In the health care industry, the term "reidentification" means something else entirely. In the U.S., regulations define "reidentification" under the "Standards for Privacy of Individually Identifiable Health Information.²⁹" Therein, reidentification is a legally sanctioned process whereby deidentified records can be linked back to their human subjects, under circumstances deemed legitimate and compelling, by a privacy board. Reidentification is typically accomplished via the use of a confidential list of links between human subject names and deidentified records, held by a trusted party. In the healthcare realm, when a human subject is identified through fraud, trickery, or through the deliberate use of computational methods to break the confidentiality of insufficiently deidentified records (ie, hacking), the term "reidentification" would not apply.¹

Reification A programming term that has some similarity to "autovivification." In either case, an abstract piece of a program brings itself to life, at the moment when it is assigned a name. Whereas autovivification generally applies to variables and data structures, reification generally applies to blocks of code, methods, and data objects. When a named block of code becomes reified, it can be invoked anywhere within the program, by its name. See Autovivification.

Reproducibility Reproducibility is achieved when repeat studies produce the same results, over and over. Reproducibility is closely related to validation, which is achieved when you draw the same conclusions, from the data, over and over again. Implicit in the concept of "reproducibility" is that the original research must somehow convey the means by which the study can be reproduced. This usually requires the careful recording of methods, algorithms, and materials. In some cases, reproducibility requires access to the data produced in the original studies. If there is no feasible way for scientists to undertake a reconstruction of the original study, or if the results obtained in the original study cannot be obtained in subsequent attempts, then the study is irreproducible. If the work is reproduced, but the results and the conclusions cannot be repeated, then the study is considered invalid. See Validation. See Verification.

Semantics The study of meaning (Greek root, semantikos, significant meaning). In the context of data science, semantics is the technique of creating meaningful assertions about data objects. A meaningful assertion, as used here, is a triple consisting of an identified data object, a data value, and a descriptor for the data value. In practical terms, semantics involves making assertions about data objects (ie, making triples), combining assertions about data objects (ie, merging triples), and assigning data objects to classes; hence relating triples to other triples. As a word of warning, few informaticians would define semantics in these terms, but most definitions for semantics are functionally equivalent to the definition offered here. Most language is unstructured and meaningless. Consider the assertion: Sam is tired. This is an adequately structured sentence with a subject verb and object. But what is the meaning of the sentence? There are a lot of people named Sam. Which Sam is being referred to in this sentence? What does it mean to say that Sam is tired? Is "tiredness" a constitutive property of Sam, or does it only apply to specific moments? If so, for what moment in time is the assertion, "Sam is tired" actually true? To a computer, meaning comes from assertions that have a specific, identified subject associated with some sensible piece of fully described data (metadata coupled with the data it describes). See Triple. See RDF.

Specification A specification is a formal method for describing objects (physical objects such as nuts and bolts or symbolic objects, such as numbers, or concepts expressed as text). In general, specifications do not require the inclusion of specific items of information (ie, they do not impose restrictions on the content that is included in or excluded from documents), and specifications do not impose any order of appearance of the data contained in the document (ie, you can mix up and rearrange specified objects, if you like). Specifications are not generally certified by a standards organization. They are generally produced by special interest organizations, and the legitimacy of a specification depends on its popularity. Examples of specifications are RDF (Resource Description Framework) produced by the W3C (World Wide Web Consortium), and TCP/IP (Transfer Control Protocol/Internet Protocol), maintained by the Internet Engineering Task Force. The most widely implemented specifications are simple and easily implemented. See Specification versus standard.

Specification versus standard Data standards, in general, tell you what must be included in a conforming document, and, in most cases, dictate the format of the final document. In many instances, standards bar inclusion of any data that is not included in the standard (eg, you should not include astronomical data in an clinical X-ray report). Specifications simply provide a formal way for describing the data that you choose to include in your document. XML and RDF, a semantic dialect of XML, are examples of specifications. They both tell you how data should be represented, but neither tell you what data to include, or how your document or data set should appear. Files that comply with a standard are rigidly organized and can be easily parsed and manipulated by software specifically designed to adhere to the standard. Files that comply with a specification are typically self-describing documents that contain within themselves all the information necessary for a human or a computer to derive meaning from the file contents. In theory, files that comply with a specification can be parsed and manipulated by generalized software designed to parse the markup language of the specification (eg, XML, RDF) and to organize the data into data structures defined within the file. The relative strengths and weaknesses of standards and specifications are discussed in Section 2.6, "Specifications Good, Standards Bad." See XML. See RDF.

System call Refers to a command, within a running script, that calls the operating system into action, momentarily bypassing the programming interpreter for the script. A system call can do essentially anything the operating system can do via a command line.

Text editor A text editor (also called ASCII editor) is a software program designed to display simple, unformatted text files. Text editors differ from word processing software applications that produce files with formatting information embedded within the file. Text editors, unlike word processors, can open large files (in excess of 100 Megabytes), very quickly. They also allow you to move around the file with ease. Examples of free and open source text editors are Emacs and vi. See ASCII.

Triple In computer semantics, a triple is an identified data object associated with a data element and the description of the data element. In the computer science literature, the syntax for the triple is commonly described as: subject, predicate, object, wherein the subject is an identifier, the predicate is the description of the object, and the object is the data. The definition of triple, using grammatical terms, can be off-putting to the data scientist, who may think in terms of spreadsheet entries: a key that identifies the line record, a column header containing the metadata description of the data, and a cell that contains the data. In this book, the three components of a triple are described as: (1) the identifier for the data object, (2) the metadata that describes the data, and (3) the data itself. In theory, all data sets, databases, and spreadsheets can be constructed or deconstructed as collections of triples. See Introspection. See Data object. See Semantics. See RDF. See Meaning.

Turtle Another syntax for expressing triples. From RDF came a simplified syntax for triples, known as Notation 3 or n3,³⁰ From n3 came Turtle, thought to fit more closely to RDF. From Turtle came an even more simplified form, known as N-Triples. See RDF. See Notation 3.

URL Unique Resource Locator. The Web is a collection of resources, each having a unique address, the URL. When you click on a link that specifies a URL, your browser fetches the page located at the unique location specified in the URL name. If the Web were designed otherwise (ie, if several different Web pages had the same Web address, or if one Web address were located at several different locations), then the Web could not function with any reliability.

URN Unique Resource Name. Whereas the URL identifies objects based on the object's unique location in the Web, the URN is a system of object identifiers that are location-independent. In the URN system, data objects are provided with identifiers, and the identifiers are registered with, and subsumed by, the URN. For example:

urn:isbn-13:9780128028827

Refers to the unique book, "Repurposing Legacy Data: Innovative Case Studies," by Jules Berman

urn:uuid:e29d0078-f7f6-11e4-8ef1-e808e19e18e5

Refers to a data object tied to the UUID identifier e29d0078-f7f6-11e4-8ef1-e808e19e18e5 In theory, if every data object were assigned a registered URN, and if the system were implemented as intended, the entire universe of information could be tracked and searched. See URL. See UUID.

UTF Any one of several Unicode Transformation Formats that accommodate a larger set of character representations than ASCII. The larger character sets included in the UTF standards include diverse alphabets (ie, characters other than the 26 Latinized letters used in English). See ASCII.

UUID (Universally Unique IDentifier) is a protocol for assigning unique identifiers to data objects, without using a central registry. UUIDs were originally used in the Apollo Network Computing System.³¹ Most modern programming languages have modules for generating UUIDs. See Identifier.

Validation Validation is the process that checks whether the conclusions drawn from data analysis are correct.³² Validation usually starts with repeating the same analysis of the same data, using the methods that were originally recommended. Obviously, if a different set of conclusions is drawn from the same data and methods, the original conclusions cannot be validated. Validation may involve applying a different set of analytic methods to the same data, to determine if the conclusions are consistent. It is always reassuring to know that conclusions are repeatable, with different analytic methods. In prior eras, experiments were validated by repeating the entire experiment, thus producing a new set of observations for analysis. Many of today's scientific experiments are far too complex and costly to repeat. In such cases, validation requires access to the complete collection of the original data, and to the detailed protocols under which the data was generated. One of the most useful methods of data validation involves testing new hypotheses, based on the assumed validity of the original conclusions. For example, if you were to accept Darwin's analysis of barnacle data, leading to his theory of evolution, then you would expect to find a chronologic history of fossils in ascending layers of shale. This was the case; thus, paleontologists studying the Burgess shale reserves provided some validation to Darwin's conclusions. Validation should not be mistaken for proof. Nonetheless, the repeatability of conclusions, over time, with the same or different sets of data, and the demonstration of consistency with related observations, is about all that we can hope for in this imperfect world. See Verification. See Reproducibility.

Verification The process by which data is checked to determine whether the data was obtained properly (ie, according to approved protocols), and that the data accurately measured what it was intended to measure, on the correct specimens, and that all steps in data processing were done using well-documented protocols. Verification often requires a level of expertise that is at least as high as the expertise of the individuals who produced the data.³² Data verification requires a full understanding of the many steps involved in data collection and can be a time-consuming task. In one celebrated case, in which two statisticians reviewed a microarray study performed at Duke University, the time devoted to their verification effort was reported to be 2000 hours.³³ To put this statement in perspective, the official work-year, according to the U.S. Office of Personnel Management, is 2087 hours. Verification is different from validation. Verification is performed on data; validation is done on the results of data analysis. See Validation. See Microarray. See Introspection.

Winnowing and chaffing Better known to contrarians as chaffing and winnowing. A protocol invented by Ronald Rivest for securing messages against eavesdroppers, without technically employing encryption.³⁴ As used in this book, the winnowing and chaffing protocol would be considered a type of encryption. A detailed discussion of winnowing and chaffing is found in Open Source tools for Chapter 8. See Encryption.

XML Acronym for eXtensible Markup Language, a syntax for marking data values with descriptors (ie, metadata). The descriptors are commonly known as tags. In XML, every data value is enclosed by a start-tag, containing the descriptor and indicating that a value will follow, and an end-tag, containing the same descriptor and indicating that a value preceded the tag. For example: < name>Conrad Nervig </name >. The enclosing angle brackets, "<>," and the end-tag marker, "/," are hallmarks of HTML and XML markup. This simple but powerful relationship between metadata and data allows us to employ metadata/data pairs as though each were a miniature database. The semantic value of XML becomes apparent when we bind a metadata/data pair to a unique object, forming a so-called triple. See Triple. See Meaning. See Semantics. See HTML.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 2: Structuring Text

Create new playlist

Sign In

Sign Up

2.1 The Meaninglessness of Free Text

2.2 Sorting Text, the Impossible Dream

2.3 Sentence Parsing

2.4 Abbreviations

2.5 Annotation and the Simple Science of Metadata

2.6 Specifications Good, Standards Bad

Open Source Tools

ASCII

Regular Expressions

Format Commands

Converting Nonprintable Files to Plain-Text

Dublin Core

Glossary

Table of Contents for
Chapter 2: Structuring Text