Chapter 7

Object-Oriented Data

Abstract

It is sometimes said that the role of the data scientist has less to do with asking questions of data and more to do with trying to understand what the data is trying to say. The self-explanatory quality of well-constructed data is known as introspection. Modern programming languages, particularly object-oriented programming languages, can use introspective data (ie, data provided by data objects) to modify the execution of a program, at run-time; an elegant process known as reflection. Using introspection and reflection, programs can integrate data objects with related data objects. The successful coordinated implementation of introspection, reflection, and integration, is one of the most important achievements in the field of computer science. The purpose of this chapter is to show how we can understand data, using a few elegant computational principles, applied to simplified data.

Keywords

Introspection; Reflection; Integration; Polymorphism; Encapsulation; Inheritance; Abstraction

7.1 The Importance of Self-Explaining Data

Looking at code you wrote more than two weeks ago is like looking at code you are seeing for the first time.

Dan Hurvitz

Data scientists use data for purposes that were unintended or unimagined by the people who prepared the data. The data that is being analyzed today may have been collected decades, centuries, or millennia in the past. If we hope to use today's data for tomorrow's purposes, we need to prepare our data in a manner that preserves meaning. Here a just a few examples.

Following the first Apollo mission to the moon (Apollo 11, July 20, 1969), the five subsequent Apollo missions left behind recording instruments on the lunar surface. The collective set of downlinked data received from these instruments is known as the Apollo Lunar Surface Experiments Package (ALSEP). More than 11,000 data tapes were recorded.1

While the Apollo program was active, control and use of the tapes, as well as the responsibility to safely archive the tapes, was distributed among various agencies and institutions. When the Apollo mission ended, funds were low, and a portion of the data that had been distributed to various investigators and agencies was never sent to the official archives.2 It should come as no surprise that, at the present time, about half of the ALSEP tapes are missing; their whereabouts uncertain. Of the available tapes, much of the data is difficult to access, due to the use of abandoned data media (ie, 7- and 9-track tapes) and obsolete data formats2 (see Glossary item, Abandonware).

Available ALSEP data, when converted into a modern data format, have proven to be a valuable asset, when reanalyzed with modern analytic tools. For example, the first analyses of ALSEP's seismic data, conducted 35 years ago, indicated that about 1300 deep moonquakes had occurred during the period when the data was being downlinked. The field of seismic analysis has advanced in the interim. A reanalysis of the same data, using modern techniques, has produced an upward revision of the first estimate; to about 7000 deep moonquakes.2

Today, there is a renewed push to find, collect, and archive the missing ALSEP data. Why is there a sudden urgency to finish a chore that should have been completed decades ago? Simply put, the tapes must be restored before the last of the original investigators, who alone understand the scope and organization of the data, vanish into retirement or death.

In the 1980s, the PETRA collider conducted a number of so-called atom smashing experiments designed to measure the force required to bind together quarks and gluons, the fundamental components of protons and neutrons.1 In 1986, the PETRA collider was decommissioned and replaced with colliders that operated at higher energy levels. Several decades passed, and advances in physics raised questions that could only be answered with observations on low-energy collisions; the kind of observations collected by PETRA and omitted by present-day colliders.3

An archeological effort to retrieve and repurpose the 1980s data was spearheaded by Siegfried Bethke, one of the original scientists in PETRA's JADE project.4 In the period following the decommissioning of PETRA, the original data had been dispersed to various laboratories (see Glossary item, Data archeology). Some of the JADE data was simply lost, and none of the data was collected in a format or a medium that was directly accessible.1

The repurposing project was divided into three tasks, involving three teams of scientists (see Glossary item, Data repurposing). One team rescued the data from archived tapes and transferred the data into a modern medium and format. The second team improved the original JADE software, fitting it to modern computer platforms. By applying new software, using updated Monte Carlo simulations, the second team generated a new set of data files (see Glossary item, Monte Carlo simulation). The third team reanalyzed the regenerated data using modern methods and improved calculations.

The project culminated in the production of numerous scientific contributions that could not have been achieved without the old JADE data. Success was credited, at least in part, to the participation of some of the same individuals who collected the original data.

On the Yucatan peninsula, concentrated within a geographic area that today encompasses the southeastern tip of Mexico, plus Belize and Guatemala, a great civilization flourished. The Mayan civilization seems to have begun about 2000 BCE, reaching its peak in the so-called classic period (AD 250–900). Abruptly, about AD 900, the great Mayan cities were abandoned, and the Mayan civilization entered a period of decline. Soon after the Spanish colonization of the peninsula, in the 16th century, the Mayans were subjected to a deliberate effort to erase any trace of their heritage. By the dawn of the 20th century, the great achievements of the Mayan civilization were forgotten, its cities and temples were thoroughly overgrown by jungle, its books had been destroyed, and no humans on the planet could decipher the enduring stone glyph tablets strewn through the Yucatan peninsula.

Over a period of several centuries, generations of archeologists, linguists, and epigraphers devoted their careers to decoding the Mayan glyphs. To succeed, they depended on the glyphs to provide some initial information to help explain their meaning. Luckily, the glyphs were created with a set of features that are essential for extracting meaning from data. The ancient Mayans provided unique, identified objects (eg, name of king and name of city), with an accurate timestamp (ie, date) on all glyph entries. The Mayans had a sophisticated calendar, highly accurate timekeeping methods, and their data was encoded in a sophisticated number system that included the concept of zero. Furthermore, their recorded data was annotated with metadata (ie, descriptions of the quantitative data). The careful recording of data as uniquely identified records, with accurate dates and helpful metadata, was the key to the first two breakthrough discoveries. In 1832, Constantine Rafinesque decoded the Mayan number system. In 1880, Forstemann, using Rafinesque's techniques to decode the numbers that appeared in a Mayan text, deduced how the Mayans recorded the passage of time, and how they used numbers to predict astronomic events. After these discoveries were made, the Mayan glyphs incrementally yielded their secrets. By 1981, the ancient Mayan scripts were essentially decoded.

What is the moral of these three examples of data repurposing (ie, The ALSEP lunar surface measurements, the PETRA collider data, and the Mayan glyphs)? The moral seems to be that if you do not want to spend decades or centuries trying to understand your data, you had better give some thought to the way you prepare your data.5 Analysis of numerous data reconstruction efforts indicates that good historical data fulfills the following1:

1. Data that is immutable (see Glossary item, Immutability)

2. Data that is persistent (see Glossary item, Persistence)

3. Data that establishes the unique identify of records (ie, data objects)

4. Data that accrues over time, documenting the moments when data objects are obtained (ie, timestamped data)

5. Data that is described (ie, annotated with metadata)

6. Data that assigns data objects to classes of information

7. Data that provides a structural hierarchy for classes

8. Data that explains itself (ie, introspective data)

7.2 Introspection and Reflection

The difference between theory and practice is that in theory, there is no difference between theory and practice.

Richard Moore

Introspection is a term borrowed from object-oriented programming, not often found in the informatics literature. It refers to the ability of data objects to describe themselves when interrogated. With introspection, data scientists can determine the content of data objects and the hierarchical organization of data objects within complex data sets. Introspection allows users to see how different data objects are related to one another. This section describes how introspection is achieved, drawing examples from a simplified set of data, composed of triples.

To illustrate, let us see how Ruby, a popular object-oriented programming language, implements introspection (see Glossary items, Object-oriented programming, Class-oriented programming).1

In Ruby, we can create a new object, "x", and assign it a string, such as "hello world".

x = "hello world"

Because the data object, "x", contains a string, Ruby knows that x belongs to the String class of objects. If we send the "class" method to the object, "x", Ruby will return a message indicating that "x" belongs to class String.

x.class yields String

In Ruby, every object is automatically given an identifier (ie, character string that is unique for the object). If we send the object the method "object_id", Ruby will tell us its assigned identifier.

x.object_id yields 22502910

Ruby tells us that the unique object identifier assigned to the object "x" is 22502910.

In Ruby, should we need to learn the contents of "x", we can send the "inspect" method to the object. Should we need to know the methods that are available to the object, we can send the "methods" method to the object. All modern object-oriented languages support syntactic equivalents of these basic introspective tools (see Glossary item, Syntax).

An important by-product of introspection is reflection. Reflection is a programming technique wherein a computer program will modify itself, at run-time, based on information it acquires through introspection. For example, a computer program may iterate over a collection of data objects, examining the self-descriptive information for each object in the collection. If the information indicates that the data object belongs to a particular class of objects, the program might call a method appropriate for the class. The program executes in a manner determined by information obtained during run-time; metaphorically reflecting upon the purpose of its computational task. Detailed information about every piece of data in a data set (eg, the identifier associated with the data object, the class of objects to which the data object belongs, the metadata and the data values that are associated with the data object), permit data scientists to integrate, relate, and repurpose individual data objects collected from any data source or sources, even those sources that are dispersed over network servers (see Glossary items, Data fusion, Data integration, Data merging, Metadata, Software agent, Reflection).

It is worth remembering that data analysis always involves understanding the relationships among data objects. Algorithms, computers, and programming languages are simply tools that help the data analyst achieve a cognitive breakthrough. Putting tools aside, we can see that if we construct our data properly, the data will provide introspection. For successful data analysis, having object-oriented programming languages is less important than having object-oriented data.

7.3 Object-Oriented Data Objects

The ignoramus is a leaf who doesn't know he is part of a tree.

attributed to Michael Crichton

For programmers, the greatest benefits of preparing your data as classified triples (ie, collections of triples in which identified objects are assigned to classes) is that all of the computational benefits of object-oriented programming automatically convey to your data. To better understand why classified triples bestow the same benefits as does the object-oriented programming paradigm, let us first look at the four intrinsic features of every object-oriented programming language (see Glossary items, Inheritance, Encapsulation, Abstraction, Polymorphism):

1. Inheritance. Data objects (ie, classes and the instances that belong to classes) inherit the methods (eg, functions and subroutines) of their ancestral classes. In object-oriented programming, we can send a method request to an object, and if the method happens to be a class method belonging to any of the ancestor classes of the object, the object-oriented programming environment will find the method (by searching through the ancestral classes), and will compute a result, by feeding parameters provided by the object through its method request. Surprisingly, a list of triples provides the equivalent functionality. Suppose we have a huge database consisting of triples. We can associate a unique object with a method, and search through the triple database for the class to which the object belongs; or to any of the object's ancestral classes. Once we find the ancestral classes, we can search for triples associated with the class to see if any of the triples assert the method. If so, we can search through the triples associated with the method until we find the code associated with the method. Once we've found the code, we can execute the code, in the language for which the code is written. The process seems awkward, but it is not; computers excel at searching through hierarchical data structures.

2. Encapsulation. Encapsulation happens when a data object contains its associated data. In object-oriented programming, methods exist whereby the data that is encapsulated in a unique object can be interrogated and examined (ie, introspected). Lists of triples are assertions about the data that are associated with a unique identifier. As such, every triple encapsulates a datum. It is easy to collect all of the triples associated with a unique identifier (ie, all the triples of a data object) and this collection will contain the totality of the unique object's data. Thus, triples encapsulate all the data associated with a data object.

3. Abstraction. In the context of object-oriented programming, abstraction is a technique whereby a method is reduced (ie, simplified) to a generalized form that is applicable to a wide range of objects, but for which the specific characteristics of the object receiving the method may be used to return a result that is suited to the particular object. When a method is sent to a triple, its output will depend on the properties of the triple, because the method receives all of the data encapsulated in the triple, and any data parameters included in the method request. The following little story may help clarify the situation. A salesman, walking along a road, passes a farmer. The salesman asks, "How long of a walk is it to town?" The farmer replies, "Keep moving." The salesman says, "Why won't you answer my question?" The farmer repeats, "Keep moving." The salesman shrugs and continues along his way. When the salesman has walked about 20 yards, the farmer shouts, "For you, 45 minutes." To the farmer, the method for computing the length of time required to reach the next town was abstracted. His answer depended on who was doing the asking and required some specific input data (collected when the salesman walked away from the farm) before an answer could be calculated.

4. Polymorphism. Methods sent to object receivers have a response determined by the class of the receiving object. Hence, different objects, from different classes, receiving a call to a method of the same name, will respond differently. For example, suppose you have a method named "divide" and you send the method (ie, issue a command to execute the method) to an object of Class Bacteria and an object of Class Numerics. The Bacteria, receiving the divide method, will try to execute by looking for the "divide" method somewhere in its class lineage. Being a bacteria, the "divide" method may involve making a copy of the bacteria object and incrementing the number of bacteria in the population. The numeric object, receiving the "divide" method, will look for the method in its class lineage and will probably find some method that provides instructions for arithmetic division. Hence, the behavior of the class object, to a received method, will be appropriate for the class of the object. The same holds true for collections of triples. A triple will only have access to the methods of its ancestral classes, and an object's ancestral classes and their methods are all described as triples. Hence, a method belonging to an unrelated class that happens to have the same method name as a method belonging to an object's ancestral class, will be inaccessible to the object.

The following example, provided in the Ruby script, lineage.rb, demonstrates how simple it is to create a new classification, and to endow the classification with inheritance, encapsulation, abstraction, and polymorphism. Readers who cannot program in Ruby will still benefit from reviewing this short example; the principles will apply to every object-oriented programming language and to every data set composed of triples.

#!/usr/bin/ruby

class Craniata

 def brag

 puts("I have a well-developed brain")

 end

 def myself

 puts("I am a member of Class" + self.class.to_s)

 end

end

class Gnathostomata < Craniata

 def speak

 puts("I have a jaw")

 end

end

class Teleostomi < Gnathostomata

end

class Mammalia < Teleostomi

end

class Theria < Mammalia

end

class Eutheria < Theria

end

class Canis < Eutheria

 def speak

 puts("Bow wow")

 end

end

class Primates < Eutheria

 def speak

 puts("Huf hufff")

 end

end

puts("Lassie")

Lassie = Canis.new

Lassie.speak

Lassie.brag

Lassie.myself

puts()

puts ("George_of_the_jungle")

George_of_the_jungle = Primates.new

George_of_the_jungle.speak

George_of_the_jungle.brag

George_of_the_jungle.myself

puts()

puts(Primates.method(:new).owner)

puts(Canis.method(:new).owner)

exit

Here is the output of the lineage.rb script:

c:ftp>lineage.rb

Lassie

Bow wow

I have a well-developed brain

I am a member of Class Canis

George_of_the_jungle

Huf hufff

I have a well-developed brain

I am a member of Class Primates

Class

Class

In brief, the script creates two new objects: a new and unique member of Class Canis, named Lassie, and a new and unique member of Class Primates, named George_of_the_jungle. Lassie and George_of_the_jungle are both sent three methods: speak, brag, and myself. When Lassie is instructed to speak, she says "Bow wow." When George_of_the_jungle is instructed to speak, he says "Huf hufff." When Lassie and George_of_the_jungle are instructed to brag, they both say, "I have a well-developed brain." When Lassie is sent the "myself" method, she replies "I am a member of class Canis." When George_of_the_jungle is sent the "myself" method, he replies, "I am a member of Class Primates."

The special features of inheritance, abstraction, encapsulation, and polymorphism cannot be appreciated without reviewing how the methods employed by the script were implemented.

We created two unique objects, using the "new" method.

Lassie = Canis.new

George_of_the_jungle = Primates.new

These two lines of code created two new object instances: Lassie, in Class Canis, and George_of_the_jungle, in Class Primates. This was accomplished by sending the "new" method to each class and providing a name for the new instances. If you skim back the top of the script, containing the class declarations and method definitions, you will notice that there is no "new" method described for any of the classes. The reason we can call the "new" method, without defining it in our code, is that Ruby has a top-level class Class that contains an abstract "new" method that is inherited by any Ruby class that we create.

Look at the two lines of code at the bottom of the script:

puts(Primates.method(:new).owner)

puts(Canis.method(:new).owner)

This code tells Ruby to print out the class that owns the "new" method used by the Primate class and the "new" method used by the Canis class. In either case, the output indicates that the "new" method belongs to class Class.

Class

Class

All classes in Ruby are descendants of class Class, and as such, they all inherit the abstract, or general method, "new," that creates new instances of classes.

Let's look at three lines of code:

Lassie.speak

Lassie.brag

Lassie.myself

The first line sends the "speak" method to the Lassie object. Ruby finds the speak method in the objects Canis class and prints out "Bow wow." The second line sends the "brag" method to the Lassie object and hunts for the class that owns "brag." In this case, Ruby must search up the class hierarchy until it reaches Class Craniata, where it finds and executes the "brag" method. The same thing happens when we send the Lassie object the "myself" method, also found in Class Craniata, as shown:

def myself

 puts("I am a member of Class " + self.class.to_s)

end

In this case, the "myself" method calls upon the Lassie object to inspect itself and to yield its class name as a string. The "myself" method requires the Lassie object to encapsulate its own class assignment.

So far, we have seen examples of abstraction (ie, the "new" method), inheritance (ie, the "brag" method) and encapsulation (ie, the "myself" method). How does the lineage.rb script demonstrate polymorphism? Notice that a "speak" method is contained in class Gnathostomata, and class Primates and class Canis. The three "speak" methods are different from one another, as shown:

class Gnathostomata < Craniata

 def speak

 puts("I have a jaw")

 end

end

class Canis < Eutheria

 def speak

 puts("I tell you that I am a member of Class Canis")

 end

end

class Primates < Eutheria

 def speak

 puts("I tell you that I am a member of Class Primates")

 end

end

When we send the "speak" method to a member of class Primates, Ruby finds and executes the "speak" method for the Primates class. Likewise, when we send the "speak" method to a member of class Canis, Ruby finds and executes the "speak" method for the Canis class. Had there been no "speak" method in either of these classes, Ruby would have traveled up the class hierarchy until it found the "speak" method in the Gnathostomata class. In these cases, the "speak" method produces different outputs, depending on the class in which it applies, an example of polymorphism.

At this point, the reader must be wondering why she is being subjected to a lesson in Ruby object-oriented programming. As it happens, the same principles of object-oriented programming apply to every object-oriented language.6 We will see in the next section how the benefits of object-oriented programming extend to triplestore databases, the simplest and most fundamental way of expressing meaning, with data.

7.4 Working with Object-Oriented Data

The unexamined life is not worth living.

Socrates

Enormous benefits follow when data objects are expressed as triples and assigned to defined classes. All of the attributes of object-oriented programming languages (ie, inheritance, encapsulation, abstraction, and polymorphism) are available to well-organized collections of triples. Furthermore, desirable features in any set of data, including integration, interoperability, portability, and introspection are available to data scientists who analyze triplestore data. Last but not least, triples are easy to understand: a unique identifier followed by a metadata/data pair comprise the simple totality of a triple.

This section illustrates everything we've learned about classifications, triples, object-oriented data, and introspection, using a simple triplestore data set.

Here is the triplestore, as the plain-text file, triple.txt:

9f0ebdf2ˆˆobject_nameˆˆClass

9f0ebdf2ˆˆpropertyˆˆsubclass_of

9f0ebdf2ˆˆpropertyˆˆproperty

9f0ebdf2ˆˆpropertyˆˆdefinition

9f0ebdf2ˆˆpropertyˆˆobject_name

9f0ebdf2ˆˆpropertyˆˆinstance_of

9f0ebdf2ˆˆsubclass_ofˆˆClass

9f0ebdf2ˆˆinstance_ofˆˆClass

701cb7edˆˆobject_nameˆˆProperty

701cb7edˆˆsubclass_ofˆˆClass

701cb7edˆˆdefinitionˆˆˆˆthe metadata class

77cb79d5ˆˆobject_nameˆˆinstance_of

77cb79d5ˆˆinstance_ofˆˆProperty

77cb79d5ˆˆdefinitionˆˆthe name of the class to which the object is an instance

a03fbc3bˆˆobject_nameˆˆobject_name

a03fbc3bˆˆinstance_ofˆˆProperty

a03fbc3bˆˆdefinitionˆˆword equivalent of its predicate identifying sequence

de0e5aa1ˆˆobject_nameˆˆsubclass_of

de0e5aa1ˆˆinstance_ofˆˆProperty

de0e5aa1ˆˆdefinitionˆˆthe name of the parent class of the referred object

4b675067ˆˆobject_nameˆˆproperty

4b675067ˆˆinstance_ofˆˆProperty

4b675067ˆˆdefinitionˆˆan identifier a for class property

c37529c5ˆˆobject_nameˆˆdefinition

c37529c5ˆˆinstance_ofˆˆProperty

c37529c5ˆˆdefinitionˆˆthe meaning of the referred object

a29c59c0ˆˆobject_nameˆˆdob

a29c59c0ˆˆinstance_ofˆˆProperty

a29c59c0ˆˆdefinitionˆˆdate of birth, as Day, Month, Year

a34a1e35ˆˆobject_nameˆˆglucose_at_time

a34a1e35ˆˆinstance_ofˆˆProperty

a34a1e35ˆˆdefinitionˆˆglucose level in mg/Dl at time drawn (GMT)

03cc6948ˆˆobject_nameˆˆOrganism

03cc6948ˆˆsubclass_ofˆˆClass

7d7ff42bˆˆobject_nameˆˆHominidae

7d7ff42bˆˆsubclass_ofˆˆOrganism

7d7ff42bˆˆpropertyˆˆdob

a0ce8ec6ˆˆobject_nameˆˆHomo

a0ce8ec6ˆˆsubclass_ofˆˆHominidae

a0ce8ec6ˆˆpropertyˆˆglucose_at_time

a1648579ˆˆobject_nameˆˆHomo sapiens

a1648579ˆˆsubclass_ofˆˆHomo

98495efcˆˆobject_nameˆˆAndy Muzeack

98495efcˆˆinstance_ofˆˆHomo sapiens

98495efcˆˆdobˆˆ1 January, 2001

98495efcˆˆglucose_at_timeˆˆ87, 02-12-2014 17:33:09

Perusal of the triples provides the following observations:

1. Each triple consists of three character sequences, separated by a double-caret. The first character sequence is the object identifier. The second is the metadata and the third is the value. For example:

7d7ff42bˆˆ subclass_of ˆˆ Organism

The individual parts of the triple are:

7d7ff42b is the identifier

subclass_of is the metadata

Organism is the data

Notice that these triples are expressed in a format different from RDF, or Notation3, or Turtle. Do we care? No. We know that with a few lines of code, we could convert our triplestore into any alternate format we might prefer. Furthermore, our triplestore could be converted into a spreadsheet, in which the identifiers are record keys, the metadata are column headings, and the data occupy cells. We could also port our triples into a database, if we so desired.

2. Using triples, we have defined various classes and properties. For example:

03cc6948ˆˆobject_nameˆˆOrganism

03cc6948ˆˆsubclass_ofˆˆClass

With one triple, we create a new object, with name Organism, and we associate it with a unique identifier (03cc6948). With another triple, we establish that the Organism object is a class, that happens to be the child class of the root class, Class. Because Organism is a subclass of Class, it will inherit all of the properties of its parent class.

Let's skip down to the bottom of the file:

98495efcˆˆobject_nameˆˆAndy Muzeack

98495efcˆˆinstance_ofˆˆHomo sapiens

98495efcˆˆdobˆˆ1 January, 2001

98495efcˆˆglucose_at_timeˆˆ87, 02-12-2014 17:33:09

Here we create a few triples that provide information about a person named Andy Muzeak. First, we assign a unique identifier to our new object, named Andy Museack. We learn, from the next triple that Andy Muzeack is a member of class Homo Sapiens. As such, we infer that Andy Muzeack inherits all the properties contained in class Homo (the parent class of class Homo Sapiens) and all the ancestors of class Homo, leading to the top, or root ancestor, class Class. We learn that Andy Muzeack has a "dob" of January 1, 2001. By ascending the list of triples, we learn that "dob" is a property, with a unique identifier (a29c59c0), and a definition, "date of birth, as Day, Month, Year." Finally, we learn that Andy Muzeack has a glucose_at_time of "87, 02-12-2014 17:33:09." Elsewhere in the triplestore, we find that the "glucose_at_time" metadata is defined as the glucose level in mg/Dl at time drawn, in Greenwich Mean Time.

If we wished, we could simply concatenate our triplestore with other triplestores that contain triples relevant to Andy Muzeack. It would not make any difference how the triples are ordered. If Andy Muzeack's identifier is reconcilable, and the metadata is defined, and each triple is assigned to a class, then we will be able to fully understand and analyze the data held in the triplestore (see Glossary item, Reconciliation).

Of course, when we have millions and billions of triples, we could not perform our analyses by reading through the file. We would need scripts and/or a database application.

Let's write our own scripts that tell us something about the objects in our triplestore.

Here is a short Perl script, class_prop.pl, that traverses the triple.txt file, and lists the contained properties.

#!/usr/local/bin/perl

open(TEXT, "triple.txt");

$line = " ";

$object_name = "object_name";

$class_identifier = "";

$instance = "instance_of";

$property_class = "Property";

$property = "property";

while ($line ne "")

 {

 $line = < TEXT >;

 $line =~ s/ //o;

 @three = split(/ˆˆ/, $line) if ($line ne "");

 $triple{$three[0]}{$three[1]}{$three[2]} = "";

 }

for $identifier (keys %triple)

 {

 if (exists($triple{$identifier}{$instance}{$property_class}))

 {

 @property_names = keys (%{$triple{$identifier}{$object_name}});

 print "$property_names[0] is an instance of Class Property ";

 }

 }

exit;

Here is the output of the class_prop.pl script:

subclass_of is an instance of Class Property

instance_of is an instance of Class Property

definition is an instance of Class Property

object_name is an instance of Class Property

glucose_at_time is an instance of Class Property

property is an instance of Class Property

dob is an instance of Class Property

Here is a simple Perl script, parent.pl, that will tell us the parent class of any class entered on the command line.

#/usr/local/bin/perl

open(TEXT, "triple.txt");

$line = " ";

$subclass = "subclass_of";

$object_name = "object_name";

$class_identifier = "";

$new_parent_identifier = $ARGV[0];

$class = $ARGV[0];

while ($line ne "")

 {

 $line = < TEXT >;

 $line =~ s/ //o;

 @three = split(/ˆˆ/, $line) if ($line ne "");

 $triple{$three[0]}{$three[1]}{$three[2]} = "";

 }

for $identifier (keys %triple)

 {

 if (exists($triple{$identifier}{$object_name}{$class}))

 {

 $class_identifier = $identifier;

 last;

 }

 }

@parent_array = keys (%{$triple{$class_identifier}{$subclass}});

print "$class is a subclass of $parent_array[0]";

exit;

Here is the output of parent.pl, for three different input classes.

c:ftp>parent.pl "Homo sapiens"

Homo sapiens is a subclass of Homo

c:ftp>parent.pl "Homo"

Homo is a subclass of Hominidae

c:ftp>parent.pl "Property"

Property is a subclass of Class

These simple Perl scripts demonstrate how simple it is to analyze triplestore data, using object-oriented techniques.

Open Source Tools

Persistent Data

A file that big?

It might be very useful.

But now it is gone.

Haiku by David J. Liszewski

Your scripts create data objects, and the data objects hold data. Sometimes, these data objects are transient, existing only during a block or subroutine. At other times, the data objects produced by scripts represent prodigious amounts of data, resulting from complex and time-consuming calculations. What happens to these data structures when the script finishes executing? Ordinarily, when a script stops, all the data produced by the script simply vanishes.

Persistence is the ability of data to outlive the program that produced it. The methods by which we create persistent data are sometimes referred to as marshalling or serializing. Some of the language-specific methods are called by such colorful names as data dumping, pickling, freezing/thawing, and storable/retrieve (see Glossary items, Serializing, Marshalling, Persistence).

Data persistence can be ranked by level of sophistication. At the bottom is the exportation of data to a simple flat-file, wherein records are each one line in length, and each line of the record consists of a record key, followed by a list of record attributes. The simple spreadsheet stores data as tab delimited or comma separated line records. Flat-files can contain a limitless number of line records, but spreadsheets are limited by the number of records they can import and manage. Scripts can be written that parse through flat-files line by line (ie, record by record), selecting data as they go. Software programs that write data to flat-files achieve a crude but serviceable type of data persistence.

A middle-level technique for creating persistent data is the venerable database. If nothing else, databases are made to create, store, and retrieve data records. Scripts that have access to a database can achieve persistence by creating database records that accommodate data objects. When the script ends, the database persists, and the data objects can be fetched and reconstructed for use in future scripts.

Perhaps the highest level of data persistence is achieved when complex data objects are saved in toto. Flat-files and databases may not be suited to storing complex data objects, holding encapsulated data values. Most languages provide built-in methods for storing complex objects, and a number of languages designed to describe complex forms of data have been developed. Data description languages, such as YAML (Yet Another Markup Language) and JSON (JavaScript Object Notation) can be adopted by any programming language.

Data persistence is essential to data simplification. Without data persistence, all data created by scripts is volatile, obliging data scientists to waste time recreating data that has ceased to exist. Essential tasks such as script debugging and data verification become impossible. It is worthwhile reviewing some of the techniques for data persistence that are readily accessible to Perl, Python, and Ruby programmers.

Perl will dump any data structure into a persistent, external file for later use. Here, the Perl script, data_dump.pl, creates a complex associative array, "%hash", which nests within itself a string, an integer, an array, and another associative array (see Glossary item, Associative array). This complex data structure is dumped into a persistent structure (ie, an external file named dump_struct).

#!/usr/local/bin/perl

use Data::Dump qw(dump);

%hash = (

 number => 42,

 string => 'This is a string',

 array => [ 1 .. 10 ],

 hash => { apple => 'red', banana => 'yellow'},);

open(OUT, ">dump_struct");

print OUT dump \%hash;

exit;

The Perl script, data_slurp.pl picks up the external file, "dump_struct", created by the data_dump.pl script, and loads it into a variable.

#!/usr/local/bin/perl

use Data::Dump qw(dump);

open(IN, "dump_struct");

undef($/);

$data = eval < IN >;

close $in;

dump $data;

exit;

Here is the output of the data_slurp.pl script, in which the contents in the variable "$data" are dumped onto the output screen:

c:ftp>data_slurp.pl

{

 array => [1 .. 10],

 hash => { apple => "red", banana => "yellow" },

 number => 42,

 string => "This is a string",

}

Python pickles its data. Here, the Python script, pickle_up.py, pickles a string variable

#!/usr/bin/python

import pickle

pumpkin_color = "orange"

pickle.dump( pumpkin_color, open( "save.p", "wb" ) )

exit

The Python script, pickle_down.py, loads the pickle file, "save.p" and prints it to the screen.

#!/usr/bin/python

import pickle

pumpkin_color = pickle.load( open( "save.p", "rb" ) )

print(pumpkin_color)

exit

The output of the pickle_down.py script is shown here:

c:ftppy>pickle_down.py

orange

Where Python pickles, Ruby marshals. In Ruby, whole objects, with their encapsulated data, are marshalled into an external file and demarshalled at will. Here is a short Ruby script, object_marshal.rb, that creates a new class, "Shoestring", a new class object, "loafer", and marshals the new object into a persistent file, "output_file.per".

#!/usr/bin/ruby

class Shoestring < String

 def initialize

 @object_uuid = (`c:\cygwin64\bin\uuidgen.exe`).chomp

 end

 def object_uuid

 print @object_uuid

 end

end

loafer = Shoestring.new

output = File.open("output_file.per", "wb")

output.write(Marshal::dump(loafer))

exit

The script produces no output other than the binary file, "output_file.per". Notice that when we created the object, loafer, we included a method that encapsulates within the object a full uuid identifier, courtesy of cygwin's bundled utility, "uuidgen.exe".

We can demarshal the persistent "output_file.per" file, using the ruby script, object_demarshal.rb:

#!/usr/bin/ruby

class Shoestring < String

 def initialize

 @object_uuid = `c:\cygwin64\bin\uuidgen.exe`.chomp

 end

 def object_uuid

 print @object_uuid

 end

end

array = []

$/=" "

out = File.open("output_file.per", "rb").each do

 |object|

 array << Marshal::load(object)

 array.each do

 |object|

 puts object.object_uuid

 puts object.class

 puts object.class.superclass

 end

end

exit

The Ruby script, object_demarshal.rb, pulls the data object from the persistent file, "output_file.per" and directs Ruby to list the uuid for the object, the class of the object, and the superclass of the object.

c:ftp>object_demarshal.rb

c2ace515-534f-411c-9d7c-5aef60f8c72a

Shoestring

String

Perl, Python, and Ruby all have access to external database modules that can build database objects that exist as external files that persist after the script has executed. These database objects can be called from any script, with the contained data accessed quickly, with a simple command syntax.7

Here is a Perl script, lucy.pl, that creates an associative array and ties it to an external database file, using the SDBM_file (Simple Database Management File) module.

#!/usr/local/bin/perl

use Fcntl;

use SDBM_File;

tie %lucy_hash, "SDBM_File", 'lucy', O_RDWR|O_CREAT|O_EXCL, 0644;

$lucy_hash{"Fred Mertz"} = "Neighbor";

$lucy_hash{"Ethel Mertz"} = "Neighbor";

$lucy_hash{"Lucy Ricardo"} = "Star";

$lucy_hash{"Ricky Ricardo"} = "Band leader";

untie %lucy_hash;

exit;

The lucy.pl script produces a persistent, external file from which any Perl script can access the associative array created in the prior script. If we look in the directory from which the lucy.pl script was launched, we will find two new files, lucy.dir and lucy.pag. These are the persistent files that will substitute for the %lucy_hash associative array when invoked within other Perl scripts.

Here is a short Perl script, lucy_untie.pl, that extracts the persistent %lucy_hash associative array from the SDBM file in which it is stored:

#!/usr/local/bin/perl

use Fcntl;

use SDBM_File;

tie %lucy_hash, "SDBM_File", 'lucy', O_RDWR, 0644;

while(($key, $value) = each (%lucy_hash))

 {

 print "$key => $value ";

 }

untie %mesh_hash;

exit;

Here is the output of the lucy_untie.pl script:

c:ftp>lucy_untie.pl

Fred Mertz => Neighbor

Ethel Mertz => Neighbor

Lucy Ricardo => Star

Ricky Ricardo => Band leader

Here is the Python script, lucy.py, that creates a tiny external database.

#!/usr/local/bin/python

import dumbdbm

lucy_hash = dumbdbm.open('lucy', 'c')

lucy_hash["Fred Mertz"] = "Neighbor"

lucy_hash["Ethel Mertz"] = "Neighbor"

lucy_hash["Lucy Ricardo"] = "Star"

lucy_hash["Ricky Ricardo"] = "Band leader"

lucy_hash.close()

exit

Here is the Python script, lucy_untie.py, that reads all of the key, value pairs held in the persistent database created for the lucy_hash dictionary object.

#!/usr/local/bin/python

import dumbdbm

lucy_hash = dumbdbm.open('lucy')

for character in lucy_hash.keys():

 print character, lucy_hash[character]

lucy_hash.close()

exit

Here is the output produced by the Python script, lucy_untie.py script.

c:ftp>lucy_untie.py

Fred Mertz Neighbor

Ethel Mertz Neighbor

Lucy Ricardo Star

Ricky Ricardo Band leader

Ruby can also hold data in a persistent database, using the gdbm module. If you do not have the gdbm (GNU database manager) module installed in your Ruby distribution, you can install it as a Ruby GEM, using the following command line, from the system prompt:

c:>gem install gdbm

The Ruby script, lucy.rb, creates an external database file, lucy.db:

#!/usr/local/bin/ruby

require 'gdbm'

lucy_hash = GDBM.new("lucy.db")

lucy_hash["Fred Mertz"] = "Neighbor"

lucy_hash["Ethel Mertz"] = "Neighbor"

lucy_hash["Lucy Ricardo"] = "Star"

lucy_hash["Ricky Ricardo"] = "Band leader"

lucy_hash.close

exit

The Ruby script, ruby_untie.db, reads the associate array stored as the persistent database, lucy.db:

#!/usr/local/bin/ruby

require 'gdbm'

gdbm = GDBM.new("lucy.db")

gdbm.each_pair do |name, role|

 print "#{name}: #{role} "

end

gdbm.close

exit

The output from the lucy_untie.rb script is:

c:ftp>lucy_untie.rb

Ethel Mertz: Neighbor

Lucy Ricardo: Star

Ricky Ricardo: Band leader

Fred Mertz: Neighbor

Persistence is a simple and fundamental process ensuring that data created in your scripts can be recalled by yourself or by others who need to verify your results. Regardless of the programming language you use, or the data structures you prefer, you will need to familiarize yourself with at least one data persistence technique.

SQLite Databases

For industrial strength persistence, providing storage for millions or billions of data objects, database applications are a good choice. SQL (Systems Query Language, pronounced like "sequel"), is a specialized language used to query relational databases. SQL allows programmers to connect with large, complex server-based network databases. A high level of expertise is needed to install and implement the software that creates server-based relational databases responding to multiuser client-based SQL queries. Fortunately, Perl, Ruby, and Python all have easy access to SQLite, a free, and widely available spin-off of SQL.7 The source code for SQLite is public domain (see Glossary item, Public domain).

SQLite is bundled into the newer distributions of Python, and can be called from Python scripts with an "import sqlite3" command. Here is a Python script, sqlite6.py, that reads a very short dictionary into an SQL database.

#!/usr/local/bin/python

import sqlite3

from sqlite3 import dbapi2 as sqlite

import string, re, os

mesh_hash = {}

entry = ()

mesh_hash["Fred Mertz"] = "Neighbor"

mesh_hash["Ethel Mertz"] = "Neighbor"

mesh_hash["Lucy Ricardo"] = "Star"

mesh_hash["Ricky Ricardo"] = "Band leader"

con=sqlite.connect('test1.db')

cur=con.cursor()

cur.executescript("""

 create table mesh

 (

    name varchar (64),

    term varchar(64)

 );

 """)

for key, value in mesh_hash.iteritems():

 entry = (key, value)

 cur.execute("insert into mesh (name, term) values (?, ?)", entry)

con.commit()

exit

Once created, entries in the SQL database file, test1.db, can be retrieved, as shown in the Python script, sqlite6_read.py:

#!/usr/local/bin/python

import sqlite3

from sqlite3 import dbapi2 as sqlite

import string, re, os

con=sqlite.connect('test1.db')

cur=con.cursor()

cur.execute("select * from mesh")

for row in cur:

 print row[0], row[1]

exit

Here is the output of the sqlite6_read.py script

c:ftp>sqlite6_read.py

Fred Mertz Neighbor

Ethel Mertz Neighbor

Lucy Ricardo Star

Ricky Ricardo Band leader

SQLite comes bundled in several of the newer Perl distributions (eg, Strawberry Perl). SQLite comes bundled in some distributions of Cygwin and is available via CPAN (see Glossary item, CPAN). A sample Perl script, perl_sqlite_in.pl, creating SQLite database records for a small associative array, is shown:

#!/usr/local/bin/perl

use DBI;

$mesh_hash{"Fred Mertz"} = "Neighbor";

$mesh_hash{"Ethel Mertz"} = "Neighbor";

$mesh_hash{"Lucy Ricardo"} = "Star";

$mesh_hash{"Ricky Ricardo"} = "Band leader";

my $dbh = DBI->connect("dbi:SQLite:dbname=dbfile","","");

my $sth = $dbh->prepare("CREATE TABLE mesh (number VARCHAR(64), term VARCHAR(64))");

$sth->execute;

$sth = $dbh->prepare("INSERT INTO mesh (number,term) VALUES(?,?)");

$dbh->do( "BEGIN TRANSACTION");

while ((my $key, my $value) = each(%mesh_hash))

 {

 $sth->execute( $key, $value );

 }

$dbh->do( "COMMIT" );

exit;

The Perl script, perl_sqlite_out.pl, retrieves the records created by the perl_sqlite_in.pl script:

#!/usr/local/bin/perl

use DBI;

my $dbh = DBI->connect("dbi:SQLite:dbname=dbfile","","");

$sth = $dbh->prepare("SELECT number, term FROM mesh");

$sth->execute;

while (@row = $sth->fetchrow_array())

 {

 print "@row ";

 }

exit;

Here is the output of perl_sqlite_out.pl:

c:ftp>perl_sqlite_out.pl

Ricky Ricardo Band leader

Lucy Ricardo Star

Fred Mertz Neighbor

Ethel Mertz Neighbor

Ruby users must first install SQLite on their computer, and then install the Ruby interface to SQLite, available as a Ruby gem (see Glossary item, Ruby gem), as shown:

c:>gem install sqlite3

The Ruby script, ruby_sqlite_in.rb, calls the installed sqlite3 interface Gem, and creates an SQLite database:

#!/usr/local/bin/ruby

require 'sqlite3'

db = SQLite3::Database.new( "test.db" )

db_hash = Hash.new()

db_hash["Fred Mertz"] = "Neighbor"

db_hash["Ethel Mertz"] = "Neighbor"

db_hash["Lucy Ricardo"] = "Star"

db_hash["Ricky Ricardo"] = "Band leader"

sql = <<SQL

 create table mesh (

   a varchar2(64),

   b varchar2(64)

  );

SQL

db.execute_batch( sql )

db.transaction

db_hash.each {|k,v| db.execute("insert into mesh values (?,?)", k,v)}

db.commit

exit

The resulting database is an external file, named "test.db". The data in the external file can be read out, using the ruby_sqlite_out.rb script:

#!/usr/local/bin/ruby

require 'sqlite3'

db = SQLite3::Database.new( "test.db" )

db.execute("select * from mesh") do

 |row|

 puts row[0] + " " + row[1]

end

exit

Here is the familiar output:

c:ftp>ruby_sqlite_out.rb

Fred Mertz Neighbor

Ethel Mertz Neighbor

Lucy Ricardo Star

Ricky Ricardo Band leader

Databases, such as SQLite, are a great way to achieve data persistence, if you are adept at programming in SQL, and if you need to store millions of simple data objects. Otherwise, persistence methods that are native to your favorite programming language provide a simpler, more flexible option.

Glossary

Abandonware Software that that is abandoned (eg, no longer updated, supported, distributed, or sold) after its economic value is depleted. In academic circles, the term is often applied to software that is developed under a research grant. When the grant expires, so does the software. Most of the software in existence today is abandonware.

Abstraction In the context of object-oriented programming, abstraction is a technique whereby a method is simplified to a generalized form that is applicable to a wide range of objects, but for which the specific characteristics of the object receiving the method may be used to return a result that is suited to the object. Abstraction, along with polymorphism, encapsulation, and inheritance, are essential features of object-oriented programming languages. See Polymorphism. See Inheritance. See Encapsulation. See Object-oriented programming language.

Associative array A data structure consisting of an unordered list of key/value data pairs. Also known as hash, hash table, map, symbol table, dictionary, or dictionary array. The proliferation of synonyms suggests that associative arrays, or their computational equivalents, have great utility. Associative arrays are used in Perl, Python, Ruby and most modern programming languages. Here is an example in which an associative array (ie, a member of Class Hash) is created in Ruby. The first line of the script creates a new associative array, named my_hash. The next two lines create two key/value elements for the associative array (C05/Albumin and C39/Choline). The next line instructs ruby to print out the elements in the my_hash associative array. Here is the output of the short ruby script.

#!/usr/local/bin/ruby
my_hash = Hash.new
my_hash["C05"] = "Albumin"
my_hash["C39"] = "Choline"
my_hash.each {|key,value| STDOUT.print(key, " --- ", value, " ")}
exit
Output: C05 — Albumin
C39 — Choline

Autocoding When nomenclature coding is done automatically, by a computer program, the process is known as "autocoding" or "autoencoding." See Coding. See Nomenclature. See Autoencoding.

Autoencoding Synonym for autocoding. See Autocoding.

Blended class Also known as class noise, subsumes the more familiar, but less precise term, "Labeling error." Blended class refers to inaccuracies (eg, misleading results) introduced in the analysis of data due to errors in class assignments (ie, assigning a data object to class A when the object should have been assigned to class B). If you are testing the effectiveness of an antibiotic on a class of people with bacterial pneumonia, the accuracy of your results will be forfeit when your study population includes subjects with viral pneumonia, or smoking-related lung damage. Errors induced by blending classes are often overlooked by data analysts who incorrectly assume that the experiment was designed to ensure that each data group is composed of a uniform and representative population. A common source of class blending occurs when the classification upon which the experiment is designed is itself blended. For example, imagine that you are a cancer researcher and you want to perform a study of patients with malignant fibrous histiocytomas (MFH), comparing the clinical course of these patients with the clinical course of patients who have other types of tumors. Let's imagine that the class of tumors known as MFH does not actually exist; that it is a grab-bag term erroneously assigned to a variety of other tumors that happened to look similar to one another. This being the case, it would be impossible to produce any valid results based on a study of patients diagnosed as having MFH. The results would be a biased and irreproducible cacophony of data collected across different, and undetermined, classes tumors. Believe it or not, this specific example, of the blended MFH class of tumors, is selected from the real-life annals of tumor biology.8,9 The literature is rife with research of dubious quality, based on poorly designed classifications and blended classes. A detailed discussion of this topic is found in Section 6.5, Properties that Cross Multiple Classes. One caveat. Efforts to eliminate class blending can be counterproductive if undertaken with excess zeal. For example, in an effort to reduce class blending, a researcher may choose groups of subjects who are uniform with respect to every known observable property. For example, suppose you want to actually compare apples with oranges. To avoid class blending, you might want to make very sure that your apples do not included any kumquats, or persimmons. You should be certain that your oranges do not include any limes or grapefruits. Imagine that you go even further, choosing only apples and oranges of one variety (eg, Macintosh apples and navel oranges), size (eg, 10 cm), and origin (eg, California). How will your comparisons apply to the varieties of apples and oranges that you have excluded from your study? You may actually reach conclusions that are invalid and irreproducible for more generalized populations within each class. In this case, you have succeeded in elminating class blending, at the expense of losing representative populations of the classes. See Simpson's paradox.

CPAN The Comprehensive Perl Archive Network, known as CPAN, holds has nearly 154,000 Perl packages, with over 12,000 contributors. These packages greatly extend the functionality of Perl, and include virtually every type of Perl method imaginable (eg, math, statistics, communications, plotting, numerical analyses). Any CPAN Perl package can be easily downloaded and automatically installed on your computer's Perl directory when you use the CPAN installer. For instructions, see Open Source Tools for Chapter 1. You can search the multitude of Perl modules to your heart's content at: https://metacpan.org/.

Child class The direct or first generation subclass of a class. Sometimes referred to as the daughter class or, less precisely, as the subclass. See Parent class. See Classification.

Class A class is a group of objects that share a set of properties that define the class and that distinguish the members of the class from members of other classes. The word "class," lowercase, is used as a general term. The word "Class," uppercase, followed by an uppercase noun (eg, Class Animalia), represents a specific class within a formal classification. See Classification.

Class-oriented programming A type of object-oriented programming for which all object instances and all object methods must belong to a class. Hence, in a class-oriented programming language, any new methods and instances that do not sensibly fall within an existing class must be accommodated with a newly created subclass. All invocations of methods, even those sent directly to a class instance, are automatically delivered to the class containing the instance. Class-oriented programming languages embody a specified representation of the real world in which all objects reside within defined classes. Important features such as method inheritance (through class lineage), and introspection (through object and class identifiers) can be very simply implemented in class-oriented programming languages. Powerful scripts can be written with just a few short lines of code, using class-oriented programming languages, by invoking the names of methods inherited by data objects assigned to classes. More importantly, class-oriented languages provide an easy way to discover and test relationships among objects. Ruby and Python are examples of two object-oriented languages that could support a pure class-oriented approach to programming, by deliberately assigning all objects and methods to a hierarchical class system. Of the two languages, Ruby seems to be better suited to a pure class-oriented approach, as it comes with a built-in class system that is intended to accommodate additional subclassing. Nonetheless, both languages give programmers the flexibility to either permit or to circumvent a purely class-oriented approach. Perhaps Smalltalk is the language which comes closest to being a purely class-oriented language.10 As with every technical advance, there are some pitfalls that users should understand. See Inheritance. See Introspection. See Class. See Instance. See Data object. See Object relationships.

Classification A system in which every object in a knowledge domain is assigned to a class within a hierarchy of classes. The properties of superclasses are inherited by the subclasses. Every class has one immediate superclass (ie, parent class), although a parent class may have more than one immediate subclass (ie, child class). Objects do not change their class assignment in a classification, unless there was a mistake in the assignment. For example, a rabbit is always a rabbit, and does not change into a tiger. Classifications can be thought of as the simplest and most restrictive type of ontology, and serve to reduce the complexity of a knowledge domain.11 Classifications can be easily modeled in an object-oriented programming language and are nonchaotic (ie, calculations performed on the members and classes of a classification should yield the same output, each time the calculation is performed). A classification should be distinguished from an ontology. In an ontology, a class may have more than one parent class and an object may be a member of more than one class. A classification can be considered a special type of ontology wherein each class is limited to a single parent class and each object has membership in one and only one class. See Nomenclature. See Thesaurus. See Vocabulary. See Classification. See Dictionary. See Terminology. See Ontology. See Parent class. See Child class. See Superclass. See Unclassifiable objects.

Coding The term "coding" has three very different meanings; depending on which branch of science influences your thinking. For programmers, coding means writing the code that constitutes a computer programmer. For cryptographers, coding is synonymous with encrypting (ie, using a cipher to encode a message). For medics, coding is calling an emergency team to handle a patient in extremis. For informaticians and library scientists, coding involves assigning an alphanumeric identifier, representing a concept listed in a nomenclature, to a term. For example, a surgical pathology report may include the diagnosis, Adenocarcinoma of prostate. A nomenclature may assign a code C4863000 that uniquely identifies the concept "Adenocarcinoma." Coding the report may involve annotating every occurrence of the work "Adenocarcinoma" with the "C4863000" identifier. For a detailed explanation of coding, and its importance for searching and retrieving data, see the full discussion in Section 3.4, Autoencoding and Indexing with Nomenclatures. See Autocoding. See Nomenclature.

Data archeology The process of recovering information held in abandoned or unpopular physical storage devices, or packaged in formats that are no longer widely recognized, and hence unsupported by most software applications. The definition encompasses truly ancient data, such as cuneiform inscriptions stored on clay tablets c.3300 BCE, and digital data stored on 5.25-in. floppy disks in Xyrite wordprocessor format, c.1994.

Data fusion Data fusion is very closely related to data integration. The subtle difference between the two concepts lies in the end result. Data fusion creates a new and accurate set of data representing the combined data sources. Data integration is an on-the-fly usage of data pulled from different domains and, as such, does not yield a residual fused set of data.

Data integration The process of drawing data from different sources and knowledge domains in a manner that uses and preserves the identities of data objects and the relationships among the different data objects. The term "integration" should not be confused with a closely related term, "interoperability." An easy way to remember the difference is to note that integration applies to data; interoperability applies to software.

Data merging A nonspecific term that includes data fusion, data integration, and any methods that facilitate the accrual of data derived from multiple sources. See Data fusion. See Data Integration.

Data object A data object is whatever is being described by the data. For example, if the data is "6-feet tall," then the data object is the person or thing to which "6-feet tall" applies. Minimally, a data object is a metadata/data pair, assigned to a unique identifier (ie, a triple). In practice, the most common data objects are simple data records, corresponding to a row in a spreadsheet or a line in a flat-file. Data objects in object-oriented programming languages typically encapsulate several items of data, including an object name, an object unique identifier, multiple data/metadata pairs, and the name of the object's class. See Triple. See Identifier. See Metadata.

Data repurposing Involves using old data in new ways, that were not foreseen by the people who originally collected the data. Data repurposing comes in the following categories: (1) Using the preexisting data to ask and answer questions that were not contemplated by the people who designed and collected the data; (2) Combining preexisting data with additional data, of the same kind, to produce aggregate data that suits a new set of questions that could not have been answered with any one of the component data sources; (3) Reanalyzing data to validate assertions, theories, or conclusions drawn from the original studies; (4) Reanalyzing the original data set using alternate or improved methods to attain outcomes of greater precision or reliability than the outcomes produced in the original analysis; (5) Integrating heterogeneous data sets (ie, data sets with seemingly unrelated types of information), for the purpose an answering questions or developing concepts that span diverse scientific disciplines; (6) Finding subsets in a population once thought to be homogeneous; (7) Seeking new relationships among data objects; (8) Creating, on-the-fly, novel data sets through data file linkages; (9) Creating new concepts or ways of thinking about old concepts, based on a re-examination of data; (10) Fine-tuning existing data models; and (11) Starting over and remodeling systems.1 See Heterogeneous data.

Database A software application designed specifically to create and retrieve large numbers of data records (eg, millions or billions). The data records of a database are persistent, meaning that the application can be turned off, then on, and all the collected data will be available to the user (see Open Source Tools).

Dictionary A terminology or word list accompanied by a definition for each item. See Nomenclature. See Vocabulary. See Terminology.

Encapsulation The concept, from object-oriented programming, that a data object contains its associated data. Encapsulation is tightly linked to the concept of introspection, the process of accessing the data encapsulated within a data object. Encapsulation, Inheritance, and Polymorphism are available features of all object-oriented languages. See Inheritance. See Polymorphism.

HTML HyperText Markup Language is an ASCII-based set of formatting instructions for web pages. HTML formatting instructions, known as tags, are embedded in the document, and double-bracketed, indicating the start point and end points for instruction. Here is an example of an HTML tag instructing the web browser to display the word "Hello" in italics: < i>Hello </i >. All web browsers conforming to the HTML specification must contain software routines that recognize and implement the HTML instructions embedded within web documents. In addition to formatting instructions, HTML also includes linkage instructions, in which the web browsers must retrieve and display a listed web page, or a web resource, such as an image. The protocol whereby web browsers, following HTML instructions, retrieve web pages from other internet sites, is known as HTTP (HyperText Transfer Protocol).

Heterogeneous data Two sets of data are considered heterogeneous when they are dissimilar to one another, with regard to content, purpose, format, organization, or annotations. One of the purposes of data science is to discover relationships among heterogeneous data sources. For example, epidemiologic data sets may be of service to molecular biologists who have gene sequence data on diverse human populations. The epidemiologic data is likely to contain different types of data values, annotated and formatted in a manner different from the data and annotations in a gene sequence database. The two types of related data, epidemiologic and genetic, have dissimilar content; hence they are heterogeneous to one another.

Identification The process of providing a data object with an identifier, or the process of distinguishing one data object from all other data objects on the basis of its associated identifier. See Identifier.

Identifier A string that is associated with a particular thing (eg person, document, transaction, data object), and not associated with any other thing.12 Object identification usually involves permanently assigning a seemingly random sequence of numeric digits (0–9) and alphabet characters (a–z and A–Z) to a data object. A data object can be a specific piece of data (eg, a data record), or an abstraction, such as a class of objects or a number or a string or a variable. See Identification.

Immutability Permanent data that cannot be modified is said to be immutable. At first thought, it would seem that immutability is a ridiculous and impossible constraint. In the real world, mistakes are made, information changes, and the methods for describing information changes. This is all true, but the astute data manager knows how to accrue information into data objects without changing the preexisting data. In practice, immutability is maintained by timestamping all data and storing annotated data values with any and all subsequent timestamped modifications. For a detailed explanation, see Section 5.6, Timestamps, Signatures, and Event Identifiers.

Inheritance In object-oriented languages, data objects (ie, classes and object instances of a class) inherit the methods (eg, functions and subroutines) created for the ancestral classes in their lineage. See Abstraction. See Polymorphism. See Encapsulation.

Instance An instance is a specific example of an object that is not itself a class or group of objects. For example, Tony the Tiger is an instance of the tiger species. Tony the Tiger is a unique animal and is not itself a group of animals or a class of animals. The terms instance, instance object, and object are sometimes used interchangeably, but the special value of the "instance" concept, in a system wherein everything is an object, is that it distinguishes members of classes (ie, the instances) from the classes to which they belong.

Introspection A method by which data objects can be interrogated to yield information about themselves (eg, properties, values, and class membership). Through introspection, the relationships among the data objects can be examined. Introspective methods are built into object-oriented languages. The data provided by introspection can be applied, at run-time, to modify a script's operation; a technique known as reflection. Specifically, any properties, methods, and encapsulated data of a data object can be used in the script to modify the script's run-time behavior. See Reflection.

Marshalling Marshalling, like serializing, is a method for achieving data persistence (ie, saving variables and other data structures produced in a program, after the program has stopped running). Marshalling methods preserve data objects, with their encapsulated data and data structures. See Persistence. See Serializing.

Meaning In informatics, meaning is achieved when described data is bound to a unique identifier of a data object. "Claude Funston's height is 5 feet 11 inches," comes pretty close to being a meaningful statement. The statement contains data (5 feet 11 inches), and the data is described (height). The described data belongs to a unique object (Claude Funston). Ideally, the name "Claude Funston" should be provided with a unique identifier, to distinguish one instance of Claude Funston from all the other persons who are named Claude Funston. The statement would also benefit from a formal system that ensures that the metadata makes sense (eg, What exactly is height, and does Claude Funston fall into a class of objects for which height is a property?) and that the data is appropriate (eg, Is 5 feet 11 inches an allowable measure of a person's height?). A statement with meaning does not need to be a true statement (eg, The height of Claude Funston was not 5 feet 11 inches when Claude Funston was an infant). See Semantics. See Triple. See RDF.

Metadata The data that describes data. For example, a data element (also known as data point) may consist of the number, "6." The metadata for the data may be the words "Height, in feet." A data element is useless without its metadata, and metadata is useless unless it adequately describes a data element. In XML, the metadata/data annotation comes in the form < metadata tag>data<end of metadata tag > and might be look something like:In spreadsheets, the data elements are the cells of the spreadsheet. The column headers are the metadata that describe the data values in the column's cells, and the row headers are the record numbers that uniquely identify each record (ie, each row of cells). See XML.

< weight_in_pounds>150 </weight_in_pounds >

Monte Carlo simulation Monte Carlo simulations were introduced in 1946 by John von Neumann, Stan Ulam and Nick Metropolis.13 For this technique, the computer generates random numbers and uses the resultant values to simulate repeated trials of a probabilistic event. Monte Carlo simulations can easily simulate various processes (eg, Markov models and Poisson processes) and can be used to solve a wide range of problems, discussed in detail in Section 8.2. The Achilles heel of the Monte Carlo simulation, when applied to enormous sets of data, is that so-called random number generators may introduce periodic (nonrandom) repeats over large stretches of data.14 What you thought was a fine Monte Carlo simulation, based on small data test cases, may produce misleading results for large data sets. The wise data analyst will avail himself of the best possible random number generator, and will test his outputs for randomness (see Open Source Tools for Chapter 5, Pseudorandom number generators). Various tests of randomness are available.15,16

Multiclass classification A misnomer imported from the field of machine translation, and indicating the assignment of an instance to more than one class. Classifications, as defined in this book, impose one-class classification (ie, an instance can be assigned to one and only one class). It is tempting to think that a ball should be included in class "toy" and in class "spheroids," but multiclass assignments create unnecessary classes of inscrutable provenance, and taxonomies of enormous size, consisting largely of replicate items. See Multiclass inheritance. See Taxonomy.

Multiclass inheritance In ontologies, multiclass inheritance occurs when a child class has more than one parent class. For example, a member of Class House may have two different parent classes: Class Shelter and Class Property. Multiclass inheritance is generally permitted in ontologies but is forbidden in one type of restrictive ontology, known as a classification. See Classification. See Parent class. See Multiclass classification.

Nomenclature A nomenclatures is a listing of terms that cover all of the concepts in a knowledge domain. A nomenclature is different from a dictionary for three reasons: (1) the nomenclature terms are not annotated with definitions, (2) nomenclature terms may be multiword, and (3) the terms in the nomenclature are limited to the scope of the selected knowledge domain. In addition, most nomenclatures group synonyms under a group code. For example, a food nomenclature might collect submarine, hoagie, po' boy, grinder, hero, and torpedo under an alphanumeric code such as "F63958." Nomenclatures simplify textual documents by uniting synonymous terms under a common code. Documents that have been coded with the same nomenclature can be integrated with other documents that have been similarly coded, and queries conducted over such documents will yield the same results, regardless of which term is entered (ie, a search for either hoagie, or po' boy will retrieve the same information, if both terms have been annotated with the synonym code, "F63948"). Optimally, the canonical concepts listed in the nomenclature are organized into a hierarchical classification.17,18 See Coding. See Autocoding.

Notation 3 Also called n3. A syntax for expressing assertions as triples (unique subject + metadata + data). Notation 3 expresses the same information as the more formal RDF syntax, but n3 is easier for humans to read.19 RDF and n3 are interconvertible, and either one can be parsed and equivalently tokenized (ie, broken into elements that can be reorganized in a different format, such as a database record). See RDF. See Triple.

Object relationships We are raised to believe that science explains how the universe, and everything in it, works. Engineering and the other applied sciences use scientific explanations to create things, for the betterment of our world. This is a lovely way to think about the roles played by scientists and engineers, but it is not completely accurate. For the most part, we cannot understand very much about the universe. Nobody understands the true nature of gravity, or mass, or light, or magnetism, or atoms, or thought. We do know a great deal about the relationships between gravity and mass, mass and energy, energy and light, light and magnetism, atoms and mass, thought and neurons, and so on. Karl Pearson, a 19th-century statistician and philosopher, wrote that "All science is description and not explanation." Pearson was admitting that we can describe relationships, but we cannot explain why those relationships are true. Here is an example of a mathematical relationship that we know to be true, but which defies our understanding. The constant pi is the ratio of the circumference of a circle to its diameter. Furthermore, pi figures into the Gaussian statistical distribution (ie, that describes how a normal population is spread). How is it possible that a number that determines the distribution of a population can also determine the diameter of a circle?20 The relationships are provable and undeniable, but the full meaning of pi is beyond our grasp. In essence, all of science can be reduced to understanding object relationships.

Object-oriented programming In object-oriented programming, all data objects must belong to one of the classes built into the language or to a class created by the programmer. Class methods are subroutines that belong to a class or instance. The members of a class have access to all of the class methods. There is a hierarchy of classes (with superclasses and subclasses). A data object can access any method from any superclass of its class. All object-oriented programming languages operate under this general strategy. The two most important differences among the object-oriented programming languages relate to syntax (ie, the required style in which data objects call their available methods) and content (the built-in classes and methods available to objects). Various esoteric issues, such as types of polymorphism allowed by the language, support for multiparental inheritance, and non-Boolean logic operations may influence which language is best suited for a specific project. See Data object. See Class-oriented programming.

Ontology An ontology is a collection of classes and their relationships to one another. Ontologies are usually rule-based systems (ie, membership in a class is determined by one or more class rules). Two important features distinguish ontologies from classifications. Ontologies permit classes to have more than one parent class and more than one child class. For example, the class of automobiles may be a direct subclass of "motorized devices" and a direct subclass of "mechanized transporters." In addition, an instance of a class can be an instance of any number of additional classes. For example, a Lamborghini may be a member of class "automobiles" and of class "luxury items." This means that the lineage of an instance in an ontology can be highly complex, with a single instance occurring in multiple classes, and with many connections between classes. Because recursive relations are permitted, it is possible to build an ontology wherein a class is both an ancestor class and a descendant class of itself. A classification is a highly restrained ontology wherein instances can belong to only one class, and each class may have only one direct parent class. Because classifications have an enforced linear hierarchy, they can be easily modeled, and the lineage of any instance can be traced unambiguously. See Classification. See Multiclass classification. See Multiclass inheritance.

Parent class The immediate ancestor, or the next-higher class (ie, the direct superclass) of a class. For example, in the classification of living organisms, Class Vertebrata is the parent class of Class Gnathostomata. Class Gnathostomata is the parent class of Class Teleostomi. In a classification, which imposes single class inheritance, each child class has exactly one parent class; whereas one parent class may have several different child classes. Furthermore, some classes, in particular the bottom class in the lineage, have no child classes (ie, a class need not always be a superclass of other classes). A class can be defined by its properties, its membership (ie, the instances that belong to the class), and by the name of its parent class. When we list all of the classes in a classification, in any order, we can always reconstruct the complete class lineage, in their correct lineage and branchings, if we know the name of each class's parent class. See Instance. See Child class. See Superclass.

Persistence Persistence is the ability of data to remain available in memory or storage after the program in which the data was created has stopped executing. Databases are designed to achieve persistence. When the database application is turned off, the data remains available to the database application when it is restarted at some later time. See Database. See Marshalling. See Serializing.

Polymorphism Polymorphism is one of the constitutive properties of an object-oriented language (along with inheritance, encapsulation, and abstraction). Methods sent to object receivers have a response determined by the class of the receiving object. Hence, different objects, from different classes, receiving a call to a method of the same name, will respond differently. For example, suppose you have a method named "divide" and you send the method (ie, issue a command to execute the method) to an object of Class Bacteria and an object of Class Numerics. The Bacteria, receiving the divide method, will try to execute by looking for the "divide" method somewhere in its class lineage. Being a bacteria, the "divide" method may involve making a copy of the bacteria (ie, reproducing) and incrementing the number of bacteria in the population. The numeric object, receiving the "divide" method, will look for the "divide" method in its class lineage and will probably find some method that provides instructions for arithmetic division. Hence, the behavior of the class object, to a received method, will be appropriate for the class of the object. See Inheritance. See Encapsulation. See Abstraction.

Public domain Data that is not owned by an entity. Public domain materials include documents whose copyright terms have expired, materials produced by the federal government, materials that contain no creative content (ie, materials that cannot be copyrighted), or materials donated to the public domain by the entity that holds copyright. Public domain data can be accessed, copied, and redistributed without violating piracy laws. It is important to note that plagiarism laws and rules of ethics apply to public domain data. You must properly attribute authorship to public domain documents. If you purposely fail to attribute authorship or if you purposefully and falsely attribute authorship to the wrong person (eg, yourself), then this is unethical, and an act of plagiarism.

RDF Resource Description Framework (RDF) is a syntax in XML notation that formally expresses assertions as triples. The RDF triple consists of a uniquely identified subject plus a metadata descriptor for the data plus a data element. Triples are necessary and sufficient to create statements that convey meaning. Triples can be aggregated with other triples from the same data set or from other data sets, so long as each triple pertains to a unique subject that is identified equivalently through the data sets. Enormous data sets of RDF triples can be merged or functionally integrated with other massive or complex data resources. For a detailed discussion see Open Source Tools for Chapter 6, Syntax for Triples. See Notation 3. See Semantics. See Triple. See XML.

RDF Schema Resource Description Framework Schema (RDFS). A document containing a list of classes, their definitions, and the names of the parent class(es) for each class. In an RDF Schema, the list of classes is typically followed by a list of properties that apply to one or more classes in the Schema. To be useful, RDF Schemas are posted on the Internet, as a Web page, with a unique Web address. Anyone can incorporate the classes and properties of a public RDF Schema into their own RDF documents (public or private) by linking named classes and properties, in their RDF document, to the web address of the RDF Schema where the classes and properties are defined. See RDFS.

Reconciliation Usually refers to identifiers, and involves verifying an object that is assigned a particular identifier in one information system has been provided the same identifier in some other system. For example, if I am assigned identifier 967bc9e7-fea0-4b09-92e7-d9327c405d78 in a legacy record system, I should like to be assigned the same identifier in the new record system. If that were the case, my records in both systems could be combined. If I am assigned an identifier in one system that is different from my assigned identifier in another system, then the two identifiers must be reconciled to determine that they both refer to the same unique data object (ie, me). This may involve creating a link between the two identifiers, or a new triple that establishes the equivalence of the two identifiers. Despite claims to the contrary, there is no possible way by which information systems with poor identifier systems can be sensibly reconciled. Consider this example. A hospital has two separate registry systems: one for dermatology cases and another for psychiatry cases. The hospital would like to merge records from the two services. Because of sloppy identifier practices, a sample patient has been registered 10 times in the dermatology system, and 6 times in the psychiatry system, each time with different addresses, social security numbers, birthdates and spellings of the name. A reconciliation algorithm is applied, and one of the identifiers from the dermatology service is matched positively against one of the records from the psychiatry service. Performance studies on the algorithm indicate that the merged records have a 99.8% chance of belonging to the same patient. So what? Though the two merged identifiers correctly point to the same patient, there are 14 (9 + 5) residual identifiers for the patient still unmatched. The patient's merged record will not contain his complete clinical history. Furthermore, in this hypothetical instance, analyses of patient population data will mistakenly attribute one patient's clinical findings to as many as 15 different patients, and the set of 15 records in the corrupted de-identified dataset may contain mixed-in information from an indeterminate number of additional patients! If the preceding analysis seems harsh, consider these words, from the Healthcare Information and Management Systems Society, "A local system with a poorly maintained or "dirty" master person index (MPI) will only proliferate and contaminate all of the other systems to which it links."21 See Social Security Number.

Reflection A programming technique wherein a computer program will modify itself, at run-time, based on information it acquires through introspection. For example, a computer program may iterate over a collection of data objects, examining the self-descriptive information for each object in the collection (ie, object introspection). If the information indicates that the data object belongs to a particular class of objects, then the program may call a method appropriate for the class. The program executes in a manner determined by descriptive information obtained during run-time; metaphorically reflecting upon the purpose of its computational task. See Introspection.

Registrars and human authentication The experiences of registrars in U.S. hospitals serve as cautionary instruction. Hospital registrars commit a disastrous mistake when they assume that all patients wish to comply with the registration process. A patient may be highly motivated to provide false information to a registrar, or to acquire several different registration identifiers, or to seek a false registration under another person's identity (ie, commit fraud), or to forego the registration process entirely. In addition, it is a mistake to believe that honest patients are able to fully comply with the registration process. Language barriers, cultural barriers, poor memory, poor spelling, and a host of errors and misunderstandings can lead to duplicative or otherwise erroneous identifiers. It is the job of the registrar to follow hospital policies that overcome these difficulties. Registration should be conducted by a trained registrar who is well-versed in the registration policies established by the institution. Registrars may require patients to provide a full legal name, any prior held names (eg, maiden name), date of birth, and a government issue photo id card (eg, driver's license or photo id card issued by the department of motor vehicles). To be thorough, registration should require a biometric identifier (eg, fingerprints, retina scan, iris scan, voice recording, photograph). If you accept the premise that hospitals have the responsibility of knowing who it is that they are treating, then obtaining a sample of DNA from every patient, at the time of registration, is reasonable. The DNA can be used to create a unique patient profile from a chosen set of informative loci; a procedure used by the CODIS system developed for law enforcement agencies. The registrar should document any distinguishing and permanent physical features that are plainly visible (eg, scars, eye color, colobomas, tattoos). Neonatal and pediatric identifiers pose a special set of problems for registrars. When an individual born in a hospital, and provided with an identifier, returns as an adult, he or she should be assigned the same identifier that was issued in the remote past. Every patient who comes for registration should be matched against a database of biometric data that does not change from birth to death (eg, fingerprints, DNA). See Social Security Number.

Ruby gem In Ruby, gems are external modules available for download from an internet server. The Ruby gem installation module comes bundled in Ruby distribution packages. Gem installations are simple, usually consisting of commands in the form, "gem install name_of_gem" invoked at the system prompt. After a gem has been installed, scripts access the gem with a "require" statement, equivalent to an "import" statement in Python or the "use" statement in Perl.

Semantics The study of meaning (Greek root, semantikos, significant meaning). In the context of data science, semantics is the technique of creating meaningful assertions about data objects. A meaningful assertion, as used here, is a triple consisting of an identified data object, a data value, and a descriptor for the data value. In practical terms, semantics involves making assertions about data objects (ie, making triples), combining assertions about data objects (ie, merging triples), and assigning data objects to classes; hence relating triples to other triples. As a word of warning, few informaticians would define semantics in these terms, but most definitions for semantics are functionally equivalent to the definition offered here. Most language is unstructured and meaningless. Consider the assertion: Sam is tired. This is an adequately structured sentence with a subject verb and object. But what is the meaning of the sentence? There are a lot of people named Sam. Which Sam is being referred to in this sentence? What does it mean to say that Sam is tired? Is "tiredness" a constitutive property of Sam, or does it only apply to specific moments? If so, for what moment in time is the assertion, "Sam is tired" actually true? To a computer, meaning comes from assertions that have a specific, identified subject associated with some sensible piece of fully described data (metadata coupled with the data it describes). See Triple. See RDF.

Serializing Serializing is a plesionym (ie, near-synonym) for marshalling and is a method for taking data produced within a script or program, and preserving it in an external file, that can be saved when the program stops, and quickly reconstituted as needed, in the same program or in different programs. The difference, in terms of common usage, between serialization and marshalling is that serialization usually involves capturing parameters (ie, particular pieces of information), while marshaling preserves all of the specifics of a data object, including its structure, content, and code content. As you might imagine, the meaning of terms might change depending on the programming language and the intent of the serializing and marshalling methods. See Persistence. See Marshalling.

Simpson's paradox Occurs when a correlation that holds in two different data sets is reversed if the data sets are combined. For example, baseball player A may have a higher batting average than player B for each of two seasons, but when the data for the two seasons are combined, player B may have the higher 2-season average. Simpson's paradox is just one example of unexpected changes in outcome when variables are unknowingly hidden or blended.22

Social Security Number The common strategy, in the U.S., of employing social security numbers as identifiers is often counterproductive, owing to entry error, mistaken memory, or the intention to deceive. Efforts to reduce errors by requiring individuals to produce their original social security cards puts an unreasonable burden on honest individuals, who rarely carry their cards, and provides an advantage to dishonest individuals, who can easily forge social security cards. Institutions that compel patients to provide a social security number have dubious legal standing. The social security number was originally intended as a device for validating a person's standing in the social security system. More recently, the purpose of the social security number has been expanded to track taxable transactions (ie, bank accounts, salaries). Other uses of the social security number are not protected by law. The Social Security Act (Section 208 of Title 42 U.S. Code 408) prohibits most entities from compelling anyone to divulge his/her social security number. Legislation or judicial action may one day stop institutions from compelling individuals to divulge their social security numbers as a condition for providing services. Prudent and forward-thinking institutions will limit their reliance on social security numbers as personal identifiers. See Registrars and human authentication.

Software agent A computer program that operates with autonomy, taking cues from the data and adjusting its behavior to perform tasks not specifically written into the program. The Internet supports software agents through the implementation of identifiers (ie, URLs) and data accessibility (ie, through TCP/IP data transport protocols). With the advent of RDF, data can be "understood," as each identified object in RDF is associated with pairs of metadata and data, and placed into defined classes. Once uniquely identified data objects are assigned into classes, it becomes relatively simple to write software agents whose activities are determined by logical inferences on retrieved data objects. See RDF. See Reflection. See URL.

Species Species is the bottom-most class of any classification or ontology. Because the species class contains the individual objects of the classification, it is the only class which is not abstract. The special significance of the species class is best exemplified in the classification of living organisms. Every species of organism contains individuals that share a common ancestral relationship to one another. When we look at a group of squirrels, we know that each squirrel in the group has its own unique personality, its own unique genes (ie, genotype), and its own unique set of physical features (ie, phenotype). Moreover, although the DNA sequences of individual squirrels are unique, we assume that there is a commonality to the genome of squirrels that distinguishes it from the genome of every other species. If we use the modern definition of species as an evolving gene pool, we see that the species can be thought of as a biological life form, with substance (a population of propagating genes), and a function (evolving to produce new species).2325 Put simply, species speciate; individuals do not. As a corollary, species evolve; individuals simply propagate. Hence, the species class is a separable biological unit with form and function. We, as individuals, are focused on the lives of individual things, and we must be reminded of the role of species in biological and nonbiological classifications. The concept of species is discussed in greater detail in Section 6.4. See Blended class.

Superclass Any of the ancestral classes of a subclass. For example, in the classification of living organisms, the class of vertebrates is a superclass of the class of mammals. The immediate superclass of a class is its parent class. In common parlance, when we speak of the superclass of a class, we are usually referring to its parent class. See Parent class.

Syntax Syntax is the standard form or structure of a statement. What we know as English grammar is equivalent to the syntax for the English language. Charles Mead distinctly summarized the difference between syntax and semantics: "Syntax is structure; semantics is meaning."26 See Semantics.

Taxonomic order In biological taxonomy, the hierarchical lineage of organisms are divided into a descending list of named orders: Kingdom, Phylum (Division), Class, Order, Family, and Genus, Species. As we have learned more and more about the classes of organisms, modern taxonomists have added additional ranks to the classification (eg, supraphylum, subphylum, suborder, infraclass, etc.). Was this really necessary? All of this taxonomic complexity could be averted by dropping named ranks and simply referring to every class as "Class." Modern specifications for class hierarchies (eg, RDF Schema) encapsulate each class with the name of its superclass. When every object yields its class and superclass, it is possible to trace any object's class lineage. For example, in the classification of living organisms, if you know the name of the parent for each class, you can write a simple script that generates the complete ancestral lineage for every class and species within the classification.7 See Class. See Taxonomy. See RDF Schema. See Species.

Taxonomy A taxonomy is the collection of named instances (class members) in a classification or an ontology. When you see a schematic showing class relationships, with individual classes represented by geometric shapes and the relationships represented by arrows or connecting lines between the classes, then you are essentially looking at the structure of a classification, minus the taxonomy. You can think of building a taxonomy as the act of pouring all of the names of all of the instances into their proper classes. A taxonomy is similar to a nomenclature; the difference is that in a taxonomy, every named instance must have an assigned class. See Taxonomic order.

Terminology The collection of words and terms used in some particular discipline, field, or knowledge domain. Nearly synonymous with vocabulary and with nomenclature. Vocabularies, unlike terminologies, are not be confined to the terms used in a particular field. Nomenclatures, unlike terminologies, usually aggregate equivalent terms under a canonical synonym.

Thesaurus A vocabulary that groups together synonymous terms. A thesaurus is very similar to a nomenclature. There are two minor differences. Nomenclatures include multiword terms; whereas a thesaurus is typically composed of one-word terms. In addition, nomenclatures are typically restricted to a well-defined topic or knowledge domain (eg, names of stars, infectious diseases, etc.). See Nomenclature. See Vocabulary. See Classification. See Dictionary. See Terminology. See Ontology.

Triple In computer semantics, a triple is an identified data object associated with a data element and the description of the data element. In the computer science literature, the syntax for the triple is commonly described as: subject, predicate, object, wherein the subject is an identifier, the predicate is the description of the object, and the object is the data. The definition of triple, using grammatical terms, can be off-putting to the data scientist, who may think in terms of spreadsheet entries: a key that identifies the line record, a column header containing the metadata description of the data, and a cell that contains the data. In this book, the three components of a triple are described as: (1) the identifier for the data object, (2) the metadata that describes the data, and (3) the data itself. In theory, all data sets, databases, and spreadsheets can be constructed or deconstructed as collections of triples. See Introspection. See Data object. See Semantics. See RDF. See Meaning.

URL Unique Resource Locator. The Web is a collection of resources, each having a unique address, the URL. When you click on a link that specifies a URL, your browser fetches the page located at the unique location specified in the URL name. If the Web were designed otherwise (ie, if several different web pages had the same web address, or if one web address were located at several different locations), then the web could not function with any reliability.

Unclassifiable objects Classifications create a class for every object and taxonomies assign each and every object to its correct class. This means that a classification is not permitted to contain unclassified objects; a condition that puts fussy taxonomists in an untenable position. Suppose you have an object, and you simply do not know enough about the object to confidently assign it to a class. Or, suppose you have an object that seems to fit more than one class, and you can't decide which class is the correct class. What do you do? Historically, scientists have resorted to creating a "miscellaneous" class into which otherwise unclassifiable objects are given a temporary home, until more suitable accommodations can be provided. I have spoken with numerous data managers, and everyone seems to be of a mind that "miscellaneous" classes, created as a stopgap measure, serve a useful purpose. Not so. Historically, the promiscuous application of "miscellaneous" classes have proven to be a huge impediment to the advancement of science. In the case of the classification of living organisms, the class of protozoans stands as a case in point. Ernst Haeckel, a leading biological taxonomist in his time, created the Kingdom Protista (ie, protozoans) in 1866 to accommodate a wide variety of simple organisms with superficial commonalities. Haeckel himself understood that the protists were a blended class that included unrelated organisms, but he believed that further study would resolve the confusion. In a sense, he was right, but the process took much longer than he had anticipated; occupying generations of taxonomists over the following 150 years. Today, Kingdom Protista no longer exists. Its members have been reassigned to other classes. Nonetheless, textbooks of microbiology still describe the protozoans, just as though this name continued to occupy a legitimate place among terrestrial organisms. In the meantime, therapeutic opportunities for eradicating so-called protozoal infections, using class-targeted agents, have no doubt been missed.27 You might think that the creation of a class of living organisms, with no established scientific relation to the real world, was a rare and ancient event in the annals of biology, having little or no chance of being repeated. Not so. A special pseudoclass of fungi, deuteromyctetes (spelled with a lowercase "d," signifying its questionable validity as a true biologic class) has been created to hold fungi of indeterminate speciation. At present, there are several thousand such fungi, sitting in a taxonomic limbo, waiting to be placed into a definitive taxonomic class.28,27 See Blended class.

Vocabulary A comprehensive collection of the words and their associated meanings. In some quarters, "vocabulary" and "nomenclature" are used interchangeably, but they are different from one another. Nomenclatures typically focus on terms confined to one knowledge domain. Nomenclatures typically do not contain definitions for the contained terms. Nomenclatures typically group terms by synonymy. Lastly, nomenclatures include multiword terms. Vocabularies are collections of single words, culled from multiple knowledge domains, with their definitions, and assembled in alphabetic order. See Nomenclature. See Thesaurus. See Taxonomy. See Dictionary. See Terminology.

XML Acronym for eXtensible Markup Language, a syntax for marking data values with descriptors (ie, metadata). The descriptors are commonly known as tags. In XML, every data value is enclosed by a start-tag, containing the descriptor and indicating that a value will follow, and an end-tag, containing the same descriptor and indicating that a value preceded the tag. For example: < name>Conrad Nervig </name >. The enclosing angle brackets, "<>", and the end-tag marker, "/", are hallmarks of HTML and XML markup. This simple but powerful relationship between metadata and data allows us to employ metadata/data pairs as though each were a miniature database. The semantic value of XML becomes apparent when we bind a metadata/data pair to a unique object, forming a so-called triple. See Triple. See Meaning. See Semantics. See HTML.

References

1 Berman J.J. Repurposing legacy data: innovative case studies. Burlington, MA: Morgan Kaufmann; 2015.

2 Solar System Exploration Research Virtual Institute. NASA. Recovering the missing ALSEP data. Available from: http://sservi.nasa.gov/articles/recovering-the-missing-alsep-data/ [accessed 13.10.14].

3 Curry A. Rescue of old data offers lesson for particle physicists. Science. 2011;331:694–695.

4 Biebel O., Movilla Fernandez P.A., Bethke S., The JADE Collaboration. C-parameter and jet broadening at PETRA energies. Phys Lett. 1999;B459:326–334.

5 Berman J.J. Principles of big data: preparing, sharing, and analyzing complex information. Burlington, MA: Morgan Kaufmann; 2013.

6 Conway D. Object oriented Perl: a comprehensive guide to concepts and programming techniques. Sebastopol, CA: O'Reilly; 2000.

7 Berman J.J. Methods in medical informatics: fundamentals of healthcare programming in Perl, Python, and Ruby. Boca Raton: Chapman and Hall; 2010.

8 Al-Agha O.M., Igbokwe A.A. Malignant fibrous histiocytoma: between the past and the present. Arch Pathol Lab Med. 2008;132:1030–1035.

9 Nakayama R., Nemoto T., Takahashi H., Ohta T., Kawai A., Seki K., et al. Gene expression analysis of soft tissue sarcomas: characterization and reclassification of malignant fibrous histiocytoma. Mod Pathol. 2007;20:749–759.

10 Goldberg A., Robson D., Harrison M.A. Smalltalk-80: the language and its implementation. Boston, MA: Addison-Wesley; 1983.

11 Patil N., Berno A.J., Hinds D.A., Barrett W.A., Doshi J.M., Hacker C.R., et al. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science. 2001;294:1719–1723.

12 Paskin N. Identifier interoperability: a report on two recent ISO activities. D-Lib Mag. 2006;12:1–23.

13 Cipra B.A. The best of the 20th century: editors name top 10 algorithms. SIAM News. 2000;33(4).

14 Sainani K. Error: what biomedical computing can learn from its mistakes. Biomed Comput Rev. 2011;Fall:12–19.

15 Marsaglia G., Tsang W.W. Some difficult-to-pass tests of randomness. J Stat Softw. 2002;7:1–8 Available from: http://www.jstatsoft.org/v07/i03/paper [accessed 25.09.12].

16 Knuth D.E. Art of computer programming, volume 2: seminumerical algorithms. 3rd ed. Boston: Addison-Wesley; 1997.

17 Berman J.J. Tumor classification: molecular analysis meets Aristotle. BMC Cancer. 2004;4:10 Available from: http://www.biomedcentral.com/1471-2407/4/10 [accessed 01.01.15].

18 Berman J.J. Tumor taxonomy for the developmental lineage classification of neoplasms. BMC Cancer. 2004;4:88 http://www.biomedcentral.com/1471-2407/4/88 [accessed 01.01.15].

19 Berman JJ, Moore GW. Implementing an RDF Schema for pathology images. Available from: http://www.julesberman.info/spec2img.htm; 2007 [accessed 01.01.15].

20 Wigner E. The unreasonable effectiveness of mathematics in the natural sciences. In: New York: John Wiley and Sons; Communications in pure and applied mathematics. 1960;vol. 13.

21 Patient Identity Integrity. A White Paper by the HIMSS Patient Identity Integrity Work Group. Available from: http://www.himss.org/content/files/PrivacySecurity/PIIWhitePaper.pdf; 2009 [accessed 19.09.12].

22 Tu Y., Gunnell D., Gilthorpe M.S. Simpson's paradox, Lord's paradox, and suppression effects are the same phenomenon: the reversal paradox. Emerg Themes Epidemiol. 2008;5:2.

23 DeQueiroz K. Ernst Mayr and the modern concept of species. PNAS. 2005;102(Suppl. 1):6600–6607.

24 DeQueiroz K. Species concepts and species delimitation. Syst Biol. 2007;56:879–886.

25 Mayden R.L. Consilience and a hierarchy of species concepts: advances toward closure on the species puzzle. J Nematol. 1999;31(2):95–116.

26 Mead C.N. Data interchange standards in healthcare IT — computable semantic interoperability: now possible but still difficult, do we really need a better mousetrap? J Healthc Inf Manag. 2006;20:71–78.

27 Berman J.J. Taxonomic guide to infectious diseases: understanding the biologic classes of pathogenic organisms. Waltham: Academic Press; 2012.

28 Guarro J., Gene J., Stchigel A.M. Developments in fungal taxonomy. Clin Microbiol Rev. 1999;12:454–500.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset