Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 6

Giving Meaning to Data

Abstract

Masters of data simplification understand the basic concepts of data meaning (ie, semantics), data objects, ontologies, and triples. Knowledge of these simple concepts permits us to find the relationships among data objects. The purpose of this chapter is to provide a thoughtful and useful discussion of the meaning of data, accompanied by examples demonstrating its practical utility.

Keywords

Ontology; Classification; Triple; Meaning; Semantics; Data objects

6.1 Meaning and Triples

Increased complexity needs simplified design

Harold Jarche¹

Data, by itself, has no meaning. It is the job of the data scientist to assign meaning to data, and this is done with data objects, triples, and classifications (see Glossary item, Data object). Our most familiar data constructions (eg, spreadsheets, relational databases, flat-file records) convey meaning through triples and data objects; we just don't perceive them as such (see Glossary items, Spreadsheet, Database, Flat-file). The purpose of this section is to define "meaning," "data object," and "triple" and to show how all well-designed data can be reconstructed as collections of triples. Later sections will explain how triples are used to integrate and aggregate data and to express the relationships among data objects.

The three conditions for a meaningful assertion are²:

1. There is a specific data object about which the statement is made.

2. There is data that pertains to the specified object.

3. There is metadata that describes the data (that pertains to the specific object).

Simply put, assertions have meaning whenever a pair of metadata and data (the descriptor for the data and the data itself) is assigned to a specific object. In the informatics field, assertions come in the form of so-called triples, consisting of the object, then the metadata, and then the data.

Here are some examples of triples, as they might occur in a medical dataset:

"Jules Berman" "blood glucose level" "85"

"Mary Smith" "blood glucose level" "90"

"Samuel Rice" "blood glucose level" "200"

"Jules Berman" "eye color" "brown"

"Mary Smith" "eye color" "blue"

"Samuel Rice" "eye color" "green"

Here are a few triples, as the might occur in a haberdasher's dataset

"Juan Valdez" "hat size" "8"

"Jules Berman" "hat size" "9"

"Homer Simpson" "hat size" "9"

"Homer Simpson" "hat_type" "bowler"

We can combine the triples from a medical dataset and a habderdasher's data set that apply to a common object:

"Jules Berman" "blood glucose level" "85"

"Jules Berman" "eye color" "brown"

"Jules Berman" "hat size" "9"

Triples can port their meaning between different databases because they bind described data to an object. The portability of triples permits us to achieve data integration of heterogeneous types of data, and facilitates the design of software agents that conduct queries over multiple databases. Data integration involves merging related data objects, across diverse data sets. As it happens, if data supports introspection and data is organized as meaningful assertions (ie, as identified triples), then data integration is implicit (ie, an intrinsic property of the data) (see Glossary item, Introspection). In essence, data integration is awarded to data scientists who apply data simplification techniques.

In the previous examples, our sample identifiers were names of people; specifically the given name followed by a space followed by the surname. Names make poor identifiers because they are not unique. There is no process that ensures that no two individuals have the same name. As discussed in Section 5.1, modern informatics systems use unique alphanumeric character sequences to identify people and other data objects.

The subject, or data object's unique identifier and the subject's class (ie, the name of the class to which the subject or data object belongs) are keys to which of all the information about the subject can be collected. If you know the identifier for a data object, you can collect all of the information associated with the object, regardless of its location in the data resource. If other data resources use the same identifier for the data object, you can integrate all of the data associated with the data object, regardless of its location in external resources. Furthermore, if you know the class that holds a data object, you can infer that the data object has all of the properties of its class, along with all the properties of its ancestral classes. Consider the following example.³

Triples in resource #1

75898039563441 name G. Willikers

75898039563441 gender male

Triples in resource #2

75898039563441 age 35

75898039563441 is_a_class_member cowboy

94590439540089 name Hopalong Tagalong

94590439540089 is_a_class_member cowboy

Merged triples from resource #1 and #2

75898039563441 name G. Willikers

75898039563441 gender male

75898039563441 is_a_class_member cowboy

75898039563441 age 35

94590439540089 name Hopalong Tagalong

94590439540089 is_a_class_member cowboy

The merge of two triplestore resources combines data related to identifier 75898039563441 from both resources (see Glossary item, Triplestore). We now know a few things about this data object that we did not know before the merge. The merge tells us that the two data objects identified as 75898039563441 and 94590439540089 are both members of class cowboy. We now have two instance members from the same class, and this gives us information related to the types of instances contained in the class, and their properties. The consistent application of standard methods for object identification and for class assignments enhances our ability to understand our data (see Glossary items, Identification, Reconciliation).

The concept of namespaces was introduced in Section 2.5, "Annotation and the Simple Science of Metadata." The value of namespaces becomes very apparent when merging triplestores. Using namespaces, a single data object residing in several triplestores can be associated with assertions (ie, object-metadata-data triples) that include descriptors of the same name, without losing the intended sense of the assertions. Here is another example wherein two resources are merged:

Triples in resource #1

29847575938125 calendar:date February 4, 1986

83654560466294 calendar:date June 16, 1904

Triples in resource #2

57839109275632 social:date Jack and Jill

83654560466294 social:date Pyramus and Thisbe

Merged triples from resource #1 and #2

29847575938125 calendar:date February 4, 1986

57839109275632 social:date Jack and Jill

83654560466294 social:date Pyramus and Thisbe

83654560466294 calendar:date June 16, 1904

There you have it. The object identified as 83654560466294 is associated with a "date" metadata tag in both resources. When the resources are merged, the unambiguous meaning of the metadata tag is conveyed through the appended namespaces (ie, social: and calendar:)

Triples are the basic commodities of information science. Every triple represents a meaningful assertion, and collections of triples can be automatically integrated with other triples. As such, all the triples that share the same identifier can be collected to yield all the available information that pertains to the unique object. Furthermore, all the triples that pertain to all the members of a class of objects can be combined to provide information about the class, and to find relationships among different classes of objects. Without elaborating, the ability to find relationships among classes of objects is the chief goal of all scientific research (see Glossary items, Science, Generalization).

Triples are the basic informational unit employed by RDF (Resource Description Framework), and RDF is the syntax of the so-called semantic web.⁴^,⁵ It will come as no surprise that numerous databases have been designed to create, store, and retrieve triples. These databases are usually referred to as triplestores, or NoSQL databases, and they all operate by assigning identifiers (ie, unique record designators) to metadata/data pairs. Such databases can hold billions or trillions of key/value pairs or triples.

At the current time, software development for triplestore databases is in a state of volatility. Triplestore databases are dropping in and out of existence, changing their names, being incorporated into other systems, or being redesigned from the ground up. At the risk of showing my own personal bias, as an unapologetic Mumps fan, I would suggest that readers may want to investigate the value of using native Mumps as a triplestore database. Mumps, also known as the M programming language, is one of a small handful of ANSI-standard (American National Standard Institute) languages. It was developed in the 1960s and is still in use, primarily in hospital information systems and large production facilities. Versions of Mumps are available as open source, free distributions.⁶^,⁷ The Mumps installation process can be challenging for those who are unfamiliar with the Mumps environment. Stalwarts may find that Mumps has native features that render it suitable for storing triples and exploring their relationships.⁸

6.2 Driving Down Complexity with Classifications

…individuals do not belong in the same taxon because they are similar, but they are similar because they belong to the same taxon.

Simpson GG. Principles of Animal Taxonomy, Columbia University Press, New York, 1961.

Classifications are collections of objects grouped into classes, designed to conform to a few restrictive rules, namely:

Rule 1 Each object instance belongs to one and only one class (see Glossary item, Instance).

Rule 2 Each class has one and only one parent class. Each class may be the parent class for any number (ie, zero or more) of child classes (see Glossary items, Parent class, Child class, Superclass, Subclass).

Rule 3 Instances and classes are nontransitive (ie, instances cannot change their class; classes cannot change their ancestors or descendants).

Rule 4 The properties of the parent class are inherited by the child class. By inductive inference, each class inherits all of the class properties of every class in its ancestral lineage.

The rules for constructing classifications seem obvious and simplistic. Surprisingly, the task of building a logical and self-consistent classification is extremely difficult. Most classifications are rife with logical inconsistencies and paradoxes. Let's look at a few examples.

In 1975, while touring the Bethesda, Maryland campus of the National Institutes of Health, I was informed that their Building 10 was the largest all-brick building in the world, providing a home to over 7 million bricks. Soon thereafter, an ambitious construction project was undertaken to greatly expand the size of Building 10. When the work was finished, Building 10 was no longer the largest all-brick building in the world. What happened? The builders used material other than brick, and Building 10 lost its classification as an all-brick building.

This poses something of a paradox; objects in a classification are not permitted to move about from one class to another. An object assigned to a class must stay in its class (ie, the nontransitive property of classifications). Apparent paradoxes that plague any formal conceptualization of classifications are not difficult to find. Let's look at a few more examples.

Consider the geometric class of ellipses; planar objects in which the sum of the distances to two focal points is constant. Class Circle is a child of Class Ellipse, for which the two focal points of instance members occupy the same position, in the center, producing a radius of constant size. Imagine that Class Ellipse is provided with a class method called "stretch," in which the foci are moved farther apart, thus producing flatter objects. When the parent class "stretch" method is applied to members of the Class Circle (as per Rule 4), the circle stops being a circle and becomes an ordinary ellipse. Hence the inherited "stretch" method forces members of Class Circle to transition out of their assigned class, violating Rule 3.

Let's look at the "Bag" class of objects. A "Bag" is a collection of objects, and the Class Bag is included in most object-oriented programming languages. A "Set" is also a collection of objects (ie, a subclass of Bag), with the special feature that duplicate instances are not permitted. For example, if Kansas is a member of the set of United States, then you cannot add a second state named "Kansas" to the set. If Class Bag were to have an "increment" method, that added "1" to the total count of objects in the bag, whenever an object is added to Class Bag, then the "increment" method would be inherited by all of the subclasses of Class Bag, including Class Set. But Class Set cannot increase in size when duplicate items are added. Hence, inheritance creates a paradox in the Class Set (see Glossary item, Inheritance).

SUMO is an upper-class ontology, designed to include general classes of objects that other ontologies can refer to as their superclasses (see Glossary item, SUMO, Superclass). We will learn about ontologies in Section 6.3. For now, it suffices to note that ontologies are just classifications that ignore the second restriction. That is to say, in an ontology, a class may have more than one parent class. Hence, all classifications are ontologies, but not all ontologies are classifications. SUMO permits multiple class inheritance. For example, in SUMO, the class of humans is assigned to two different parent classes: Class Hominid and Class CognitiveAgent. "HumanCorpse", another SUMO class, is defined in SUMO as "A dead thing which was formerly a Human." Human corpse is a subclass of Class OrganicObject; not of Class Human. This means that a human, once it ceases to live, transits to a class that is not directly related to the class of humans. Basically, a member of Class Human, in the SUMO ontology, will change its class and its ancestral lineage, at different timestamped moments.

One last dalliance. Consider these two classes from the SUMO ontology, both of which happen to be subclasses of Class Substance.

Subclass NaturalSubstance

Subclass SyntheticSubstance

It would seem that these two subclasses are mutually exclusive. However, diamonds occur naturally, and diamonds can be synthesized. Hence, diamond belongs to Subclass NaturalSubstance and to Subclass SyntheticSubstance. The ontology creates two mutually exclusive classes that contain some of the same objects; and this is a problem. We cannot create sensible inference rules for objects that occupy mutually exclusive classes.

How does a data scientist deal with class objects that disappear from their assigned class and reappear elsewhere? In the examples discussed here, we saw the following:

1. Building 10 at NIH was defined as the largest all-brick building in the world. Strictly speaking, Building 10 was a structure, and it had a certain weight and dimensions, and it was constructed of brick. "Brick" is an attribute or property of buildings, and properties cannot form the basis of a class of building, if they are not a constant feature shared by all members of the class (ie, some buildings have bricks; others do not). Had we not conceptualized an "all-brick" class of building, we would have avoided any confusion. We will see later in this chapter (Section 6.5, "Properties That Cross Multiple Classes") how to distinguish properties from classes.

2. Class Circle qualified as a member of Class Ellipse, because a circle can be imagined as an ellipse whose two focal points happen to occupy the same location. Had we defined Class Ellipse to specify that class members must have two separate focal points, we could have excluded circles from class Ellipse. Hence, we could have safely included the stretch method in Class Ellipse without creating a paradox.

3. Class Set was made a subset of Class Bag, but the increment method of class Bag could not apply to Class Set. We created Class Set without taking into account the basic properties of Class Bag, which must apply to all its subclasses. Perhaps it would have been better if Class Set and Class Bag were created as children of Class Collection; each with its own set of properties.

4. Class HumanCorpse was not created as a subclass of Class Human. This was a mistake, as all humans will eventually die. If we were to create two classes, one called Class Living Human and one called Class Deceased Human, we would certainly cover all possible human states of being, but we would be creating a situation where members of a class are forced to transition out of their class and into another (violating Rule 3). The solution, in this case, is simple. Life and death are properties of organisms, and all organisms can and will have both properties, but never at the same time. Assign organisms the properties of life and of death, and stop there.

5. At first glance, the concepts "NaturalSubstance" and "SyntheticSubstance" would appear to be subclasses of "Substance." Are they really? Would it not be better to think that being "natural" or being "synthetic" are just properties of substances; not types of substances. If we agree that diamonds are a member of class substance, we can say that any specific diamond may have occurred naturally or through synthesis. We can eliminate two subclasses (ie, "NaturalSubstance" and "SyntheticSubstance") and replace them with two properties of class "Substance": synthetic and natural. By assigning properties to a class of objects, we simplify the ontology (by reducing the number of subclasses), and we eliminate problems created when a class member belongs to two mutually exclusive subclasses. We will discuss the role of properties in classifications in Section 6.5.

When using classifications, it is important to distinguish a classification system from an identification system. An identification system puts a data object into its correct slot within an existing classification. In the realm of medicine, when a doctor renders a diagnosis on a patient's diseases, she is not classifying the disease; she is finding the correct slot within the preexisting classification of diseases that holds her patient's diagnosis.³

When creating new classes, it is important to distinguish two important concepts: object relationships and object similarities. Relationships are the fundamental properties of an object that account for the object's behavior and interactions with other objects. Mathematical equations establish relationships among the variables of the equation. For example, mass is related to force by its velocity. An object is a member of a particular class if it has a relationship to all of the other members of the class (eg, all rodents have gnawing teeth; all eukaryotic organisms have nuclei). Similarities are features or properties that two objects have in common. Related objects tend to be similar to one another, but these similarities occur as the consequence of their relationships; not vice versa. For example, you are related to your father, and you probably have many similarities to your father. The reason that you share similarities to your father is that you are related to him; you are not related to your father because you are similar to him.

Here is a specific example that demonstrates the difference between a similarity and a relationship. You look up at the clouds, and you begin to see the shape of a lion. The cloud has a tail, like a lion's tale, and a fluffy head, like a lion's mane. With a little imagination, the mouth of the lion seems to roar down from the sky. You have succeeded in finding similarities between the cloud and a lion. If you look at a cloud and you imagine a tea kettle producing a head of steam, and you recognize that the physical forces that create a cloud and the physical forces that produced steam from a heated kettle are the same, then you have found a relationship. Without science-based relationships, reality makes no sense.⁹

Currently, data scientists have at their disposal a variety of mathematical algorithms that cluster objects by similarity (see Glossary items, K-nearest neighbor algorithm, Predictive analytics, Support vector machine, SVM, Neural network, Normalized compression distance). Such algorithms are referred to as classifiers; an inaccurate and misleading name. These algorithms can take a data set consisting of data objects, and their features (ie, the data that describes the objects), and produce a hierarchical distribution of data clusters, simulating a complete classification. Such algorithms are easily fooled by data objects that share highly specific or specialized features. In the case of the classification of living organisms, two unrelated species may independently acquire identical or similar traits through adaptation; not through inheritance from a shared ancestor. Examples are: the wing of a bat and the wing of a bird; the opposable thumb of opossums and of primates; the beak of a platypus and the beak of a bird. Unrelated species frequently converge upon similar morphologic solutions to common environmental conditions or shared physiological imperatives. Algorithms that cluster organisms based on similarity will mistakenly group divergent organisms under the same species (see Glossary items, Nongeneralizable predictor, Nonphylogenetic signal).

It is sometimes assumed that classification algorithms (ie, clustering objects into classes based on finding similarities among data objects) will improve when we acquire whole-genome sequence data for many different species (see Glossary item, Phenetics). Not so.¹⁰ Imagine an experiment wherein you take DNA samples from every organism you encounter: bacterial colonies cultured from a river, unicellular nonbacterial organisms found in a pond, small multicellular organisms found in soil, crawling creatures dwelling under rocks, and so on. You own a powerful sequencing machine, that produces the full-length sequence for each sampled organism, and you have a powerful computer that sorts and clusters every sequence. At the end, the computer prints out a huge graph, wherein groups of organisms with the greatest sequence similarities are clustered together. You may think you've created a useful classification, but you haven't really, because you don't know anything about the properties of your clusters. You don't know whether each cluster represents a species, or a class (a collection of related species), or whether a cluster may be contaminated by organisms that share some of the same gene sequences, but are phylogenetically unrelated (ie, the sequence similarities result from chance or from convergence, but not by descent from a common ancestor). The sequences do not tell you very much about the biological properties of specific organisms, and you cannot infer which biological properties characterize the classes of clustered organisms. You have no certain knowledge whether the members of any given cluster of organisms can be characterized by any particular gene sequence (ie, you do not know a characterizing gene sequence that applies to every member of a class, and to no members of other classes). You do not know the genus or species names of the organisms included in the clusters, because you began your experiment without a presumptive taxonomy (see Glossary item, Taxonomy). It is hard to begin something, if there is no beginning from which to start. Old-fashioned taxonomy is the beginning from which modern computational methods must build.

When creating a classification, it is essential to remember that the members of classes may be highly similar to one another, but their similarities result from their membership in the same class. Similarity alone can never account for class inclusion, and computational approaches to classification, based entirely on sequence similarity, have limited value.

Biologists are continually engaged in an intellectual battle over the classification of living organisms. The stakes are high. When unrelated organisms are mixed together in the same class, and when related organisms are separated into unrelated classes, the value of the classification is lost, perhaps forever (see Glossary item, Blended class). Without an accurate classification of living organisms, it would be impossible to make significant progress in the diagnosis, prevention, or treatment of infectious diseases.

6.3 Driving Up Complexity With Ontologies

More than any other time in history, mankind faces a crossroads. One path leads to despair and utter hopelessness. The other, to total extinction. Let us pray we have the wisdom to choose correctly.

Woody Allen¹¹

Ontologies are classifications for which the "one child class -> one parent class" restraint is lifted. Today's data scientists have largely abandoned classifications in favor of ontologies. There are several reasons why this is so, one being that a "classification" is misconstrued to be the product of a "classifier algorithm." This is not the case. Classifier algorithms organize data objects by similarity, and, as we have seen, this is a fundamentally different process than organizing data objects by relationships (see Glossary items, Classifier, Predictive analytics). More importantly, the popularity of ontologies comes from the perception that ontologies are more modern and computer-friendly than classifications. Classifications were created and implemented at a time when scientists did not have powerful computers that were capable of handling the complexities of ontologies. For example, the classification of all living organisms on earth was created over a period of two millennia. Several million species have been assigned to date to this classification. It is currently estimated that we will need to add another 10–50 million species before we come close to completing the taxonomy of living organisms. Prior generations of scientists could cope with a simple classification, wherein each class of organisms falls under a single superclass; they could not hope to cope with a complex ontology of organisms. In an ontology, the species class, "Horse," might be a child class of Equu, a zoologic term; as well as a subclass of "racing animals" and "farm animals," and "four-legged animals" (see Glossary items, Child class, Parent class). Likewise, the class "book" might be a subclass of "works of literature," as well as a subclass of "wood-pulp materials," and "inked products." Naturalists working in the precomputer age simply could not keep track of the class relationships built into ontologies.³

The advent of powerful and accessible computers has spawned a new generation of computer scientists who have developed powerful methods for building complex ontologies. It is the goal of these computer scientists to analyze data in a manner that allows us to find and understand relationships among data objects.

The question confronting data scientists is, "Should I model my data as a classification, wherein every class has one direct parent class; or should I model my data as an ontology, wherein classes may have multiparental inheritance?" This question lies at the heart of several related fields: database management, computational informatics, object-oriented programming, semantics, and artificial intelligence. Computer scientists are choosing sides, often without acknowledging the problem or fully understanding the stakes. For example, when a programmer builds object libraries in the Python or the Perl programming languages, he is choosing to program in a permissive environment that supports multiclass object inheritance (see Glossary items, Multiclass inheritance, Multiclass classification). In Python and Perl, any object can have as many parent classes as the programmer prefers. When a programmer chooses to program in the Ruby programming language, he shuts the door on multiclass inheritance. A Ruby object can have only one direct parent class. Most programmers are totally unaware of the liberties and restrictions imposed by their choice of programming language, until they start to construct their own object libraries, or until they begin to use class libraries prepared by another programmer.³

In object-oriented programming, the programming language provides a syntax whereby a named method is "sent" to data objects, and a result is calculated. The named methods are short programs contained in a library of methods created for a class. For example, a "close" method, written for file objects, typically shuts a file so that it cannot be accessed for read or write operations. In object-oriented languages, a "close" method is sent to an instance of class "File" when the programmer wants to prohibit access to the file. The programming language, upon receiving the "close" method, will look for a method named "close" somewhere in the library of methods prepared for the "File" class. If it finds the "close" method in the "File" class library, it will apply the method to the object to which the method was sent. In simplest terms, the file is closed.³

If the "close" method were not found among the available methods for the "File" class library, the programming language would automatically look for the "close" method in the parent class of the "File" class. In some languages, the parent class of the "File" class is the "Input/Output" class. If there were a "close" method in the "Input/Output" class, then the "close" method contained in the "Input/Output" class would be sent to the "File" Object. If not, the process of looking for a "close" method would be repeated for the parent class of the "Input/Output" class. You get the idea. Object-oriented languages search for methods by moving up the lineage of ancestral classes for the object instance that receives the method.

In object-oriented programming, every data object is assigned membership to a class of related objects. Once a data object has been assigned to a class, the object has access to all of the methods available to the class in which it holds membership, and to all of the methods in all the ancestral classes. This is the beauty of object-oriented programming. If the object-oriented programming language is constrained to single parental inheritance (eg, the Ruby programming language), then the methods available to the programmer are restricted to a tight lineage. When the object-oriented language permits multiparental inheritance (eg, Perl and Python programming languages), a data object can have many different ancestral classes crisscrossing the class libraries.

Freedom always has its price. Imagine what happens in a multiparental object-oriented programming language when a method is sent to a data object, and the data object's class library does not contain the method. The programming language will look for the named method in the library belonging to a parent class. Which parent class library should be searched? Suppose the object has two parent classes, and each of those two parent classes has a method of the same name in their respective class libraries? The functionality of the method will change depending on its class membership (ie, a "close" method may have a different function within Class File than it may have within Class Transactions or Class Boxes). There is no way to determine how a search for a named method will traverse its ancestral class libraries; hence, the output of a software program written in an object-oriented language that permits multiclass inheritance is unpredictable.³

The rules by which ontologies assign class relationships can lead to absurd outcomes. When there are no restraining inheritance rules, a class may be an ancestor of a child class that is an ancestor of its parent class (eg, a single class might be a grandfather and a grandson to the same class). An instance of a class might be an instance of two classes, at once. The combinatorics and the recursive options can become computationally difficult or impossible.

Those who use ontologies that have multiclass inheritance will readily acknowledge that they have created a system that is complex and unpredictable. The ontology expert justifies his complex and unpredictable model on the certainty that reality itself is complex and unpredictable. A faithful model of reality cannot be created with a simple-mined classification. Computational ontologists believe that with time and effort, modern approaches to complex systems will isolate and eliminate computational impedimenta, these being the kinds of problems that computer scientists are trained to solve. For example, recursion within an ontology can be avoided if the ontology is acyclic (ie, class relationships are not permitted to cycle back onto themselves). For every problem created by an ontology, an adept computer scientist will find a solution. Basically, ontologists believe that the task of organizing and understanding information no longer resides within the ancient realm of classification.³

For those nonprogrammers who believe in the supremacy of classifications over ontologies, their faith has nothing to do with the computational dilemmas incurred with multiclass parental inheritance. They base their faith on epistemological grounds; on the nature of objects. They hold that an object can only be one thing. You cannot pretend that one thing is really two or more things, simply because you insist that it is so. One thing can only belong to one class. One class can only have one ancestor class; otherwise, it would have a dual nature. Assigning more than one parental class to an object is a sign that you have failed to grasp the essential nature of the object. The classification expert believes that ontologies do not accurately represent reality.³

At the heart of classical classification is the notion that everything in the universe has an essence that makes it one particular thing, and nothing else. This belief is justified for many different kinds of systems. When an engineer builds a radio, he knows that he can assign names to components, and these components can be relied upon to behave in a manner that is characteristic of its type. A capacitor will behave like a capacitor, and a resistor will behave like a resistor. The engineer need not worry that the capacitor will behave like a semiconductor or an integrated circuit.

What is true for the radio engineer may not hold true for the data scientist. In many complex systems, the object changes its function depending on circumstances. For example, cancer researchers discovered an important protein that plays a very important role in the development of cancer. This protein, p53, was considered to be the primary cellular driver for human malignancy. When p53 mutated, cellular regulation was disrupted, and cells proceeded down a slippery path leading to cancer. In the past few decades, as more information was obtained, cancer researchers have learned that p53 is just one of many proteins that play some role in carcinogenesis, but the role changes depending on the species, tissue type, cellular microenvironment, genetic background of the cell, and many other factors. Under one set of circumstances, p53 may play a role in DNA repair; under another set of circumstances, p53 may cause cells to arrest the growth cycle.¹²^,¹³ It is difficult to classify a protein that changes its primary function based on its biological context.

Steeped as I am in the ancient art of classification, I am impressed, but not convinced, by arguments on both sides of the ontology/classification debate. Purists will argue that the complexity of the ontology must faithfully match the complexity of the data domain. As a matter of practicality, complex ontologies are difficult to implement in large and complex data projects.

Without stating a preference for single-class inheritance (classifications) or multiclass inheritance (ontologies), I would suggest that when modeling a complex system, you should always strive to design a model that is as simple as possible (see Glossary item, KISS). The practical ontologist may need to settle for a simplified approximation of the truth. Regardless of your personal preference, you should learn to recognize when an ontology has become too complex.

Here are the danger signs of an overly complex ontology³:

1. Nobody, even the designers, fully understands the ontology model.¹⁴

2. You realize that the ontology makes no sense. The solutions obtained by data analysts are impossible, or they contradict observations. Tinkering with the ontology doesn't help matters.

3. For a given problem, no two data analysts seem able to formulate the query the same way, and no two query results are ever equivalent.

4. The ontology lacks modularity. It is impossible to remove a set of classes within the ontology without collapsing its structure. When anything goes wrong, the entire ontology must be fixed or redesigned, from scratch.

5. The ontology cannot fit under a higher level ontology or over a lower-level ontology.

6. The ontology cannot be debugged when errors are detected.

7. Errors occur without anyone knowing that the error has occurred.

8. You realize, to your horror, that your ontology has violated the cardinal rule of data simplification, by increasing the complexity of your data.

6.4 The Unreasonable Effectiveness of Classifications

I visited the Sage of reverend fame

And thoughtful left more burden'd than I came.

I went- - and ere I left his humble door

The busy World had quite forgot his name.

Ecclesiastes

In 1960, Eugene Wigner wrote a fascinating essay entitled, "The Unreasonable Effectiveness of Mathematics in the Natural Sciences." The thesis of this essay is that the most fundamental and abstract concepts in mathematics seem to play important roles in almost every aspect of the sciences.¹⁵ For myself, the most unreasonably effective equation in mathematics is Euler's identity (Fig. 6.1).

f06-01-9780128037812 — Figure 6.1 Euler's identity, considered one of the most beautiful equations in mathematics, demonstrates the relationships among five of the most important quantities in the mathematical universe: i, pi, 1, 0, and e (see Glossary item, Beauty).

Euler's identity is a special case of Euler's formula, from which a novice mathematician can quickly come to grasp DeMoivre's theorem, Gaussian distributions, Fourier analysis, signal analysis, and combinatorics. The fundamental quantities in mathematics seem to lead forward and backward to one another. This is true for mathematicians working in the fields or of geometry, number theory, or probability. It applies also to the worldly fields of physics, statistics, and digital signal processing. The constants in Euler's identity, i, 0, 1, pi, and e, are abstractions that have somehow come to preside over our physical reality. It seems that the universe is organized by a few numbers that only exist in our imaginations.

Just as everything in mathematics and physics seems to be related to a few abstract fundamentals, everything in the realm of data science seems to be related to a few classes of data objects. As an example, let me pick the least consequential subject that we can imagine: stamp collecting. Imagine that you have a great deal of time on your hands, and have chosen to while away the hours absorbed in your stamp collection. You have spent the past decade building a database of "Postage Stamps of the World," with each stamp annotated with a unique identifier and a list of attributes (eg, date of issue, ink compounds used, paper composition and manufacturer, text of stamp, font style of text, price of stamp, image of stamp). You have taken pains to assign each stamp to a class, based on the country of origin. What might you learn from such a tedious classification?

By graphing the price of stamps, country-by-country, you can determine which countries must have endured times of hyperinflation, reflected as huge rises in the cost of postage. By matching the human faces featured on stamps, by country, you can determine which cultures value scientific achievement (eg, by featuring Nobel laureates), which countries value entertainment (eg, by featuring musicians), and which countries value armed conflict (eg, by featuring generals and war heroes). You can link countries to various political and social persuasions by studying the slogans that appear on stamps (eg, calls to war, pleas for conservation, faith-based icons). Animals featured on stamps tell you something about local fauna. By examining the production levels of postage stamps within a country (ie, the number of postage stamps printed per capita), you gather that certain countries have been using postage stamps as legal tender, in lieu of minted bills. If stamps from multiple countries have the same basic design, then you can infer that these countries are too small to design and print their own stamps, and have opted to use the same manufacturer. With no additional effort on your part, it would seem that your postage stamp collection is fully integrated into other databases covering geographic regions, economies, populations, and cultures. By some miracle, your annotated stamp collection serves as a window to the world, providing an unbiased, and hyperconnected view of reality.

Ernest Rutherford (1871–1937) famously said, "All science is either physics or stamp collecting." Needless to add, Rutherford was a physicist. Rutherford's opinion echoed a sentiment, common in his heyday, that quantitative experiments advance science and engineering and broaden our understanding of the universe. Endeavors that dwell on description (eg, anatomy, zoology, and all of the so-called natural sciences), barely deserved the attention of serious scientists. It is this contempt for nonquantitative sciences, so prevalent during my own formative years, that dissuaded me, at first, from following a career in biology. It was not until after I had my undergraduate degree in mathematics that I began to think seriously about the role of classification in data analysis.

What follows here is a listing of six properties of classifications that benefit data scientists. You will notice that all of the examples for these properties are taken from the field of biology. Some readers will be put off by the emphasis on biological classes, but it cannot be helped. Almost every well-documented lesson in classification-building has come from studying the mistakes and the successes in the construction of the classification of living organisms. If you are serious about data simplification, you need to know something about taxonomy (see Glossary item, Taxonomy).¹⁶

1. Classifications drive down complexity

Let us take a look at the most formidable classification ever designed by humans. The classification of living organisms has been a work-in-progress for more than two millennia. This classification, sometimes called the tree-of-life, has become the grand unifying theory of all the natural sciences, including such diverse fields as genetics, geology, paleontology, microbiology, and evolution. For the past 150 years, any school child could glance at a schematic depicting the classification and gain a near instantaneous understanding of the organization of all living organisms (Fig. 6.2).

f06-02-9780128037812 — Figure 6.2 Modern classification of living organisms, a simple schema indicating the hierarchical relationships among the major classes of living organisms.¹⁷ Wikipedia, public domain.

Classifications drive down the complexity of their data domain, because every instance in the domain is assigned to an exclusive class, and every class is related to the other classes through a simple hierarchy (see Glossary item, Unclassifiable objects). By creating the classification of organisms, we eliminate the burden of specifying the relationships among instances (ie, individual organisms). If oak belongs to Class Angiosperm, and birch belongs to class Angiosperm, then oak and birch are related by class. We need not specify pairwise relationships among every member of class Angiosperm. Likewise, if Class Angiosperm descends from class Plantae, then oak and birch both descend from Class Plantae. Furthermore, oak and birch both enjoy all of the class properties of Class Angiosperm and Class Plantae. Life is great!

No matter how large, a classification can be absorbed and understood by the human mind; a statement that seldom applies to ontologies. Because ontologies permit multiparental inheritance, the complexity of an ontology can easily exceed human comprehension.

2. Classifications are nonchaotic and computable

Classifications have a linear ascension through a hierarchy. The parental classes of any instance of the classification can be traced as a simple, nonbranched, and nonrecursive, ordered, and uninterrupted list of ancestral classes.

In a prior work,¹⁸ I described how a large, publicly available, taxonomy data file could be instantly queried to retrieve any listed organism, and to compute its complete class lineage, back to the "root" class, the primordial origin of the classification of living organisms.¹⁸ Basically, the trick to climbing backwards up the class lineage involves building two dictionary objects, also known as associative arrays. One dictionary object (which we will be calling "namehash") is composed of key/value pairs wherein each key is the identifier code of a class (in the nomenclature of the taxonomy data file), and each value is its name or label. The second dictionary object (which we'll be calling "parenthash") is composed of key/value pairs wherein each key is the identifier code of a class, and each value is the identifier code of the parent class. The snippet that prints the lineage for any class within the classification of living organisms is shown for Perl, Python, and Ruby:

In Perl:

while()

{

print OUT "$namehash{$id_name} ";

$id_name = $parenthash{$id_name};

last if ($namehash{$id_name} eq "root");

}

In Python:

for i in range(30):

if namehash.has_key(id_name):

print>>outtext, namehash[id_name]

if parenthash.has_key(id_name):

id_name = parenthash[id_name]

In Ruby:

(1..30).each do

outtext.puts(namehash[id_name])

id_name = parenthash[id_name]

break if namehash[id_name].nil?

The parts of the script that build the dictionary objects are left as an exercise for the reader. As an example of the script's output, here is the lineage for the domestic horse (Equus caballus), calculated from the classification of living organisms:

Equus caballus

Equus subg. Equus

Equus

Equidae

Perissodactyla

Laurasiatheria

Eutheria

Theria

Mammalia

Amniota

Tetrapoda

Sarcopterygii

Euteleostomi

Teleostomi

Gnathostomata

Vertebrata

Craniata

Chordata

Deuterostomia

Coelomata

Bilateria

Eumetazoa

Metazoa

Fungi/Metazoa group

Eukaryota

Cellular organisms

The words in this zoological lineage may seem strange to laypersons, but taxonomists who view this lineage instantly grasp the place of domestic horses in the classification of all living organisms.

3. Classifications are self-correcting

New information causes us to reconsider the assumptions upon which a classification is built.¹⁹ For example, an unfortunate error in the early classifications of living organisms involved placing fungi in the plant kingdom. On a superficial level, fungi and plants seem similar to one another. Both classes of organisms live in soil, and they both emerge from the ground, to produce stationary growths. Much like plants, fungal mushrooms can be picked, cooked, and served as a side dish.

In 1811, an interesting substance named chitin was extracted from mushrooms. In 1830, the same substance was extracted from insects. Over the following century, as we learned more about the chemistry of chitin and the cellular constituents of animals cells and plant cells, it gradually dawned on taxonomists that fungi were quite different from plants and had many similarities with animal cells. Most obviously, chitin, a constituent of fungi and insects, is absent from plant cells. Furthermore, cellulose, a constituent of plant cells, is absent from fungi and insects. These observations should have told us that fungi probably do not belong in the same class as plants.

We now know that fungi descended from a flagellated organism and belong to a large class of organisms known as the opisthokonts (organisms with one flagellum positioned at the posterior of cells). The opisthokonts include the class of metazoans (ie, animals); hence, we humans are much more closely related to fungi than to plants. To be fair, it was not easy to determine that the fungi are true opisthokonts. Most modern fungi lost their posterior flagellum somewhere along their evolutionary road. It seems that when fungi changed their habitat from water to soil, they lost their tails. The chytrids, a somewhat primitive group of fungi that never abandoned their aquatic lifestyle, have retained their posterior flagellum, using the tail to propel themselves through water. The retention of the posterior flagellum among the chytrid fungi leaves no doubt that fungi are opisthokonts, not plants.

The misconception that fungi are plants persists to this day. Academic mycologists (ie, fungal experts) are employed by botany departments, and fungal taxonomy is subsumed under the International Code of Botanical Nomenclature (ICBN).²⁰ It is not unusual to see modern textbooks that list the fungi among the flowers. Tsk Tsk.

The reassignment of the fungi to the Class Opisthokonta, an ancestral class of humans, raised an interesting question? If we humans are descended from Class Opisthokonta, just like the fungi, then where is our posterior flagellum? As it happens, human spermatocytes are propelled through body fluids by a posterior flagellum. Our posterior flagellum was too handy a tool for evolution to discard entirely (see Glossary item, Negative classifier).

When a class is assigned a wrong position in a classification, it can be moved, along with all its descendant classes, to its new position, under its rightful parent class. Basically, the correction involves erasing one line and constructing another. Simple! The correction of class assignments is nearly impossible in the case of complex ontologies. Because a class may have many different parent classes, a correction can disrupt the class relationships among many different classes and their subclasses. Basically, ontologies are too entangled to permit facile modifications in their structures (see Glossary item, Unstable taxonomy).

4. Classifications are self-converging

Two 20th century discoveries have greatly influenced the modern construction of the classification of living organisms: the 1909 discovery of the Burgess shale, by Walcott; and the 1961 discovery of the genetic code, by Nirenberg and Khorana. The Burgess shale provided taxonomists with an opportunity to determine the ordered epochs in which classes of organisms came into existence and out of existence. The discovery of the genetic code led to the sequencing of nucleic acids in the genes of various organisms. This data revealed the divergence of shared genes among related organisms, and added greatly to our understanding of every class of organism living on earth today.

When we look at Ernst Haeckel's classification of living organisms, as he understood it, in 1866, we learn that pre-Darwinian biologists had produced a classification that is very similar to our modern classification of organisms. If Haeckel were alive today, he would have no trouble adjusting to modern taxonomy (Fig. 6.3).

f06-03-9780128037812 — Figure 6.3 Ernst Haeckel's rendition of the classification of living organisms, c.1866. Wikipedia, public domain.

How did pre-Darwinian taxonomists arrive so close to our modern taxonomy, without the benefit of the principles of evolution, modern paleontological discoveries, or molecular biology? For example, how was it possible for Aristotle to know, two thousand years ago, that a dolphin is a mammal, and not a fish? Aristotle studied the anatomy and the developmental biology of many different types of animals. One class of animals was distinguished by a gestational period in which a developing embryo is nourished by a placenta, and the offspring are delivered into the world as formed, but small versions of the adult animals (ie, not as eggs or larvae), and in which the newborn animals feed from milk excreted from nipples, overlying specialized glandular organs in the mother (mammae). Aristotle knew that these were features that specifically characterized one group of animals and distinguished this group from all the other groups of animals. He also knew that dolphins had all these features; fish did not. From these observations, he correctly reasoned that dolphins were a type of mammal, not a type of fish. Aristotle was ridiculed by his contemporaries for whom it was obvious that a dolphin is a fish. Unlike Aristotle, his contemporaries had based their classification on similarities, not on relationships.⁹

Whether a classification of living organisms is based on anatomic features that characterized classes or organisms, or on orthologic gene sequences (ie, gene sequences found in different species, inherited from a common ancestor), or on ancestral descent observed in rock strata, the resulting classification is nearly identical. Perhaps the greatest virtue of a classification is that regardless of the methods used to construct a classification, any two valid versions will be equivalent to one another, if the classifications represent the same reality.

5. Classifications are multidisciplinary hypothesis-generating machines

Because a classification embodies the relationships among its members (ie, its classes and instances), we can search for relationships that may extend to members of other classifications, based on shared attributes. Such tentative relationships across classifications are equivalent to cross-disciplinary hypotheses. Let's look at a few examples.

They say that coal is a nonrenewable resource. Once we've found and consumed all of the coal that lies in the ground, no additional coal will be forthcoming. Why is this the case? Why isn't the planet producing coal today, much the same way as it produced coal hundreds of millions of years ago? The answer to this question comes, surprisingly, from the classification of living organisms. It seems that about 450 million years ago, plants began to grow on land, vertically. Erect plant growth requires the structural support provided by cellulose.

The problem that faced primeval forest organisms, early in the evolution of woody plants, was the digestion of cellulose. As it happens, cellulose, the most abundant organic compound on earth, is very difficult to digest. The early woody plants had made an evolutionary breakthrough when they first began to synthesize cellulose. The cellulose molecule, when it appeared in the earliest woody plants, was a novelty that no organism could digest; hence, early trees could not decompose. When they died, they stayed in place, like carbonized mummies. Under the weight of soil, they were slowly compressed into coal.

The age of coal lasted from the time that cellulose first appeared in plants, until the time that terrestrial organisms acquired the ability to digest cellulose. Eventually, some bacteria, fungi and other simple eukaryotes began to eat dead trees, thus heralding the end of the coal age. To this day, most organisms cannot digest cellulose. Insects, ruminants, and other animals that derive energy from cellulose outsource the job to fungal or bacterial symbiotes.

The point of this story is that the energy industry, geologists, earth scientists, ecologists, and evolutionary biologists all learned important lessons when a modern-day problem related to the sustainability of energy resources was approached as an exercise in evolutionary taxonomy.

Let's try another example.

The tip of Mount Everest is the highest point on earth, with an elevation of 29,029 feet. One might expect Mount Everest to have had a volcanic origin, wherein convulsions of earth heaped forth magma that solidified into the majestic Himalayan Mountains. If that were the case, the Himalayan Mountains, Everest included, would be composed of igneous rock such as basalt, granite, gabbro, and so on. This is definitely not the case.

The Himalayans are largely composed of limestone, a sedimentary rock that is formed at the bottom of oceans, from ocean salts and dead marine life, compressed by the weight of miles of water. The limestone at the summit of Everest contains the skeletons of long-extinct classes of marine organism (eg, trilobites), and ancient species of currently extant classes of organisms (eg, small crustaceans). The fossil-bearing limestone at the top of Mount Everest is similar to the fossil-bearing limestone found in flat ranges of sedimentary rock located in sites where oceans held sway, many millions of years ago (Fig. 6.4).

f06-04-9780128037812 — Figure 6.4 Fossiliferous limestone, packed with the skeletons of ancient pelagic organisms. Wikipedia, donated into the public domain by its author, Jim Stuby (Jstuby).

Why is Mount Everest built from ancient limestone? There can only be one answer. Everest was built by a force that pushed the ocean bottom up and up and up, until a mountain of limestone poked through the clouds, to stand dazed and naked in its own glory. What force could have caused the ocean bottom to rise up? Continental drift delivered the Indian subcontinent to Asia, about 40 or 50 million years ago. As the subcontinent closed in on Asia, a body of water, the Tethys Sea, got in its way, and the sea-bottom was squeezed backwards and upwards as the inevitable collision proceeded. Thus was born the Himalayan Mountains.

The theory of continental drift owes its existence to a set of observations and clever deductions. One of those deductions was based on understanding the timeline of evolution, which in turn was based on observations of fossils in limestone strata, which in turn was the basis for the classification of terrestrial organisms. An understanding of the classification of living organisms helped generate the continental drift hypothesis, thus providing a plausible explanation for the existence of the Indian subcontinent, and of long-dead marine life now reposing on the peak of Mt. Everest.

The classification of living organisms is directly or indirectly responsible for nearly every advance in biology, genetics, and medicine in the past half-century and for many of the advances in geology, meteorology, anthropology, and agriculture. If, as some would suggest, everything on our planet relates to every other thing, then surely the classification of living organisms is the place to learn about those relationships.

6. Classifications create our reality

In the Disney retelling of a classic fairy tale, a human-made abstraction, a puppet named Pinocchio survives a series of perils and emerges as a real live boy. It seems farfetched that an abstraction could become a living biological organism, but it happens. In point of fact, the transformation of an abstract idea into a living entity is one of the most important scientific advancements of the past half-century. For the most part, this miracle of science has gone unheralded. Nonetheless, if you think very deeply about the meaning of classifications, and if you can appreciate the role played by abstractions in the governance of our physical universe, you will appreciate the profound implications of the following story. We shall see that a human-made abstraction, that we name "species," has survived a series of perils, and has emerged as a real live biological entity.

In the classification of living terrestrial organisms, the bottom classes are known as "species." There is a species class for all the horses and another species class for all the squirrels, and so on. Speculation has it that there are 50–100 million different species of organisms on planet Earth. We humans have assigned names to a few million species, a small fraction of the total.

It has been argued that nature produces individuals, not species; the concept of species being a mere figment of the human imagination, created for the convenience of taxonomists who need to group similar organisms. Biologists can collect feature data such as gene sequences, geographic habitat, diet, size, mating rituals, hair color, shape of the skull and so on, for a variety of different animals. After some analysis, perhaps performed with the aid of a computer, we could cluster animals based on their similarities, and we could assign the clusters names, and the names of our clusters would be our species. The arbitrariness of species creation comes from the various ways we might select the features to be measured in our data sets, the choice of weights assigned to the different features (eg, should we give more weight to gene sequence than to length of gestation?), and to our choice of algorithm for assigning organisms to groups (see Glossary items, K-means algorithm, Phenetics).

For myself, and for many other scientists who use classification, there can be no human arbitrariness in the assignment of species.²¹ A species is a fundamental building block of the natural world, no less substantial than the concept of a galaxy to astronomers or the number "e" to mathematicians.

The modern definition of species is "an evolving gene pool." As such, species have three properties that prove that they are biological entities.

1. Unique definition. Until recently, biologists could not agree on a definition of species. There were dozens of definitions to choose from, depending on which field of science you studied. Molecular biologists defined species by gene sequence. Zoologists defined species by mating exclusivity. Ecologists defined species by habitat constraints. The current definition equating species with an evolving gene pool serves as a great unifying theory for biologists.

2. The class "species" has a biological function that is not available to individual members of the species; namely, speciation. Species propagate, and when they do, they produce new species. Species are the only biological entities that can produce new species.

3. Species evolve. Individuals cannot evolve on their own. Evolution requires a gene pool; something that species have and individuals to not.

4. Species bear a biological relationship to individual organisms. Just as species are defined as evolving gene pools, individual organisms can be defined as set of propagating genes living within a cellular husk. Hence, the individual organism has a genome taken from the pool of genes available to his species.

The classification of living organisms has worked a true miracle, by breathing life into the concept of species, thus expanding reality.

6.5 Properties That Cross Multiple Classes

Our greatest responsibility is to be good ancestors

Jonas Salk

Perhaps the most common mistake among builders of ontologies and classifications is to confuse a subclass with a class property. This property/class confusion is not limited to ontology builders; it is a pitfall encountered by virtually every programmer who uses object-oriented languages and every data scientist who uses RDF semantics (see Glossary items Property, Mixins, RDF Schema).

To help understand the difference between a property and a class, consider the following question: "Is a leg a subclass of the class of humans?" If you answer that a leg is most definitely a subclass of the class of humans, then you probably are basing your opinion on the following line of reasoning:

Human embryos develop to produce a leg (actually, 2);

hence every human, under normal circumstances, has a leg,

and the leg, being a component of humans, is a subclass of humans.

Furthermore, a leg, being a solid object, is not an abstraction, like a property. Hence, a leg must be a solid object, and a part of humans;

Hence, a leg is a member of a subclass of the class of humans.

No, no, no! Most students new to the subject of classification will answer, incorrectly, that a leg is a subclass of humans. First off, a subclass of a class must qualify as a member of the class. A leg is not a type of human; hence, a leg is not a subclass of class humans. If "leg" is not a subclass of class humans, then is "leg" a property of class humans? Not really. A leg is a data object that can be described by a property of class humans. For example, we can assert the following triple:

Batman has_component leg

Batman is an instance of class human. As an instance of class human, Batman is endowed with the properties and relationships of class human, one of which is the ability to have things (ie, the "has_component" property).²² The "has_component" property may be defined to include the known body parts of humans, and this list of known body parts would include "leg." Annotating our triple, we might write:

Batman (a unique instance of Class Human),

has_component (a metadata property of Class Human)

leg (the data described by the "has_component" property).

At this point, you may be thinking that issues related to class creation and to property assignment are just too pedantic to ponder. Consider this. Abraham Lincoln was once asked, "If you call a tail a leg, then how how many legs does a horse have?" Lincoln replied, "Four, because calling a tail a leg doesn't make it one." Lincoln understood that the process of classification is neither hypothetical nor abstract. Let me try to explain the importance of understanding how classifications are organized, with the following story from my own life.

In the last decades of the 20th century, there was general agreement that the war on cancer had failed to discover drugs that were effective against advanced cases of any of the common cancers of humans (eg, lung, colon, breast, prostate). The U.S. cancer death rate was rising despite the best efforts of the U.S. National Cancer Institute. A new approach seemed warranted, and there was a growing consensus for a divide-and-conquer strategy in which treatments would be developed for biologically distinct subclasses of cancers. The idea seemed reasonable, but how would cancer researchers subclassify cancers? Several ideas were considered, but one idea gained traction. Cancers would be subclassified by their clinical stage.

Some background is needed regarding the meaning of cancer stage. When cancers are diagnosed, a determination is made regarding its size, extent of local invasion, whether tumor deposits can be found in regional or distant lymph nodes, and whether the cancer has metastasized to distant organs. As an example, a stage 3 lung cancer is one that has spread to lymph nodes. Stage 3 lung cancers are further divided into Stage 3A tumors, which have spread to lymph nodes confined to the same side of the body as the primary cancer; and Stage 3B tumors where the tumor is of any size and has spread to distant lymph nodes and invaded chest structures other than the lungs, such as the heart or esophagus. Because staging requires an accurate, well-documented assessment of the extent of spread of the tumor at the time of diagnosis, the staging process necessitates the professional services of radiologists, pathologists, oncologists, surgeons, and nurses. Furthermore, the information upon which the staging was determined would need to be reviewed by a set of experts, to verify the accuracy of the original reports.

When plans for subclassifying tumors by stage were being discussed, I indicated that the project made very little sense, from my point of view as an information scientist. The stage of a tumor is a class property; not a true class. I argued that each stage 3 cancer grew from a much smaller cancer (ie, progressed through stages over time). Given the natural course of tumor growth and metastasis, most stage 3 cancers will progress into stage 4 cancers. True classes are not transitive. The progression of cancer through stages implies that stages could not be used to classify cancers! Furthermore, a collection of patients with stage 3 cancers will include some with fast-growing cancers and others with slow-growing cancers; some cancers prone to distant metastasis and other cancers unlikely to spread to distant organs. I argued that it made no sense to lump cancers with widely different biological properties into the same class on the basis of one shared property (ie, stage) observed at one particular moment in time.

Try as I might, I failed to dissuade my colleagues from their declared course of action. The results, as you might assume, were mixed. Treatments designed for particular stages of tumor growth were effective in some individuals, ineffective in others. Molecular profiles of staged cancers showed that within any stage, some tumors shared similar profiles; other tumors did not. I would hazard to guess that every set of patients with same-stage cancer consists of a blend of undefined subclasses. Because the trials were not designed to treat a true biological class of tumors, the results were predestined to produce mixed results; an artifact of class noise.

Class noise refers to inaccuracies (eg, misleading results) introduced in the analysis of classified data due to errors in class assignments (eg, assigning a data object to class A when the object should have been assigned to class B). If you are testing the effectiveness of an antibiotic on a class of people with bacterial pneumonia, the accuracy of your results will be jeopardized if your study population includes subjects with viral pneumonia, or smoking-related lung damage (see Glossary items, Blended class, Simpson's paradox).

Some of the most promising technologies have yielded little or no scientific advancements simply because their practitioners have paid insufficient attention to the distinctions between classes and properties. The past half-century has seen incredible advances in the field of brain imaging, including the introduction of computed tomography and nuclear magnetic resonance imaging. Scientists can now determine the brain areas that are selectively activated for specific physiologic functions. These imaging techniques include: positron emission tomography, functional magnetic resonance imaging, multichannel electroencephalography, magnetoencephalography, near infrared spectroscopic imaging, and single photon emission computed tomography. With all of these available technologies, you would naturally expect that neuroscientists would be able to correlate psychiatric conditions with abnormalities in function, mapped to specific areas of the brain. Indeed, the brain research literature has seen hundreds, if not thousands, of early studies purporting to find associations that link brain anatomy to psychiatric diseases. Alas, none of these early findings has been validated. Excluding degenerative brain conditions (eg, Alzheimer's disease, Parkinson's disease), there is, at the present time, no known psychiatric condition that can be consistently associated with a specific functional brain deficit or anatomic abnormality.²³ The reasons for the complete lack of validation for what seemed to be highly promising field of research, pursued by an army of top scientists, is a deep and disturbing mystery.

In 2013, a new version of the Diagnostic and Statistical Manual of Mental Disorders (DSM) was released. The DSM is the standard classification of psychiatric disorders, and is used by psychiatrists and other healthcare professionals worldwide. The new version was long in coming, following its previous version by 20 years. Spoiling the fanfare for the much-anticipated update was a chorus of loud detractors, who included among their ranks a host of influential and respected neuroscientists. Their complaint was that the DSM classifies diagnostic entities based on collections of symptoms; not on biological principles. For every diagnostic entity in the DSM, all persons who share the same collection of symptoms will, in most cases, be assigned the same diagnosis; even when the biological cause of the symptoms are unknown or unrelated.

When individuals with unrelated diseases are studied together, simply because they have some symptoms in common, the results of the study are unlikely to have any validity.¹⁹ Dr. Thomas Insel, a former Director of the National Institute of Mental Health, was quoted as saying, "As long as the research community takes the DSM to be a bible, we'll never make progress."²⁴ Apparently, the creators of the DSM do not understand the distinction between a property and a class.⁹

Returning to the definition of classification, we notice that there is nothing to say that a property for one class cannot also be a property for some unrelated class. A single property can belong to multiple unrelated classes, and to every class that descends from a class in which the property is defined (see Glossary item, Nonphylogenetic property). Knowing this, we can search classified data sets for properties shared by unrelated classes, and we can compare how values of properties vary among the different classes.

Object-oriented programming languages make good use of properties that cross unrelated classes. In object-oriented programming, methods are a type of property (see Glossary item, Method). A method can be assigned to a class (ie, a class method), or it can be placed into the repertoire of one or more classes without being assigned as a class method; sort of a freelance subroutine. Methods that apply to unrelated classes are called Mixins. Mixins enable programmers to extend the functionality of classes, without producing replicate class methods, a technique called compositional programming or layering.

Just as Mixins permit unrelated classes to have identical methods, properties permit unrelated classes to have identical features. In Section 7.4, we will see the semantic approach to assigning properties to data objects (including instances and classes), and how this simple technique, when thoughtfully applied, eliminates the hazards of confusing classes with properties.

Open Source Tools

Perl: The only language that looks the same before and after RSA encryption.

Keith Bostic

Syntax for Triples

Good specifications will always improve programmer productivity far better than any programming tool or technique.

Milt Bryce

RDF is a specialized XML syntax for creating computer-parsable files consisting of triples. The subject of an RDF triple is invoked with the rdf:about attribute. Following the subject is a metadata/data pair.

Let us create an RDF triple whose subject is the jpeg image file specified as: http://www.the_url_here.org/ldip/ldip2103.jpg. The metadata is < dc:title > and the data value is "Normal Lung".

< rdf:Description

rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg">

< dc:title>Normal Lung </dc:title >

</rdf:Description >

An example of three triples in RDF syntax is:

< rdf:Description

rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg>

< dc:title>Normal Lung </dc:title >

</rdf:Description >

< rdf:Description

rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg>

< dc:creator>Bill Moore </dc:creator >

</rdf:Description >

< rdf:Description

rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg>

< dc:date>2006-06-28 </dc:date >

</rdf:Description >

RDF permits you to collapse multiple triples that apply to a single subject. The following RDF:Description statement is equivalent to the three prior triples:

< rdf:Description

rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg>

< dc:title>Normal Lung </dc:title >

< dc:creator>Bill Moore </dc:creator >

< dc:date>2006-06-28 </dc:date >

</rdf:Description >

An example of a short but well-formed RDF image specification document is:

<?xml version="1.0"?>

< rdf:RDF

xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#";

xmlns:dc="http://purl.org/dc/elements/1.1/">

< rdf:Description

rdf:about="http://www.the_url_here.org/ldip/ldip2103.jpg>

< dc:title>Normal Lung </dc:title >

< dc:creator>Bill Moore </dc:creator >

< dc:date>2006-06-28 </dc:date >

</rdf:Description >

</rdf:RDF >

The first line tells you that the document is XML. The second line tells you that the XML document is an RDF resource. The third and fourth lines are the namespace documents that are referenced within the document (see Glossary item, Namespace). Following that is the RDF statement that we have already seen.

If you think that RDF syntax creates an awful lot of text to represent a small amount of data, then your thoughts echo those of every data scientist who has tread your path. No sooner was RDF developed than a succession of simplified syntaxes, designed to convey the equivalent information, followed.

From RDF came a simplified syntax for triples, known as Notation 3 or n3.²⁵ From n3 came another syntactic form, thought to fit more closely to RDF, known as Turtle. From Turtle came an even more simplified form, known as N-Triples. All of these metamorphoses tell us something about the limitations and the values of syntactic rules for specifications:

1. It is impossible to write a syntax that pleases everyone.

2. Never assume that universally adopted syntaxes are permanent; data scientists are notoriously fickle.

3. Never assume that a syntax that you believe to be universally adopted is as popular as you may have been led to believe.

4. Well-designed specifications are fungible. You can always write a short script that will transform your data into any syntax you prefer.

In this book, we will use a syntax that closely resembles N-triples. N-triples are much easier to enter into a text file than RDF, and much easier to read than RDF. N-triples obey the one triple per line rule. If you know the number of lines in your triple file, you know the number of triples.

Let's look at a very short file, image.n3, containing a few triples that describe a medical image that has been archived in a hospital's electronic records system:

@prefix : < http://www.someplace.org/image_schema.rdf#>.

@prefix rdf: < http://www.w3.org/1999/02/22-rdf-syntax-ns#>.

:Baltimore_Hospital_Center rdf:type "Hospital".

:Baltimore_Hospital_Center_4357 rdf:type "Unique_medical_identifier".

:Baltimore_Hospital_Center_4357 :patient_name "Sam_Someone".

:Baltimore_Hospital_Center_4357 :surgical_pathology_specimen "S3456_2001".

:S_3456_2001 rdf:type "Surgical_pathology_specimen".

:S_3456_2001 :image < https://baltohosp.org/pathology/y49w3p2.jpg >.

:S_3456_2001 :log_in_date "2001-08-15".

:S_3456_2001 :clinical_history "30_years_oral_tobacco_use".

< https://baltohosp.org/pathology/y49w3p2.jpg > rdf:type "Medical_image".

< https://baltohosp.org/pathology/y49w3p2.jpg > :specimen "2".

< https://baltohosp.org/pathology/y49w3p2.jpg > :block "3".

< https://baltohosp.org/pathology/y49w3p2.jpg > :format "jpeg".

< https://baltohosp.org/pathology/y49w3p2.jpg > :width "524_pixels".

< https://baltohosp.org/pathology/y49w3p2.jpg > :height "429_pixels".

The file is fairly easy to read. Let's look at one triple:

< https://baltohosp.org/pathology/y49w3p2.jpg > rdf:type "Medical_image".

Here, the identified subject of the triple is the URL: <https://baltohosp.org/pathology/y49w3p2.jpg>. The data pertaining to the subject of the triple is "Medical_image". The metadata describing the data is "rdf:type". The triple tells us that the identified subject is a type of medical image. Notice that, like any complete sentence, the triple ends with a period.

You might be wondering about the "rdf:" prefix for the "type" metadata tag. The "rdf:" prefix tells us that we can look at the top of the file, where the prefix namespaces are located, to learn the prefix "rdf:" is defined in a web document, "http://www.w3.org/1999/02/22-rdf-syntax-ns#". Rather than attaching the full web document to every triple that invokes metadata described in the document, we use a prefix of our own choice, ":rdf".

You will notice that some triples use an even simpler prefix, ":". At the top of the file, we see that the ":" prefix is assigned to the namespace defined at the web site, "<http://www.someplace.org/image_schema.rdf#>". Don't try to find this web site; it is fictitious.

RDF Schema

When a triple indicates that an object is a member of a certain class of objects, there must be some document that defines the class and all of its class relationships. In theory, the document that contains the triple could contain triples that define classes to which the objects in the document are assigned. Doing so would be tedious and counterproductive, because it would require a repetition of pertinent class definitions for every file in which triples are assigned the class. It makes much more sense to have accessible documents that contain class definitions that could be linked from any and all collections of triples assigned to classes. Insofar as classes will be defined, in part, by attributes (ie, features of objects) shared among class instances, it would be useful to include definitions of class properties within the same document that defines classes.

An RDF Schema is a document that lists the classes and the properties that pertain to triples residing in other documents. The properties of an RDF Schema are the metadata descriptors appearing in triples whose objects are instances of the classes listed in the RDF schema. Elements in an RDF schema may be subclasses of elements in other RDF schemas.

As the name implies, an RDF Schema can be written in formal RDF syntax. In practice, many of the so-called RDF Schema documents found on the web are prepared in alternate formats. They are nominally RDF syntax because they create a namespace for classes and properties referred by triples listed in RDF documents.

Here is a short RDF schema, written as Turtle triples, and held in a fictitious web site, "http://www.fictitious_site.org/schemas/life#"

@prefix rdf: < http://www.w3.org/1999/02/22-rdf-syntax-ns#>

@prefix rdfs: < http://www.w3.org/2000/01/rdf-schema#>

@base < http://www.fictitious_site.org/schemas/life#>

:Homo instance_of rdfs:Class.

:HomoSapiens instance_of rdfs:Class;

rdfs:subClassOf :Homo.

Turtle triples have a somewhat different syntax than N-triples or N3 triples. As you can see, the turtle triple resembles RDF syntax in form, allowing for nested metadata/data pairs assigned to the same object. Nonetheless, turtle triples use less verbiage than RDF, but convey equivalent information. In this minimalist RDF Schema, we specify two classes that would normally be included in the much larger classification of living organisms: Homo and HomoSapiens.

A triple that refers to our "http://www.fictitious_site.org/schemas/life#" Schema might look something like this:

:Batman instanceOf < http://www.fictitious_site.org/schemas/life#>:HomoSapiens.

The triple asserts that Batman is an instance of Homo Sapiens. The data "HomoSapiens" links us to the RDF Schema, which in turn tells us that HomoSapiens is a class and is the subclass of Class Homo.

RDF Parsers

RDF documents can be a pain to create, but they are very easy to parse. Even in instances when an RDF file is composed of an off-kilter variant of RDF, it is usually quite easy to write a short script that will parse through the file, extracting triples, and using the components of the triples to serve the programmer's goals. Such goals may include: counting occurrences of items in a class, finding properties that apply to specific subsets of items in specific classes, or merging triples extracted from various triplestore databases (see Glossary item, Triplestore).

The rdflib package is an RDF parser written for Python. Insofar as rdflib is not bundled in the standard Python distribution, Python users can download the package via the Python Package Index at: https://pypi.python.org/pypi/rdflib

Easier yet, the rdflib package can be installed from the command line, using the pip installer for Python, discussed in Open Source Tools for Chapter 1, and shown again here:

c:ftp>pip install rdflib

To demonstrate the parsing method, let's start with an RDF file, that we will call rdf_example.xml:

<?xml version="1.0"?>

< rdf:RDF

xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:cd="http://www.recshop.fake/cd#">

< rdf:Description

rdf:about="http://www.recshop.fake/cd/Empire Burlesque">

< cd:artist>Bob Dylan </cd:artist >

< cd:country>USA </cd:country >

< cd:company>Columbia </cd:company >

< cd:price>10.90 </cd:price >

< cd:year>1985 </cd:year >

</rdf:Description >

< rdf:Description

rdf:about="http://www.recshop.fake/cd/Hide your heart">

< cd:artist>Bonnie Tyler </cd:artist >

< cd:country>UK </cd:country >

< cd:company>CBS Records </cd:company >

< cd:price>9.90 </cd:price >

< cd:year>1988 </cd:year >

</rdf:Description >

</rdf:RDF >

The Python script, rdf_parse.py, imports the rdflib package and extracts each triple, and divides the triple into three lines:

#!/usr/bin/python

import rdflib

g=rdflib.Graph()

g.load('rdf_example.xml')

for subject,predicate,object in g:

print "Identified subject -", subject

print "Metadata -", predicate

print "Data -", object

exit

Here are a few lines of output generated by the rdf_pars.py script:

c:ftp>rdf_parse.py

No handlers could be found for logger "rdflib.term"

Identified subject - http://www.recshop.fake/cd/Hide your heart

Metadata - http://www.recshop.fake/cd#artist

Data - Bonnie Tyler

Identified subject - http://www.recshop.fake/cd/Hide your heart

Metadata - http://www.recshop.fake/cd#year

Data - 1988

Identified subject - http://www.recshop.fake/cd/Hide your heart

Metadata - http://www.recshop.fake/cd#company

Data - CBS Records

Identified subject - http://www.recshop.fake/cd/Empire Burlesque

Metadata - http://www.recshop.fake/cd#price

Data - 10.90

Identified subject - http://www.recshop.fake/cd/Empire Burlesque

Metadata - http://www.recshop.fake/cd#country

Data - USA

It is worth repeating that good specifications are fungible. One can be transformed into another. Just as it possible to convert RDF to n3 triples, it is possible to convert n3 triples to RDF.

Here is a Perl script, RDF_n3.pl, that converts the image.n3 file, composed as Notation 3, into long RDF syntax. You will need to pre-install Perl's RDF::Notation3 module, available from CPAN.²⁶

#!/usr/local/bin/perl

use RDF::Notation3::Triples;

$path = "image.n3";

$rdf = RDF::Notation3::Triples->new();

$rdf->parse_file($path);

$triples = $rdf->get_triples;

use RDF::Notation3::XML;

$rdf = RDF::Notation3::XML->new();

$rdf->parse_file($path);

$string = $rdf->get_string;

print $string;

exit;

The RDF_n3.pl script operates on the image.n3 file, a list of N3 triples shown earlier in Open Source tools, for this chapter. Here is the truncated output, in RDF format:

<?xml version="1.0" encoding="utf-8"?>

< rdf:RDF

xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

< rdf:Description

rdf:about="http://www.pathology

informatics.org/image_schema.rdf#Baltimore_Hospital_Center">

< rdf:type

xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">Hospital </rdf:type >

</rdf:Description >

< rdf:Description

rdf:about="http://www.pathologyinfor

matics.org/image_schema.rdf#Baltimore_Hospital_Center_4357">

< rdf:type

xmlns:rdf="http://www.w3.org/199

9/02/22-rdf-syntax-ns#">Unique_medical_identifier </rdf:type >

</rdf:Description >

< rdf:Description

rdf:about="http://www.pathologyinformatics.org/image_schema.rdf#Baltimore_Hospital_Center_4357">

< patient_name

Visualizing Class Relationships

When working with classifications or ontologies, it is useful to have an image that represents the relationships among the classes. GraphViz is an open source software utility that produces graphic representations of object relationships (see Glossary item, Object relationships).

The GraphViz can be downloaded from: http://www.graphviz.org/

GraphViz comes with a set of applications that generate graphs of various styles. Here is an example of a Graphviz dot file, number.dot, constructed in Graphviz syntax.²⁷ Aside from a few lines that provide instructions for line length and graph size, the dot file is a list of classes and their child classes.

digraph G {

size="7,7";

Object -> Numeric;

Numeric -> Integer;

Numeric -> Float;

Integer -> Fixnum

Integer -> Bignum

}

After the Graphviz exe file (version graphviz-2.14.1.exe, on my computer) is installed, you can launch the various Graphviz methods as command lines from its working directory, or through a system call from within a script (see Glossary item, Exe file).

c:ftpdot>dot -Tpng number.dot -o number.png

The command line tells Graphviz to use the dot method to produce a rendering of the number.dot text file, saved as an image file, with filename number.png. The output file contains a class hierarchy, beginning with the highest class and branching until it reaches the lowest descendant class (Fig. 6.5).

f06-05-9780128037812 — Figure 6.5 A class hierarchy, described by the number.dot file and converted to a visual file, using Graphviz.

With a glance, we see that the highest class is Class Object. Class Object has one child class, Class Numeric. Numeric has two child classes, Class Integer and Class Float. Class Integer has two child classes, Class Fixnum and Class Bignum. You might argue that a graphic representation of classes was unnecessary; the textual listing of class relationships was all that you needed. Maybe so, but when the class structure becomes complex, graphic visualization can greatly simplify your understanding of the relationships among classes.

Here is a visualization of a classification of human neoplasms (Fig. 6.6). It was produced by Graphviz, from a .dot file containing a list of classes and their subclasses, and rendered with the "twopi" method, shown:

f06-06-9780128037812 — Figure 6.6 A visualization of relationships in a classification of tumors. The image was rendered with the Graphviz utility, using the twopi method, which produced a radial classification, with the root class in the center.

c:ftp>twopi -Tpng neoplasms.dot -o neoplasms_classes.png

We can look at the graphic version of the classification and quickly make the following observation:

1. The root class (ie, the ancestor to every class) is Class Neoplasm. The Graphviz utility helped us find the root class, by placing it in the center of the visualization.

2. Every class is connected to other classes. There are no classes sitting out in space, unrelated to other classes.

3. Every class that has a parent class has exactly one parent class.

4. There are no recursive branches to the graph (eg, the ancestor of a class cannot also be a descendant of the class).

If we had only the textual listing of class relationships, without benefit of a graphic visualization, it would be very difficult for a human to verify, at a glance, the internal logic of the classification.

With a few tweaks to the neo.dot Graphviz file, we can create a nonsensical graphic visualization (Fig. 6.7):

f06-07-9780128037812 — Figure 6.7 A corrupted classification, that might qualify as a valid ontology.

Notice that one cluster of classes is unconnected to the other, indicating that class Endoderm/Ectoderm has no parent classes. Elsewhere, Class Mesoderm is both child and parent to Class Neoplasm. Class Melanocytic and Class Molar are each the child class to two different parent classes. At a glance, we have determined that the classification is highly flawed. The visualization simplified the relationships among classes, and allowed us to see where the classification went wrong. Had we only looked at the textual listing of classes and subclasses, we may have missed some or all of the logical flaws in our classification.

At this point, you might be thinking that visualizations of class relationships are nice, but who has the time and energy to create the long list of classes and subclasses, in Graphviz syntax, that are the input files for the Graphviz methods?

Now comes one of the great payoffs of data specifications. You must remember that good data specifications are fungible. A modestly adept programmer can transform a specification into whatever format is necessary to do a particular job. In this case, the classification of neoplasms had been specified as an RDF Schema. An RDF Schema is a document that includes the definitions of classes and properties, with each class provided with the name of its parent class and each property provided with its range (ie, the classes to which the property applies). Because class relationships in an RDF Schema are specified, it is easy to transform an RDF Schema into a .dot file suitable for Graphviz.

Here is a short RDF parsing script, dot.pl, written in Perl that takes an RDF Schema (contained in the plain-text file, schema.txt) and produces a Graphviz .dot file, named schema.dot.

#!/usr/bin/perl

open (TEXT, "schema.txt");

open (OUT, ">schema.dot");

$/ = "</rdfs:Class>";

print OUT "digraph G { ";

print OUT "size="15,15"; ";

print OUT "ranksep="2.00"; ";

$line = " ";

while ($line ne "")

{

$line = < TEXT >;

last if ($line !~ /<rdfs:/);

if ($line =~ /:resource="[a-z0-9:/\_.-]*#([a-z\_]+)"/i)

{

$father = $1;

}

if ($line =~ /rdf:ID="([a-z\_]+)"/i)

{

$child = $1;

}

print OUT "$father -> $child; ";

print "$father -> $child; ";

}

print OUT "}";

exit;

The first 15 lines of output of the dot.pl script:

digraph G {

size="15,15";

ranksep="2.00";

Class -> Tumor_classification;

Tumor_classification -> Neoplasm;

Tumor_classification -> Unclassified;

Neural_tube -> Neural_tube_parenchyma;

Mesoderm -> Sub_coelomic;

Neoplasm -> Endoderm_or_ectoderm;

Unclassified -> Syndrome;

Neoplasm -> Neural_crest;

Neoplasm -> Germ_cell;

Neoplasm -> Pluripotent_non_germ_cell;

Sub_coelomic -> Sub_coelomic_gonadal;

Trophectoderm -> Molar;

The full schema.dot file, not shown, is suitable for use as an input file for the Graphviz utility.

Glossary

Abstraction In the context of object-oriented programming, abstraction is a technique whereby a method is simplified to a generalized form that is applicable to a wide range of objects, but for which the specific characteristics of the object receiving the method may be used to return a result that is suited to the object. Abstraction, along with polymorphism, encapsulation, and inheritance, are essential features of object-oriented programming languages. See Polymorphism. See Inheritance. See Encapsulation.

Annotation Annotation involves associating data with additional data to provide description, disambiguation (eg, adding identifiers to distinguish the data from other data), links to related data, or timestamps to mark when the data was created. One of the most important functions of annotation is to provide data elements with metadata, facilitating our ability to find relationships among different data objects. Annotation is vital for effective search and retrieval of large and complex sets of data.

Autocoding When nomenclature coding is done automatically, by a computer program, the process is known as "autocoding" or "autoencoding." See Coding. See Nomenclature. See Autoencoding.

Autoencoding Synonym for autocoding. See Autocoding.

Beauty To mathematicians, beauty and simplicity are virtually synonymous, both conveying the idea that someone has managed to produce something of great meaning or value from a minimum of material. Euler's identity, relating e, i, pi, 0, and 1 in a simple equation, is held as an example of beauty in mathematics. When writing this book, I was tempted to give it the title, "The Beauty of Data," but I feared that a reductionist flourish, equating data simplification with beauty, was just too obscure.

Blended class Also known as class noise, subsumes the more familiar, but less precise term, "Labeling error." Blended class refers to inaccuracies (eg, misleading results) introduced in the analysis of data due to errors in class assignments (ie, assigning a data object to class A when the object should have been assigned to class B). If you are testing the effectiveness of an antibiotic on a class of people with bacterial pneumonia, the accuracy of your results will be forfeit when your study population includes subjects with viral pneumonia, or smoking-related lung damage. Errors induced by blending classes are often overlooked by data analysts who incorrectly assume that the experiment was designed to ensure that each data group is composed of a uniform population. A common source of class blending occurs when the classification upon which the experiment is designed is itself blended. For example, imagine that you are a cancer researcher and you want to perform a study of patients with malignant fibrous histiocytomas (MFH), comparing the clinical course of these patients with the clinical course of patients who have other types of tumors. Let's imagine that the class of tumors known as MFH does not actually exist; that it is a grab-bag term erroneously assigned to a variety of other tumors that happened to look similar to one another. This being the case, it would be impossible to produce any valid results based on a study of patients diagnosed as having MFH. The results would be a biased and irreproducible cacophony of data collected across different, and undetermined, classes of tumors. Believe it or not, this specific example, of the blended MFH class of tumors, is selected from the real-life annals of tumor biology.²⁸^,²⁹ The literature is rife with research of dubious quality, based on poorly designed classifications and blended classes. A detailed discussion of this topic is found in Section 6.5, Properties that Cross Multiple Classes. One caveat. Efforts to eliminate class blending can be counterproductive if undertaken with excess zeal. For example, in an effort to reduce class blending, a researcher may choose groups of subjects who are uniform with respect to every known observable property. For example, suppose you want to actually compare apples with oranges. To avoid class blending, you might want to make very sure that your apples do not include any cumquats, or persimmons. You should be certain that your oranges do not include any limes or grapefruits. Imagine that you go even further, choosing only apples and oranges of one variety (eg, Macintosh apples and navel oranges), size (eg, 10 cm), and origin (eg, California). How will your comparisons apply to the varieties of apples and oranges that you have excluded from your study? You may actually reach conclusions that are invalid and irreproducible for more generalized populations within each class. In this case, you have succeeded in eliminating class blending, while losing representative populations of the classes. See Simpson's paradox.

Bootstrapping The act of self-creation, from nothing. The term derives from the ludicrous stunt of pulling oneself up by one's own bootstraps. Its shortened form, "booting," refers to the startup process in computers in which the operating system is somehow activated via its operating system, that has not been activated. The absurd and somewhat surrealistic quality of bootstrapping protocols serves as one of the most mysterious and fascinating areas of science. As it happens, bootstrapping processes lie at the heart of some of the most powerful techniques in data simplification (eg, classification, object-oriented programming, resampling statistics, and Monte Carlo simulations). It is worth taking the time to explore the philosophical and the pragmatic aspects of bootstrapping. Starting from the beginning, how was the universe created? For believers, the universe was created by an all-powerful deity. If this were so, then how was the all-powerful deity created? Was the deity self-created, or did the deity simply bypass the act of creation altogether? The answers to these questions are left as an exercise for the reader, but we can all agree that there had to be some kind of bootstrapping process, if something was created from nothing. Otherwise, there would be no universe, and this book would be much shorter than it is. Getting back to our computers, how is it possible for any computer to boot its operating system, when we know that the process of managing the startup process is one of the most important functions of the fully operational operating system? Basically, at startup, the operating system is nonfunctional. A few primitive instructions hardwired into the computer's processors are sufficient to call forth a somewhat more complex process from memory, and this newly activated process calls forth other processes, until the operating system is eventually up and running. The cascading rebirth of active processes takes time, and explains why booting your computer may seem to be a ridiculously slow process. What is the relationship between bootstrapping and classification? The ontologist creates a classification based on a worldview in which objects hold specific relationships with other objects. Hence, the ontologist's perception of the world is based on preexisting knowledge of the classification of things; which presupposes that the classification already exists. Essentially, you cannot build a classification without first having the classification. How does an ontologist bootstrap a classification into existence? She may begin with a small assumption that seems, to the best of her knowledge, unassailable. In the case of the classification of living organisms, she may assume that the first organisms were primitive, consisting of a few self-replicating molecules and some physiologic actions, confined to a small space, capable of hosting a self-sustaining system. Primitive viruses and prokaryotes (ie, bacteria) may have started the ball rolling. This first assumption might lead to observations and deductions, which eventually yield the classification of living organisms that we know today. Every thoughtful ontologist will admit that a classification is, at its best, a hypothesis-generating machine; not a factual representation of reality. We use the classification to create new hypotheses about the world and about the classification itself. The process of testing hypotheses may reveal that the classification is flawed; that our early assumptions were incorrect. More often, testing hypotheses will reassure us that our assumptions were consistent with new observations, adding to our understanding of the relations between the classes and instances within the classification.

Child class The direct or first generation subclass of a class. Sometimes referred to as the daughter class or, less precisely, as the subclass. See Parent class. See Classification.

Class A class is a group of objects that share a set of properties that define the class and that distinguish the members of the class from members of other classes. The word "class," lowercase, is used as a general term. The word "Class," uppercase, followed by an uppercase noun (eg, Class Animalia) represents a specific class within a formal classification. See Classification.

Classification A system in which every object in a knowledge domain is assigned to a class within a hierarchy of classes. The properties of superclasses are inherited by the subclasses. Every class has one immediate superclass (ie, parent class) although a parent class may have more than one immediate subclass (ie, child class). Objects do not change their class assignment in a classification, unless there was a mistake in the assignment. For example, a rabbit is always a rabbit, and does not change into a tiger. Classifications can be thought of as the simplest and most restrictive type of ontology, and serve to reduce the complexity of a knowledge domain.³⁰ Classifications can be easily modeled in an object-oriented programming language and are nonchaotic (ie, calculations performed on the members and classes of a classification should yield the same output, each time the calculation is performed). A classification should be distinguished from an ontology. In an ontology, a class may have more than one parent class and an object may be a member of more than one class. A classification can be considered a special type of ontology wherein each class is limited to a single parent class and each object has membership in one and only one class. See Nomenclature. See Thesaurus. See Vocabulary. See Classification. See Dictionary. See Terminology. See Ontology. See Parent class. See Child class. See Superclass. See Unclassifiable objects.

Classifier As used herein, refers to algorithms that assign a class (from a preexisting classification) to an object whose class is unknown.³¹ It is unfortunate that the term classifier, as used by data scientists, is often misapplied to the practice of classifying, in the context of building a classification. Classifier algorithms cannot be used to build a classification, as they assign class membership by similarity to other members of the class; not by relationships. For example, a classifier algorithm might assign a terrier to the same class as a house cat because both animals have many phenotypic features in common (eg, similar size and weight, presence of a furry tail, four legs, tendency to snuggle in a lap). A terrier is dissimilar to a wolf, and a house cat is dissimilar to a lion, but the terrier and the wolf are directly related to one another; as are the housecat and the lion. For the purposes of creating a classification, relationships are all that are important. Similarities, when they occur, arise as a consequence of relationships; not the other way around. At best, classifier algorithms provide a clue to classification, by sorting objects into groups that may contain related individuals. Like clustering techniques, classifier algorithms are computationally intensive when the dimension is high, and can produce misleading results when the attributes are noisy (ie, contain randomly distributed attribute values) or noninformative (ie, unrelated to correct class assignment). See K-nearest neighbor algorithm. See Predictive analytics. See Support vector machine.

Coding The term "coding" has three very different meanings; depending on which branch of science influences your thinking. For programmers, coding means writing the code that constitutes a computer programmer. For cryptographers, coding is synonymous with encrypting (ie, using a cipher to encode a message). For medics, coding is calling an emergency team to handle a patient in extremis. For informaticians and library scientists, coding involves assigning an alphanumeric identifier, representing a concept listed in a nomenclature, to a term. For example, a surgical pathology report may include the diagnosis, Adenocarcinoma of prostate." A nomenclature may assign a code C4863000 that uniquely identifies the concept "Adenocarcinoma." Coding the report may involve annotating every occurrence of the work "Adenocarcinoma" with the "C4863000" identifier. For a detailed explanation of coding, and its importance for searching and retrieving data, see the full discussion in Section 3.4, "Autoencoding and Indexing with Nomenclatures." See Autocoding. See Nomenclature.

Data object A data object is whatever is being described by the data. For example, if the data is "6 feet tall," then the data object is the person or thing to which "6 feet tall" applies. Minimally, a data object is a metadata/data pair, assigned to a unique identifier (ie, a triple). In practice, the most common data objects are simple data records, corresponding to a row in a spreadsheet or a line in a flat-file. Data objects in object-oriented programming languages typically encapsulate several items of data, including an object name, an object unique identifier, multiple data/metadata pairs, and the name of the object's class. See Triple. See Identifier. See Metadata.

Database A software application designed specifically to create and retrieve large numbers of data records (eg, millions or billions). The data records of a database are persistent, meaning that the application can be turned off, then on, and all the collected data will be available to the user (see Open Source Tools for Chapter 7).

Dictionary A terminology or word list accompanied by a definition for each item. See Nomenclature. See Vocabulary. See Terminology.

Encapsulation The concept, from object-oriented programming, that a data object contains its associated data. Encapsulation is tightly linked to the concept of introspection, the process of accessing the data encapsulated within a data object. Encapsulation, Inheritance, and Polymorphism are available features of all object-oriented languages. See Inheritance. See Polymorphism.

Exe file A file with the filename suffix ".exe". In common parlance, filenames with the ".exe" suffix are executable code. See Executable file.

Executable file A file that contains compiled computer code that can be read directly from the computer's CPU, without interpretation by a programming language. A language such as C will compile C code into executables. Scripting languages, such as Perl, Python, and Ruby interpret plain-text scripts and send instructions to a run-time engine, for execution. Because executable files eliminate the interpretation step, they typically run faster than plain-text scripts. See Exe file.

Flat-file A file consisting of data records, usually with one record per file line. The individual fields of the record are typically separated by a marking character, such as "|" or "ˆ". Flat-files are usually plain-text.

Generalization Generalization is the process of extending relationships from individual objects to classes of objects. For example, when Isaac Newton observed the physical laws that applied to apples falling to the ground, he found a way to relate the acceleration of an object to its mass and to the acceleration of gravity. His apple-centric observations applied to all objects and could be used to predict the orbit of the moon around the earth, or the orbit of the earth around the sun. Newton generalized from the specific to the universal. Similarly, Darwin's observations on barnacles could be generalized to yield the theory of evolution, thus explaining the development of all terrestrial organisms. Science would be of little value if observed relationships among objects could not be generalized to classes of objects. See Science.

HTML HyperText Markup Language is an ASCII-based set of formatting instructions for web pages. HTML formatting instructions, known as tags, are embedded in the document, and double-bracketed, indicating the start point and end points for instruction. Here is an example of an HTML tag instructing the web browser to display the word "Hello" in italics: < i>Hello </i >. All web browsers conforming to the HTML specification must contain software routines that recognize and implement the HTML instructions embedded within web documents. In addition to formatting instructions, HTML also includes linkage instructions, in which the web browsers must retrieve and display a listed web page, or a web resource, such as an image. The protocol whereby web browsers, following HTML instructions, retrieve web pages from other internet sites is known as HTTP (HyperText Transfer Protocol).

Identification The process of providing a data object with an identifier, or the process of distinguishing one data object from all other data objects on the basis of its associated identifier. See Identifier.

Identifier A string that is associated with a particular thing (eg, person, document, transaction, data object), and not associated with any other thing.³² Object identification usually involves permanently assigning a seemingly random sequence of numeric digits (0–9) and alphabet characters (a–z and A–Z) to a data object. A data object can be a specific piece of data (eg, a data record), or an abstraction, such as a class of objects or a number or a string or a variable. See Identification.

Inheritance In object-oriented languages, data objects (ie, classes and object instances of a class) inherit the methods (eg, functions and subroutines) created for the ancestral classes in their lineage. See Abstraction. See Polymorphism. See Encapsulation.

Instance An instance is a specific example of an object that is not itself a class or group of objects. For example, Tony the Tiger is an instance of the tiger species. Tony the Tiger is a unique animal and is not itself a group of animals or a class of animals. The terms instance, instance object, and object are sometimes used interchangeably, but the special value of the "instance" concept, in a system wherein everything is an object, is that it distinguishes members of classes (ie, the instances) from the classes to which they belong.

Introspection A method by which data objects can be interrogated to yield information about themselves (eg, properties, values, and class membership). Through introspection, the relationships among the data objects can be examined. Introspective methods are built into object-oriented languages. The data provided by introspection can be applied, at run-time, to modify a script's operation; a technique known as reflection. Specifically, any properties, methods, and encapsulated data of a data object can be used in the script to modify the script's run-time behavior. See Reflection.

K-means algorithm The k-means algorithm assigns any number of data objects to one of k-clusters, where k is selected by the individual who implements the algorithm.³¹ Here is how the algorithm works for sets of quantitative data: (1) The program randomly chooses k objects from the collection of objects to be clustered. We'll call each of these k objects a focus. (2) For every object in the collection, the distance between the object and all of randomly chosen k objects (chosen in step 1) is computed. (3) A round of k-clusters is computed by assigning every object to its nearest focus. (4) The centroid focus for each of the k-clusters is calculated. The centroid is the point that is closest to all of the objects within the cluster. Another way of saying this is that if you sum the distances between the centroid and all of the objects in the cluster, this summed distance will be smaller than the summed distance from any other point in space. (5) Steps 2, 3, and 4 are repeated, using the k centroid foci as the points for which all distances are computed. (6) Step 5 is repeated until the k centroid foci converge on a nonchanging set of k centroid foci (or until the program slows to an interminable crawl). There are serious drawbacks to the algorithm: The final set of clusters will sometimes depend on the initial, random choice of k data objects. This means that multiple runs of the algorithm may produce different outcomes. The algorithms are not guaranteed to succeed. Sometimes, the algorithm does not converge to a final, stable set of clusters. When the dimensionality is very high, the distances between data objects (ie, the square root of the sum of squares of the measured differences between corresponding attributes of two objects) can be ridiculously large and of no practical meaning. Computations may bog down, cease altogether, or produce meaningless results. In this case, the only recourse may require eliminating some of the attributes (ie, reducing dimensionality of the data objects). Subspace clustering is a method wherein clusters are found for computationally manageable subsets of attributes. If useful clusters are found using this method, additional attributes can be added to the mix to see if the clustering can be improved. The clustering algorithm may succeed, producing a set of clusters of similar objects, but the clusters may have no practical value, omitting essential relationships among the objects. The k-means algorithm should not be confused with the k-nearest neighbor algorithm.

K-nearest neighbor algorithm The k-nearest neighbor algorithm is a simple and popular classifier algorithm. From a collection of data objects whose class is known, the algorithm computes the distances from the object of unknown class to the objects of known class. This involves a distance measurement from the feature set of the objects of unknown class to every object of known class (the test set). After the distances are computed, the k classed objects with the smallest distance to the object of unknown class are collected. The most common class (ie, the class with the most objects) among the nearest k classed objects is assigned to the object of unknown class. If the chosen value of k is 1, then the object of unknown class is assigned the class of its closest classed object (ie, the nearest neighbor).

KISS Acronym for Keep It Simple Stupid. The motto applies to almost any area of life; nothing should be made more complex than necessary. As it happens, much of what we encounter, as data scientists, comes to us in a complex form (ie, nothing to keep simple). A more realistic acronym is MISS (Make It Simple Stupid).

Machine learning Refers to computer systems and software applications that learn or improve as new data is acquired. Examples would include language translation software that improves in accuracy as additional language data is added to the system, and predictive software that improves as more examples are obtained. Machine learning can be applied to search engines, optical character recognition software, speech recognition software, vision software, neural networks. Machine learning systems are likely to use training data sets and test data sets.

Meaning In informatics, meaning is achieved when described data is bound to a unique identifier of a data object. "Claude Funston's height is five feet eleven inches," comes pretty close to being a meaningful statement. The statement contains data (five feet eleven inches), and the data is described (height). The described data belongs to a unique object (Claude Funston). Ideally, the name "Claude Funston" should be provided with a unique identifier, to distinguish one instance of Claude Funston from all the other persons who are named Claude Funston. The statement would also benefit from a formal system that ensures that the metadata makes sense (eg, What exactly is height, and does Claude Funston fall into a class of objects for which height is a property?) and that the data is appropriate (eg, Is 5 feet 11 inches an allowable measure of a person's height?). A statement with meaning does not need to be a true statement (eg, The height of Claude Funston was not 5 feet 11 inches when Claude Funston was an infant). See Semantics. See Triple. See RDF.

Metadata The data that describes data. For example, a data element (also known as data point) may consist of the number, "6." The metadata for the data may be the words "Height, in feet." A data element is useless without its metadata, and metadata is useless unless it adequately describes a data element. In XML, the metadata/data annotation comes in the form < metadata tag>data<end of metadata tag > and might look something like:In spreadsheets, the data elements are the cells of the spreadsheet. The column headers are the metadata that describe the data values in the column's cells, and the row headers are the record numbers that uniquely identify each record (ie, each row of cells). See XML.

< weight_in_pounds>150 </weight_in_pounds >

Method Roughly equivalent to functions, subroutines, or code blocks. In object-oriented languages, a method is a subroutine available to an object (class or instance). In Ruby and Python, instance methods are declared with a "def" declaration followed by the name of the method, in lowercase. Here is an example, in Ruby, for the "hello" method, is written for the Salutations class.

class Salutations
def hello
puts "hello there"
end
end

Mixins Mixins are a technique for including modules within a class to extend the functionality of the class. The power of the mixin is that methods can be inserted into unrelated classes. In practice, mixin methods are generally useful functions that are not related to the fundamental and defining methods for a class. A good way to think about nondefining methods included in unrelated classes is that "mixins" are to object-oriented programming languages what "properties" are to classifications. A single property may apply to multiple, unrelated classes. Mixins are available in both Python and Ruby. See RDF Schema. See Property.

Multiclass classification A misnomer imported from the field of machine translation, and indicating the assignment of an instance to more than one class. Classifications, as defined in this book, impose one-class classification (ie, an instance can be assigned to one and only one class). It is tempting to think that a ball should be included in class "toy" and in class "spheroids," but mutliclass assignments create unnecessary classes of inscrutable provenance, and taxonomies of enormous size, consisting largely of replicate items. See Multiclass inheritance. See Taxonomy.

Multiclass inheritance In ontologies, multiclass inheritance occurs when a child class has more than one parent class. For example, a member of Class House may have two different parent classes: Class Shelter, and Class Property. Multiclass inheritance is generally permitted in ontologies but is forbidden in one type of restrictive ontology, known as a classification. See Classification. See Parent class. See Multiclass classification.

Namespace A namespace is the realm in which a metadata tag applies. The purpose of a namespace is to distinguish metadata tags that have the same name, but a different meaning. For example, within a single XML file, the metadata tag "date" may be used to signify a calendar date, or the fruit, or the social engagement. To avoid confusion, metadata terms are assigned a prefix that is associated with a web document that defines the term (ie, establishes the tag's namespace). In practical terms, a tag that can have different descriptive meanings in different contexts is provided with a prefix that links to a web document wherein the meaning of the tag, as it applies in the XML document is specified.

Negative classifier One of the most common mistakes committed by ontologists involves classification by negative attribute. A negative classifier is a feature whose absence is used to define a class. An example is found in the Collembola, popularly known as springtails, a ubiquitous member of Class Hexapoda, and readily found under just about any rock. These organisms look like fleas (same size, same shape) and were formerly misclassified among the class of true fleas (Class Siphonaptera). Like fleas, springtails are wingless, and it was assumed that springtails, like fleas, lost their wings somewhere in evolution's murky past. However, true fleas lost their wings when they became parasitic. Springtails never had wings, an important taxonomic distinction separating springtails from fleas. Today, springtails (Collembola) are assigned to Class Entognatha, a separate subclass of Class Hexapoda. Alternately, taxonomists may be deceived by a feature whose absence is falsely conceived to be a fundamental property of a class of organisms. For example, all species of Class Fungi were believed to have a characteristic absence of a flagellum. Based on the absence of a flagellum, the fungi were excluded from Class Opisthokonta and were put in Class Plantae, which they superficially resembled. However, the chytrids, which have a flagellum, were recently shown to be a primitive member of Class Fungi. This finding places fungi among the true descendants of Class Opisthokonta (from which Class Animalia descended). This means that fungi are much more closely related to people than to plants, a shocking revelation!

Neural network A dynamic system in which outputs are calculated by a summation of weighted functions operating on inputs. The weights for the individual functions are determined by a learning process, simulating the learning process hypothesized for human neurons. In the computer model, individual functions that contribute to a correct output (based on the training data) have their weights increased (strengthening their influence on the calculated output). Over the past ten or fifteen years, neural networks have lost some favor in the artificial intelligence community. They can become computationally complex for very large sets of multidimensional input data. More importantly, complex neural networks cannot be understood or explained by humans, endowing these systems with a "magical" quality that some scientists find unacceptable. See Nongeneralizable predictor. See Overfitting. See Machine learning.

Nomenclature A nomenclature is a listing of terms that cover all of the concepts in a knowledge domain. A nomenclature is different from a dictionary for three reasons: (1) the nomenclature terms are not annotated with definitions, (2) nomenclature terms may be multi-word, and (3) the terms in the nomenclature are limited to the scope of the selected knowledge domain. In addition, most nomenclatures group synonyms under a group code. For example, a food nomenclature might collect submarine, hoagie, po' boy, grinder, hero, and torpedo under an alphanumeric code such as "F63958". Nomenclatures simplify textual documents by uniting synonymous terms under a common code. Documents that have been coded with the same nomenclature can be integrated with other documents that have been similarly coded, and queries conducted over such documents will yield the same results, regardless of which term is entered (ie, a search for either hoagie, or po' boy will retrieve the same information, if both terms have been annotated with the synonym code, "F63948"). Optimally, the canonical concepts listed in the nomenclature are organized into a hierarchical classification.³³^,³⁴ See Coding. See Autocoding.

Nonatomicity Nonatomicity is the assignment of a collection of objects to a single, composite object that cannot be further simplified or sensibly deconstructed. For example, the human body is composed of trillions of individual cells, each of which lives for some length of time, and then dies. Many of the cells in the body are capable of dividing to produce more cells. In many cases, the cells of the body that are capable of dividing can be cultured and grown in plastic containers, much like bacteria can be cultured and grown in Petri dishes. If the human body is composed of individual cells, why do we habitually think of each human as a single living entity? Why don't we think of humans as bags of individual cells? Perhaps the reason stems from the coordinated responses of cells. When someone steps on the cells of your toe, the cells in your brain sense pain, the cells in your mouth and vocal cords say ouch, and an army of inflammatory cells rush to the scene of the crime. The cells in your toe are not capable of registering an actionable complaint, without a great deal of assistance. The reason that organisms, composed of trillions of living cells, are generally considered to have nonatomicity, also relates to the "species" concept in biology. Every cell in an organism descended from the same zygote, and every zygote in every member of the same species descended from the same ancestral organism. Hence, there seems to be little benefit to assigning unique entity status to the individual cells that compose organisms, when the class structure for organisms is based on descent through zygotes. See Species.

Nongeneralizable predictor Sometimes data analysis can yield results that are true, but nongeneralizable (ie, irrelevant to everything outside the set of data objects under study). The most useful scientific findings are generalizable (eg, the laws of physics operate on the planet Jupiter or the star Alpha Centauri much as they do on earth). Many of the most popular analytic methods are not generalizable because they produce predictions that only apply to highly restricted sets of data; or the predictions are not explainable by any underlying theory that relates input data with the calculated predictions. Data analysis is incomplete until a comprehensible, generalizable and testable theory for the predictive method is developed.

Nonphylogenetic property Properties that do not hold true for a class; hence, cannot be used by ontologists to create a classification. For example, we do not classify animals by height or weight because animals of greatly different heights and weights may occupy the same biological class. Similarly, animals within a class may have widely ranging geographic habitats; hence, we cannot classify animals by locality. Case in point: penguins can be found virtually anywhere in the southern hemisphere, including hot and cold climates. Hence, we cannot classify penguins as animals that live in Antarctica or that prefer a cold climate. Scientists commonly encounter properties, once thought to be class-specific, that prove to be uninformative, for classification purposes. For many decades, all bacteria were assumed to be small; much smaller than animal cells. However, the bacterium Epulopiscium fishelsoni grows to about 600 microns by 80 microns, much larger than the typical animal epithelial cell (about 35 microns in diameter).³⁵ Thiomargarita namibiensis, an ocean-dwelling bacterium, can reach a size of 0.75 millimeter, visible to the unaided eye. What do these admittedly obscure facts teach us about the art of classification? Superficial properties, such as size, seldom inform us how to classify objects. The ontologist must think very deeply to find the essential defining features of classes.

Nonphylogenetic signal DNA sequences that cannot yield any useful conclusions related to the evolutionary pathways. Because DNA mutations arise stochastically over time (ie, at random locations in the gene, and at random times), two organisms having different ancestors may, by chance alone, achieve the same sequence in a chosen stretch of DNA. When gene sequence data is analyzed, and two organisms share the same sequence in a stretch of DNA, it can be tempting to infer that the two organisms belong to the same class (ie, that they inherited the identical sequence from a common ancestor). This inference is not necessarily correct. When mathematical phylogeneticists began modeling inferences for gene data sets, they assumed that most of the class assignment errors based on DNA sequence similarity would occur when the branches between sister taxa were long (ie, when a long time elapsed between evolutionary divergences, allowing for many random substitutions in base pairs). They called this phenomenon, wherein nonsister taxa were assigned the same ancient ancestor class, "long branch attraction." In practice, errors of this type can occur whether the branches are long, or short, or in-between. The term "nonphylogenetic signal" refers to just about any pitfall in phylogenetic grouping due to gene similarities acquired through any mechanism other than inheritance from a shared ancestor. This would include random mutational and adaptive convergence.³⁶

Normalized compression distance String compression algorithms (eg, zip, gzip, bunzip) should yield better compression from a concatenation of two similar strings than from a concatenation of two highly dissimilar strings. The reason is that the same string patterns that are employed to compress a string (ie, repeated runs of a particular pattern) are likely to be found in another, similar string. If two strings are completely dissimilar, then the compression algorithm would fail to find shared repeated patterns that enhance compressibility. The normalized compression distance is a similarity measure based on the enhanced compressibility of concatenated strings of high similarity.³⁷ A full discussion, with examples, is found in the Open Source Tools section of Chapter 4.

Notation 3 Also called n3. A syntax for expressing assertions as triples (unique subject + metadata + data). Notation 3 expresses the same information as the more formal RDF syntax, but n3 is easier for humans to read.³⁸ RDF and n3 are interconvertible, and either one can be parsed and equivalently tokenized (ie, broken into elements that can be re-organized in a different format, such as a database record). See RDF. See Triple.

Object relationships We are raised to believe that science explains how the universe, and everything in it, works. Engineering and the other applied sciences use scientific explanations to create things, for the betterment of our world. This is a lovely way to think about the roles played by scientists and engineers, but it is not completely accurate. For the most part, we cannot understand very much about the universe. Nobody understands the true nature of gravity, or mass, or light, or magnetism, or atoms, or thought. We do know a great deal about the relationships between gravity and mass, mass and energy, energy and light, light and magnetism, atoms and mass, thought and neurons, and so on. Karl Pearson, a 19th century statistician and philosopher, wrote that "All science is description and not explanation." Pearson was admitting that we can describe relationships, but we cannot explain why those relationships are true. Here is an example of a mathematical relationship that we know to be true, but which defies our understanding. The constant pi is the ratio of the circumference of a circle to its diameter. Furthermore, pi figures into the Gaussian statistical distribution (ie, that describes how a normal population is spread). How is it possible that a number that determines the distribution of a population can also determine the diameter of a circle?¹⁵ The relationships are provable and undeniable, but the full meaning of pi is beyond our grasp. In essence, all of science can be reduced to understanding object relationships.

Ontology An ontology is a collection of classes and their relationships to one another. Ontologies are usually rule-based systems (ie, membership in a class is determined by one or more class rules). Two important features distinguish ontologies from classifications. Ontologies permit classes to have more than one parent class and more than one child class. For example, the class of automobiles may be a direct subclass of "motorized devices" and a direct subclass of "mechanized transporters." In addition, an instance of a class can be an instance of any number of additional classes. For example, a Lamborghini may be a member of class "automobiles" and of class "luxury items." This means that the lineage of an instance in an ontology can be highly complex, with a single instance occurring in multiple classes, and with many connections between classes. Because recursive relations are permitted, it is possible to build an ontology wherein a class is both an ancestor class and a descendant class of itself. A classification is a highly restrained ontology wherein instances can belong to only one class, and each class may have only one direct parent class. Because classifications have an enforced linear hierarchy, they can be easily modeled, and the lineage of any instance can be traced unambiguously. See Classification. See Multiclass classification. See Multiclass inheritance.

Overfitting Overfitting occurs when a formula describes a set of data very closely, but does not lead to any sensible explanation for the behavior of the data, and does not predict the behavior of comparable data sets. In the case of overfitting, the formula is said to describe the noise of the system, rather than the characteristic behavior of the system. Overfitting occurs frequently with models that perform iterative approximations on training data, coming closer and closer to the training data set with each iteration. Neural networks are an example of a data modeling strategy that is prone to overfitting.³

Parent class The immediate ancestor, or the next-higher class (ie, the direct superclass) of a class. For example, in the classification of living organisms, Class Vertebrata is the parent class of Class Gnathostomata. Class Gnathostomata is the parent class of Class Teleostomi. In a classification, which imposes single class inheritance, each child class has exactly one parent class; whereas one parent class may have several different child classes. Furthermore, some classes, in particular the bottom class in the lineage, have no child classes (ie, a class need not always be a superclass of other classes). A class can be defined by its properties, its membership (ie, the instances that belong to the class), and by the name of its parent class. When we list all of the classes in a classification, in any order, we can always reconstruct the complete class lineage, in their correct lineage and branchings, if we know the name of each class's parent class. See Instance. See Child class. See Superclass.

Phenetics The classification of organisms by feature similarity, rather than through relationships. Starting with a set of feature data on a collection of organisms, you can write a computer program that will cluster the organisms into classes, according to their similarities. In theory, one computer program, executing over a large dataset containing measurements for every earthly organism, could create a complete biological classification. The status of a species is thereby reduced from a fundamental biological entity, to a mathematical construction. There is a host of problems consequent to computational methods for classification. First, there are many different mathematical algorithms that cluster objects by similarity. Depending on the chosen algorithm, the assignment of organisms to one species or another would change. Secondly, mathematical algorithms do not cope well with species convergence. Convergence occurs when two species independently acquire identical or similar traits through adaptation; not through inheritance from a shared ancestor. Examples are: the wing of a bat and the wing of a bird; the opposable thumb of opossums and of primates; the beak of a platypus and the beak of a bird. Unrelated species frequently converge upon similar morphologic adaptations to common environmental conditions or shared physiological imperatives. Algorithms that cluster organisms based on similarity are likely to group divergent organisms under the same species. It is often assumed that computational classification, based on morphologic feature similarities, will improve when we acquire whole-genome sequence data for many different species. Imagine an experiment wherein you take DNA samples from every organism you encounter: bacterial colonies cultured from a river, unicellular nonbacterial organisms found in a pond, small multicellular organisms found in soil, crawling creatures dwelling under rocks, and so on. You own a powerful sequencing machine, that produces the full-length sequence for each sampled organism, and you have a powerful computer that sorts and clusters every sequence. At the end, the computer prints out a huge graph, wherein all the samples are ordered and groups with the greatest sequence similarities are clustered together. You may think you've created a useful classification, but you haven't really, because you don't know anything about the organisms that are clustered together. You don't know whether each cluster represents a species, or a class (a collection of related species), or whether a cluster may be contaminated by organisms that share some of the same gene sequences, but are phylogenetically unrelated (ie, the sequence similarities result from chance or from convergence, but not by descent from a common ancestor). The sequences do not tell you very much about the biological properties of specific organisms, and you cannot infer which biological properties characterize the classes of clustered organisms. You have no certain knowledge whether the members of any given cluster of organisms can be characterized by any particular gene sequence (ie, you do not know the characterizing gene sequences for classes of organisms). You do not know the genus or species names of the organisms included in the clusters, because you began your experiment without a presumptive taxonomy. Basically, you simply know what you knew before you started; that individual organisms have unique gene sequences that can be grouped by sequence similarity. Taxonomists, who have long held that a species is a natural unit of biological life, and that the nature of a species is revealed through the intellectual process of building a consistent taxonomy,²¹ are opposed to the process of phenetics-based classification.²¹ See Blended class. See Bootstrapping.

Polymorphism Polymorphism is one of the constitutive properties of an object-oriented language (along with inheritance, encapsulation, and abstraction). Methods sent to object receivers have a response determined by the class of the receiving object. Hence, different objects, from different classes, receiving a call to a method of the same name, will respond differently. For example, suppose you have a method named "divide" and you send the method (ie, issue a command to execute the method) to an object of Class Bacteria and an object of Class Numerics. The Bacteria, receiving the divide method, will try to execute by looking for the "divide" method somewhere in its class lineage. Being bacteria, the "divide" method may involve making a copy of the bacteria (ie, reproducing) and incrementing the number of bacteria in the population. The numeric object, receiving the "divide" method, will look for the "divide" method in its class lineage and will probably find some method that provides instructions for arithmetic division. Hence, the behavior of the class object, to a received method, will be appropriate for the class of the object. See Inheritance. See Encapsulation. See Abstraction.

Predictive analytics A collection of techniques that have achieved great popularity and influence in the marketing industry. These are: recommenders, classifiers, and clustering algorithms.³⁹ Although these techniques can be used for purposes other than business, they are typically described using terms favored by marketers: recommenders (eg, predicting which products a person might prefer to buy), profile clustering (eg, grouping individuals into marketing clusters based on the similarity of their profiles), and product classifiers (eg, assigning a product or individual to a prediction category, based on a set of features). See Classifier. See Recommender.

Property Property, in the context of computational semantics, is a quantitative or qualitative feature of an object. In the case of spreadsheets, the column heads are all properties. In a classification, every class contains a set of properties that might apply to every member of the class (eg, male cardinals have the "red feather" property). Furthermore, instances may have their own set of properties, separate from the class. For example, the cardinal that I watch in my back yard seems to enjoy eating safflower seeds and cavorting in our bird bath, but I'm not sure that all cardinals share the same pleasures. From the standpoint of classifications, it is crucial to understand that a property may apply to multiple classes that are not directly related to one another. For example, insects, birds, and bats are not closely related classes of animals, but they all share the amazing property of flight. It is the ability to assign a single property to multiple classes that liberates classifications from the restraints imposed by the one-class/one-parent dictum. Although a class can have no more than one parent class, a class can share properties, in the form of data types and data methods, with unrelated classes. For example, Class File may be unrelated to Class Integer, but both classes may share a "print" method or the same "store" method. In object-oriented programming, assignments of shared methods to multiple classes is known as Mixins. See Mixins.

RDF Resource Description Framework (RDF) is a syntax in XML notation that formally expresses assertions as triples. The RDF triple consists of a uniquely identified subject plus a metadata descriptor for the data plus a data element. Triples are necessary and sufficient to create statements that convey meaning. Triples can be aggregated with other triples from the same data set or from other data sets, so long as each triple pertains to a unique subject that is identified equivalently through the data sets. Enormous data sets of RDF triples can be merged or functionally integrated with other massive or complex data resources. For a detailed discussion see Open Source Tools, "Syntax for Triples." See Notation 3. See Semantics. See Triple. See XML.

RDF Schema Resource Description Framework Schema (RDFS). A document containing a list of classes, their definitions, and the names of the parent class(es) for each class. In an RDF Schema, the list of classes is typically followed by a list of properties that apply to one or more classes in the Schema. To be useful, RDF Schemas are posted on the Internet, as a Web page, with a unique Web address. Anyone can incorporate the classes and properties of a public RDF Schema into their own RDF documents (public or private) by linking named classes and properties, in their RDF document, to the web address of the RDF Schema where the classes and properties are defined. See Namespace. See RDFS.

RDFS Same as RDF Schema.

Recommender A collection of methods for predicting the preferences of individuals. Recommender methods often rely on one or two simple assumptions: (1) If an individual expresses a preference for a certain type of product, and the individual encounters a new product that is similar to a previously preferred product, then he is likely to prefer the new product; (2) If an individual expresses preferences that are similar to the preferences expressed by a cluster of individuals, and if the members of the cluster prefer a product that the individual has not yet encountered, then the individual will most likely prefer the product. See Predictive analytics. See Classifier.

Reconciliation Usually refers to identifiers, and involves verifying that an object that is assigned a particular identifier in one information system has been provided the same identifier in some other system. For example, if I am assigned identifier 967bc9e7-fea0-4b09-92e7-d9327c405d78 in a legacy record system, I should like to be assigned the same identifier in the new record system. If that were the case, my records in both systems could be combined. If I am assigned an identifier in one system that is different from my assigned identifier in another system, then the two identifiers must be reconciled to determine that they both refer to the same unique data object (ie, me). This may involve creating a link between the two identifiers, or a new triple that establishes the equivalence of the two identifiers. Despite claims to the contrary, there is no possible way by which information systems with poor identifier systems can be sensibly reconciled. Consider this example. A hospital has two separate registry systems: one for dermatology cases and another for psychiatry cases. The hospital would like to merge records from the two services. Because of sloppy identifier practices, a sample patient has been registered 10 times in the dermatology system, and 6 times in the psychiatry system, each time with different addresses, social security numbers, birthdates, and spellings of the name. A reconciliation algorithm is applied, and one of the identifiers from the dermatology service is matched positively against one of the records from the psychiatry service. Performance studies on the algorithm indicate that the merged records have a 99.8% chance of belonging to the same patient. So what? Though the two merged identifiers correctly point to the same patient, there are 14 (9 + 5) residual identifiers for the patient still unmatched. The patient's merged record will not contain his complete clinical history. Furthermore, in this hypothetical instance, analyses of patient population data will mistakenly attribute one patient's clinical findings to as many as 15 different patients, and the set of 15 records in the corrupted de-identified dataset may contain mixed-in information from an indeterminate number of additional patients! If the preceding analysis seems harsh, consider these words, from the Healthcare Information and Management Systems Society, "A local system with a poorly maintained or "dirty" master person index (MPI) will only proliferate and contaminate all of the other systems to which it links."⁴⁰ See Social Security Number.

Reflection A programming technique wherein a computer program will modify itself, at run-time, based on information it acquires through introspection. For example, a computer program may iterate over a collection of data objects, examining the self-descriptive information for each object in the collection (ie, object introspection). If the information indicates that the data object belongs to a particular class of objects, then the program may call a method appropriate for the class. The program executes in a manner determined by descriptive information obtained during run-time; metaphorically reflecting upon the purpose of its computational task. See Introspection.

Registrars and human authentication The experiences of registrars in U.S. hospitals serve as cautionary instruction. Hospital registrars commit a disastrous mistake when they assume that all patients wish to comply with the registration process. A patient may be highly motivated to provide false information to a registrar, or to acquire several different registration identifiers, or to seek a false registration under another person's identity (ie, commit fraud), or to forego the registration process entirely. In addition, it is a mistake to believe that honest patients are able to fully comply with the registration process. Language barriers, cultural barriers, poor memory, poor spelling, and a host of errors and misunderstandings can lead to duplicative or otherwise erroneous identifiers. It is the job of the registrar to follow hospital policies that overcome these difficulties. Registration should be conducted by a trained registrar who is well-versed in the registration policies established by the institution. Registrars may require patients to provide a full legal name, any prior held names (eg, maiden name), date of birth, and a government issue photo id card (eg, driver's license or photo id card issued by the department of motor vehicles). Ideally, registration should require a biometric identifier (eg, fingerprints, retina scan, iris scan, voice recording, photograph). If you accept the premise that hospitals have the responsibility of knowing who it is that they are treating, then obtaining a sample of DNA from every patient, at the time of registration, is reasonable. The DNA can be used to create a unique patient profile from a chosen set of informative loci; a procedure used by the CODIS system developed for law enforcement agencies. The registrar should document any distinguishing and permanent physical features that are plainly visible (eg, scars, eye color, colobomas, tattoos). Neonatal and pediatric identifiers pose a special set of problems for registrars. When an individual is born in a hospital, and provided with an identifier, returns as an adult, he or she should be assigned the same identifier that was issued in the remote past. Every patient who comes for registration should be matched against a database of biometric data that does not change from birth to death (eg, fingerprints, DNA). See Social Security Number.

SUMO Knowing that ontologies reach into higher ontologies, ontologists have endeavored to create upper-level ontologies to accommodate general classes of objects, under which the lower ontologies may take their place. One such ontology is SUMO, the Suggested Upper Merged Ontology, created by a group of talented ontologists.⁴¹ SUMO is owned by IEEE (Institute of Electrical and Electronics Engineers), and is freely available, subject to a usage license.⁴²

SVM See Support vector machine.

Science Of course, there are many different definitions of science, and inquisitive students should be encouraged to find a conceptualization of science that suits their own intellectual development. For me, science is all about finding general relationships among objects. In the so-called physical sciences, the most important relationships are expressed as mathematical equations (eg, the relationship between force, mass and acceleration; the relationship between voltage, current and resistance). In the so-called natural sciences, relationships are often expressed through classifications (eg, the classification of living organisms). Scientific advancement is the discovery of new relationships or the discovery of a generalization that applies to objects hitherto confined within disparate scientific realms (eg, evolutionary theory arising from observations of organisms and geologic strata). Engineering would be the area of science wherein scientific relationships are exploited to build new technology. See Generalization.

Semantics The study of meaning (Greek root, semantikos, significant meaning). In the context of data science, semantics is the technique of creating meaningful assertions about data objects. A meaningful assertion, as used here, is a triple consisting of an identified data object, a data value, and a descriptor for the data value. In practical terms, semantics involves making assertions about data objects (ie, making triples), combining assertions about data objects (ie, merging triples), and assigning data objects to classes; hence relating triples to other triples. As a word of warning, few informaticians would define semantics in these terms, but most definitions for semantics are functionally equivalent to the definition offered here. Much of any language is unstructured and meaningless. Consider the assertion: Sam is tired. This is an adequately structured sentence with a subject verb and object. But what is the meaning of the sentence? There are a lot of people named Sam. Which Sam is being referred to in this sentence? What does it mean to say that Sam is tired? Is "tiredness" a constitutive property of Sam, or does it only apply to specific moments? If so, for what moment in time is the assertion, "Sam is tired" actually true? To a computer, meaning comes from assertions that have a specific, identified subject associated with some sensible piece of fully described data (metadata coupled with the data it describes). See Triple. See RDF.

Simpson's paradox Occurs when a correlation that holds in two different data sets is reversed if the data sets are combined. For example, baseball player A may have a higher batting average than player B for each of two seasons, but when the data for the two seasons are combined, player B may have the higher 2-season average. Simpson's paradox is just one example of unexpected changes in outcome when variables are unknowingly hidden or blended.⁴³

Social Security Number The common strategy, in the U.S., of employing social security numbers as identifiers is often counterproductive, owing to entry error, mistaken memory, or the intention to deceive. Efforts to reduce errors by requiring individuals to produce their original social security cards puts an unreasonable burden on honest individuals, who rarely carry their cards, and provides an advantage to dishonest individuals, who can easily forge social security cards. Institutions that compel patients to provide a social security number have dubious legal standing. The social security number was originally intended as a device for validating a person's standing in the social security system. More recently, the purpose of the social security number has been expanded to track taxable transactions (ie, bank accounts, salaries). Other uses of the social security number are not protected by law. The Social Security Act (Section 208 of Title 42 U.S. Code 408) prohibits most entities from compelling anyone to divulge his/her social security number. Legislation or judicial action may one day stop healthcare institutions from compelling patients to divulge their social security numbers as a condition for providing medical care. Prudent and forward-thinking institutions will limit their reliance on social security numbers as personal identifiers. See Registrars and human authentication.

Species Species is the bottom-most class of any classification or ontology. Because the species class contains the individual objects of the classification, it is the only class which is not abstract. The special significance of the species class is best exemplified in the classification of living organisms. Every species of organism contains individuals that share a common ancestral relationship to one another. When we look at a group of squirrels, we know that each squirrel in the group has its own unique personality, its own unique genes (ie, genotype), and its own unique set of physical features (ie, phenotype). Moreover, although the DNA sequences of individual squirrels are unique, we assume that there is a commonality to the genome of squirrels that distinguishes it from the genome of every other species. If we use the modern definition of species as an evolving gene pool, we see that the species can be thought of as a biological life form, with substance (a population of propagating genes), and a function (evolving to produce new species).²¹^,⁴⁴^,⁴⁵ Put simply, species speciate; individuals do not. As a corollary, species evolve; individuals simply propagate. Hence, the species class is a separable biological unit with form and function. We, as individuals, are focused on the lives of individual things, and we must be reminded of the role of species in biological and nonbiological classifications. The concept of species is discussed in greater detail in Section 6.4. See Blended class. See Nonatomicity.

Spreadsheet Spreadsheets are data arrays consisting of records (the rows), with each record containing data attributes (the columns). Spreadsheet applications permit the user to search records, columns, and cells (ie, the data points corresponding to a specific record and a specific column). Spreadsheets support statistical and mathematical functions operating on the elements of the spreadsheet (ie, records, columns, cells). Perhaps most importantly, spreadsheets offer a wide range of easily implemented graphing features. Quite a few data scientists perform virtually all of their work using a favorite spreadsheet application. Spreadsheets have limited utility when dealing with large data (eg, gigabytes or terabytes of data), complex data (eg, images, waveforms, text), and they do not easily support classified data (eg, data objects that belong to classes within a lineage of classes). Additionally, spreadsheets do not support the kinds of methods and data structures (eg, if statements, access to external modules, system calls, network interactions, reflection, complex data structures) that are supported in modern programming languages.

Subclass A class in which every member descends from some higher class (ie, a superclass) within the class hierarchy. Members of a subclass have properties specific to the subclass. As every member of a subclass is also a member of the superclass, the members of a subclass inherit the properties and methods of the ancestral classes. For example, all mammals have mammary glands because mammary glands are a defining property of the mammal class. In addition, all mammals have vertebrae because the class of mammals is a subclass of the class of vertebrates. A subclass is the immediate child class of its parent class. See Child class. See Parent class.

Superclass Any of the ancestral classes of a subclass. For example, in the classification of living organisms, the class of vertebrates is a superclass of the class of mammals. The immediate superclass of a class is its parent class. In common parlance, when we speak of the superclass of a class, we are usually referring to its parent class. See Parent class.

Support vector machine A machine learning classifying algorithm. The method starts with a training set consisting of two classes of objects as input. The support vector machine computes a hyperplane, in a multidimensional space, that separates objects of the two classes. The dimension of the hyperspace is determined by the number of dimensions or attributes associated with the objects. Additional objects (ie, test set objects) are assigned membership in one class or the other, depending on which side of the hyperplane they reside.

Taxonomic order In biological taxonomy, the hierarchical lineage of organisms are divided into a descending list of named orders: Kingdom, Phylum (Division), Class, Order, Family, and Genus, Species. As we have learned more and more about the classes of organisms, modern taxonomists have added additional ranks to the classification (eg, supraphylum, subphylum, suborder, infraclass, etc.). Was this really necessary? All of this taxonomic complexity could be averted by dropping named ranks and simply referring to every class as "Class." Modern specifications for class hierarchies (eg, RDF Schema) encapsulate each class with the name of its superclass. When every object yields its class and superclass, it is possible to trace any object's class lineage. For example, in the classification of living organisms, if you know the name of the parent for each class, you can write a simple script that generates the complete ancestral lineage for every class and species within the classification.¹⁸ See Class. See Taxonomy. See RDF Schema. See Species.

Taxonomy A taxonomy is the collection of named instances (class members) in a classification or an ontology. When you see a schematic showing class relationships, with individual classes represented by geometric shapes and the relationships represented by arrows or connecting lines between the classes, then you are essentially looking at the structure of a classification, minus the taxonomy. You can think of building a taxonomy as the act of pouring all of the names of all of the instances into their proper classes. A taxonomy is similar to a nomenclature; the difference is that in a taxonomy, every named instance must have an assigned class. See Taxonomic order.

Terminology The collection of words and terms used in some particular discipline, field, or knowledge domain. It is nearly synonymous with vocabulary and with nomenclature. Vocabularies, unlike terminologies, are not to be confined to the terms used in a particular field. Nomenclatures, unlike terminologies, usually aggregate equivalent terms under a canonical synonym.

Thesaurus A vocabulary that groups together synonymous terms. A thesaurus is very similar to a nomenclature. There are two minor differences. Nomenclatures included multi-word terms; whereas a thesaurus is typically composed of one-word terms. In addition, nomenclatures are typically restricted to a well-defined topic or knowledge domain (eg, names of stars, infectious diseases, etc.). See Nomenclature. See Vocabulary. See Classification. See Dictionary. See Terminology. See Ontology.

Triple In computer semantics, a triple is an identified data object associated with a data element and the description of the data element. In the computer science literature, the syntax for the triple is commonly described as: subject, predicate, object," wherein the subject is an identifier, the predicate is the description of the object, and the object is the data. The definition of triple, using grammatical terms, can be off-putting to the data scientist, who may think in terms of spreadsheet entries: a key that identifies the line record, a column header containing the metadata description of the data, and a cell that contains the data. In this book, the three components of a triple are described as: (1) the identifier for the data object, (2) the metadata that describes the data, and (3) the data itself. In theory, all data sets, databases, and spreadsheets can be constructed or deconstructed as collections of triples. See Introspection. See Data object. See Semantics. See RDF. See Meaning.

Triplestore A list or database composed entirely of triples (ie, statements consisting of an item identifier plus the metadata describing the item plus an item of data). The triples in a triplestore need not be saved in any particular order, and any triplestore can be merged with any other triplestore; the basic semantic meaning of the contained triples is unaffected. See Triple.

Unclassifiable objects Classifications create a class for every object and taxonomies assign each and every object to its correct class. This means that a classification is not permitted to contain unclassified objects; a condition that puts fussy taxonomists in an untenable position. Suppose you have an object, and you simply do not know enough about the object to confidently assign it to a class. Or, suppose you have an object that seems to fit more than one class, and you can't decide which class is the correct class. What do you do? Historically, scientists have resorted to creating a "miscellaneous" class into which otherwise unclassifiable objects are given a temporary home, until more suitable accommodations can be provided. I have spoken with numerous data managers, and everyone seems to be of a mind that "miscellaneous" classes, created as a stopgap measure, serve a useful purpose. Not so. Historically, the promiscuous application of "miscellaneous" classes has proven to be a huge impediment to the advancement of science. In the case of the classification of living organisms, the class of protozoans stands as a case in point. Ernst Haeckel, a leading biological taxonomist in his time, created the Kingdom Protista (ie, protozoans), in 1866, to accommodate a wide variety of simple organisms with superficial commonalities. Haeckel himself understood that the protists were a blended class that included unrelated organisms, but he believed that further study would resolve the confusion. In a sense, he was right, but the process took much longer than he had anticipated; occupying generations of taxonomists over the following 150 years. Today, Kingdom Protista no longer exists. Its members have been reassigned to other classes. Nonetheless, textbooks of microbiology still describe the protozoans, just as though this name continued to occupy a legitimate place among terrestrial organisms. In the meantime, therapeutic opportunities for eradicating so-called protozoal infections, using class-targeted agents, have no doubt been missed.¹⁶ You might think that the creation of a class of living organisms, with no established scientific relation to the real world, was a rare and ancient event in the annals of biology, having little or no chance of being repeated. Not so. A special pseudoclass of fungi, deuteromyctetes (spelled with a lowercase "d," signifying its questionable validity as a true biologic class) has been created to hold fungi of indeterminate speciation. At present, there are several thousand such fungi, sitting in a taxonomic limbo, waiting to be placed into a definitive taxonomic class.²⁰^,¹⁶ See Blended class.

Unstable taxonomy A taxonomy that continuously changes over time. Examples abound from the classification of living organisms. You might expect that a named species would keep its name forever, and would never change its assigned class. Not so. For example, Class Fungi has recently undergone profound changes, with the exclusion of myxomycetes (slime molds) and oomycetes (water molds), and the acquisition of Class Microsporidia (formerly classed as a protozoan). The instability of fungal taxonomy impacts negatively on the practice of clinical mycology. When the name of a fungus changes, so must the name of the associated disease. Consider "Allescheria boydii," Individuals infected with this organism were said to suffer from the disease known as allescheriasis. When the organism's name was changed to Petriellidium boydii, the disease name was changed to petriellidosis. When the fungal name was changed, once more, to Pseudallescheria boydii, the disease name was changed to pseudallescheriasis.²⁰ All three names appear in the literature, past and present, thus hindering attempts to annotate the medical literature.¹⁶ See Autocoding. See Annotation. See Unclassifiable objects.

Vocabulary A comprehensive collection of words and their associated meanings. In some quarters, "vocabulary" and "nomenclature" are used interchangeably, but they are different from one another. Nomenclatures typically focus on terms confined to one knowledge domain. Nomenclatures typically do not contain definitions for the contained terms. Nomenclatures typically group terms by synonymy. Lastly, nomenclatures include multi-word terms. Vocabularies are collections of single words, culled from multiple knowledge domains, with their definitions, and assembled in alphabetic order. See Nomenclature. See Thesaurus. See Taxonomy. See Dictionary. See Terminology.

XML Acronym for eXtensible Markup Language, a syntax for marking data values with descriptors (ie, metadata). The descriptors are commonly known as tags. In XML, every data value is enclosed by a start-tag, containing the descriptor and indicating that a value will follow, and an end-tag, containing the same descriptor and indicating that a value preceded the tag. For example: < name>Conrad Nervig </name >. The enclosing angle brackets, "<>", and the end-tag marker, "/", are hallmarks of HTML and XML markup. This simple but powerful relationship between metadata and data allows us to employ metadata/data pairs as though each were a miniature database. The semantic value of XML becomes apparent when we bind a metadata/data pair to a unique object, forming a so-called triple. See Triple. See Meaning. See Semantics. See HTML.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 6: Giving Meaning to Data

Create new playlist

Sign In

Sign Up

6.1 Meaning and Triples

6.2 Driving Down Complexity with Classifications

6.3 Driving Up Complexity With Ontologies

6.4 The Unreasonable Effectiveness of Classifications

6.5 Properties That Cross Multiple Classes

Open Source Tools

Syntax for Triples

RDF Schema

RDF Parsers

Visualizing Class Relationships

Glossary

Table of Contents for
Chapter 6: Giving Meaning to Data