Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 1

The Simple Life

Abstract

The introduction discusses the insurmountable analytic obstacles created by collections of complex and inscrutable data. As it happens, this problem is not new. The natural history of civilization always seems to lead to a point where science and society become too bloated and burdensome to sustain further progress. A crucial point is reached when civilizations opt for simplification or nullification. This chapter reviews some of the most important simplification concepts, developed over the history of mankind, that have permitted civilization to attain its current state of activity. The chapter comes with a warning: simplify or stagnate.

Keywords

Simplification tools; Historical advances; Simplicity in civilizations; Complexity; Classifications; Midi; Povray; Html; Xml; Neuroscience

1.1 Simplification Drives Scientific Progress

Make everything as simple as possible, but not simpler.

Albert Einstein (see Glossary item, Occam's razor)

Advances in civilization have been marked by increasing complexity. To a great extent, modern complexity followed from the invention of books, which allowed us to build upon knowledge deposited by long-deceased individuals.

Because it is easy to admire complexity, it can be difficult to appreciate its opposite: simplification. Few of us want to revert to a simple, prehistoric lifestyle, devoid of the benefits of engines, electricity, automobiles, airplanes, mass production of food and manufactured items, and medical technology. Nonetheless, a thoughtful review of human history indicates that some of our greatest scientific advances involved simplifying complex activities (see Glossary item, Science). Here are just a few examples:

1. Nouns and names. By assigning specific names to individuals (eg, Turok Son of Stone, Hagar the Horrible), ancient humans created a type of shorthand for complex objects, thus releasing themselves from the task of providing repeated, detailed descriptions of the persons to whom we refer.

2. Classifications. Terms that apply to classes of things simplified our ability to communicate abstract concepts. The earliest classes may have been the names of species (eg, antelope) or families (eg, birds). In either case, class abstractions alleviated the need for naming every bird in a flock (see Glossary items, Abstraction, Species, Systematics, Taxonomy, and Classification).

3. Numerals. Early humans must have known that counting on fingers and toes can be confusing. Numbers simplified counting, and greatly extended the maximum conceivable value of a tally. Without an expandable set of integers, communicating "how much" and "how many" must have been an exasperating experience.

4. Glyphs, runes, stone tablets, and papyrus. Written language, and the media for preserving thoughts, relieved humans from committing everything to memory. The practice of writing things down simplified the task of recordkeeping and allowed ancient humans to create records that outlived the record-keepers (see Glossary item, Persistence).

5. Libraries. Organized texts (ie, books) and organized collections of texts (ie, libraries) simplified the accrual of knowledge across generations. Before there were books and libraries, early religions relied on the oral transmission of traditions and laws, an unreliable practice that invited impish tampering. The popularization of books marked the demise of oral traditions and the birth of written laws that could be copied, examined, discussed, and sometimes discarded.

6. Mathematics. Symbolic logic permitted ancient man to understand the real world through abstractions. For example, the number 2, a mathematical abstraction with no physical meaning, can apply to any type of object (eg, 2 chickens, 2 rocks, or 2 universes). Mathematics freed us from the tedious complexities of the physical realm, and introduced humans to a new world, ruled by a few simple axioms.

The list of ancient simplifications can go on and on. In modern times, simplifications have sparked new scientific paradigms and rejuvenated moribund disciplines. In the information sciences, HTML, a new and very simple method for formatting text and linking web documents and other data objects across the Internet, has revolutionized communications and data sharing. Likewise, XML has revolutionized our ability to annotate, understand, and merge data objects. The rudiments of HTML and XML can be taught in a few minutes (see Glossary items, HTML, XML, Data object).

In the computer sciences, language compilers have greatly reduced the complexity of programming. Object-oriented programming languages have simplified programming even further. Modern programmers can be much more productive than their counterparts who worked just a few decades ago. Likewise, Monte Carlo and resampling methods have greatly simplified statistics, enabling general scientists to model complex systems with ease (see Sections 8.2 and 8.3 of Chapter 8). More recently, MapReduce has simplified calculations by dividing large and complex problems into simple problems, for distribution to multiple computers (see Glossary item, MapReduce).

The methods for sequencing DNA are much simpler today than they were a few decades ago, and projects that required the combined efforts of multiple laboratories over several years, can now be accomplished in a matter of days or hours, within a single laboratory.

Physical laws and formulas simplify the way we understand the relationships among objects (eg, matter, energy, electricity, magnetism, and particles). Without access to simple laws and formulas, we could not have created complex products of technology (ie, computers, smartphones, and jet planes).

1.2 The Human Mind is a Simplifying Machine

Science is in reality a classification and analysis of the contents of the mind.

Karl Pearson in The Grammar of Science, 1900¹

The unrestricted experience of reality is complex and chaotic. If we were to simply record all the things and events that we see when we take a walk on a city street or a country road, we would be overwhelmed by the magnitude and complexity of the collected data: images of trees, leaves, bark, clouds, buildings, bricks, stones, dirt, faces, insects, heat, cold, wind, barometric pressure, color, shades, sounds, loudness, harmonics, sizes and positions of things, relationships in space between the positions of different objects, movements, interactions, changes in shape, emotional responses, to name just a few.²

We fool ourselves into thinking that we can gaze at the world and see what is to be seen. In fact, what really happens is that light received by retinal receptors is processed by many neurons, in many pathways, and our brain creates a representation of the data that we like to call consciousness. The ease with which we can be fooled by optical illusions serves to show that we only "see" what our brains tell us to see; not what is really there. Vision is somewhat like sitting in a darkened theater and watching a Hollywood extravaganza, complete with special effects and stage props. Dreams are an example of pseudo-visual productions directed by our subconscious brains, possibly as an antidote to nocturnal boredom.

Life, as we experience it, is just too weird to go unchecked. We maintain our sanity by classifying complex data into simple categories of things that have defined properties and roles. In this manner, we can ignore the details and concentrate on patterns. When we walk down the street, we see buildings. We know that the buildings are composed of individual bricks, panes of glass, and girders of steel; but we do not take the time to inventory all the pieces of the puzzle. We humans classify data instinctively, and much of our perception of the world derives from the classes of objects that we have invented for ourselves.

What we perceive is dependent upon the choices we make, as we classify our world. If we classify animals as living creatures, just like ourselves, with many of the same emotional and cognitive features as we have, then we might be more likely to treat animals much the same way as we treat our fellow humans. If we classify animals as a type of food, then our relationships with animals might be something different. If we classify 3-week-old human embryos as young humans, then our views on abortion might be quite different from someone who classifies 3-week-old human embryos as small clusters of cells without organs or a nervous system.

Classification is heavy stuff. We simplify our world through classification, but, in so doing, we create our personal realities. For this reason, the process of classification should be done correctly. If I had to choose an area of science that was neglected during my early education, it would be ontology, the science of classification. Every data scientist should understand that there is a logic to creating a classification and that a poor data classification can ruin the most earnest efforts to analyze complex data. Chapter 6 is devoted to the concepts of meaning and classification, as it applies to the data sciences (see Glossary items, Classification, Ontology).

1.3 Simplification in Nature

The dream of every cell, to become two cells!

Francois Jacob

Is our universe complex? Maybe so, but it is very clear that forces have been at work to simplify our reality. Despite the enormity of our universe, there seem to be just a few types of cosmological bodies (eg, stars, planets, black holes, gases, debris, and dark matter). These bodies assemble into galaxies, and our galaxies display a relatively narrow range of shapes and sizes. We see that simple and stable systems can emerge from chaos and complexity (see Glossary item, System). Because stable systems, by definition, persist longer than unstable systems, the stable systems will eventually outnumber the unstable, chaotic systems. As time progresses, simplicity trumps complexity.

What is true on the grand scale seems to apply to the small scale of things. Despite the enormous number of protons, neutrons, and electrons available to form complex elements, we find that there are just 98 naturally occurring elements. Under extreme laboratory conditions, we can produce about 20 additional elements that exist briefly before decaying. Spectrographic surveys of space indicate that the universe accommodates no elements other than those natural elements encountered in our own solar system.

Why is the periodic table of elements such a simple and short piece of work? Why can we not make heavier and heavier atoms, without restraint? As it happens, the nature of the physical realm is highly restrictive. It is estimated that electrons in the ground state (ie, lowest energy state) move at about 1% of the speed of light. The number of electrons orbiting the nucleus of an element equals the number of protons in the nucleus; hence, heavy elements contain many more electrons than the light elements. As the number of electrons increases, so do the orbits occupied by the electrons. Each additional orbit in the heavy elements has a higher energy level than the preceding orbit, and this means that the velocity of the electrons in the higher orbits exceeds the velocity of the electrons in the lower orbits. Eventually, as the electrons in an atom occupy orbits of higher and higher energies, the outermost electrons must move at a speed that exceeds the speed of light. Electrons cannot exceed the speed of light; hence, there is a strict limit to the number of allowed electron orbits; hence, there is a strict limit to the number of elements; hence, the periodic table is simple and short.

Is the realm of living organisms more complex than the physical realm? There are estimated to be somewhere between 10 million and 100 million species of living organisms on planet Earth. At first blush, the profusion of species would suggest incomprehensible complexity. Nonetheless, all of the species of living organisms can be listed under a simple classification composed of some dozens of classes. A high school student, perusing the classification of living organisms, can acquire a fair knowledge of the evolutionary development of life on earth. A skeptic might remark that the classes of living organisms are artefactual and arbitrary. What does it mean to say that there are a 100 classes of living organisms, when a we might have assigned species to 1000 classes or 1,000,000 classes if we had chosen so? As it happens, the true classes of organisms are discoveries, not inventions. A species is a biological unit, much like a single organism is a biological unit (Glossary item, Species). A strong argument can be made that the number of classes of organisms is fixed. It is the job of biologists to correctly assign organisms to their proper species, and to provide a hierarchy of classes that exactly accommodate the lineage produced by the evolution of life on this planet (see Glossary items, Classification system versus identification system, Cladistics, Clade, Monophyletic class, and Taxonomic order). Essentially, the classification of living organisms is an exercise in discovering the simplicity in nature. Because the classification of living organisms is the oldest and most closely examined classification known to science, data analysts are well-advised to learn from past failures and triumphs resulting from this grand, communal endeavor. In Sections 6.2 through 6.4 of Chapter 6, we will return to the process whereby simple classifications are built to model complex systems.

Aside from the simple manner in which all organisms can be classified, it seems obvious enough that life is complex. Furthermore, the complexity found in a highly evolved organism, such as a human, is obviously much greater than the complexity of the first organisms arising on earth. Hence, it seems safe to conclude that among living organisms, complexity has continually increased throughout the history of the evolution of life on our planet. Well, yes and no. The complexity of living organisms has increased over time, but simplification has occurred in tandem with complexification. For example, all living organisms contain DNA, and the DNA in organisms obeys a simple coding system. A biologist who works with human DNA one year may switch over to a study of fruit flies or corn the next year, without bothering to acquire new equipment. The protein, DNA, and RNA motifs in our cells are simple variations of a relatively small number of motifs and subunits developed long ago and shared by most living organisms.³ A biologist who studies fruit flies and humans will learn that the same genes that control the embryonic development in humans also control the embryonic development of fruit flies and mice.⁴ In some cases, a human gene can substitute for an insect gene, with little or no difference in outcome. Basically, what we see as the limitless complexity of nature is accounted for by variations on a few simple themes.

When we look at all the different types of animals in the world, we focus on diversity. We forget that there are just a few dozen body plans available to animals, and all of these plans were developed about half a billion years ago, in the Cambrian period.⁵ Since then, nothing much has happened. It is as though we have spent a half billion years recovering from a wave of complexification. Today, we look at a horse and its rider, and we think that we are looking at two totally unrelated animals. Not so. Humans and horses have the same body plan, the same skeleton, and the same bones.⁶ The differences lie in bone size and shape, attachment facets, and a few involutionary adjustments (eg, loss of a tail). A visiting Martian, with no appreciation of the subtleties, might fail to distinguish any one mammal from another.

Evolution simplifies organisms by conserving great evolutionary achievements among descendant classes. For example, the evolutionary development of photosynthetic oxygenation occurred once only; all organisms capable of photosynthetic oxygenation presumably use the same basic equipment inherited from one lucky cyanobacterium (see Glossary item, Chloroplast evolution). Likewise, about one billion years ago, the first eukaryotes appeared as one-celled organisms containing a nucleus and at least one mitochondrion. Nature never needed to reinvent the mitochondrion. The mitochondria that exist today, throughout the kingdom of eukaryotes, apparently descended from one early ancestor. So it goes for ribosome-associated proteins that translate ribonucleic acids into protein, and hox genes that orchestrate embryonic development.⁴ The genomes of living cells, containing upwards of billions of nucleotides, seem hopelessly complex, but nature reduces the complexity and size of genomes when conditions favor a simpler and shorter set of genes (see Glossary item, Obligate intracellular organism).⁷^,⁸ Henner Brinkmann and Herve Philippe summarized the situation nicely: "In multiple cases evolution has proceeded via secondary simplification of a complex ancestor, instead of the constant march towards rising complexity generally assumed."⁹

1.4 The Complexity Barrier

Nobody goes there anymore. It's too crowded.

Yogi Berra

It seems that many scientific findings, particularly those findings based on analyses of large and complex data sets, are yielding irreproducible results. We find that we cannot depend on the data that we depend on. If you don't believe me, consider these shocking headlines:

1. "Unreliable research: Trouble at the lab."¹⁰ The Economist, in 2013 ran an article examining flawed biomedical research. The magazine article referred to an NIH official who indicated that "researchers would find it hard to reproduce at least three-quarters of all published biomedical findings, the public part of the process seems to have failed." The article described a study conducted at the pharmaceutical company Amgen, wherein 53 landmark studies were repeated. The Amgen scientists were successful at reproducing the results of only 6 of the 53 studies. Another group, at Bayer HealthCare, repeated 63 studies. The Bayer group succeeded in reproducing the results of only one-fourth of the original studies.

2. "A decade of reversal: an analysis of 146 contradicted medical practices."¹¹ The authors reviewed 363 journal articles, reexamining established standards of medical care. Among these articles were 146 manuscripts (40.2%) claiming that an existing standard of care had no clinical value.

3. "Cancer fight: unclear tests for new drug."¹² This New York Times article examined whether a common test performed on breast cancer tissue (Her2) was repeatable. It was shown that for patients who tested positive for Her2, a repeat test indicated that 20% of the original positive assays were actually negative (ie, falsely positive on the initial test).¹²

4. "Reproducibility crisis: blame it on the antibodies."¹³ Biomarker developers are finding that they cannot rely on different batches of a reagent to react in a consistent manner, from test to test. Hence, laboratory analytic methods, developed using a controlled set of reagents, may not have any diagnostic value when applied by other laboratories, using different sets of the same analytes.¹³

5. "Why most published research findings are false."¹⁴ Modern scientists often search for small effect sizes, using a wide range of available analytic techniques, and a flexible interpretation of outcome results. The manuscript's author found that research conclusions are more likely to be false than true.¹⁴^,¹⁵

6. "We found only one-third of published psychology research is reliable — now what?"¹⁶ The manuscript authors suggest that the results of a first study should be considered preliminary and tentative. Conclusions have no value until they are independently validated.

Anyone who attempts to stay current in the sciences soon learns that much of the published literature is irreproducible¹⁷; and that almost anything published today might be retracted tomorrow. This appalling truth applies to some of the most respected and trusted laboratories in the world.¹⁸^–²⁵ Those of us who have been involved in assessing the rate of progress in disease research are painfully aware of the numerous reports indicating a general slowdown in medical progress.²⁶^–³³

For the optimists, it is tempting to assume that the problems that we may be experiencing today are par for the course, and temporary. It is the nature of science to stall for a while and lurch forward in sudden fits. Errors and retractions will always be with us as long as humans are involved in the scientific process.

For the pessimists, such as myself, there seems to be something going on that is really new and different; a game changer. This game changer is the "complexity barrier," a term credited to Boris Beizer, who used it to describe the impossibility of managing increasingly complex software products.³⁴ The complexity barrier, known also as the complexity ceiling, applies to virtually every modern area of science and engineering.³⁵^,³⁶

Some of the mistakes that lead to erroneous conclusions in data-intensive research are well-known, and include the following:

1. Errors in sample selection, labeling, and measurement.³⁷^–³⁹ For example, modern biomedical data is high-volume (eg, gigabytes and larger), heterogeneous (ie, derived from diverse sources), private (ie, measured on human subjects), and multidimensional (eg, containing thousands of different measurements for each data record). The complexities of handling such data correctly are daunting⁴⁰ (see Glossary items, Curse of dimensionality, Dimensionality).

2. Misinterpretation of the data⁴¹^,¹⁴^,⁴²^,³¹^,⁴³^–⁴⁵

3. Data hiding and data obfuscation⁴⁶^,⁴⁷

4. Unverified and unvalidated data⁴⁸^–⁵¹^,⁴³^,⁵²

5. Outright fraud⁴⁷^,²⁵^,⁵³

When errors occur in complex data analyses, they are notoriously difficult to discover.⁴⁸

Aside from human error, intrinsic properties of complex systems may thwart our best attempts at analysis. For example, when complex systems are perturbed from their normal, steady-state activities, the rules that govern the system's behavior become unpredictable.⁵⁴ Much of the well-managed complexity of the world is found in machines built with precision parts having known functionality. For example, when an engineer designs a radio, she knows that she can assign names to the components, and these components can be relied upon to behave in a manner that is characteristic of its type. A capacitor will behave like a capacitor, and a resistor will behave like a resistor. The engineer need not worry that the capacitor will behave like a semiconductor or an integrated circuit. The engineer knows that the function of a machine's component will never change; but the biologist operates in a world wherein components change their functions, from organism to organism, cell to cell, and moment to moment. As an example, cancer researchers discovered an important protein that plays a role in the development of cancer. This protein, p53, was considered to be the primary cellular driver for human malignancy. When p53 mutated, cellular regulation was disrupted, and cells proceeded down a slippery path leading to cancer. In the past few decades, as more information was obtained, cancer researchers have learned that p53 is just one of many proteins that play some role in carcinogenesis, and that the role played by p53 changes depending on the species, tissue type, cellular microenvironment, genetic background of the cell, and many other factors. Under one set of circumstances, p53 may modify DNA repair; under another set of circumstances, p53 may cause cells to arrest the growth cycle.⁵⁵^,⁵⁶ It is difficult to predict the biological effect of a protein that changes its primary function based on prevailing cellular conditions.

At the heart of all data analysis is the assumption that systems have a behavior that can be described with a formula or a law, or that can lead to results that are repeatable and to conclusions that can be validated. We are now learning that our assumptions may have been wrong, and that our best efforts at data analysis may be irreproducible.

Complexity seems to be the root cause of many failures in software systems; and the costs of such failures run very high. It is common for large, academic medical centers to purchase information systems that cost in excess of $500 million. Despite the enormous investment, failures are not uncommon.⁵⁷^–⁵⁹ About three-quarters of hospital information systems are failures.⁶⁰ Furthermore, successfully implemented electronic health record systems do not always improve patient outcomes.⁶¹ Based on a study of the kinds of failures that account for patient safety errors in hospitals, it has been suggested that hospital information systems will not greatly reduce safety-related incidents.⁶² Clinical decision support systems, built into electronic health record systems, have not had much impact on physicians' practices.⁶³ These systems tend to be too complex for the hospital staff to master and are not well-utilized.⁶⁴

It is believed that the majority of information technology projects fail, and that failure is positively correlated with the size and cost of the projects.⁶⁵ We know that public projects costing hundreds of billions of dollars have failed quietly, without raising much attention.⁶⁶^,⁶⁷ Projects that are characterized by large size, high complexity, and novel technology aggravate any deficiencies in management, personnel, or process practices.⁶⁵^,⁶⁸^,³⁵^,³⁶^,⁶⁹

In 2004, the National Cancer Institute launched an ambitious project, known as the Cancer Biomedical Informatics Grid, abbreviated as (CaBig(tm)), aimed at developing standards for annotating and sharing biomedical data, and tools for data analysis (see Glossary items, Standard, Grid, and Data sharing).⁶⁴ For a time, the project received generous support from academia and industry. In 2006, the Cancer Biomedical Informatics Grid was selected as a laureate in ComputerWorld's honors program.⁷⁰ ComputerWorld described the project as "Effectively forming a World Wide Web of cancer research," with "promises to speed progress in all aspects of cancer research and care." The great promises of the project came with a hefty price tag. By 2010, the National Cancer Institute had sunk at least 350 million dollars into the effort.⁷¹ Though the project was ambitious, there were rumblings in the cancer informatics community that very little had been achieved. In view of past and projected costs, an ad hoc committee was assigned to review the program. In a report issued to the public in 2011, the committee found that the project had major deficiencies and suggested a year-long funding moratorium.⁷¹ Soon thereafter, the project leader left the National Cancer Institute, and the Cancer Bioinformatics Grid was unceremoniously terminated.⁷²

After CaBig(tm) was terminated, Barry Smith, a big thinker in the rather small field of ontology, wrote an editorial entitled "CaBIG(tm) Has Another Fundamental Problem: It Relies on "Incoherent" Messaging Standard" (see Glossary item, Ontology).⁷³ In his editorial, Smith suggested that HL7, a data model specification used by CaBig(tm) could not possibly work, and that it had proven itself a failure for those people who actually tried to implement the specification and use it for its intended purposes.⁷³

At about the same time that CaBig was being terminated, a major project in the United Kingdom was also scuttled. The United Kingdom's National Health Service had embarked on a major overhaul of its information systems, with the goal of system-wide interoperability and data integration (see Glossary items, Interoperability, Integration). After investing $17 billion dollars, the project was ditched when members of Parliament called the effort "unworkable."⁷⁴^–⁷⁶ This failed program had been called "the world's biggest civil information technology program."⁷⁴ Back in 2001, a report published by the NHS Information Authority cited fundamental flaws in HL7.⁷⁷ The project was also hampered by intrinsic difficulties in establishing a workable identifier system (to be discussed further in Sections 5.1 and 5.2 of Chapter 5). There are generally multiple problems that, together, account for the failure of a complex system.⁷⁸^–⁸⁰

Science and society may have reached a complexity barrier beyond which nothing can be analyzed and understood with any confidence. In light of the irreproducibility of complex data analyses, it seems prudent to make the following two recommendations:

1. Simplify your complex data, before you attempt analysis.

2. Assume that the first analysis of primary data is tentative and often wrong. The most important purpose of data analysis is to lay the groundwork for data reanalysis (see Glossary items, Primary data, Secondary data, Conclusions).

1.5 Getting Ready

What is needed is not only people with a good background in a particular field, but also people capable of making a connection between item 1 and item 2, which might not ordinarily seem connected.

Isaac Asimov⁸¹

Like many individuals, I am no cook. Nonetheless, I can prepare a few dishes when the need arises: scrambled eggs, oatmeal, and spaghetti. In a pinch, I'll open a can of tuna, or baked beans. My wife insists that such activities do not qualify as cooking, but I maintain that the fundamental skills, such as heating, boiling, mixing, and measuring, are all there. It's cooking if I can eat it.

If you are planning to have a data-centric career, then you must learn to cook up your own scripts. Otherwise, you will be limited by the built-in functionalities provided by software applications. To make a substantive contribution to your field, you must have the ability to organize and analyze diverse types of data, using algorithms collected from a variety of different scientific disciplines. Creativity often involves finding relationships that were missed by your peers, but you will never find those relationships if you lock yourself in a specialized software application. Programming languages free scientists to follow their creative instincts.

Some of our most common and indispensable computational tasks are ridiculously easy to achieve, in any programming environment. We do not ask a master chef to fill a glass of water at the sink. Why would we seek the services of a professional programmer when we need to alphabetically sort a list, or find records in a data set that match a query string, or annotate a collection of files with a name and date (see Glossary items, Query, String)? The bulk of the work involved in data analysis projects will require skills in data organization, data curation, data annotation, data merging, data transforming, and a host of computationally simple techniques that you should learn to do for yourself.⁶⁴^,⁴⁰ (see Glossary items, Curator, Transform)

There are hundreds of fine programming languages available; each with its own strengths and weaknesses. If you intend to confine your programming efforts to simple tasks, such as basic arithmetic operations, simple descriptive statistics, and search and replace methods, then you may wish to avoid the languages preferred by professional programmers(eg, C and Java). Furthermore, advanced GUI (Graphic User Interface) languages such as Visual Basic, require a level of programmatic overhead that you will not need. Specialty languages, such as R, for statistics, may come in handy, but they are not essential for every data scientist.⁸² Some tasks should be left to specialists. Perl, Python, and Ruby are powerful, no-cost programming languages, with versions available for most popular operating systems. Instructional online tutorials, as well as a rich literature of print books, provide nonprogrammers with the skills they need to write the programs that they will use as data scientists.⁴⁰^,⁸³^–⁸⁵

If you hope to simplify your data, it is advisable to begin your projects by eyeballing your data files. For this, you will need a simple text editor and some data visualization tools. The easiest way to review a large data file is to open it and browse; much the same way that you might browse the contents of books in a library. Most word processing software applications are not suited to opening and browsing large files, exceeding about 50 megabytes in length. Text editors, unlike word processors, are designed to perform simple tasks on large, plain-text files (ie, unformatted text, also known as ASCII text) on the order of a gigabyte in length (see Glossary items, ASCII, American Standard Code for Information Interchange). There are many freely available text editors that can quickly open and search large files. Two popular text editors are Emacs and vi. Downloadable versions are available for Linux, Windows, and Macintosh systems. Text editors are useful for composing computer programs, which are always written in plain-text (see Glossary item, Plain-text). Data scientists will find it useful to acquire facility with a fast and simple text editor (see Glossary items, Data science, Data scientist).

For data visualization, Gnuplot or Matplotlib should suffice. Gnuplot is an easy-to-use general data visualization and analysis tool that is available at no cost.⁸⁶ If you prefer to program in Python, then Matplotlib is a good choice. Similar to Gnuplot in functionality, Matplotlib is designed to work smoothly within Python scripts, on output data provided by Python's numpy and scipy modules (see Glossary items, Numpy, Scipy). Gnuplot and Matplotlib support a remarkable range of data analysis options (see Open Source Tools for Chapter 4).⁸⁷

As a preliminary step, it is important to know whether your data is comprehensive (ie, containing all of the data relevant to an intended purpose), representative (ie, containing a useful number of data objects of every type of data included in the data set), reasonably organized (eg, with identifiers for individual records and with a consistent set of features associated with each data record), and adequately annotated (eg, timestamped appropriately, and accompanied with descriptions of the data elements) (see Glossary items, Timestamp, Time, Trusted timestamp, Universal and timeless).

The outlook for data scientists who choose to be creative with their data has never been better. You can stop obsessing over your choice of operating system and programming language; modern scripting languages provide cross-platform compatibility. You can forget about buying expensive software applications; nearly everything you need is available at no cost. Feel free to think in terms of simple utilities (eg, command-line programs or specialized modules) that will implement specific algorithms, as required (see Glossary item, Algorithm). Write your own short scripts designed to perform one particular computational task quickly, using a minimal amount of code, and without the overhead of a graphic user interface. A few hours of effort will start you on your way towards data independence.

By far, the most important asset of any data analyst is her brain. A set of personal attributes that include critical thinking, an inquisitive mind, and the patience to devote hundreds of hours to reviewing data is certain to come in handy. Expertise in analytic algorithms is a valuable, yet sometimes overrated, skill (see Glossary items, New analytic method, Method). Most data analysis projects require the ability to understand the data, and this can often be accomplished with simple data visualization tools. The application of rigorous mathematical and statistical algorithms typically comes at the end of the project, after the key relationships among data objects have been discovered (see Glossary items, Object, Data object). It is important to remember that if your old data is verified, organized, annotated, and preserved, the analytic process can be repeated and improved. In most cases, the first choice of analytic method is not the best choice. No single analytic method is critical when the data analyst has the opportunity to repeat his work, applying many different methods, all the while attaining a better understanding of the data.⁸⁷

Open Source Tools

Today, most software exists, not to solve a problem, but to interface with other software.

IO Angell

Perl

Perl comes bundled with Linux distributions. Versions of the Perl interpreter are available for most operating systems.

For Windows users, the Strawberry Perl distribution comes bundled with a CPAN installer, a C compiler (gcc +), and a wealth of bundled modules (see Glossary item, CPAN). Strawberry Perl is a compilation of open source software components, available as 32-bit or 64-bit binary versions (see Glossary items, Binary data, Exe file). Strawberry Perl is available at: strawberryperl.com

Further information is available at: https://www.perl.org/get.html

General instruction for Perl programming is available in my book, "Perl Programming for Medicine and Biology."⁸³

Python

Like Perl, Python comes bundled on Linux distributions. Python installations for various operating systems are available at: https://www.python.org/downloads/

A Python download site for Windows is: https://www.python.org/downloads/windows/

Students and professionals in healthcare and related sciences may enjoy my book, "Methods in Medical Informatics: Fundamentals of Healthcare Programming in Perl, Python, and Ruby."⁸⁴

Ruby

Ruby was built to have many of the syntactic features of Perl but with the native object orientation of Smalltalk.

Ruby installations for various operating systems can be downloaded from: https://www.ruby-lang.org/en/downloads/

A Windows installer for Ruby is found at: http://rubyinstaller.org/

An excellent Ruby book is "Programming Ruby: The Pragmatic Programmer's Guide," by Dave Thomas.⁸⁸ Students and professionals in healthcare and related sciences may enjoy my book, "Ruby Programming for Medicine and Biology."⁸⁵

Text Editors

When I encounter a large data file, in plain ASCII format, the first thing I do is open the file and take a look at its content (see Glossary items, ASCII, Plain-text). Unless the file is small (ie, under about 50 megabytes), most commercial word processors fail at this task. You will want to use an editor designed to work with large ASCII files (see Glossary item, Text editor). Two of the more popular, freely available editors are Emacs and vi (also available under the name vim). Downloadable versions are available for Linux, Windows, and Macintosh systems. On most computers, these editors will open files of several hundred megabytes, at least.⁶⁴

vim download site: http://www.vim.org/download.php

Emacs download site: https://www.gnu.org/software/emacs/

For those with 64-bit Windows, an emacs version is available at: http://sourceforge.net/projects/emacsbinw64/?source=typ_redirect

One of the advantages of Emacs is its implicit, prettified display of source code. Perl, Python, and Ruby scripts are all displayed with colored fonts distinguishing commands, comments, quoted text, and conditionals.

OpenOffice

OpenOffice is a free word processing application that also displays and edits powerpoint-type files and excel-type files. It can also produce mathematical symbols, equations and notations, providing much of the functionality of LaTeX (see Glossary item, LaTeX). OpenOffice is available at: https://www.openoffice.org/download/

LibreOffice

A spin-off project from OpenOffice, intended to offer equivalent functionality plus greater extensibility. LibreOffice is available at: https://www.libreoffice.org/

Command Line Utilities

The "command line" is a one-line instruction, sent to the operating system, and entered at the operating system prompt. The command line is an important feature of most operating systems.⁸⁹ The term "command line utility" refers to a utility that can be launched via a command line instruction. By convention, command line utilities permit the user to include arguments (ie, required or optional parameters that modify the behavior of the utility, added to the command line in no particular order).

To enter command line arguments, you must have a command prompt (also called the "shell prompt" or just "the shell" in unix-like operating systems). In the Windows operating system, the "Command Prompt" application provides the functionality of the old "DOS" screen. The "Command Prompt" application sits inconspicuously among the minor Windows system programs. Once you find Command Prompt among the list of available programs in your system, it would be wise to move its icon to a place of honor on your desktop, for easy access. Entering the word "help" at the command prompt produces a long list of the available DOS commands (Fig. 1.1).

f01-01-9780128037812 — Figure 1.1 At the DOS prompt, which happens to be set to the c:ftp subdirectory in this example, the "help" line displays several screens of commands, any of which can be asserted from the command line.

The most common use of the Command prompt is to execute DOS commands (eg, dir, cd, type, copy, ren, rd), familiar to anyone who has used DOS-based computers, prior to the advent of Windows (Fig. 1.2).

f01-02-9780128037812 — Figure 1.2 The DOS prompt window, displaying the DOS prompt (ie, c:>), and a DOS command (ie, dir), and the screen dump exhibiting the results of the dir command, listing the current directory contents of the author's home computer.

The command line gains power when it invokes applications and utilities. Here is an example of a command line instruction that calls an ImageMagic utility (Fig. 1.3).

f01-03-9780128037812 — Figure 1.3 A single command line, albeit a lengthy one, that calls ImageMagick's "convert" utility to create a gray gradient background to the words "Hello, World."

The first word following the system prompt is "convert," the name of an ImageMagick utility (see Glossary item, ImageMagick). The full name of the utility is convert.exe, but the operating system is programmed to know that "convert" represents an executable file within the current directory, or in the list of files in its PATH (ie, the operating system's go-to list of executable files) (see Glossary items, Exe file, Executable file). The convert utility expects a list of parameters and the name of an output file.

The resulting image is shown (Fig. 1.4):

f01-04-9780128037812 — Figure 1.4 The output file, hello.png, produced from the command line, displayed in Fig. 1.3.

How did I know the proper command line syntax for ImageMagick's convert utility? The ImageMagick download, widely available from multiple locations on the Internet, comes with detailed documentation. This documentation indicates that ImageMagick contains "convert," among many other utilities, and provides instructions for implementing these utilities via the command line. The following command line requests instructions, via the "-help" argument, for using the montage utility.

c:>montage -help

There are many thousands of publicly available command line utilities. Like anything on the Internet, utilities may not always work as advertised, and it may be impossible to know when a utility has been corrupted with malicious code. It is best to download only trusted utilities, from trusted websites. In this book, I try to confine my suggested utilities to tested and popular resources (see Glossary item, Data resource).

Cygwin, Linux Emulation for Windows

Three decades ago, every computer user was faced with deciding which operating system to use (eg, Unix, DOS, Macintosh, Amiga) and which programming language to learn (eg, C, Basic, Fortran). The choice would lock the user into a selection of hardware and a style of programming that could not be easily reversed. Over the next two decades, the importance of the decision did not diminish, but the proliferation of programming languages made the decision much more difficult. The computer magazine literature from the 90s was crammed with titles devoted to one particular operating system or one particular programming language.

Everything is so much simpler today. With a little effort, computer users can enjoy the particular benefits offered by most operating systems and most programming languages, from their home computer. For myself, I use Windows as my default operating system, because it comes preinstalled on most computers sold in the United States. One of the very first things I do after booting a new computer is to install Cygwin, a linux-like interface for Windows.

Cygwin, and supporting documentation, can be downloaded from: http://www.cygwin.com/

Cygwin opens in a window that produces a shell prompt (equivalent to the Windows C prompt) from which Unix programs can be launched. For myself, I use Cygwin primarily as a source of Linux utilities, of which there are hundreds. In addition, Cygwin comes bundled with some extremely useful applications, such as Perl, Python, and Gnuplot. Cygwin distributions containing Ruby are also available (Fig. 1.5).

f01-05-9780128037812 — Figure 1.5 The Cygwin screen, emulating a Linux system, under the Windows operating system.

Windows users are not restricted to launching Linux applications from the Cygwin shell prompt. A command line from the Windows C prompt will launch Cygwin utilities. Likewise, a system call from a Perl, Python, or Ruby script can make interactive calls to Cygwin applications (see Glossary item, System call).

Here are a few examples:

c:cygwin64in>wc c:ftpsimplify.txt

output:

10718 123383 849223 c:ftpsimplify.txt

Here, we begin at the C prompt for the subdirectory in which the Cygwin command line utilities are stored (ie, c:cygwin64in>). We enter the Linux command, wc, and the path/filename for a plain-text file. The Linux "wc" command is a word counter. The "wc" command is followed by the path/filename of the interrogated file. The three numbers returned, "10718 123383 849223" are the number of lines in the file, the number of words in the file, and the number of bytes in the file, respectively.

Data scientists eventually learn that there are some tasks that are best left to Linux. Having Cygwin installed on your Windows system will make life easier for you, and for your collaborators, who may prefer to work in Linux.

DOS Batch Scripts

When you have a succession of DOS commands that you use routinely, it is often convenient to collect them all in a plain-text file that will run your commands in tandem. Such files are called batch files and can be created in any text editor by simply listing the commands, line-by-line, and appending the suffix ".bat" to the named file that holds the commands.

The simplest batch files are often the most useful. As an example, you may have a collection of working files that you would like to back up, at the end of each work day, onto an external thumb drive. Assuming the thumb drive is assigned the f: directory, just list your copy commands, mount your thumb drive, and invoke the name of the batch file at the command prompt.

Here is an example of a batch file, fback.bat, that copies 10 of my work-in-progress files to a thumb drive.

copy diener.txt f:

copy simplify.txt f:

copy re-ana.txt f:

copy phenocop.txt f:

copy mystery.txt f:

copy disaster.txt f:

copy factnote.txt f:

copy perlbig.txt f:

copy create.txt f:

copy exploreo.txt f:

To invoke the batch file, just enter its name from the subdirectory in which it resides. For example:

c:ftp>fback.bat

Batch files will initiate any action that would otherwise be launched from the command line. This includes calling utilities. For example, the magick.bat batch file, vida infra, launches ImageMagick's "convert" application and applies contrast twice, yielding a modified file named results.png. The current directory is changed to the location wherein the Cygwin executables reside. From this location, Cygwin's version of ImageMagick's imdisplay (image display) application is launched, to display the "results.png" image file, at which time, the directory is changed back to its start location. Notice that the magick.bat batch file invokes a Linux application, contained in the Cygwin application, directly from a DOS line; a neat trick. Of course, this batch file requires ImageMagick and Cygwin to be installed on your computer.

convert c:ftpeqn.jpg -contrast -contrast c:ftp esults.png

cd c:cygwin64homeE-Rock

imdisplay c:ftp esults.png

cd c:ftp

In the prior example, the name of the image file used in the batch file was preassigned (ie, so-called hard-coded into the script). For most practical applications, you won't know which files to modify until you are ready to launch the batch file. Batch files have a simple syntax for substituting variable names, into a batch command, at the moment when the batch file is launched. Strings separated by spaces appearing after the name of the batch file are assigned to numbered variables that are passed to the batch file.

Here is the command line that launches the showpic.bat batch file:

c:cygwin64homeE-Rock>showpic.bat eqn.jpg

Following the name of the batch file, there is a parameter, "c:ftpeqn.jpg" which will be passed implicitly to the batch file, as the numbered variable, "%1." We can see how the passed parameter is used, when we inspect the code for the showpic.bat batch file.

imdisplay %1

The batch file loads the provided path/filename into the variable, %1, and displays the eqn.jpg image.

Batch files can include conditional statements and loop iterators. The following batch file, a.bat, contains a conditional statement and a go-to command. If the user supplies a parameter (a path/filename in this case) when the batch file is invoked, an application (which happens to be "a.exe" here; you would need to substitute your own exe file) is launched, using the provided filename. If no filename is provided, the application is launched using "c:ftpperlbig.txt" as the default parameter. Here is the a.bat batch file:

cd C:ftpackaurora

if not "%1" == "" goto start

a.exe c:ftpperlbig.txt

goto end

:start

a.exe c:ftp\%1

:end

cd c:ftp

exit

Experienced programmers simplify their lives by composing short batch files that automate routine processes.

Linux Bash Scripts

The bash script, also known as the shell script, is the Unix equivalent of the DOS batch script; a batch of shell commands collected into a short plain-text file, and executed by invoking the name of the file, from the command prompt. The name, bash, is a concatenation of the first two characters of "Batch" + "Shell." There are thousands of open source and available bash files that users will find helpful.

One of the peculiarities of Linux bash files is that prior to launch, downloaded bash files will, in most cases, need to have their file permissions reset. At the Linux shell prompt, the "chmod + x" command, shown here, changes the mode of the file (mybashfile in this instance) to permit user access.

$ chmod + x mybashfile

Here is an example demonstrating the typical syntax for launching a bash file from the Cygwin shell:

$ mybashfile infile.gif out.gif -s 5 -d 5 -c 5 -g 1 -p 2 -b white

Typically, the name of the bash file, in this case "mybashfile", is followed by the name of an existing file on which the bash file will operate ("infile.gif" here), followed by the name of an output file ("out.gif" here), followed by a list of options and parameters that modify the behavior of the bash file. The options are preceded by a dash (-s, -d, -c, -g, -p, -b) and each option is immediately followed by a space, and then an input parameter, also called the input argument. How do we know what options to use and what arguments to supply? By convention, bash files include a help screen that can be invoked with the name of the bashfile followed by "-h" or "-help"

$ mybashfile -h

As a curious subpopulation of humanity, Linux devotees are eager to spend any amount of time, no matter how large, writing applications intended to save any amount of time, no matter how small. Consequently, when it comes to parsimonious approaches to data simplification, nothing beats Linux utilities. Linux programmers have prepared, through the decades, an amazing library of bash scripts, capable of performing a wide range of useful tasks.⁹⁰^,⁹¹ A great source of hundreds of brilliant and useful image utilities, available as bash files, is found at Fred Weinhaus' site: http://www.fmwconcepts.com/imagemagick/plot3D/index.php

Interactive Line Interpreters

Interpreted languages, such as Python, Ruby, and Perl permit scripts to be run line-by-line; whereas executable files (eg, C programs) are compiled and executed in toto, as though the entire program consisted of a single, long line of code. Because scripts are interpreted line-by-line, they can be developed via line interpreters, permitting programmers to test each line of code, as the script is being written.

Python, Ruby, and Perl provide line interpreters for programmers who take pride in crafting perfect lines of code. In the following examples, a line of code prints the sum of 5 + 5.

The line interpreter for Python is invoked by typing "python" at the system prompt:

c:>python

Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:06:53) [MSC v.1600 64 bit (AMD64)] on win32

Type "help", "copyright", "credits", or "license" for more information.

>>> print(5 + 5)

>>>

The line interpreter for Ruby is invoked by typing "irb" at the system prompt:

c:>irb

irb(main):001:0 > puts (5+5)

=> nil

The Ruby line interpreter keeps track of the line number of the code.

Perl will interpret a line of code by invoking "perl" followed by an argument composed of the evaluate instruction (e) and the command.

c:>perl -e "print 5+5";

Perl users may choose to interpret code lines using the Perl debugger. The debugger is activated with one of Perl's famously obscure, syntactically abbreviated operating system command lines, from the system prompt:

c:ftp>perl -d -e 1

The command line activates a line-by-line interface to the Perl interpreter as shown:

c:ftp>perl -d -e 1

Loading DB routines from perl5db.pl version 1.33

Editor support available.

Enter h or `h h' for help, or `perldoc perldebug' for more help.

main::(-e:1): 1

DB<1 > print(gmtime());

3937203411501220

DB<2 > print(1+6);

DB<3 > print(4/0);

Illegal division by zero at (eval 11)[C:/Perl64/lib/perl5db.pl:640] line 2.

Like the Ruby "irb" line interpreter, the Perl debugger keeps track of the line number.

Package Installers

Perl, Python, and Ruby have a seemingly endless number of open source packages that extend the functionality of the standard distribution of each language. All three languages provide package managers that will download any package you desire, from an archive site on the web, and install the language-specific package on your computer. Once installed, the packages can be called into service in scripts.

Currently, there are over 64,000 Python packages available at the Python Package Index repository at: https://pypi.python.org/pypi

The installer for Python is Pip. Pip comes preinstalled in Python 2.7.9 and later (Python version 2 series) and Python 3.4 and later (Python 3 series). A usage example for pip is::

$ pip install rdflib (on Linux shell)

c:>pip install rdflib (on Windows)

Currently, the Comprehensive Perl Archive Network (CPAN) has nearly 154,000 Perl modules, with over 12,000 contributors. You can search the multitude of Perl modules to your heart's content, at: https://metacpan.org/

The CPAN modules are available for installation through the CPAN installer, included in the newer distributions of Perl. For Windows users, a CPAN installer in included in the Strawberry Perl installation (vida supra).

At the C prompt, enter "cpan"

c:>cpan

Perl then produces a "cpan" prompt from which you can enter an "install" command, followed by the name of the CPAN module you would like to install (the Chart::Gnuplot module in this example).

cpan > install Chart::Gnuplot

Public installation modules for Ruby are known as Gems. There are currently nearly 7000 Gems available from: https://rubygems.org/

The Gem installer comes standard with Ruby. A usage example, for installing the sqlite3 gem, is:

c:>gem install sqlite3-ruby -v=1.2.3

System Calls

A system call is a command line, inserted into a software program, that interrupts the script while the operating system executes the command line. Immediately afterword, the script resumes at the next line of code. Any utility that runs from the command line can be embedded in any scripting language that supports system calls, and this includes all of the languages discussed in this book.

Here are the properties of system calls that make them useful to programmers:

1. System calls can be inserted into iterative loops (eg, while loops, for loops) so that they can be repeated any number of times, on collections of files, or data elements (see Glossary item, Iterator).

2. Variables that are generated at run-time (ie, during the execution of the script) can be included as arguments added to the system call.

3. The results of the system call can be returned to the script, and used as variables.

4. System calls can utilize any operating system command and any program that would normally be invoked through a command line, including external scripts written in other programming languages. Hence, a system call can initiate an external script written in an alternate programming language, composed at run-time within the original script, using variables generated in the original script, and capturing the output from the external script for use in the original script (see Glossary item, Metaprogramming)!

System calls enhance the power of any programming language by providing access to a countless number of external methods and by participating in iterated actions using variables created at run-time.

How does the system call help with the task of data simplification? Data simplification is very often focused on uniformity and reproducibility. If you have 100,000 images, data simplification might involve calling ImageMagick to resize every image to the same height and width (see Glossary item, ImageMagick). If you need to convert spreadsheet data to a set of triples, than you might need to provide a UUID string to every triple in the database all at once (see Glossary items, UUID, Universally Unique IDentifier, Triple). If you are working on a Ruby project and you need to assert one of Python's numpy methods on every data file in a large collection of data files, then you might want to create a short Python file that can be accessed, via a system call, from your Ruby script (see Glossary item, Numpy).

Once you get the hang of including system calls in your scripts, you will probably use them in most of your data simplification tasks. It's important to know how system calls can be used to great advantage, in Perl, Python, and Ruby. A few examples follow.

The following short Perl script makes a system call, consisting of the DOS "dir" command:

#!/usr/bin/perl

system("dir");

exit;

The "dir" command, launched as a system call, displays the files in the current directory. Here is the equivalent script, in Python:

#!/usr/local/bin/python

import os

os.system("dir")

exit

Notice that system calls in Python require the importation of the os (operating system) module into the script.

Here is an example of a Ruby system call, to ImageMagick's "Identify" utility. The system call instructs the "Identify" utility to provide a verbose description of the image file3320_out.jpg, and to pipe the output into the text file, myimage.txt.

#!/usr/bin/ruby

system("Identify -verbose c:/ftp/3320_out.jpg > myimage.txt")

exit

Here is an example of a Perl system call, to ImageMagick's "convert" utility, that incorporates a Perl variable ($file, in this case) that is passed to the system call. If you run this script on your own computer, please note that the system( ) command must appear as a single line of code. The system line was broken here, into two lines, to accommodate the printed page.

#!/usr/local/bin/perl

$file = "try2.gif";

system("convert -size 350x40 xc:lightgray -font Arial -pointsize 32 -fill black

-gravity north -annotate + 0+0 "Hello, World" $file");

exit;

The following Python script opens the current directory and parses through every filename, looking for jpeg image files. When a jpeg file is encountered, the script makes a system call to imagemagick, instructing imagemagick's "convert" utility to copy the jpeg file to the thumb drive (designated as the f: drive), in the form of a grayscale image. If you try this script at home, be advised that it requires a mounted thumb drive, in the "f:" drive.

#!/usr/local/bin/python

import os, re, string

filelist = os.listdir(".")

for file in filelist:

if ".jpg" in file:

img_in = file

img_out = "f:/" + file

command = "convert " + img_in + " -set colorspace Gray -separate -average " + img_out

os.system(command)

exit

Let's look at a Ruby script that calls a Perl script, a Python script, and another Ruby script, from within one Ruby script.

Here are the Perl, Python, and Ruby scripts that will be called from within a Ruby script:

hi.py

#!/usr/local/bin/python

print("Hi, I'm a Python script")

exit

hi.pl

#!/usr/local/bin/perl

print "Hi, I'm a Perl script ";

exit;

hi.rb

#!/usr/local/bin/ruby

puts "Hi, I'm a Ruby script"

exit

Here is the Ruby script, call_everyone.rb, that calls external scripts, written in Python, Perl, and Ruby:

#!/usr/local/bin/ruby

system("python hi.py")

system("perl hi.pl")

system("ruby hi.rb")

exit

Here is the output of the Ruby script, call_everyone.rb:

c:ftp>call_everyone.rb

Hi, I'm a Python script

Hi, I'm a Perl script

Hi, I'm a Ruby script

If you have some facility with a variety of language-specific methods and utilities, you can deploy them all from within your favorite scripting language.

Glossary

ANSI The American National Standards Institute (ANSI) accredits standards developing organizations to create American National Standards (ANS). A so-called ANSI standard is produced when an ANSI-accredited standards development organization follows ANSI procedures and receives confirmation, from ANSI, that all the procedures were followed. ANSI coordinates efforts to gain international standards certification from the ISO (International Standards Organization) or the IEC (International Electrotechnical Commission). ANSI works with hundreds of ANSI-accredited standards developers.

ASCII ASCII is the American Standard Code for Information Interchange, ISO 14962-1997. The ASCII standard is a way of assigning specific 8-bit strings (a string of 0s and 1s of length 8) to the alphanumeric characters and punctuation. Uppercase letters are assigned a different string of 0s and 1s than their matching lowercase letters. There are 256 ways of combining 0s and 1s in strings of length 8. This means that that there are 256 different ASCII characters, and every ASCII character can be assigned a number-equivalent, in the range of 0 to 255. The familiar keyboard keys produce ASCII characters that happen to occupy ASCII values under 128. Hence, alphanumerics and common punctuation are represented as 8-bits, with the first bit, "0," serving as padding. Hence, keyboard characters are commonly referred to as 7-bit ASCII. These are the classic printable ASCII characters:

!"#$%&'()*+,-./0123456789:;<=> ?@ABCDEFGHIJKLMNOPQRSTUVWXYZ
[]ˆ_`abcdefghijklmnopqrstuvwxyz{|}~

Files composed exclusively of common keyboard characters are commonly referred to as plain-text files or as 7-bit ASCII files. See Text editor. See Plain-text. See ISO.

Abstraction In the context of object-oriented programming, abstraction is a technique whereby a method is simplified to a generalized form that is applicable to a wide range of objects, but for which the specific characteristics of the object receiving the method may be used to return a result that is suited to the object. Abstraction, along with polymorphism, encapsulation, and inheritance, are essential features of object-oriented programming languages. See Polymorphism. See Inheritance. See Encapsulation.

Algorithm An algorithm is a logical sequence of steps that lead to a desired computational result. Algorithms serve the same function in the computer world as production processes serve in the manufacturing world. Fundamental algorithms can be linked to one another, to create new algorithms. Algorithms are the most important intellectual capital in computer science. In the past half-century, many brilliant algorithms have been developed.⁹²^,⁹³

American Standard Code for Information Interchange See ASCII.

Autocoding When nomenclature coding is done automatically, by a computer program, the process is known as "autocoding" or "autoencoding." See Coding. See Nomenclature. See Autoencoding.

Autoencoding Synonym for autocoding. See Autocoding.

Binary data Computer scientists say that there are 10 types of people. Those who think in terms of binary numbers, and those who do not. Pause for laughter and continue. All digital information is coded as binary data. Strings of 0s and 1s are the fundamental units of electronic information. Nonetheless, some data is more binary than other data. In text files, 8-bit sequences are equivalent to decimals in the range of 0 to 256, and these decimal numbers are mapped as characters, as determined by the ASCII standard. In several raster image formats (ie, formats consisting of rows and columns of pixel data), 24-bit pixel values are chopped into red, green, and blue values of 8-bits each. Files containing various types of data (eg, sound, movies, telemetry, formatted text documents) all have some kind of low-level software that takes strings of 0s and 1s and converts them into data that has some particular meaning for a particular use. So-called plain-text files, including html files and xml files are distinguished from binary data files and referred to as plain-text or ASCII files. Most computer languages have an option wherein files can be opened as "binary," meaning that the 0s and 1s are available to the programmer, without the intervening translation into characters or stylized data. See ASCII.

Blended class Also known as class noise, subsumes the more familiar, but less precise term, "Labeling error." Blended class refers to inaccuracies (eg, misleading results) introduced in the analysis of data due to errors in class assignments (ie, assigning a data object to class A when the object should have been assigned to class B). If you are testing the effectiveness of an antibiotic on a class of people with bacterial pneumonia, the accuracy of your results will be forfeit when your study population includes subjects with viral pneumonia, or smoking-related lung damage. Errors induced by blending classes are often overlooked by data analysts who incorrectly assume that the experiment was designed to ensure that each data group is composed of a uniform population. A common source of class blending occurs when the classification upon which the experiment is designed is itself blended. For example, imagine that you are a cancer researcher and you want to perform a study of patients with malignant fibrous histiocytomas (MFH), comparing the clinical course of these patients with the clinical course of patients who have other types of tumors. Let's imagine that the class of tumors known as MFH does not actually exist; that it is a grab-bag term erroneously assigned to a variety of other tumors that happened to look similar to one another. This being the case, it would be impossible to produce any valid results based on a study of patients diagnosed as having MFH. The results would be a biased and irreproducible cacophony of data collected across different, and undetermined, classes of tumors. Believe it or not, this specific example, of the blended MFH class of tumors, is selected from the real-life annals of tumor biology.⁹⁴^,⁹⁵ The literature is rife with research of dubious quality, based on poorly designed classifications and blended classes. A detailed discussion of this topic is found in Section 6.5 of Chapter 6, Properties that Cross Multiple Classes. One caveat. Efforts to eliminate class blending can be counterproductive if undertaken with excess zeal. For example, in an effort to reduce class blending, a researcher may choose groups of subjects who are uniform with respect to every known observable property. For example, suppose you want to actually compare apples with oranges. To avoid class blending, you might want to make very sure that your apples do not include any kumquats, or persimmons. You should be certain that your oranges do not include any limes or grapefruits. Imagine that you go even further, choosing only apples and oranges of one variety (eg, Macintosh apples and navel oranges), size (eg, 10 cm), and origin (eg, California). How will your comparisons apply to the varieties of apples and oranges that you have excluded from your study? You may actually reach conclusions that are invalid and irreproducible for more generalized populations within each class. In this case, you have succeeded in eliminating class blending, while losing representative populations of the classes. See Simpson's paradox.

CPAN The Comprehensive Perl Archive Network, known as CPAN, has nearly 154,000 Perl packages, with over 12,000 contributors. These packages greatly extend the functionality of Perl, and include virtually every type of Perl method imaginable (eg, math, statistics, communications, plotting, and numerical analyses). Any CPAN Perl package can be easily downloaded and automatically installed on your computer's Perl directory when you use the CPAN installer. For instructions, see Open Source Tools. You can search the multitude of Perl modules to your heart's content at: https://metacpan.org/.

Check digit A checksum that produces a single digit as output is referred to as a check digit. Some of the common identification codes in use today, such as ISBN numbers for books, come with a built-in check digit. Of course, when using a single digit as a check value, you can expect that some transmitted errors will escape the check, but the check digit is useful in systems wherein occasional mistakes are tolerated; or wherein the purpose of the check digit is to find a specific type of error (eg, an error produced by a substitution in a single character or digit), and wherein the check digit itself is rarely transmitted in error. See Checksum.

Checksum An outdated term that is sometimes used synonymously with one-way hash or message digest. Checksums are performed on a string, block, or file yielding a short alphanumeric string intended to be specific for the input data. Ideally, If a single bit were to change, anywhere within the input file, then the checksum for the input file would change drastically. Checksums, as the name implies, involve summing values (typically weighted character values), to produce a sequence that can be calculated on a file before and after transmission. Most of the errors that were commonly introduced by poor transmission could be detected with checksums. Today, the old checksum algorithms have been largely replaced with one-way hash algorithms. A checksum that produces a single digit as output is referred to as a check digit. See Check digit. See Message digest. See HMAC.

Child class The direct or first generation subclass of a class. Sometimes referred to as the daughter class or, less precisely, as the subclass. See Parent class. See Classification.

Chloroplast evolution Chloroplasts are the organelles (little membrane-wrapped replicating structures within cells) that produce glucose and oxygen, via a process that can be loosely described as: carbon dioxide + water + light energy → carbohydrate + oxygen. Chloroplasts are found in plants. Aside from photosynthesis occurring in plants, we can also observe photosynthesis in cyanobacteria. Photosynthesis produced by cyanobacteria is thought to be account for the conversion of our atmosphere from an anoxic environment to an oxygen-rich environment. Did photosynthesis, a complex chemical pathway, evolve twice in terrestrial history; once in cyanobacteria and once again in primitive plants? Present thinking on the subject holds that the evolution of photosynthesis occurred only once, in the distant past, and all photosynthesis ever since, in cyanobacteria and in plants, arose from this one event. It is presumed that plants acquired photosynthesis when they engulfed photosynthesizing cyanobacteria that evolved into self-replicating chloroplasts. This startling conclusion is based on a simple observation that chloroplasts, unlike other plant organelles, are wrapped by two membrane layers. One layer is believed to have been contributed by the captured cyanobacteria, and one layer presumably contributed by the ancient plant cell as it wrapped the cyanobacteria in its own cell membrane. Whether a complex pathway, such as photosynthesis, can re-evolve in other organisms, is a matter of conjecture.⁹⁶

Clade A class, plus all of its descendant classes. A clade should be distinguished from a lineage, the latter being a class and its ascendant classes. Because a class can have more than one child class, a pictogram of a clade will often look like a branching tree. In a classification, where each class is restricted to having one parent class, ascending lineages (from the Latin linea, "a line"), are represented as a nonbranching line of ancestors, leading to the root (ie, top class) of the classification. Of course, in an ontology, where there are no restrictions on the number of parents a class may have, pictograms of clades and lineages will tend to be florid.

Cladistics The technique of producing a hierarchy of clades, wherein each clade is a monophyletic class. See Monophyletic class.

Class A class is a group of objects that share a set of properties that define the class and that distinguish the members of the class from members of other classes. The word "class," lowercase, is used as a general term. The word "Class," uppercase, followed by an uppercase noun (eg, Class Animalia), represents a specific class within a formal classification. See Classification.

Classification A system in which every object in a knowledge domain is assigned to a class within a hierarchy of classes. The properties of classes are inherited by their subclasses. Every class has one immediate superclass (ie, parent class), although a parent class may have more than one immediate subclass (ie, child class). Objects do not change their class assignment in a classification, unless there was a mistake in the assignment. For example, a rabbit is always a rabbit, and does not change into a tiger. Classifications can be thought of as the simplest and most restrictive type of ontology, and serve to reduce the complexity of a knowledge domain.³ Classifications can be easily modeled in an object-oriented programming language and are nonchaotic (ie, calculations performed on the members and classes of a classification should yield the same output, each time the calculation is performed). A classification should be distinguished from an ontology. In an ontology, a class may have more than one parent class and an object may be a member of more than one class. A classification can be considered a special type of ontology wherein each class is limited to a single parent class and each object has membership in one, and only one, class. See Nomenclature. See Thesaurus. See Vocabulary. See Classification. See Dictionary. See Terminology. See Ontology. See Parent class. See Child class. See Superclass. See Unclassifiable objects.

Classification system versus identification system It is important to distinguish a classification system from an identification system. An identification system matches an individual organism with its assigned object name (or species name, in the case of the classification of living organisms). Identification is based on finding several features that, taken together, can help determine the name of an organism. For example, if you have a list of characteristic features: large, hairy, strong, African, jungle-dwelling, knuckle-walking; you might correctly identify the organism as a gorilla. These identifiers are different from the phylogenetic features that were used to classify gorillas within the hierarchy of organisms (Animalia: Chordata: Mammalia: Primates: Hominidae: Homininae: Gorillini: Gorilla). Specifically, you can identify an animal as a gorilla without knowing that a gorilla is a type of mammal. You can classify a gorilla as a member of Class Gorillini without knowing that a gorilla happens to be large. One of the most common mistakes in science is to confuse an identification system with a classification system. The former simply provides a handy way to associate an object with a name; the latter is a system of relationships among objects.

Coding The term "coding" has three very different meanings, depending on which branch of science influences your thinking. For programmers, coding means writing the code that constitutes a computer program. For cryptographers, coding is synonymous with encrypting (ie, using a cipher to encode a message). For medics, coding is calling an emergency team to handle a patient in extremis. For informaticians and library scientists, coding involves assigning an alphanumeric identifier, representing a concept listed in a nomenclature, to a term. For example, a surgical pathology report may include the diagnosis, "Adenocarcinoma of prostate." A nomenclature may assign a code C4863000 that uniquely identifies the concept "Adenocarcinoma." Coding the report may involve annotating every occurrence of the work "Adenocarcinoma" with the "C4863000" identifier. For a detailed explanation of nomenclature coding, and its importance for searching and retrieving data, see the full discussion in Section 3.4 of Chapter 3, Autoencoding and Indexing with Nomenclatures. See Autocoding. See Nomenclature.

Conclusions Conclusions are the interpretations made by studying the results of an experiment or a set of observations. The term "results" should never be used interchangeably with the term "conclusions." Remember, results are verified. Conclusions are validated.⁹⁷ See Verification. See Validation. See Results.

Curator The word "curator" derives from the Latin, "curatus," taken care of the same root for "curative," indicating that curators "take care of" things. A data curator collects, annotates, indexes, updates, archives, searches, retrieves, and distributes data. Curator is another of those somewhat arcane terms (eg, indexer, data archivist, and lexicographer) that are being rejuvenated in the new millennium. It seems that if we want to enjoy the benefits of a data-centric world, we will need the assistance of curators, trained in data organization.

Curse of dimensionality As the number of attributes for a data object increases, the multidimensional space becomes sparsely populated, and the distances between any two objects, even the two closest neighbors, becomes absurdly large. When you have thousands of dimensions (eg, data values in a data record, cells in the rows of a spreadsheet), the space that holds the objects is so large that distances between data objects become difficult or impossible to compute, and most computational algorithms become useless.

Data object A data object is whatever is being described by the data. For example, if the data is "6-feet tall," then the data object is the person or thing to which "6-feet tall" applies. Minimally, a data object is a metadata/data pair, assigned to a unique identifier (ie, a triple). In practice, the most common data objects are simple data records, corresponding to a row in a spreadsheet or a line in a flat-file. Data objects in object-oriented programming languages typically encapsulate several items of data, including an object name, an object unique identifier, multiple data/metadata pairs, and the name of the object's class. See Triple. See Identifier. See Metadata.

Data resource A collection of data made available for data retrieval. The data can be distributed over servers located anywhere on earth or in space. The resource can be static (ie, having a fixed set of data), or in flux. Plesionyms for data resource are: data warehouse, data repository, data archive, data store.

Data science A vague term encompassing all aspects of data collection, organization, archiving, distribution, and analysis. The term has been used to subsume the closely related fields of informatics, statistics, data analysis, programming, and computer science.

Data scientist Anyone who practices data science and who has some expertise in a field subsumed by data science (ie, informatics, statistics, data analysis, programming, and computer science).

Data sharing Providing one's own data to another person or entity. This process may involve free or purchased data, and it may be done willingly, or under coercion, as in compliance with regulations, laws, or court orders.

Database A software application designed specifically to create and retrieve large numbers of data records (eg, millions or billions). The data records of a database are persistent, meaning that the application can be turned off, then on, and all the collected data will be available to the user (see Open Source Tools for Chapter 7).

Dictionary A terminology or word list accompanied by a definition for each item. See Nomenclature. See Vocabulary. See Terminology.

Dimensionality The dimensionality of a data object consists of the number of attributes that describe the object. Depending on the design and content of the data structure that contains the data object (ie, database, array, list of records, object instance, etc.), the attributes will be called by different names, such as field, variable, parameter, feature, or property. Data objects with high dimensionality create computational challenges, and data analysts typically reduce the dimensionality of data objects wherever possible.

Dublin Core metadata The Dublin Core metadata is a set of tags that describe the content of an electronic file. These tags were developed by a group of librarians who met in Dublin, Ohio. Syntactical usage of the Dublin Core elements is described in Open Source Tools for Chapter 2.

Encapsulation The concept, from object-oriented programming, that a data object contains its associated data. Encapsulation is tightly linked to the concept of introspection, the process of accessing the data encapsulated within a data object. Encapsulation, Inheritance, and Polymorphism are available features of all object-oriented languages. See Inheritance. See Polymorphism.

Exe file A file with the filename suffix ".exe." In common parlance, filenames with the ".exe" suffix are executable code. See Executable file.

Executable file A file that contains compiled computer code that can be read directly from the computer's CPU, without interpretation by a programming language. A language such as C will compile C code into executables. Scripting languages, such as Perl, Python, and Ruby interpret plain-text scripts and send instructions to a run-time engine, for execution. Because executable files eliminate the interpretation step, they typically run faster than plain-text scripts. See Exe file.

Gaussian copula function A formerly praised and currently vilified formula, developed for Wall Street, that calculates the risk of default correlation (ie, the likelihood of two investment vehicles defaulting together). The formula uses the current market value of the vehicles, without factoring in historical data. The formula is easy to implement, and was a favored model for calculating risk in the securitization market. In about 2008, the Gaussian copula function stopped working; soon thereafter came the 2008 global market collapse. In some circles, the Gaussian copula function is blamed for the disaster.⁹⁸

Generalization Generalization is the process of extending relationships from individual objects to classes of objects. For example, when Isaac Newton observed the physical laws that applied to apples falling to the ground, he found a way to relate the acceleration of an object to its mass and to the force of gravity. His apple-centric observations applied to all objects and could be used to predict the orbit of the moon around the earth, or the orbit of the earth around the sun. Newton generalized from the specific to the universal. Similarly, Darwin's observations on barnacles could be generalized to yield the theory of evolution, thus explaining the development of all terrestrial organisms. Science would be of little value if observed relationships among objects could not be generalized to classes of objects. See Science.

Grid A collection of computers and computer resources that are coordinated to provide a desired functionality. The Grid is the intellectual predecessor of cloud computing. Cloud computing is less physically and administratively restricted than Grid computing.

HMAC Hashed Message Authentication Code. When a one-way hash is employed in an authentication protocol, it is often referred to as an HMAC. See Message digest. See Checksum.

HTML HyperText Markup Language is an ASCII-based set of formatting instructions for web pages. HTML formatting instructions, known as tags, are embedded in the document, and double-bracketed, indicating the start point and end points for instruction. Here is an example of an HTML tag instructing the web browser to display the word "Hello" in italics: <i> Hello </i>. All web browsers conforming to the HTML specification must contain software routines that recognize and implement the HTML instructions embedded within web documents. In addition to formatting instructions, HTML also includes linkage instructions, in which the web browsers must retrieve and display a listed web page, or a web resource, such as an image. The protocol whereby web browsers, following HTML instructions, retrieve web pages from other Internet sites, is known as HTTP (HyperText Transfer Protocol).

ISO International Standards Organization. The ISO is a nongovernmental organization that develops international standards (eg, ISO-11179 for metadata and ISO-8601 for date and time). See ANSI.

Identification The process of providing a data object with an identifier, or the process of distinguishing one data object from all other data objects on the basis of its associated identifier. See Identifier.

Identifier A string that is associated with a particular thing (eg person, document, transaction, and data object), and not associated with any other thing.⁹⁹ Object identification usually involves permanently assigning a seemingly random sequence of numeric digits (0–9) and alphabet characters (a–z and A–Z) to a data object. A data object can be a specific piece of data (eg, a data record), or an abstraction, such as a class of objects or a number, a string, or a variable. See Identification.

ImageMagick An open source utility that supports a huge selection of robust and sophisticated image editing methods. There is a detailed discussion of ImageMagick in Open Source Tools for Chapter 4.

Inheritance In object-oriented languages, data objects (ie, classes and object instances of a class) inherit the methods (eg, functions and subroutines) created for the ancestral classes in their lineage. See Abstraction. See Polymorphism. See Encapsulation.

Instance An instance is a specific example of an object that is not itself a class or group of objects. For example, Tony the Tiger is an instance of the tiger species. Tony the Tiger is a unique animal and is not itself a group of animals or a class of animals. The terms instance, instance object, and object are sometimes used interchangeably, but the special value of the "instance" concept, in a system wherein everything is an object, is that it distinguishes members of classes (ie, the instances) from the classes to which they belong.

Integration Integration, in the computer sciences, involves relating diverse data extracted from different data sources. Data merging is a type of data integration, wherein various sources of data are combined in a manner that preserves meaning and value. The terms "integration" and "interoperability" are sometimes confused with one another. An easy way of thinking about these terms is that integration applies to data, while interoperability applies to software.

Interoperability It is desirable and often necessary to create software that operates with other software, regardless of differences in operating systems and programming language. There are a wide variety of methods by which this can be achieved. The topic of software interoperability has become complex, but it remains a fundamental issue to all attempts to share analytic methodologies. The terms "integration" and "interoperability" should not be confused with one another. An easy way of thinking about these terms is that integration applies to data, while interoperability applies to software.⁴⁰

Introspection A method by which data objects can be interrogated to yield information about themselves (eg, properties, values, and class membership). Through introspection, the relationships among the data objects can be examined. Introspective methods are built into object-oriented languages. The data provided by introspection can be applied, at run-time, to modify a script's operation; a technique known as reflection. Specifically, any properties, methods, and encapsulated data of a data object can be used in the script to modify the script's run-time behavior. See Reflection.

Iterator Iterators are simple programming shortcuts that call functions that operate on consecutive members of a data structure, such as a list, or a block of code. Typically, complex iterators can be expressed in a single line of code. Perl, Python, and Ruby all have iterator methods.

LaTeX LaTeX is a document preparation system that operates in conjunction with TeX, a typesetting system. Because LaTeX is designed to accommodate scientific and mathematical notations (eg, exponents, integrals, and summations), LaTeX is used widely by publishers of scientific text. LaTeX is discussed in the Open Source Tools section of Chapter 4.

MapReduce A method by which computationally intensive problems can be processed on multiple computers, in parallel. The method can be divided into a mapping step and a reducing step. In the mapping step, a master computer divides a problem into smaller problems that are distributed to other computers. In the reducing step, the master computer collects the output from the other computers. MapReduce is intended for large and complex data resources, holding petabytes of data.

Marshaling Marshaling, like serializing, is a method for achieving data persistence (ie, saving variables and other data structures produced in a program, after the program has stopped running). Marshaling methods preserve data objects, with their encapsulated data and data structures. See Persistence. See Serializing.

Meaning In informatics, meaning is achieved when described data is bound to a unique identifier of a data object. The assertion that "Claude Funston's height is five feet eleven inches," comes pretty close to being a meaningful statement. The statement contains data (five feet eleven inches), and the data is described (height). The described data belongs to a unique object (Claude Funston). Ideally, the name "Claude Funston" should be provided with a unique identifier, to distinguish one instance of Claude Funston from all the other persons who are named Claude Funston. The statement would also benefit from a formal system that ensures that the metadata makes sense (eg, What exactly is height, and does Claude Funston fall into a class of objects for which height is a property?) and that the data is appropriate (eg, Is 5 feet 11 inches an allowable measure of a person's height?). A statement with meaning does not need to be a true statement (eg, The height of Claude Funston was not 5 feet 11 inches when Claude Funston was an infant). See Semantics. See Triple. See RDF.

Message digest Within the context of this book, "message digest," "digest," "HMAC," and "one-way hash" are equivalent terms. See HMAC.

Metadata The data that describes data. For example, a data element (also known as data point) may consist of the number, "6." The metadata for the data may be the words "Height, in feet." A data element is useless without its metadata, and metadata is useless unless it adequately describes a data element. In XML, the metadata/data annotation comes in the form <metadata tag> data <end of metadata tag> and might look something like:

< weight_in_pounds>150 </weight_in_pounds >

In spreadsheets, the data elements are the cells of the spreadsheet. The column headers are the metadata that describe the data values in the columns’ cells, and the row headers are the record numbers that uniquely identify each record (ie, each row of cells). See XML.

Metaprogramming A metaprogram is a program that creates or modifies other programs. Metaprogramming is a powerful feature found in languages that are modifiable at runtime. Perl, Python, and Ruby are all metaprogramming languages. There are several techniques that facilitate metaprogramming features, including introspection and reflection. See Reflection. See Introspection.

Method Roughly equivalent to functions, subroutines, or code blocks. In object-oriented languages, a method is a subroutine available to an object (class or instance). In Ruby and Python, instance methods are declared with a "def" declaration followed by the name of the method, in lowercase. Here is an example, in Ruby, for the "hello" method, written for the Salutations class.

class Salutations
def hello
puts "hello there"
end
end

Microarray Also known as gene chip, gene expression array, DNA microarray, or DNA chips. These consist of thousands of small samples of chosen DNA sequences arrayed onto a block of support material (such as a glass slide). When the array is incubated with a mixture of DNA sequences prepared from cell samples, hybridization will occur between molecules on the array and single-stranded complementary (ie, identically sequenced) molecules present in the cell sample. The greater the concentration of complementary molecules in the cell sample, the greater the number of fluorescently tagged hybridized molecules in the array. A specialized instrument prepares an image of the array, and quantifies the fluorescence in each array spot. Spots with high fluorescence indicate relatively large quantities of DNA in the cell sample that match the specific sequence of DNA in the array spot. The data comprising all the fluorescent intensity measurements for every spot in the array produces a gene profile characteristic of the cell sample.

Monophyletic class A class of organisms that includes a parent organism and all its descendants, while excluding any organisms that did not descend from the parent. If a subclass of a parent class omits any of the descendants of the parent class, then the parent class is said to be paraphyletic. If a subclass of a parent class includes organisms that did not descend from the parent, then the parent class is polyphyletic. A class can be paraphyletic and polyphyletic, if it excludes organisms that were descendants of the parent and if it includes organisms that did not descend from the parent. The goal of cladistics is to create a hierarchical classification that consists exclusively of monophyletic classes (ie, no paraphyly, no polyphyly). See Cladistics. See Clade.

Multiclass classification A misnomer imported from the field of machine translation, and indicating the assignment of an instance to more than one class. Classifications, as defined in this book, impose one-class classification (ie, an instance can be assigned to one and only one class). It is tempting to think that a ball should be included in class "toy" and in class "spheroids," but multiclass assignments create unnecessary classes of inscrutable provenance, and taxonomies of enormous size, consisting largely of replicate items. See Multiclass inheritance. See Taxonomy.

Multiclass inheritance In ontologies, multiclass inheritance occurs when a child class has more than one parent class. For example, a member of Class House may have two different parent classes: Class Shelter, and Class Property. Multiclass inheritance is generally permitted in ontologies but is forbidden in one type of restrictive ontology, known as a classification. See Classification. See Parent class. See Multiclass classification.

Namespace A namespace is the realm in which a metadata tag applies. The purpose of a namespace is to distinguish metadata tags that have the same name, but a different meaning. For example, within a single XML file, the metadata tag "date" may be used to signify a calendar date, or the fruit, or the social engagement. To avoid confusion, metadata terms are assigned a prefix that is associated with a Web document that defines the term (ie, establishes the tag's namespace). In practical terms, a tag that can have different descriptive meanings in different contexts is provided with a prefix that links to a web document wherein the meaning of the tag, as it applies in the XML document is specified.

New analytic method The chief feature that distinguishes a new analytic method from an old analytic method is that we have not yet learned the limitations and faults associated with the new method. Although it is true that all great analytic methods were, at one time, new methods, it is also true that all terrible analytic methods were, at one time, new methods. Historically, there have been many more failed methods than successful methods. Hence, most new analytic methods are bad methods. See Gaussian copula function.

Nomenclature A nomenclature is a listing of terms that cover all of the concepts in a knowledge domain. A nomenclature is different from a dictionary for three reasons: (1) the nomenclature terms are not annotated with definitions, (2) nomenclature terms may be multi-word, and (3) the terms in the nomenclature are limited to the scope of the selected knowledge domain. In addition, most nomenclatures group synonyms under a group code. For example, a food nomenclature might collect submarine, hoagie, po' boy, grinder, hero, and torpedo under an alphanumeric code such as "F63958." Nomenclatures simplify textual documents by uniting synonymous terms under a common code. Documents that have been coded with the same nomenclature can be integrated with other documents that have been similarly coded, and queries conducted over such documents will yield the same results, regardless of which term is entered (ie, a search for either hoagie, or po' boy will retrieve the same information, if both terms have been annotated with the synonym code, "F63948"). Optimally, the canonical concepts listed in the nomenclature are organized into a hierarchical classification.¹⁰⁰^,¹⁰¹ See Coding. See Autocoding.

Nonatomicity Nonatomicity is the assignment of a collection of objects to a single, composite object, that cannot be further simplified or sensibly deconstructed. For example, the human body is composed of trillions of individual cells, each of which lives for some length of time, and then dies. Many of the cells in the body are capable of dividing to produce more cells. In many cases, the cells of the body that are capable of dividing can be cultured and grown in plastic containers, much like bacteria can be cultured and grown in Petri dishes. If the human body is composed of individual cells, why do we habitually think of each human as a single living entity? Why don't we think of humans as bags of individual cells? Perhaps the reason stems from the coordinated responses of cells. When someone steps on the cells of your toe, the cells in your brain sense pain, the cells in your mouth and vocal cords say "ouch," and an army of inflammatory cells rush to the scene of the crime. The cells in your toe are not capable of registering an actionable complaint, without a great deal of assistance. The reason that organisms, composed of trillions of living cells, are generally considered to have nonatomicity, probably relates to the "species" concept in biology. Every cell in an organism descended from the same zygote, and every zygote in every member of the same species descended from the same ancestral organism. Hence, there seems to be little benefit to assigning unique entity status to the individual cells that compose organisms, when the class structure for organisms is based on descent through zygotes. See Species.

Notation 3 Also called n3. A syntax for expressing assertions as triples (unique subject + metadata + data). Notation 3 expresses the same information as the more formal RDF syntax, but n3 is easier for humans to read.¹⁰² RDF and N3 are interconvertible, and either one can be parsed and equivalently tokenized (ie, broken into elements that can be reorganized in a different format, such as a database record). See RDF. See Triple.

Numpy Also known as Numerical Python, numpy is an open source extension of Python that supports matrix operations and a variety of other mathematical functions. Examples of Python scripts containing numpy methods are found in Chapter 4.

Object See Data object.

Obligate intracellular organism An obligate intracellular organism can only reproduce within a host cell. Obligate intracellular organisms live off the bounty of the host cell. Thus, it would be redundant for such organisms to maintain all of the complex metabolic processes that a free-living organism must synthesize and maintain. Consequently, obligate intracellular organisms adapt simplified cellular anatomy, often dispensing with much of the genome, much of the cytoplasm, and most of the organelles that were present in their ancestral classes, prior to their switch to intracellular (ie, parasitic) living.

Occam's razor Also known as lex parsimoniae (law of parsimony). Asserts that the least complex solution is, more often than not, the correct solution. See Parsimony. See Solomonoff's theory of inductive inference.

Ontology An ontology is a collection of classes and their relationships to one another. Ontologies are usually rule-based systems (ie, membership in a class is determined by one or more class rules). Two important features distinguish ontologies from classifications. Ontologies permit classes to have more than one parent class. For example, the class of automobiles may be a direct subclass of "motorized devices" and a direct subclass of "mechanized transporters." In addition, an instance of a class can be an instance of any number of additional classes. For example, a Lamborghini may be a member of class "automobiles" and of class "luxury items." This means that the lineage of an instance in an ontology can be highly complex, with a single instance occurring in multiple classes, and with many connections between classes. Because recursive relations are permitted, it is possible to build an ontology wherein a class is both an ancestor class and a descendant class of itself. A classification is a highly restrained ontology wherein instances can belong to only one class, and each class may have only one parent class. Because classifications have an enforced linear hierarchy, they can be easily modeled, and the lineage of any instance can be traced unambiguously. See Classification. See Multiclass classification. See Multiclass inheritance.

Parent class The immediate ancestor, or the next-higher class (ie, the direct superclass) of a class. For example, in the classification of living organisms, Class Vertebrata is the parent class of Class Gnathostomata. Class Gnathostomata is the parent class of Class Teleostomi. In a classification, which imposes single class inheritance, each child class has exactly one parent class; whereas one parent class may have several different child classes. Furthermore, some classes, in particular the bottom class in the lineage, have no child classes (ie, a class need not always be a superclass of other classes). A class can be defined by its properties, its membership (ie, the instances that belong to the class), and by the name of its parent class. When we list all of the classes in a classification, in any order, we can always reconstruct the complete class lineage, in their correct lineage and branchings, if we know the name of each class's parent class. See Instance. See Child class. See Superclass.

Parsimony Use of the smallest possible collection of resources in solving a problem or achieving a goal. See Occam's razor. See Solomonoff's theory of inductive inference.

Persistence Persistence is the ability of data to remain available in memory or storage after the program in which the data was created has stopped executing. Databases are designed to achieve persistence. When the database application is turned off, the data remains available to the database application when it is restarted at some later time. See Database. See Marshaling. See Serializing.

Plain-text Plain-text refers to character strings or files that are composed of the characters accessible to a typewriter keyboard. These files typically have a ".txt" suffix to their names. Plain-text files are sometimes referred to as 7-bit ASCII files because all of the familiar keyboard characters have ASCII values under 128 (ie, can be designated in binary with, just seven 0s and 1s. In practice, plain-text files exclude 7-bit ASCII symbols that do not code for familiar keyboard characters. To further confuse the issue, plain-text files may contain ASCII characters above 7 bits (ie, characters from 128 to 255) that represent characters that are printable on computer monitors, such as accented letters. See ASCII.

Polymorphism Polymorphism is one of the constitutive properties of an object-oriented language (along with inheritance, encapsulation, and abstraction). Methods sent to object receivers have a response determined by the class of the receiving object. Hence, different objects, from different classes, receiving a call to a method of the same name, will respond differently. For example, suppose you have a method named "divide" and you send the method (ie, issue a command to execute the method) to an object of Class Bacteria and an object of Class Numerics. The Bacteria, receiving the divide method, will try to execute by looking for the "divide" method somewhere in its class lineage. Being a bacteria, the "divide" method may involve making a copy of the bacteria (ie, reproducing) and incrementing the number of bacteria in the population. The numeric object, receiving the "divide" method, will look for the "divide" method in its class lineage and will probably find some method that provides instructions for arithmetic division. Hence, the behavior of the class object, to a received method, will be appropriate for the class of the object. See Inheritance. See Encapsulation. See Abstraction.

Primary data The original set of data collected to serve a particular purpose or to answer a particular set of questions, and intended for use by the same individuals who collected the data. See Secondary data.

Query One of the first sticking points in any discussion of heterogeneous database queries is the definition of "query." Informaticians use the term "query" to mean a request for records that match a specific set of data element features (eg, name, age, etc.) Ontologists think of a query as a question that matches the competence of the ontology (ie, a question for which the ontology can infer an answer). Often, a query is a parameter or set of parameters that is matched against properties or rules that apply to objects in the ontology. For example, "weight" is a property of physical objects, and this property may fall into the domain of several named classes in an ontology. The query may ask for the names of the classes of objects that have the "weight" property and the numbers of object instances in each class. A query might select several of these classes (eg, including dogs and cats, but excluding microwave ovens), along with the data object instances whose weights fall within a specified range (eg, 20 to 30 pounds).

RDF Resource Description Framework (RDF) is a syntax in XML notation that formally expresses assertions as triples. The RDF triple consists of a uniquely identified subject plus a metadata descriptor for the data plus a data element. Triples are necessary and sufficient to create statements that convey meaning. Triples can be aggregated with other triples from the same data set or from other data sets, so long as each triple pertains to a unique subject that is identified equivalently through the data sets. Enormous data sets of RDF triples can be merged or functionally integrated with other massive or complex data resources. For a detailed discussion, see Open Source Tools for Chapter 6, "Syntax for triples." See Notation 3. See Semantics. See Triple. See XML.

RDF Schema Resource Description Framework Schema (RDFS). A document containing a list of classes, their definitions, and the names of the parent class(es) for each class. In an RDF Schema, the list of classes is typically followed by a list of properties that apply to one or more classes in the Schema. To be useful, RDF Schemas are posted on the Internet, as a Web page, with a unique Web address. Anyone can incorporate the classes and properties of a public RDF Schema into their own RDF documents (public or private) by linking named classes and properties, in their RDF document, to the web address of the RDF Schema where the classes and properties are defined. See Namespace. See RDFS.

RDFS Same as RDF Schema.

Reflection A programming technique wherein a computer program will modify itself, at run-time, based on information it acquires through introspection. For example, a computer program may iterate over a collection of data objects, examining the self-descriptive information for each object in the collection (ie, object introspection). If the information indicates that the data object belongs to a particular class of objects, then the program may call a method appropriate for the class. The program executes in a manner determined by descriptive information obtained during run-time; metaphorically reflecting upon the purpose of its computational task. See Introspection.

Reproducibility Reproducibility is achieved when repeat studies produce the same results, over and over. Reproducibility is closely related to validation, which is achieved when you draw the same conclusions, from the data, over and over again. Implicit in the concept of "reproducibility" is that the original research must somehow convey the means by which the study can be reproduced. This usually requires the careful recording of methods, algorithms, and materials. In some cases, reproducibility requires access to the data produced in the original studies. If there is no feasible way for scientists to undertake a reconstruction of the original study, or if the results obtained in the original study cannot be obtained in subsequent attempts, then the study is irreproducible. If the work is reproduced, but the results and the conclusions cannot be repeated, then the study is considered invalid. See Validation. See Verification.

Results The term "results" is often mistaken for the term "conclusions." Interchanging the two concepts is a source of confusion among data scientists. In the strictest sense, "results" consist of the full set of experimental data collected by measurements. In practice, "results" are provided as a small subset of data distilled from the raw, original data. In a typical journal article, selected data subsets are packaged as a chart or graph that emphasizes some point of interest. Hence, the term "results" may refer, erroneously, to subsets of the original data, or to visual graphics intended to summarize the original data. Conclusions are the inferences drawn from the results. Results are verified; conclusions are validated. The data that is generated from the original data should not be confused with "secondary" data. See Secondary data. See Conclusions. See Verification. See Validation.

Science Of course, there are many different definitions of science, and inquisitive students should be encouraged to find a conceptualization of science that suits their own intellectual development. For me, science is all about finding general relationships among objects. In the so-called physical sciences, the most important relationships are expressed as mathematical equations (eg, the relationship between force, mass and acceleration; the relationship between voltage, current and resistance). In the so-called natural sciences, relationships are often expressed through classifications (eg, the classification of living organisms). Scientific advancement is the discovery of new relationships or the discovery of a generalization that applies to objects hitherto confined within disparate scientific realms (eg, evolutionary theory arising from observations of organisms and geologic strata). Engineering would be the area of science wherein scientific relationships are exploited to build new technology. See Generalization.

Scipy Scipy, like numpy, is an open source extension to Python.¹⁰³ It includes many very useful mathematical routines commonly used by scientists, including: integration,interpolation, Fourier transforms, signal processing, linear algebra, and statistics. Examples and discussion are provided in Open Source Tools for Chapter 4.

Secondary data Data collected by someone else. Much of the data analyses performed today are done on secondary data.¹⁰⁴ Most verification and validation studies depend upon access to high-quality secondary data. Because secondary data is prepared by someone else, who cannot anticipate how you will use the data, it is important to provide secondary data that is simple and introspective. See Introspection. See Primary data.

Semantics The study of meaning (Greek root, semantikos, signficant meaning). In the context of data science, semantics is the technique of creating meaningful assertions about data objects. A meaningful assertion, as used here, is a triple consisting of an identified data object, a data value, and a descriptor for the data value. In practical terms, semantics involves making assertions about data objects (ie, making triples), combining assertions about data objects (ie, merging triples), and assigning data objects to classes; hence relating triples to other triples. As a word of warning, few informaticians would define semantics in these terms, but most definitions for semantics are functionally equivalent to the definition offered here. Most language is unstructured and meaningless. Consider the assertion: Sam is tired. This is an adequately structured sentence with a subject verb and object. But what is the meaning of the sentence? There are a lot of people named Sam. Which Sam is being referred to in this sentence? What does it mean to say that Sam is tired? Is "tiredness" a constitutive property of Sam, or does it only apply to specific moments? If so, for what moment in time is the assertion, "Sam is tired" actually true? To a computer, meaning comes from assertions that have a specific, identified subject associated with some sensible piece of fully described data (metadata coupled with the data it describes). As you may suspect, virtually all data contained in databases does not fully qualify as "meaningful." See Triple. See RDF.

Serializing Serializing is a plesionym (ie, near-synonym) for marshaling and is a method for taking data produced within a script or program, and preserving it in an external file, that can be saved when the program stops, and quickly reconstituted as needed, in the same program or in different programs. The difference, in terms of common usage, between serialization and marshaling is that serialization usually involves capturing parameters (ie, particular pieces of information), while marshaling preserves all of the specifics of a data object, including its structure, content, and code). As you might imagine, the meaning of terms might change depending on the programming language and the intent of the serializing and marshaling methods. See Persistence. See Marshaling.

Simpson's paradox Occurs when a correlation that holds in two different data sets is reversed if the data sets are combined. For example, baseball player A may have a higher batting average than player B for each of two seasons, but when the data for the two seasons are combined, player B may have the higher 2-season average. Simpson's paradox is just one example of unexpected changes in outcome when variables are unknowingly hidden or blended.¹⁰⁵

Solomonoff's theory of inductive inference Solomonoff's theory of inductive inference is Occam's razor, as applied to mathematics and computer science. The shortest, most computable functions that describe or predict data are the correct functions, assuming that all competing functions describe the existing data equally well. See Occam's razor. See Parsimony.

Species Species is the bottom-most class of any classification or ontology. Because the species class contains the individual objects of the classification, it is the only class which is not abstract. The special significance of the species class is best exemplified in the classification of living organisms. Every species of organism contains individuals that share a common ancestral relationship to one another. When we look at a group of squirrels, we know that each squirrel in the group has its own unique personality, its own unique genes (ie, genotype), and its own unique set of physical features (ie, phenotype). Moreover, although the DNA sequences of individual squirrels are unique, we assume that there is a commonality to the genome of squirrels that distinguishes it from the genome of every other species. If we use the modern definition of species as an evolving gene pool, we see that the species can be thought of as a biological life form, with substance (a population of propagating genes), and a function (evolving to produce new species).¹⁰⁶^–¹⁰⁸ Put simply, species speciate; individuals do not. As a corollary, species evolve; individuals simply propagate. Hence, the species class is a separable biological unit with form and function. We, as individuals, are focused on the lives of individual things, and we must be reminded of the role of species in biological and nonbiological classifications. The concept of species is discussed in greater detail in Section 6.4 of Chapter 6. See Blended class. See Nonatomicity.

Specification A specification is a formal method for describing objects (physical objects such as nuts and bolts or symbolic objects, such as numbers, or concepts expressed as text). In general, specifications do not require the inclusion of specific items of information (ie, they do not impose restrictions on the content that is included in or excluded from documents), and specifications do not impose any order of appearance of the data contained in the document (ie, you can mix up and rearrange specified objects, if you like). Specifications are not generally certified by a standards organization. They are typically produced by special interest organizations, and the legitimacy of a specification depends on its popularity. Examples of specifications are RDF (Resource Description Framework) produced by the W3C (World Wide Web Consortium), and TCP/IP (Transfer Control Protocol/Internet Protocol), maintained by the Internet Engineering Task Force. The most widely implemented specifications are simple and easily implemented. See Specification versus standard.

Specification versus standard Data standards, in general, tell you what must be included in a conforming document, and, in most cases, dictates the format of the final document. In many instances, standards bar inclusion of any data that is not included in the standard (eg, you should not include astronomical data in a clinical x-ray report). Specifications simply provide a formal way for describing the data that you choose to include in your document. XML and RDF, a semantic dialect of XML, are examples of specifications. They both tell you how data should be represented, but neither tell you what data to include, or how your document or data set should appear. Files that comply with a standard are rigidly organized and can be easily parsed and manipulated by software specifically designed to adhere to the standard. Files that comply with a specification are typically self-describing documents that contain within themselves all the information necessary for a human or a computer to derive meaning from the file contents. In theory, files that comply with a specification can be parsed and manipulated by generalized software designed to parse the markup language of the specification (eg, XML, RDF) and to organize the data into data structures defined within the file. The relative strengths and weaknesses of standards and specifications are discussed in Section 2.6 of Chapter 2, "Specifications Good, Standards Bad." See XML. See RDF.

Standard A standard is a set of rules for doing a particular task or expressing a particular kind of information. The purpose of standards is to ensure that all objects that meet the standard have certain physical or informational features in common, thus facilitating interchange, reproducibility, interoperability, and reducing costs of operation. In the case of standards for data and information, standards typically dictate what data is to be included, how that data should be expressed and arranged, and what data is to be excluded. Standards are developed by any of hundreds of standards developing agencies, but there are only a few international agencies that bestow approval of standards. See ISO. See Specification. See Specification versus standard.

String A string is a sequence of characters (ie, letters, numbers, punctuation). For example, this book is a long string. The complete sequence of the human genome (3 billion characters, with each character an A,T,G, or C) is an even longer string. Every subsequence of a string is also a string. A great many clever algorithms for searching, retrieving, comparing, compressing, and otherwise analyzing strings, have been published.¹⁰⁹

Superclass Any of the ancestral classes of a subclass. For example, in the classification of living organisms, the class of vertebrates is a superclass of the class of mammals. The immediate superclass of a class is its parent class. In common parlance, when we speak of the superclass of a class, we are usually referring to its parent class. See Parent class.

System A set of objects whose interactions produce all of the observable properties, behaviors and events that we choose to observe. Basically, then, a system is whatever we decide to look at (eg, the brain, the cosmos, a cell, a habitat). The assumption is that the objects that we are not looking at (ie, the objects excluded from the system), have little or no effect on the objects within the system. Of course, this assumption will not always be true, but we do our best.

System call Refers to a command, within a running script, that calls the operating system into action, momentarily bypassing the programming interpreter for the script. A system call can do essentially anything the operating system can do via a command line.

Systematics The term "systematics" is, by tradition, reserved for the field of biology that deals with taxonomy (ie, the listing of the distinct types of organisms) and with classification (ie, the classes of organisms and their relationships to one another). There is no reason why biologists should lay exclusive claim to the field of systematics. As used herein, systematics equals taxonomics plus classification, and this term applies just as strongly to stamp collecting, marketing, operations research, and object-oriented programming as it does to the field of biology.

Taxonomic order In biological taxonomy, the hierarchical lineage of organisms are divided into a descending list of named orders: Kingdom, Phylum (Division), Class, Order, Family, Genus, and Species. As we have learned more and more about the classes of organisms, modern taxonomists have added additional ranks to the classification (eg, supraphylum, subphylum, suborder, infraclass, etc.). Was this really necessary? All of this taxonomic complexity could be averted by dropping named ranks and simply referring to every class as "Class." Modern specifications for class hierarchies (eg, RDF Schema) encapsulate each class with the name of its superclass. When every object yields its class and superclass, it is possible to trace any object's class lineage. For example, in the classification of living organisms, if you know the name of the parent for each class, you can write a simple script that generates the complete ancestral lineage for every class and species within the classification.⁸⁴ See Class. See Taxonomy. See RDF Schema. See Species.

Taxonomy A taxonomy is the collection of named instances (class members) in a classification or an ontology. When you see a schematic showing class relationships, with individual classes represented by geometric shapes and the relationships represented by arrows or connecting lines between the classes, then you are essentially looking at the structure of a classification, minus the taxonomy. You can think of building a taxonomy as the act of pouring all of the names of all of the instances into their proper classes. A taxonomy is similar to a nomenclature; the difference is that in a taxonomy, every named instance must have an assigned class. See Taxonomic order.

Terminology The collection of words and terms used in some particular discipline, field, or knowledge domain. Nearly synonymous with vocabulary and with nomenclature. Vocabularies, unlike terminologies, are not confined to the terms used in a particular field. Nomenclatures, unlike terminologies, usually aggregate equivalent terms under a canonical synonym.

Text editor A text editor (also called ASCII editor) is a software program designed to display simple, unformatted text files. Text editors differ from word processing software applications that produce files with formatting information embedded within the file. Text editors, unlike word processors, can open large files (in excess of 100 megabytes), very quickly. They also allow you to move around the file with ease. Examples of free and open source text editors are Emacs and vi. See ASCII.

Thesaurus A vocabulary that groups together synonymous terms. A thesaurus is very similar to a nomenclature. There are two minor differences. Nomenclatures included multi-word terms; whereas a thesaurus is typically composed of one-word terms. In addition, nomenclatures are typically restricted to a well-defined topic or knowledge domain (eg, names of stars, infectious diseases, etc.). See Nomenclature. See Vocabulary. See Classification. See Dictionary. See Terminology. See Ontology.

Time A large portion of data analysis is concerned, in one way or another, with the times that events occur or the times that observations are made, or the times that signals are sampled. Here are three examples demonstrating why this is so: (1) most scientific and predictive assertions relate how variables change with respect to one another, over time; and (2) a single data object may have many different data values, over time, and only timing data will tell us how to distinguish one observation from another; (3) computer transactions are tracked in logs, and logs are composed of time-annotated descriptions of the transactions. Data objects often lose their significance if they are not associated with an accurate time measurement. Because accurate time data is easily captured by modern computers, there is no reason why data elements should not be annotated with the time at which they are made. See Timestamp. See Trusted timestamp.

Timestamp Many data objects are temporal events and all temporal events must be given a timestamp indicating the time that the event occurred, using a standard measurement for time. The timestamp must be accurate, persistent, and immutable. The Unix epoch time (equivalent to the Posix epoch time) is available for most operating systems and consists of the number of seconds that have elapsed since January 1, 1970, midnight, Greenwich mean time. The Unix epoch time can easily be converted into any other standard representation of time. The duration of any event can be easily calculated by subtracting the beginning time from the ending time. Because the timing of events can be maliciously altered, scrupulous data managers may choose to employ a trusted timestamp protocol by which a timestamp can be verified. See Trusted timestamp

Transform A transform is a mathematical operation that takes a function or a time series (eg, values obtained at intervals of time) and transforms it into something else. An inverse transform takes the transform function and produces the original function. Transforms are useful when there are operations that can be more easily performed on the transformed function than on the original function.

Triple In computer semantics, a triple is an identified data object associated with a data element and the description of the data element. In the computer science literature, the syntax for the triple is commonly described as: subject, predicate, object, wherein the subject is an identifier, the predicate is the description of the object, and the object is the data. The definition of triple, using grammatical terms, can be off-putting to the data scientist, who may think in terms of spreadsheet entries: a key that identifies the line record, a column header containing the metadata description of the data, and a cell that contains the data. In this book, the three components of a triple are described as: (1) the identifier for the data object, (2) the metadata that describes the data, and (3) the data itself. In theory, all data sets, databases, and spreadsheets can be constructed or deconstructed as collections of triples. See Introspection. See Data object. See Semantics. See RDF. See Meaning.

Trusted timestamp It is sometimes necessary to establish, beyond doubt, that a timestamp is accurate and has not been modified. Through the centuries, a great many protocols have been devised to prove that a timestamp is trustworthy. One of the simplest methods, employed in the late twentieth century, involved creating a digest of a document (eg, a concatenated sequence consisting of the first letter of each line in the document) and sending the sequence to a newspaper for publication in the "Classifieds" section. After publication of the newspaper, anyone in possession of the original document could extract the same sequence from the document, thus proving that the document had existed on the date that the sequence appeared in the newspaper's classified advertisements. Near the end of the twentieth century, one-way hash values became the sequences of choice for trusted timestamp protocols. Today, newspapers are seldom employed to establish trust in timestamps. More commonly, a message digest of a confidential document is sent to a timestamp authority that adds a date to the digest and returns a message, encrypted with the timestamp authority's private key, containing the original one-way hash plus the trusted date. The received message can be decrypted with the timestamp authority's public key, to reveal the date/time and the message digest that is unique for the original document. It seems like the modern trusted timestamp protocol is a lot of work, but those who use these services can quickly and automatically process huge batches of documents. See Message digest.

UUID UUID (Universally Unique IDentifier) is a protocol for assigning unique identifiers to data objects, without using a central registry. UUIDs were originally used in the Apollo Network Computing System.¹¹⁰ Most modern programming languages have modules for generating UUIDs. See Identifier.

Unclassifiable objects Classifications create a class for every object and taxonomies assign each and every object to its correct class. This means that a classification is not permitted to contain unclassified objects; a condition that puts fussy taxonomists in an untenable position. Suppose you have an object, and you simply do not know enough about the object to confidently assign it to a class. Or, suppose you have an object that seems to fit more than one class, and you can't decide which class is the correct class. What do you do? Historically, scientists have resorted to creating a "miscellaneous" class into which otherwise unclassifiable objects are given a temporary home, until more suitable accommodations can be provided. I have spoken with numerous data managers, and everyone seems to be of a mind that "miscellaneous" classes, created as a stopgap measure, serve a useful purpose. Not so. Historically, the promiscuous application of "miscellaneous" classes have proven to be a huge impediment to the advancement of science. In the case of the classification of living organisms, the class of protozoans stands as a case in point. Ernst Haeckel, a leading biological taxonomist in his time, created the Kingdom Protista (ie, protozoans), in 1866, to accommodate a wide variety of simple organisms with superficial commonalities. Haeckel himself understood that the protists were a blended class that included unrelated organisms, but he believed that further study would resolve the confusion. In a sense, he was right, but the process took much longer than he had anticipated; occupying generations of taxonomists over the following 150 years. Today, Kingdom Protista no longer exists. Its members have been reassigned to classes among the animals, plants, and fungi. Nonetheless, textbooks of microbiology still describe the protozoans, just as though this name continued to occupy a legitimate place among terrestrial organisms. In the meantime, therapeutic opportunities for eradicating so-called protozoal infections, using class-targeted agents, have no doubt been missed.¹¹¹ You might think that the creation of a class of living organisms, with no established scientific relation to the real world, was a rare and ancient event in the annals of biology, having little or no chance of being repeated. Not so. A special pseudo-class of fungi, deuteromyctetes (spelled with a lowercase "d," signifying its questionable validity as a true biologic class) has been created to hold fungi of indeterminate speciation. At present, there are several thousand such fungi, sitting in a taxonomic limbo, waiting to be placed into a definitive taxonomic class.¹¹²^,¹¹¹ See Blended class.

Universal and timeless Wherein a set of data or methods can be understood and utilized by anyone, from any discipline, at any time. It's a tall order, but a worthy goal. Much of the data collected over the centuries of recorded history is of little value because it was never adequately described when it was specified (eg, unknown time of recording, unknown source, unfamiliar measurements, unwritten protocols). Efforts to resuscitate large collections of painstakingly collected data are often abandoned simply because there is no way of verifying, or even understanding, the original data.⁸⁷ Data scientists who want their data to serve for posterity should use simple specifications, and should include general document annotations such as the Dublin Core. The importance of creating permanent data is discussed in Section 8.5 of Chapter 8, Data Permanence and Data Immutability. See Dublin Core metadata.

Universally Unique IDentifier See UUID.

Validation Validation is the process that checks whether the conclusions drawn from data analysis are correct.⁹⁷ Validation usually starts with repeating the same analysis of the same data, using the methods that were originally recommended. Obviously, if a different set of conclusions is drawn from the same data and methods, the original conclusions cannot be validated. Validation may involve applying a different set of analytic methods to the same data, to determine if the conclusions are consistent. It is always reassuring to know that conclusions are repeatable, with different analytic methods. In prior eras, experiments were validated by repeating the entire experiment, thus producing a new set of observations for analysis. Many of today's scientific experiments are far too complex and costly to repeat. In such cases, validation requires access to the complete collection of the original data, and to the detailed protocols under which the data was generated. One of the most useful methods of data validation involves testing new hypotheses, based on the assumed validity of the original conclusions. For example, if you were to accept Darwin's analysis of barnacle data leading to his theory of evolution, then you would expect to find a chronologic history of fossils in ascending layers of shale. This was the case; thus, paleontologists studying the Burgess shale provided some validation to Darwin's conclusions. Validation should not be mistaken for proof. Nonetheless, the repeatability of conclusions, over time, with the same or different sets of data, and the demonstration of consistency with related observations, is about all that we can hope for in this imperfect world. See Verification. See Reproducibility.

Verification The process by which data is checked to determine whether the data was obtained properly (ie, according to approved protocols), and that the data accurately measured what it was intended to measure. Verification often requires a level of expertise that is at least as high as the expertise of the individuals who produced the data.⁹⁷ Data verification requires a full understanding of the many steps involved in data collection and can be a time-consuming task. In one celebrated case, in which two statisticians reviewed a micro-array study performed at Duke University, the number of hours devoted to their verification effort was reported to be 2000 hours⁴⁸ To put this statement in perspective, the official work-year, according to the U.S. Office of Personnel Management, is 2087 hours Verification is different from validation. Verification is performed on data; validation is done on the results of data analysis. See Validation. See Microarray. See Introspection.

Vocabulary A comprehensive collection of words and their associated meanings. In some quarters, "vocabulary" and "nomenclature" are used interchangeably, but they are different from one another. Nomenclatures typically focus on terms confined to one knowledge domain. Nomenclatures typically do not contain definitions for the contained terms. Nomenclatures typically group terms by synonymy. Lastly, nomenclatures include multi-word terms. Vocabularies are collections of single words, culled from multiple knowledge domains, with their definitions, and assembled in alphabetic order. See Nomenclature. See Thesaurus. See Taxonomy. See Dictionary. See Terminology.

XML Acronym for eXtensible Markup Language, a syntax for marking data values with descriptors (ie, metadata). The descriptors are commonly known as tags. In XML, every data value is enclosed by a start-tag, containing the descriptor and indicating that a value will follow, and an end-tag, containing the same descriptor and indicating that a value preceded the tag. For example: <name> Conrad Nervig </name>. The enclosing angle brackets, "<>", and the end-tag marker, "/", are hallmarks of HTML and XML markup. This simple but powerful relationship between metadata and data allows us to employ metadata/data pairs as though each were a miniature database. The semantic value of XML becomes apparent when we bind a metadata/data pair to a unique object, forming a so-called triple. See Triple. See Meaning. See Semantics. See HTML.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 1: The Simple Life

Create new playlist

Sign In

Sign Up

1.1 Simplification Drives Scientific Progress

1.2 The Human Mind is a Simplifying Machine

1.3 Simplification in Nature

1.4 The Complexity Barrier

1.5 Getting Ready

Open Source Tools

Perl

Python

Ruby

Text Editors

OpenOffice

LibreOffice

Command Line Utilities

Cygwin, Linux Emulation for Windows

DOS Batch Scripts

Linux Bash Scripts

Interactive Line Interpreters

Package Installers

System Calls

Glossary

Table of Contents for
Chapter 1: The Simple Life