Preface

Abstract

The purpose of the preface is to elevate data simplification to the level of a discipline within the general field of data science. Readers will learn that data simplification is at least as important as data analysis. Inadequate analyses can be reviewed and improved if the data is well-annotated and the data records are simple. Analytic tools are of little help if the data is complex and inscrutable. The preface will provide a very short introduction and take-home summary for the 8 book chapters, explaining how the methods described in each chapter fulfill a necessary component of data simplification. The following points will be developed: (1) Large and complex data cannot be explored unless data is simplified; (2) Data simplification is not simple; there are principles, methods, and tools that must be studied and mastered; (3) Data simplification tools become data discovery tools, in the hands of creative individuals; and (4) Learning the methods and tools of data simplification is a great career move.

Keywords

Book organization, Readership, Justification for data simplification, Complex data, Repurposing, Limits of technology, Comprehensible records, Validation

Order and simplification are the first steps toward the mastery of a subject.

Thomas Mann

Complex data is difficult to understand and analyze. Large data projects, using complex sets of data, are likely to fail; furthermore, the more money spent on a data project, the greater the likelihood of failure.19 What is true for data projects is also true in the experimental sciences; large and complex projects are often unrepeatable.1028 Basically, complexity is something that humans have never mastered. As a species, we work much better when things are made simple.

Intelligent data scientists soon learn that it is nearly impossible to conduct competent and reproducible analyses of highly complex data. Inevitably, something always goes wrong. This book was written to provide a set of general principles and methods for data simplification. Here are the points that establish the conceptual framework of Data Simplification: Taming Information with Open Source Tools:

(1) The first step in data analysis is data simplification. Most modern data scientists can expect that the majority of their time will be spent collecting, organizing, annotating, and simplifying their data, preparatory to analysis. A relatively small fraction of their time will be spent directly analyzing the data. When data has been simplified, successful analysis can proceed using standard computational methods, on standard computers.

(2) Results obtained from unsimplified projects are nearly always irreproducible. Hence, the results of analyses on complex data cannot be validated. Conclusions that cannot be validated have no scientific value.

(3) Data simplification is not simple. There is something self-defeating about the term, "data simplification." The term seems to imply a dumbing down process wherein naturally complex concepts are presented in a manner that is palatable to marginally competent scientists. Nothing can be further from the truth. Creating overly complex data has always been the default option for lazy-minded or cavalier scientists who lacked the will or the talent to produce a simple, well-organized, and well-annotated collection of data. The act of data simplification will always be one of the most challenging tasks facing data scientists, often requiring talents drawn from multiple disciplines. The sad truth is that there are very few data professionals who are competent to perform data simplification; and fewer still educators who can adequately teach the subject.

(4) No single software application will solve your data simplification needs.29 Applications that claim to do everything for the user are, in most instances, applications that require the user to do everything for the application. The most useful software solutions often come in the form of open source utilities, designed to perform one method very well and very fast. In this book, dozens of freely available utilities are demonstrated.

(5) Data simplification tools are data discovery tools, in the hands of creative individuals. The act of data simplification always gives the scientist a better understanding of the meaning of the data. Data that has been organized and annotated competently provides us with new questions, new hypotheses, and new approaches to problem solving. Data that is complex only provides headaches.

(6) Data simplification is a prerequisite for data preservation. Data that has not been simplified has no useful shelf life. After a data project has ended, nobody will be able to understand what was done. This means no future projects will build upon the original data, or find new purposes for the data. Moreover, conclusions drawn from the original data will never be verified or validated. This means that when you do not simplify your data, your colleagues will not accept your conclusions. Those who understand the principles and practice of data simplification will produce credible data that can be validated and repurposed (see Glossary items, Validation, Data repurposing, Data Quality Act).

(7) Data simplification saves money. Data simplification often involves developing general solutions that apply to classes of data. By eliminating the cost of using made-to-order proprietary software, data scientists can increase their productivity and reduce their expenses.

(8) Learning the methods and tools of data simplification is a great career move. Data simplification is the next big thing in the data sciences. The most thoughtful employers understand that it's not always about keeping it simple. More often, it's about making it simple (see Glossary item, KISS).

(9) Data scientists should have familiarity with more than one programming language. Although one high-level language has much the same functionality as another, each language may have particular advantages in different situations. For example, a programmer may prefer Perl when her tasks involve text parsing and string pattern matches. Another programmer might prefer Python if she requires a variety of numeric or analytic functions and a smooth interface to a graphing tool. Programmers who work with classes of data objects, or who need to model new classifications, might prefer the elegant syntax and rich class libraries available in Ruby (see Glossary item, Syntax). Books that draw on a single programming language run the risk of limiting the problem-solving options of their readers. Although there are many high-quality programming languages, I have chosen Perl, Python, and Ruby as the demonstration languages for this book. Each of these popular languages is free, open source, and can be installed easily and quickly on virtually any operating system. By offering solutions in several different programming languages, this book may serve as a sort of Rosetta stone for data scientists who must work with data structures produced in different programming environments.

Organization of this book

Chapter 1, The Simple Life, explores the thesis that complexity is the rate-limiting factor in human development. The greatest advances in human civilization and the most dramatic evolutionary improvements in all living organisms have followed the acquisition of methods that reduce or eliminate complexity.

Chapter 2, Structuring Text, reminds us that most of the data on the Web today is unstructured text, produced by individuals, trying their best to communicate with one another. Data simplification often begins with textual data. This chapter provides readers with tools and strategies for imposing some basic structure on free-text.

Chapter 3, Indexing Text, describes the often undervalued benefits of indexes. An index, aided by proper annotation of data, permits us to understand data in ways that were not anticipated when the original content was collected. With the use of computers, multiple indexes designed for different purposes, can be created for a single document or data set. As data accrues, indexes can be updated. When data sets are combined, their respective indexes can be merged. A good way of thinking about indexes is that the document contains all of the complexity; the index contains all of the simplicity. Data scientists who understand how to create and use indexes will be in the best position to search, retrieve, and analyze textual data. Methods are provided for automatically creating customized indexes designed for specific analytic pursuits and for binding index terms to standard nomenclatures.

Chapter 4, Understanding Your Data, describes how data can be quickly assessed, prior to formal quantitative analysis, to develop some insight into what the data means. A few simple visualization tricks and simple statistical descriptors can greatly enhance a data scientist's understanding of complex and large data sets. Various types of data objects, such as text files, images, and time-series data, can be profiled with a summary signature that captures the key features that contribute to the behavior and content of the data object. Such profiles can be used to find relationships among different data objects, or to determine when data objects are not closely related to one another.

Chapter 5, Identifying and Deidentifying Data, tackles one of the most under-appreciated and least understood issues in data science. Measurements, annotations, properties, and classes of information have no informational meaning unless they are attached to an identifier that distinguishes one data object from all other data objects, and that links together all of the information that has been or will be associated with the identified data object. The method of identification and the selection of objects and classes to be identified relates fundamentally to the organizational model of complex data. If the simplifying step of data identification is ignored or implemented improperly, data cannot be shared, and conclusions drawn from the data cannot be believed. All well-designed information systems are, at their heart, identification systems: ways of naming data objects so that they can be retrieved. Only well-identified data can be usefully deidentified. This chapter discusses methods for identifying data and deidentifying data.

Chapter 6, Giving Meaning to Data, explores the meaning of meaning, as it applies to computer science. We shall learn that data, by itself, has no meaning. It is the job of the data scientist to assign meaning to data, and this is done with data objects, triples, and classifications (see Glossary items, Data object, Triple, Classification, Ontology). Unfortunately, coursework in the information sciences often omits discussion of the critical issue of "data meaning"; advancing from data collection to data analysis without stopping to design data objects whose relationships to other data objects are defined and discoverable. In this chapter, readers will learn how to prepare and classify meaningful data.

Chapter 7, Object-Oriented Data, shows how we can understand data, using a few elegant computational principles. Modern programming languages, particularly object-oriented programming languages, use introspective data (ie, the data with which data objects describe themselves) to modify the execution of a program at run-time; an elegant process known as reflection. Using introspection and reflection, programs can integrate data objects with related data objects. The implementations of introspection, reflection and integration, are among the most important achievements in the field of computer science.

Chapter 8, Problem Simplification, demonstrates that it is just as important to simplify problems as it is to simplify data. This final chapter provides simple but powerful methods for analyzing data, without resorting to advanced mathematical techniques. The use of random number generators to simulate the behavior of systems, and the application of Monte Carlo, resampling, and permutative methods to a wide variety of common problems in data analysis, will be discussed. The importance of data reanalysis, following preliminary analysis, is emphasized.

Chapter Organization

Every chapter has a section in which free utilities are listed, and their functions, in the service of data simplification, are described. The utilities listed in this book confer superpowers on the user; nothing less. Like a magical incantation, or a bite from a radioactive spider, software utilities endow sedentary data scientists with the power to create elaborate graphs with the click of a mouse, perform millions of mathematical calculations in the blink of an eye, and magically unlock the deepest secrets held in a tangled web of chaotic data. Happily, every data scientist can enjoy a nearly limitless collection of no-cost, opens source programming languages, utilities, specifications, and undifferentiated software that can help them perform all of the tasks involved in data simplification. Bearing in mind that today's exciting innovation is tomorrow's abandonware, I've confined my recommendations to tools whose popularity has survived the test of time (see Glossary items, Utility, Open source, Free software, Free software license, FOSS, Specification, Undifferentiated software, GNU software license, Abandonware).

Every chapter has its own glossary. Traditionally, glossaries are intended to define technical terms encountered in the text. Definitions, though helpful, seldom explain the purpose served by a term or concept. In this book, the glossaries are designed to add depth and perspective to concepts that were touched upon in the chapters. Readers will find that the glossaries can be enjoyed as stand-alone documents, worthy of reading from beginning to ending.

Every chapter has its own Reference section. Serious readers will want to obtain and study the primary sources. With very few exceptions, the articles cited in the reference sections are available for public download from the Web. A cursory review of the citations should indicate that the articles are chosen from many different scientific disciplines, reinforcing the premise that science, at its most creative, is always multidisciplinary.

How to Read this Book

Readers will almost certainly be overwhelmed by the multitude of algorithms and code snippets included in this book. Please keep in mind that there is plenty of time to master methodology. It might be best to absorb the fundamental concepts discussed in each chapter. For your first reading, it is sufficient to understand how various methods may fit into a general approach to simplifying data. When you need to use the tools or the scripts from the book, please do not try to retype snippets of code; you'll only introduce errors. Use the free downloadable file, containing all of the code from the book, at the publisher's web site (http://booksite.elsevier.com/9780128037812/) (see Glossary item, Script).

Nota Bene

Readers will notice that many of the examples of classifications are drawn from the classification of living organisms. The reasons for this is that the classification of living organisms is the largest, most mature, and the best classification in existence. It was developed with the most intense efforts of thousands of scientists, over hundreds of years. Almost every imaginable design flaw and conceptual error that can be produced by the human mind has been committed and corrected in the development of this classification. Hence, this classification offers a seemingly endless number of examples of what to avoid and what to seek when designing, or implementing, a classification. Anyone who has any serious interest in data organization, regardless of his or her chosen scientific discipline, must endure some exposure to the history and the current status of the classification of living organisms.

The book contains many scripts, and snippets of scripts, written in Perl, Python, or Ruby. The book also contains command line calls to numerous open source utilities. These code samples should be operable on Windows and Linux systems and on any operating system that supports scripts written in Perl, Python, or Ruby. Although all of the code was thoroughly checked to ensure that it functioned properly, readers may encounter occasional malfunctions and nonfunctions due to inconsistencies and incompatibilities between versions of operating systems and/or versions of programming languages; or due to unavailability of utilities and language-specific modules; or due to the author's inadvertent coding errors; or due to typographic mistakes; or due to idiosyncrasies among individual computers. As an example of the last item, readers should know that whenever a command line in this book begins with the DOS prompt: "c:ftp>", this represents a drive location on the author's computer. Readers must substitute the prompt that applies to their own computer in all these instances.

This book is a work of literature; nothing more. Neither the author nor the publisher is responsible for code that does not work to the reader's satisfaction. The provided code is not intended to serve as functional applications. It is best to think of the code in this book as instructional text intended to demonstrate various options and approaches that might be applied to data-related projects. To reduce the number of errors introduced when code is transcribed by hand, all of the coding examples from the book can be downloaded from the publisher's web site for Data Simplification: Taming Information with Open Source Tools.

Glossary

Abandonware Software that is abandoned (eg, no longer updated, supported, distributed, or sold) after its economic value is depleted. In academic circles, the term is often applied to software that is developed under a research grant. When the grant expires, so does the software. Most of the software in existence today is abandonware

Autocoding When nomenclature coding is done automatically, by a computer program, the process is known as "autocoding" or "autoencoding." See Coding. See Nomenclature. See Autoencoding.

Autoencoding Synonym for autocoding. See Autocoding.

Blended Class Also known as class noise, subsumes the more familiar, but less precise term, "Labeling error." Blended class refers to inaccuracies (eg, misleading results) introduced in the analysis of data due to errors in class assignments (ie, assigning a data object to class A when the object should have been assigned to class B). If you are testing the effectiveness of an antibiotic on a class of people with bacterial pneumonia, the accuracy of your results will be forfeit when your study population includes subjects with viral pneumonia, or smoking-related lung damage. Errors induced by blending classes are often overlooked by data analysts who incorrectly assume that the experiment was designed to ensure that each data group is composed of a uniform and representative population. A common source of class blending occurs when the classification upon which the experiment is designed is itself blended. For example, imagine that you are a cancer researcher and you want to perform a study of patients with malignant fibrous histiocytomas (MFH), comparing the clinical course of these patients with the clinical course of patients who have other types of tumors. Let's imagine that the class of tumors known as MFH does not actually exist; that it is a grab-bag term erroneously assigned to a variety of other tumors that happened to look similar to one another. This being the case, it would be impossible to produce any valid results based on a study of patients diagnosed as MFH. The results would be a biased and irreproducible cacophony of data collected across different, and undetermined, classes of tumors. Believe it or not, this specific example, of the blended MFH class of tumors, is selected from the real-life annals of tumor biology.30, 31 The literature is rife with research of dubious quality, based on poorly designed classifications and blended classes. A detailed discussion of this topic is found in Section 6.5, Properties that Cross Multiple Classes. One caveat. Efforts to eliminate class blending can be counterproductive if undertaken with excess zeal. For example, in an effort to reduce class blending, a researcher may choose groups of subjects who are uniform with respect to every known observable property. For example, suppose you want to actually compare apples with oranges. To avoid class blending, you might want to make very sure that your apples do not included any kumquats, or persimmons. You should be certain that your oranges do not include any limes or grapefruits. Imagine that you go even further, choosing only apples and oranges of one variety (eg, Macintosh apples and navel oranges), size (eg 10 cm), and origin (eg, California). How will your comparisons apply to the varieties of apples and oranges that you have excluded from your study? You may actually reach conclusions that are invalid and irreproducible for more generalized populations within each class. In this case, you have succeeded in eliminated class blending, at the expense of losing representative populations of the classes. See Simpson's paradox.

Child Class The direct or first generation subclass of a class. Sometimes referred to as the daughter class or, less precisely, as the subclass. See Parent class. See Classification.

Class A class is a group of objects that share a set of properties that define the class and that distinguish the members of the class from members of other classes. The word "class," lowercase, is used as a general term. The word "Class," uppercase, followed by an uppercase noun (eg Class Animalia), represents a specific class within a formal classification. See Classification.

Classification A system in which every object in a knowledge domain is assigned to a class within a hierarchy of classes. The properties of superclasses are inherited by the subclasses. Every class has one immediate superclass (ie, parent class), although a parent class may have more than one immediate subclass (ie, child class). Objects do not change their class assignment in a classification, unless there was a mistake in the assignment. For example, a rabbit is always a rabbit, and does not change into a tiger. Classifications can be thought of as the simplest and most restrictive type of ontology, and serve to reduce the complexity of a knowledge domain.32 Classifications can be easily modeled in an object-oriented programming language and are nonchaotic (ie, calculations performed on the members and classes of a classification should yield the same output, each time the calculation is performed). A classification should be distinguished from an ontology. In an ontology, a class may have more than one parent class and an object may be a member of more than one class. A classification can be considered a special type of ontology wherein each class is limited to a single parent class and each object has membership in one and only one class. See Nomenclature. See Thesaurus. See Vocabulary. See Classification. See Dictionary. See Terminology. See Ontology. See Parent class. See Child class. See Superclass. See Unclassifiable objects.

Coding The term "coding" has three very different meanings; depending on which branch of science influences your thinking. For programmers, coding means writing the code that constitutes a computer programmer. For cryptographers, coding is synonymous with encrypting (ie, using a cipher to encode a message). For medics, coding is calling an emergency team to handle a patient in extremis. For informaticians and library scientists, coding involves assigning an alphanumeric identifier, representing a concept listed in a nomenclature, to a term. For example, a surgical pathology report may include the diagnosis, "Adenocarcinoma of prostate." A nomenclature may assign a code C4863000 that uniquely identifies the concept "Adenocarcinoma." Coding the report may involve annotating every occurrence of the word "Adenocarcinoma" with the "C4863000" identifier. For a detailed explanation of coding, and its importance for searching and retrieving data, see the full discussion in Section 3.4, "Autoencoding and Indexing with Nomenclatures." See Autocoding. See Nomenclature.

Command Line Instructions to the operating system, that can be directly entered as a line of text from the a system prompt (eg, the so-called C prompt, "c:>", in Windows and DOS operating systems; the so-called shell prompt, "$", in Linux-like systems).

Command Line Utility Programs lacking graphic user interfaces that are executed via command line instructions. The instructions for a utility are typically couched as a series of arguments, on the command line, following the name of the executable file that contains the utility. See Utility.

Data Quality Act In the U.S., the data upon which public policy is based must have quality and must be available for review by the public. Simply put, public policy must be based on verifiable data. The Data Quality Act of 2002, requires the Office of Management and Budget to develop government-wide standards for data quality.33

Data Object A data object is whatever is being described by the data. For example, if the data is "6 feet tall," then the data object is the person or thing to which "6 feet tall" applies. Minimally, a data object is a metadata/data pair, assigned to a unique identifier (ie, a triple). In practice, the most common data objects are simple data records, corresponding to a row in a spreadsheet or a line in a flat-file. Data objects in object-oriented programming languages typically encapsulate several items of data, including an object name, an object unique identifier, multiple data/metadata pairs, and the name of the object's class. See Triple. See Identifier. See Metadata.

Data Repurposing Involves using old data in new ways, that were not foreseen by the people who originally collected the data. Data repurposing comes in the following categories: (1) Using the preexisting data to ask and answer questions that were not contemplated by the people who designed and collected the data; (2) Combining preexisting data with additional data, of the same kind, to produce aggregate data that suits a new set of questions that could not have been answered with any one of the component data sources; (3) Reanalyzing data to validate assertions, theories, or conclusions drawn from the original studies; (4) Reanalyzing the original data set using alternate or improved methods to attain outcomes of greater precision or reliability than the outcomes produced in the original analysis; (5) Integrating heterogeneous data sets (ie, data sets with seemingly unrelated types of information), for the purpose an answering questions or developing concepts that span diverse scientific disciplines; (6) Finding subsets in a population once thought to be homogeneous; (7) Seeking new relationships among data objects; (8) Creating on-the-fly, novel data sets through data file linkages; (9) Creating new concepts or ways of thinking about old concepts, based on a reexamination of data; (10) Fine-tuning existing data models; and (11) Starting over and remodeling systems.34 See Heterogeneous data.

Dictionary A terminology or word list accompanied by a definition for each item. See Nomenclature. See Vocabulary. See Terminology.

Exe File A file with the filename suffix ".exe". In common parlance, filenames with the ".exe" suffix are executable code. See Executable file.

Executable File A file that contains compiled computer code that can be read directly from the computer's CPU, without interpretation by a programming language. A language such as C will compile C code into executables. Scripting languages, such as Perl, Python, and Ruby interpret plain-text scripts and send instructions to a run-time engine, for execution. Because executable files eliminate the interpretation step, they typically run faster than plain-text scripts. See Exe file.

FOSS Free and open source software. Equivalent to FLOSS (Free Libre Open Source Software), an abbreviation that trades redundancy for international appeal. See Free Software Movement versus Open Source Initiative.

Free Software Movement Versus Open Source Initiative Beyond trivial semantics, the difference between free software and open source software relates to the essential feature necessary for "open source" software (ie, access to the source code) and to the different distribution licenses of free software and open source software. Sticklers insist that free software always comes with permission to modify and redistribute software in a prescribed manner as discussed in the software license; a permission not always granted in open source software. In practice, there is very little difference between free software and open source software. Richard Stallman has written an essay that summarizes the two different approaches to creating free software and open source software.35

Free Software The concept of free software, as popularized by the Free Software Foundation, refers to software that can be used freely, without restriction. The term "free" does not necessarily relate to the actual cost of the software.

Free Software License Virtually all free software is distributed under a license that assigns copyright to the software creator and protects the creator from damages that might result from using the software. Software sponsored by the Free Software Foundation, and much of the software described as either free software or open source software is distributed under one of the GNU software licenses. See GNU software license.

GNU Software License The GNU organization publishes several licenses, used for software produced by GNU and by anyone who would like to distribute their software under the terms of the GNU license. GNU licenses are referred to as copyleft licenses, because they primarily serve the software users, rather than the software creators. One of the GNU licenses, the General Public License, covers most software applications. The GNU Lesser General Public License, formerly known as the GNU Library General Public License is intended for use with software libraries or unified collections of files comprising a complex application, language, or other body of work.36

HTML HyperText Markup Language is an ASCII-based set of formatting instructions for web pages. HTML formatting instructions, known as tags, are embedded in the document, and double-bracketed, indicating the start point and end points for instruction. Here is an example of an HTML tag instructing the web browser to display the word "Hello" in italics: < i>Hello </i >. All web browsers conforming to the HTML specification must contain software routines that recognize and implement the HTML instructions embedded within in web documents. In addition to formatting instructions, HTML also includes linkage instructions, in which the web browsers must retrieve and display a listed web page, or a web resource, such as an image. The protocol whereby web browsers, following HTML instructions, retrieve web pages from other internet sites, is known as HTTP (HyperText Transfer Protocol).

Halting a Script The most common problem encountered by programmers is nonexecution (ie, your script will not run). But another problem, which can be much worse, occurs when your program never wants to stop. This can occur when a block or loop has no exit condition. For most scripting languages, when you notice that a script seems to be running too long, and you want to exit, just press the ctrl-break keys. The script should eventually stop and return your system prompt.

Heterogeneous Data Two sets of data are considered heterogeneous when they are dissimilar to one another, with regard to content, purpose, format, organization, or annotations. One of the purposes of data science is to discover relationships among heterogeneous data sources. For example, epidemiologic data sets may be of service to molecular biologists who have gene sequence data on diverse human populations. The epidemiologic data is likely to contain different types of data values, annotated and formatted in a manner different from the data and annotations in a gene sequence database. The two types of related data, epidemiologic and genetic, have dissimilar content; hence they are heterogeneous to one another.

Identification The process of providing a data object with an identifier, or the process of distinguishing one data object from all other data objects on the basis of its associated identifier. See Identifier.

Identifier A string that is associated with a particular thing (eg person, document, transaction, data object), and not associated with any other thing.37 Object identification usually involves permanently assigning a seemingly random sequence of numeric digits (0–9) and alphabet characters (a-z and A-Z) to a data object. A data object can be a specific piece of data (eg, a data record), or an abstraction, such as a class of objects or a number or a string or a variable. See Identification.

Instance An instance is a specific example of an object that is not itself a class or group of objects. For example, Tony the Tiger is an instance of the tiger species. Tony the Tiger is a unique animal and is not itself a group of animals or a class of animals. The terms instance, instance object, and object are sometimes used interchangeably, but the special value of the "instance" concept, in a system wherein everything is an object, is that it distinguishes members of classes (ie, the instances) from the classes to which they belong.

Introspection A method by which data objects can be interrogated to yield information about themselves (eg, properties, values, and class membership). Through introspection, the relationships among the data objects can be examined. Introspective methods are built into object-oriented languages. The data provided by introspection can be applied, at run-time, to modify a script's operation; a technique known as reflection. Specifically, any properties, methods, and encapsulated data of a data object can be used in the script to modify the script's run-time behavior. See Reflection.

KISS Acronym for Keep It Simple Stupid. The motto applies to almost any area of life; nothing should be made more complex than necessary. As it happens, much of what we encounter, as data scientists, comes to us in a complex form (ie, nothing to keep simple). A more realistic acronym is MISS (Make It Simple Stupid).

Meaning In informatics, meaning is achieved when described data is bound to a unique identifier of a data object. "Claude Funston's height is five feet eleven inches," comes pretty close to being a meaningful statement. The statement contains data (five feet eleven inches), and the data is described (height). The described data belongs to a unique object (Claude Funston). Ideally, the name "Claude Funston" should be provided with a unique identifier, to distinguish one instance of Claude Funston from all the other persons who are named Claude Funston. The statement would also benefit from a formal system that ensures that the metadata makes sense (eg, What exactly is height, and does Claude Funston fall into a class of objects for which height is a property?) and that the data is appropriate (eg, Is 5 feet 11 inches an allowable measure of a person's height?). A statement with meaning does not need to be a true statement (eg, The height of Claude Funston was not 5 feet 11 inches when Claude Funston was an infant). See Semantics. See Triple. See RDF.

Metadata The data that describes data. For example, a data element (also known as data point) may consist of the number, "6." The metadata for the data may be the words "Height, in feet." A data element is useless without its metadata, and metadata is useless unless it adequately describes a data element. In XML, the metadata/data annotation comes in the form < metadata tag>data<end of metadata tag > and might be look something like:

< weight_in_pounds>150 </weight_in_pounds > 

In spreadsheets, the data elements are the cells of the spreadsheet. The column headers are the metadata that describe the data values in the column's cells, and the row headers are the record numbers that uniquely identify each record (ie, each row of cells). See XML.

Microarray Also known as gene chip, gene expression array, DNA microarray, or DNA chips. These consist of thousands of small samples of chosen DNA sequences arrayed onto a block of support material (usually, a glass slide). When the array is incubated with a mixture of DNA sequences prepared from cell samples, hybridization will occur between molecules on the array and single stranded complementary (ie, identically sequenced) molecules present in the cell sample. The greater the concentration of complementary molecules in the cell sample, the greater the number of fluorescently tagged hybridized molecules in the array. A specialized instrument prepares an image of the array, and quantifies the fluorescence in each array spot. Spots with high fluorescence indicate relatively large quantities of DNA in the cell sample that match the specific sequence of DNA in the array spot. The data comprising all the fluorescent intensity measurements for every spot in the array produces a gene profile characteristic of the cell sample.

Multiclass Classification A misnomer imported from the field of machine translation, and indicating the assignment of an instance to more than one class. Classifications, as defined in this book, impose one-class classification (ie, an instance can be assigned to one and only one class). It is tempting to think that a ball should be included in class "toy" and in class "spheroids," but mutliclass assignments create unnecessary classes of inscrutable provenance, and taxonomies of enormous size, consisting largely of replicate items. See Multiclass inheritance. See Taxonomy.

Multiclass Inheritance In ontologies, multiclass inheritance occurs when a child class has more than one parent class. For example, a member of Class House may have two different parent classes: Class Shelter, and Class Property. Multiclass inheritance is generally permitted in ontologies but is forbidden in one type of restrictive ontology, known as a classification. See Classification. See Parent class. See Multiclass classification.

Namespace A namespace is the realm in which a metadata tag applies. The purpose of a namespace is to distinguish metadata tags that have the same name, but a different meaning. For example, within a single XML file, the metadata tag "date" may be used to signify a calendar date, or the fruit, or the social engagement. To avoid confusion, metadata terms are assigned a prefix that is associated with a Web document that defines the term (ie, establishes the tag's namespace). In practical terms, a tag that can have different descriptive meanings in different contexts is provided with a prefix that links to a web document wherein the meaning of the tag, as it applies in the XML document is specified.

Nomenclature A nomenclatures is a listing of terms that cover all of the concepts in a knowledge domain. A nomenclature is different from a dictionary for three reasons: (1) the nomenclature terms are not annotated with definitions, (2) nomenclature terms may be multi-word, and (3) the terms in the nomenclature are limited to the scope of the selected knowledge domain. In addition, most nomenclatures group synonyms under a group code. For example, a food nomenclature might collect submarine, hoagie, po' boy, grinder, hero, and torpedo under an alphanumeric code such as "F63958". Nomenclatures simplify textual documents by uniting synonymous terms under a common code. Documents that have been coded with the same nomenclature can be integrated with other documents that have been similarly coded, and queries conducted over such documents will yield the same results, regardless of which term is entered (ie, a search for either hoagie, or po' boy will retrieve the same information, if both terms have been annotated with the synonym code, "F63948"). Optimally, the canonical concepts listed in the nomenclature are organized into a hierarchical classification.38,39 See Coding. See Autocoding.

Nonatomicity Nonatomicity is the assignment of a collection of objects to a single, composite object, that cannot be further simplified or sensibly deconstructed. For example, the human body is composed of trillions of individual cells, each of which lives for some length of time, and then dies. Many of the cells in the body are capable of dividing to produce more cells. In many cases, the cells of the body that are capable of dividing can be cultured and grown in plastic containers, much like bacteria can be cultured and grown in Petri dishes. If the human body is composed of individual cells, why do we habitually think of each human as a single living entity? Why don't we think of humans as bags of individual cells? Perhaps the reason stems from the coordinated responses of cells. When someone steps on the cells of your toe, the cells in your brain sense pain, the cells in your mouth and vocal cords say ouch, and an army of inflammatory cells rush to the scene of the crime. The cells in your toe are not capable of registering an actionable complaint, without a great deal of assistance. Another reason that organisms, composed of trillions of living cells, are generally considered to have nonatomicity, probably relates to the "species" concept in biology. Every cell in an organism descended from the same zygote, and every zygote in every member of the same species descended from the same ancestral organism. Hence, there seems to be little benefit to assigning unique entity status to the individual cells that compose organisms, when the class structure for organisms is based on descent through zygotes. See Species.

Notation 3 Also called n3. A syntax for expressing assertions as triples (unique subject + metadata + data). Notation 3 expresses the same information as the more formal RDF syntax, but n3 is easier for humans to read.40 RDF and n3 are interconvertible, and either one can be parsed and equivalently tokenized (ie, broken into elements that can be reorganized in a different format, such as a database record). See RDF. See Triple.

Ontology An ontology is a collection of classes and their relationships to one another. Ontologies are usually rule-based systems (ie, membership in a class is determined by one or more class rules). Two important features distinguish ontologies from classifications. Ontologies permit classes to have more than one parent class. For example, the class of automobiles may be a direct subclass of "motorized devices" and a direct subclass of "mechanized transporters." In addition, an instance of a class can be an instance of any number of additional classes. For example, a Lamborghini may be a member of class "automobiles" and of class "luxury items." This means that the lineage of an instance in an ontology can be highly complex, with a single instance occurring in multiple classes, and with many connections between classes. Because recursive relations are permitted, it is possible to build an ontology wherein a class is both an ancestor class and a descendant class of itself. A classification is a highly restrained ontology wherein instances can belong to only one class, and each class may have only one parent class. Because classifications have an enforced linear hierarchy, they can be easily modeled, and the lineage of any instance can be traced unambiguously. See Classification. See Multiclass classification. See Multiclass inheritance.

Open Access A document is open access if its complete contents are available to the public. Open access applies to documents in the same manner as open source applies to software.

Open Source Software is open source if the source code is available to anyone who has access to the software. See Open source movement. See Open access.

Open Source Movement Open source software is software for which the source code is available. The Open Source Software movement is an offspring of the Free Software movement. Although a good deal of free software is no-cost software, the intended meaning of the term "free" is that the software can be used without restrictions. The Open Source Initiative posts an open source definition RopaR and a list of approved licenses that can be attached to open source products.

Parent Class The immediate ancestor, or the next-higher class (ie, the direct superclass) of a class. For example, in the classification of living organisms, Class Vertebrata is the parent class of Class Gnathostomata. Class Gnathostomata is the parent class of Class Teleostomi. In a classification, which imposes single class inheritance, each child class has exactly one parent class; whereas one parent class may have several different child classes. Furthermore, some classes, in particular the bottom class in the lineage, have no child classes (ie, a class need not always be a superclass of other classes). A class can be defined by its properties, its membership (ie, the instances that belong to the class), and by the name of its parent class. When we list all of the classes in a classification, in any order, we can always reconstruct the complete class lineage, in their correct lineage and branchings, if we know the name of each class's parent class. See Instance. See Child class. See Superclass.

RDF Resource Description Framework (RDF) is a syntax in XML notation that formally expresses assertions as triples. The RDF triple consists of a uniquely identified subject plus a metadata descriptor for the data plus a data element. Triples are necessary and sufficient to create statements that convey meaning. Triples can be aggregated with other triples from the same data set or from other data sets, so long as each triple pertains to a unique subject that is identified equivalently through the data sets. Enormous data sets of RDF triples can be merged or functionally integrated with other massive or complex data resources. For a detailed discussion see Open Source Tools for Chapter 6, "Syntax for Triples." See Notation 3. See Semantics. See Triple. See XML.

RDF Schema Resource Description Framework Schema (RDFS). A document containing a list of classes, their definitions, and the names of the parent class(es) for each class. In an RDF Schema, the list of classes is typically followed by a list of properties that apply to one or more classes in the Schema. To be useful, RDF Schemas are posted on the Internet, as a Web page, with a unique Web address. Anyone can incorporate the classes and properties of a public RDF Schema into their own RDF documents (public or private) by linking named classes and properties, in their RDF document, to the web address of the RDF Schema where the classes and properties are defined. See Namespace. See RDFS.

RDFS Same as RDF Schema.

Reflection A programming technique wherein a computer program will modify itself, at run-time, based on information it acquires through introspection. For example, a computer program may iterate over a collection of data objects, examining the self-descriptive information for each object in the collection (ie, object introspection). If the information indicates that the data object belongs to a particular class of objects, then the program may call a method appropriate for the class. The program executes in a manner determined by descriptive information obtained during run-time; metaphorically reflecting upon the purpose of its computational task. See Introspection.

Reproducibility Reproducibility is achieved when repeat studies produce the same results, over and over. Reproducibility is closely related to validation, which is achieved when you draw the same conclusions, from the data, over and over again. Implicit in the concept of "reproducibility" is that the original research must somehow convey the means by which the study can be reproduced. This usually requires the careful recording of methods, algorithms, and materials. In some cases, reproducibility requires access to the data produced in the original studies. If there is no feasible way for scientists to undertake a reconstruction of the original study, or if the results obtained in the original study cannot be obtained in subsequent attempts, then the study is irreproducible. If the work is reproduced, but the results and the conclusions cannot be repeated, then the study is considered invalid. See Validation. See Verification.

Script A script is a program that is written in plain-text, in a syntax appropriate for a particular programming language, that needs to be parsed through that language's interpreter before it can be compiled and executed. Scripts tend to run a bit slower than executable files, but they have the advantage that they can be understood by anyone who is familiar with the script's programming language. Scripts can be identified by the so-called shebang line at the top of the script. See Shebang. See Executable file. See Halting a script.

Semantics The study of meaning (Greek root, semantikos, signficant meaning). In the context of data science, semantics is the technique of creating meaningful assertions about data objects. A meaningful assertion, as used here, is a triple consisting of an identified data object, a data value, and a descriptor for the data value. In practical terms, semantics involves making assertions about data objects (ie, making triples), combining assertions about data objects (ie, merging triples), and assigning data objects to classes; hence relating triples to other triples. As a word of warning, few informaticians would define semantics in these terms, but most definitions for semantics are functionally equivalent to the definition offered here. Most language is unstructured and meaningless. Consider the assertion: Sam is tired. This is an adequately structured sentence with a subject verb and object. But what is the meaning of the sentence? There are a lot of people named Sam. Which Sam is being referred to in this sentence? What does it mean to say that Sam is tired? Is "tiredness" a constitutive property of Sam, or does it only apply to specific moments? If so, for what moment in time is the assertion, "Sam is tired" actually true? To a computer, meaning comes from assertions that have a specific, identified subject associated with some sensible piece of fully described data (metadata coupled with the data it describes). See Triple. See RDF.

Shebang Standard scripts written in Ruby, Perl, or Python all begin with a shebang, a colloquialism referring to the concatenation of the pound character, "#" (known by the musically-inclined as a SHarp character), followed by an exclamation sign, "!" (connoting a surprise or a bang). A typical shebang line (ie, top line) for Perl, Ruby, Python, and Bash scripts is:

#!/usr/local/bin/perl
#!/usr/local/bin/ruby
#!/usr/local/bin/python
#!/usr/local/bin/sh

In each case, the shebang is followed by the directory path to the script, and this is traditionally followed by optional programming arguments specific to each language. The shebang line, though essential in some Unix-like systems, is unnecessary in the Windows operating system. In this book, I use the shebang line to indicate, at a glance, the language in which a script is composed.

Simpson's Paradox Occurs when a correlation that holds in two different data sets is reversed if the data sets are combined. For example, baseball player A may have a higher batting average than player B for each of two seasons, but when the data for the two seasons are combined, player B may have the higher 2-season average. Simpson's paradox is just one example of unexpected changes in outcome when variables are unknowingly hidden or blended.41

Species Species is the bottom-most class of any classification or ontology. Because the species class contains the individual objects of the classification, it is the only class which is not abstract. The special significance of the species class is best exemplified in the classification of living organisms. Every species of organism contains individuals that share a common ancestral relationship to one another. When we look at a group of squirrels, we know that each squirrel in the group has its own unique personality, its own unique genes (ie, genotype), and its own unique set of physical features (ie, phenotype). Moreover, although the DNA sequences of individual squirrels are unique, we assume that there is a commonality to the genome of squirrels that distinguishes it from the genome of every other species. If we use the modern definition of species as an evolving gene pool, we see that the species can be thought of as a biological life form, with substance (a population of propagating genes), and a function (evolving to produce new species).4244 Put simply, species speciate; individuals do not. As a corollary, species evolve; individuals simply propagate. Hence, the species class is a separable biological unit with form and function. We, as individuals, are focused on the lives of individual things, and we must be reminded of the role of species in biological and nonbiological classifications. The concept of species is discussed in greater detail in Section 6.4. See Blended class. See Nonatomicity.

Specification A specification is a formal method for describing objects (physical objects such as nuts and bolts or symbolic objects, such as numbers, or concepts expressed as text). In general, specifications do not require the inclusion of specific items of information (ie they do not impose restrictions on the content that is included in or excluded from documents), and specifications do not impose any order of appearance of the data contained in the document (ie, you can mix up and rearrange specified objects, if you like). Specifications are not generally certified by a standards organization. They are generally produced by special interest organizations, and the legitimacy of a specification depends on its popularity. Examples of specifications are RDF (Resource Description Framework) produced by the W3C (World Wide Web Consortium), and TCP/IP (Transfer Control Protocol/Internet Protocol), maintained by the Internet Engineering Task Force. The most widely implemented specifications are simple and easily implemented. See Specification versus standard.

Specification Versus Standard Data standards, in general, tell you what must be included in a conforming document, and, in most cases, dictates the format of the final document. In many instances, standards bar inclusion of any data that is not included in the standard (eg, you should not include astronomical data in an clinical x-ray report). Specifications simply provide a formal way for describing the data that you choose to include in your document. XML and RDF, a semantic dialect of XML, are examples of specifications. They both tell you how data should be represented, but neither tell you what data to include, or how your document or data set should appear. Files that comply with a standard are rigidly organized and can be easily parsed and manipulated by software specifically designed to adhere to the standard. Files that comply with a specification are typically self-describing documents that contain within themselves all the information necessary for a human or a computer to derive meaning from the file contents. In theory, files that comply with a specification can be parsed and manipulated by generalized software designed to parse the markup language of the specification (eg, XML, RDF) and to organize the data into data structures defined within the file. The relative strengths and weaknesses of standards and specifications are discussed in Section 2.6, "Specifications Good, Standards Bad." See XML. See RDF.

Superclass Any of the ancestral classes of a subclass. For example, in the classification of living organisms, the class of vertebrates is a superclass of the class of mammals. The immediate superclass of a class is its parent class. In common parlance, when we speak of the superclass of a class, we are usually referring to its parent class. See Parent class.

Syntax Syntax is the standard form or structure of a statement. What we know as English grammar is equivalent to the syntax for the English language. Charles Mead distinctly summarized the difference between syntax and semantics: "Syntax is structure; semantics is meaning."45 See Semantics.

Taxonomic Order In biological taxonomy, the hierarchical lineage of organisms are divided into a descending list of named orders: Kingdom, Phylum (Division), Class, Order, Family, and Genus, Species. As we have learned more and more about the classes of organisms, modern taxonomists have added additional ranks to the classification (eg, supraphylum, subphylum, suborder, infraclass, etc.). Was this really necessary? All of this taxonomic complexity could be averted by dropping named ranks and simply referring to every class as "Class." Modern specifications for class hierarchies (eg, RDF Schema) encapsulate each class with the name of its superclass. When every object yields its class and superclass, it is possible to trace any object's class lineage. For example, in the classification of living organisms, if you know the name of the parent for each class, you can write a simple script that generates the complete ancestral lineage for every class and species within the classification.46 See Class. See Taxonomy. See RDF Schema. See Species.

Taxonomy A taxonomy is the collection of named instances (class members) in a classification or an ontology. When you see a schematic showing class relationships, with individual classes represented by geometric shapes and the relationships represented by arrows or connecting lines between the classes, then you are essentially looking at the structure of a classification, minus the taxonomy. You can think of building a taxonomy as the act of pouring all of the names of all of the instances into their proper classes. A taxonomy is similar to a nomenclature; the difference is that in a taxonomy, every named instance must have an assigned class. See Taxonomic order.

Terminology The collection of words and terms used in some particular discipline, field, or knowledge domain. Nearly synonymous with vocabulary and with nomenclature. Vocabularies, unlike terminologies, are not to be confined to the terms used in a particular field. Nomenclatures, unlike terminologies, usually aggregate equivalent terms under a canonical synonym.

Thesaurus A vocabulary that groups together synonymous terms. A thesaurus is very similar to a nomenclature. There are two minor differences. Nomenclatures included multi-word terms; whereas a thesaurus is typically composed of one-word terms. In addition, nomenclatures are typically restricted to a well-defined topic or knowledge domain (eg, names of stars, infectious diseases, etc.). See Nomenclature. See Vocabulary. See Classification. See Dictionary. See Terminology. See Ontology.

Triple In computer semantics, a triple is an identified data object associated with a data element and the description of the data element. In the computer science literature, the syntax for the triple is commonly described as: subject, predicate, object, wherein the subject is an identifier, the predicate is the description of the object, and the object is the data. The definition of triple, using grammatical terms, can be off-putting to the data scientist, who may think in terms of spreadsheet entries: a key that identifies the line record, a column header containing the metadata description of the data, and a cell that contains the data. In this book, the three components of a triple are described as: (1) the identifier for the data object, (2) the metadata that describes the data, and (3) the data itself. In theory, all data sets, databases, and spreadsheets can be constructed or deconstructed as collections of triples. See Introspection. See Data object. See Semantics. See RDF. See Meaning.

Unclassifiable Objects Classifications create a class for every object and taxonomies assign each and every object to its correct class. This means that a classification is not permitted to contain unclassified objects; a condition that puts fussy taxonomists in an untenable position. Suppose you have an object, and you simply do not know enough about the object to confidently assign it to a class. Or, suppose you have an object that seems to fit more than one class, and you can't decide which class is the correct class. What do you do? Historically, scientists have resorted to creating a "miscellaneous" class into which otherwise unclassifiable objects are given a temporary home, until more suitable accommodations can be provided. I have spoken with numerous data managers, and everyone seems to be of a mind that "miscellaneous" classes, created as a stopgap measure, serve a useful purpose. Not so. Historically, the promiscuous application of "miscellaneous" classes have proven to be a huge impediment to the advancement of science. In the case of the classification of living organisms, the class of protozoans stands as a case in point. Ernst Haeckel, a leading biological taxonomist in his time, created the Kingdom Protista (ie, protozoans), in 1866, to accommodate a wide variety of simple organisms with superficial commonalities. Haeckel himself understood that the protists were a blended class that included unrelated organisms, but he believed that further study would resolve the confusion. In a sense, he was right, but the process took much longer than he had anticipated; occupying generations of taxonomists over the following 150 years. Today, Kingdom Protista no longer exists. Its members have been reassigned to other classes. Nonetheless, textbooks of microbiology still describe the protozoans, just as though this name continued to occupy a legitimate place among terrestrial organisms. In the meantime, therapeutic opportunities for eradicating so-called protozoal infections, using class-targeted agents, have no doubt been missed.47 You might think that the creation of a class of living organisms, with no established scientific relation to the real world, was a rare and ancient event in the annals of biology, having little or no chance of being repeated. Not so. A special pseudoclass of fungi, deuteromyctetes (spelled with a lowercase "d," signifying its questionable validity as a true biologic class) has been created to hold fungi of indeterminate speciation. At present, there are several thousand such fungi, sitting in a taxonomic limbo, waiting to be placed into a definitive taxonomic class.47, 48 See Blended class.

Undifferentiated Software Intellectual property disputes have driven developers to divide software into two categories: undifferentiated software and differentiated software. Undifferentiated software comprises the fundamental algorithms that everyone uses whenever they develop a new software application. It is in nobody's interest to assign patents to basic algorithms and their implementations. Nobody wants to devote their careers to prosecuting or defending tenuous legal claims over the ownership of the fundamental building blocks of computer science. Differentiated software comprises customized applications that are sufficiently new and different from any preceding product that patent protection would be reasonable.

Utility In the context of software, a utility is an application that is dedicated to performing one specific task very well and very fast. In most instances, utilities are short programs, often running from the command line, and thus lacking any graphic user interface. Many utilities are available at no cost, with open source code. In general, simple utilities are preferable to multipurpose software applications.29 Remember, an application that claims to do everything for the user is, most often, an application that requires the user to do everything for the application. See Command line. See Command line utility.

Validation Validation is the process that checks whether the conclusions drawn from data analysis are correct.49 Validation usually starts with repeating the same analysis of the same data, using the methods that were originally recommended. Obviously, if a different set of conclusions is drawn from the same data and methods, the original conclusions cannot be validated. Validation may involve applying a different set of analytic methods to the same data, to determine if the conclusions are consistent. It is always reassuring to know that conclusions are repeatable, with different analytic methods. In prior eras, experiments were validated by repeating the entire experiment, thus producing a new set of observations for analysis. Many of today's scientific experiments are far too complex and costly to repeat. In such cases, validation requires access to the complete collection of the original data, and to the detailed protocols under which the data was generated. One of the most useful methods of data validation involves testing new hypotheses, based on the assumed validity of the original conclusions. For example, if you were to accept Darwin's analysis of barnacle data, leading to his theory of evolution, then you would expect to find a chronologic history of fossils in ascending layers of shale. This was the case; thus, paleontologists studying the Burgess shale reserves provided some validation to Darwin's conclusions. Validation should not be mistaken for proof. Nonetheless, the repeatability of conclusions, over time, with the same or different sets of data, and the demonstration of consistency with related observations, is about all that we can hope for in this imperfect world. See Verification. See Reproducibility.

Verification The process by which data is checked to determine whether the data was obtained properly (ie, according to approved protocols), and that the data accurately measured what it was intended to measure, on the correct specimens, and that all steps in data processing were done using well-documented protocols. Verification often requires a level of expertise that is at least as high as the expertise of the individuals who produced the data.49 Data verification requires a full understanding of the many steps involved in data collection and can be a time-consuming task. In one celebrated case, in which two statisticians reviewed a microarray study performed at Duke University, the time devoted to their verification effort was reported to be 2000 hours.50 To put this statement in perspective, the official work-year, according to the U.S. Office of Personnel Management, is 2087 hours. Verification is different from validation. Verification is performed on data; validation is done on the results of data analysis. See Validation. See Microarray. See Introspection.

Vocabulary A comprehensive collection of the words and their associated meanings. In some quarters, "vocabulary" and "nomenclature" are used interchangeably, but they are different from one another. Nomenclatures typically focus on terms confined to one knowledge domain. Nomenclatures typically do not contain definitions for the contained terms. Nomenclatures typically group terms by synonymy. Lastly, nomenclatures include multi-word terms. Vocabularies are collections of single words, culled from multiple knowledge domains, with their definitions, and assembled in alphabetic order. See Nomenclature. See Thesaurus. See Taxonomy. See Dictionary. See Terminology.

XML Acronym for eXtensible Markup Language, a syntax for marking data values with descriptors (ie, metadata). The descriptors are commonly known as tags. In XML, every data value is enclosed by a start-tag, containing the descriptor and indicating that a value will follow, and an end-tag, containing the same descriptor and indicating that a value preceded the tag. For example: < name>Conrad Nervig </name >. The enclosing angle brackets, "<>", and the end-tag marker, "/", are hallmarks of HTML and XML markup. This simple but powerful relationship between metadata and data allows us to employ metadata/data pairs as though each were a miniature database. The semantic value of XML becomes apparent when we bind a metadata/data pair to a unique object, forming a so-called triple. See Triple. See Meaning. See Semantics. See HTML.

References

1 Kappelman L.A., McKeeman R., Lixuan Zhang L. Early warning signs of IT project failure: the dominant dozen. Information Systems Management. 2006;23:31–36.

2 Arquilla J. The Pentagon's biggest boondoggles. The New York Times (Opinion Pages). March 12, 2011.

3 Lohr S. Lessons from Britain's Health Information Technology Fiasco. The New York Times. September 27, 2011.

4 Department of Health Media Centre. Dismantling the NHS National Programme for IT. Press Release. September 22, 2011 Available from: http://mediacentre.dh.gov.uk/2011/09/22/dismantling-the-nhs-national-programme-for-it/ [accessed 12.06.2012].

5 Whittaker Z. UK's delayed national health IT programme officially scrapped. ZDNet. September 22, 2011.

6 Lohr S. Google to end health records service after it fails to attract users. The New York Times. Jun 24, 2011.

7 Schwartz E. Shopping for health software, some doctors get buyer's remorse. The Huffington Post Investigative Fund. January 29, 2010.

8 Heeks R., Mundy D., Salazar A. Why health care information systems succeed or fail. Manchester: Institute for Development Policy and Management, University of Manchester; June 1999 Available from: http://www.sed.manchester.ac.uk/idpm/research/publications/wp/igovernment/igov_wp09.htm. [accessed 12.07.2012].

9 Beizer B. Software testing techniques. 2nd ed. Hoboken, NJ: Van Nostrand Reinhold; 1990.

10 Unreliable Research. Trouble at the lab. The Economist. October 19, 2013.

11 Kolata G. Cancer fight: unclear tests for new drug. The New York Times. April 19, 2010.

12 Ioannidis J.P. Why most published research findings are false. PLoS Med. 2005;2:e124.

13 Baker M. Reproducibility crisis: blame it on the antibodies. Nature. 2015;521:274–276.

14 Naik G. Scientists' elusive goal: reproducing study results. Wall Street Journal. December 2, 2011.

15 Innovation or Stagnation. Challenge and opportunity on the critical path to new medical products. Silver Spring, MD: U.S. Department of Health and Human Services, Food and Drug Administration; 2004.

16 Hurley D. Why are so few blockbuster drugs invented today? The New York Times. November 13, 2014.

17 Angell M. The truth about the drug companies. The New York Review of Books. July 15, 2004;vol. 51.

18 Crossing the Quality Chasm. A new health system for the 21st century. Washington, DC: Quality of Health Care in America Committee, Institute of Medicine; 2001.

19 Wurtman R.J., Bettiker R.L. The slowing of treatment discovery, 1965–1995. Nat Med. 1996;2:5–6.

20 Ioannidis J.P. Microarrays and molecular research: noise discovery? Lancet. 2005;365:454–455.

21 Weigelt B., Reis-Filho J.S. Molecular profiling currently offers no more than tumour morphology and basic immunohistochemistry. Breast Cancer Res. 2010;12:S5.

22 The Royal Society. Personalised medicines: hopes and realities. London: The Royal Society; 2005 Available from: https://royalsociety.org/~/media/Royal_Society_Content/policy/publications/2005/9631.pdf [accessed 01.01.2015].

23 Vlasic B. Toyota's slow awakening to a deadly problem. The New York Times. February 1, 2010.

24 Lanier J. The complexity ceiling. In: Brockman J., ed. The next fifty years: science in the first half of the twenty-first century. New York: Vintage; 2002:216–229.

25 Ecker J.R., Bickmore W.A., Barroso I., Pritchard J.K., Gilad Y., Segal E. Genomics: ENCODE explained. Nature. 2012;489:52–55.

26 Rosen J.M., Jordan C.T. The increasing complexity of the cancer stem cell paradigm. Science. 2009;324:1670–1673.

27 Labos C. It ain't necessarily so: why much of the medical literature is wrong. Medscape News and Perspectives. September 09, 2014.

28 Gilbert E., Strohminger N. We found only one-third of published psychology research is reliable — now what? The Conversation. August 27, 2015 Available at: http://theconversation.com/we-found-only-one-third-of-published-psychology-research-is-reliable-now-what-46596 [accessed 27.08.2015].

29 Brooks F.P. No silver bullet: essence and accidents of software engineering. Computer. 1987;20:10–19.

30 Al-Agha O.M., Igbokwe A.A. Malignant fibrous histiocytoma: between the past and the present. Arch Pathol Lab Med. 2008;132:1030–1035.

31 Nakayama R., Nemoto T., Takahashi H., Ohta T., Kawai A., Seki K., et al. Gene expression analysis of soft tissue sarcomas: characterization and reclassification of malignant fibrous histiocytoma. Mod Pathol. 2007;20:749–759.

32 Patil N., Berno A.J., Hinds D.A., Barrett W.A., Doshi J.M., Hacker C.R., et al. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science. 2001;294:1719–1723.

33 Data Quality Act. 67 Fed. Reg. 8,452, February 22, 2002, addition to FY 2001 Consolidated Appropriations Act (Pub. L. No. 106-554. codified at 44 U.S.C. 3516).

34 Berman J.J. Repurposing legacy data: innovative case studies. United States: Elsevier; 2015.

35 Stallman R. Why "Free Software" is better than "Open Source". Boston: Free Software Foundation; 2015. Available from: http://www.gnu.org/philosophy/free-software-for-freedom.html [accessed 14.11.2015].

36 What is copyleft? Available from: http://www.gnu.org/copyleft/ [accessed 31.08.2015].

37 Paskin N. Identifier interoperability: a report on two recent ISO activities. D-Lib Mag. 2006;12:1–23.

38 Berman J.J. Tumor classification: molecular analysis meets Aristotle. BMC Cancer. 2004;4:10. Available from: http://www.biomedcentral.com/1471-2407/4/10 [accessed 01.01.2015].

39 Berman J.J. Tumor taxonomy for the developmental lineage classification of neoplasms. BMC Cancer. 2004;4:88. http://www.biomedcentral.com/1471-2407/4/88 [accessed 01.01. 2015].

40 Berman J.J., Moore G.W. Implementing an RDF schema for pathology images. Pittsburgh, PA: Pathology Informatics; 2007. Available from: http://www.julesberman.info/spec2img.htm [accessed 01.01.2015].

41 Tu Y., Gunnell D., Gilthorpe M.S. Simpson's Paradox, Lord's Paradox, and suppression effects are the same phenomenon — the reversal paradox. Emerg Themes Epidemiol. 2008;5:2.

42 DeQueiroz K. Ernst Mayr and the modern concept of species. Proc Natl Acad Sci U S A. 2005;102(Suppl. 1):6600–6607.

43 DeQueiroz K. Species concepts and species delimitation. Syst Biol. 2007;56:879–886.

44 Mayden R.L. Consilience and a hierarchy of species concepts: advances toward closure on the species puzzle. J Nematol. 1999;31(2):95–116.

45 Mead C.N. Data interchange standards in healthcare IT — computable semantic interoperability: now possible but still difficult, do we really need a better mousetrap? J Healthc Inf Manag. 2006;20:71–78.

46 Berman J.J. Methods in medical informatics: fundamentals of healthcare programming in Perl, Python, and Ruby. Boca Raton: Chapman and Hall; 2010.

47 Berman J.J. Taxonomic guide to infectious diseases: understanding the biologic classes of pathogenic organisms. Waltham: Academic Press; 2012.

48 Guarro J., Gene J., Stchigel A.M. Developments in fungal taxonomy. Clin Microbiol Rev. 1999;12:454–500.

49 Committee on Mathematical Foundations of VerificationValidation, and Uncertainty Quantification, Board on Mathematical Sciences and their Applications, Division on Engineering and Physical Sciences, National Research Council, Validation, and Uncertainty Quantification; Board on Mathematical Sciences and their Applications, Division on Engineering and Physical Sciences, National Research Council. Assessing the reliability of complex models: mathematical and statistical foundations of verification, validation, and uncertainty quantification. Washington, DC: National Academy Press; 2012 Available from: http://www.nap.edu/catalog.php?record_id=13395 [accessed 01.01.2015].

50 The Economist. Misconduct in science: an array of errors. The Economist. September 10, 2011.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset