Chapter 5

Identifying and Deidentifying Data

Abstract

Data identification is one of the most under-appreciated and least understood issues in data science. Measurements, annotations, properties, and classes of information have no informational meaning unless they are attached to an identifier that distinguishes one data object from all other data objects and that links together all of the information that has been or will be associated with the identified data object. The method of identification and the selection of objects and classes to be identified relates fundamentally to the organizational model of complex data. If the simplifying step of data identification is ignored or implemented improperly, data cannot be shared, and conclusions drawn from the data cannot be believed. All well-designed information systems are, at their heart, identification systems: ways of naming data objects so that they can be retrieved. The concept of data identification is of such overriding importance that simplified data sets should be envisioned as collections of unique identifiers to which data is attached. Once data objects have been properly identified, they can be deidentified and, under some circumstances, reidentified. The ability to deidentify data objects confers enormous advantages when issues of confidentiality, privacy, and intellectual property emerge. Data reidentification will be discussed in the context of error detection, error correction, and data validation. This chapter discusses methods for identifying data and explains the catastrophic consequences of inadequate identification.

Keywords

Identifiers; Data identification; Reidentification; Deidentification; Unique identifier; UUID; Digital signature; Message digest

Intellectuals solve problems; geniuses prevent them.

Albert Einstein

5.1 Unique Identifiers

I always wanted to be somebody, but now I realize I should have been more specific.

Lily Tomlin

An object identifier is anything associated with the object that persists throughout the life of the object and that is unique to the object (ie, does not belong to any other object). Everyone is familiar with biometric identifiers, such as fingerprints, iris patterns, and genome sequences. In the case of data objects, the identifier usually refers to a randomly chosen long sequence of numbers and letters that is permanently assigned to the object and that is never assigned to any other data object (see Glossary item, Data object).

An identifier system is a set of data-related protocols that satisfy the following conditions: (1) Completeness (ie, every unique object has an identifier); (2) Uniqueness (ie, each identifier is a unique sequence); (3) Exclusivity (ie, each identifier is assigned to only one unique object and to no other object, ever); (4) Authenticity (ie, objects that receive identification can be verified as the objects that they are intended to be); (5) Aggregation (ie, all information associated with an identifier can be collected); and (6) Permanence (ie, an identifier is never deleted).

Uniqueness is a very strange concept, especially when applied to the realm of data. For example, if I refer to the number 1, then I am referring to a unique number among other numbers (ie, there is only one number 1). Yet the number 1 may apply to many different things (eg, 1 left shoe, 1 umbrella, 1 prime number between 2 and 5). The number 1 makes very little sense to us until we know something about what it measures (eg, left shoe) and the object to which the measurement applies (eg, shoe_id_#840354751) (see Glossary item, Uniqueness).1

We refer to uniquely assigned computer-generated character strings as "identifiers". As such, computer-generated identifiers are abstract constructs that do not need to embody any of the natural properties of the object. A long (ie, 200 character length) character string consisting of randomly chosen numeric and alphabetic characters is an excellent identifier because the chances of two individuals being assigned the same string is essentially zero. When we need to establish the uniqueness of some object, such as a shoe or a data record, we bind the object to a contrived identifier.

Jumping ahead just a bit, if we say "part number 32027563 weighs 1 pound," then we are dealing with a meaningful assertion. The assertion tells us three things: (1) that there is a unique thing, known as part number 32027563, (2) the unique thing has a weight, and (3) the weight has a measurement of 1 pound. The phrase "weighs 1 pound" has no meaning until it is associated with a unique object (ie, part number 32027563). The assertion that "part number 32027563 weighs 1 pound" is a "triple", the embodiment of meaning in the field of computational semantics. A triple consists of a unique identified object matched to a pair of data and metadata (ie, a data element and the description of the data element). Information experts use formal syntax to express triples as data structures (see Glossary items, Meaning, RDF, RDF Schema, RDF ontology, Notation 3). We will revisit the concept of triples in Section 6.1, "Meaning and Triples".

Returning to the issue of object identification, there are various methods for generating and assigning unique identifiers to data objects,25 (see Glossary items, Identification, Identifier, One-way hash, UUID, URN, URL). Some identification systems assign a group prefix to an identifier sequence that is unique for the members of the group. For example, a prefix for a research institute may be attached to every data object generated within the institute. If the prefix is registered in a public repository, data from the institute can be merged with data from other institutes, and the institutional source of the data object can always be determined. The value of prefixes and other reserved namespace designations can be undermined when implemented thoughtlessly (see Glossary item, Namespace).1

Identifiers are data simplifiers, when implemented properly, because they allow us to collect all of the data associated with a unique object while ensuring that we exclude that data that should be associated with some other object. As an example of the utility of personal identifier systems, please refer to the discussion of National Patient Identifiers, in the Glossary (see Glossary item, National Patient Identifier).

Universally Unique Identifier (UUID) is an example of one type of algorithm that creates collision-free identifiers that can be generated on command at the moment when new objects are created (ie, during the run-time of a software application). Linux systems have a built-in UUID utility, "uuidgen.exe", that can be called from the system prompt.

Here are a few examples of output values generated by the "uuidgen.exe" utility:

$ uuidgen.exe

312e60c9-3d00-4e3f-a013-0d6cb1c9a9fe

$ uuidgen.exe

822df73c-8e54-45b5-9632-e2676d178664

$ uuidgen.exe

8f8633e1-8161-4364-9e98-fdf37205df2f

$ uuidgen.exe

83951b71-1e5e-4c56-bd28-c0c45f52cb8a

$ uuidgen -t

e6325fb6-5c65-11e5-b0e1-0ceee6e0b993

$ uuidgen -r

5d74e36a-4ccb-42f7-9223-84eed03291f9

Notice that each of the final two examples have a parameter added to the "uuidgen" command (ie, "-t" and"-r"). There are several versions of the UUID algorithm that are available. The "-t" parameter instructs the utility to produce a UUID based on the time (measured in seconds elapsed since the first second of October 15, 1582, the start of the Gregorian calendar). The "-r" parameter instructs the utility to produce a UUID based on the generation of a pseudorandom number. In any circumstance, the UUID utility produces a fixed-length character string suitable as an object identifier. The UUID utility is trusted and widely used by computer scientists. Independent-minded readers can easily design their own unique object identifiers, using pseudorandom number generators or one-way hash generators (see Open Source Tools for this chapter, "UUID", "Pseudorandom number generators", and "One-way hash implementations").

In theory, identifier systems are incredibly easy to implement. Here is exactly how it is done:

1. Generate a unique character sequence, such as UUID, or a long random number.

2. Assign a unique character sequence (ie, identifier) to each new object at the moment that the object is created. In the case of a hospital, a patient is created at the moment he or she is registered into the hospital information system. In the case of a bank, a customer is created at the moment that he or she is provided with an account number. In the case of an object-oriented programming language, such as Ruby, this would be the moment when the "new" method is sent to a class object, instructing the class object to create a class instance.

3. Preserve the identifier number and bind it to the object. In practical terms, this means that whenever the data object accrues new data, the new data is assigned to the identifier number. In the case of a hospital system, this would mean that all of the lab tests, billable clinical transactions, pharmacy orders, and so on are linked to the patient's unique identifier number, as a service provided by the hospital information system. In the case of a banking system, this would mean that all of the customer's deposits and withdrawals and balances are attached to the customer's unique account number.

As it happens, nothing is ever as simple as it ought to be. In the case of an implementation of systems that employ long-sequence generators to produce unique identifiers, the most common problem involves indiscriminate re-assignment of additional unique identifiers to the same data object, thus nullifying the potential benefits of the unique identifier systems.

Let's look at an example wherein multiple identifiers are redundantly assigned to the same image, corrupting the identifier system. In Section 4.3, we discussed image headers and provided examples wherein the ImageMagick "identify" utility could extract the textual information included in the image header. One of the header properties created, inserted, and extracted by ImageMagick's "identify" is an image-specific unique string.

When ImageMagick is installed in our computer, we can extract any image's unique string using the "identify" utility and the "-format" attribute on the following command line:

c:ftp>identify -verbose -format "%#" eqn.jpg

Here, the image we are examining is "eqn.jpg". The "%#" character string is ImageMagick's special syntax indicating that we would like to extract the image identifier. The output is shown.

219e41b4c761e4bb04fbd67f71cc84cd6ae53a26639d4bf33155a5f62ee36e33

We can repeat the command line whenever we like, and the same image-specific unique sequence of characters will be produced.

Using ImageMagick, we can insert text into the "comment" section of the header, using the "-set" attribute. Let's add the text, "I'm modifying myself":

c:ftp>convert eqn.jpg -set comment "I'm modifying myself" eqn.jpg

Now, let's extract the comment that we just added to satisfy ourselves that the "-set" attribute operated as we had hoped. We do this using the "-format" attribute and the "%c" character string, which is ImageMagick's syntax for extracting the comment section of the header.

c:ftp>identify -verbose -format "%c" eqn.jpg

The output of the command line is:

I'm modifying myself

Now, let's run one more time the command line that produces the unique character string that is unique for the eqn.jpg image file:

c:ftp>identify -verbose -format "%#" eqn.jpg

The output is:

cb448260d6eeeb2e9f2dcb929fa421b474021584e266d486a6190067a278639f

What just happened? Why has the unique character string specific for the eqn.jpg image changed? Has our small modification of the file, which consisted of adding a text comment to the image header, resulted in the production of a new image object worthy of a new unique identifier?

Before answering these very important questions, let's pose two gedanken questions (see Glossary item, Gedanken). Imagine you have a tree. This tree, like every living organism, is unique. It has a unique history, a unique location, and a unique genome (ie, a unique sequence of nucleotides composing its genetic material). In 10 years, its leaves drop off and are replaced 10 times. Its trunk expands in size and its height increases. In the 10 years of its existence, has the identify of the tree changed?

You would probably agree that the tree has changed, but that it has maintained its identity (ie, it is still the same tree).

In informatics, a newly created object is given an identifier and this identifier is immutable (ie, cannot be changed), regardless of how the object is modified. In the case of the unique string assigned to an image by ImageMagick, the string serves as an authenticator, not as an identifier (see Glossary item, Authentication). When the image is modified, a new unique string is created. By comparing the so-called identifier string in copies of the image file, we can determine whether any modifications have been made; that is to say, we can authenticate the file.

Getting back to the image file in our example, when we modified the image by inserting a text comment, ImageMagick produced a new unique string for the image. The identity of the image had not changed, but the image was different from the original image (ie, no longer authentic). It seems that the string that we thought to be an identifier string was actually an authenticator string.

If we want an image to have a unique identifier that does not change when the image is modified, we must create our own identifier that persists when the image is modified.

Here is a short Python script, image_id.py, that uses Python's standard UUID method to create an identifier and insert it into the comment section of our image, flanking the identifier with XML tags (see Glossary item, XML).

#!/usr/local/bin/python

import sys, os, uuid

my_id = "<image_id>" + str(uuid.uuid4()) + "</image_id>"

in_command = "convert eqn.jpg -set comment "" + my_id + "" eqn.jpg"

os.system(in_command)

out_command = "identify -verbose -format "%c" eq2.jpg"

print (" Here's the unique identifier:")

os.system(out_command)

print (" Here's the unique authenticator:")

os.system("identify -verbose -format "%#" eqn.jpg")

os.system("convert eqn.jpg -resize 325x500! eqn.jpg")

print (" Here's the new authenticator:")

os.system("identify -verbose -format "%#" eqn.jpg")

print (" Here's the unique identifier:")

os.system(out_command)

exit

Here is the output of the image_id.py script:

Here's the unique identifier:

< image_id>c94f679f-7acd-4216-a464-eb051ab57547 </image_id >

Here's the unique authenticator:

3529d28f97661b401d9ce6d9925a2dadb46c26b7350d94fff5585d7860886781

Here's the new authenticator:

7b45485ca7fca87f5b78e87b9392946b3e1895dab362d2ca5b13a0e3bc136e48

Here's the unique identifier:

< image_id>c94f679f-7acd-4216-a464-eb051ab57547 </image_id >

What did the script do, and what does it teach us?

In three lines of code, the script produced a UUID identifier for the image, flanked by start and end tags < image_id > and </image_id >, and inserted it into the comment section of the image:

my_id = "<image_id>" + str(uuid.uuid4()) + "</image_id>"

in_command = "convert eqn.jpg -set comment "" + my_id + "" eqn.jpg"

os.system(in_command)

Next, the script displayed the image identifier as well as the built-in unique sequence that we now can call ImageMagick's authenticator sequence:

Here's the unique identifier:

< image_id>c94f679f-7acd-4216-a464-eb051ab57547 </image_id >

Here's the unique authenticator:

3529d28f97661b401d9ce6d9925a2dadb46c26b7350d94fff5585d7860886781

Following this, the script modified the image by resizing. Then, the script once more produced the authenticator sequence and the identifier sequence:

Here's the new authenticator:

7b45485ca7fca87f5b78e87b9392946b3e1895dab362d2ca5b13a0e3bc136e48

Here's the unique identifier:

< image_id>c94f679f-7acd-4216-a464-eb051ab57547 </image_id >

We see that the identifier sequence is unchanged when the image is resized (as it should be), and the authenticator sequence, following the resizing of our image, is a totally new sequence (as it should be).

If you have followed the logic of this section, you are prepared for the following exercise adapted from Zen Buddhism: Imagine you have a hammer. Over the years, you have replaced its head twice and its handle three times. In this case, with nothing remaining of the original hammer, is it still the same hammer? The informatician would answer that it is the same hammer, but it can no longer be authenticated (ie, it is what it is, though it has changed).

5.2 Poor Identifiers, Horrific Consequences

Anticipatory plagiarism occurs when someone steals your original idea and publishes it 100 years before you were born.

Robert Merton

All information systems, all databases, and all good collections of data are best envisioned as identifier systems to which data (belonging to the identifier) can be added over time.

If the system is corrupted (eg, multiple identifiers for the same object; data belonging to one object incorrectly attached to other objects), then the system has no value. You can't trust any of the individual records, and you can't trust any of the analyses performed on collections of records. Furthermore, if the data from a corrupted system is merged with the data from other systems, then all analyses performed on the aggregated data becomes unreliable and useless. This holds true even when every other contributor to the system shares reliable data.

Without proper identifiers, the following may occur: Data values can be assigned to the wrong data objects; data objects can be replicated under different identifiers, with each replicant having an incomplete data record (ie, an incomplete set of data values); the total number of data objects cannot be determined; data sets cannot be verified; and the results of data set analyses will not be valid.

In the past, individuals were identified by their names. When dealing with large numbers of names, it becomes obvious, almost immediately, that personal names are woefully inadequate. Aside from the obvious fact that they are not unique (eg, surnames such as Smith, Zhang, Garcia, Lo, and given names such as John and Susan), one name can have multiple representations. The sources for these variations are many. Here is a partial listing4:

1. Modifiers to the surname (eg, du Bois, DuBois, Du Bois, Dubois, Laplace, La Place, van de Wilde, Van DeWilde, etc.).

2. Accents that may or may not be transcribed onto records (eg, acute accent, cedilla, diacritical comma, palatalized mark, hyphen, diphthong, umlaut, circumflex, and a host of obscure markings).

3. Special typographic characters (the combined "ae").

4. Multiple middle names for an individual that may not always be consistently transcribed onto records (eg, individuals who replace their first name with their middle name for common usage while retaining the first name for legal documents).

5. Latinized and other versions of a single name (eg, Carl Linnaeus, Carl von Linne, Carolus Linnaeus, Carolus a Linne).

6. Hyphenated names that are confused with first and middle names (eg, Jean-Jacques Rousseau or Jean Jacques Rousseau; Louis-Victor-Pierre-Raymond, Seventh duc de Broglie or Louis Victor Pierre Raymond, Seventh duc de Broglie).

7. Cultural variations in name order that are mistakenly rearranged when transcribed onto records. Many cultures do not adhere to the Western European name order (eg, given name, middle name, surname).

8. Name changes through marriage, legal action, aliasing, pseudonymous posing, or insouciant whim.

I have had numerous conversations with intelligent professionals who are tasked with the responsibility of assigning identifiers to individuals. At some point in every conversation, they will find it necessary to explain that although an individual's name cannot serve as an identifier, the combination of name plus date of birth provides accurate identification in almost every instance. They sometimes get carried away, insisting that the combination of name plus date of birth plus Social Security number provides perfect identification, as no two people will share all three identifiers: same name, same date of birth, and same Social Security number (see Glossary item, Social Security Number). This is simply wrong. Let us see what happens when we create identifiers from the name plus birthdate.

Consider this example: Mary Jessica Meagher, born June 7, 1912, decided to open a separate bank account at each of 10 different banks. Some of the banks had application forms, which she filled out accurately. Other banks registered her account through a teller, who asked her a series of questions and immediately transcribed her answers directly into a computer terminal. Ms. Meagher could not see the computer screen and therefore could not review the entries for accuracy.

Here are the entries for her name plus date of birth4:

1. Marie Jessica Meagher, June 7, 1912 (the teller mistook Marie for Mary).

2. Mary J. Meagher, June 7, 1912 (the form requested a middle initial, not a full name).

3. Mary Jessica Magher, June 7, 1912 (the teller misspelled the surname).

4. Mary Jessica Meagher, Jan. 7, 1912 (the birth month was constrained on the form to three letters; June was entered on the form but was transcribed as Jan.).

5. Mary Jessica Meagher, 6/7/12 (the form provided spaces for the final two digits of the birth year. Through the miracle of bank registration, Mary, born in 1912, was reborn a century later).

6. Mary Jessica Meagher, 7/6/12 (the form asked for day, month, and year, in that order, as is common in Europe).

7. Mary Jessica Meagher, June 1, 1912 (on the form, a 7 was mistaken for a 1).

8. Mary Jessie Meagher, June 7, 1912 (Marie, as a child, was called by the informal form of her middle name, which she provided to the teller).

9. Mary Jesse Meagher, June 7, 1912 (Marie, as a child, was called by the informal form of her middle name, which she provided to the teller, and which the teller entered as the male variant of the name).

10. Marie Jesse Mahrer, 1/1/12 (an underzealous clerk combined all of the mistakes on the form and the computer transcript and added a new orthographic variant of the surname).

For each of these 10 examples, a unique individual (Mary Jessica Meagher) would be assigned a different identifier at each of 10 banks. Had Mary re-registered at one bank 10 times, the results may have been the same.

If you toss the Social Security number into the mix (name + birth date + Social Security number) the problem is compounded. The Social Security number for an individual is anything but unique. Few of us carry our original Social Security cards. Our number changes due to false memory ("You mean I've been wrong all these years?"), data entry errors ("Character tranpsositoins, I mean transpositions, are very common"), intention to deceive ("I don't want to give those people my real number), desperation ("I don't have a number, so I'll invent one"), or impersonation ("I don't have health insurance, so I'll use my friend's Social Security number"). Efforts to reduce errors by requiring patients to produce their Social Security cards have not been entirely beneficial.

Beginning in the late 1930s, the E. H. Ferree Company, a manufacturer of wallets, promoted their products' card pockets by including a sample Social Security card with each wallet sold. The display card had the Social Security number of one of their employees. Many people found it convenient to use the card as their own Social Security number. Over time, the wallet display number was claimed by over 40,000 people. Today, few institutions require individuals to prove their identity by showing their original Social Security card. Doing so puts an unreasonable burden on the honest patient (who does not happen to carry his/her card) and provides an advantage to criminals (who can easily forge a card).4

Entities that compel individuals to provide a Social Security number have dubious legal standing. The Social Security number was originally intended as a device for validating a person's standing in the Social Security system. More recently, the purpose of the Social Security number has been expanded to track taxable transactions (ie, bank accounts, salaries). Other uses of the Social Security number are not protected by law. The Social Security Act (Section 208 of Title 42 U.S. Code 408) prohibits most entities from compelling anyone to divulge his/her Social Security number.4

Considering the unreliability of Social Security numbers in most transactional settings, and considering the tenuous legitimacy of requiring individuals to divulge their Social Security numbers, a prudently designed personal identifier system will limit its reliance on these numbers.

Let's examine another imperfect identifier system: Peter Kuzmak, an information specialist who works with hospital images, made an interesting observation concerning the non-uniqueness of identifiers that were thought to be unique.6 Hospitals that use the DICOM (Digital Imaging and Communications in Medicine) image standard assign a unique object identifier to each image. Each identifier comes with a prefix consisting of a permanent registered code for the institution and the department, along with a suffix consisting of a number generated for an image at the moment the image is created.1

A hospital may assign consecutive numbers to its images, appending these numbers to an object identifier that is unique for the institution and for the department within the institution. For example, the first image created with a CT-scanner might be assigned an identifier consisting of the assigned code for institution and department, followed by a separator such as a hyphen, followed by "1".

In a worst-case scenario, different instruments may assign consecutive numbers to images, independently of one another. This means that the CT-scanner in room A may be creating the same identifier (ie, institution/department prefix + image number) as the CT-scanner in Room B for images on different patients. This problem could be remedied by constraining each CT-scanner to avoid using numbers assigned by any other CT-scanner.

When image counting is done properly, and the scanners are constrained to assign unique numbers (not previously assigned by other scanners in the same institution), then each image may indeed have a unique identifier (institution/department prefix + image number). Nonetheless, problems will arise when departments or institutions merge. Each of these shifts produces a change in the prefix for the institution and department. If a consecutive numbering system is used, then you can expect to create duplicate identifiers when institutional prefixes are replaced. In this case, the old records in both of the merging institutions may be assigned the same prefix and will contain replicates among the consecutively numbered suffixes (eg, image 1, image 2, etc.).

Yet another problem may occur if one unique object is provided with multiple different unique identifiers. For example, a software application may add its own unique identifier to an image that had been previously assigned a unique identifier by the radiology department. Assigning a second identifier insulates the software vendor from bad identifiers that may have been produced in the referring hospital. In so doing, the image now has two different unique identifiers. At this point, which identifier should be used to attach the various data and metadata annotations that accrue to the image over time? By redundantly layering unique identifiers onto a data object, the software vendor defeats the intended purpose of identifying the image (ie, to unambiguously connect a data object with its data).1

How can we avoid common design errors in identification systems? Let's look once more at the implementation steps listed in Section 5.1:

1. Generate a unique character sequence. You cannot compose an adequate identifier by concatenating a sequence of inadequate identifiers. (eg, Name + Social Security number + birthdate). A UUID or long fixed-length sequence of randomly generated alphanumeric characters will usually suffice.

2. Assign a unique character sequence (ie, identifier) to each new object at the moment that the object is created. Once the object has been created, it should already have its identifier. This means that you should never assign an identifier to a pre-existing data object. As a practical example, first-time patients seeking care at a hospital should be registered into the hospital system by a qualified registrar, who oversees the assignment of a permanent identification number to the newly created patient object (see Glossary item, Registrars and human authentication).

3. Preserve the unique identifier number and bind it to the unique object. You must never interrupt the linkage between a data object and its identifier. The enduring linkage between data object and identifier ensures that you can retrieve all of the information that accrues to a data object. As a practical example, if an individual enters a hospital and indicates that he or she is a first-time patient, there must be some mechanism to determine that the patient is indeed a new patient (ie, to exclude the possibility that the patient had already been registered). There must also be in place a system to verify that a patient is who they claim to be. If these safeguards are not in place, then the bindings of unique identifier to unique patient are corrupted and the system fails.

Let's end this sobering section with some humor. As the story is told, a tourist, enthralled by his visit to a holy reliquary, is excited to see that one of the skulls is labeled "St. Peter". Several bones later, the tourist notices another skull, also labeled "St. Peter". The tourist asks the curator to explain why there are two skulls with the same name. Without missing a beat the curator says, "Oh, the first skull was St. Peter when he was an adult. The second skull was St. Peter when he was a child."

The moral here is: "Never give a data object more than one opportunity to be unique."

5.3 Deidentifiers and Reidentifiers

Two lawyers can make an excellent living in a town that can't support one.

Anonymous

Imagine, for a moment, that you are a data analyst who is tasked with analyzing cancer genes. You find a large public database consisting of DNA sequences obtained from human cancer specimens. The tissues have been deidentified to protect the privacy of the individuals from whom the tissue samples were taken. All of this data is available for download. As the gigabytes stream into your computer, you think you have arrived in heaven. Before too long, you have completed your analysis on tissues from dozens of lung cancers. You draw a conclusion that most lung cancers express a particular genetic variation, and you suggest in your paper that this variation is a new diagnostic marker. You begin the process of seeking patent protection for your discovery. You believe that you are on a fast track leading to fame and fortune. At lunch, in the University cafeteria, one of your colleagues poses the following question, "If all the tissues in the public data set were deidentified, and you studied 36 samples, how would you know whether your 36 samples came from 36 different tumors in 36 patients or whether they represented 36 samples taken from one tumor in one patient?" A moment of sheer panic follows. If all the samples came from a single patient, then your research is only relevant to one person's tumor. In that case, you would expect a variation in the genome of one tumor sample to show up in every sample. Your finding would have no particular relevance to lung tumors, in general. Frantically, you contact the tissue resource from which you had obtained your data. They confess that because the tissues are deidentified, they cannot resolve your dilemma. You happen to know the director of the tissue resource, and you explain your problem to her. She indicates that she cannot answer your question, but she happens to know that all of the lung cancer specimens were contributed by a single laboratory. She puts you in touch with the principle investigator at the lab. He remembers that his lab contributed tissues, and that all their tissues were samples of a single large tumor. Your greatest fears are confirmed. Your findings have no general relevance, and no scientific value.

What is the moral of this story? Surprising as it may seem, all data, even deidentified data, must be uniquely identified if it is to be of scientific value.

To maintain confidentiality and to protect privacy, data must be disconnected from the individual, but can this be accomplished without sacrificing unique identifiers? Yes, if we have a clear understanding of our goals.

The term "identified data" is a concept central to modern data science and must be distinguished from "data that is linked to an identified individual," a concept that has legal and ethical importance. In the privacy realm, "data that is linked to an identified individual" is shortened to "identified data", and this indulgence has caused no end of confusion. All good data must be identified. If the data isn't identified, then there is no way of aggregating data that pertains to an identifier, and there is no way of distinguishing one data assertion from another (eg, one observation on 10 samples versus 10 observations on one sample).7 It is absolutely crucial to understand and accept that the identity of data is not equivalent to the identity of the individual to whom the data applies. In particular, we can remove the links to the identity of individuals without removing data identifiers. This subtle point accounts for much of the rancor in the field of data privacy (see Glossary items, Deidentification, Deidentification versus anonymization, Reidentification).

In Section 6.1, we will be revisiting data triples and their importance in simplifying data (see Glossary item, Triple). For the moment, it suffices to say that all data can be stored as data triples that consist of a data identifier followed by a metadata tag followed by the data described by the metadata tag.

Here are a set of triples collected from other chapters in this book.

75898039563441 name G. Willikers

75898039563441 gender male

75898039563441 is_a_class_member cowboy

75898039563441 age 35

94590439540089 name Hopalong Tagalong

94590439540089 is_a_class_member cowboy

29847575938125 calendar:date February 4, 1986

57839109275632 social:date Jack and Jill

83654560466294 social:date Pyramus and Thisbe

83654560466294 calendar:date June 16, 1904

98495efc object_name Andy Muzeack

98495efc instance_of Homo sapiens

98495efc dob 1 January, 2001

98495efc glucose_at_time 87, 02-12-2014 17:33:09

In the next chapter, we will be creating a database of triples. For now, we can see that triples are assertions about data objects that are identified by a random alphanumeric character string. The triples can be in any order we like, because we can write software that aggregates or retrieves triples by identifier, metadata, or data. For example, we could write software that aggregates all the data pertaining to a particular data object.

Notice that we can do a fair job of delinking the data that identifies individuals simply by removing triples that contain names of persons or private information (eg, age or date of birth) and occupation, as shown below:

75898039563441 gender male

75898039563441 is_a_class_member cowboy

94590439540089 is_a_class_member cowboy

29847575938125 calendar:date February 4, 1986

83654560466294 calendar:date June 16, 1904

98495efc instance_of Homo sapiens

98495efc glucose_at_time 87, 02-12-2014 17:33:09

This residual set of triples are disembodied and do not link to data that might identify the individual to which the data applies. Nonetheless, all of the remaining triples are fully identified in the sense that the data object identifiers are retained (eg, 75898039563441, 94590439540089, 98495efc). If we had a triplestore containing billions of triples, we could have written a short script that would have removed any triples that contained names of individuals or private information, thus yielding a set of identified triples that do not link to an identified individual (see Glossary item, Triplestore).

When you think about it, researchers almost never care about which particular individual is associated with which particular piece of data (see Glossary item, Data versus datum). The purpose of science is to draw generalizations from populations (see Glossary items, Science, Generalization). Scientists are perfectly happy using data from which the links to individuals have been removed, just so long as they can distinguish one observation from another.

Can we stop here? Have we protected the confidentiality and privacy of individuals by removing triples that can link a data object to the person from whom the data applies (see Glossary item, Privacy versus confidentiality)? Probably not. During the course of the study, new data collected on the data objects may have been generated, and this data could be damaging to individuals if it were discovered. As an example, let's add a few triples to data object 75898039563441.

75898039563441 gender male

75898039563441 is_a_class_member cowboy

75898039563441 mental_status borderline personality disorder

75898039563441 criminal_record multiple assaults

75898039563441 chief_complaint kicked in head by cow

You can imagine that data object 75898039563441 might prefer that the data produced in the scientific study (eg, borderline personality disorder, history of assaults, cow-induced head injury) remain confidential. The problem is that a maliciously inclined individual might gain access to the original database, wherein the data object 75898039563441 is clearly linked to the name "G. Willikers". In this case, one possible solution might involve replacing the original object identifier with a new identifier that cannot be linked to data held in the original data file.

We might find it prudent to substitute a new object identifier for each object identifier in our "safe" set of triples:

82030201856150 gender male

82030201856150 is_a_class_member cowboy

44934938405062 is_a_class_member cowboy

65840231656302 calendar:date February 4, 1986

76206674367326 calendar:date June 16, 1904

7392g2s1 instance_of Homo sapiens

7392g2s1 glucose_at_time 87, 02-12-2014 17:33:09

By assigning new identifiers, we have essentially deidentified our original identifiers. Now, it would be impossible to link any of the experimental data in our set of triples to the set of triples that have their original identifiers. Unfortunately, in our quest to protect the names of individuals, we have created a new problem. Suppose an occasion arises when we need to establish the identity of an individual involved in a scientific study. For example, a clinical test on a deidentified blood sample might reveal that the sample contains a marker for a type of cancer that is curable when treated with an experimental drug. As another example, suppose that a virus has been isolated from a sample, indicating that the person who contributed the sample must be immediately quarantined and treated. More commonly, we can imagine that some unaccountably confusing result has prompted us to check the original dataset to determine if there was a systemic error in the collected samples. In any case, there may be legitimate reasons for reidentifying the original individuals from whom the samples were procured.

Reidentification is a term casually applied to any instance whereby information can be linked to a specific person, after the links between the information and the person associated with the information were removed. Used this way, the term reidentification connotes an insufficient deidentification process. In the U.S., privacy protection regulations define reidentification as a legally valid process whereby deidentified records can be linked back to their human subjects, under circumstances deemed compelling by a privacy board.8

Reidentification that is approved by a privacy board is typically accomplished via a confidential list of links between human subject names and deidentified records, held by a trusted party.4

In our example, a list linking the original data identifiers with the replacement data identifiers is created, as shown:

75898039563441 -> 82030201856150

94590439540089 -> 44934938405062

29847575938125 -> 65840231656302

83654560466294 -> 76206674367326

98495efc -> 7392g2s1

The list is saved by a trusted agent, and all or part of the list could be recalled by order of an institutional privacy board, if necessary.

This section covered three basic principles of deidentifiers that every data scientist must understand:

1. Every identifier can also serve as a deidentifier.

2. Data cannot be satisfactorily deidentified until it has been successfully identified.

3. Identifiers can be used to reidentify deidentified data, when necessary.

5.4 Data Scrubbing

My most important piece of advice to all you would-be writers: When you write, try to leave out all the parts readers skip.

Elmore Leonard, from "Elmore Leonard's 10 Rules of Writing"

Data scrubbing is sometimes used as a synonym for deidentification. This is wrong. It is best to think of data scrubbing as a process that begins where deidentification ends. A data scrubber will remove unwanted information from a data record, including information of a personal nature, and any information that is not directly related to the purpose of the data record. For example, in the case of a hospital record, a data scrubber might remove: the names of physicians who treated the patient; the names of hospitals or medical insurance agencies; addresses; dates; and any textual comments that are inappropriate, incriminating, or irrelevant.4

There is a concept known as "minimal necessary" that applies to shared confidential data.8 It holds that when confidential records are shared, only the minimum necessary information should be released. Any information not directly relevant to the intended purposes of the data analyst should be withheld. The process of data scrubbing gives data managers the opportunity to render a scrubbed record that is free of information that would link the record to its subject and free of extraneous information that the data analyst does not actually require.

There are many approaches to data scrubbing; most require the data manager to develop an exception list of items to be excluded from shared records (eg, names of people, cities, locations, phone numbers, email addresses, and so on). The scrubbing application moves through the records, extracting unnecessary information along the way. The end product is cleaned, but not sterilized. Though many undesired items can be successfully removed, this approach never produces a perfectly scrubbed set of data. In a large and complex data resource, it is simply impossible for the data manager to anticipate every objectionable item and to include it in an exception list.

There is, however, a method whereby data records can be cleaned without error. This method involves producing a list of "safe phrases" acceptable for inclusion in a scrubbed and deidentified data set. Any data that is not in the list of "safe phrases" is automatically deleted. What remains is the scrubbed data. This method can be described as a reverse scrubbing method. Everything is in the data set is deleted by default, unless it is an approved "exception". This method of scrubbing is very fast and can produce an error-free deidentified and scrubbed output.4,5,9,10

I have written a very fast scrubber that uses word doublets as its "safe phrases" list. The algorithm is simple. The text to be scrubbed is parsed word by word. All doublets that match "safe phrases" (ie, entries in the approved doublet list) are retained. The approved doublet list can be extracted from a nomenclature that comprehensively covers the knowledge domain appropriate for the research that is being conducted. For example, if I wanted to scrub medical records to search for the occurrences of diseases, then I might create a doublet list from a nomenclature of diseases. I would need to add general grammatic doublets that might not appear in a nomenclature, such as "to be, is not, for the, when the, to do", and so on. The grammatic doublets can be extracted from virtually any general text (eg, "Treasure Island," "Wuthering Heights," or "Data Simplification: Taming information with Open Source Tools"). The combined list of "safe phrases" might have a length of about 100,000 doublets.

The Ruby script, scrubit.rb, uses an external doublet list, "doubdb.txt", and scrubs out a single line of input text. Removed words from the input line are replaced by an asterisk.11

#!/usr/local/bin/ruby

doub_file = File.open("c:/ftp/doubdb.txt", "r")

doub_hash = {}

doub_file.each_line{|line| line.chomp!; doub_hash[line] = " "}

doub_file.close

puts "What would you like to scrub?"

linearray = gets.chomp.downcase.split

arraysize = linearray.length - 2

lastword = "*"

for arrayindex in (0 .. arraysize)

 doublet = linearray[arrayindex] + " " + linearray[arrayindex+1]

 if doub_hash.key?(doublet)

 print " " + linearray[arrayindex]

 lastword = " " + linearray[arrayindex+1]

 else

 print lastword

 lastword = " *"

 end

 if arrayindex == arraysize

 print lastword

 end

end

exit

Examples of scrubbed text output are shown here11:

Basal cell carcinoma, margins involved

Scrubbed text…. basal cell carcinoma margins involved

Mr Brown has a basal cell carcinoma

Scrubbed text…. * * has a basal cell carcinoma

Mr. Brown was born on Tuesday, March 14, 1985

Scrubbed text…. * * * * * * * * *]

The doctor killed the patient

Scrubbed text…. * * * * *

Reviewing the output, we see that private or objectionable words, such as "Mr. Brown", "Tuesday, March 14, 1985", and "The doctor killed the patient" were all extracted.

The simple script has two main strengths:

1. The output is guaranteed to exclude all phrases other than those composed of doublets from the approved list.

2. The script is fast, executing thousands of times faster than rule-based scrubbing methods.5,10,12

Readers who would like to modify this script to scrub their own textual records can build their own "safe phrase" list using methods provided in Open Source Tools for Chapter 3, "Doublet lists."

Whether confidential data can be adequately deidentified and scrubbed is a highly controversial subject.13 James Joyce is credited with saying that "there are two sides to every argument; unfortunately, I can only occupy one of them." As for myself, I have sided with the data scientists who believe that any text can be rendered deidentified and free of objectionable language, with the use of short scripts that employ simple algorithms. The method provided here removes everything from text, with the exception of pre-approved "safe" doublets. Hence, its sensitivity is only limited by the choice of safe phrases included in the doublet list. It is regrettable that scientists today refuse to share their data on the grounds of patient confidentiality. As an example, here is a quotation from a paper published in 2015 by NIH authors in the Public Library of Science (PLOS): "Data Availability Statement: All data underlying the findings are available for general research use to applicants whose data access request is approved by the National Cancer Institute dbGaP Data Access Committee (dbGaP accession number phs000720). Because of confidentiality issues associated with human subject data, they cannot be made available without restriction."14 Tsk tsk tsk. Confidential data can, in fact, be rendered harmless through computational methods. The negative consequences caused by restricting public access to scientific data are discussed in Section 8.4. Verification, Validation, and Reanalysis.

5.5 Data Encryption and Authentication

If you think technology can solve your security problems, then you don't understand the problems and you don't understand the technology.

Bruce Schneier

On May 3, 2006, a laptop computer was stolen from the home of a U.S. Veterans Affairs data analyst. On the computer and its external drive were the names, birthdates, and Social Security numbers of 26.5 million soldiers and veterans. By the end of June, the laptop was recovered. Fortunately, there was no reason to believe that the contained data had ever been accessed. Nonetheless, the 26.5 million potential victims of identity theft suffered sufficient emotional distress to justify launching a class action suit against the Veterans Administration. Three years later, the Veterans Administration agreed to pay a lump sum of $20 million dollars to the plaintiffs.15

The episode opens a flood of questions:

1. Is it customary for employees to bring confidential information home? Apparently, government staff just can't help themselves. The problem extends to the top agent in the top security agency in the U.S. While he was the CIA Director, John Deutch breached his own security protocols by bringing sensitive CIA information to an unclassified computer at his home.16

2. Is confidential information typically bundled into a neat, no-nonsense file from which all of the information pertaining to millions of individuals can be downloaded? Apparently, all the high-tech jargon thrown around concerning encryption algorithms and security protocols just never trickles down to front-line staff.

3. Is there any way of really knowing when a confidential file has been stolen? The thing about electronic data is that it can be copied perfectly and in secret. A database with millions of records can be downloaded in a few moments, without the victim knowing that the theft has occurred.

At the U.S. National Institutes of Health (NIH), I was involved in a project in which sensitive files were to be shared with university-based investigators. The investigators were approved for viewing the data contained in the files, but we were concerned that once the data left our hands, the files could be stolen in transit or from the investigator's site. My suggestion was to devise an encryption protocol, whereby the files would leave the NIH encrypted and remain encrypted at the investigator's site. The investigator could open and view the files, using a key that we would provide. Additional data would be sent with new keys, ensuring that if a key fell into the wrong hands, the damage would be limited to one version of one file. I also suggested that the files could be easily scrubbed and deidentified, so that the data in the files contained no private information and nothing to link records to individual persons. I was satisfied that the system, when implemented, would render the data harmless beyond any reasonable concern. My assurances did not allay the fear that someone with sufficient resources might conceivably gain access to some fragment of data that they were not entitled to see. The fact that the data had been thoroughly scrubbed, and had no monetary value, was deemed irrelevant. The solution that was finally adopted was simple, but worse than useless. To my consternation, the decision was made to withhold the data, until such time as a perfect system for safe data exchange could be implemented.

Let's be practical. Nearly everyone I know has confidential information on their computers. Often, this information resides in a few very private files. If those files fell into the hands of the wrong people, the results would be calamitous. For myself, I encrypt my sensitive files. When I need to work with those files, I decrypt them. When I'm finished working with them, I encrypt them again. These files are important to me, so I keep copies of the encrypted files on thumb drives and on an external server. I don't care if my thumb drives are lost or stolen. I don't care if a hacker gets access to the server that stores my files. The files are encrypted, and only I know how to decrypt them.

Anyone in the data sciences will tell you that it is important to encrypt your data files, particularly when you are transferring files via the Internet. Very few data scientists follow their own advice. Scientists, despite what you may believe, are not a particularly disciplined group of individuals. Few scientists get into the habit of encrypting their files. Perhaps they perceive the process as being too complex.

Here are the general desirable features for simple encryption systems that will satisfy most needs. If you have really important data, the kind that could hurt yourself or others if the data were to fall into the wrong hands, then you should totally disregard the advice that follows.

1. Save yourself a lot of grief by settling for a level of security that is appropriate and reasonable for your own needs. Don't use a bank vault when a padlock will suffice.

2. Avail yourself of no-cost solutions. Some of the finest encryption algorithms and their implementations are publicly available. For example, AES (Advanced Encryption Standard) was established by the U.S. National Institute of Standards and Technology and has been adopted by the U.S. Government and by organizations throughout the world. Methods for using AES and other encryption protocols are found in Open Source Tools for this chapter, "Encryption and decryption with OpenSSL."

3. The likelihood that you will lose your passwords is much higher than the likelihood that someone will steal your passwords. Develop a good system for passkey management that is suited to your own needs.

4. The security of the system should not be based on hiding your encrypted files or keeping the encryption algorithm secret. The greatest value of modern encryption protocols is that it makes no difference whether anyone steals or copies your encrypted files or learns your encryption algorithm.

5. File encryption and decryption should be computationally fast. Fast, open source protocols are readily available.

6. File encryption should be done automatically, as part of some computer routine (eg, a backup routine), or as a cron job (ie, a process that occurs at a predetermined time).

7. You need not be a stickler for protocol. You can use the same passkey over and over again if your level of concern for intruders is low, or if you do not value the confidentiality of your data very highly.

8. You should be able to batch-encrypt and batch-decrypt any number of files all at once (ie, from a command loop within a script), and you should be able to combine encryption with other file maintenance activities. For example, you should be able to implement a simple script that loops through every file in a directory or a directory tree (ie, all the files in all of the subdirectories under the directory) all at once, adding file header and metadata information into the file, scrubbing data as appropriate, calculating a one-way hash (ie, message digest) of the finished file, and producing an encrypted file output.

9. You should never implement an encryption system that is more complex than you can understand.17

Your data may be important to you and to a few of your colleagues, but the remainder of the world looks upon your output with glazed eyes. If you are the type of person who would protect your valuables with a padlock rather than a safety deposit box, then you should probably be thinking less about encryption strength and more about encryption interoperability. Ask yourself: Will the protocols that I use today be widely available platform-independent and vendor-independent protocols 5, 10, or 50 years from now? Will I always be able to decrypt my encrypted files?

For the encryptically lazy, a simple ROT13 protocol may be all you need. Rot_13 is a protocol that shifts the input text halfway around the alphabet. For example, the letter "a", ASCII value 97, is replaced by the letter "n", ASCII value 110 (see Open Source Tools for Chapter 2, ASCII).

#!/usr/bin/python

import codecs

print('abCdeFgHijKlM')

output = codecs.encode('abCdeFgHijKlM', 'rot_13')

print output

print(codecs.encode(output, 'rot_13'))

print('hello world')

output = codecs.encode('hello world', 'rot_13')

print output

print(codecs.encode(output, 'rot_13'))

exit

Here is the output of the rot_13.py script. Each input string is followed by its Rot_13 encryption:

c:ftp>rot_13.py

abCdeFgHijKlM

noPqrStUvwXyZ

abCdeFgHijKlM

hello world

uryyb jbeyq

hello world

What happened here? The string "uryyb jbeyq" is the result of rot_13 operating on the input string "hello world". Notice that the space character is untouched by the algorithm. The Python implementation of rot_13 only operates on the 26 characters of the alphabet (ie, a-z and the uppercase equivalents, A-Z). All other ASCII characters are ignored.

The next line of output is "hello world", and this represents the rot_13 algorithm operating on the encoded string, "uryyb jbeyq". You can see that repeating the Rot_13 returns the original string.

Using Perl's transliteration operator, "tr", any string can be converted to ROT13 in one line of code.

#!/usr/local/bin/perl

$text = "abCdeFgHijKlM and hello world";

$text =~ tr/A-Za-z/N-ZA-Mn-za-m/;

print $text;

print " ";

$text =~ tr/A-Za-z/N-ZA-Mn-za-m/;

print $text;

exit;

Here's the output of the rot_13.pl script:

c:ftp>rot_13.pl

noPqrStUvwXyZ naq uryyb jbeyq

abCdeFgHijKlM and hello world

Of course, you will want to encrypt multiple whole files all at once. Here is a Python script, rot_13_dir.py, that will encrypt all the text files in a directory. All characters other than the 26-letter alphabet, the space character, and the underscore character are left alone (ie, uncoded).

#!/usr/bin/python

import sys, os, re, codecs

current_directory = os.getcwd()

filelist = os.listdir(current_directory)

pattern = re.compile(".txt$")

ascii_characters = re.compile("[a-zA-Z0-9 _]")

for in_file_name in filelist:

 if pattern.search(in_file_name):

 out_file_name = pattern.sub('.rot', in_file_name)

 print(out_file_name)

 out_file_holder = open(out_file_name,'w')

 with open(in_file_name) as in_file_holder:

 while True:

 character = in_file_holder.read(1)

 if not character:

 break

 if not ascii_characters.search(character):

 out_file_holder.write(character)

 else:

 rot_13_of_character = codecs.encode(character, 'rot_13')

 out_file_holder.write(rot_13_of_character)

 else:

 continue

exit

For serious encryption, you will want to step up to OpenSSL. OpenSSL is an open-source collection of encryption protocols and message digest protocols (ie, protocols that yield one-way hashes). Encryption algorithms available through OpenSLL include: RSA, DES, and AES. With system calls to OpenSSL, your scripts can encrypt or decrypt thousands of files, all at once. Here is a simple Python script, aes.py , that encrypts a list of text files with the AES standard encryption algorithm (see Glossary item, AES).

#!/usr/local/bin/python

import sys, os, re

filelist = ['diener.txt','simplify.txt','re-ana.txt', 'phenocop.txt', 'mystery.txt','disaster.txt', 'factnote.txt', 'perlbig.txt', 'referen.txt', 'create.txt', 'exploreo.txt']

pattern = re.compile("txt")

for filename in filelist:

 out_filename = pattern.sub('enc', filename)

 out_filename = "f:\" + out_filename

 print(out_filename)

 cmdstring = "openssl aes128 -in " + filename + " -out " + out_filename + " -pass pass:z4u7w28"

 os.system(cmdstring)

exit

In Open Source Tools for this chapter, the OpenSSL installation, documentation and protocols are described, and implementations in Perl, Python and Ruby are demonstrated.

5.6 Timestamps, Signatures, and Event Identifiers

Time is what keeps everything from happening at once.

Ray Cummings in his 1922 novel, "The Girl in the Golden Atom"

Consider the following assertions:

Alexander Goodboy, 34 inches height

Alexander Goodboy, 42 inches height

Alexander Goodboy, 46 inches height

Alexander Goodboy, 52 inches height

At first glance, these assertions seem contradictory. How can Alexander Goodboy be 34, 42, 46, and 52 inches tall? The confusion is lifted when we add some timing information to the assertions:

Alexander Goodboy, age 3 years, 34 inches height

Alexander Goodboy, age 5 years, 42 inches height

Alexander Goodboy, age 7 years, 46 inches height

Alexander Goodboy, age 9 years, 52 inches height

All events, measurements, and transactions occur at a particular time, and it is essential to annotate data objects with their moment of creation and with every moment when additional data is added to the data object (ie, event times).18 It is best to think of data objects as chronicles of a temporal sequence of immutable versions of the object (see Glossary item, Immutability). In the case of Alexander Goodboy, the boy changes in height as he grows, but each annotated version of Alexander Goodboy (ie, Alexander Goodboy, age 3 years, height 34 inches) is eternal and immutable.

Timestamps, when used consistently, achieve the impossible. They allow data managers to modify or correct data without violating data immutability (ie, without tampering with history and without hiding the truth). How might this be done?

Data object -> Newspaper headlines:

"Dewey Defeats Truman" timestamp: November 3,1948, 6:00 AM

"Dewey Defeats Truman" (modification) timestamp: November 3,1948, 10:00 AM

"Truman Defeats Dewey" timestamp: November 3, 1948, 10:01 AM

The Chicago Daily Tribune ran an infamous banner declaring Thomas E. Dewey as victor over Harry S. Truman in the 1948 U.S. presidential election. History is immutable, and their error will live forever. Nonetheless, Dewey did not defeat Truman. To restore order to the universe, we need to do the following:

1. Timestamp the original (erroneous) headline.

2. Indicate that a modification was made and timestamp the event.

3. Produce corrected data and timestamp the corrected data.

4. Save all data assertions forever (original, modification, and new).

5. Ensure that the data assertions support introspection (discussed in Section 7.2, Introspection and Reflection).

Timestamping is nothing new. Ancient scribes were fastidious timestampers. It would be an unusual Sumerian, Egyptian, or Mayan document that lacked an inscribed date. In contrast, it is easy to find modern, web-based news reports that lack any clue to the date that the web page was created. Likewise, it is a shameful fact that most spreadsheet data lacks timestamps for individual data cells. Data sets that lack timestamps, unique identifiers, and metadata have limited value to anyone other than the individual who created the data and who happens to have personal knowledge of how the data was created and what it means.

Fortunately, all computers have an internal clock. This means that all computer events can be timestamped. Most programming languages have a method for generating the epoch time, ie, the number of seconds that have elapsed since a particular moment in time. On most systems, the epoch is the first second of January 1, 1970. Perl, Python, and Ruby have methods for producing epoch time. For trivia’s sake, we must observe that the UUID timestamp is generated for an epoch time representing the number of seconds elapsed since the first second of Friday, October 15, 1582 (see Section 5.1, "Unique Identifiers"). This moment marks the beginning of the Gregorian calendar. The end of the Julian calendar occurred on October 4, 1582. The 11 days intervening, from the end of the Julian calendar to the start of the Gregorian calendar, are lost somewhere in time and space.

Here is a Perl command line generating epoch time:

c:ftp>perl -e print(time())

1442353564

From Python's interactive environment:

c:ftp>python

>>> import time

>>> print(time.time())

1442353742.456994

From Ruby's interactive environment:

c:ftp>irb

irb(main):001:0 > "%10.9f" % Time.now.to_f

=> "1442354071.895107031"

Perl has a built-in gmtime (Greenwich Mean Time) function that produces an array of time-related values, which can be parsed and formatted in any desired format.

Here is the Perl script, gmt.pl, that generates the date in the American style (ie, month, day, year):

#!/usr/bin/perl

($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = gmtime();

$year = $year + 1900;

$mon = $mon + 1;

print "Americanized GMT date is: $mon/$mday/$year ";

exit;

Here is the output of the gmt.pl script:

c:ftp>gmt.pl

Americanized GMT date is: 9/16/2015

It is very important to understand that country-specific styles for representing the date are a nightmare for data scientists. As an example, consider: "2/4/97". This date signifies February 4, 1997 in America; and April 2, 1997 in Great Britain and much of the world. There basically is no way of distinguishing with certainty 2/4/97 and 4/2/97.

It is not surprising that an international standard, the ISO-8601, has been created for representing date and time.19

The international format for date and time is: YYYY-MM-DD hh:mm:ss

The value "hh" is the number of complete hours that have passed since midnight. The upper value of hh is 24 (midnight). If hh = 24, then the minute and second values must be zero.

An example of ISO-8601-compliant data and time is:

1995-02-04 22:45:00

An alternate form, likewise ISO-8601-compliant, is:

1995-02-04 T22:45:00Z

In the alternate form, a "T" replaces the space left between the date and the time, indicating that time follows date. A "Z" is appended to the string indicating that the time and date are computed for UTC (Coordinated Universal Time, formerly known as Greenwich Mean Time, and popularly known as Zulu time, hence the "Z").

Here is a short Perl script, format_time.pl, that produces the date and time, in American style, and in ISO-8601-compliant forms:

#!/usr/bin/perl

($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = gmtime();

$year = $year + 1900;

$mon = substr(("000" . ($mon+1)), -2, 2);

$mday = substr(("000" . $mday), -2, 2);

$hour = substr(("000" . $hour), -2, 2);

$min = substr(("000" . $min), -2, 2);

$sec = substr(("000" . $sec), -2, 2);

print "Americanized time is: $mday/$wday/$year ";

print "ISO8601 time is:$year-$mon-$mday $hour:$min:$sec ";

print "ISO8601 time is:$year-$mon-${mday}T$hour:$min:${sec}Z (alternate form)";

exit;

Here is the output of the format_time.pl script.

output

c:ftp>format_time.pl

Americanized time is: 16/3/2015

ISO8601 time is:2015-09-16 12:31:41

ISO8601 time is:2015-09-16 T12:31:41Z (alternate form)

Here is a Python script, format_time.py, that generates the date and time, compliant with ISO-8601.

#!/usr/bin/python

import time, datetime

timenow = time.time()

print(datetime.datetime.fromtimestamp(timenow).strftime('%Y-%m-%d %H:%M:%S'))

exit

Here is the output of the format_time.py script:

c:ftp>format_time.py

2015-09-16 07:44:09

It is sometimes necessary to establish, beyond doubt, that a timestamp is accurate and has not been modified. Through the centuries, a great many protocols have been devised to prove that a timestamp is trustworthy. A popular method employed in the 20th century involved creating some inscrutable text extracted from a document (eg, the character sequence consisting of the first letter of each line in the document) and sending the sequence to a newspaper for publication in the classifieds section. Anyone in possession of the document could generate the same sequence from the document and find that it had been published in the newspaper on the date specified. Hence, the document must have existed on the day of publication. As more computer-savvy individuals became familiar with methods for producing one-way hashes from documents, these sequences became the digest sequences of choice for trusted timestamp protocols (see Glossary item, Message digest). Today, newspapers are seldom used to established trusted timestamps. More commonly, a message digest of a confidential document is sent to a timestamp authority that adds a date to the digest and returns a message, encrypted with the timestamp authority's private key, containing the original one-way hash plus the trusted date. The received message can be decrypted with the timestamp authority's public key, to reveal the date/time and the message digest that is unique for the original document. It seems like the trusted timestamp process is a lot of work, but regular users of these services can routinely process hundreds of documents in seconds.

Open Source Tools

It's better to wait for a productive programmer to become available than it is to wait for the first available programmer to become productive.

Steve McConnell

Pseudorandom Number Generators

It is not easy for computers to produce an endless collection of random numbers. Eventually, algorithms will cycle through their available variations and begin to repeat themselves, producing the same set of "random" numbers, in the same order; a phenomenon referred to as the generator's period. Because algorithms that produce seemingly random numbers are imperfect, they are known as pseudorandom number generators.

The Mersenne Twister algorithm is used as default in most current programming languages to generate pseudorandum numbers. Languages using the Mersenne Twister include R, Python, Ruby, the GNU Multiple Precision Arithmetic Library, and the GNU Scientific Library. An implementation of the Mersenne Twister is available for Perl as an external module; specifically, Math::Random::MT::Perl. The Mersenne Twister has an extremely long period, and it performs well on most of the tests that mathematicians have devised to test randomness.

Just to demonstrate how simple it is to deploy random numbers, here is a one-line Ruby command, entered on the interactive "irb(main):001:0>" prompt, that calls a pseudorandom number generator, rand(), 10 times.

irb(main):001:0 > (1..10).each{puts rand()}

output:

0.6210719721375545

0.8275281308969118

0.5221605121682973

0.4579032986235061

0.3897775291626894

0.1859092284180266

0.9087949176336569

0.44303624386264195

0.514384506264992

0.037523700988150055

An equivalent Python snippet is:

import random

for iterations in range(10):

 print random.uniform(0,1)

An equivalent Perl snippet is:

for (0..9)

 {

 print rand() . " ";

 }

Here is a short Python script, random_filenames.py, that uses a pseudorandom generator to produce a filename composed of random characters:

#!/usr/bin/python

import random

filename = [0]*12

filename = map(lambda x: x is "" or chr(int(random.uniform(0,25) + 97)), filename)

print ''.join(filename[0:8]) + "." + ''.join(filename[9:12])

exit

Here is the output of the random_filenames.py script:

c:ftppy>random_filenames.py

tjqimddr.mjb

Random number generators are among the more useful programming utilities available to programmers. With random number generators, we can create unique identifiers, perform resampling statistics (see Sections 8.2 and 8.3), and produce reliable non-analytic solutions to complex problems in the field of probability and combinatorics (see Glossary item, Combinatorics).

Just for fun, let's look at a short script, random_string.py, that uses Python's random number generator to test the assertion that random strings are incompressible (eg, cannot be compressed with gzip, a popular open-source compression utility).

First, let's build a file, composed of 160,000 randomly chosen ASCII characters, using a few lines of Python.

#!/usr/local/bin/python

from random import randint

outfile = open ("random_1.raw", "wb")

for n in range(160000):

 c = chr(randint(0,255))

 outfile.write(c)

exit

The output script is random_1.raw, and its size is 160,000 bytes. Now, let's try to compress this file, using the gzip utility. This produces the gzipped file, random_1.gz of size 160,077 bytes.

Not only did the gzip compression utility fail to reduce the size of random_1.raw, but the resulting file is also slightly larger than the original file. The reason for this is compression utilities reduce files by replacing repetitive sequences with shorter sequences. A random file lacks highly repetitive sequences and cannot be condensed (see Glossary item, Randomness).

UUID

The UUID, is a protocol for creating unique strings that can be used as permanent identifiers for data objects.2 The idea here is that you create a UUID for a data object, and you take appropriate steps to permanently associate the data object with its assigned UUID. The most important benefit of the UUID is that the protocol can be implemented without the administrative overhead of a central registry. Maintaining the object-identifier association, after it is created, is the job of the data creator.

A typical UUID might look like this5:

4c108407-0570-4afb-9463-2831bcc6e4a4

Like most standard protocols, the UUID has been subject to modifications. Within the "standard", a UUID can be created with a pseudorandom number generator, or with timing information (eg, the moment when the UUID was generated), or with a one-way hash value produced from data contained in the data object (see Glossary items, Timestamp, One-way hash). One implementation of UUID attaches a timestamp to the random number. Implementers of UUIDs must keep in mind that uniqueness is achieved in practice but not in theory. It is possible, though remarkably unlikely, that two data objects could be assigned the same UUID. The mathematics of UUID collisions has been studied, and the chances of double assignments are incredibly remote. Nonetheless, the use of a timestamp, indicating the moment when the UUID was assigned to the data object, decreases the chance of a collision even further. The likelihood that the same 128-bit pseudorandom numbers might be assigned to different data objects at the exact same moment in time is essentially zero.

Because UUIDs have become popular, there are now many simple and convenient ways of creating UUIDs as needed. For Linux users, uuidgen.exe is a built-in command line utility. The uuidgen.exe utility is available to Windows users via its inclusion in the Cygwin distribution (Open Source Tools for Chapter 1).

c:cygwin64in>uuidgen.exe

The command line generates an output that you can attach to a data object (see Glossary item, Data object)

9ee64643-2ff2-4 cd1-ad31-ab59f933a276

You can access uuidgen.exe through a system call from a Perl or Python or Ruby script to the uuidgen.exe utility residing it its Cygwin subdirectory:

#!/usr/bin/perl

system("c:\cygwin64\bin\uuidgen.exe");

exit;

A UUID module is included in the standard Python distribution and can be called directly from a script. In this case, the UUID protocol chosen is UUID-4, which generates a UUID using a random number generator.

#!/usr/local/bin/python

import uuid

print uuid.uuid4()

exit

In Ruby, UUIDs can be generated using the GUID module, available as a downloadable gem, and installed into your computer's Ruby environment using the following command (see Glossary item, Ruby gem)5:

gem install guid

Easier yet, if you have Ruby 1.9 or later, you can create UUIDs with the built-in SecureRandom module:

#!/usr/bin/ruby

require 'securerandom'

puts SecureRandom.uuid

exit

Encryption and Decryption with OpenSSL

OpenSSL is an open-source collection of encryption protocols and message digest protocols (ie, protocols that yield one-way hashes). OpenSSL comes with an Apache-style open-source license. This useful set of utilities, with implementations for various operating systems, is available at no cost from: https://www.openssl.org/related/binaries.html

Encryption algorithms and suites of cipher strategies available through OpenSLL include: RSA, DH (numerous protocols), DSS, ECDH, TLS, AES (including 128 and 256 bit keys), CAMELLIA, DES (including triple DES), RC4, IDEA, SEED, PSK, and numerous GOST protocols (see Glossary item, AES). In addition, implementations of popular one-way hash algorithms are provided (ie, MD5 and SHA, including SHA384). OpenSSL comes with an Apache-style open-source license.

For Windows users, the OpenSSL download contains three files that are necessary for file encryption: openssl.exe, ssleay32.dll, and libeay32.dll. If these three files are located in your current directory, you can encrypt any file, directly from the command prompt:

c:>openssl aes128 -in public.txt -out secret.aes -pass pass:z4u7w28"

The command line provides a password, "z4u7w28", to the aes128 encryption algorithm, which takes the file public.txt and produces an encrypted output file, secret.aes.

The same command line could have been launched from a Perl script:

#!/usr/bin/perl

system("openssl aes128 -in public.txt -out secret.aes -pass pass:z4u7w28");

exit;

Here's a Perl script that changes the directory to the location of the OpenSSL program suite on my computer. The script encrypts the public.txt file, using the somewhat outmoded Digital Encryption Standard (DES) with password "test123".

#!/usr/bin/perl

chdir "c:\ftp\openssl-1.0.1l-x64\_86-win64";

system("openssl des -in c:\ftp\public.txt -out c:\ftp\secret.des -pass pass:test123");

exit;

Of course, once you've encrypted a file, you will need a decryption method. Here's a short Perl script that decrypts secret.aes, the result of encrypting public.txt with the AES algorithm:

#!/usr/bin/perl

system("openssl aes128 -d -in secret.aes -out decrypted.txt -pass pass:z4u7w28");

exit;

We see that decryption involves inserting the "-d" option into the command line. AES is an example of a symmetric encryption algorithm, which means that the encryption password and the decryption password are identical.

Decryption works much the same way for the DES algorithm. In the following script, we change the directory to the location where the OpenSSL suite resides:

#!/usr/bin/perl

chdir "c:\ftp\openssl-1.0.1l-x64\_86-win64";

system("openssl des -d -in c:\ftp\secret.des -out c:\ftp\secret.txt -pass pass:test123");

exit;

Encrypting and decrypting individual strings, files, groups of files, and directory contents is extremely simple and can provide a level of security that is likely to be commensurate with your personal needs.

One-Way Hash Implementations

One-way hashes are mathematical algorithms that operate on a string of any size, including large binary files, to produce a short, fixed-length, seemingly random sequence of characters that is specific for the input string. If a single byte is changed anywhere in the input string, the resulting one-way hash output string will be radically altered. Furthermore, the original string cannot be computed from the one-way hash, even when the algorithm that produced the one-way hash is known. One-way hash protocols have many practical uses in the field of information science (see Glossary items Checksum, HMAC, Digest, Message digest, Check digit, Authentication). It is very easy to implement one-way hashes, and most programming languages and operating systems come bundled with one or more implementations of one-way hash algorithms. The two most popular one-way hash algorithms are md5 (message digest version 5) and SHA (Secure Hash Algorithm).

Here we use Cygwin's own md5sum.exe utility on the command line to produce a one-way hash for an image file, named dash.png:

c:ftp>c:cygwin64inmd5sum.exe dash.png

Here is the output:

db50dc33800904ab5f4ac90597d7b4ea *dash.png

We could call the same command line from a Python script:

#!/usr/local/bin/python

import sys, os

os.system("c:/cygwin64/bin/md5sum.exe dash.png")

exit

From a Ruby script:

#!/usr/local/bin/ruby

system("c:/cygwin64/bin/md5sum.exe dash.png")

exit

From a Perl script:

#!/usr/local/bin/perl

system("c:/cygwin64/bin/md5sum.exe dash.png");

exit;

The output will always be the same, so long as the input file, dash.png, does not change:

db50dc33800904ab5f4ac90597d7b4ea *dash.png

OpenSSL contains several one-way hash implementations, including both md5 and several variants of SHA. Here is a system call to OpenSSL from Perl using a text file public.txt as input, using SHA as the algorithm, and sending the resulting hash to the file, hash.sha.

#!/usr/bin/perl

system("openssl dgst -sha public.txt > hash.sha");

exit;

The output is a file, hash.sha. Here are the contents of the hash.sha file:

SHA(public.txt)= af2f12663145770ac0cbd260e69675af6ac26417

Here is an equivalent script, in Ruby:

#!/usr/local/bin/ruby

system("openssl dgst -sha public.txt > hash.sha")

exit

Here is the output of the Ruby script:

SHA(public.txt)= af2f12663145770ac0cbd260e69675af6ac26417

Once more, notice that the output is the same whether we use Ruby or Perl, because both scripts produce a hash using the same SHA algorithm on the same file. We could have called the SHA algorithm from Python, generating the same output.

Here is a Python script, hash_dir.py, that produces an SHA hash for every file in a subdirectory, putting the collection of hash values into a file:

#!/usr/local/bin/python

import os

filelist = os.listdir(".")

outfile = open ("hashes_collect.txt", "w")

for filename in filelist:

 cmdstring = "openssl dgst -sha %s" % (filename)

 cmdstring = cmdstring + "> hashes.sha"

 os.system(cmdstring)

 infile = open ("hashes.sha", "r")

 getline = infile.readline()

 infile.close()

 outfile.write(getline)

outfile.close()

exit

The output file, hashes_collect.txt, has the following contents, listing the one-way hash value for each file that happened to reside in my current directory:

SHA(bunzip2.TXT)= 0547d31d7c675ae2239067611e6309dc8cb7e7db

SHA(googlez.TXT)= f96cea198ad1dd5617ac084a3d92c6107708c0ef

SHA(gunzipe.TXT)= d275bd87933f2322a97c70035aa2aa5f4c6088ac

SHA(gzip.TXT)= 0b323cb4555c8996c8100f8ad8259eec2538821b

SHA(hashes.sha)= f96cea198ad1dd5617ac084a3d92c6107708c0ef

SHA(hashes_collect.txt)= f96cea198ad1dd5617ac084a3d92c6107708c0ef

SHA(hash_dir.py)= 1db37524e54de40ff723fbc7e3ba20b18e651d48

SHA(JHSPZIP.TXT)= 32424d0d1fe75bedd5680205fbbc63bff4bb783a

SHA(libeay32.dll)= 647fae7916e8c4c45d0002fd6d2fc9c6877de085

SHA(mortzip.TXT)= 3168f52511c8289db7657b637156063c0e8c5646

SHA(Nonlinear_Science_FAQ.txt)= 6316b8531712ca2f0b192c1f662dbde446f958d9

SHA(openssl.exe)= 1cf6af2d3f720e0959d0ce49d6e4dad7a58092e8

SHA(pesonalized_blog.txt)= 25d208163e7924b8c10c7e9732d29383d61a22f1

SHA(ssleay32.dll)= 4889930b67feef5765d3aef3d1752db10ebced8f

Such files, containing lists of one-way hashes, can be used as authentication records (see Glossary item, Authentication)

Steganography

You look at them every day; the ones that others create and the ones that you create to share with your friends or with the world. They're part of your life, and you would feel a deep sense of loss if you lost them. I'm referring to high-resolution digital images. We love them, but we give them more credit than they deserve. When you download a 16-megapixel image of your sister's lasagna, you can be certain that most of the pixel information is padded with so-called empty resolution, ie, pixel precision that is probably inaccurate and certainly exceeding the eye's ability to meaningfully resolve. Most images in the megabyte size range can safely be reduced to the kilobyte size range, without loss of visual information. Steganography is an encryption technique that takes advantage of the empty precision in pixel data by inserting secret text messages into otherwise useless data bits.

Steganography is one of several general techniques in which a message is hidden within another digital object. Steganography has been around for centuries, and was described as early as AD 1500 by Trithemious.20 Watermarking is closely related to steganography. Digital watermarking is a way of secretly insinuating the name of the owner or creator of a digital object into the object, as a mechanism of rights management (see Glossary item, Watermarking).21

Steghide is an open-source GNU license utility that invisibly embeds data in image or audio files. Windows and Linux versions are available for download from SourceForge at: http://steghide.sourceforge.net/download.php.

A Steghide manual is available at: http://steghide.sourceforge.net/documentation/manpage.php.

On my computer, the Steghide executables happen to be stored on the c:ftpsteghidesteghide subdirectory. Hence, in the following code, this subdirectory will be the launch path.

Here is an example of a command line invocation of Steghide. Your chosen password can be inserted directly into the commandline

c:ftpsteghidesteghide>steghide embed -cf c:ftpsimplifyerman_author_photo.

jpg -ef c:ftperman_author_bio.txt -p hideme

The command line was launched from the subdirectory that holds the Steghide executable files on my computer. The command instructs Steghide to embed the text file berman_author_bio.txt into the image file berman_author_photo.jpg under the password "hideme".

That's all there is to it. The image file, containing a photo of myself, now contains an embedded text file with my short biography. I no longer need to keep track of both files. I can generate my biography file from my image file, but I've got to remember the password.

I could have called Steghide from a script. Here is an example of an equivalent Python script that invokes Steghide from a system call.

#!/usr/local/bin/python

import os

command_string = "steghide embed -cf c:/ftp/simplify/berman_author_photo.jpg -ef c:/ftp/berman_author_bio.txt -p hideme"

os.system(command_string)

exit

You can see how powerful this method can be. With a bit of tweaking, you can write a short script that uses the Steghide utility to embed a hidden text message in thousands of images, all at once. Anyone viewing those images would have no idea that they contained a hidden message unless you told them so.

Glossary

AES The Advanced Encryption Standard (AES) is the cryptographic standard endorsed by the U.S. government as a replacement for the old government standard, Data Encryption Standard (DES). In 2001, AES was chosen from among many different encryption protocols submitted in a cryptographic contest conducted by the U.S. National Institute of Standards and Technology. AES is also known as Rijndael, after its developer. It is a symmetric encryption standard, meaning that the same password used for encryption is also used for decryption. See Symmetric key.

Authentication A process for determining if the data object that is received (eg, document, file, image) is the data object that was intended to be received. The simplest authentication protocol involves one-way hash operations on the data that needs to be authenticated. Suppose you happen to know that a certain file named z.txt will be arriving via email and that this file has an MD5 hash of "uF7pBPGgxKtabA/2zYlscQ==". You receive the z.txt, and you perform an MD5 one-way hash operation on the file, as shown here:

#!/usr/bin/python
import base64
import md5
md5_object = md5.new()
sample_file = open ("z.txt", "rb")
string = sample_file.read()
sample_file.close()
md5_object.update(string)
md5_string = md5_object.digest()
print(base64.encodestring(md5_string))
exit

Let's assume that the output of the MD5 hash operation, performed on the z.txt file, is "uF7pBPGgxKtabA/2zYlscQ==". This would tell us that the received z.txt file is authentic (ie, it is the file that you were intended to receive), because any file tampering would have changed the MD5 hash. Additional implementations of one-way hashes are described in Open Source Tools for this chapter. The authentication process in this example does not tell you who sent the file, the time that the file was created, or anything about the validity of the contents of the file. These would require a protocol that included signature, timestamp, and data validation, in addition to authentication. In common usage, authentication protocols often include entity authentication (ie, some method by which the entity sending the file is verified). Consequently, authentication protocols are often confused with signature verification protocols. An ancient historical example serves to distinguish the concepts of authentication protocols and signature protocols. Since earliest recorded history, fingerprints were used as a method of authentication. When a scholar or artisan produced a product, he would press his thumb into the clay tablet, or the pot, or the wax seal closing a document. Anyone doubting the authenticity of the pot could ask the artisan for a thumbprint. If the new thumbprint matched the thumbprint on the tablet, pot, or document, then all knew that the person creating the new thumbprint and the person who had put his thumbprint into the object were the same individual. Hence, ancient pots were authenticated. Of course, this was not proof that the object was the creation of the person with the matching thumbprint. For all anyone knew, there may have been a hundred different pottery artisans, with one person pressing his thumb into every pot produced. You might argue that the thumbprint served as the signature of the artisan. In practical terms, no. The thumbprint by itself does not tell you whose print was used. Thumbprints could not be read, at least not in the same way as a written signature. The ancients needed to compare the pot's thumbprint against the thumbprint of the living person who made the print. When the person died, civilization was left with a bunch of pots with the same thumbprint, but without any certain way of knowing whose thumb produced them. In essence, because there was no ancient database that permanently associated thumbprints with individuals, the process of establishing the identity of the pot maker became very difficult once the artisan died. A good signature protocol permanently binds an authentication code to a unique entity (eg, a person). Today, we can find a fingerprint at the scene of a crime; we can find a matching signature in a database and link the fingerprint to one individual. Hence, in modern times, fingerprints are true "digital" signatures, no pun intended. Modern uses of fingerprints include keying (eg, opening locked devices based on an authenticated fingerprint), tracking (eg, establishing the path and whereabouts of an individual by following a trail of fingerprints or other identifiers), and body part identification (ie, identifying the remains of individuals recovered from mass graves or from the sites of catastrophic events based on fingerprint matches). Over the past decade, flaws in the vaunted process of fingerprint identification have been documented, and the improvement of the science of identification is an active area of investigation.22 See HMAC. See Digital signature.

Check digit A checksum that produces a single digit as output is referred to as a check digit. Some of the common identification codes in use today, such as ISBN numbers for books, come with a built-in check digit. Of course, when using a single digit as a check value, you can expect that some transmitted errors will escape the check, but the check digit is useful in systems wherein occasional mistakes are tolerated, wherein the purpose of the check digit is to find a specific type of error (eg, an error produced by a substitution in a single character or digit), and wherein the check digit itself is rarely transmitted in error. See Checksum.

Checksum An outdated term that is sometimes used synonymously with one-way hash or message digest. Checksums are performed on a string, block, or file yielding a short alphanumeric string intended to be specific for the input data. Ideally, if a single bit were to change anywhere within the input file, then the checksum for the input file would change drastically. Checksums, as the name implies, involve summing values (ie, typically weighted character values), to produce a sequence that can be calculated on a file before and after transmission. Most of the errors that were commonly introduced by poor transmission could be detected with checksums. Today, the old checksum algorithms have been largely replaced with one-way hash algorithms. A checksum that produces a single digit as output is referred to as a check digit. See Check digit. See One-way hash. See Message digest. See HMAC.

Combinatorics The analysis of complex data often involves combinatorics, ie, the evaluation, on some numeric level, of combinations of things. Often, combinatorics involves pairwise comparisons of all possible combinations of items. When the number of comparisons becomes large, as is the case with virtually all combinatoric problems involving large data sets, the computational effort becomes massive. For this reason, combinatorics research has become a subspecialty in applied mathematics and data science. There are four "hot" areas in combinatorics. The first involves building increasingly powerful computers capable of solving complex combinatoric problems. The second involves developing methods whereby combinatoric problems can be broken into smaller problems that can be distributed to many computers in order to provide relatively fast solutions to problems that could not otherwise be solved in any reasonable length of time. The third area of research involves developing new algorithms for solving combinatoric problems quickly and efficiently. The fourth area, perhaps the most promising area, involves developing innovative non-combinatoric solutions for traditionally combinatoric problems — a golden opportunity for experts in the field of data simplification.

Data object A data object is whatever is being described by the data. For example, if the data is "6 feet tall", then the data object is the person or thing to which "6 feet tall" applies. Minimally, a data object is a metadata/data pair, assigned to a unique identifier (ie, a triple). In practice, the most common data objects are simple data records, corresponding to a row in a spreadsheet or a line in a flat file. Data objects in object-oriented programming languages typically encapsulate several items of data, including an object name, an object unique identifier, multiple data/metadata pairs, and the name of the object's class. See Triple. See Identifier. See Metadata.

Data scrubbing A term that is very similar to data deidentification and is sometimes used improperly as a synonym for data deidentification. Data scrubbing refers to the removal of information from data records that is considered unwanted. This may include identifiers, private information, or any incriminating or otherwise objectionable language contained in data records, as well as any information deemed irrelevant to the purpose served by the record. See Deidentification.

Data versus datum The singular form of data is datum, but the word "datum" has virtually disappeared from the computer science literature. The word "data" has assumed both a singular and plural form. In its singular form, it is a collective noun that refers to a single aggregation of many data points. Hence, current usage would be "The data is enormous," rather than "These data are enormous."

Deidentification The process of removing all of the links in a data record that can connect the information in the record to an individual. This usually includes the record identifier, demographic information (eg, place of birth), personal information (eg, birthdate), biometrics (eg, fingerprints), and so on. The process of deidentification will vary based on the type of records examined. Deidentifying protocols exist wherein deidentificated records can be reidentified, when necessary. See Reidentification. See Data scrubbing.

Deidentification versus anonymization Anonymization is a process by which all the links between an individual and the individual's data record are irreversibly removed. The difference between anonymization and deidentification is that anonymization is irreversible. Because anonymization is irreversible, the opportunities for verifying the quality of data are limited. For example, if someone suspects that samples have been switched in a data set, thus putting the results of the study into doubt, an anonymized set of data would afford no opportunity to resolve the problem by reidentifying the original samples. See Reidentification.

Digest As used herein, "digest" is equivalent to a one-way hash algorithm. The word "digest" also refers to the output string produced by a one-way hash algorithm. See Checksum. See One-way hash. See HMAC.

Digital signature As it is used in the field of data privacy, a digital signature is an alphanumeric sequence that could only have been produced by a private key owned by one particular person. Operationally, a message digest (eg, a one-way hash value) is produced from the document that is to be signed. The person "signing" the document encrypts the message digest using his or her private key and submits the document and the encrypted message digest to the person who intends to verify that the document has been signed. This person decrypts the encrypted message digest with her public key (ie, the public key complement to the private key) to produce the original one-way hash value. Next, a one-way hash is performed on the received document. If the resulting one-way hash is the same as the decrypted one-way hash, then several statements hold true: The document received is the same document as the document that had been "signed". The signer of the document had access to the private key that complemented the public key that was used to decrypt the encrypted one-way hash. The assumption here is that the signer was the only individual with access to the private key. Digital signature protocols, in general, have a private method for encrypting a hash and a public method for verifying the signature. Such protocols operate under the assumption that only one person can encrypt the hash for the message and that the name of that person is known; hence, the protocol establishes a verified signature. It should be emphasized that a digital signature is quite different from a written signature; the latter usually indicates that the signer wrote the document or somehow attests to the veracity of the document. The digital signature merely indicates that the document was received from a particular person, contingent on the assumption that the private key was available only to that person. To understand how a digital signature protocol may be maliciously deployed, imagine the following scenario: I contact you and tell you that I am Elvis Presley and would like you to have a copy of my public key plus a file that I have encrypted using my private key. You receive the file and the public key, and you use the public key to decrypt the file. You conclude that the file was indeed sent by Elvis Presley. You read the decrypted file and learn that Elvis advises you to invest all your money in a company that manufactures concrete guitars; which, of course, you do, because Elvis knows guitars. The problem here is that the signature was valid, but the valid signature was not authentic. See Authentication.

Gedanken Gedanken, German for "thoughts" refers to conceptual exercises that clarify a scientific question. In general, gedanken problems are esoteric and not suited to experimental validation. Einstein was fond of using gedanken experiments to develop his breakthroughs in theoretical physics.

Generalization Generalization is the process of extending relationships from individual objects to classes of objects. For example, when Isaac Newton observed the physical laws that applied to apples falling to the ground, he found a way to relate the acceleration of an object to its mass and to the force of gravity. His apple-centric observations applied to all objects and could be used to predict the orbit of the moon around the earth or the orbit of the earth around the sun. Newton generalized from the specific to the universal. Similarly, Darwin's observations on barnacles could be generalized to yield the theory of evolution, thus explaining the development of all terrestrial organisms. Science would be of little value if observed relationships among objects could not be generalized to classes of objects. See Science.

HMAC Hashed Message Authentication Code. When a one-way hash is employed in an authentication protocol, it is often referred to as an HMAC. See One-way hash. See Message digest. See Checksum.

HTML HyperText Markup Language is an ASCII-based set of formatting instructions for web pages. HTML formatting instructions, known as tags, are embedded in the document and double-bracketed, indicating the start point and end points for instruction. Here is an example of an HTML tag instructing the web browser to display the word "Hello" in italics: < i>Hello </i >. All web browsers conforming to the HTML specification must contain software routines that recognize and implement the HTML instructions embedded within in web documents. In addition to formatting instructions, HTML also includes linkage instructions, in which the web browsers must retrieve and display a listed web page, or a web resource, such as an image. The protocol whereby web browsers, following HTML instructions, retrieve web pages from other Internet sites, is known as HTTP (HyperText Transfer Protocol).

Identification The process of providing a data object with an identifier, or the process of distinguishing one data object from all other data objects on the basis of its associated identifier. See Identifier.

Identifier A string that is associated with a particular thing (eg, person, document, transaction, data object), and not associated with any other thing.23 Object identification usually involves permanently assigning a seemingly random sequence of numeric digits (0–9) and alphabet characters (a-z and A-Z) to a data object. A data object can be a specific piece of data (eg, a data record) or an abstraction, such as a class of objects or a number or a string or a variable. See Identification.

Immutability Permanent data that cannot be modified is said to be immutable. At first thought, it would seem that immutability is a ridiculous and impossible constraint. In the real world, mistakes are made, information changes, and the methods for describing information changes. This is all true, but the astute data manager knows how to accrue information into data objects without changing the pre-existing data. In practice, immutability is maintained by time-stamping all data and storing annotated data values with any and all subsequent time-stamped modifications. For a detailed explanation, see Section 5.6, "Timestamps, Signatures, and Event Identifiers."

Intellectual property Data, software, algorithms, and applications that are created by an entity capable of ownership (eg, humans, corporations, universities). The owner entity holds rights over the manner in which the intellectual property can be used and distributed. Protections for intellectual property may come in the form of copyrights, patents, and laws that apply to theft. Copyright applies to published information. Patents apply to novel processes and inventions. Certain types of intellectual property can only be protected by being secretive. For example, magic tricks cannot be copyrighted or patented; this is why magicians guard their intellectual property against theft. Intellectual property can be sold outright or used under a legal agreement (eg, license, contract, transfer agreement, royalty, usage fee, and so on). Intellectual property can also be shared freely, while retaining ownership (eg, open-source license, GNU license, FOSS license, Creative Commons license).

Introspection A method by which data objects can be interrogated to yield information about themselves (eg, properties, values, and class membership). Through introspection, the relationships among the data objects can be examined. Introspective methods are built into object-oriented languages. The data provided by introspection can be applied at run-time to modify a script's operation, a technique known as reflection. Specifically, any properties, methods, and encapsulated data of a data object can be used in the script to modify the script's run-time behavior. See Reflection.

Meaning In informatics, meaning is achieved when described data is bound to a unique identifier of a data object. "Claude Funston's height is 5 feet 11 inches," comes pretty close to being a meaningful statement. The statement contains data (5 feet 11 inches), and the data is described (height). The described data belongs to a unique object (Claude Funston). Ideally, the name "Claude Funston" should be provided with a unique identifier to distinguish one instance of Claude Funston from all the other persons who are named Claude Funston. The statement would also benefit from a formal system that ensures that the metadata makes sense (eg, What exactly is height, and does Claude Funston fall into a class of objects for which height is a property?) and that the data is appropriate (eg, Is 5 feet 11 inches an allowable measure of a person's height?). A statement with meaning does not need to be a true statement (eg, The height of Claude Funston was not 5 feet 11 inches when Claude Funston was an infant). See Semantics. See Triple. See RDF.

Message digest Within the context of this book, "message digest", "digest", "HMAC", and "one-way hash" are equivalent terms. See One-way hash. See HMAC.

Metadata The data that describes data. For example, a data element (also known as data point) may consist of the number "6". The metadata for the data may be the words "Height, in feet". A data element is useless without its metadata, and metadata is useless unless it adequately describes a data element. In XML, the metadata/data annotation comes in the form < metadata tag>data<end of metadata tag > and might look something like:

< weight_in_pounds>150 </weight_in_pounds >

In spreadsheets, the data elements are the cells of the spreadsheet. The column headers are the metadata that describe the data values in the column's cells, and the row headers are the record numbers that uniquely identify each record (ie, each row of cells). See XML.

Namespace A namespace is the realm in which a metadata tag applies. The purpose of a namespace is to distinguish metadata tags that have the same name, but a different meaning. For example, within a single XML file, the metadata tag "date" may be used to signify a calendar date, or the fruit, or the social engagement. To avoid confusion, metadata terms are assigned a prefix that is associated with a web document that defines the term (ie, establishes the tag's namespace). In practical terms, a tag that can have different descriptive meanings in different contexts is provided with a prefix that links to a web document wherein the meaning of the tag, as it applies in the XML document, is specified. An example of namespace syntax is provided in Section 2.5.

National Patient Identifier Many countries employ a National Patient Identifier (NPI) system. In these cases, when a citizen receives treatment at any medical facility in the country, the transaction is recorded under the same permanent and unique identifier. Doing so enables the data collected on individuals from multiple hospitals to be merged. Hence, physicians can retrieve patient data that was collected anywhere in the nation. In countries with NPIs, data scientists have access to complete patient records and can perform health care studies that would be impossible to perform in countries that lack NPI systems. In the U.S., where a system of NPIs has not been adopted, there is a perception that such a system would constitute an invasion of privacy and would harm citizens. See Reconciliation.

Notation 3 Also called n3. A syntax for expressing assertions as triples (unique subject + metadata + data). Notation 3 expresses the same information as the more formal RDF syntax, but n3 is easier for humans to read.24 RDF and n3 are interconvertible and either one can be parsed and equivalently tokenized (ie, broken into elements that can be re-organized in a different format, such as a database record). See RDF. See Triple.

One-way hash A one-way hash is an algorithm that transforms one string into another string (a fixed-length sequence of seemingly random characters) in such a way that the original string cannot be calculated by operations on the one-way hash value (ie, the calculation is one-way only). One-way hash values can be calculated for any string, including a person's name, a document, or an image. For any given input string, the resultant one-way hash will always be the same. If a single byte of the input string is modified, the resulting one-way hash will be changed and will have a totally different sequence than the one-way hash sequence calculated for the unmodified string. Most modern programming languages have several methods for generating one-way hash values. Here is a short Ruby script that generates a one-way hash value for a file:

#!/usr/local/bin/ruby
require 'digest/md5'
file_contents = File.new("simplify.txt").binmode
hash_string = Digest::MD5.base64digest(file_contents.read)
puts hash_string
exit

Here is the one-way hash value for the file "simplify.txt" using the md5 algorithm:

0CfZez7L1A6WFcT+oxMh+g==

If we copy our example file to another file with an alternate filename, the md5 algorithm will generate the same hash value. Likewise, if we generate a one-way hash value, using the md5 algorithm implemented in some other language, such as Python or Perl, the outputs will be identical. One-way hash values can be designed to produce long fixed-length output strings (eg, 256 bits in length). When the output of a one-way hash algorithm is very long, the chance of a hash string collision (ie, the occurrence of two different input strings generating the same one-way hash output value) is negligible. Clever variations on one-way hash algorithms have been repurposed as identifier systems.2528 Examples of one-way hash implementations in Perl and Python are found in Open Source Tools for this chapter, Encryption. See HMAC. See Message digest. See Checksum.

Privacy versus confidentiality The concepts of confidentiality and of privacy are often confused, and it is useful to clarify their separate meanings. Confidentiality is the process of keeping a secret with which you have been entrusted. You break confidentiality if you reveal the secret to another person. You violate privacy when you use the secret to annoy the person whose confidential information was acquired. If you give me your unlisted telephone number in confidence, then I am expected to protect this confidentiality by never revealing the number to other persons. I may also be expected to protect your privacy by never using the telephone number to call you at all hours of the day and night. In this case, the same information object (unlisted telephone number) is encumbered by separable confidentiality and privacy obligations.

RDF Resource Description Framework (RDF) is a syntax in XML notation that formally expresses assertions as triples. The RDF triple consists of a uniquely identified subject plus a metadata descriptor for the data plus a data element. Triples are necessary and sufficient to create statements that convey meaning. Triples can be aggregated with other triples from the same data set or from other data sets, so long as each triple pertains to a unique subject that is identified equivalently through the data sets. Enormous data sets of RDF triples can be merged or functionally integrated with other massive or complex data resources. For a detailed discussion see Open Source Tools for Chapter 6, "Syntax for triples." See Notation 3. See Semantics. See Triple. See XML.

RDF Schema Resource Description Framework Schema (RDFS). A document containing a list of classes, their definitions, and the names of the parent class(es) for each class. In an RDF Schema, the list of classes is typically followed by a list of properties that apply to one or more classes in the Schema. To be useful, RDF Schemas are posted on the Internet, as a web page, with a unique web address. Anyone can incorporate the classes and properties of a public RDF Schema into their own RDF documents (public or private) by linking named classes and properties, in their RDF document, to the web address of the RDF Schema where the classes and properties are defined. See Namespace. See RDFS.

RDF ontology A term that, in common usage, refers to the class definitions and relationships included in an RDF Schema document. The classes in an RDF Schema need not comprise a complete ontology. In fact, a complete ontology could be distributed over multiple RDF Schema documents. See RDF Schema.

RDFS Same as RDF Schema.

Randomness Various tests of randomness are available.29 One of the easiest to implement takes advantage of the property that random strings are uncompressible. If you can show that if a character string, a series of numbers, or a column of data cannot be compressed by gzip, then it is pretty safe to conclude that the data is randomly distributed and without any informational value.

Reconciliation Usually refers to identifiers, and involves verifying when an object that is assigned a particular identifier in one information system has been provided the same identifier in some other system. For example, if I am assigned identifier 967bc9e7-fea0-4b09-92e7-d9327c405d78 in a legacy record system, I should like to be assigned the same identifier in the new record system. If that were the case, my records in both systems could be combined. If I am assigned an identifier in one system that is different from my assigned identifier in another system, then the two identifiers must be reconciled to determine that they both refer to the same unique data object (ie, me). This may involve creating a link between the two identifiers or a new triple that establishes the equivalence of the two identifiers. Despite claims to the contrary, there is no possible way by which information systems with poor identifier systems can be sensibly reconciled. Consider this example: A hospital has two separate registry systems, one for dermatology cases and another for psychiatry cases. The hospital would like to merge records from the two services. Because of sloppy identifier practices, a sample patient has been registered 10 times in the dermatology system and 6 times in the psychiatry system, each time with different addresses, Social Security numbers, birthdates, and spellings of the patient's name. A reconciliation algorithm is applied, and one of the identifiers from the dermatology service is matched positively against one of the records from the psychiatry service. Performance studies on the algorithm indicate that the merged records have a 99.8% chance of belonging to the same patient. So what? Though the two merged identifiers correctly point to the same patient, there are 14 (9 + 5) residual identifiers for the patient still unmatched. The patient's merged record will not contain his complete clinical history. Furthermore, in this hypothetical instance, analyses of patient population data will mistakenly attribute one patient's clinical findings to as many as 15 different patients, and the set of 15 records in the corrupted de-identified dataset may contain mixed-in information from an indeterminate number of additional patients! If the preceding analysis seems harsh, consider these words from the Health care Information and Management Systems Society, "A local system with a poorly maintained or 'dirty' master person index (MPI) will only proliferate and contaminate all of the other systems to which it links."30 See Social Security Number.

Reflection A programming technique wherein a computer program will modify itself at run-time based on information it acquires through introspection. For example, a computer program may iterate over a collection of data objects, examining the self-descriptive information for each object in the collection (ie, object introspection). If the information indicates that the data object belongs to a particular class of objects, then the program may call a method appropriate for the class. The program executes in a manner determined by descriptive information obtained during run-time, metaphorically reflecting upon the purpose of its computational task. See Introspection.

Registrars and Human Authentication The experiences of registrars in U.S. hospitals serve as cautionary instruction. Hospital registrars commit a disastrous mistake when they assume that all patients wish to comply with the registration process. A patient may be highly motivated to provide false information to a registrar, acquire several different registration identifiers, seek a false registration under another person's identity (ie, commit fraud), or forego the registration process entirely. In addition, it is a mistake to believe that honest patients are able to fully comply with the registration process. Language barriers, cultural barriers, poor memory, poor spelling, and a host of errors and misunderstandings can lead to duplicative or otherwise erroneous identifiers. It is the job of the registrar to follow hospital policies that overcome these difficulties. Registration should be conducted by a trained registrar who is well-versed in the registration policies established by the institution. Registrars may require patients to provide a full legal name, any prior held names (eg maiden name), date of birth, and a government-issue photo ID card (eg, driver's license or photo ID card issued by the Department of Motor Vehicles). To be thorough, registration should require a biometric identifier (eg, fingerprints, retina scan, iris scan, voice recording, and/or photograph). If you accept the premise that hospitals have the responsibility of knowing who it is that they are treating, then obtaining a sample of DNA from every patient at the time of registration is reasonable. The DNA can be used to create a unique patient profile from a chosen set of informative loci, a procedure used by the CODIS system developed for law enforcement agencies. The registrar should document any distinguishing and permanent physical features that are plainly visible (eg, scars, eye color, colobomas, or tattoos). Neonatal and pediatric identifiers pose a special set of problems for registrars. When an individual born in a hospital and provided with an identifier returns as an adult, he or she should be assigned the same identifier that was issued in the remote past. Every patient who comes for registration should be matched against a database of biometric data that does not change from birth to death (eg, fingerprints, DNA). See Social Security Number.

Reidentification A term casually applied to any instance whereby information can be linked to a specific person after the links between the information and the person associated with the information have been removed. Used this way, the term reidentification connotes an insufficient deidentification process. In the health care industry, the term "reidentification" means something else entirely. In the U.S., regulations define "reidentification" under the "Standards for Privacy of Individually Identifiable Health Information."8 Therein, reidentification is a legally sanctioned process whereby deidentified records can be linked back to their human subjects, under circumstances deemed legitimate and compelling, by a privacy board. Reidentification is typically accomplished via the use of a confidential list of links between human subject names and deidentified records held by a trusted party. In the health care realm, when a human subject is identified through fraud, trickery, or through the deliberate use of computational methods to break the confidentiality of insufficiently deidentified records (ie, hacking), the term "reidentification" would not apply.4

Ruby gem In Ruby, gems are external modules available for download from an Internet server. The Ruby gem installation module comes bundled in Ruby distribution packages. Gem installations are simple, usually consisting of commands in the form, "gem install name_of_gem" invoked at the system prompt. After a gem has been installed, scripts access the gem with a "require" statement, equivalent to an "import" statement in Python or the "use" statement in Perl.

Science Of course, there are many different definitions of science, and inquisitive students should be encouraged to find a conceptualization of science that suits their own intellectual development. For me, science is all about finding general relationships among objects. In the so-called physical sciences, the most important relationships are expressed as mathematical equations (eg, the relationship between force, mass, and acceleration; the relationship between voltage, current and resistance). In the so-called natural sciences, relationships are often expressed through classifications (eg, the classification of living organisms). Scientific advancement is the discovery of new relationships or the discovery of a generalization that applies to objects hitherto confined within disparate scientific realms (eg, evolutionary theory arising from observations of organisms and geologic strata). Engineering would be the area of science wherein scientific relationships are exploited to build new technology. See Generalization.

Semantics The study of meaning (Greek root, semantikos, significant meaning). In the context of data science, semantics is the technique of creating meaningful assertions about data objects. A meaningful assertion, as used here, is a triple consisting of an identified data object, a data value, and a descriptor for the data value. In practical terms, semantics involves making assertions about data objects (ie, making triples), combining assertions about data objects (ie, merging triples), and assigning data objects to classes, therefore relating triples to other triples. As a word of warning, few informaticians would define semantics in these terms, but most definitions for semantics are functionally equivalent to the definition offered here. Most language is unstructured and meaningless. Consider the assertion: Sam is tired. This is an adequately structured sentence, but what is its meaning? There are a lot of people named Sam. Which Sam is being referred to in this sentence? What does it mean to say that Sam is tired? Is "tiredness" a constitutive property of Sam, or does it only apply to specific moments? If so, for what moment in time is the assertion "Sam is tired" actually true? To a computer, meaning comes from assertions that have a specific, identified subject associated with some sensible piece of fully described data (metadata coupled with the data it describes). See Triple. See RDF.

Social Security Number The common strategy in the U.S. of employing Social Security numbers as identifiers is often counterproductive, owing to entry error, mistaken memory, or the intention to deceive. Efforts to reduce errors by requiring individuals to produce their original Social Security cards puts an unreasonable burden on honest individuals, who rarely carry their cards, and provides an advantage to dishonest individuals, who can easily forge Social Security cards. Institutions that compel individuals to provide a Social Security number have dubious legal standing. The Social Security number was originally intended as a device for validating a person's standing in the Social Security system. More recently, the purpose of the Social Security number has been expanded to track taxable transactions (ie, bank accounts, salaries). Other uses of the Social Security number are not protected by law. The Social Security Act (Section 208 of Title 42 U.S. Code 408) prohibits most entities from compelling anyone to divulge his/her Social Security number. Legislation or judicial action may one day stop institutions from compelling individuals to divulge their Social Security numbers as a condition for providing services. Prudent and forward-thinking institutions will limit their reliance on Social Security numbers as personal identifiers. See Registrars and human authentication.

Symmetric key A key (ie, a password) that can be used to encrypt and decrypt the same file. AES is an encryption/decryption algorithm that employs a symmetric key. See AES.

Timestamp Many data objects are temporal events and all temporal events must be given a timestamp indicating the time that the event occurred, using a standard measurement for time. The timestamp must be accurate, persistent, and immutable. The Unix epoch time (equivalent to the Posix epoch time) is available to most operating systems and consists of the number of seconds that have elapsed since January 1, 1970 at midnight, Greenwich Mean Time. The Unix epoch time can easily be converted into any other standard representation of time. The duration of any event can be calculated by subtracting the beginning time from the ending time. Because the timing of events can be maliciously altered, scrupulous data managers may choose to employ a trusted timestamp protocol by which a timestamp can be verified.

Triple In computer semantics, a triple is an identified data object associated with a data element and the description of the data element. In computer science literature, the syntax for the triple is commonly described as "subject, predicate, object," wherein the subject is an identifier, the predicate is the description of the object, and the object is the data. The definition of triple, using grammatic terms, can be off-putting to the data scientist, who may think in terms of spreadsheet entries: a key that identifies the line record, a column header containing the metadata description of the data, and a cell that contains the data. In this book, the three components of a triple are described as: (1) the identifier for the data object, (2) the metadata that describes the data, and (3) the data itself. In theory, all data sets, databases, and spreadsheets can be constructed or deconstructed as collections of triples. See Introspection. See Data object. See Semantics. See RDF. See Meaning.

Triplestore A list or database composed entirely of triples (statements consisting of an item identifier plus the metadata describing the item plus an item of data). The triples in a triple store need not be saved in any particular order, and any triplestore can be merged with any other triplestore; the basic semantic meaning of the contained triples is unaffected. See Triple.

URL Unique Resource Locator. The web is a collection of resources, each having a unique address, the URL. When you click on a link that specifies a URL, your browser fetches the page located at the unique location specified in the URL name. If the web were designed otherwise (ie, if several different web pages had the same web address, or if one web address were located at several different locations), then the web could not function with any reliability.

URN Unique Resource Name. Whereas the URL identifies objects based on the object's unique location in the web, the URN is a system of object identifiers that are location-independent. In the URN system, data objects are provided with identifiers, and the identifiers are registered with, and subsumed by, the URN. For example:

urn:isbn-13:9780128028827

Refers to the unique book Repurposing Legacy Data: Innovative Case Studies by Jules Berman

urn:uuid:e29d0078-f7f6-11e4-8ef1-e808e19e18e5

Refers to a data object tied to the UUID identifier e29d0078-f7f6-11e4-8ef1-e808e19e18e5. In theory, if every data object were assigned a registered URN and if the system were implemented as intended, the entire universe of information could be tracked and searched. See URL. See UUID.

UUID (Universally Unique Identifier) is a protocol for assigning unique identifiers to data objects without using a central registry. UUIDs were originally used in the Apollo Network Computing System.2 Most modern programming languages have modules for generating UUIDs. See Identifier.

Uniqueness Uniqueness is the quality of being demonstrably different from every other thing in the universe. For data scientists, uniqueness is achieved when a data object is bound to a unique identifier (ie, a string of alphanumeric characters) that has not and will never be assigned to any other object. Interestingly, uniqueness can apply to classes of objects that happen to contain non-unique members and to two or more indistinguishable objects, if they are assigned unique identifiers (eg, unique product numbers stamped into identical auto parts).

Watermarking Watermarking, a type of steganography, is a method for insinuating the name of the owner or creator of a digital object within the object for the purpose of asserting intellectual property. See Intellectual property.

XML Acronym for Extensible Markup Language, a syntax for marking data values with descriptors (ie, metadata). The descriptors are commonly known as tags. In XML, every data value is enclosed by a start tag containing the descriptor and indicating that a value will follow and an end tag containing the same descriptor and indicating that a value preceded the tag. For example: < name>Conrad Nervig </name >. The enclosing angle brackets, "<>", and the end-tag marker, "/", are hallmarks of HTML and XML markup. This simple but powerful relationship between metadata and data allows us to employ metadata/data pairs as though each were a miniature database. The semantic value of XML becomes apparent when we bind a metadata/data pair to a unique object, forming a so-called triple. See Triple. See Meaning. See Semantics. See HTML.

References

1 Berman J.J. Repurposing legacy data: innovative case studies. Burlington, MA: Elsevier, Morgan Kaufmann imprint; 2015.

2 Leach P, Mealling M, Salz R. A universally unique identifier (UUID) URN namespace. Network Working Group, Request for Comment 4122, Standards Track. Available from: http://www.ietf.org/rfc/rfc4122.txt [accessed 01.01.15].

3 Mealling M. RFC 3061. A URN namespace of object identifiers. Network Working Group, 2001. Available from: https://www.ietf.org/rfc/rfc3061.txt [accessed 01.01.15].

4 Berman J.J. Principles of big data: preparing, sharing, and analyzing complex information. Burlington, MA: Morgan Kaufmann; 2013.

5 Berman J.J. Methods in medical informatics: fundamentals of healthcare programming in Perl, Python, and Ruby. Boca Raton: Chapman and Hall; 2010.

6 Kuzmak P., Casertano A., Carozza D., Dayhoff R., Campbell K. Solving the problem of duplicate medical device unique identifiers high confidence medical device software and systems (HCMDSS). In: Workshop, Philadelphia, PA, June 2–3; 2005 Available from: http://www.cis.upenn.edu/hcmdss/Papers/submissions/ [accessed 26.08.12].

7 Committee on A Framework for Developing a New Taxonomy of Disease, Board on Life Sciences, Division on Earth and Life Studies, National Research Council of the National Academies. Toward precision medicine: building a knowledge network for biomedical research and a new taxonomy of disease. Washington, DC: The National Academies Press; 2011.

8 Department of Health and Human Services. 45 CFR (Code of Federal Regulations), Parts 160 through 164. Standards for privacy of individually identifiable health information (final rule). Fed Regist. 2000;65(250):82461–82510.

9 Berman J.J. Concept-match medical data scrubbing: how pathology datasets can be used in research. Arch Pathol Lab Med. 2003;127:680–686.

10 Berman JJ. Comparing de-identification methods. Available from: http://www.biomedcentral.com/1472-6947/6/12/comments/comments.htm; 2006 [accessed 01.01.15].

11 Berman J.J. Ruby programming for medicine and biology. Sudbury, MA: Jones and Bartlett; 2008.

12 Berman J.J. Doublet method for very fast autocoding. BMC Med Inform Decis Mak. 2004;4:16.

13 Rothstein M.A. Is deidentification sufficient to protect health privacy in research? Am J Bioeth. 2010;10:3–11.

14 Chen L., Shern J.F., Wei J.S., Yohe M.E., Song Y.K., Hurd L., et al. Clonality and evolutionary history of rhabdomyosarcoma. PLoS Genet. 2015;11.

15 Frieden T. VA will pay $20 million to settle lawsuit over stolen laptop's data. CNN. 2009.

16 Powers T. Computer security; the whiz kid vs. the old boys. The New York Times; 2000.

17 Schneier B. A plea for simplicity: you can't secure what you don't understand. Information Security. Available from: http://www.schneier.com/essay-018.html; 1999 [accessed 01.07.15].

18 Reed DP. Naming and synchronization in a decentralized computer system. Doctoral Thesis, MIT; 1978.

19 Klyne G. Newman C. Date and time on the Internet: timestamps. Network Working Group Request for Comments RFC:3339. Available from: http://tools.ietf.org/html/rfc3339 [accessed 15.09.15].

20 Trithemius J. Steganographia (Secret Writing), by Johannes Trithemius. 1500.

21 Berman J.J. Biomedical informatics. Sudbury, MA: Jones and Bartlett; 2007.

22 A review of the FBI's handling of the Brandon Mayfield case. U.S. Department of Justice, Office of the Inspector General, Oversight and Review Division; 2006.

23 Paskin N. Identifier interoperability: a report on two recent ISO activities. D-Lib Mag. 2006;12:1–23.

24 Berman JJ, Moore GW. Implementing an RDF Schema for pathology images 2007. Available from: http://www.julesberman.info/spec2img.htm [accessed 01.01.15].

25 Faldum A., Pommerening K. An optimal code for patient identifiers. Comput Methods Programs Biomed. 2005;79:81–88.

26 Rivest R. Request for Comments: 1321, The MD5 Message-Digest Algorithm. Network Working Group. https://www.ietf.org/rfc/rfc1321.txt [accessed 01.01.15].

27 Bouzelat H., Quantin C., Dusserre L. Extraction and anonymity protocol of medical file. Proc AMIA Annu Fall Symp. 1996;323–327.

28 Quantin C.H., Bouzelat F.A., Allaert A.M., Benhamiche J., Faivre J., Dusserre L. Automatic record hash coding and linkage for epidemiological followup data confidentiality. Methods Inf Med. 1998;37:271–277.

29 Marsaglia G., Tsang W.W. Some difficult-to-pass tests of randomness. J Stat Softw. 2002;7:1–8 Available from: http://www.jstatsoft.org/v07/i03/paper [accessed 25.09.12].

30 Patient Identity Integrity. A white paper by the HIMSS Patient Identity Integrity Work Group. Available from: http://www.himss.org/content/files/PrivacySecurity/PIIWhitePaper.pdf; 2009 [accessed 19.09.12].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset