Chapter 4.7

Taxonomies

Abstract

There are different definitions of big data. The definition used here is that big data encompasses a lot of data, is based on inexpensive storage, manages data by the “Roman census” method, and stores data in an unstructured format. There are two major types of big data—repetitive big data and nonrepetitive big data. Only a small fraction of repetitive big data has business value, whereas almost all of nonrepetitive big data has business value. In order to achieve business value, the context of data in big data must be determined. Contextualization of repetitive big data is easily achieved. But contextualization of nonrepetitive data is done by means of textual disambiguation.

Keywords

Big data; Roman census method; Unstructured data; Repetitive data; Nonrepetitive data; Contextualization; Textual disambiguation

Taxonomies are classifications of information. Taxonomies play a large and important role in the disambiguation of narrative information. Fig. 4.7.1 shows that taxonomies are to unstructured data what the data model is to structured data.

Fig. 4.7.1
Fig. 4.7.1 Taxonomies—one of the keys to unlocking unstructured data.

Data Models/Taxonomies

The data model classically has played the role of serving as a map—an intellectual guideline—to the understanding and management of data in the structured environment. The taxonomy plays the same role in the unstructured textual environment. While not perfectly equivalent to each other, the taxonomy serves much the same purpose as the data model.

There is one anomaly in the world of unstructured data that must be explained. The classification of information that has been developed in this book has one very confusing anomaly. Unfortunately, that anomaly is important in understanding the role and function of taxonomies.

Consider the classification of data shown in Fig. 4.7.2.

Fig. 4.7.2
Fig. 4.7.2 Creating confusion—the fact that there is repetitive nonrepetitive data.

Fig. 4.7.2 shows that there is unstructured data. Then, a subclassification of unstructured data is repetitive and nonrepetitive unstructured data. Then, beneath nonrepetitive data, there is a lower classification of repetitive and nonrepetitive data. Using this classification scheme, there are repetitive and nonrepetitive data. And this is confusing (apologies!) but is not a mistake.

In order to explain this anomaly and explain why it is important, consider the following real example.

In general, unstructured data can be considered to be repetitive and nonrepetitive. Repetitive unstructured data are unstructured data whose content and structure are highly repetitive. Into this classification of data fall clickstream data, analog data, metering data, and so forth. Into the other classification of data fall all data that are written. There are e-mails, call center data, customer feedback, contracts, and a whole host of other written and spoken narrative data.

Now, consider that in the classification of narrative data, there appears a further subclassification of data. For all written data, there can be nonrepetitive written data and repetitive written data. For example, lawyers who write contracts use what is called “boilerplate.” A boilerplate contract is a contract where the primary body of the contract is predetermined. The lawyer only fills in a few details into the contract such as the name, address, and social security number of the recipient of the contract. There may be a few other terms that are negotiated, but at the end of the day, the boilerplate contracts are very, very similar.

This then is an example of a repetitive nonrepetitive occurrence of data. The contract is nonrepetitive because it is in narrative form. But it is repetitive because it is essentially boilerplate.

The reason why making the distinction between nonrepetitive nonrepetitive text and nonrepetitive repetitive text is that taxonomies apply to nonrepetitive nonrepetitive text. Some examples are needed here to explain this anomaly.

Applicability of Taxonomies

Taxonomies are most applicable to text such as e-mails, call center information, conversations, and other free-form narrative text. In free-form text, it is necessary to classify words using only the context associated by the taxonomy. As an example, the word ice cream is encountered. Ice cream belongs in the taxonomy of “dessert.” It is assumed that the e-mail is about food and meals and desserts. Another e-mail mentions cake. Cake too is a dessert. So, the e-mails are related to each other, even though the words—“ice cream” and “cake”—are very different. Using taxonomic classification in free-form text is very useful for understanding the text.

However, suppose you have a boilerplate contract. Suppose the contract is for the purchase of apples. The term “apples” appears in every contract as part of the boilerplate. Certainly, an apple is a fruit. But the fact the apple is classified as a fruit appears in every instance of a contract. And there are many instances of the contract. Therefore, using a taxonomy to classify apple is not terribly useful in boilerplate data because the classification occurs repeatedly and adds very little to the understanding of the text.

For this reason, taxonomies are not very useful or applicable to boilerplate contracts and other places where there is repetitive narrative text.

The previous discussion is very difficult to explain. It is hoped that the examples make it clear what is being said.

What Is a Taxonomy?

So what is a taxonomy? In its simplest form, a taxonomy is simply a list of words that provides a classification of some larger topic. Fig. 4.7.3 shows some simple taxonomies.

Fig. 4.7.3
Fig. 4.7.3 Some simple taxonomies.

In Fig. 4.7.3, it is seen that a car can be a Honda, Porsche, Volkswagen, and so forth. Or a German product may be sausage, beer, a Porsche, software (such as SAP), and so forth.

Of course, there are many other ways to classify these items. A car may be a sedan, an SUV, a sports car, and so forth. Or American products may be a hamburger, software, movies, corn, wheat, and so forth.

There are indeed almost an infinite number of taxonomies. Taxonomies are applied to nonrepetitive unstructured data on the basis of applicability. For example, an automaker may use taxonomies relating to engineering and manufacturing. Or an accounting firm may choose taxonomies that apply to taxes and to the rules of accounting. Or a retailer may choose taxonomies that relate to products and sales.

Conversely, it would be very unusual to have an engineering firm use a taxonomy relating to religion or lawmaking. Or it would be unusual for a construction firm have an interest in taxonomies about ethnicity.

Related to a taxonomy is an ontology. Fig. 4.7.4 depicts an ontology.

Fig. 4.7.4
Fig. 4.7.4 An ontology.

A simple definition of an ontology is that an ontology is a taxonomy where there are interrelationships of the elements within the taxonomy.

As a rule, either taxonomies or ontologies (or both) can be used when creating the foundation for textual disambiguation of nonrepetitive unstructured data.

Taxonomies in Multiple Languages

One of the issues relating to taxonomies is that taxonomies can exist in multiple languages. Fig. 4.7.5 shows that taxonomies can exist in multiple languages.

Fig. 4.7.5
Fig. 4.7.5 Taxonomies—in multiple languages.

Commercial or Private Taxonomies?

A related issue is whether to use commercially created taxonomies or to use individually created taxonomies when doing textual disambiguation. One of the major advantages of a commercially created taxonomy is that the commercially created taxonomy can be easily and automatically translated into different languages. One of the features of commercially created taxonomies is that the taxonomy is normally created and supported in multiple languages. With a commercially created taxonomy, you can read a document in one language and create the associated analytic database in a different language.

But the largest advantage of using a commercially created taxonomy is that the commercially created taxonomy does not require a large investment in the creation of the taxonomy. If an organization decides to manually create their own taxonomies, the organization is inviting a disaster because of the organization's inability to estimate how much effort is required to actually build and maintain the taxonomies that it needs.

Dynamics of Taxonomies and Textual Disambiguation

The dynamics of how a taxonomy interacts with textual disambiguation is illustrated in the simple example seen in Fig. 4.7.6.

Fig. 4.7.6
Fig. 4.7.6 The application of a taxonomy to raw text.

In Fig. 4.7.6, raw text is shown. The raw text is passed against the taxonomies for a car and another taxonomy for a motor thoroughfare. The output shows that where the word “Porsche” is encountered, it is recognized to be part of the taxonomy for car. The word “Porsche” is changed to the expression “Porsche/car” in the output. The same processing occurs for “Volkswagen” and “Honda.”

Using the taxonomy for thoroughfare, the term “highway” is seen to be a form of “road.” The output for “highway” is written out as “highway/road.”

The example in the figure is very simple. But the example serves to illustrate the dynamics of how the taxonomy is used to interact with raw text inside the textual disambiguation process. In reality, the actual uses of taxonomies are usually much more sophisticated and elaborated than this simple example.

It is the use of taxonomies that has been described that is the key to opening the door to sophisticated analysis of text.

Note that on output of the processed text, the analyst can now create a query on “car” and find all mentions of any type of car. Also note that the term “car” appears nowhere in the raw text. This is just a glimpse at the value added by taxonomies when taxonomies are applied to text.

The ability to classify data externally is extremely useful when disambiguating nonrepetitive unstructured data.

Taxonomies and Textual Disambiguation—Separate Technologies

Taxonomies—the gathering, classification, and maintenance of the taxonomy—require their own care and handling. Usually, it makes sense to build and manage the taxonomy external to the technology for textual disambiguation. Fig. 4.7.7 shows that arrangement.

Fig. 4.7.7
Fig. 4.7.7 Taxonomies serve as input to textual ETL.

There are many reasons for the logic behind separating the building and management of the taxonomies from textual disambiguation. But the primary reason is that textual disambiguation is complex enough without adding the further complexity of the building and management of taxonomies to the process.

Another way to explain the differences between the two processes is to look at the representation of taxonomies in the different technologies. In the world of taxonomy management, taxonomies require a robust and complex representation. But in the world of textual disambiguation, taxonomies are represented as a series of word pairs.

Fig. 4.7.8 shows this distinct difference between the two technologies.

Fig. 4.7.8
Fig. 4.7.8 The output from processing taxonomies is a word pair specification.

Different Types of Taxonomies

An interesting point about taxonomies is that taxonomies themselves can be classified in many ways. Stated differently, there are many different ways to create the lists and classifications that make up taxonomies. Some taxonomies are made up of words that are synonyms. Other taxonomies are simply a list of words that happen to be gathered together. Other taxonomies are categories of words and so forth.

Fig. 4.7.9 shows that there are many different kinds of taxonomies.

Fig. 4.7.9
Fig. 4.7.9 Different kinds of taxonomies.

Taxonomies—Maintenance Over Time

A final observation about taxonomies is that over time, taxonomies require maintenance. Taxonomies require maintenance because language is constantly changing. For example, in the year 2000, if you referred to a “blog,” no one would have known what you are talking about. But 10 years later, the term “blog” is a commonly used term.

Over time, language and terms change. And as language and terms change, the taxonomies that track those changes must be brought up to date.

Fig. 4.7.10 shows that over time, taxonomies require periodic maintenance.

Fig. 4.7.10
Fig. 4.7.10 Over time taxonomies require periodic maintenance.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset