Chapter 1.3

The “Great Divide”

Abstract

Corporate data include everything found in the corporation in the way of data. The most basic division of corporate data is by structured data and unstructured data. As a rule, there are much more unstructured data than structured data. Unstructured data have two basic divisions—repetitive data and nonrepetitive data. Big data is made up of unstructured data. Nonrepetitive big data has a fundamentally different form than repetitive unstructured big data. In fact, the differences between nonrepetitive big data and repetitive big data are so large that they can be called the boundaries of the “great divide.” The divide is so large that many professionals are not even aware that there is this divide. As a rule, nonrepetitive big data has MUCH greater business value than repetitive big data.

Keywords

Structured data; Unstructured data; Corporate data; Repetitive data; Nonrepetitive data; Business value; The great divide of data; Big data

Classifying Corporate Data

Corporate data can be classified in many different ways. One of the major classifications is by structured versus unstructured data. And unstructured data can be further broken into two categories—repetitive unstructured data and nonrepetitive unstructured data. This division of data is shown in Fig. 1.3.1.

Fig. 1.3.1
Fig. 1.3.1 The great divide.

Repetitive unstructured data are data that occur very often and whose records are almost identical in terms of structure and content. There are many examples of repetitive unstructured data—telephone call records, metered data, analog data, and so forth.

Nonrepetitive unstructured data are data that consist of records of data where the records are not similar, in terms of either structure or content. There are many examples of nonrepetitive unstructured data—e-mails, call center conversations, warranty claims, and so forth.

The “Great Divide”

Between the two types of unstructured data is what can be termed the “great divide.”

The “great divide” is the demarcation of repetitive and nonrepetitive records, as seen in the figure. At first glance, it does not appear that there should be a massive difference between repetitive unstructured records and nonrepetitive unstructured records of data. But such is not the case at all. There indeed is a HUGE difference between repetitive unstructured data and nonrepetitive unstructured data.

The primary distinction between the two types of unstructured data is that repetitive unstructured data focus its attention on the management of data in the Hadoop/big data environment, whereas the attention of nonrepetitive unstructured data focuses its attention on textual disambiguation of data. And as shall be seen, this difference in focus makes a huge difference in how the data are perceived, how the data are used, and how the data are managed.

This difference—the “great divide”—is shown in Fig. 1.3.2.

Fig. 1.3.2
Fig. 1.3.2 Different types of unstructured data.

It is seen then that there is a very different focus between the two types of unstructured data.

Repetitive Unstructured Data

The repetitive unstructured data are said to be “Hadoop” centric. Being “Hadoop” centric means that processing of repetitive unstructured data revolves around processing and managing the Hadoop/big data environment. The centricity of the repetitive unstructured data is seen in Fig. 1.3.3.

Fig. 1.3.3
Fig. 1.3.3 Hadoop centric unstructured data.

The center of the Hadoop environment naturally enough is Hadoop. Hadoop is one of the technologies by which data can be managed over very large amounts of data. Hadoop/big data is at the center of what is known as “big data.” Hadoop is one of the primary storage mechanism for big data. The essential characteristics of Hadoop are that Hadoop

  • - is capable of managing very large volumes of data,
  • - manages data on less expensive storage,
  • - manages data by the “Roman census” method,
  • - stores data in an unstructured manner.

Because of these operating characteristics of Hadoop, very large volumes of data can be managed. Hadoop is capable of managing volumes of data significantly larger than standard relational database management systems.

The big data technology of Hadoop is depicted in Fig. 1.3.4.

Fig. 1.3.4
Fig. 1.3.4 Hadoop.

But Hadoop/big data is a raw technology. In order to be useful, Hadoop/big data requires its own unique infrastructure.

The technologies that surround Hadoop/big data serve to manage the data and to access and analyze the data found in Hadoop. The infrastructure services that surround Hadoop are seen in Fig. 1.3.5.

Fig. 1.3.5
Fig. 1.3.5 Services needed by big data.

The services that surround Hadoop/big data are familiar to anyone that has ever used a standard DBMS. The difference is that in a standard DBMS, the services are found in the DBMS itself, while in Hadoop, many of the services have to be done externally. A second major difference is that throughout the Hadoop/big data environment, there is the need to service huge volumes of data. The developer in the Hadoop/big data environment must be prepared to manage and handle extremely large volumes of data. This means that many infrastructure tasks can be handled only in the Hadoop/big data environment itself.

Indeed, the Hadoop environment is permeated by the need to be able to handle extraordinarily large amounts of data. The need to handle large amounts of data—indeed, almost unlimited amounts of data—is seen in Fig. 1.3.6

Fig. 1.3.6
Fig. 1.3.6 An infinite amount of data.

There is then an emphasis on doing the normal tasks of data management in the Hadoop environment where the process must be able to handle very large amounts of data.

Nonrepetitive Unstructured Data

The emphasis in the nonrepetitive unstructured environment is quite different than the emphasis on the management of the Hadoop big data technology. In the nonrepetitive unstructured environment, there is an emphasis on “textual disambiguation” (or on “textual ETL”). This emphasis is shown in Fig. 1.3.7.

Fig. 1.3.7
Fig. 1.3.7 Textual disambiguation centric unstructured data.

Textual disambiguation is the process of taking nonrepetitive unstructured data and manipulating it into a format that can be analyzed by standard analytic software. There are many facets to textual disambiguation, but perhaps the most important functionality is one that can be called “contextualization.” Contextualization is the process by which text is read and analyzed and the context of the text is derived. Once the context of the text is derived, the text is then reformatted into a standard database format where the text can be read and analyzed by standard “business intelligence” software.

The process of textual disambiguation is shown in Fig. 1.3.8.

Fig. 1.3.8
Fig. 1.3.8 From unstructured to structured data.

There are many facets to textual disambiguation. Textual disambiguation is completely free from the limitations of natural language processing (NLP). In textual disambiguation, there is a multifaceted approach to the identification of and derivation of context.

Some of the techniques used to derive context include the following:

  • - The integration of external taxonomies and ontologies
  • - Proximity analysis
  • - Homographic resolution
  • - Subdocument processing
  • - Associative text resolution
  • - Acronym resolution
  • - Simple stop word processing
  • - Simple word stemming
  • - Inline pattern recognition

In truth, there are many more facets to the process of textual disambiguation than those shown. Some of the more important facets of textual disambiguation are shown in Fig. 1.3.9.

Fig. 1.3.9
Fig. 1.3.9 Some of the services needed to turn unstructured into structured data.

There is a concern regarding the volume of data that is managed by textual disambiguation. But the volume of data that can be processed is secondary to the transformation of data that occurs during the transformation process. Simply stated, it doesn’t matter how fast you can process data if you cannot understand what it is that you are processing. The fact that textual disambiguation is dominated by transformation is depicted in Fig. 1.3.10.

Fig. 1.3.10
Fig. 1.3.10 Transformation.

There is then a completely different emphasis on the processing that occurs in the repetitive unstructured world versus the processing that occurs in the nonrepetitive unstructured world.

Different Worlds

This difference is seen in Fig. 1.3.11.

Fig. 1.3.11
Fig. 1.3.11 Transforming big data.

Part of the reason for the difference between repetitive unstructured data and nonrepetitive unstructured data lies in the very data themselves. With repetitive unstructured data, there is not much of a need to discover the context of the data. With repetitive unstructured data, data occur so frequently and so repeatedly that the context of that data is fairly obvious or fairly easy to ascertain. In addition, there typically are not much contextual data to begin with when it comes to repetitive unstructured data. Therefore, the emphasis is almost entirely on the need to manage volumes of data.

But with nonrepetitive unstructured data, there is a great need to derive the context of the data. Before the data can be used analytically, the data need to be contextualized. And with nonrepetitive unstructured data, deriving the context of the data is a very complex thing to do. For sure, there is a need to manage volumes of data when it comes to nonrepetitive unstructured data. But the primary need is the need to contextualize the data in the first place.

For these reasons, there is a “great divide” when it comes to managing and dealing with the different forms of unstructured data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset