Chapter 4.4

Unstructured Data

Abstract

There are different definitions of big data. The definition used here is that big data encompasses a lot of data, is based on inexpensive storage, manages data by the “Roman census” method, and stores data in an unstructured format. There are two major types of big data—repetitive big data and nonrepetitive big data. Only a small fraction of repetitive big data has business value, whereas almost all of nonrepetitive big data has business value. In order to achieve business value, the context of data in big data must be determined. Contextualization of repetitive big data is easily achieved. But contextualization of nonrepetitive data is done by means of textual disambiguation.

Keywords

Big data; Roman census method; Unstructured data; Repetitive data; Nonrepetitive data; Contextualization; Textual disambiguation

It is estimated that over 80% of the data in the corporation are unstructured information. There are many different forms of unstructured information. There is video. There is audio. There are images. But far and away the most interesting and useful for unstructured data is textual information.

Textual Information—Everywhere

Textual information is found everywhere in the corporation. Text is found in contracts, in e-mail, in reports, in memorandum, in human resource evaluations, and so forth. In a word, textual information makes up the fabric of corporate life, and that is true for every corporation.

Unstructured information can be broken into two major categories—repetitive unstructured data and nonrepetitive unstructured data. Fig. 4.4.1 shows the categories that describe all corporate data.

Fig. 4.4.1
Fig. 4.4.1 Unstructured data can be repetitive or nonrepetitive.

Decisions Based on Structured Data

For a variety of reasons, the vast majority of corporate decisions are made based on structured data. There are several reasons for this. The primary reason is that structured information is easy to automate. Structured data fit naturally and normally on standard database technology. And once on database technology, the data can easily be analyzed inside the corporation. It is easy to read and analyze 100,000 records of structured information. There are plenty of analytic tools that can handle the analysis of standard database records.

Fig. 4.4.2 shows that most corporate decisions are made based on structured data.

Fig. 4.4.2
Fig. 4.4.2 In most organizations the vast majority of decisions are made on the basis of structured data.

Despite the fact that most corporate decisions are made on the basis of structured information, there is a wealth of untapped potential in the unstructured information of the corporation. The challenge then is unlocking that potential.

The Business Value Proposition

Fig. 4.4.3 shows that there is a different business value proposition for the different types of unstructured data. Repetitive unstructured data have business value. But the business value in repetitive unstructured data is hard to find and hard to unlock. And in many cases, there simply is no business value whatsoever in repetitive unstructured data.

Fig. 4.4.3
Fig. 4.4.3 Business value varies depending on whether data is repetitive or nonrepetitive.

However, it is in nonrepetitive unstructured data where there is huge business value. There are many, many cases where the business value in nonrepetitive unstructured data is very high.

Some of the more obvious cases where there is business value in nonrepetitive unstructured data include the following:

  • E-mails, where customers express their opinions
  • Call center information, where customers have a direct line to the corporation
  • Corporate contracts, where corporate obligations are disclosed
  • Warranty claims, where the manufacturer can find out where the weak points of the manufacturing process are
  • Insurance claims, where the insurance company can assess where profitable business lies
  • Marketing analysis companies, where direct customer feedback can be analyzed

These cases represent merely the most obvious tip of the iceberg for finding and using nonrepetitive unstructured information.

Repetitive and Nonrepetitive Unstructured Information

Fig. 4.4.4 illustrates the visceral differences between the repetitive and the nonrepetitive unstructured environments.

Fig. 4.4.4
Fig. 4.4.4 Representations of repetitive and nonrepetitive data.

As has been discussed in conversations on the “great divide,” there are many differences between the repetitive and the nonrepetitive environments. But perhaps the most poignant, most relevant difference between the two environments is that of the ease with which analytic processing can be done.

Fig. 4.4.5 shows that analytic processing is quite easy to do when it comes to working with repetitive unstructured data. But when it comes to doing analysis on nonrepetitive unstructured data, analysis is awkward and difficult to do.

Fig. 4.4.5
Fig. 4.4.5 Analysis on nonrepetitive data is like fitting a square peg in a round hole.

Ease of Analysis

Fig. 4.4.5 shows that analysis in the repetitive unstructured environment is as easy as putting a square peg in a square hole whereas analysis in the nonrepetitive unstructured environment is as awkward and as difficult as placing a square peg in a round hole.

There are lots of reasons for this major difference between repetitive and nonrepetitive unstructured data. Repetitive unstructured data are easy to analyze because of the following:

  • The records are uniform in shape.
  • The records are usually small and compact.
  • The records are easy to parse because the contextual information in the record is easy to find.

Pretty much the opposite is true of the nonrepetitive unstructured records. Nonrepetitive unstructured records are the following:

  • Very nonuniform in shape.
  • Sometimes small, sometimes large, and sometimes very large.
  • The records are quite difficult to parse because the records are made up of text and text requires an entirely different approach than simple parsing.

There are probably more differences between these two types of data. But these differences alone warrant the recognition of the “great divide” between the types of unstructured data.

So, what is so difficult about going in and working with text? Fig. 4.4.6 shows some typical text.

Fig. 4.4.6
Fig. 4.4.6 Some typical text.

There are many reasons why text is so difficult to work with.

First off, there is the discussion of whether text is actually unstructured at all. An English teacher might argue that text is anything but unstructured. There are rules that govern the structure of all text. Some of the rules include the following:

  • Spelling
  • Punctuation
  • Grammar
  • Proper sentence construction

It cannot be argued that there are no rules that govern the creation of proper text. But those rules are so complex that the rules are not obvious and apparent to the computer. From the computer's perspective, text is unstructured simply because the computer cannot understand all the rules of proper textual construction.

Contextualization

There are many parts of text that must be managed if text is to be turned into a form that is useful to the computer. But easily, the most important and the most complex aspect of text that must be mastered is that of finding and determining the context of text. Stated differently, if you do not understand the context of text, you cannot use text for any form of useful decision-making.

Contextualization of text then is the single largest challenge facing the analyst who wishes to use nonrepetitive unstructured text in the decision-making process.

Fig. 4.4.7 shows an example of the importance of understanding context.

Fig. 4.4.7
Fig. 4.4.7 Text makes no sense without understanding context.

Two gentlemen are standing on a corner, and one gentleman says to the next as a young lady passes by—“She's hot.”

Now, what is being said here?

One interpretation is that the gentleman finds the young lady to be attractive and he would like to have a date with her.

Another interpretation is that it is Houston, Texas, on a July day and it is 98 degrees and 100% humidity. The lady is wet from pouring sweat. She's hot.

Another interpretation is that the two gentlemen are in a hospital and they are doctors. One doctor has just taken the lady's temperature, and she has a temperature of 104 degrees. She is burning up with fever, and she's hot.

These then are three very different meanings of the words—“She's hot.” Trying to use and interpret these words without understanding the context could lead to disaster and embarrassment.

The need to find and understand context is hardly limited to the words—“She's hot.” The need to find and understand context is true for all words.

The largest challenge facing the analyst who wishes to make sense of nonrepetitive unstructured data then is that of understanding how to contextualize text.

It is noteworthy that there are other challenges as well. As important as contextualization is, it is hardly the only challenge when it comes to doing analysis.

Fig. 4.4.8 shows that finding context in nonrepetitive unstructured data is a major challenge.

Fig. 4.4.8
Fig. 4.4.8 Finding context.

Some Approaches to Contextualization

The notion that finding context in nonrepetitive unstructured data is a challenge is not a new idea. Indeed, people have been attempting to contextualize text for a long time. The earliest attempt to trying to contextualize text is a technology called “NLP.” NLP stands for natural language processing (or sometimes “natural language programming.”)

NLP has been around a long time and has met with modest success. There are several inherent limitations to NLP. The first limitation is that NLP makes the assumption that context of text can be derived from text itself. The problem is that only a small amount of context comes from text itself. In the case of the two gentlemen standing around and saying—“She's hot”—the vast majority of the context comes from external sources, not textual sources. Is the lady young and attractive? Is it Houston, Texas, in the summertime? Is the conversation taking place in a hospital? All of these circumstances that provide context are external to the words that are being spoken.

The second limitation of NLP is that NLP does not account for emphasis. Suppose the words are spoken—“I love you.” How are these words to be interpreted?

If you say “I love you” where the emphasis is on “I,” the meaning is that it is me and not someone else who loves you. If the emphasis is on the word “love,” the meaning is that the emotion I feel is strong, one of love. I don’t like you—I actually love you. If the emphasis is on the word “you,” the meaning is that it is you and not someone else that I love.

So, the same words can have very different meaning based on the way the words are said.

But there is a very different reason why NLP has had a hard time showing concrete results. That reason is that NLP—in order to be implemented effectively—must understand the logic behind words. The problem is that the English language has evolved over many years and many circumstances, and at the end of the day, the logic behind the English language is very complex. Trying to map out the logic of the English language is very difficult to do. It is tortuous.

For these reasons (and probably more), NLP processing has met with modest success.

A much more practical approach is that of textual disambiguation.

Fig. 4.4.9 shows the two approaches toward contextualization of text.

Fig. 4.4.9
Fig. 4.4.9 NLP does not do a good job of finding and managing context of text.

In later chapters, much more will be said about textual disambiguation.

Map Reduce

Another approach to contextualization that is found in big data is that of a technology called MapReduce. Fig. 4.4.10 shows MapReduce.

Fig. 4.4.10
Fig. 4.4.10 Map reduce can be used to address text.

MapReduce is a language for the technician that can be used to do all sorts of useful things in big data. However, the number of lines of code that must be written and maintained and the sheer complexity of contextualizing nonrepetitive unstructured data limits the usefulness of MapReduce for the purpose of contextualizing nonrepetitive unstructured data.

Manual Analysis

There is one other time-honored approach to analyzing nonrepetitive unstructured data. That approach is to do things manually. Fig. 4.4.11 shows that nonrepetitive unstructured data can be analyzed manually.

Fig. 4.4.11
Fig. 4.4.11 Manual analysis is appealing for small, one time only projects.

The great appeal of doing analysis manually is that no infrastructure is required. The only thing that is required is a human being that is capable of reading and analyzing information. So, a person can start right away to doing analysis of nonrepetitive unstructured information.

The great drawback of doing analysis like this manually is that the human brain can only absorb so much information. There is no contest between the amount of information a computer can absorb and digest versus what a human can absorb and digest.

Fig. 4.4.12 shows that when it comes to reading and storing information in a database, a computer far outstrips even the brightest of human beings.

Fig. 4.4.12
Fig. 4.4.12 In order to do analytical processing, text needs to be placed in a data base.

It simply is no contest.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset