There are different definitions of big data. The definition used here is that big data encompasses a lot of data, is based on inexpensive storage, manages data by the “Roman census” method, and stores data in an unstructured format. There are two major types of big data—repetitive big data and nonrepetitive big data. Only a small fraction of repetitive big data has business value, whereas almost all of nonrepetitive big data has business value. In order to achieve business value, the context of data in big data must be determined. Contextualization of repetitive big data is easily achieved. But contextualization of nonrepetitive data is done by means of textual disambiguation.
Big data; Roman census method; Unstructured data; Repetitive data; Nonrepetitive data; Contextualization; Textual disambiguation
It is estimated that over 80% of the data in the corporation are unstructured information. There are many different forms of unstructured information. There is video. There is audio. There are images. But far and away the most interesting and useful for unstructured data is textual information.
Textual information is found everywhere in the corporation. Text is found in contracts, in e-mail, in reports, in memorandum, in human resource evaluations, and so forth. In a word, textual information makes up the fabric of corporate life, and that is true for every corporation.
Unstructured information can be broken into two major categories—repetitive unstructured data and nonrepetitive unstructured data. Fig. 4.4.1 shows the categories that describe all corporate data.
For a variety of reasons, the vast majority of corporate decisions are made based on structured data. There are several reasons for this. The primary reason is that structured information is easy to automate. Structured data fit naturally and normally on standard database technology. And once on database technology, the data can easily be analyzed inside the corporation. It is easy to read and analyze 100,000 records of structured information. There are plenty of analytic tools that can handle the analysis of standard database records.
Fig. 4.4.2 shows that most corporate decisions are made based on structured data.
Despite the fact that most corporate decisions are made on the basis of structured information, there is a wealth of untapped potential in the unstructured information of the corporation. The challenge then is unlocking that potential.
Fig. 4.4.3 shows that there is a different business value proposition for the different types of unstructured data. Repetitive unstructured data have business value. But the business value in repetitive unstructured data is hard to find and hard to unlock. And in many cases, there simply is no business value whatsoever in repetitive unstructured data.
However, it is in nonrepetitive unstructured data where there is huge business value. There are many, many cases where the business value in nonrepetitive unstructured data is very high.
Some of the more obvious cases where there is business value in nonrepetitive unstructured data include the following:
These cases represent merely the most obvious tip of the iceberg for finding and using nonrepetitive unstructured information.
Fig. 4.4.4 illustrates the visceral differences between the repetitive and the nonrepetitive unstructured environments.
As has been discussed in conversations on the “great divide,” there are many differences between the repetitive and the nonrepetitive environments. But perhaps the most poignant, most relevant difference between the two environments is that of the ease with which analytic processing can be done.
Fig. 4.4.5 shows that analytic processing is quite easy to do when it comes to working with repetitive unstructured data. But when it comes to doing analysis on nonrepetitive unstructured data, analysis is awkward and difficult to do.
Fig. 4.4.5 shows that analysis in the repetitive unstructured environment is as easy as putting a square peg in a square hole whereas analysis in the nonrepetitive unstructured environment is as awkward and as difficult as placing a square peg in a round hole.
There are lots of reasons for this major difference between repetitive and nonrepetitive unstructured data. Repetitive unstructured data are easy to analyze because of the following:
Pretty much the opposite is true of the nonrepetitive unstructured records. Nonrepetitive unstructured records are the following:
There are probably more differences between these two types of data. But these differences alone warrant the recognition of the “great divide” between the types of unstructured data.
So, what is so difficult about going in and working with text? Fig. 4.4.6 shows some typical text.
There are many reasons why text is so difficult to work with.
First off, there is the discussion of whether text is actually unstructured at all. An English teacher might argue that text is anything but unstructured. There are rules that govern the structure of all text. Some of the rules include the following:
It cannot be argued that there are no rules that govern the creation of proper text. But those rules are so complex that the rules are not obvious and apparent to the computer. From the computer's perspective, text is unstructured simply because the computer cannot understand all the rules of proper textual construction.
There are many parts of text that must be managed if text is to be turned into a form that is useful to the computer. But easily, the most important and the most complex aspect of text that must be mastered is that of finding and determining the context of text. Stated differently, if you do not understand the context of text, you cannot use text for any form of useful decision-making.
Contextualization of text then is the single largest challenge facing the analyst who wishes to use nonrepetitive unstructured text in the decision-making process.
Fig. 4.4.7 shows an example of the importance of understanding context.
Two gentlemen are standing on a corner, and one gentleman says to the next as a young lady passes by—“She's hot.”
Now, what is being said here?
One interpretation is that the gentleman finds the young lady to be attractive and he would like to have a date with her.
Another interpretation is that it is Houston, Texas, on a July day and it is 98 degrees and 100% humidity. The lady is wet from pouring sweat. She's hot.
Another interpretation is that the two gentlemen are in a hospital and they are doctors. One doctor has just taken the lady's temperature, and she has a temperature of 104 degrees. She is burning up with fever, and she's hot.
These then are three very different meanings of the words—“She's hot.” Trying to use and interpret these words without understanding the context could lead to disaster and embarrassment.
The need to find and understand context is hardly limited to the words—“She's hot.” The need to find and understand context is true for all words.
The largest challenge facing the analyst who wishes to make sense of nonrepetitive unstructured data then is that of understanding how to contextualize text.
It is noteworthy that there are other challenges as well. As important as contextualization is, it is hardly the only challenge when it comes to doing analysis.
Fig. 4.4.8 shows that finding context in nonrepetitive unstructured data is a major challenge.
The notion that finding context in nonrepetitive unstructured data is a challenge is not a new idea. Indeed, people have been attempting to contextualize text for a long time. The earliest attempt to trying to contextualize text is a technology called “NLP.” NLP stands for natural language processing (or sometimes “natural language programming.”)
NLP has been around a long time and has met with modest success. There are several inherent limitations to NLP. The first limitation is that NLP makes the assumption that context of text can be derived from text itself. The problem is that only a small amount of context comes from text itself. In the case of the two gentlemen standing around and saying—“She's hot”—the vast majority of the context comes from external sources, not textual sources. Is the lady young and attractive? Is it Houston, Texas, in the summertime? Is the conversation taking place in a hospital? All of these circumstances that provide context are external to the words that are being spoken.
The second limitation of NLP is that NLP does not account for emphasis. Suppose the words are spoken—“I love you.” How are these words to be interpreted?
If you say “I love you” where the emphasis is on “I,” the meaning is that it is me and not someone else who loves you. If the emphasis is on the word “love,” the meaning is that the emotion I feel is strong, one of love. I don’t like you—I actually love you. If the emphasis is on the word “you,” the meaning is that it is you and not someone else that I love.
So, the same words can have very different meaning based on the way the words are said.
But there is a very different reason why NLP has had a hard time showing concrete results. That reason is that NLP—in order to be implemented effectively—must understand the logic behind words. The problem is that the English language has evolved over many years and many circumstances, and at the end of the day, the logic behind the English language is very complex. Trying to map out the logic of the English language is very difficult to do. It is tortuous.
For these reasons (and probably more), NLP processing has met with modest success.
A much more practical approach is that of textual disambiguation.
Fig. 4.4.9 shows the two approaches toward contextualization of text.
In later chapters, much more will be said about textual disambiguation.
Another approach to contextualization that is found in big data is that of a technology called MapReduce. Fig. 4.4.10 shows MapReduce.
MapReduce is a language for the technician that can be used to do all sorts of useful things in big data. However, the number of lines of code that must be written and maintained and the sheer complexity of contextualizing nonrepetitive unstructured data limits the usefulness of MapReduce for the purpose of contextualizing nonrepetitive unstructured data.
There is one other time-honored approach to analyzing nonrepetitive unstructured data. That approach is to do things manually. Fig. 4.4.11 shows that nonrepetitive unstructured data can be analyzed manually.
The great appeal of doing analysis manually is that no infrastructure is required. The only thing that is required is a human being that is capable of reading and analyzing information. So, a person can start right away to doing analysis of nonrepetitive unstructured information.
The great drawback of doing analysis like this manually is that the human brain can only absorb so much information. There is no contest between the amount of information a computer can absorb and digest versus what a human can absorb and digest.
Fig. 4.4.12 shows that when it comes to reading and storing information in a database, a computer far outstrips even the brightest of human beings.
It simply is no contest.