Image241123.jpg

Chapter 8
Textual Data Pond

The third kind of data collection is the textual data pond. There is much textual data in the corporate world. Unfortunately, very little textual data is ever converted to a state where it’s fit for analysis or is ever used as a basis for decision making. Yet there is a tremendous amount of textual information that has a lot of potential. There is no reason why textual data that finds its way to the textual data pond cannot be used for analytical processing.

Uniform Data and the Computer

The reason why textual data has such a hard time being used for corporate decision making is that the computer requires data to be served up in a uniform manner. The computer is good at reading one record, processing it, then reading another record that was in the same format as the previous record. The system thrives on repetition of processing. When the computer has to change its mindset with every record, it will have a hard time. And with text, every word must be treated as a completely new universe.

For this reason, text has been treated in a very superficial manner within the bounds of computerized processing. Computer technology and text (i.e., narration) are the digital equivalent of oil and water. They just don’t mix well.

Valuable Text

Some of the many places where text contains valuable information for making managerial decisions include:

  • Corporate contracts
  • Corporate call center conversations
  • Customer feedback
  • Medical records
  • Insurance claims
  • Human resource records
  • Insurance policies
  • Loans applications
  • Corporate memos
  • And many other places.

However, most corporations collect their text, put it in a file, and never look at or analyze it again. The info just sits in a file and collects dust.

There is a good reason why corporation don’t look at text: there’s so much of it. If a person were to sit down and read a large collection of text, the person would not be able to recall a small fraction of what was read. The human brain is just not a good processor of large amounts of textual data.

Textual Disambiguation

A profound technology called “textual disambiguation” has changed the ability of text to be used for decision making. That technology is used for reading and analyzing text and then transforming text into a standard database format, with the context of the text identified in the database format.

Most corporations have not yet discovered textual disambiguation. That’s why most text arrives in the textual data pond in a state of raw text. Sometimes text arrives as formal language, informal notes, slang, vulgarities or even other languages.

The most common text forms are emails, tweets and other social media, but the data can also arrive via physical reading technology such as OCR (optical character recognition) or voice transcription. However it arrives at the textual data pond, documents and text are usually still in the form of unstructured (to a computer) narration.

Text Sent to the Data Pond

Fig 8.1 shows documents that have been captured and have been sent to the textual data pond.

Image250758.jpg

Fig 8.1 Sending documents to the textual data pond

If the corporation attempts to read and make sense of the text in its raw, narrative state, they’ll find that only a very superficial analysis can be done. If the corporation is serious about making use of the textual data pond, it is mandatory to pass the raw text through textual disambiguation.

Note that textual disambiguation is merely another form of transforming and conditioning data. The need to condition and transform data is seen in both the analog data pond and the application data pond. However, textual disambiguation is very different from data reduction or integration of application data.

So it is not unusual that textual data in the pond needs to go through its own conditioning and transformation process. What is noteworthy is that the different processes used for conditioning and transforming the data ponds are completely different from each other. There is very, very little overlap (if any) among the different techniques used to condition and transform data in the different data ponds. Fig 8.2 shows the need for textual disambiguation in the textual data pond.

Image250768.jpg

Fig 8.2 Applying textual disambiguation in the textual data pond

Output of Textual Disambiguation

The net effect of textual disambiguation is the ability to store text in a standard, uniformly structured database and to store the text along with its context. Once text is restructured into that format, the text can be read and analyzed by standard analytical processors.

In order to store the text in a standard database format, it is necessary to store the text in a form where there is a record. Each record has the text that is processed, along with its context, the byte number of the text, and the name of the document. In order to visualize how this might look, consider the following example in Fig 8.3.

Image250776.jpg

Fig 8.3 Disambiguating text example

Here, a lease has been made between an individual and a corporation. The text defines the terms of the lease. The lease has been read and passed through textual disambiguation. Once having been processed, the text has been reduced to a database format. In the database format are the identification of the document, the byte address of the text that has been captured, the text itself, and the context of the text.

Once the text has been reduced to the form of a database and once the context of the text has been determined, the text can then be read by a computer and processed analytically. It is interesting to examine the functions that textual disambiguation performs in the act of disambiguating text.

Inherent Complexity

Language is inherently complex so it is no surprise that textual disambiguation is quite complex as well. Indeed, there are over 90 different functions that algorithmically define the inner workings of textual disambiguation. Some of (but not all of!) the more interesting workings of textual disambiguation will be described here:

  • Inline contextualization. Inline contextualization is the process of identifying text and its context by examining the words that surround it. For example, given the text …signed by Bill Inmon, leaseholder… Inline contextualization only works on text that has predictable occurrences of data such as a contract. In this case, the leaseholder is identified as “Bill Inmon.”
  • Proximity. Words in proximity to each other have different meanings than words not in proximity to each other. Given the text …Denver Broncos won the Super Bowl… the words Denver Broncos are taken to mean a professional football team. Proximity analysis works on words in any order.
  • Alternate spelling. In England, the word color is spelled colour. Alternate spelling analysis works for many types of functions.
  • Homographic resolution. In many cases the interpretation of a word or acronym is shaped by the understanding of who wrote the term. A cardiologist interprets “ha” as heart attack. An endocrinologist interprets “ha” as hepatitis A, while a general practitioner interprets “ha” as head ache, and so forth. Homographic resolution is a sophisticated form of alternate spelling.
  • Acronym resolution. In the military, AWOL means absent without leave. Acronym resolution is a form of alternate spelling.
  • Custom variable recognition. In the US, the digits 999 999 9999 are interpreted to mean a telephone number. Corporations have many variables which are recognizable by the structure of the variable itself.
  • Taxonomy resolution. When a document refers to a Volkswagen or a Honda, it is referring to a car. Taxonomy resolution is the single most important function of textual disambiguation.
  • Date standardization. July 5, 1999 is the same thing as 1999/07/05. Date standardization is very common and is very useful.

This short list of functions merely reflects some of the more prominent functions of textual disambiguation. There are many more functions that need to be accomplished by textual disambiguation in order for text to be reduced to the form of a database.

It is noteworthy that merely processing text is not enough to do analytic processing. In order to do effective analytic processing, it is necessary to identify and to process context as well. And context of text is much more difficult to handle than the text itself.

Textual Disambiguation Functionality

Fig 8.4 shows some of the functions of textual disambiguation.

Image250784.jpg

Fig 8.4 Disambiguating text functions

Taxonomies and Ontologies

Each data pond has a target that allows the data in the pond to relate to the business of the organization. In the application data pond, there was the corporate data model. But the corporate data model does not relate well to the world of text. Instead, in text there are taxonomies and ontologies.

Taxonomies are classifications of terms. There are many, many taxonomies in the world. As some simple examples of a taxonomy, consider the following:

Car

Honda

Porsche

Volkswagen

Ford

Toyota

Or

Tree

Elm

Pine

Fir

Oak

Walnut

A taxonomy then is nothing but a classification of terms. An ontology is a grouping of related taxonomies. For another example of an ontology, consider the following:

Country

USA

Canada

Mexico

Australia

South Africa

And

USA

Texas

New Mexico

Arizona

Colorado

The relationship between taxonomies is that the US is made up of states. Together these two taxonomies form an ontology.

There are an almost infinite number of taxonomies (and ontologies). Taxonomies and ontologies form the target foundation for the textual data pond as seen in Fig 8.5.

Image250792.jpg

Fig 8.5 Leveraging taxonomies and ontologies in the textual data pond

Value of Text and Context

The value of having both text and context can be illustrated very easily. Suppose you have some text about people. You have different names in your text. You have Joe and you have Susan and you have Mike and you have Terry. Now suppose you want to find Joe who is an army officer. If you do a query against all Joe’s, you get bartenders, convicts, newborn babies and airplane pilots. But if you have information about all Joe’s that have been identified by context, say government employees, you can now make a query and find the Joe (or Joe’s) that are army officers.

Context allows you to qualify exactly what you are looking for and to the business analyst, that is a necessary condition.

Fig 8.6 depicts the usage of context in a query format.

Image250801.jpg

Fig 8.6 Applying context in the textual data pond

In this case a query would be made that says:

Find all occurrences of “Joe” where context = army officer.

The results of the query would be a reference to all people named Joe who are army officers.

Tracing Text Back to the Source

In case there was a question as to the validity or accuracy of the query, any reference to text can always be traced easily and quickly back to the originating source.

The reason why it is easy to trace a reference back to the originating source is that when textual disambiguation is done, the byte of the document and the name of the document are stored with each reference. Therefore, whenever you have a question about the work that has been done by textual disambiguation, you can always go back to the original document and verify that disambiguation was performed correctly.

Mechanics of Disambiguation

As an example of the mechanics of disambiguation, consider the taxonomy developed to identify sentiment. Sentiment occurs in many places – in tweets, in emails, in documents, and so forth. It is often quite useful to gauge tone in the message. The way tone is evaluated is through the usage of a sentiment taxonomy. Fig 8.7 shows a simple taxonomy that can be used to identify sentiment in text.

Image250809.jpg

Fig 8.7 Applying sentiment analysis in the textual data pond

In reality, a sentiment taxonomy would be much more involved than that shown in the figure. The simple taxonomy above is merely for the purposes of illustration.

Textual disambiguation reads the raw text and then matches the contents of the taxonomy against the raw text that is being analyzed. When a word is discovered that matches a word in the taxonomy, the inference is made that the message has an expression of sentiment. In such a fashion, a document can be analyzed and the tone of the document gauged.

Once the tone of the document is weighted and placed into a database, then multiple messages can be analyzed by the computer using standard analytical and standard visualization technology.

Analyzing the Database

By creating a database, the computer can then perform the heavy lifting analysis. As an example, suppose there was a restaurant chain receiving feedback from its customers. Many customers are sending messages on a daily basis.

The messages cover a wide spectrum of topics. Some discuss the menu. One item was too salty. Another too hot. Another item had too small of portions. Others discuss the waiter/waitress. The waiter was slow. The waiter had a bad attitude. The waitress was very nice. Some topics dealt with cleanliness. The floor was wet. The table was not wiped. The lights were too dim. Other topics were about almost anything you could imagine – the parking lot, the restroom, the vending machines, and so forth.

In a month’s time, the restaurant chain receives over 100,000 messages from its customers. There simply are too many messages for any person to read and assimilate the information contained in the messages. Yet the feedback from the customer is critical to the happiness of the customer experience. And the happiness of the customer is the key to customer loyalty and repeat business. It greatly behooves the restaurant chain to listen to its customers.

So the restaurant chain decides to run its customer feedback through textual disambiguation. After reading 100,000 messages a month, a database is created. The database is then read into standard analytical software, which allows them to send canned, but still personalized automated responses.

Visualizing the Results

A visualization is produced that looks like that seen in Fig 8.8.

Image250816.jpg

Fig 8.8 Visualizing feedback in the textual data pond

The different categories of comments are divided into several categories:

  • Off menu items – non entrée items on the menu
  • On menu items – the entrees served by the restaurant chain
  • Service – comments about the waiters, waitresses, cashier, manager, etc.
  • Price – the cost of food
  • Curb – comments about the exterior of the restaurant and service outside the restaurant
  • Ambience – how clean the restaurant is, what atmosphere the restaurant has
  • Promotions – what comments there are about the promotions done by the restaurant.

A word of wisdom is needed to explain the interpretation of comment sentiment. At first glance it appears there are a lot of negative comments. But experience has shown that people are more inclined to message a restaurant when there is a negative experience. When a person goes to a restaurant and has a pleasant experience there is rarely feedback for the restaurant. Therefore a ratio of 85%:15% of negative to positive experiences is the normal expectation.

If a restaurant is getting more than 85% negative comments then something is wrong. If less than 85% negative, that branch is doing something right.

Looking at the comments and their sentiment expressed in Fig 8.9 shows some surprising results. One is that there are almost no comments at all about price. This is an indication to management that it may not be charging enough for its food. The same sort of observation can be made about promotions. There simply are no comments that anyone makes about the promotions the restaurant is doing. This implies that the promotions are ineffective. The message that the restaurant chain is not charging enough and that it ought to be doing more effective promotions is really important to the management of the chain.

In Summary

The textual data pond is the place where text resides. In order to be effective, text must go through a transformation and conditioning process. The transformation and conditioning process is called textual disambiguation.

The net result of textual disambiguation is the creation of text in a standard database format where both text and context have been identified. Fig 8.9 shows the textual data pond.

Image250823.jpg

Fig 8.9 Analyzing text in the textual data pond

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset