Chapter 17.1

Managing Text

Abstract

In most organizations, text forms the basis of the majority of data in the corporation. Yet, many corporations do little or nothing with text. For many years, there were technological reasons why text was so difficult to handle. But in today's world, text is easily manageable. Organizations find that there is a wealth of value that can be attained by addressing and employing the text that is in the corporate walls.

Keywords

Text; DBMS; NLP; Stemming; Soundex; Taxonomy; Blob; Stop word; Context; Textual ETL; In line contextualization; Post processing; Preprocessing

Text is the Wednesday's child of technology. It has been forgotten and abandoned, to the point that organizations act as if they don’t have any text, much less text that contains important data. Yet in most corporations, some of the most important information is bound up in text.

For years, it was not possible to read text automatically and use it in the decision-making process. But that has changed. Today, it is possible to read text and to include it in standard databases. In doing so, text has become an important source of data in the corporate decision-making process.

The Challenge of Text

There are many very valid reasons why text is so difficult to work with and manage. The primary reason has to be that text does not fit well into a standard database management system. Stated differently, the fit between text and a database management system is awkward at best and a total mismatch at worst.

A standard database management system requires data to be tightly structured. The DBMS requires that fields of data be uniform in size, that the attributes are able to be defined, and that keys be readily available in order to store the data. The very essence of a DBMS is uniformity in the units of data held in the database. Text meets none of those requirements.

DBMS requirements are rigid. DBMS requirements are nonnegotiable. You either arrange data the way the DBMS wants you to or you don’t use a database.

And text is free form. No one tells the author of words or the speaker of words what to say. The very essence of communication with language is the freedom to express one's self as one desires. Every person expresses differently.

Fig. 17.1.1 shows that the nonuniform nature of text does not fit with a standard database management system.

Fig. 17.1.1
Fig. 17.1.1 Fitting text into a standard data base is an awkward thing, at best.

The misfit between a DBMS and text has been noticed for a long time. Indeed, there have evolved over a long time a series of solutions (or attempts at solutions) over the years. There has been a progression over time that tries to address the different problems that arise when trying to place text into a database.

Fig. 17.1.2 depicts the evolution.

Fig. 17.1.2
Fig. 17.1.2 The evolution of textual integration into data base technology.

The first attempt to manage text was to create a standard field definition and to stuff text into the definition. Structural definitions such as text field char (1000) were created. While it is possible to place text in a field such as the one described, there were lots of problems. Some text entries were much shorter than 1000 bytes (thus wasting space), and some text entries were much larger than 1000 bytes (creating a complexity). From a size standpoint alone, merely defining a field of data was not an effective solution. The length of the field was always either too long or too short (or both).

The next approach by the DBMS vendors was to allow a field called a “blob” to be defined. The blob would allow any length of text to be entered, thus solving the problem of defining the length of a field properly. But merely placing text into a blob only solved one problem of placing text into a database. Once text was placed into a blob, there was nothing real that could be done with it other than the mere placement of the data into a database. Trying to do any meaningful analysis on text inside a blob was extremely difficult to do.

The next step in the solution to dealing with text that needed to go into a database was to employ the practice of “stemming.” Stemming was the practice of defining words that are related at the root stem. For example, the word move has a relation to the words moving, mover, moved, mover, and so forth. The word move is the stem of the other words. Stemming was the first real step toward the systemic analysis of words. However, stemming had little practical value. Stemming was an interesting exercise, but stemming had little practical use.

Along with stemming came the practice of soundex. In soundex, words are spelled and classified according to their sound. Like stemming, soundex had few practical applications. However, both stemming and soundex were the first steps in starting to deal with text systemically.

The next step was the practice of identification and the removal of stop words. Stop words are extraneous words that are needed for proper communication but which are extraneous to the meaning of what is being said. Typical stop words are words such as “a,” “and,” “the,” and “to.”

In a way, stop word removal was the first significant practical step to starting to deal with text. Stop word removal erased words that “got in the way” and removed unnecessary text for further consideration.

After stop word removal came tagging. Tagging is the practice of examining a document and finding and identifying desired words found in the document. Tagging words inside a document is a good and effective way to start to understand what is inside a document. However, tagging had several drawbacks. The first drawback of tagging is that in order to know how to tag a document, you had to know what words you were looking for before you ever did the tagging. This presupposes that you know what the person is going to say before they say it. And in most circumstances, that is a fallacious assumption. The second drawback of tagging is that there is a lot more to understand about text than the mere identification of words.

Nevertheless, tagging was a real step forward in the management of text.

The next step in the progression to putting text into a database was that of using taxonomies in order to analyze sentences. Taxonomic resolution occurs when a taxonomy is created, and the taxonomy is matched against the raw text. In matching the text, words could be classified. In many regards, the use of taxonomies was the secret that began to unlock the process of textual analysis. There are MANY things that can be done with text when the text is matched against a taxonomy.

Following taxonomic analysis, there came NLP—natural language processing. Natural language processing took all the previous techniques and built on them in order to produce an effective way to examine and analyze text.

In the final phase of the evolution, there is textual ETL (or textual disambiguation). Textual ETL does everything that NLP does and adds a lot of other functionality. The emphasis of textual ETL is on the identification of the context of text, not the text itself. In addition, textual ETL specifically builds databases. And textual ETL also does in-line contextualization.

Today, with textual ETL, you can read text and turn it into useful databases. Once you have constructed the databases, you can then use standard visualization tools to analyze the data.

The Challenge of Context

The first and biggest problem with trying to incorporate text into a database environment is that text does not fit comfortably inside a database. But that is not the only problem. The second major problem is that in order to deal with text, you have to deal with context as well. Stated differently, dealing with text is one problem. Dealing with the context of text is an entirely different problem. But in order to put text meaningfully into an environment where it can be analyzed, you MUST deal with both text and context.

Fig. 17.1.3 shows that text and context must be considered.

Fig. 17.1.3
Fig. 17.1.3 Both text and context must be taken into account.

So, why is context so difficult to deal with? Consider the word “ship.” When you read a sentence and you see the word ship, what do you think of? Do you think of a large boat on the ocean? Do you think of an airliner? Do you think of a package that needs to be sent somewhere? Do you think of soldiers that are about to be transported somewhere? Do you think of someone being fired? Do you think of something else? The truth is that the word ship can mean lots of things, most of which are very different from each other.

The way you know what is meant by “ship” is to understand the context in which the word is used. Stated differently, the context of a word is usually EXTERNAL to the word. And the external nature of context is true of EVERY word and EVERY conversation.

And that is the hard part of understanding context. Context exists EXTERNALLY to the words that it applies to (for the most part). Occasionally, context is found within the sentences themselves. But far and away, the much more normal case is for context to be found external to the words being analyzed.

Fig. 17.1.4 shows that context exists external to the words that are being analyzed.

Fig. 17.1.4
Fig. 17.1.4 Nearly all context exists outside the text itself.

That is why context is 90% of the work done by textual ETL in order to read and prepare text for inclusion into a database.

Despite the fact that context is so difficult to identify and manage, it is MANDATORY that context be included with EVERY word put into a database. If a word was to be put by itself into a database, the word would be naked. A word without context would be lost and almost useless for the purpose of being analyzed.

Textual ETL then is the technology that allows text to be read and meaningfully placed into a database. Textual ETL ALWAYS—in every case—considers both the word and its context.

Fig. 17.1.5 shows textual ETL.

Fig. 17.1.5
Fig. 17.1.5 Textual ETL.

Textual ETL reads as input raw text, taxonomies, and other input and determines what text is important and how the text is to be processed. The output is a standard database.

The Processing Components of Textual ETL

From a processing standpoint, there are two major processing sections of textual ETL. There is document fracturing, and there is named value processing (sometimes called “in-line contextualization”).

Fig. 17.1.6 shows these two major divisions of the processing that occur within textual ETL.

Fig. 17.1.6
Fig. 17.1.6 The components of textual ETL.

In document fracturing, a document is processed in such a way that—upon being processed—the document remains in a recognizable state. In named value processing, the document is processed, but the document itself is not recognizable at the end of processing.

Secondary Analysis

Textual ETL is really only the first step in the analysis of text. Textual ETL produces a simple file that is then further analyzed. The first step gathers the information and contextualizes it. However, to do textual analysis, further processing is necessary.

Fig. 17.1.7 shows that the output from textual ETL goes through a secondary analysis. Typical secondary processing includes such activities as sentiment analysis, medical record analysis and reconstruction, call center analysis, and other types of analysis.

Fig. 17.1.7
Fig. 17.1.7 Processing is a multistep activity.

For example, sentiment analysis includes such activities as scope of inference analysis, connector analysis, and predicate location.

Fig. 17.1.8 shows the secondary processing that occurs after textual ETL is done.

Fig. 17.1.8
Fig. 17.1.8 After the data bases are built, the data can be visualized.

Visualization

After secondary analysis is done, the results can then be visualized. The output of the secondary analysis is a simple database that can easily be used by an analytic visualizer.

Fig. 17.1.8 shows that visualization is the best way to show the results of analytic processing.

The visualization that is done can be customized to suit the application. There are MANY ways the analyst can shape the visualization. Fig. 17.1.9 shows some of the typical parameters that can be used to shape the visualization.

Fig. 17.1.9
Fig. 17.1.9 Some of the ways to analyze text.

Merging Text Based Data and Structured Data

The primary value in being able to meaningfully put textual data into a database is to analyze the data. And there is great value in being able to do just that. However, there are some other really important benefits. One of those benefits is to be able to intermix textual data and standard structured data.

Fig. 17.1.10 shows that once textual data are captured inside a standard database, it can be freely mixed with classical structured data.

Fig. 17.1.10
Fig. 17.1.10 Intermixing data.

And once text-based data and classical structured data can be mixed together, they can be analyzed together. Fig. 17.1.11 shows the ability of the analyst to do investigation into intermixed data.

Fig. 17.1.11
Fig. 17.1.11 Analytical possibilities.

The mixing of the two types of data together sets the stage for analytic possibilities that were once only a dream of the analyst.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset