Chapter 4.6

Textual Disambiguation

Abstract

There are different definitions of big data. The definition used here is that big data encompasses a lot of data, is based on inexpensive storage, manages data by the “Roman census” method, and stores data in an unstructured format. There are two major types of big data—repetitive big data and nonrepetitive big data. Only a small fraction of repetitive big data has business value, whereas almost all of nonrepetitive big data has business value. In order to achieve business value, the context of data in big data must be determined. Contextualization of repetitive big data is easily achieved. But contextualization of nonrepetitive data is done by means of textual disambiguation.

Keywords

Big data; Roman census method; Unstructured data; Repetitive data; Nonrepetitive data; Contextualization; Textual disambiguation

The process of contextualizing nonrepetitive unstructured data is accomplished by technology known as “textual disambiguation” (or “textual ETL”). The process of textual disambiguation has an analogous process in structured processing known as “ETL”—“extract/transform/load.” The difference between ETL and textual ETL is that ETL transforms old legacy system data and textual ETL transforms text. At a very high level, they are analogous, but in terms of the actual details of processing, they are very different.

From Narrative Into an Analytical Data Base

The purpose of textual disambiguation is to read raw text—narrative—and to turn that text into an analytic database. Fig. 4.6.1 shows the general flow of data in textual disambiguation.

Fig. 4.6.1
Fig. 4.6.1 Transformation of text into a standard database.

Once raw text is transformed, it arrives in the analytic database in a normalized form. The analytic database looks like any other analytic database. Typically, the analytic data are “normalized,” where there is a unique key with dependent elements of data. The analytic database can be joined with other analytic databases to achieve the effect of being able to analyze structured data and unstructured data in the same query.

Each element in the analytic database can be tied back directly to the originating source document. This feature is needed if there ever is any question to the accuracy of the processing that has occurred in textual disambiguation. In addition, if there ever is any question as to the context of the data found in the analytic database, it can be easily and quickly verified.

Note that the originating source document is not touched or altered in any way.

Fig. 4.6.2 shows that each element of data in the analytic database can be tied back to originating source.

Fig. 4.6.2
Fig. 4.6.2 Tying the text to the database.

Input Into Textual Disambiguation

The input into textual disambiguation comes from many different places. The most obvious source of input is the electronic-based text that represents the document that is to be disambiguated. Another important source of data is taxonomies. Taxonomies are essential to the process of disambiguation. There will be an entire chapter on taxonomies. And there are many other types of parameters based on the document being disambiguated.

Fig. 4.6.3 shows some of the typical input into the process of textual disambiguation.

Fig. 4.6.3
Fig. 4.6.3 Raw text, taxonomies and other parameters are input into textual ETL.

Mapping

In order to execute textual disambiguation, it is necessary to “map” a document to the appropriate parameters that can be specified inside textual disambiguation. The mapping directs textual disambiguation as to how the document needs to be interpreted. The mapping process is akin to the process of designing how a system will operate. Each document has its own mapping process.

The mapping parameters are specified, and upon completion of the mapping process, a document can then be executed. All documents of the same type can be served by the same mapping. For example, there may be one mapping for oil and gas contracts, another mapping for human resource resume management, and another mapping for call center analysis.

Fig. 4.6.4 shows the mapping process.

Fig. 4.6.4
Fig. 4.6.4 Mapping.

In almost every case, the mapping process is done in an iterative manner. The first mapping of a document is created. A few documents are processed, and the analyst sees the results. The analyst decides to make a few changes and reruns the document through textual disambiguation with the new mapping specifications. The process of gradually refining the mapping continues until the analyst is satisfied.

The iterative approach to the creation of a mapping is used because documents are notoriously complex and there are many nuances to a document that are not immediately apparent. For even an experienced analyst, the creation of the mapping is an iterative process.

Because of the iterative nature of the creation of the mapping, it NEVER makes sense to create a mapping and then process thousands of documents using the initial mapping. Such a practice is wasteful because it is almost guaranteed that the initial mapping will need to be refined.

Fig. 4.6.5 shows the iterative nature of the mapping process.

Fig. 4.6.5
Fig. 4.6.5 Iterative development.

Input/Output

The input to the process of textual disambiguation is electronic text. There are MANY forms of electronic text. Indeed, electronic text can come from almost anywhere. The electronic text can be in the form of proper language, slang, shorthand, comments, database entries, and many other forms. Textual disambiguation needs to be able to handle all the forms of electronic text. In addition, electronic text can be in different languages.

Textual disambiguation can handle nonelectronic text after the nonelectronic text passes through an automated capture mechanism such as optical character recognition (OCR) processing.

The output of textual disambiguation can take many forms. The output of textual disambiguation is output that is created in a “flat file format.” As such, the output can be sent to any standard DBMS or to Hadoop.

Fig. 4.6.6 shows the types of output that can be created from textual disambiguation.

Fig. 4.6.6
Fig. 4.6.6 Input and output passing through textual ETL.

The output from textual disambiguation is placed into a work table area. From the work table area, the data can be loaded into a standard DBMS using the load utility of the DBMS.

Fig. 4.6.7 shows that data are loaded into the DBMS load utility from the work area created and managed by textual disambiguation.

Fig. 4.6.7
Fig. 4.6.7 A load utility.

Document Fracturing/Named Value Processing

There are many features to the actual processing done by textual disambiguation. But there are two primary paths of processing a document. These paths are called document fracturing and named value processing.

Document fracturing is the process by which a document is processed—word by word—doing such processing as stop word processing, alternate spelling and acronym resolution, and homographic resolution. The effect of document fracturing is that upon processing, the document still has a recognizable shape, albeit in a modified form. For all practical purposes, it appears as if the document has been fractured.

The second major type of processing that occurs is named value processing. Named value processing occurs when inline contextualization needs to be done. Inline contextualization is done where the text is repetitive, as sometimes occurs. When text is repetitive, it can be processed by looking for unique beginning delimiters and ending delimiters.

There are other types of processing that can be done by textual disambiguation, but document fracturing and named value processing are the two primary analytic processing paths.

Fig. 4.6.8 depicts the two primary forms of processing that occur in textual disambiguation.

Fig. 4.6.8
Fig. 4.6.8 The two main processing components of textual ETL.

Preprocessing a Document

On occasion, it is necessary to preprocess a document. On occasion, the text of a document cannot be processed in a standard fashion by textual disambiguation. In these circumstances, it is necessary to pass the text through a preprocessor. In the preprocessor, the text can be edited to alter the text to the point that the text can be processed in a normal manner by textual disambiguation.

As a rule, you don’t want to preprocess text unless you absolutely have to. The reason why you don’t want to have to preprocess text is that by preprocessing text, you automatically double (or more!) the machine cycles that are required to process the text.

Fig. 4.6.9 shows that—if necessary—electronic text can be preprocessed.

Fig. 4.6.9
Fig. 4.6.9 Preprocessing text.

E-mails—A Special Case

E-mails are a special case of nonrepetitive unstructured data. E-mails are special because everybody has them and because there are so many of them. Another reason why e-mails are special is that e-mails carry with them an enormous amount of system overhead that is useful to the system and no one else. Also, e-mails carry a lot of valuable information when it comes to customer's attitudes and activities.

It is possible to simply send e-mails into textual disambiguation. But such an exercise is fruitless because of the spam and blather that are found in e-mails. Spam is the nonbusiness relevant information that is generated outside the corporation. Blather is the internally generated correspondence that is nonbusiness related. For example, blather contains the jokes that are sent throughout the corporation.

In order to use textual disambiguation effectively, the spam, blather, and system information need to be filtered out. Otherwise, the system becomes overwhelmed meaningless information.

Fig. 4.6.10 shows that there is a filter to remove unnecessary information from the stream of e-mails before the e-mails are processed by textual disambiguation.

Fig. 4.6.10
Fig. 4.6.10 Filtering emails.

Spreadsheets

Another special case is the case of spreadsheets. Spreadsheets are ubiquitous. Sometimes, the information on the spreadsheet is purely numerical. But on other occasions, there is character-based information on a spreadsheet. As a rule, textual disambiguation does not process numerical information from a spreadsheet. That is because there are no metadata to accurately describe numeric values on a spreadsheet. (Note: there is formulaic information for the numbers found on a spreadsheet, but the spreadsheet formulas are almost worthless as metadata descriptions of the meaning of the numbers.) For this reason, the only data that are found on the spreadsheet that make its way into textual ETL are the character-based descriptive data.

To this end, there is an interface that allows the data on the spreadsheet that are useful to be formatted from the spreadsheet into a working database. From the working database, the data are then sent into textual disambiguation, as seen in Fig. 4.6.11.

Fig. 4.6.11
Fig. 4.6.11 Reformatting spreadsheet data.

Report Decompilation

Most textual information is found in the form of a document. And when text is on a document, it is processed linearly by textual disambiguation. Fig. 4.6.12 shows that textual disambiguation operates in a linear fashion.

Fig. 4.6.12
Fig. 4.6.12 Linear processing of text.

But text on a document is not the only form of nonrepetitive unstructured data. Another common form of nonrepetitive unstructured data is that of a table. Tables are found everywhere—in bank statements, in research papers, in corporate invoices, and so forth.

On some occasions, it is necessary to read the table in as input, just as text is read in on a document. To this end, a specialized form of textual disambiguation is required. This form of textual disambiguation is called report decomposition.

In report decomposition, the contents of the report are handled very differently than the contents of text. The reason why reports are handled differently from text is that in a report, the information cannot be handled in a linear format.

Fig. 4.6.13 shows that there are different elements of a report that must be brought together in a normalized format. The problem is that those elements appear is a decidedly nonlinear format.

Fig. 4.6.13
Fig. 4.6.13 An entirely different form of textual disambiguation.

Therefore, an entirely different form of textual disambiguation is required.

Fig. 4.6.14 shows that reports can be sent to spreadsheet report decompilation for reduction to a normalized format.

Fig. 4.6.14
Fig. 4.6.14 Report decompilation.

The end result of report decompilation is exactly the same as the end result of textual disambiguation. But the processing and the logic that arrive at the end result are very different in content and substance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset