Chapter 10.2

Mapping

Abstract

Nonrepetitive analytics begins with the contextualization of the nonrepetitive data. Unlike repetitive data, the context of nonrepetitive data is difficult to determine. The context of nonrepetitive big data is determined by textual disambiguation. In textual disambiguation, there are algorithms that relate to stop word resolution, stemming, homographic resolution, inline contextualization, taxonomy/ontology resolution, custom variable resolution, acronym resolution, and so forth. Nonrepetitive analytics is very relevant to business value. Some typical forms of nonrepetitive analytics include the analysis of medical records, warranty analysis, insurance claim analysis, and call center analysis.

Keywords

Nonrepetitive data; Textual disambiguation; Stemming; Stop word processing; Homographic Resolution; Taxonomic resolution; Custom variable resolution; Acronym resolution; Inline contextualization

Mapping is the process of defining the specifications of how a document is to be processed to textual ETL. There is a separate mapping for each type of document to be processed. One of the nice features of textual ETL is that the analyst can build on the specification of previous mappings when it comes time to build a new mapping. On many occasions, one mapping will be very similar to another mapping. It is not necessary for the analyst to create a new mapping if a previous mapping has been created that is similar.

At first glance, creating mappings is a bewildering process. It is like the airline pilot at the control of the airplane. There are many control panels and many switches and buttons. To the uninitiated, flying an airplane seems to be an almost monumental task.

However, once an organized approach is taken, learning to do mapping is a straightforward process.

Fig. 10.2.1 shows the questions the analyst needs to be asking as he/she does the mapping process.

Fig. 10.2.1
Fig. 10.2.1 The process of mapping.

Most of the questions are straightforward, but a few deserve an explanation.

The first observation is that there is a difference between repetitive and nonrepetitive records of text and structural text repetition. It is true that the words repetition and nonrepetition appear in this book. But they do not mean the same thing at all.

Repetitive records of data refer to records of data that repeatedly appear and are very similar in structure and even in context. Nonrepetitive records are records that appear where there is little or no repetition of records from one record to the next.

But repetitive text is something entirely different. Repetitive text refers to text appearing the same way or in a very similar way across more than one document. A simple example of repetitive text is boilerplate contracts. In boilerplate contracts, a lawyer has taken a basic contract and added a few words to it. The same contract appears over and over again in a repetitive manner. Another example of repetitive text is blood pressure. In blood pressure readings, blood pressure is written as “bp 124/68.” The first number is the diastolic reading, and the second number is the systolic reading. When one encounters “bp 176/98,” one knows exactly what is meant by the text. The text is repetitive.

Of course, you can use as many techniques and specifications are as applicable. You can use taxonomies, inline contextualization, and custom formatting, all at once. Or you can use only taxonomy processing or only inline contextualization. The data and what you want to do with the data dictate how you will choose to do what is needed.

One of the issues is choosing name for variables. For example, when you create a custom format, you choose a name for the variable. Suppose you wanted to pick up telephone number. You could use a specification of “999-999-9999.” You need to name the variable that is created in a meaningful manner. The variable name becomes the context.

For example, for a telephone number, the name “variable001” would be a terrible name. No one would know what you meant when they encountered “variable001.” Instead, a name like “telephone_number001” is much more appropriate. When a person reads “telephone_number001,” it is immediately obvious what is meant.

The definition of a mapping is meant to be done in an iterative manner. It is HIGHLY unlikely that you will create a mapping and that the first mapping you create becomes the final mapping. It is MUCH MORE likely that you will create a mapping, run the mapping against the document, then go back, and make adjustments to the mapping. Documents are complex, and language is complex. There are plenty of nuances in language that people take for granted. Therefore, it is unrealistic to think that you will create the perfect mapping the first time you create one. It just doesn’t happen with even the most experienced people.

Textual ETL often has multiple ways to handle the same interpretation. In many cases, the mapper will be able to accomplish the same results in more than one way. There is no right way or wrong way to do something in textual ETL. You can choose whatever way makes the most sense to you.

Textual ETL is sensitive to resource consumption. In general, textual ETL operates in an efficient manner. The only things to be avoided are the following:

  • Looking for more than four or five proximity variables. It is possible to swamp textual ETL by looking for many proximity variables.
  • Looking for many homographs. It is possible to swamp textual ETL by looking for more than four or five homograph resolutions.
  • Taxonomy processing. Loading more than 10000 words in a taxonomy can slow the system down.
  • Date standardization. Date standardization causes the system to use many resources. Do not use date standardization unless you really need to use it.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset