Chapter 10.1

Nonrepetitive Data

Abstract

Nonrepetitive analytics begins with the contextualization of the nonrepetitive data. Unlike repetitive data, the context of nonrepetitive data is difficult to determine. The context of nonrepetitive big data is determined by textual disambiguation. In textual disambiguation, there are algorithms that relate to stop word resolution, stemming, homographic resolution, inline contextualization, taxonomy/ontology resolution, custom variable resolution, acronym resolution, and so forth. Nonrepetitive analytics is very relevant to business value. Some typical forms of nonrepetitive analytics include the analysis of medical records, warranty analysis, insurance claim analysis, and call center analysis.

Keywords

Nonrepetitive data; Textual disambiguation; Stemming; Stop word processing; Homographic resolution; Taxonomic resolution; Custom variable resolution; Acronym resolution; Inline contextualization

There are two types of data that reside in the big data environment—repetitive data and nonrepetitive data. Repetitive data are relatively easy to handle because of the repetitive nature of the structure of the data. But nonrepetitive data are anything but easy to handle because every unit of data in the nonrepetitive environment must be individually interpreted before it can be used for analytic processing.

Fig. 10.1.1 shows a representation of nonrepetitive data as they reside in a raw state in the big data environment.

Fig. 10.1.1
Fig. 10.1.1 Nonrepetitive data.

The nonrepetitive data found in big data are called “nonrepetitive” because each unit of data is unique. Fig. 10.1.2 shows that each unit of data in the nonrepetitive environment is different from the preceding unit of data.

Fig. 10.1.2
Fig. 10.1.2 Great differences.

There are many examples of nonrepetitive data in the big data environment. Some of the examples include the following:

  • E-mail data
  • Call center data
  • Corporate contracts
  • Warranty claims
  • Insurance claims

It is possible for two units of repetitive data to actually be the same. Fig. 10.1.3 shows this possibility.

Fig. 10.1.3
Fig. 10.1.3 The only similarities are accidental.

As an example of two units of nonrepetitive data being the same, suppose there are two e-mails that contain one word—the word “yes.” In this case, the e-mails are identical. But the fact that they are identical is merely an act of randomness.

In general, when text finds its way into the big data environment, the units of data stored in big data are nonrepetitive.

One approach to processing nonrepetitive data is to use a search technology. While search technology accomplishes the task of scanning the data, search technology leaves a lot to be desired. The two primary shortcomings of search technology are that searching data do not leave a database that can be subsequently used for analytic purposes and the fact that search technology does not look at or provide context for the text being analyzed. And there are other limitations of search technology as well.

In order to do extensive analytic processing against nonrepetitive data, it is necessary to read the nonrepetitive data and to turn the nonrepetitive data into a standard database format. Sometimes, this process is said to take unstructured data and turn them into structured data. That indeed is a good description of what occurs.

The process of reading nonrepetitive data and turning them into a database is called “textual disambiguation” or “textual ETL.” Textual disambiguation is—of necessity—a complex process because the language it processes is complex. There is no getting around the fact that processing text is a complex process.

The result of processing nonrepetitive data in big data with textual disambiguation is the creation of a standard database. Once data are put into the form of a standard database, it can then be analyzed using standard analytic technology.

The mechanics of textual disambiguation are shown in Fig. 10.1.4.

Fig. 10.1.4
Fig. 10.1.4 The mechanics of textual disambiguation.

The general flow of processing in textual ETL is this. The first step is to find and read the data. Normally, this step is straightforward. But occasionally, the data have to be “untangled” in order for further processing to continue. In some cases, the data reside in a unit by unit basis. This is the “normal” (or easy) case. But in other cases, the units of data are combined into a single document, and the units of data must be isolated in the document in order to be processed.

The second step is to examine the unit of data and determine what data need to be processed. In some cases, all the data need to be processed. In other cases, only certain data need to be processed. In general, this step is very straightforward.

The third step is to “parse” the nonrepetitive data. The word “parse” is a little misleading because it is in this step that the system applies great amounts of logic. The word “parsing” implies a straightforward process, and the logic that occurs here is anything but straightforward. The remainder of this chapter discusses the logic that occurs here.

After the nonrepetitive data have been “parsed,” the attributes of data, the keys of data, and the records of data are identified.

Once the keys, attributes, and records are identified, it is a straightforward process to turn the data into a standard database record.

That then is what takes place in textual disambiguation.

The heart of textual disambiguation is the logic of processing that occurs when nonrepetitive data are analyzed and turned into keys, attributes, and records.

The activities of logic that occur here can be roughly classified into several categories. Fig. 10.1.5 shows those categories.

Fig. 10.1.5
Fig. 10.1.5 The different types of textual disambiguation.

The basic activities of logic applied by textual disambiguation include the activities of the following:

  • Contextualization, where the context of data is identified and captured
  • Standardization, where certain types of text are standardized
  • Basic editing, where basic editing of text occurs

Indeed, there are other functions of textual disambiguation, but these three classifications of activities encompass most of the important processing that occurs.

The remainder of this chapter will be an explanation of logic that is found in textual disambiguation.

Inline Contextualization

One form of contextualization is a form that is called “inline contextualization” (or sometimes called “named value” processing). Inline contextualization only applies when there is a repetition and predictability of text. It is noted that in many cases, there is no predictability of text, so inline contextualization cannot be used in these cases.

Inline contextualization is the process of inferring the context of a word or phrase by looking at the text immediately preceding and immediately following the word or phrase. As a simple example of inline contextualization, consider the raw text “2. This is a PAID-UP LEASE.”

The context name would be contract type. The beginning delimiter would be “2. This is a” and the ending delimiter would be “.” The system would produce an entry into the analytic database that would look like the following:

  • Document name, byte, context—contract type, value—PAID-UP LEASE

Fig. 10.1.6 shows the activity the system does in processing raw text to determine inline contextualization.

Fig. 10.1.6
Fig. 10.1.6 Finding beginning and ending delimiters.

Note that beginning delimiter must be unique. If you were to specify “is a” as a beginning delimiter, then every occurrence where the term “is a” is found would be qualified. And there may be many places where the term “is a” is found that does not specify inline contextualization.

Also, note that the ending delimiter must be specified exactly. In this case, if the term does not end in a “.” the system will not consider the entry to be a hit.

Because the ending delimiter must be specified accurately, the analyst also specifies a maximum character count. The maximum character count tells the system how far to search to determine whether the ending delimiter has been found.

On occasion, the analyst wants the inline contextualization search to end on a special character. In this case, the analyst specifies the special character that is needed.

Taxonomy/Ontology Processing

Another powerful way to specify context is through the usage of taxonomies and ontologies.

There are many important things that taxonomies do for contextualization. The first is applicability. Whereas inline contextualization requires repetitive and predictable occurrences of text to be applicable, taxonomies do not have such a requirement. Taxonomies are applicable just about everywhere. A second valuable feature of taxonomies is that can be applied externally. This means that in choosing the taxonomy to be applied, the analyst can greatly influence the interpretation of the raw text.

For example, suppose the analyst was going to apply a taxonomy to the phrase “President Ford drove a Ford.” If the interpretation that analyst wished to infer were about cars, then the analyst would choose one or more taxonomy that would allow “Ford” to be interpreted as an automobile. But if the analyst were to choose a taxonomy relating to the history of the presidents of the United States, then the term “Ford” would be interpreted to be a former president of the United States.

The analyst then has great power in applying the correct taxonomy to the raw text that is to be processed.

The mechanics of how a taxonomy processes against raw text is seen in Fig. 10.1.7.

Fig. 10.1.7
Fig. 10.1.7 Processing a taxonomy against raw text.

As a simple example of the application of a taxonomy to raw text, consider the following example.

Raw text—“…she drove her Honda into the garage….” The simple taxonomy used looks like the following:

  • Car
    • Porsche
    • Honda
    • Toyota
    • Ford
    • Kia
    • Volkswagen

When the taxonomy is passed against the raw text, the results look like the following:

  • Document name, byte, context—car, value—Honda

In order to accommodate other processing, on some occasions, it is useful to create a second entry:

  • Document name, byte, context—car, value—car

The reason why it is sometimes useful to produce a second entry into the analytic database is that on occasion, you want to process all the values and you want the context to be processed as a value. That is why that on occasion, the system produces two entries into the analytic database.

Note that textual ETL operates on taxonomies/ontologies as if the taxonomies were a simple word pair. In fact, taxonomies and ontologies are much more complex that simple word pairs. But even the most sophisticated taxonomy can be decomposed into a series of simple word pairs.

In general, the usage of taxonomies as a form of contextualization is the most powerful tool the analyst has in determining the context of raw text.

Custom Variables

Another very useful form of contextualization is that of the identification of and creation of what can be termed “custom variables.” Almost every organization has custom variables. A custom variable is a word or phrase that is recognizable entirely from the format of the word or phrase. As a simple example, a manufacturer may have its part numbers in the form of “AK-876-uy.” Looking at a part number, generically, the generic form of the part number would be “CC-999-cc.” In this case, “C” indicates a capital character, “-“ indicates the literal “-“, “9” indicates any numeric digit, and “c” indicates a lower case character.

By looking at the format of a word or phrase, the analyst can tell immediately the context of the variable.

Fig. 10.1.8 shows how raw text is processed using custom variables.

Fig. 10.1.8
Fig. 10.1.8 Custom variable format processing.

As an example of the use of custom variables, consider the following raw text:

…I want to order two more cases of TR-0987-BY to be delivered on…

Upon processing the raw text, the following entry would be created inside the analytic database:

  • Doc name, byte, context—part number, value—TR-0987-BY

Note that there are a few common custom variables in common use. One (in the United States) is 999-999-9999, which is the common pattern for telephone number. Or there is 999-99-9999 that is the generic pattern for social security number.

The analyst can create whatever pattern he/she wishes for processing against the raw text. The only “gotcha” that sometimes occurs is the case where on occasion more than one type of variable will have the same format as another variable. In this case, there will be confusion in trying to use custom variables.

Homographic Resolution

A powerful form of contextualization is that known as “homographic resolution.” In order to understand homographic resolution, consider the following (very real) example. Some doctors are trying to interpret doctor's notes. The term “ha” gives the doctors a problem. When a cardiologist writes “ha,” the cardiologist refers to “heart attack.” When an endocrinologist writes “ha,” the endocrinologist refers to “hepatitis A.” When a general practitioner writes “ha,” the general practitioner refers to “headache.”

In order to create a proper analytic database, the term “ha” must be interpreted properly. If the term “ha” is not interpreted properly, then people that have had heart attacks, hepatitis A, and headaches will all be mixed together, and that surely will produce a faulty analysis.

There are several elements to homographic resolution. The first element is the homograph itself. In this case, the homograph is “ha.” The second element is the homograph class. The homograph class in this case includes cardiologist, endocrinologist, and general practitioner. The homographic resolution is that for cardiologists “ha” means “heart attack”; for endocrinologists, “ha” means “hepatitis A”; and that for general practitioners, “ha” means “head ache.”

The fourth element of homographic resolution is that each of the homographic classes must have typical words assigned to the class. For example, a cardiologist may be associated with words like “aorta,” “stent,” “bypass,” and “valve.”

There are then four elements to homographic resolution:

  • The homograph
  • The homograph class
  • The homograph resolution
  • Words associated with the homograph class

Fig. 10.1.9 shows how homographic processing is done against raw text.

Fig. 10.1.9
Fig. 10.1.9 Homographic processing.

Suppose the raw text looks as follows—“…120/68, 168 lbs, ha, 72 bpm, f, 38,…”

Upon processing the raw text, the entry into the database might look like the following:

  • Document name, byte, context—head ache, value—ha

Care must be taken with the specification of homographs. The underlying work done by the system to resolve the homograph is considerable. So, system overhead is a concern.

In addition, the analyst can specify a default homographic class should none of the homographic classes be qualified. In this case, the system will default to the homograph class specified by the analyst.

Acronym Resolution

A related form of resolution is that of acronym resolution. Acronyms are found everywhere in raw text. Acronyms are a standard part of communication. Furthermore, acronyms tend to be clustered around some subject area. There are IBM acronyms. There are military acronyms. There are IMS acronyms. There are chemical acronyms. There are Microsoft acronyms and so forth.

In order to clearly understand a communication, it is advisable to resolve acronyms.

Textual ETL is equipped to resolve acronyms. When textual ETL reads raw text and spots an acronym, textual ETL replaces the acronym with the literal value.

Fig. 10.1.10 shows the dynamics of how textual ETL reads raw text and resolves an acronym when it is found.

Fig. 10.1.10
Fig. 10.1.10 Processing an acronym.

As an example of how acronym resolution works, suppose there was the following text:

Sgt Mullaney was AWOL as of 10:30 p.m. on Dec 25…

The following entry would be placed in the analytic database:

  • Document name, byte, context—absent without official leave, value—AWOL

Textual ETL has organized the terms of resolution by category class. Of course, the terms of resolution can be customized upon loading into the system.

Negation Analysis

On occasion, text will state that something did not happen, as opposed to saying that something happened. If standard contextualization is used, there will be a reference to something that did not happen. In order to make sure that when a negation is stated in text, the negation needs to be recognized by textual ETL.

For example, if a report says “…John Jones did not have a heart attack…,” there does not need to be a reference to John Jones having a heart attack. Instead, there needs to be a reference to the fact that John did NOT have a heart attack.

There are actually many different ways that negation analysis can be done by textual ETL. The simplest way is to create a taxonomy of negative terms—“none, not, hardly, no,…”—and keep track of the negations that have occurred. Then, if a negative term has occurred in conjunction with another term in the same sentence, the inference is made that something did not happen.

Fig. 10.1.11 shows how raw text can be treated to create one form of negation analysis.

Fig. 10.1.11
Fig. 10.1.11 Negation analysis.

As an example of negation analysis, consider the raw text “…John Jones did not have a heart attack….”

The data that would be generated would look like the following:

  • Document name, byte, context—negation, value—no
  • Document name, byte, context—condition, value—heart attack

Care must be taken with negation analysis because not all forms of negation are easily handled. The good news is that most forms of negation in language are straightforward and are easily handled. The bad news is that some forms of negation require elaborate techniques for textual ETL management.

Numeric Tagging

Another useful form of contextualization is that of numeric tagging. It is normal for a document to have multiple numeric values on the document. It is also normal for one numeric value to mean one thing and another numeric value to mean something else.

For example, a document may have the following:

  • Payment amount
  • Late fee charge
  • Interest amount
  • Payoff amount
  • And so forth

It is most helpful to the analyst who will be analyzing the document to “tag” the different numeric values. In doing so, the analyst can simply refer to the numeric value by its meaning. This makes the analysis of documents that contain multiple numeric values quite convenient. (Stated differently, if the tagging is not done at the time of textual ETL processing, the analyst accessing and using the document will have to do the analysis at the time the document is being analyzed, which is a time-consuming and tedious process. It is much simpler to tag a numeric value at the moment of textual ETL processing.)

Fig. 10.1.12 shows how raw text is read and how tags are created for numeric values.

Fig. 10.1.12
Fig. 10.1.12 Tagging numerical values.

As an example of how textual ETL might read a document and tag a numeric value, consider the following raw text:

  • Raw text—“…Invoice amount”—“$813.97,…”

The data placed onto the analytic database would look like the following:

  • Document name, byte, context—invoice amount, value—813.97

Date Tagging

Date tagging operates on the same basis as numeric tagging. The only difference is that date tagging operates on dates rather than numeric values.

Date Standardization

Date standardization comes in useful when there are multiple documents that have to be managed or when a single document requires analysis based on date. The problem with date is that it can be formatted in so many ways. Some common ways that date can be formatted include the following:

  • May 13, 2104
  • 23rd of June, 2015
  • 2001/05/28
  • 14/14/09

While a human being can read these forms of data and understand what is meant, a computer cannot.

Data standardization by textual ETL reads data, recognizes them as a date, recognizes what date value is being represented in text, and converts the date value into a standard value. The standard value is then stored in the analytic database.

Fig. 10.1.13 shows how textual ETL reads raw text and converts date values into standardized values.

Fig. 10.1.13
Fig. 10.1.13 Converting dates into a standardized format.

As an example of the processing done by textual ETL against raw text, consider the following raw text:

…she married on July 15, 2015 at a small church in Southern Colorado….

The database reference generated for the analytic database would look like the following:

  • Document name, byte, context—date value, value—20150715

List Processing

Occasionally, text contains a list. And occasionally, the list needs to be processed as a list, rather than as a sequential string of text.

Textual ETL can recognize and process a list if asked to do so.

Fig. 10.1.14 shows how raw text is read and processed into a recognizable list in textual ETL.

Fig. 10.1.14
Fig. 10.1.14 List processing.

Consider the raw text:

  • “Recipe ingredients:
    • 1—Rice
    • 2—Salt
    • 3—Paprika
    • 4—Onions
    • …………………”

Textual could read the list and process it thusly:

  • Document name, byte, context—list recipe element 1, value—rice
  • Document name, byte, context—list recipe element 2, value—salt
  • Document name, byte, context—list recipe element 3, value—paprika

Associative Word Processing

Occasionally, there are documents that are repetitive in structure but not in terms of words or content. In cases like these, it may be necessary to use a feature of textual ETL called associative word processing.

In associative word processing, an elaborate definitional structure of data is created; then, the words inside the structure are defined according to a common meaning of words.

Fig. 10.1.15 depicts associative word processing.

Fig. 10.1.15
Fig. 10.1.15 Associative word processing.

As an example of associative word processing, consider the following raw text:

Contract ABC, requirement section, required conferences—every two weeks,…

The output to the analytic database might look like the following:

  • Document name, byte, context—scheduled meeting, value—required conference

Stop Word Processing

Perhaps, the most straightforward processing done in textual ETL is that of stop word processing. Stop words are words that are necessary for proper grammar but are not useful or necessary for the understanding of the meaning of what is being said. Typical English stop words are “a,” “and,” “the,” “is,” “that,” “what,” “for,” “to,” “by,” and so forth. Typical stop words in Spanish include “el,” “la,” “es,” “de,” “que,” and “y.” All Latin-based languages have stop words.

In doing textual ETL processing, stop words are removed.

The analyst has the opportunity to customize the stop word list that is shipped with the product.

Removing unnecessary stop words has the effect of reducing the overhead of processing raw text with textual ETL.

Fig. 10.1.16 shows raw text that is being processed for stop words by textual ETL.

Fig. 10.1.16
Fig. 10.1.16 Stop word processing.

In order to envision how stop word processing works, consider the following raw text:

…he walked up the steps, looking to make sure he carried the bag properly…

After stops words are removed, the resulting raw text would look like the following:

…walked steps looking carried bag…

Word Stemming

Another sometimes useful editing feature of textual ETL is that of stemming. Latin-based words have word stems. There are usually many forms of the same word. Consider the stem “mov.” The different forms of the word stem mov include move, mover, moves, moving, and moved. Note that the stem itself may or may not be an actual word.

Oftentimes, it is useful to make associations of text that uses the same word stems. It is easy to reduce a word down to its word stem in textual ETL, as seen in Fig. 10.1.17.

Fig. 10.1.17
Fig. 10.1.17 Word stemming.

In order to see how textual processes word stems, consider the following raw text:

…she walked her dog to the park….

The resulting database entry would look like the following:

  • Document name, byte, stem—walk, value—walked

Document Metadata

On occasion, it is useful to create an index of the documents that are being managed by the organization. The index can be created where there is only the index or the index can be created in conjunction with all the other features available in textual ETL. There are business justifications for both types of design.

Typical contents for a document index include such data as follows:

  • Date document created
  • Date document last accessed
  • Date document last updated
  • Document created by
  • Document length
  • Document title or name

Fig. 10.1.18 shows that document metadata can be created by textual ETL.

Fig. 10.1.18
Fig. 10.1.18 Processing document metadata.

Suppose an organization has a contract document. Running textual ETL against the contract document can produce the following entry into the analytic database:

  • Document name, byte, document title—Jones Contract, July 30, 1995, 32651 bytes, by Ted Van Duyn,…

Document Classification

In addition to document metadata being able to be gathered, it is also possible to classify documents into an index. As an example of classifying documents, suppose the company is an oil company. One way of classifying document in an oil company is according to how the documents belong to a part of the organization. Some documents are about exploration. Some documents are about oil production. Some documents are about refining, oil distribution, and oil sales.

Textual ETL can read the document and determine which classification the document belongs in.

Fig. 10.1.19 shows the reading of raw text and the classification of documents.

Fig. 10.1.19
Fig. 10.1.19 Classification of documents.

As an example of document classification, suppose the corporation has a document on deepwater drilling. The database entry that would be produced looks like the following:

  • Document, byte, document type—exploration, document name

Proximity Analysis

Occasionally, the analyst needs to look at words or taxonomies that are in proximity to each other. For example, when a person sees the words “New York Yankees,” the thought is about a baseball team. But when the words “New York” and “Yankees” are separated by two or three pages of text, the thought is something entirely different.

Therefore, it is useful to be able to do what is referred to as “proximity analysis” in textual ETL.

Proximity analysis operates on actual words or taxonomies (or any combination of these elements).

The analyst specifies the words/taxonomies that are to be analyzed, gives a proximity value for how close the words need to be in the text, and gives the proximity variable a name.

Fig. 10.1.20 shows proximity analysis operating against raw text.

Fig. 10.1.20
Fig. 10.1.20 Proximity analysis.

As an example of proximity analysis against raw text, suppose there were raw text that looked like

…away in a manger no crib for a child….

Suppose the analyst had specified that the words manger, child, and crib were the words that made up the proximity variable—baby Jesus.

The results of the processing would look like the following:

  • Document name, byte, context—manger, crib, child, value—baby Jesus.

Care must be taken with proximity analysis as a great amount of system resources can be expended if there are many proximity variables to be sought.

Functional Sequencing Within Textual ETL

There are many different functions that occur within textual ETL. Given on the document and the processing that needs to occur, the sequence the functions are done in has a great impact on the validity of the results. In fact, the sequence of the functions may determine whether the results that are achieved are accurate or not.

Therefore, one of the more important features of textual ETL is the ability to sequence the order in which functions are executed.

Fig. 10.1.21 shows that the different functions can be sequences at the discretion of the analyst.

Fig. 10.1.21
Fig. 10.1.21 Sequencing the many functions of textual ETL.

Internal Referential Integrity

In order to keep track of the many different variables and the many different relationships, textual ETL has an elaborate internal structure. In order for any given iteration of textual ETL to execute properly, the internal relationships MUST be defined properly. Stated differently, if the internal relationships inside textual ETL are not properly defined, textual ETL will not execute properly, and the results obtained will not be valid and accurate.

As an example of internal relationships inside textual ETL, there is a need to define a document. Once a document is defined, the different indexes that can be created for the document can be defined. Once the different indexes are defined, the delimiters that define the index must be defined. This entire infrastructure must be in place before textual ETL can operate accurately.

In order to ensure that ALL internal relationships are accurately defined, textual ETL has to have verification processing executed before textual ETL can be run.

Fig. 10.1.22 shows the need for verification processing.

Fig. 10.1.22
Fig. 10.1.22 Verification processing.

If any one or more internal relationship is found to be out of place or not defined, the verification process sends a message identifying the relationship that is out of order and declares that the verification process has not been properly passed.

Preprocessing, Postprocessing

There is a lot of complexity to the processing inside textual ETL. In most cases, a document can be processed entirely within the confines of textual ETL. However, on occasions, it is possible to either preprocess a document or postprocess the document (or do both) if necessary.

Fig. 10.1.23 shows that textual ETL can have either or both preprocessing or postprocessing.

Fig. 10.1.23
Fig. 10.1.23 Preprocessing and postprocessing.

Textual ETL is designed to do as much processing as possible within the scope of the program. The reason why neither preprocessing nor postprocessing is a normal part of the workflow is because of overhead. When you do either preprocessing or postprocessing, the overhead of processing is elevated.

There are several activities that occur in preprocessing, if in fact it is necessary to run preprocessing. Some of those activities include the following:

  • Filtering unwanted and unneeded data
  • Fuzzy logic repair of data
  • Classification of data
  • Raw editing of data

Fig. 10.1.24 shows the processing that occurs inside the preprocessor.

Fig. 10.1.24
Fig. 10.1.24 Preprocessor.

Occasionally, there is a document that simply cannot be processed by textual ETL without being first processed by a preprocessor. In cases like this, a preprocessor comes in handy.

After ETL processing, it is possible to postprocess a document. The functions accomplished in postprocessing are seen in Fig. 10.1.25.

Fig. 10.1.25
Fig. 10.1.25 Postprocessing.

On occasion, an index entry needs to be edited before it is clean. Or data need to be merged before they are in the form the end user expects.

These are all typical activities that can occur in postprocessing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset