Chapter 4.5

Contextualizing Repetitive Unstructured Data

Abstract

There are different definitions of big data. The definition used here is that big data encompasses a lot of data, is based on inexpensive storage, manages data by the “Roman census” method, and stores data in an unstructured format. There are two major types of big data—repetitive big data and nonrepetitive big data. Only a small fraction of repetitive big data has business value, whereas almost all of nonrepetitive big data has business value. In order to achieve business value, the context of data in big data must be determined. Contextualization of repetitive big data is easily achieved. But contextualization of nonrepetitive data is done by means of textual disambiguation.

Keywords

Big data; Roman census method; Unstructured data; Repetitive data; Nonrepetitive Data; Contextualization; Textual disambiguation

In order to be used for analysis, all unstructured data need to be contextualized. This is as true for repetitive unstructured data as it is for nonrepetitive unstructured data. But there is a big difference between contextualizing repetitive unstructured data and nonrepetitive unstructured data. That difference is that contextualizing repetitive unstructured data is easy and straightforward to do, whereas contextualizing nonrepetitive unstructured data is anything but easy to do.

Parsing Repetitive Unstructured Data

In the case of repetitive unstructured data, the data are read, usually in Hadoop. After the block of data is read, the data are then parsed. Given the repetitive nature of the data, parsing the data is straightforward. The record is small, and the context of the record is easy to find.

The process of parsing and contextualizing the data found in big data can be done with a commercial utility or can be a custom-written program.

Once the parsing takes place, the output can be placed in any one of many formats. One format the output data can be placed in is in the form of selected records. The parsing takes place. If the selection criteria are met, the data—record at a time—are gathered.

A variation of the record selection process occurs when only the context is selected, not the entire record.

Yet, another variation occurs when the record—once selected—is merged on output with another record.

There are undoubtedly many other variations other than the ones that are suggested here.

Fig. 4.5.1 shows the possibilities that have been discussed.

Fig. 4.5.1
Fig. 4.5.1 Two database alternatives.

Recasting the Output Data

Once the parse and selection process has been completed, the next step is to physically recast the data. There are many factors that determine how the output data are to be physically recast. One of the factors is how much output data are there. Another factor is what the data will be used for. And there are undoubtedly many other factors as well.

Some of the possibilities for that recasting of the output data include placing the output data back into big data. Another possibility is to place the output data into an index. Yet, another possibility is to send the output data to a standard database management system.

Fig. 4.5.2 shows the output recasting possibilities.

Fig. 4.5.2
Fig. 4.5.2 Data recast according to its content.

In the final analysis, even though repetitive unstructured data have to be contextualized, the process of contextualizing repetitive unstructured data is a straightforward process.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset