Chapter 9.2

Analyzing Repetitive Data

Abstract

There are many facets to the analysis of repetitive data. One type of data where repetitive data are found is in an open-ended continuous system. Another place where repetitive analytics is done is in a project-based environment. A common practice for analytics in repetitive analytics is that of looking for patterns. One issue that always occurs with repetitive pattern analysis is the occurrence of false positives. A useful approach for doing repetitive analytics is to create what is known as the “sandbox.” Analysis in the sandbox does not go outside of the corporation. On the other hand, the analyst is not constrained with regard to the analysis that is done or what data can be analyzed. Log tapes often provide a basis for repetitive data analytics.

Keywords

Repetitive data; Open-ended continuous system; Project-based system; Pattern analysis; Outliers; False positives; The “sandbox”; Log tapes

Much of the data found in big data are repetitive. Analyzing repetitive data in the big data environment is quite different than analyzing data in the nonrepetitive environment. As a point of departure, we need to look at what the repetitive big data environment looks like.

Fig. 9.2.1 shows that data in the repetitive big data environment look like lots of units of data laid end to end.

Fig. 9.2.1
Fig. 9.2.1 Repetitive data.

Repetitive data can be thought of as being organized into blocks, records, and attributes. Fig. 9.2.2 shows this organization.

Fig. 9.2.2
Fig. 9.2.2 The elements of repetitive data.

A block of data is a large allocation of space. The system knows how to find a block of data. The block of data is loaded with units of data. These units of data can be thought of as records. Within the record of data are attributes of data.

As an example of the organization of data, consider the record of telephone calls. In the block of data is found the information about many phone calls. In the record for each phone call is found some basic information:

  • Date and time of the phone call
  • Who was making the phone call
  • To whom the call was made
  • How long the telephone call was made

There may be other incidental information such as was the phone call operator assisted or was the phone call an international phone call. But at the end of the day, the same attribution of information is found over and over again, for every phone call.

When the system goes to look for data, the system knows how to find a block of data. But once the system finds a block of data, it is up to the analyst to make sense of the data found in the block. The analyst does this by “parsing” the data. The analyst reads the data in the block. Then, the analyst determines where a record is. Upon finding a record, the analyst then determines what attribute is where.

The process of parsing would be onerous if there were not a high degree of similarity of the records tucked into the block.

Fig. 9.2.3 shows that upon encountering a block of data in big data, there is a need to parse the block.

Fig. 9.2.3
Fig. 9.2.3 Parsing is done to find out what is in a block.

Log Data

One of the most common forms of big data is log data. Indeed, much important corporate information is tucked into the form of logs.

When one looks at log data, log data do not appear to look much like other repetitive data. Consider the comparison seen in Fig. 9.2.4.

Fig. 9.2.4
Fig. 9.2.4 The difference between log tape data and repetitive data.

In Fig. 9.2.4, repetitive data do not look like log data at all. It appears that in log data, many different kinds of records appear. And indeed, they do. But this apparent contradiction can be resolved by understanding that LOGICALLY, the log tape is nothing more than an amalgamation of repetitive records. This phenomenon is shown by Fig. 9.2.5.

Fig. 9.2.5
Fig. 9.2.5 Different perspectives.

Even though a log tape record is made up of multiple records and must be parsed, the good news is that there typically are a finite number of record types that have to be parsed (unlike other nonrepetitive records where there are anything but a finite number of record types that need to be parsed).

Fig. 9.2.6 shows that there are a finite number of records that need to be examined when parsing a log tape.

Fig. 9.2.6
Fig. 9.2.6 Typical contents of a log tape.

The analysis of repetitive data starts with access to the means by which big data is stored. In many instances, big data is stored in Hadoop. However, there are other technologies (such as Huge Data) that can manage store and manage large amounts of data.

In an earlier day and age where there were only structured database management systems, the DBMS itself did much of the basic data management. But in the world of big data, much of the management of the data is up to the user.

Fig. 9.2.7 shows some of the different ways in which basic data management needs to be done in big data.

Fig. 9.2.7
Fig. 9.2.7 Different means of accessing big data.

Fig. 9.2.7 shows that with Hadoop, you can access and analyze data through an interface, that you can access and parse data, that you can directly access the data and do basic functions yourself, that there are load utilities, and that there are other data management utilities. Most of the focus on the technology that directly accesses data in big data is concerned with two things:

  • The reading and interpretation of the data
  • The management of large amounts of data

The management of large amounts of data is a consuming issue because there are indeed large amounts of data that need to be handled. There is a science to the handling of large amounts of data unto itself.

Notwithstanding the need to manage large amounts of data, there is still a need for creating an architecture of data.

Active/Passive Indexing of Data

One of the most useful design techniques the architect can use is that of creating different kinds of indexes of data. In any case, an index is useful in helping find data. It is always faster to locate data through an index than it is to search the data directly. So, indexes have their place in analytic processing.

The way that most indexes are built is through starting with a user requirement to access data and then building an index to satisfy that requirement. When an index is built in this manner, it can be called an active index because there is an expectation that the index will be actively used.

But there is another type of index that can be built, and that index is a passive index. In a passive index, there is no user requirement to start with. Instead, the index is built “just in case” somebody in the future wants to access the data according to how the data are organized. Because there is no active requirement for the building of the index, it is called a “passive” index.

Fig. 9.2.8 shows both active and passive indexes that can be built.

Fig. 9.2.8
Fig. 9.2.8 Two approaches to accessing repetitive data.

With any index, there is a cost. There is the cost of initially building the index. Then, there is the cost of keeping the index current. Then, there is the cost of storage for the index. In the world of big data, indexes are typically built by technology called “crawlers.” The crawler technology is constantly searching the big data creating new index records. As long as the data remain stable and unchanged, the data have to only be indexed once. But if data are added or if data are deleted, then there need to be constant updates to the index in order to keep the index current. And in any case, there is the cost of storage for the index itself.

Fig. 9.2.9 shows the costs of building and maintaining an index.

Fig. 9.2.9
Fig. 9.2.9 The costs of building and maintaining an index.

Summary/Detailed Data

Another issue that arises is whether detailed and summary data should be kept in big data, and if both summary and detailed data are kept in big data, should there be a connection between the detailed and summary data?

First off, there is no reason why summary and detailed data should not be stored in big data. Big data is perfectly capable of holding both kinds of data. But if big data can hold both detailed and summary data, should there be a logical connection between the detailed data and the summary data? In other words, should the detailed data add up to the summary data?

The answer is that even though detailed and summary data can both be stored in big data, there is no necessary connection between the data once stored in big data. The reason for this is that when the data are calculated and the summary data are created, it is necessary to have an algorithm. The algorithm most likely is NOT stored in big data. So, as long as an algorithm is not stored in big data, there is no necessary logical connection between detailed data and summary data. For this reason, detailed data may or may not add up to the related summary data that can be stored in big data.

Fig. 9.2.10 shows this relationship of data inside big data.

Fig. 9.2.10
Fig. 9.2.10 Detailed data can be summarized and stored in big data.

But if detailed data and summary data should both be kept in big data and if detailed data should not necessarily add up to the summary data found there, at the very least, there should be documentation of the algorithm that was used to create the summary data.

Fig. 9.2.11 shows that documentation of algorithms and selection of detailed data should be documented alongside the summary data stored in big data.

Fig. 9.2.11
Fig. 9.2.11 Documenting summarization.

Metadata in Big Data

While data are the essence of what is stored in big data, it is important not to neglect another type of data. That data are metadata.

There are MANY forms of metadata, and each of them is important. Two of the more important forms of metadata are native metadata and derived metadata. Native metadata are metadata that addresses the immediate descriptive needs of the data. Typical native metadata include such information as follows:

  • Field name
  • Field length
  • Field type
  • Field identifying characteristics

Native metadata are used to identify and describe data that are stored in big data.

Derived metadata take many forms. Some of the forms of derived metadata include the following:

  • Description of how data were selected
  • Description of when data were selected
  • Description of the source of data
  • Description of how data were calculated

Fig. 9.2.12 depicts the different types of metadata.

Fig. 9.2.12
Fig. 9.2.12 Differences between native metadata and derived metadata.

With metadata stored in big data, there arises the issue—where should metadata be stored? Traditionally, metadata have been stored in a repository. The repository is stored physically separately from the data itself. But in the world of big data, there are some very good reasons for managing metadata differently. In big data, it often makes sense to store the descriptive metadata physically in the same location and same data set as the data being described.

There are several very good reasons for the physical storage of metadata in the same physical location as the data itself. Some of those reasons are the following:

  • Storage is cheap. There is no reason why the cost of storage needed to store the metadata should ever be an issue.
  • The world of big data is undisciplined. Having the metadata stored directly with the data being described means that the metadata will never be lost or misplaced.
  • Metadata change over time. When the metadata are stored directly with the data being described, there is ALWAYS a direct relationship between the metadata and the data being described. In other words, the metadata NEVER go out of sync with the data being described.
  • Simplicity of processing. When the analyst starts to process data in big data, there is never a search for the metadata. It is always easy to locate because it is always with the data being described.

Fig. 9.2.13 shows that embedding metadata along with the data stored in big data is a good idea.

Fig. 9.2.13
Fig. 9.2.13 Embedded metadata is a good idea.

Note that storing metadata directly with the data stored in big data does not preclude the possibility of having a repository of metadata for big data. There is nothing to say that metadata cannot be stored in the data with big data AND reside in a repository as well.

Linking Data

One of the fundamental issues of data is that of how data are linked to each other. This issue is an issue in big data just as it has been an issue in other forms of information processing.

In classical information systems, linkage of data was accomplished by matching data values. As an example, one record contained social security number, and another record contained social security number as well. The two units of data could then be linked because of the existence of the same value residing in the record. The analyst could be 99.99999% assured that there was a basis for linkage. (Curiously, since the government reissues social security numbers upon the death of an individual, the analyst cannot be 100% assured that the linkage is real.)

But with the unstructured data (i.e., textual data) that come with big data, it is necessary to accommodate another type of relationship involving the linkage of data. In this case, it is necessary to accommodate what can be called a probable linkage of data.

A probable linkage of data is linkage that is based on probability rather than an actual value.

Probabilistic linkages arise wherever there is text.

As an example of a probabilistic linkage, consider the linkages of data based on name. Suppose there are two names in different records—Bill Inmon and William Inmon. Should these values be linked? There is a high probability that these names should be linked. But it is only a probability, not a certainty. Suppose there are two records where the name William Inmon is found. Should these records be linked?

One record refers to a serial killer in Arizona, and another record refers to a data warehouse writer in Colorado. (This is a true example—look it up on the Internet to verify.) Both individuals have the same name. But they are very different people.

When text is involved, linkage is accomplished on the basis of probability of a match, not the certainty of a match.

Fig. 9.2.14 depicts the different kinds of linkages that are found in big data.

Fig. 9.2.14
Fig. 9.2.14 Different kinds of linkages.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset