Chapter 9.3

Repetitive Analysis

Abstract

There are many facets to the analysis of repetitive data. One type of data where repetitive data are found is in an open-ended continuous system. Another place where repetitive analytics is done is in a project-based environment. A common practice for analytics in repetitive analytics is that of looking for patterns. One issue that always occurs with repetitive pattern analysis is the occurrence of false positives. A useful approach for doing repetitive analytics is to create what is known as the “sandbox.” Analysis in the sandbox does not go outside of the corporation. On the other hand, the analyst is not constrained with regard to the analysis that is done or what data can be analyzed. Log tapes often provide a basis for repetitive data analytics.

Keywords

Repetitive data; Open-ended continuous system; Project-based system; Pattern analysis; Outliers; False positives; The “sandbox”; Log tapes

Internal, External Data

Because the cost of storage is so inexpensive with big data, it is possible to consider storing data that come from other than internal sources.

In an earlier day, the cost of storage was such that the only data that corporations considered to store were internally generated data. But with the cost of storage diminished by the advent of big data, it is now possible to consider storing external data and internal data.

One of the issues with storing external data is that of finding and using identifiers. But textual disambiguation can be used on external data just as it can against internal data, so it is entirely possible to establish discrete identifiers for external data.

Fig. 9.3.1 shows that storing external data in big data is a real possibility.

Fig. 9.3.1
Fig. 9.3.1 Repetitive data can come from almost anywhere.

Universal Identifiers

As data are stored in big data and as textual disambiguation is used to bring the data into a standard database format, the subject of universal identifiers or universal measurement arises. Because data come from such diverse sources, because there is little or no discipline or uniformity of data across disparate sources, and because there is a need to relate data to common measurements, there is a need for uniform measurement characteristics across the universe from which data come.

Some universal measurements are fairly obvious; other universal measurements are not.

Three standard or universal measurements of data might include the following:

  • Time—Greenwich mean time
  • Date—Julian date
  • Money—US dollar

Undoubtedly, there are other universal measurements. And each of these measurements has its own quirks.

Greenwich mean time (GMT) is the time that occurs at the meridian that runs through Greenwich, England. The good news about GMT is that there is universal understanding as to what that time is. The bad news is that it is not in agreement with 23 other time zones in the world. But at least, there is an agreed-upon understanding of time in at least one other place in the world.

Julian date is the sequential count of dates starting from day 0, which occurred at Jan 1, 4713 BC. The value of Julian date is that it is universal and that it reduces the number of days to an ordinal number. In a standard calendar, calculating how many days there are between 16 May 2014 and 3 Jan 2015 is a complex thing to do. But with Julian date, such a calculation is very simple to do.

The US dollar is as good a measurement of currency as any other measure. But even with the US dollar, there are challenges. For example, the conversion rate between the dollar and other currencies is constantly changing. If you calculate a value on Feb 15 converting the dollar against another currency, chances are excellent that you will get a different value if you make the same currency conversion on Aug 7. But all other factors being equal, the US dollar serves as a good economic measurement of wealth.

Fig. 9.3.2 shows some of the universal measurements.

Fig. 9.3.2
Fig. 9.3.2 Some standard measurements of data.

Security

Another significant and serious concern of data (anywhere, not just in big data) is that of security. There are literally hundreds of reasons why data need to be secure:

  • - Health-care data need to be secure because of privacy reasons.
  • - Personal financial data need to be secure because of theft and personal loss.
  • - Corporate financial data need to be secure because of insider trading laws.
  • - Corporate activity needs to be secure because of the need to keep trade secrets actually secret.
  • - And so forth.

There are a multitude of reasons why certain data need to be treated with the utmost of care when it comes to security.

Fig. 9.3.3 shows the need for security.

Fig. 9.3.3
Fig. 9.3.3 Security is always an issue.

There are many facets to security. Only a few of them will be mentioned here. The simplest (and one of the most effective) form of security is that of encryption. Encryption is the process of taking data and substituting encrypted values for actual values. For example, you might take the text “Bill Inmon” and substitute “Cjmm Jmopm” in its place. In this case, we have merely substituted the next letter in the alphabet for the actual value. It would have taken a good cryptographer about a nanosecond to decrypt the data. But a good encryption analyst could figure out many more ways to encrypt the data that would stump even the most sophisticated of analysts.

In any case, the process of encrypting data is commonly used. Typically, fields of data are encrypted inside a database. In health care, for example, only the identifying information is encrypted. The remaining data are left untouched. This allows the data to be used in research without endangering the privacy of the data.

Fig. 9.3.4 shows encryption being done on a field of data.

Fig. 9.3.4
Fig. 9.3.4 Encrypting a field of data.

There are many issues that relate to encryption. Some of the issues are as follows:

  • - How secure is the encrypting algorithm
  • - Who can decrypt the data
  • - Should fields that need to be indexed be encrypted
  • - How should decryption keys be protected
  • - And so forth

One of the more interesting issues is consistency of encryption. Suppose you encrypt the name “Bill Inmon.” Suppose that at later place, you need to once again encrypt the name—Bill Inmon. You need to ensure that the name—Bill Inmon—is encrypted the same everywhere there is a need for encryption. You need to ensure consistency of encryption everywhere encryption is needed. This is necessary because if you need to link records based on an encrypted value, you cannot do so if there is no consistency of encryption.

Fig. 9.3.5 shows the need for consistency of encryption.

Fig. 9.3.5
Fig. 9.3.5 Consistency of encryption is an issue.

Another interesting aspect of security is looking at who is trying to look at encrypted data. The access and analysis of encryption may be purely innocent. Then, again, it may not be innocent at all. By examining log tapes and seeing who is trying to access what data, the analyst can determine if someone is trying to access data they shouldn’t be looking at.

Fig. 9.3.6 shows that examining log tapes is a good practice in determining if there are breaches of security.

Fig. 9.3.6
Fig. 9.3.6 Who is looking at our encrypted data?

Filtering, Distillation

There are two basic kinds of processing that occur in the analysis of repetitive data—distillation and filtering.

In distillation of data, repetitive records are selected and read. Then, the data are analyzed, looking for average values, total values, exceptional values, and the like. After the analysis has concluded, a single result is achieved and becomes the out-of-distillation process.

As a rule, distillation is done on a project basis or on an irregular unscheduled basis.

Fig. 9.3.7 shows the process of distillation of repetitive data.

Fig. 9.3.7
Fig. 9.3.7 The process of distillation.

The other type of processing done against repetitive data is that of filtering repetitive and reformatting the repetitive data.

Filtering of data is similar to distillation in that data are selected and analyzed. But the output of filtering of data is different. In filtering, there are many records that are the output of processing. And filtering is done on a regular, scheduled basis.

Fig. 9.3.8 depicts the filtering of repetitive records of data.

Fig. 9.3.8
Fig. 9.3.8 The process of filtering.

Archiving Results

Much of the analytic processing that is done against repetitive data is of the project variety. And there is a problem with analytic processing done on a project basis. The problem is that once the project is finished, the results are either discarded or put into “mothballs.” There is no problem until such time as it comes to do another project. When starting a new project, it is very convenient to see what analysis has preceded this analysis. There may be overlap. There may be complementary processing. If nothing else, a description of how the previous analyses have been developed can be useful.

Therefore, at the end of a project, it is useful to create an archive of the project.

Typical information that might go into the archive might include the following:

  • What data went into the project
  • How data were selected
  • What algorithms were used
  • How many iterations were there in the project
  • What results were attained
  • Where are the results stored
  • Who conducted the project
  • How long did it take to conduct the project
  • Who sponsored the project

Fig. 9.3.9 shows that an archive of projects is a worthwhile thing to do.

Fig. 9.3.9
Fig. 9.3.9 Archiving the results of analysis.

At the very least, the results created by the project should be gathered and stored, as seen in Fig. 9.3.10.

Fig. 9.3.10
Fig. 9.3.10 Documenting the output.

Metrics

At the outset of the repetitive analysis, it is worthwhile to establish the metrics that will establish whether a project has met its objectives. The optimal time to outline such metrics is at the very outset of the project.

There is a problem with delineating the metrics at the beginning. That problem is that in a heuristically run project that many of the metrics cannot be definitively established.

Nevertheless, outlining the metrics at the very least gives the project a sense of focus.

The metrics can be described in very broad terms. There is no need to have the metrics defined to a very low level of definition.

Fig. 9.3.11 shows that metrics define when a project has been successful or less than successful.

Fig. 9.3.11
Fig. 9.3.11 Crossing the finish line.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset