Chapter 9.1

Repetitive Analytics: Some Basics

Abstract

There are many facets to the analysis of repetitive data. One type of data where repetitive data are found is in an open-ended continuous system. Another place where repetitive analytics is done is in a project-based environment. A common practice for analytics in repetitive analytics is that of looking for patterns. One issue that always occurs with repetitive pattern analysis is the occurrence of false positives. A useful approach for doing repetitive analytics is to create what is known as the “sandbox.” Analysis in the sandbox does not go outside of the corporation. On the other hand, the analyst is not constrained with regard to the analysis that is done or what data can be analyzed. Log tapes often provide a basis for repetitive data analytics.

Keywords

Repetitive data; Open-ended continuous system; Project-based system; Pattern analysis; Outliers; False positives; The “sandbox”; Log tapes

There are some basic concepts and practices regarding analytics that are pretty much universal. These practices and concepts apply to repetitive analytics and are essential for the data scientist.

Different Kinds of Analysis

There are two distinct types of analysis—open-ended continuous analysis and project-based analysis. Open-ended continuous analysis is analysis that is typically found in the structured corporate world but is occasionally found in the repetitive data world. In open-ended continuous analysis, the analysis starts with the gathering of data. Once the data are gathered, the next step is to refine the data and analyze the data. After the data are analyzed, someone's decision or a set of decisions are made, and the results of those decisions affect the world. Then, more raw data are gathered, and the process starts over again.

The process of gathering data, refining it, analyzing it, and then making decisions based on the analysis on an ongoing basis is actually very common. An example of such a continuous feedback loop might be a bank's decision to raise or lower the loan rate. The bank gathers information about loan applications and loan payments. Then, the bank digests that information and decides to raise or lower loan rates. The bank raises the rate and then tests to see what the results have been. Such is an example of an open-ended continuous analytic loop.

The other type of analytic system is a project-based analysis. In a project-based analysis, the intent is to do the analysis just once. For example, the government may do an analysis on how many illegals have been integrated successfully into society. The intent is to make such a study exactly once. There may be safety studies conducted by an automobile manufacturer. Or there may be chemical analysis of a product. Or there may be a study about the content of ethanol in gasoline and so forth. There can be any sort of and any number of onetime studies.

Fig. 9.1.1 shows that there are two types of analytic studies.

Fig. 9.1.1
Fig. 9.1.1 Two types of analytics.

Whether a study is one time or ongoing greatly affects the infrastructure surrounding the study. In a continuous study, there needs to be an ongoing infrastructure that is created. In a onetime study, there is a very different infrastructure that is created.

Looking for Patterns

However, the analytic study is done; the study typically looks for patterns. Stated differently, the organization identifies patterns that lead to conclusions. The patterns are tip-offs that important previously unknown events are occurring. By knowing these patterns, the organization can then have insights that allow the organization to manage itself more efficiently or more safely or more economically or whatever the end goal of the study is.

The patterns can come in different forms. Sometimes, the patterns are in the form of measurements of occurrences. In other cases, a variable is measured continuously. Fig. 9.1.2 shows two common forms in which patterns are measured.

Fig. 9.1.2
Fig. 9.1.2 Different ways to find patterns in data.

Where there are discrete occurrences, the occurrences are pasted onto a “scatter chart.” The scatter chart is merely a collection of the points placed onto a chart. There are many issues that relate to the creation of a scatter chart. One of the more important issues is that of determining if a pattern is relevant. On occasion, there may be points that have been collected that should not have been collected. On other occasions, there may be points on the chart that have been created that form more than one pattern. A professional statistician is needed to be able to determine the accuracy and the integrity of the points found on a scatter diagram.

Another form of finding patterns is to look at a continuously measured variable. In this case, there typically are levels of thresholds that are of interest. As long as the continuous variable is within the limits of the threshold, there is no problem. But the moment the variable exceeds one or more level of the threshold, then, the analyst takes interest. Usually, the analysis centers around questions of what else has occurred when the variable has exceeded the threshold value.

Once the points of events have been captured and fit to a graph, the next issue is that of identifying false positives. A false positive is an event that has occurred but for reasons unrelated to the study. If enough variables are studied, there will be occurrences of false positives merely by the fact that enough variables have been correlated to each other.

There once was a famous false-positive correlation that occurred that was widely known and discussed. That false-positive correlation was one that stated that if the AFC won the Super Bowl, then the stock market would go down for the next year. But if the NFC won the Super Bowl, then the stock market would rise. Based on this false positive, one could make money in the stock market knowing what was going to happen in the stock market.

Of course, there is no real correlation between the rise or fall of the stock market and who wins the Super Bowl.

Fig. 9.1.3 shows this infamous false-positive correlation.

Fig. 9.1.3
Fig. 9.1.3 A famous false positive result.

In reality, there is no real correlation between the winner of the Super Bowl and the performance of the stock market. Winning a football game is no indicator of economic performance of the nation. The fact that for many years there actually was a correlation proves that if enough trends are compared, somebody will find a correlation somewhere even if the correlation occurs by simple coincidence.

There may be many reasons why false-positive readings occur. Consider an analysis of Internet sales. One looks at the results of a sale and starts to draw conclusions. In many cases, the conclusion is correct and valid. But one internet sale occurred because someone's cat walked across the keyboard at just the wrong time. There is no legitimate conclusion that can be drawn from an occurrence such as that (Fig. 9.1.4).

Fig. 9.1.4
Fig. 9.1.4 A false positive.

False-positive readings can occur for a huge number of unknown and random reasons.

Heuristic Processing

Analytic processing is fundamentally different than other types of process. In general, analytic processing is known as “heuristic” processing. In heuristic processing, the requirements for analysis are discovered by the results of the current iteration of processing. In order to understand the dynamics of heuristic processing, consider classical system development life cycle (SDLC) processing.

Fig. 9.1.5 shows a classical SDLC development effort.

Fig. 9.1.5
Fig. 9.1.5 Classical systems development.

In classical SDLC processing, the first step is to gather requirements. In classical SDLC, the intent is to gather all requirements before the next step of development occurs. This approach is sometimes called the “waterfall” approach because of the need to gather all requirements before engaging in the next step of development.

But heuristic processing is fundamentally different than the class SDLC. In heuristic processing, you start with some requirements. You build a system to analyze those requirements. Then, after you have results, you sit back and rethink your requirements after you have had time to reflect on the results that have been achieved. You then restate the requirements and redevelop and reanalyze again. Each time you go through the redevelopment exercise is called an “iteration.” You continue the process of building different iterations of processing until such time as you achieve the results that satisfy the organization that is sponsoring the exercise.

Fig. 9.1.6 depicts the heuristic approach to analysis.

Fig. 9.1.6
Fig. 9.1.6 The iterative approach.

One of the characteristics of the heuristic process is that at the beginning, it is impossible to know how many iterations of redevelopment will be done. It just simply is impossible to know how long the heuristic analytic process will take. Another characteristic of the heuristic process is that the requirements may change very little or the requirements may completely change during the life of the heuristic process. Again, it is impossible to know what the requirements will look like at the end of the heuristic process.

Because of the iterative nature of the heuristic process, the development process is much less formal and a lot more relaxed than the development process found in the classical SDLC environment. The essence of the heuristic process is on speed of development and the quick production and analysis of results.

Freezing Data

Another characteristic of the heuristic process is for the need for data to be “frozen” from time to time. In the heuristic process, the algorithms that process data are constantly changing. If the data that are being operated on are also being changed at the same time, the analyst can never tell whether the new results are a result of the change in algorithms or a change in the data. Therefore, as long as the algorithms going against the data are changing, it is sometimes useful to “freeze” the data that are being operated on.

The notion that data need to be frozen is antithetical to other forms of processing. In other forms of processing, there is a need to operate on the most current data possible. In other forms of processing, data are being updated and changed as soon as possible. Such is not the case at all in heuristic processing.

Fig. 9.1.7 shows the need to freeze data as long as the algorithms processing the data are changing.

Fig. 9.1.7
Fig. 9.1.7 Freezing data to ensure consistent results.

The Sandbox

Heuristic processing is often done in what is called the “sandbox.” The sandbox is an environment where the analyst has the opportunity to go and “play” with data. The analyst can look at data one way one day and another way another day. The analyst is not restricted in what kind of processing or in terms of how much processing can be done.

The reason why there is a need for a sandbox is that in standard corporate processing, there is a need for tight control of processing. One reason for the need for tight control of processing in the standard environment is because of resource limitations. In the standard corporate operating environment, there is a need to control the resources that are used for processing by all analysts. That is because there is a need for high performance in the standard operating environment. But in the sandbox environment, there is no such restriction on the analyst. In the sandbox environment, there is no need for high performance. Therefore, the analyst is free to do whatever analytic investigation that he/she wishes to do.

But there is another reason for the sandbox environment. That reason is that in the standard operating environment, there is a need for tight control of data access and calculation. That is because that in the standard operating environment, there are security concerns and data governance concerns. But in the sandbox, there are no such concerns.

The converse of processing in the sandbox is that because there are no controls in the sandbox environment, the results of processing in the sandbox environment should not be used in a formal manner. The results of processing in the sandbox can lead to great new and important insight. But after the insight has been captured, the insight is translated into a more formal system and is incorporated into the standard operating environment.

The sandbox environment then is a great boon to the analytics community.

Fig. 9.1.8 shows the sandbox environment.

Fig. 9.1.8
Fig. 9.1.8 An analytical sandbox.

The “Normal” Profile

One of the most important things that analyst can develop is something that can be called the “normal” profile. The normal profile is the composite of the audience that is being analyzed.

In the case of people, the normal profile may contain such things as gender, age, education, location, number of children, and marriage status.

Fig. 9.1.9 shows a “normal” profile.

Fig. 9.1.9
Fig. 9.1.9 A normal profile.

The normal profile for a corporation may include such attributes as the size of corporation, locations, type of product/service created, and revenue of the corporation. There are different definitions of what is normal for different environments.

There are many reasons why a “normal” profile is useful. One reason is that the profile is just plain interesting. The normal profile tells management at a glance what is going on inside a system. But there is another very important reason why a normal profile is useful. When looking at a large body of data, it is oftentimes useful to look at a single record and measure just how far from the norm the record is. And you can’t determine how far from the norm a record is unless you first understand the norm.

In many cases, the further from the norm a record is, the more interesting it becomes. But you can’t spot a record that is far from the norm unless you first understand what the norm is.

Distillation, Filtering

When doing analytic processing against the repetitive big data environment, the types of processing can be classified in one of two ways. There is what can be termed “distillation” processing, and there is what can be termed “filtering” processing.

Both of these processes can be done depending on the needs of the analyst.

In distillation processing, the results of the processing are a single set of results, such as the creation of a profile. In retail operations, the desire might be to create a normal profile. In banking, the result of distillation might be to create the new lending rate. In manufacturing, the result might be to determine the best materials for manufacture.

In any case, the results of the distillation process are a single occurrence of a set of values.

In filtering, the results are quite different. In filtering, the result of processing is the selection of and the refinement of multiple records. In filtering, the objective is to find all records that satisfy some criteria. Once those records have been found, the records can then be edited, manipulated, or otherwise altered to suit the needs of the analyst. Then, the records are output for further processing or analysis.

In a retail environment, the results of filtering might be the selection of all high-value customers. In manufacturing, the results of filtering might be the selection of all end products that failed quality tests. In health care, the results of filtering might be all patients afflicted with a certain condition and so forth.

The processing that occurs in distillation and in filtering is quite different. The emphasis in distillation is on analytic and algorithmic processing, and the emphasis in filtering is on the selection of records and the editing of those records.

Fig. 9.1.10 illustrates the types of processing that can be done against repetitive data.

Fig. 9.1.10
Fig. 9.1.10 Distillation and filtering.

Subsetting Data

One of the results of filtering is the creation of subsets of data. As repetitive data are read and filtered, the result is the creation of data into different subsets. There are lots of practical reasons of subsetting data. Some of those reasons are the following:

  • - The reduction in volume of data that have to be analyzed. It is much easier to analyze and manipulate a small subset of data than it is to analyze that same data mixed in with many other nonrelevant occurrences of data.
  • - Purity of processing. By subsetting data, the analyst can filter out unwanted data, so that the analysis can focus on the data that are of interest. Creating a subset of data means that the analytic algorithmic processing that occurs can be very focused on the objective of the analysis.
  • - Security. Once data are selected into a subset, it can be protected with even higher levels of security than when the data existed in an unfiltered state.

Subsetting data for analysis is a technique that is used commonly and has been used as long as there were data and a computer.

One of the uses of subsetting of data is to set the stage for sampling.

In data sampling, processing goes against a sample of data rather than against the full set of data. In doing so, the resources used for creating the analysis are considerably less, and the time that it takes to create the analysis is significantly reduced. And in heuristic processing, the “turnaround time” to do an analysis can be very important.

Sampling is especially important when doing heuristic analysis against big data because of the sheer volume of data that has to be processed.

Fig. 9.1.11 shows the creation of an analytic sample.

Fig. 9.1.11
Fig. 9.1.11 Creating the analytical sample.

There are some downsides to sampling. One downside is that the analytic results obtained when processing the sample may be different than the processing results achieved when processing against the entire database. For example, the sampling may produce the results that the average age of a customer is 35.78 years. When the full database is processed, it may be found that the average age of the customer is really 36.21 years old. In some cases, this small differential between results is inconsequential. In other cases, the difference in results is truly significant. Whether there is significance or not depends on how much difference there is and the importance of accuracy.

If there is not much of a problem with slight inaccuracies of data, then sampling works well.

If in fact, there is a desire to get the results as accurate as possible, then the algorithmic development can be done against sampling data. When the analyst is satisfied that the sampling results are being done properly, then the final run can be made against the entire database, thereby satisfying the needs to do analysis quickly and the need to achieve accurate results.

Bias of the Sample

One issue that arises with sampling is the bias of the sample. When data are selected for inclusion in the sampling database, there is ALWAYS a bias of the data. What the bias is and how badly the bias colors the final analytic results are a function of the selection process. In some cases, there is a bias, but the bias of the data really doesn’t matter. In other cases, there is a real impact made on the final results because of the bias of the data selected for the sampling database.

The analyst must constantly be aware of the existence of and the influence of the bias of the sampling data.

Fig. 9.1.12 shows that there is an expensive marginal value of accuracy when processing sampling data.

Fig. 9.1.12
Fig. 9.1.12 The differences between results obtained from a sample data base and a fully populated data base.

Filtering Data

There are many reasons why filtering data—especially big data—is a common practice. The actual practice of filtering data can be done on almost an attribute or any attribute value found in the database.

Fig. 9.1.13 shows that filtering data can be done many ways.

Fig. 9.1.13
Fig. 9.1.13 Filtering data.

While data can be filtered, at the same time, the data can be edited and manipulated. It is common practice for the output of the filtering process to create records that have some means of ordering the records. Usually, the ordering is done by the inclusion of uniquely valued attributes. For example, the output relating to a person may have the data relating to the person's social security number as part of the output. Or the filtered output for manufacturing goods may have attributes of the part number along with lot number and date of manufacture. Or if the filtered data were from real estate, there may be property address that is an attribute that is included as a key.

Fig. 9.1.14 shows that the data that are produced as part of the filtering process usually contain uniquely valued attributes.

Fig. 9.1.14
Fig. 9.1.14 Filtering raw data.

One result of filtering is the production of subsets of data. In fact, when data are filtered, the result is the creation of a subset of data. However, the analyst creating the filtering mechanism may want to use the creation of a subset of data as an opportunity to prepare for future analysis.

Stated differently, when a subset is created, it may be useful to put a little planning into the process to create the subset so that it will be useful to future analytic processing.

Fig. 9.1.15 shows that subsets of data are created when data are filtered.

Fig. 9.1.15
Fig. 9.1.15 Subsetting filtered data.

Repetitive Data and Context

In general, repetitive data yield its context easily and readily. In general, because repetitive data have so many occurrences of data and because repetitive data are all similarly structured, finding context is easy to do.

When data are in the world of big data, the data are unstructured in the sense that the data are not being managed by a standard database management system. Because the data are unstructured, in order to be used, the repetitive data have to pass through the parsing process (as does all unstructured data). But because the data are structurally repetitive, once the analyst has parsed the first record, all subsequent records will be parsed in exactly the same manner. Because of this parsing, repetitive data in big data still must be done. But doing parsing for repetitive data is an almost trivial thing to do.

Fig. 9.1.16 shows that context for repetitive data is usually easy to find and determine.

Fig. 9.1.16
Fig. 9.1.16 Finding the context of repetitive data.

When looking at repetitive data, most data are fairly unexceptional. About the only interesting data that occur are in terms of values that occur inside repetitive data. As an example of interesting values occurring, consider retail sales. Most retail sales for a retailer are from $1.00 to $100.00. But occasionally, an order is for greater than $100.00. These exceptions are of great interest to the retailer (Fig. 9.1.17).

Fig. 9.1.17
Fig. 9.1.17 Most nonrepetitive data is non exceptional.

The retailer is interested in such issues as the following:

  • - How often do they occur
  • - How large are they
  • - What else occurs in conjunction with them
  • - Are they predictable

Linking Repetitive Records

Repetitive records by themselves have value. But occasionally, repetitive records that have been linked together tell an even larger picture. When records are linked together where there is a logical reason for the linkage, a more complex story can be derived from the data.

Repetitive records can be linked together in many ways. But the most common way to link them together is through common occurrence of data values. For example, there may be a common customer number that links the records. Or there may be a common part number. Or there may be a common retail location number and so forth.

There are in fact many different ways to link together repetitive records, depending on the business problem being studied.

Fig. 9.1.18 shows it sometimes makes sense to link repetitive records together based on a business relationship of the records.

Fig. 9.1.18
Fig. 9.1.18 Linking records.

Log Tape Records

It is common in examining big data to encounter log tapes. Many organizations create log tapes only to wake up one day and discover that there is a wealth of information on those tapes that have not ever been used.

As a rule, log tapes contain information that is stored in a cryptic manner. Most log tapes are written for purposes other than analytic processing. Most log tapes are written for purposes of backup and recovery or for the purpose of creating a record of historical events. As a consequence, log tapes require a utility to read and decipher the log tape. The utility reads the log tape, infers the meaning of the data found on the log tape, and then reformats the data into an intelligible form. Once the data are read and reformatted, the analyst can then start to use the data found on the log tape.

Most log tape processing requires the elimination of irrelevant data. Much data appear on the log tape that is of no use to the analyst.

Fig. 9.1.19 shows a schematic of what a typical log tape might look like.

Fig. 9.1.19
Fig. 9.1.19 Log tape records are very irregular.

Fig. 9.1.19 shows that on the log tape, many different kinds of records are found. Typically, these records are written onto the log tape in a chronological manner. As a business event occurs, a record is written to reflect the occurrence of the event.

At first glance, these data might look like nonrepetitive data. Indeed, from a physical occurrence of data standpoint, that is a valid perspective. But there is another way to look at the data found on a log tape. That perspective is that the log tape is merely a chronological accumulation of a bunch of repetitive records. A “logical” perspective of a log tape is seen in Fig. 9.1.20.

Fig. 9.1.20
Fig. 9.1.20 Two different perspectives of the same thing.

In Fig. 9.1.20, it is seen that logically, a log tape is merely a sequential collection of different types of records. The perspective shown in Fig. 9.1.20 shows that the data logically appear to be repetitive records of data.

Analyzing Points of Data

One of the ways in which data are analyzed is through the graphing of collection of points of reference data. This technique is called the creation of a scatter diagram and is seen in Fig. 9.1.21.

Fig. 9.1.21
Fig. 9.1.21 A scatter diagram.

While gathering and plotting these points can lead to simple observations, there is a mathematical means to expressing the scatter diagram. A line can be drawn through the points. The line represents a mathematically calculated formula using what is called the least squares method. In the least squares approach, the line represents the mathematical function where the square of the distance from each point to the line is the least value.

Outliers

On occasion, there is a point of reference that does not seem to fit with all the other points. If this is the case, the point of reference can be discarded. Such a point of reference is referred to as being an “outlier.”

In the case of an outlier, the theory is that some other factors were relevant to the calculation of the point of reference. Removing the outlier will not hurt the implications created by the calculation of the least squares regression analysis. Of course, if there are too many outliers, then the analyst must indulge in a deeper analysis of why the outliers occurred. But as long as there are only a few outliers and there are reasons why the outliers should be removed, then removal of outliers is a perfectly legitimate thing to do.

Fig. 9.1.22 depicts a scatter diagram with linear regression analysis and a scatter diagram with outliers.

Fig. 9.1.22
Fig. 9.1.22 Least squares regression analysis.

Data Over Time

It is normal to look at data over time. Looking at data over time is a good way to get insight that would otherwise not be possible.

One of the standard ways to look at data over time is through a Pareto chart. Fig. 9.1.23 depicts data found in a Pareto chart.

Fig. 9.1.23
Fig. 9.1.23 A pareto chart.

While looking at data over time is a standard and a good practice, there is an insidious aspect to looking at data over time. That aspect is this—if the data being examined over time are being examined for a short period of time, then there is no problem. But if the data being examined are being examined over a lengthy enough period of time, then the parameters over which the examination is made change and affect the data.

This effect—of looking at data over limited moments of time—is illustrated by a simple example. Suppose there is an examination of the GNP of the United States over decades. One way to measure GNP is by looking at GNP measured against dollars. So, you plot the national GNP every 10 years or so. The problem is that over time, the dollar means different things in terms of value. The worth of the dollar in 2015 is not the same thing at all as the dollar in 1900. If you do not adjust your parameters of measurement for inflation, your measurement of GNP means nothing.

Fig. 9.1.24 shows that over time, the meaning of the basic measurement of the dollar is not the same over decades.

Fig. 9.1.24
Fig. 9.1.24 Metadata parameters are changing all the time.

The fact is that the dollar and inflation are well-understood phenomena. What is not so well understood is that there are other factors over time that cannot be as easily tracked as inflation.

As an example, suppose one were tracking the revenue of IBM over decades. The revenue of IBM over decades is easy enough to find and track because of the fact that IBM is a publicly traded company. But what is not so easy to track are all the acquisitions of other companies that IBM has made over the years. Looking at IBM in 1960 and then looking at IBM in 2000 are a little misleading because the company that IBM was in 1960 is a very different company than the company that IBM is in 2000.

There is constant change in the parameters of measurement of ANY variable over time. The analyst of repetitive data does well to keep in mind that—given enough time—the very patterns of measurement of data over time gradually change.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset