|
The analog data pond is the place where data that begins life as a mechanically generated measurement of data resides. There are many sources for analog data – electronic eyes, manufacturing control machines, log or journal tapes, periodic metering measurements and so forth.
Analog data is often referred to as data measured by the “inch” or by the “millisecond.” Inches and milliseconds refers to the frequency of measurement. Some products are laid out linearly and a snapshot is taken every n inches. Or a product is produced and is measured every millisecond. It does not take a fertile imagination to see that many, many irrelevant data points can result from a mechanical recording of measurements.
Analog Data Issues
There are two generic issues with the data in the analog data pond. The first is the sheer volume of data. It is normal for there to be a massive amount of data that is generated by analog processing. A machine just sits there and takes a snapshot every millisecond. It is also normal for 99.9% of the data to be normal and of little business value. The same (or nearly same) value is repeated over and over. In a sense, the interesting data “hides” behind the tremendous volume of information generated.
A second issue is that much of the important data associated with the generation of the analog data is lost. Analog analysts have the habit of collecting only the analog data and not the descriptor data that is associated with the analog data. Unfortunately, the descriptor data is often as valuable (or even more valuable) than the actual analog data.
The challenge the analyst has in dealing with analog data is in preparing the data for analysis by streamlining and outlining the important analog data. This streamlining and outlining is accomplished in the transformation / conditioning process that occurs inside the analog data pond.
Data Descriptor
The details surrounding the information in the analog data pond is very important. Some of the surrounding data includes:
Fig 6.1 depicts the analog data pond.
Fig 6.1 Storing data in the analog data pond
Capturing Raw Data/Transforming Raw Data
There are two basic steps that occur as analog data is moved into the analog data pond. The first step is capturing and moving the analog data into the data pond. The second step is the transformation / conditioning of the analog data in the analog data pond into a form and structure that is easily analyzed by the end user.
Note that the activity of transformation of the analog data occurs entirely within the confines of the analog data pond itself.
Fig 6.2 shows the capture and transformation activities.
Fig 6.2 Capturing and transforming activities in the analog data pond
Transforming/Conditioning Raw Analog Data
The most interesting aspect of the analog data pond is conditioning the raw analog data into a form that is useful for analysis. The process of conditioning can be called a transformation or a conversion.
In an earlier day and age, the process of conversion was called data reduction and/or data compression. The purpose of data reduction was to significantly reduce the amount of storage and the number of records that was required. And significantly reducing the amount of storage required for data reduces the amount of work required by the system to do analytical processing of the data.
The data reduction found in the analog data pond is entirely up to the analyst managing the data. The type and amount of data reduction will vary from one set to another.
Some of the techniques of data reduction that can be employed are:
And there are many other forms of data reduction.
One or more of these techniques can be used for any given set of analog data inside the analog data pond. Fig 6.3 shows that a fundamental transformation of data occurs from the time the data enters the analog data pond to the time that data is fit for analysis.
Fig 6.3 Making the data useful for analysis in the data pond
Some of the common forms of data reduction inside the analog data pond will be discussed in the following sections.
Data Excision
Perhaps the most common and useful form of data reduction is data excision. In data excision, data that is not needed is simply removed. So how does the analyst tell that data is not needed? There are lots of ways. One of these is rounding. Suppose a measurement is made saying that a wheel is 16.577638892 cm in diameter. In practice, the only digits that are significant are the first two following the decimal point. As a consequence, rounding up to the first two digits makes sense. The number 16.577638892 is rounded up to 16.58, thereby saving significant space.
Another form of excision is that of threshholding. Suppose a manufacturing process is being tracked. The output is measured by an electronic eye. As long as the part is no longer than 1.257 cm and is no shorter than 1.250 cm, then the part is in compliance. The electronic eye reads the following parts as they come off the assembly line:
1.256937
1.251004
1.249887
1.254887
1.261095
1.255087
1.252090
1.254981
Using the boundaries of threshholding, the system would record only the data that was not in the boundaries of tolerance. In this case, the system would record the values 1.249887 and 1.261095. The other values are in the threshold of tolerance found by the system. Fig 6.4 shows that excision of data is a useful tool for data reduction.
Fig 6.4 Excising data in the analog data pond
Clustering Data
Another useful technique is that of clustering data. There are different forms of clustering data. One of those forms is that of grouping common values or ranges of values. Suppose there were the following measurements:
1.56
1.78
1.67
1.57
1.65
1.70
1.62
1.73
1.77
A more concise way to represent the data is to cluster them. The clustering might look like:
1.5 – 2
1.6 – 3
1.7 – 4
In this clustering, there are 2 values from 1.50 to 1.59, 3 values from 1.60 to 1.69 and 4 values from 1.70 to 1.79.
Another way to cluster the data is:
1.5 (1), (4)
1.6 (3), (5), (7)
1.7 (2), (6),(8),(9)
In this method, the ordinal number is maintained. Note that in the first method of clustering the ordinal number of the value is lost.
But in either case, there is the potential for gross reduction of the amount of space required to represent the numbers. And in fact, there are many more complicated forms of clustering, like bit map indexing. Fig 6.5 depicts clustering as a form of data reduction that can be useful in conditioning data in the analog data pond.
Fig 6.5 Clustering data in the analog data pond
Data Relationships
Another form of data conditioning that can be useful in the analog data pond is that of establishing relationships between measurements of data. As an example, suppose we measured air pressure for tires and those measurements were captured as:
35.6 psi
36.1 psi
34.6 psi
36.2 psi
34.8 psi
35.7 psi
35.9 psi
While the tire pressure may be an interesting number, the measurement becomes more interesting when the tire manufacturer is attached to the pressure. Consider what the attachment of manufacturer looks like:
35.6 psi Goodrich
36.1 psi Bridgestone
34.6 psi Goodyear
36.2 psi Bridgestone
34.8 psi Alliance
35.7 psi Michelin
35.9 psi Panther
Once the tire manufacturer is attached to the pressure, more possibilities for analysis arise. But suppose even more data were available. If the date the tire was installed were attached to the data, the results might look like:
35.6 psi Goodrich July 20, 2016
36.1 psi Bridgestone Jan 5, 2013
34.6 psi Goodyear Oct 6, 2015
36.2 psi Bridgestone Nov 17, 2016
34.8 psi Alliance Dec 20, 2015
35.7 psi Michelin Mar 2, 2013
35.9 psi Panther Apr 28, 2014
And there are even more types of data that could be added. For example, suppose the mileage the tire had on it was added to the data. The result might look like:
35.6 psi Goodrich July 20, 2016 16,500 miles
36.1 psi Bridgestone Jan 5, 2013 85,980 miles
34.6 psi Goodyear Oct 6, 2015 24,000 miles
36.2 psi Bridgestone Nov 17, 2016 2,000 miles
34.8 psi Alliance Dec 20, 2015 14,970 miles
35.7 psi Michelin Mar 2, 2013 78,400 miles
35.9 psi Panther Apr 28, 2014 65,980 miles
Fig 6.6 shows that adding relationships to data in the analog data pond greatly enhances the usability and desirability of data.
Fig 6.6 Making the analog data pond more valuable through relationships
Probability of Future Usage
All the design decisions shaping the transformation and conditioning of the data in the analog data pond are shaped by probability of future usage. If a unit of data has a very low probability of future access or even no probability of access, then it can safely be removed from the analog data pond. But if a unit of data has a high probability of access, then it is moved to a prominent place in the analog data pond. In fact the higher the probability of access the more prominently the data is placed in the analog data pond.
Of course not all probabilities can be accurately predicted. Because of this fact of life, it often makes sense to not throw away data that has a low probability of access, but to place that data in a less conspicuous location.
Fig 6.7 shows that probability of future access to data shapes all design decisions of the conditioning and transformation structure of the analog data pond.
Fig 6.7 Determining the probability of usage in the analog data pond
Outliers
Another factor of data in the analog data pond that sparks interest among the analyst is the occurrence of outliers. An outlier is the measurement of an event occurrence that does not fit the norm. Typically, the measurements have a pattern. There are often small variations from the pattern but most of the measurements fit a predictable and definable pattern. An outlier is a measurement that does not fit the pattern of the other variables nor has a variance which is atypical of the other variables. Fig 6.8 shows a collection of measurements of data and a few outliers.
Fig 6.8 Capturing outliers in the analog data pond
Outliers are always of interest and typically deserve special study. As an example of outliers, suppose a telephone company does an analysis of the length of calls made from New Jersey to Texas. Most of the phone calls last five to six minutes. Some phone calls are shorter and some phone calls are longer, but most are in that range. However, the telephone company notices that there are three calls greater than 24 hours.
The phone company decides to investigate those really long calls and finds that:
When the organization examines the outliers, it can then decide what it wants to do with them. One option is to remove them from the data set. Another option is to redefine the data set to include them. A third option is to create another data set with a new algorithm defining the distribution of the measurements.
Once the data has been conditioned, it is then made available to the analyst. The analyst then uses the transformed/conditioned data for the purpose of analysis, as seen in Fig 6.9.
Fig 6.9 Analyzing data in the analog data pond
Specialized Ad Hoc Analysis
There is another use for the analog data that has been conditioned. It is entirely possible and likely that specialized analysis needs to be done. It is also possible to use the conditioned data as a basis for a specialized data analysis.
Say the conditioned data is for a manufacturing environment. Analog analysis regularly uses the conditioned data for their analysis. But suppose a new manufacturer arrives in the marketplace. The corporation wishes to do a separate analysis of a subset of the products that they produce. There is nothing wrong with separating out the specialized product from the mainstream products and performing a specialized analysis. With the conditioned data, it is an easy task to use it as a foundation on which to do new and unanticipated analysis. Fig 6.10 depicts the fact that ad hoc specialized analysis can be done from the conditioned data.
Fig 6.10 Performing ad hoc specialized analysis in the analog data pond
In Summary
The analog data pond then is the place where analog data is stored, conditioned and analyzed. The conditioning process varies for every type of data found in the analog data pond. Fig 6.11 shows the analog data pond.