Image241123.jpg

Chapter 6
Analog Data Pond

The analog data pond is the place where data that begins life as a mechanically generated measurement of data resides. There are many sources for analog data – electronic eyes, manufacturing control machines, log or journal tapes, periodic metering measurements and so forth.

Analog data is often referred to as data measured by the “inch” or by the “millisecond.” Inches and milliseconds refers to the frequency of measurement. Some products are laid out linearly and a snapshot is taken every n inches. Or a product is produced and is measured every millisecond. It does not take a fertile imagination to see that many, many irrelevant data points can result from a mechanical recording of measurements.

Analog Data Issues

There are two generic issues with the data in the analog data pond. The first is the sheer volume of data. It is normal for there to be a massive amount of data that is generated by analog processing. A machine just sits there and takes a snapshot every millisecond. It is also normal for 99.9% of the data to be normal and of little business value. The same (or nearly same) value is repeated over and over. In a sense, the interesting data “hides” behind the tremendous volume of information generated.

A second issue is that much of the important data associated with the generation of the analog data is lost. Analog analysts have the habit of collecting only the analog data and not the descriptor data that is associated with the analog data. Unfortunately, the descriptor data is often as valuable (or even more valuable) than the actual analog data.

The challenge the analyst has in dealing with analog data is in preparing the data for analysis by streamlining and outlining the important analog data. This streamlining and outlining is accomplished in the transformation / conditioning process that occurs inside the analog data pond.

Data Descriptor

The details surrounding the information in the analog data pond is very important. Some of the surrounding data includes:

  • The selection criteria for the data that finds its way into the analog data pond
  • The originating source of the analog data
  • The frequency with which analog data is moved into the analog data pond
  • The volume of analog data that is moved into the analog data pond
  • The date and time that the movement of analog occurs.

Fig 6.1 depicts the analog data pond.

Image250488.jpg

Fig 6.1 Storing data in the analog data pond

Capturing Raw Data/Transforming Raw Data

There are two basic steps that occur as analog data is moved into the analog data pond. The first step is capturing and moving the analog data into the data pond. The second step is the transformation / conditioning of the analog data in the analog data pond into a form and structure that is easily analyzed by the end user.

Note that the activity of transformation of the analog data occurs entirely within the confines of the analog data pond itself.

Fig 6.2 shows the capture and transformation activities.

Image250496.jpg

Fig 6.2 Capturing and transforming activities in the analog data pond

Transforming/Conditioning Raw Analog Data

The most interesting aspect of the analog data pond is conditioning the raw analog data into a form that is useful for analysis. The process of conditioning can be called a transformation or a conversion.

In an earlier day and age, the process of conversion was called data reduction and/or data compression. The purpose of data reduction was to significantly reduce the amount of storage and the number of records that was required. And significantly reducing the amount of storage required for data reduces the amount of work required by the system to do analytical processing of the data.

The data reduction found in the analog data pond is entirely up to the analyst managing the data. The type and amount of data reduction will vary from one set to another.

Some of the techniques of data reduction that can be employed are:

  • Deduplication. Deduplication entails the removal of masses of redundant data.
  • Excision. Data excision calls for the removal of unneeded data and data that is unlikely to ever be needed for analysis.
  • Compression. Data compression allows data to be packed very tightly. The problem with compression arises when compressed data must be altered. It is difficult to alter highly compressed data without incurring a high overhead.
  • Smoothing. Smoothing of data is the practice of removing or editing outliers.
  • Interpolation. Interpolation of data is the practice of inferring values of data based on the values near to the value being created. The interpolated value is the “likely” value, had a value been found.
  • Sampling. The practice of sampling data is the practice of selecting a small subset of data that is representative of a larger set of data. Sampling is good for analytical processing but cannot be used for detailed update processing.
  • Rounding. Rounding is the processing of removing and rounding insignificant digits from a data set.
  • Encoding. Encoding is the practice of representing long strings of data with shorter strings of data.
  • Tokenization. Tokenization is a form of encoding. Tokenization can be used effectively when there is a high degree of repetition in the data being stored.
  • Threshholding. Threshholding is a form of excision. In threshholding only values above (or below) the threshold are stored. Everything within the boundaries of the threshold are ignored.
  • Clustering. Clustering of data is the practice of grouping similar and exact values of data. Clustering is a form of data deduplication.

And there are many other forms of data reduction.

One or more of these techniques can be used for any given set of analog data inside the analog data pond. Fig 6.3 shows that a fundamental transformation of data occurs from the time the data enters the analog data pond to the time that data is fit for analysis.

Image250506.jpg

Fig 6.3 Making the data useful for analysis in the data pond

Some of the common forms of data reduction inside the analog data pond will be discussed in the following sections.

Data Excision

Perhaps the most common and useful form of data reduction is data excision. In data excision, data that is not needed is simply removed. So how does the analyst tell that data is not needed? There are lots of ways. One of these is rounding. Suppose a measurement is made saying that a wheel is 16.577638892 cm in diameter. In practice, the only digits that are significant are the first two following the decimal point. As a consequence, rounding up to the first two digits makes sense. The number 16.577638892 is rounded up to 16.58, thereby saving significant space.

Another form of excision is that of threshholding. Suppose a manufacturing process is being tracked. The output is measured by an electronic eye. As long as the part is no longer than 1.257 cm and is no shorter than 1.250 cm, then the part is in compliance. The electronic eye reads the following parts as they come off the assembly line:

1.256937

1.251004

1.249887

1.254887

1.261095

1.255087

1.252090

1.254981

Using the boundaries of threshholding, the system would record only the data that was not in the boundaries of tolerance. In this case, the system would record the values 1.249887 and 1.261095. The other values are in the threshold of tolerance found by the system. Fig 6.4 shows that excision of data is a useful tool for data reduction.

Image250514.jpg

Fig 6.4 Excising data in the analog data pond

Clustering Data

Another useful technique is that of clustering data. There are different forms of clustering data. One of those forms is that of grouping common values or ranges of values. Suppose there were the following measurements:

1.56

1.78

1.67

1.57

1.65

1.70

1.62

1.73

1.77

A more concise way to represent the data is to cluster them. The clustering might look like:

1.5 – 2

1.6 – 3

1.7 – 4

In this clustering, there are 2 values from 1.50 to 1.59, 3 values from 1.60 to 1.69 and 4 values from 1.70 to 1.79.

Another way to cluster the data is:

1.5 (1), (4)

1.6 (3), (5), (7)

1.7 (2), (6),(8),(9)

In this method, the ordinal number is maintained. Note that in the first method of clustering the ordinal number of the value is lost.

But in either case, there is the potential for gross reduction of the amount of space required to represent the numbers. And in fact, there are many more complicated forms of clustering, like bit map indexing. Fig 6.5 depicts clustering as a form of data reduction that can be useful in conditioning data in the analog data pond.

Image250522.jpg

Fig 6.5 Clustering data in the analog data pond

Data Relationships

Another form of data conditioning that can be useful in the analog data pond is that of establishing relationships between measurements of data. As an example, suppose we measured air pressure for tires and those measurements were captured as:

35.6 psi

36.1 psi

34.6 psi

36.2 psi

34.8 psi

35.7 psi

35.9 psi

While the tire pressure may be an interesting number, the measurement becomes more interesting when the tire manufacturer is attached to the pressure. Consider what the attachment of manufacturer looks like:

35.6 psi Goodrich

36.1 psi Bridgestone

34.6 psi Goodyear

36.2 psi Bridgestone

34.8 psi Alliance

35.7 psi Michelin

35.9 psi Panther

Once the tire manufacturer is attached to the pressure, more possibilities for analysis arise. But suppose even more data were available. If the date the tire was installed were attached to the data, the results might look like:

35.6 psi Goodrich July 20, 2016

36.1 psi Bridgestone Jan 5, 2013

34.6 psi Goodyear Oct 6, 2015

36.2 psi Bridgestone Nov 17, 2016

34.8 psi Alliance Dec 20, 2015

35.7 psi Michelin Mar 2, 2013

35.9 psi Panther Apr 28, 2014

And there are even more types of data that could be added. For example, suppose the mileage the tire had on it was added to the data. The result might look like:

35.6 psi Goodrich July 20, 2016 16,500 miles

36.1 psi Bridgestone Jan 5, 2013 85,980 miles

34.6 psi Goodyear Oct 6, 2015 24,000 miles

36.2 psi Bridgestone Nov 17, 2016 2,000 miles

34.8 psi Alliance Dec 20, 2015 14,970 miles

35.7 psi Michelin Mar 2, 2013 78,400 miles

35.9 psi Panther Apr 28, 2014 65,980 miles

Fig 6.6 shows that adding relationships to data in the analog data pond greatly enhances the usability and desirability of data.

Image250531.jpg

Fig 6.6 Making the analog data pond more valuable through relationships

Probability of Future Usage

All the design decisions shaping the transformation and conditioning of the data in the analog data pond are shaped by probability of future usage. If a unit of data has a very low probability of future access or even no probability of access, then it can safely be removed from the analog data pond. But if a unit of data has a high probability of access, then it is moved to a prominent place in the analog data pond. In fact the higher the probability of access the more prominently the data is placed in the analog data pond.

Of course not all probabilities can be accurately predicted. Because of this fact of life, it often makes sense to not throw away data that has a low probability of access, but to place that data in a less conspicuous location.

Fig 6.7 shows that probability of future access to data shapes all design decisions of the conditioning and transformation structure of the analog data pond.

Image250540.jpg

Fig 6.7 Determining the probability of usage in the analog data pond

Outliers

Another factor of data in the analog data pond that sparks interest among the analyst is the occurrence of outliers. An outlier is the measurement of an event occurrence that does not fit the norm. Typically, the measurements have a pattern. There are often small variations from the pattern but most of the measurements fit a predictable and definable pattern. An outlier is a measurement that does not fit the pattern of the other variables nor has a variance which is atypical of the other variables. Fig 6.8 shows a collection of measurements of data and a few outliers.

Image250548.jpg

Fig 6.8 Capturing outliers in the analog data pond

Outliers are always of interest and typically deserve special study. As an example of outliers, suppose a telephone company does an analysis of the length of calls made from New Jersey to Texas. Most of the phone calls last five to six minutes. Some phone calls are shorter and some phone calls are longer, but most are in that range. However, the telephone company notices that there are three calls greater than 24 hours.

The phone company decides to investigate those really long calls and finds that:

  • One call was a computer working with another computer transferring data.
  • One call was a malfunction of the equipment. The call actually only lasted a minute but the monitoring equipment had a problem and the call appeared to be a really long call.
  • The last call was a customer who was downloading movies and was mistakenly using the wrong line to make the download.

When the organization examines the outliers, it can then decide what it wants to do with them. One option is to remove them from the data set. Another option is to redefine the data set to include them. A third option is to create another data set with a new algorithm defining the distribution of the measurements.

Once the data has been conditioned, it is then made available to the analyst. The analyst then uses the transformed/conditioned data for the purpose of analysis, as seen in Fig 6.9.

Image250555.jpg

Fig 6.9 Analyzing data in the analog data pond

Specialized Ad Hoc Analysis

There is another use for the analog data that has been conditioned. It is entirely possible and likely that specialized analysis needs to be done. It is also possible to use the conditioned data as a basis for a specialized data analysis.

Say the conditioned data is for a manufacturing environment. Analog analysis regularly uses the conditioned data for their analysis. But suppose a new manufacturer arrives in the marketplace. The corporation wishes to do a separate analysis of a subset of the products that they produce. There is nothing wrong with separating out the specialized product from the mainstream products and performing a specialized analysis. With the conditioned data, it is an easy task to use it as a foundation on which to do new and unanticipated analysis. Fig 6.10 depicts the fact that ad hoc specialized analysis can be done from the conditioned data.

Image250562.jpg

Fig 6.10 Performing ad hoc specialized analysis in the analog data pond

In Summary

The analog data pond then is the place where analog data is stored, conditioned and analyzed. The conditioning process varies for every type of data found in the analog data pond. Fig 6.11 shows the analog data pond.

Image250571.jpg

Fig 6.11 Analyzing data in the analog data pond

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset