The Life Cycle of Data: Understanding Data Over Time

Abstract

Corporate data include everything found in the corporation in the way of data. The most basic division of corporate data is by structured data and unstructured data. As a rule, there is much more unstructured data than structured data. Unstructured data have two basic divisions—repetitive data and nonrepetitive data. Big data is made up of unstructured data. Nonrepetitive big data has a fundamentally different form than repetitive unstructured big data. In fact, the differences between nonrepetitive big data and repetitive big data are so large that they can be called the boundaries of the “great divide.” The divide is so large that many professionals are not even aware that there is this divide. As a rule, nonrepetitive big data has MUCH greater business value than repetitive big data.

Keywords

Structured data; Unstructured data; Corporate data; Repetitive data; Nonrepetitive data; Business value; The great divide of data; Big data

Data in the corporation have a predictable life cycle. The life cycle applies to most data. There are however a few exceptions. Some data do not follow the life cycle that will be described (but most does). The life cycle of data looks like the diagram shown in Figs. 1.6.1 and 1.6.2.

The life cycle of data shows that raw data enter the corporate information systems. The entry of raw data can be made in many ways. The customer may do a transaction, and the data are captured as a by-product of the transaction. An analog computer may make a reading, and the data are entered as part of the analog processing. A customer may initiate an activity (such as make a phone call), and a computer captures that information. There are many ways that data can enter the information systems of the corporation.

After the raw detailed data have entered the system, the next step is that the raw detailed data pass through a capture/edit process. In the capture/edit process, the raw detailed data pass through a basic edit process. In the edit process, the raw detailed data can be adjusted (or even rejected). In general, the data that enter the information systems of the corporation are at the most detailed level.

After the raw detailed data have passed through the edit/capture process, the raw detailed data then go through an organization process. The organization process can be as simple as simple indexing the data. Or the raw detailed data may be subjected to an elaborate filtering/calculation/merging process. At this point, the raw detailed data are like putty that can be shaped in many ways by the system designer.

Once the raw detailed data have passed through the organization process, the data are then fit to be stored. The data can be stored in a standard DBMS or in big data (or in other forms of storage). After the data are stored, before the data are fit for analysis, it typically passes through an integration process. The purpose of the integration process is to restructure the data so that they are fit to be combined with other types of data.

It is at this point that the data enter the cycle of usefulness. The cycle of usefulness will be discussed at length later. After the data have fulfilled its usefulness, the data can be either archived or discarded.

The life cycle of data that has been described is for raw detailed data. There is a slightly different life cycle of data for summarized or aggregated data.

The life cycle of summarized or aggregated data is seen in Fig. 1.6.3.

The life cycle for most summarized or aggregated data begins the same way that raw detailed data begin. Raw data are ingested into the corporation. But once that raw data become a part of the infrastructure, the raw data are accessed, categorized, and calculated. The calculation is then saves as part of the information infrastructure, as seen in Fig. 1.6.3.

Once raw and summarized become part of the information infrastructure, the data are then subject to the “curve of usefulness.” The curve of usefulness states that the longer data remain in the infrastructure, the less likely it is that the data will be used in analysis.

Fig. 1.6.4 illustrates that when looked at from the standpoint of age, the fresher the data are, the greater the chances are that the data will be accessed. This phenomenon applies to most types of data found in the corporate information infrastructure.

As data age in the corporate information infrastructure, the probability of access drops. The older data—for all practical purposes—become “dormant.”

The phenomenon of data becoming dormant is not quite as true for structured online data.

There are certain types of business where the phenomenon of data aging is not as true, as well. One type of industry is the life insurance industry, where actuaries are regularly looking at data that are over 100 years old. And in certain scientific and manufacturing research organizations, there may be great interest in results that were generated over 50 years ago. But most organizations do not have an actuary or a scientific research facility. For those more ordinary organizations, the focus is almost always on the most current data.

The declining curve of usefulness can be expressed by a curve, as seen in Fig. 1.6.5.

The declining curve of usefulness states that over time, the value of data decreases, at least insofar as the probability of access is concerned. Note that the value never actually gets to zero. But after a while, the value nearly approaches zero. At some point in time, the value is so low that for all practical purposes, it might as well be zero.

The curve is a rather sharp curve—a classical Poisson distribution.

An interesting aspect of the curve is that the curve is actually different for summary and detailed data. Fig. 1.6.6 shows the difference in the curve for detailed data and summary data.

Fig. 1.6.6 shows that the declining curve of usefulness for data is much steeper for detailed data than it is for summary data. Furthermore, over time, the usefulness of summary data goes flat but does not approach zero, whereas the curve for detailed data indeed does approach zero. And in some cases, the curve for summarized data over time starts to actually grow, although at a very incremental rate.

There is another way to look at the dormancy of data over time. Consider the curve that expresses the accumulation of data over time. This curve is shown in Fig. 1.6.7.

Fig. 1.6.7 shows that over time, the volume of data that accumulates in the corporation accelerates. This phenomenon is pretty much true for every organization.

Another way to look at this accumulation curve is shown in Fig. 1.6.8.

Fig. 1.6.8 shows that as data accumulate over time in the corporation, there are different and dynamic bands of usage of data. There is one band of data that shows that some data are heavily used over time. There is another band of data for lightly used data. And there is yet another band of data for data that are not used at all.

As time passes, these bands of data expand.

Usually, the bands of data relate to the age of the data. The younger the data are, the more relevant the data are to the current business of the corporation. And the younger the data are, the more the data are accessed and analyzed.

When it comes to looking at data over time, there is another interesting phenomenon that occurs. That phenomenon is that over long periods of time, the integrity of data “degrades.” Perhaps, the term degrades is not appropriate because there is a pejorative sense to “degrades.” And—as used here—the term “degrades” has no such pejorative connotations. Instead, as used here, the term “degrades” simply means that there is a natural and normal decay of meaning of data over time.

Fig. 1.6.9 shows the degradation of integrity of data over time.

In order to understand the degradation of integrity over time, let's look at some examples. Let's consider the price of meat—say hamburger—over time. In 1850, hamburger was 0.05 cents a pound. In 1950, the price of hamburger was 0.95 cents a pound. And in 2015, the price of hamburger is $2.75 a pound. Does this comparison of the price of hamburger over time make sense? The answer is it sort of makes sense. The problem is not in the measurement of the price of hamburger. The problem is in the currency by which hamburger is measured. Even the meaning of what is a dollar is different in 1850 than what a dollar is in 2015.

Now, let's consider another example. The stock price of one share of International Business Machines (IBM) was $35 in 1950, and the price of that same share of stock in 2015 is $200 a share. Is the comparison of a stock price over time a valid comparison? The answer is sort of. IBM in 2015 is not the same company as it was in 1950, in terms of products, in terms of customers and revenues, and in terms of the value of the dollar. In a hundred ways, doing the examination of IBM in 1950 compared with IBM in 2015, there simply is no comparison. Over time, the very definition of the data has changed. So while a comparison of IBM's stock price in 1950 versus the stock price in 2015 is an interesting number, it is a completely relative number, because the very meaning of the number has drastically changed.

Given enough time, the very definition of values and data changes. That is why degradation of the definition of data is simply a fact of life.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 1.6: The Life Cycle of Data: Understanding Data Over Time

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 1.6: The Life Cycle of Data: Understanding Data Over Time