Image241123.jpg

Chapter 3
Inside the Data Lake

In order to better understand how the data lake can be prepared for future access and analysis, it is necessary to take a look at what lies inside the data lake.

While it is true that any kind of data can be found inside the data lake, it is nevertheless possible to categorize the data into three categories:

  • Analog data
  • Application data
  • Textual data.

Fig 3.1 shows that most data inside the data lake fits into one of three categories.

Image250243.jpg

Fig 3.1 Categorizing data lake data into three types

Analog Data

The first type of data found in the data lake is analog data. Analog data is typically generated by a machine or some other automated device, even if not connected to the internet. These measuring tools include diagnostic programs logging performance on everything from nuclear reactors to the CPU usage of your mobile phone.

In general, analog data is very voluminous and very repetitive. Most analog data consists of a long list of numbers that have been generated. Most records created by an analog device are measurements and most of the time those measurements only vary slightly from all other measurements. Typically, these small outliers are of the most interest.

Analog data usually are a simple measurement of some physical value (heat, weight, chemical composition, size, etc.). When a measurement seems out of line, it is an indication to look elsewhere for the cause of the measurement. For example, the odd measurement may have been caused by the fact that a machine has lost its calibration. Or a part needs an adjustment, and so forth. The analog data is merely a signal to the analyst to look elsewhere as to the cause of the variation in measurement.

Which is why the metaprocess information associated with analog data is often times more important than the analog data itself. Metaprocess details typically include such information as time of measurement, location of measurement, speed of measurement, and so forth.

Typically, analog information is triggered by or associated with some trigger, such as a manufacturing event. A part is created. A shipment has been sent. A box has been moved. These are all common events causing the creation of an analog record. The analog measurement is almost always made mechanically, without any user input or extra processing. Fig 3.2 shows an event triggering the creation of an analog measurement.

Image250251.jpg

Fig 3.2 Triggering analog measurements through events

The data points accompanying the raw data captured in the analog measurement process is called “metaprocess” data. While there are different kinds of metaprocess models suited to different objectives, this raw output is the most relevant to data lakes. The metaprocess information provides a different perspective of the analog data than just looking at the raw data itself. Fig 3.3 depicts some typical metaprocess details.

Image250259.jpg

Fig 3.3 Providing a different perspective of the analog data than just looking at the raw data itself

Often times the analog measurements are stored in log tapes or journal tapes. A log tape is a sequential measurement of one or more variables detected during the event(s) that creates an analog measurement. A log tape is very detailed. Numbers are generated in very small intervals.

The format of the log tape is typically complex. Often times system utilities are used to read and interpret the log tape because of their complexity. In most cases, the log tape captures all the events that occur, not just the events that are of interest or events that are an exception. As a consequence, it’s normal for a log tape to contain much extraneous information. Fig 3.4 shows the analog data found on a normal log tape.

Image250268.jpg

Fig 3.4 Storing analog data in log tapes or journals

Application Data

The second general category of data found in data lakes is application data. Application data is generated by the execution of an application or transaction, and sent to the data lake. As important as transaction data is, it is not the only kind of data found in the application component of the data lake.

Typical types of application data found in the data lake include sales data, payment data, banking checking data, manufacturing process control data, shipment data, contract completion data, inventory management data, billing data, bill payment data, and so forth. When any business relevant event occurs, the event is measured by an application and the data is created.

The physical manifestation of application data in the data lake can take many forms. However, the most typical form is recording activity in an application. The records may or may not have been shaped by a database management system (DBMS) application. It is typical of the application records to have a common and repeating uniform structure. Fig 3.5 shows that structure.

Image250276.jpg

Fig 3.5 Repeating the same structure

The common, uniform structure of the application data is usually in the form of a record, which is more than an analog data point. The record may have attributes. One or more of those attributes may be designated as a key. One or more of the attributes can have an independent index. Fig 3.6 shows the key and record structure that is typical of application data in the data lake.

Image250283.jpg

Fig 3.6 Typical key and record structure in the data lake

It is noteworthy that the structure of application data may or may not be rigorously tied to the DBMS that the data once was housed in.

Textual Data

The third general type of data found in the data lake is textual data. The textual data is usually associated with an application. However, the textual data takes a very different form than application data. Whereas application data is shaped into uniform records, data found in a textual format is decidedly not shaped into any uniform form.

Textual data is called “unstructured data” because the text can take any form. For example, when a person is speaking, they can say anything in any fashion that they like. Usually the sounds make sense, but many variables can strip away the structure. They could speak in riddles and parables. They might use a different language. Their speech may contain slang, vulgarities, be in a formal style or might even be an inside joke. Naturally, such text is extremely content dependent and not easily searched or processed by automated means.

Typical text found in corporations include call center conversations, corporate contracts, email, insurance claims, sales pitches, court orders, jokes, tweets, invitations and so forth. There is no limit as to what kind of text and how much text can be stored in a data lake. However, in order for text to be used analytically it must be transformed. As long as text is in its original form, only the most superficial analysis can be done against the text. In order for text to be subjected to useful analytical processing, unstructured text must pass through a process known as textual disambiguation.

Note that analog data and application data rarely have to pass through a similar process. Because of the uniformity with which analog data and application data are captured, those kinds of data points are expected to be analyzed by a computer. But if there is to be exhaustive analysis of text, it must be passed from its unstructured form of data through textual disambiguation at which point it passes into a state and form that can be analyzed by the computer.

There are two principal activities that are accomplished by textual disambiguation:

  • Text goes from an unstructured state to a structured uniform state that can be analyzed by the computer, and
  • Text has context recognized and associated with the text itself.

While these are the two primary functions of textual disambiguation, there are other useful functions accomplished by textual disambiguation. The most complex of these disambiguation activities is the identification of the context of text and the association of text with that context, as seen in Fig 3.7.

Image250290.jpg

Fig 3.7 Identifying the context of text

Another Perspective

The three major categories of data found in the data lake then are analog data, application data and textual data. But there is another important classification of data in the data lake between repetitive and non-repetitive data. In general, analog and application data are repetitive, whereas textual data is non repetitive. Fig 3.8 shows data in the data lake divided into classifications of repetitive data and non-repetitive data.

Image250299.jpg

Fig 3.8 Repetitive data is data where the same unit of data occurs over and over. Non-repetitive data is data where the same unit of data does not occur repeatedly, if at all.

While this might seem minor at first glance, there is great significance to the division of data into these two classifications.

In later chapters, we will explore the differences between repetitive and non-repetitive data in terms of business value and the significance of this division. Generally, there is great business value in non-repetitive data while significantly less business value is found in repetitive data. Because of the stark difference in business value, they form what is called the “great divide” between the two types of data, as seen in Fig 3.9.

Image250308.jpg

Fig 3.9 The “great divide” between repetitive and non-repetitive data

In Summary

There are many ways of organizing a data lake. One of those ways is to categorize data into one of three categories:

  • Analog data
  • Application data
  • Textual data.

Another important method of categorizing data is into repetitive and non-repetitive data. The difference between repetitive data and non-repetitive data forms what is termed the “great divide.”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset