Image241123.jpg

Chapter 4
Data Ponds

In order to organize the different types of data into a structure that can be analyzed, it is necessary to create a high-level structure of data within the data lake. As data enters the lake it first enters the raw data pond. The purpose of the raw data pond is to serve as a holding cell. There is little or no analysis or other organized activity of the data while in the raw data pond.

Once it is time for analysis, the information in the raw data pond is sent to one of three different ponds based on the kind of data entailed. For example, analog, application and textual data all require a unique data pond.

While it is important to separate the three types of data, once inside the pond considerable processing takes place. It’s noteworthy that very different kinds of data processing or conditioning of the data occur inside the data pond. After the conditioning in finished, the data in the pond is fit for analysis.

After the data has outlived its useful life in the data pond, it’s moved from the analog, application, or textual pond into an archival data pond. This high-level flow of data from the raw data pond through the analog pond, the application pond, or the textual pond is seen in Fig 4.1.

Image250369.jpg

Fig 4.1 Understanding the data lifecycle across the different types of ponds

Conditioning Data

As data enters the various source ponds, the raw data goes through a conditioning process to prepare the data for analytical processing. Stated differently, if raw data does not go through the conditioning process, it has a hard time supporting the business analysis, which in turn creates business value. This is because the information is not in a format which is easy, or sometimes even possible, to study. It is absolutely mandatory that raw data be conditioned if it is to be fit for supporting analytical processing.

But conditioning for each type of pond is very different.

Raw Data Pond

The genesis of data is the raw data pond. The raw data pond is what many organizations initially call the data lake. Too often, they’ll simply throw data into the lake and then wonder why they can’t do any meaningful analytic processing against the data. In fairness, analytical processing can be done against raw data in the data lake. It just requires a data scientist to do the analysis. But much more lucid and efficient data analysis can be done against data after it has been conditioned. Almost as important, once the data has been conditioned, it can then be analyzed by the ordinary business user.

An interesting architectural question is: once raw data flows from the raw data pond into the data pond, should the raw data remain in the raw data pond? The answer is no. Once raw data passes from the raw data pond to the analog data pond, the application data pond, or the textual data pond, it is best to remove the source data from the raw data pond. The raw data has already served its purpose and it would be extremely rare for analytical processing to ever be performed in the raw data pond. The raw data pond then becomes a “holding cell” for a jumble of data, as seen in Fig 4.2.

Image250377.jpg

Fig 4.2 Becoming a “holding cell” for a jumble of data

The data in the raw data pond should be passed to the supporting data ponds as quickly as possible. One useful measure of quality for the raw data pond is how small it is and how quickly data passes out of the pond.

Analog Data Pond

The analog data pond is a place where, naturally enough, analog data is stored. The conditioning process for analog data primarily consists of data reduction – of reducing the volume of data in the analog pond to a workable, manageable, meaningful volume of data and restructuring the data in the pond.

Application Data Pond

The application data pond is populated with information that comes from executing one or more applications. This application data is probably the “cleanest” in the data lake because it has been generated by an application. All the data in the application pond is uniformly structured and contains values that are relevant to the execution of some business activity. But the data in the application pond is notoriously unintegrated. If, by some chance, all the information in this pond comes from a single application, the data in this pond may actually be integrated. However, for large corporations (and it is mostly large corporations who have data lakes) there is a good chance that data in this pond comes from different applications. It’s this multi-application origin of data that gives the analyst a hard time.

Textual Data Pond

The textual data pond is where unstructured textual data is placed. Text here can come from anywhere. Text in this pond is notoriously difficult to analyze in a profound fashion. Text can have a superficial analysis done with no transformation, but in order to do a deep analysis of the data it is necessary to disambiguate the text.

The disambiguation of text has two important effects:

  • Text is restructured into a uniform, database format, and
  • Text has context identified and attached to the text itself.

Data Passing Directly Into the Data Ponds

It is worthy of note that data does not have to pass through the raw data pond, although it almost always does. If the developer is sophisticated, it is possible to send the data directly into the analog, application or the textual pond of data. However most data passes through the raw data pond simply because that is the way most organizations did it in the beginning. Fig 4.3 shows that raw data can pass directly into the analog, application or textual data pond.

Image250385.jpg

Fig 4.3 Sending raw data into the different data ponds

In the final stages of the life cycle of data, data passes from the analog, application or textual data pond into the archival pond.

Archival Data Pond

Fig 4.4 shows the passage of data from the various data ponds into the archival data pond.

Image250393.jpg

Fig 4.4 Storing data in the archival data pond in optional

The purpose of the archival data pond is to hold data that is not actively needed for analysis but might be needed at some future point in time for analysis.

In Summary

The data lake that can support analytical processing is divided into several data ponds:

  • The raw data pond is the place where data first enters the data lake. The raw data pond serves as a holding cell for data.
  • The analog data pond is the place where analog data is channeled.
  • The application data pond is the place where application data is channeled.
  • The textual data pond is the place where textual data is gathered.

Upon entering the different data ponds, raw data passes through a conditioning process. Finally, when data has reached the end of its useful life, data passes into an archival data pond.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset