Image241123.jpg

Chapter 2
Transforming the Data Lake

The data lake has great potential. The data lake can be used to conduct analytical processing that has never before been done. From governments to small businesses, the data lake can be used to identify, analyze and even predict important patterns which heretofore have gone unnoticed.

What needs to be done to turn the data lake into an information gold mine? What exactly does the organization need to think about as it creates its data lake? What are the things that can be done to data that will prepare for future usage and analysis?

With care and planning, the data lake can be turned into an information gold mine. What are the ingredients that are needed to turn the data lake into a bottomless well of actionable insights? There are four basic ingredients that are needed: metadata, integration mapping, context, and metaprocess.

Metadata

Metadata is the description of the data in the data lake itself (as opposed to the raw data). Metadata is the basic structural information that every collection of data has associated with it. For example, if tracking visits, clicks and engagement to a website, metadata would include the IP address/geographic location of the visiting computer. Typical forms of metadata include descriptions of the record, the attributes, the keys, the indexes and the relationships among the different attributes of data. There are however many additional forms of metadata.

Metadata is used by the analyst to decipher the raw data found in the data lake. Or in other words, metadata is the basic roadmap of the data that resides in the data lake.

When only raw data is stored in the data lake, the analyst that needs to use that data is crippled. Imagine trying to search Wikipedia if none of the articles had titles. Raw data by itself just isn’t very useful. Now when raw data is properly tagged with metadata and stored in the data lake together, you now have an incredibly useful service.

Integration Mapping

The integration map describes how data from one application relates to data from another application and how that data can be meaningfully combined. As important as metadata is, it is not the only basic infrastructure ingredient needed in the data lake. Consider that most of the data lake’s input is generated by an application, in one form or the other. What happens when you put data from different applications in the data lake? You create unintegrated “silos” of data in the data lake.

Each application, usually written in a different coding language, sends its input to a separate silo, which cannot communicate or “talk to” the other silos. While the information is all stored in the same data lake, each silo is unable to integrate its data with the others, even if properly tagged with metadata.

In order to make sense of the data in the data lake, it is necessary to create an “integration map.” The integration map is a detailed specification that shows how the lake’s data can be integrated. The integration map is the best method to overcome the isolation of data in the silos.

Fig 2.1 shows that when unintegrated application data is placed in the data lake, silos of data are created. These silos make the reading and interpretation of data a very difficult thing to do.

Image250151.jpg

Fig 2.1 Creating silos leads to unintegrated data, hindering communication

Context

Another complicating factor in the data lake is textual data that has been placed there without context of the text being identified. Suppose the text “court” appears. Does court refer to a tennis court? To a legal proceeding? To the activities of a young man as he tries to lure a young lady as his mate? Does court refer to the people surrounding royalty? When you look at the word “court” by itself, it might mean any of these things or more.

Text without context is meaningless data. In fact, in some cases it is dangerous to store text without an understanding of its context. If you are going to put text in the data lake, then you must also insert context as well, or at least a way to find that context. Fig 2.2 shows that context for text is an essential ingredient for data found in the data lake.

Image250158.jpg

Fig 2.2 Lacking context of textual data

Metaprocess

Metaprocess information is information about how the data was processed or how the information in the data lake will be processed. When was the data generated? Where was the data generated? How much data was generated? Who generated the data? How was the data selected to be placed in the data lake? Once inside the data lake, was the data further processed? All of these forms of metaprocessing are useful to the analyst as they go about extracting and analyzing the lake’s data.

The most important point is that these features need to be included at the outset. Usually, after the raw data has been loaded into the data lake, it is too late to go back and include these essential ingredients.

However, once the ingredients have been added, the data lake is a potential information gold mine. Fig 2.3 depicts the broad strokes required to turn the data lake into a powerful and useful corporate resource.

Image250171.jpg

Figure 2.3 Going from a garbage dump to an information gold mine

Another important effect of turning the data lake into a useful corporate resource is that an entirely different and expanded community of users can make use of the tool.

Consider the transformation of a data lake into a useful corporate resource. Fig 2.4 shows the data lake in an untransformed state and the data lake in a transformed state.

Image250180.jpg

Fig 2.4 Going from an untransformed state into a transformed state

Data Scientist

When the data lake is in its raw state only a handful of specialists can make sense of the data in the data lake. Typically these people are called data scientists. Data scientists are:

  • Hard to find
  • Expensive to hire
  • Hard to get their time when they are hired.

There is nothing wrong with data scientists as a group of people. But the difficulty in even finding them, the cost of hiring them, and the difficulty in getting their time even when they are found and hired is legendary. No matter how well organized, when the data lake can be operated only by a few people whose cost is high and time is precious, the data lake just has limited corporate value.

General Usability

Now consider what happens when the data lake is fully integrated and the data is transformed into a state of general usability.

Fig 2.5 shows the difference between a data lake that is accessible only to a few data scientists and one after transformation that’s accessible to a large population of business users.

Image250188.jpg

Fig 2.5 Transforming data increases user accessibility

After transformation, the data lake is useful to accountants, managers, systems analysts, the end user, the finance team, sales staff, marketing and so forth. By integrating and conditioning the data, the audience served by the data lake expands greatly. And in doing so, the lake’s value to the corporation expands greatly.

In Summary

The data lake has great potential. But when people merely dump data inside with no thought as to how the data will be used, there is the very real danger that the data lake will turn into a garbage dump. With four basic ingredients, the data lake can be turned into an information gold mine:

  • Metadata. Metadata is used by the analyst to decipher the raw data found in the data lake. Metadata is the basic roadmap of data that resides in the data lake.
  • Integration mapping. The integration map is a detailed specification that shows how the data in the data lake can be integrated. The integration map shows how the isolation of data in the silos can be overcome.
  • Context. If you are going to put text in the data lake, then you must also insert context as well, or at least a way to find that context.
  • Metaprocess. Metaprocess tags are information about the processing of data in the data lake.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset