Image241123.jpg

Chapter 10
Using the Infrastructure

Nothing elucidates a concept better than an example.

Suppose there was a corporation that had a wide variety of data. The corporation has applications that governed different aspects of their business. There are online systems running transactions that manage the day-to-day interchanges between the corporation and customers. The corporation has data warehouses where corporate analysis is done. The corporation has data marts that are fed by the data warehouse where key performance indicator (KPI) calculations are made periodically.

Yet the corporation also has a lot of other data as well. The firm has competitive data, engineering data, financial data, emails, economic data, tweets, contracts, call center data and a whole host of other data types.

Naturally, the corporation starts to place a lot of its data into a data lake.

After a while, storing data in the lake becomes onerous. Management asks why they are putting information into a data lake when no analysis is being generated from the lake. Or if there is analysis, why is it so slow and expensive?

The organization wakes up to the fact that they have created a “one way” data lake that’s more of a liability than an asset. The “one way” data lake simply does not support business decision making in any meaningful way.

“One Way” Data Lake

Fig 10.1 depicts a “one way” data lake that has been built by well-intentioned Big Data developers and data scientists.

Image250972.jpg

Fig 10.1 Avoiding the “one way” data lake

One day a manager reads a book describing how to turn the data lake into a positive business asset. The manager understands the problems with the “one way” data lake and decides to build an architected data lake/data pond environment that can truly support decision making in the corporation.

Transforming the Data Lake

The manager hires a consulting firm and soon they are busily transforming the “one way” data lake into an architected data lake with data ponds. Fig 10.2 shows the data lake/data pond architecture that has been built from the data found in the “one way” data lake.

Image250979.jpg

Fig 10.2 Transforming the “one way” data lake into an architected data lake

The newly architected data lake contains three primary data ponds – an analog data pond, and application data pond, and a textual data pond. In addition, there is some small amount of data in the miscellaneous data section of the raw data pond.

Transformation Technology

The consulting firm also brings in three distinct technologies for the purpose of transforming/conditioning the raw data in each of the data ponds. For the analog data pond, they use technology that can do data reduction and data compression. For the application data pond, the consulting firm brings in classical ETL technology. And for the textual data pond, they deploy textual disambiguation software. In addition, the consulting firm brings in technology to manage the descriptors, metaprocess information and the metadata that are found in the data lake. Soon the “one way” lake is transformed into a useful tool for the firm.

The transformation process requires work, investment and time. Still, the result is an infrastructure that can really be used for analytical processing. An asset of inestimable value for the corporation.

Some Analytical Questions

As an example of the worth of the architected data lake, consider some simple analytical questions. Suppose the corporation wanted to find out what corporate revenues were for the last quarter. Now suppose the corporation goes and looks in the untransformed/unconditioned data lake environment. In the lake, they have transactions recorded in Australian dollars, Mexican pesos, Canadian dollars, and US dollars.

Certainly the corporation can find the financial transactions. But converting the monetary amounts on the transactions into a common value is a confusing and onerous process that the analyst would rather not have to do. When management wants answers, management wants the answers now. Management does not want to have to wait on complicated calculations and complex analysis.

It is one thing to calculate conversion rates. It is another thing to convert rates as of some moment in time in the past. The conversion calculation is a messy, inaccurate affair. Fig 10.3 shows what management gets when they query the data lake.

But what happens when management queries the architected, integrated data lake/ data pond environment? Since the data has been integrated into a cohesive and accurate number, management quickly gets their answer and has confidence in the value.

Image250988.jpg

Fig 10.3 Querying the data lake

There is no question that building integrated data ponds takes work and investment. But that investment comes back many times over in the analysis that can be performed with the data after it has been architected.

The world of technology has millions of dollars to build things wrong and not a dime to build things right. And this shortsighted attitude comes back to bite the clients more often than the vendors.

Now suppose management has another question to be addressed by the untransformed data lake. Management wants to know how many female employees have taken the SAT exam.

When management looks into the untransformed data lake, they find that every application has encoded the designator for gender differently. One application has encoded women as 0. Other applications have encoded women as F. Another application has encoded women as X, and so forth.

When the applications were built each developer had his/her own way of designating gender. It is one thing to find data. It is another thing to interpret the data accurately. Once again, management just wants answers. They don’t want a big explanation about calculations and algorithm processes. But where the applications have not been integrated, management cannot get what it wants. Fig 10.4 shows that access and analysis of unintegrated data is a difficult thing to do.

Image250997.jpg

Fig 10.4 Challenging queries against unintegrated data

However, when management accesses and analyzes data from the architected, integrated data lake/data pond environment, the answer is easy to locate. In addition to getting the answer quickly, management has confidence that the answer is accurate as well and doesn’t have a bunch of asterisks next to the figure.

Querying Textual Data

Now let’s consider another type of data, textual data. Management wishes to know how many books Bill Inmon has written.

Management issues a natural language processing (NLP) query to the data lake. When NLP sees the name “bill” it marks the record. Soon all sorts of “bills” start to appear. There are bird bills. There are billboards. There is an Australian billabong. There is Bill Bryson. They are bills in front of Congress. There are dollar bills. There are hotel bills. And along the way, there are a few references to Bill Inmon.

Doing an un-contextualized query against raw text is very confusing and not very productive. Fig 10.5 shows the confusing query that comes from looking at the untransformed data lake.

Image251005.jpg

Fig 10.5 Confusing results from the untransformed data lake

But when management looks at the contextualized data in the textual data pond, they see Bill Inmon is the author of 55 books.

Once again, the integration and transformation work done by the creation of a disambiguated textual data pond has paid off in speed of analysis and in terms of confidence of results.

Real Analysis

The queries and the analysis discussed here are trivial compared to the real analytical queries that organizations do. But these trivial queries are useful in pointing out what the problems of analysis are.

When trying to use an untransformed data lake for analysis the results are confusing and complex. It takes quite an effort to conduct a serious analysis of data in the untransformed data lake. And management does not like long and complex efforts. Fig 10.6 shows that using the untransformed data lake as a basis for analysis is a complex and tedious chore. No wonder the untransformed data lake becomes a “one way” street and turns into a garbage dump.

Image251015.jpg

Fig 10.6 Choosing integrity and clarity over ambiguity

It takes time and effort to read, analyze, integrate and condition the data in the data ponds. But that effort turns the data lake into an asset rather than a liability.

In Summary

If you are serious about turning your data lake into a useful corporate asset, you must go through the effort and expense of transforming the raw data. The data ponds do the first high-level separation of data into generic data types, and the transformation / conditioning phase turns the data into something that is useful for corporate business analysis.

The alternative to not building the data lake/data pond environment is to build a corporate structure that turns into a liability rather than an asset. It’s much cheaper to get things right the first time around.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset