Image241123.jpg

Chapter 1
Data Lakes

First came the punch card. Then magnetic tapes. Then disk storage and database management systems, followed by fourth generation languages (4GLs), “metadata” and floppy disks and mobile computing. Advances coming faster than we could memorize their new names. Soon personal computers and spreadsheets became as ubiquitous as suits and ties. And that was just the beginning.

In a rapid few decades, the corporation went from no automation to hyper automation. Throughout this progression one of the limiting factors was storage. Storage was always either too expensive or too limited in its capacity to hold large volumes of data. The bottleneck of storage limits had a profound effect on the types of systems that could be built and hampered the performance of systems that were built.

Enter Big Data

Then one day Big Data changed the world. Big Data technology was best typified by the Hadoop Distributed File System (HDFS). This open-source software framework was designed from the ground up to store and process massive datasets distributed among many different computer clusters. With Big Data, storage is effectively unlimited in terms of cost and technical constraints. Most importantly, with Big Data whole new worlds of processing and opportunity opened up.

In short order, Big Data redefined our very conception of data. The sheer volume of data that could be stored and analyzed with Big Data systems revolutionized not just the industry, but the world. Megabytes, gigabytes, terabytes... the old data measures were thrown out the window in this new world where storage volume was effectively unlimited. Fig 1.1 depicts the advent of Big Data.

Image250072.jpg

Fig 1.1 Creating unlimited opportunities by leveraging Big Data

Enter the Data Lake

As Big Data blossomed, organizations began to store the endless stream of data being collected in structures called “data lakes.”

While collecting the data was a piece of cake, plucking something useful from this sea of knowledge was the real challenge. Some organizations turned to data scientists to make sense out of their data lakes. Despite the costs sunk into research, Big Data was just as brand new and unexplored for the scientists as the organizations. Analytic breakthroughs were rare, expensive to produce and fraught with false positives and other errors. Fig 1.2 shows that Big Data leads to massive data lakes to sift through.

Image250081.jpg

Fig 1.2 Placing Big Data in the data lake

Fig 1.3 shows the frustration by the business community that grew as the volume, and therefore value, of data in the data lake continued to grow, while they could do little of value with their treasure.

Image250089.jpg

Fig 1.3 Waking up and finding that we can’t find anything in the data lake

“One Way” Data Lake

There were many reasons for business users to be frustrated with the information pooling in their data lakes. The core issue was that the larger the information lake grew, the more difficult analyzing the data became. A data lake of any significant size was often dubbed a “one way lake,” since data is eternally pouring in, but data and/or any analysis is never taken out, or even accessed once the data is placed inside the data lake. Fig 1.4 depicts the “one way” data lake.

Image250096.jpg

Fig 1.4 Entering data into the “one way” data lake, but nothing comes out

It was an expensive and frustrating Catch-22. The larger and more potentially insightful a data lake grew, the more useless it became to the organization. If no one is using data in the data lake, then the lake serves no purpose to the organization. Yet the organization was paying a lot of money on storage and the specialized staff to extract useful information out of the data lake.

The question then arose – why is the data lake one way and what can be done about it? There is great potential in Big Data and data lakes, but no one seems to be getting their money’s worth out of their investment. There are many reasons why the data lake turns into a “one way” data lake. But those issues trace their roots to how data was placed into the data lake in the first place: the intent was never to organize the data for future usage. Instead the data lake became a place just to “dump” data. So much effort was spent on gathering data from every possible source that few engineers or companies gave much thought to organizing the data for future usage. Fig 1.5 shows that with the “one way” data lake, the lake becomes little more than a large garbage dump for data.

Image250104.jpg

Fig 1.5 Turning the data lake into a garbage dump

Does the data lake have to become a garbage dump? Isn’t there something that can be done in order to make the data lake a productive and useful place? Were the promises of Big Data just a bunch of hype by the vendor? Indeed, the data lake has the potential to become a quite useful foundation for analytical processing. However, as long as people simply dump data into the data lake with little or no thought to the future usage of the data, then the data lake is destined to remain a garbage dump.

What are some of the issues with the data lake when data is merely dumped inside? Let’s unpack the core problems one by one.

One issue is that useful data becomes hidden from the analyst because it is buried behind mountains of other information that are not relevant. There is nothing very remarkable about much of the data that is useful to companies. And given the sheer volume of data found in the data lake, the blandness of useful data makes it that much more difficult to find. Put another way, useful data just doesn’t stand out in the mountains of data that accumulate in the data lake.

A second and related issue is that the metadata describing the data points in the data lake are not captured or stored in an accessible location. Only the raw data is stored in the data lake. This makes analysis of data a really dicey issue because the analyst never knows the meaning or source of the data that has found its way into the data lake. In order to perform useful analysis, the organization needs accurate and readily accessible metadata that puts the data found in the data lake in context.

A third shortcoming of the one-way data lake is that data relationships are lost (or are never even recognized). The pool is so large that important data relationships are not carried forward into the data lake. It’s considered too cumbersome to carry data relationships into the data lake.

And this list is just the beginning of the shortcomings of data in a “one way” data lake. In fact, there are many more technical obstacles in the way of effectively utilizing a data lake. Fig 1.6 shows some of the limitations of data in the data lake.

Image250111.jpg

Fig 1.6 Traditionally analyzing data in the data lake becomes impossible

In Summary

Because the information inside the data lake is not designed for future access and analysis, the organization soon discovers the data lake will not support their business, no matter how large it is.

Organizations have long known that in order to support the business, data must be organized in a rational, easy to use, and easy to understand manner. Due to data being dumped into the lake with no thought for future usage, the data lake is consequently not useful to the business.

When the data lake is transformed into a “one way” data lake, the only benefit to the business of the data lake is as a cheap facility for the storage of useless data. The data lake as a cheap form of storage hardly justifies the expense and investment organizations have made.

So let’s take a look at solutions for this quandary.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset