Image241123.jpg

Introduction

We invest millions of dollars and years of time to build it wrong, but can’t we spare a dime or a minute to build it right?

Nowadays corporations are madly building data lakes, a by-product of the Big Data mania. Then one day they wake up and find that they can’t get anything meaningful out of their data lake. Or at least it takes a monumental effort to get the smallest amount of useful information out of their data lake.

They spend huge amounts of money and many man years of effort and build something that is a white elephant.

One day the corporation wakes up to the fact that they have built a “one way” data lake. Data goes into the data lake but nothing ever comes out. When this happens, the data lake is no more useful than a garbage dump.

This book is dedicated to corporations that want to build data lakes so that they can get useful information out of their data lakes. There is business value in the data lake, but only if you build it properly. If you are going to build a data lake you may as well build it so that it becomes an important corporate asset, not a liability.

The book examines why corporations have such a hard time getting anything useful out of their data lakes. There are several answers to this important question. One reason is that data is just packed into the data lake in an indiscriminant fashion. Another answer is that data is not integrated. A third reason is that data is stored in a textual manner and you can’t easily do analysis on text.

This book suggests that a high level of organization of data in the data lake is needed and that integration and “conditioning” of the integrated data is needed in order to make the data a foundation for analytical processing. The data lake can be turned into a positive asset for the corporation, but only if there is care and forethought in the shaping of the data lake.

The data lake needs to be divided into several sections, called data ponds. There is the:

  • Raw data pond
  • Analog data pond
  • Application data pond
  • Textual data pond
  • Archival data pond.

After the data ponds are created, the ponds require conditioning in order to make the data accessible and useful. For example, the analog data pond needs to have data reduction and data compression applied to it. The application data pond needs to have classical ETL integration applied to it. The textual data pond needs to have textual disambiguation applied to the text so that the text can be reduced to a uniform database structure and so that the context of the text can be identified.

Once the data ponds have had conditioning algorithms applied to their data, the data ponds then serve as a basis for analytical processing. Once the data in the data lake has been divided into ponds and the ponds have their data conditioned, then the ponds serve as an asset for the corporation, not a liability. In addition the data in the ponds is moved to the archival data pond when the useful life of the data in the data pond is over.

This book is for managers, students, system developers, architects, programmers, and end users. This book is designed to be a guideline to the organization that wishes to build data lakes that are an asset, not a liability.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset