Image241123.jpg

Chapter 9
Comparing the Ponds

At first glance, the different data ponds seem to be the same. While there are many structural similarities among the different ponds, there are some important and distinct structural dissimilarities as well.

Similarities Across the Data Ponds

In terms of similarities, all data ponds:

  • Ingest raw data, usually lots of it
  • Transform/condition the raw data into a form that is suitable for analysis
  • Produce a uniform, integrated structuring of data that is suitable for analytical processing
  • Support business analysis with their final output
  • Ultimately send their data to the archival data pond
  • Have similar entry points for raw data
  • Produce data that is fit for analytical processing
  • Have a supporting infrastructure of documentation to help the business analyst.

From a structural standpoint then, there are many core similarities among the different ponds of data. But for all the structural similarities, there are some important and striking differences as well.

Dissimilarities Across the Data Ponds

The structural dissimilarities among the different data ponds include:

  • The raw data entering the pond is very different from the raw data found in other ponds. One pond contains analog data, another pond contains application data, and another pond contains textual data.
  • The transformation and conditioning process for each pond is very, very different from one pond to the next.
  • The type of business analysis conducted on the final data state of the pond is very different.

Relational Format for Final State Data

An interesting question arises when looking at the different pools. Does the technology holding the final state of the pool have to be in a relational format? Fig 9.1 poses this question.

Image250878.jpg

Fig 9.1 Requiring a relational format?

The answer is no. There is nothing special about the relational format other than the fact that the vast majority of analytical and visualization packages available operate against relational data. The world of analytical processing has been around for a long time, long before there were data ponds. It is no surprise that analytical processing supports the relational data model.

Having stated that, there is no other reason why the final state data in the data pond must be in a relational format. If there is an analytical tool that operates on data in other than a relational format, then there is no reason why that analytical package cannot be used.

Technology Differences

A related question is whether the final data state of each pond has to be in the same technology. The only reason why an organization might want the technology to be the same is because of the overhead of supporting more than one platform.

Total Expected Volume of Data in the Data Pond

Another related and interesting question is: what total volume is expected in each data pond? The answer is that the total volume in any given data pond depends entirely upon the business goals and the nature of data in the business. One industry will have more of one type of data and less of another in their data ponds than another.

An engineering firm or a manufacturing organization is probably going to have lots and lots of analog data. A telephone company is going to have lots of application data. And a marketing research firm is going to have lots of textual data.

Moving Data From Pond to Pond

An interesting architectural question is: once the final state data has been created inside a data pond, can the data be moved to another pond and remain resident in the pond?

The answer is that it is certainly technologically feasible and possible to move data from one data pond to the next and allow that data to remain resident in the source pond. But from an architectural aspect, such a move rarely makes sense. Much of the data pond’s value is its supporting infrastructure. In addition to the data in the pond, there is important infrastructure found in the data pond, such as:

  • Metadata definitions
  • Metaprocess definitions
  • Descriptor information.

It is one thing to shuffle data from pond to pond. It is quite another thing to move the infrastructure that supports the data from pond to pond as well. For these reasons, it normally does not make sense to move data outside of the source pond. Fig 9.2 addresses this issue.

Image250886.jpg

Fig 9.2 Avoiding data movement from one pond to another

Doing Analysis From Multiple Ponds

Another interesting architectural issue is whether it is possible to do analytics based on data found in more than one pond. While possible to do, analytics are usually restricted to the data found in a single pond. This restriction has more to do with the type of data found in the pond and the type of analytics being conducted.

If the analysis requires data from more than one pond, then there is no reason why analytics cannot be done from more than one pond. Fig 9.3 shows the analysis that is conducted from more than one pond and that such analytics are a real possibility.

Image250896.jpg

Fig 9.3 Can analysis be done using data from more than one data pond? Yes!

Using Metadata to Relate Data From Different Ponds

If analyzing data from more than one data pond it is necessary to relate the data from one pond to data in the other. In some cases, this relationship is very quixotic. To facilitate the exchange of data across ponds it is necessary to use the metadata infrastructure.

The metadata for each pond will describe the data in the pond. If it is possible at all to relate data from one pond to another, the relationship is first realized in the metadata. Fig 9.4 shows that when data from different ponds is related, the relationship begins at the metadata level.

Image250905.jpg

Fig 9.4 Synchronizing metadata is required to analyze across data ponds

What if…?

Yet another interesting question is what if there is data that is not analog data, not application data, and not textual data that finds its way to the raw data pond? It is certainly possible to have data enter the raw pond that does not fit neatly into one of these three categories. If that is the case, what should be done with the data?

The answer is to not try to place the data in a data pond that it does not belong in. That would be a mistake. There are many reasons why this should not be done.

Instead, a good idea is to carve out a part of the raw data pond reserved for data that does not fit into one of the “standard” data ponds. This area can be called the miscellaneous data section of the raw data pond. Fig 9.5 shows the miscellaneous data section of the raw data pond.

Image250913.jpg

Fig 9.5 Carving out a miscellaneous data section

The miscellaneous section of the raw data pond can then be used to support business analytical processing, just like other data in the data lake. However, there is a note of caution. The data in the miscellaneous section of the raw data pond must be conditioned in order to support business analytical processing. Fig 9.6 shows the conditioning (transformation and integration) that must be done against data in the miscellaneous data section of the raw data pond.

Image250921.jpg

Fig 9.6 Conditioning must be done against the miscellaneous data section

In Summary

The data lake can be divided up into separate data ponds. Each data pond has its own data and its own characteristics. Seen organically, the data lake and its subdivision of data ponds are seen in Fig 9.7.

Image250930.jpg

Fig 9.7 Understanding the data pond landscape

Each data pond services its own kind of data and has its own unique analysis that can be performed on data in the pond. In addition, if data is entered into the lake that does not fit into the analog, application, or textual data ponds, then the data can be stored in a special data section of the raw data pond.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset