Image241123.jpg

Chapter 13
Additional Topics

Documentation is included in any computerized system. Documentation is especially important for the data lake/data pond environment. Without documentation, the analyst trying to use the data lake/data pond environment will not be successful. Documentation is absolutely essential for success in the data lake/data pond environment.

High System Level Documentation

There are at least two levels of documentation which are necessary for the data lake/data pond environment. One crucial point is the high system level. At the high system level there is documentation about:

  • How data enters the data lake and/or data pond
  • How data flows from one data pond to the next
  • How data flows into the archival data pond environment.

The high system level documentation for the data pond then shows the business analyst the general flow of data within the data lake/data pond environment.

Detailed Data Pond Level Documentation

The second level of necessary documentation is documentation at the detailed data pond level. The type of documentation that is needed here covers:

  • Metadata of the data found in the data pond
  • Metaprocess information about the activities taking place in the data pond
  • Transformation documentation
  • An architectural description of the flow of data within the data pond
  • The criteria for selection for entry into the data pond
  • The criteria for exit out of the data pond.

Once the business analyst finds the general place where his/her data is, they then need to have detailed information about how to access and manipulate data accurately. The low level of documentation provides this detailed information.

What Data Flows Into the Data Lake/Data Pond?

In Fig 13.1, there is the familiar corporate information factory, where the application/operational systems, the data warehouse and data marts, and other structures of data are found. But there is a host of other data in the corporation that is not found in the corporate information factory. There is also external data. There is analog data. There is security data. There is textual data, and so forth.

Image251257.jpg

Fig 13.1 Expanding the corporate information factory

Fig 13.2 shows that the two sources of data feed the data lake/data pond environment.

Image251266.jpg

Fig 13.2 Focusing in on the data relationships

Where Does Analysis Occur?

Looking at the diagram seen in Fig 13.2, it’s now an interesting question to ask: where do different kinds of analysis occur? The organization conducts all sorts of analysis. Some is online in real time. Some of the analysis is for corporate historical data. Some analysis is KPI analysis or textual information.

So it is instructive to ask, what kind of analysis occurs where? Fig 13.3 shows that online, real time analysis takes place in the applications. Activities such as bank transactions, airline reservations, manufacturing control activities, shipment recording and so forth occur here. The activity is online and real time, occurring in a matter of seconds. Typically, only a very small amount of data is accessed though. Updates and insert processing usually occur here.

Image251274.jpg

Fig 13.3 Analyzing online real time

The corporate analytical location is the data warehouse, as seen in Fig 13.4. Data from different applications is integrated into the data warehouse. Typically, 3 to 5 years’ worth of history is stored here. The analytical processing that occurs is performed in ranges from 5 minutes to 24 hours. In order to get the data into the data warehouse, it passes through ETL (extract/transform/load) processing. As data passes from the application environment to the data warehouse through extract/transform/load processing, the data is then transformed from an application state to a corporate state.

Image251284.jpg

Fig 13.4 Sourcing data for analysis from the data warehouse

Surrounding the data warehouse are data marts. Data marts are where KPI analysis occurs, typically on a departmental basis. Marketing, sales, finance and so forth all have their own KPI’s. Fig 13.5 depicts the data mart processing and analysis found in the corporate information factory.

Image251292.jpg

Fig 13.5 Sourcing data for analysis from the corporate information factory

Various and sundry other processing and analysis occurs outside the corporate information factory. Most often, the processing is very detailed and immediate. There are the reading of meters, the control of manufacturing devices, and the electronic eye reading of vehicles passing a control point. Fig 13.6 shows the kind of processing and analysis that occurs outside the corporate information factory.

Image251301.jpg

Fig 13.6 Processing and analyzing data outside the corporate information factory

And finally, there is the analytical processing that occurs in the data lake/data pond environment. The most common forms of analytical processing on the data found in the data lake/data pond environment are pattern discovery and deep historical analysis.

In the textual data pond, sentiment analysis occurs as well. Fig 13.7 shows the analytical processing that occurs in the data lake/data pond environment.

There are then many different kinds of analytical activities occurring across the information landscape of the corporation. Analysis in one place is usually quite different than the analysis elsewhere.

Image251309.jpg

Image251318.jpg

Fig 13.7 Analyzing data by applying various processing techniques

The age of Data

Another interesting question is, what is the age of data in the data lake/data pond environment? The answer is that data of any age can be found in the data lake/data pond environment.

Normally, data that is very fresh – seconds old – is found in the operational environment. Data that is from one year to five years old is found in the data warehouse/data mart environment. And data that is of any age is found in the data lake/data pond environment.

The data lake is the original long-term carrier of data.

On occasion, information is kept simply because it is cheaper to store the data than it is to ever have to recreate the data again. The theory is that if the data was important enough to be captured electronically in the first place, then the data is important enough to never have to be recreated again. There may be no foreseeable need for the data but the data is kept in any case.

Another reason to keep data for lengthy periods of time are statutory requirements. Some data must be kept forever because of legal mandate. Storing that data in the data lake/data pond environment is a good thing to do.

Security of Data

Data in the data lake/data pond environment needs security, just like the other parts of the data processing environment. However, the security criticality of the data lake/data pond environment is somewhat less than the security criticality of the other parts of the data processing environment. That is because of the timeliness of the data. Data in the data lake/data pond environment is likely to be much older than the data found elsewhere in the data processing environment.

In Summary

Documentation is an important part of the data lake/data pond environment. There are two levels of documentation required. There is high-level system documentation and there is low-level documentation.

Data flows into the data lake/data pond from two basic places – the corporate information factory and other data.

Different kinds of analysis occur in different locations. Online analysis takes place in the online operational systems. Corporate data analysis occurs in the data warehouse. KPI analysis occurs in the data mart. Limited immediate analysis is conducted in the miscellaneous data found elsewhere.

The data lake/data pond supports different kinds of analysis.

The age of data kept in the data lake/data pond environment is very lengthy.

The data lake/data pond environment requires security, but not the stringent level of security that is found elsewhere in the data processing environment.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset