Image241123.jpg

Chapter 15
Archiving Data Ponds

An essential part of the data lake/data pond architecture is the archival data pond. The archival data pond is fed from data from the analog data pond, the application data pond and the textual data pond. Fig 15.1 shows the archival data pond.

Image251433.jpg

Fig 15.1 Understanding the archival data pond

The archival data pond is used to hold data whose probable useful life has diminished. The purpose of this pond is:

  • To have a place to store data that might have some future use
  • To allow useless data to be removed from data ponds so that analysis in those data ponds can proceed in an efficient manner.

Criteria for Removal

There are several criteria for the removal of data from the analog, application and textual data ponds. Some critical ones are:

  • The aging of data.
  • The lowering of the probability of usage.
  • The need to store data because of litigated activity.
  • The need to store data because of the criticality regardless of the probability of access.

Structural Alteration

As data is being restructured from the data ponds to the archival data pond, a structural change to the data occurs. Data in the archival data pond has both metadata and metaprocess information attached directly to the raw data. This attachment ensures that when future analysts go looking through the archival data, then that metadata and metaprocess information is not lost. Fig 15.2 shows the restructuring that occurs as data is moved into the archival data pond.

Image251441.jpg

Fig 15.2 Restructuring as data is moved into the archival data pond

Once data is placed in the archival data pond, it is often useful to index the data independently so that future analysts will be able to find data efficiently.

Creating Independent Indexes for Archival Data

Fig 15.3 shows the indexing of data in the archival data pond.

Image251449.jpg

Fig 15.3 Indexing of data in the archival data pond

In Summary

The archival data pond receives data from the other data ponds when the data in those ponds has a very low probability of usage. Data in the data archival pond is held in the pond indefinitely. Data is restructured as it enters the archival data pond in order to have metadata and metaprocess information placed physically adjacent to the actual data itself. On occasion, separate and independent indexes of data are created and stored in the data archival pond.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset