|
An essential part of the data lake/data pond architecture is the archival data pond. The archival data pond is fed from data from the analog data pond, the application data pond and the textual data pond. Fig 15.1 shows the archival data pond.
Fig 15.1 Understanding the archival data pond
The archival data pond is used to hold data whose probable useful life has diminished. The purpose of this pond is:
Criteria for Removal
There are several criteria for the removal of data from the analog, application and textual data ponds. Some critical ones are:
Structural Alteration
As data is being restructured from the data ponds to the archival data pond, a structural change to the data occurs. Data in the archival data pond has both metadata and metaprocess information attached directly to the raw data. This attachment ensures that when future analysts go looking through the archival data, then that metadata and metaprocess information is not lost. Fig 15.2 shows the restructuring that occurs as data is moved into the archival data pond.
Fig 15.2 Restructuring as data is moved into the archival data pond
Once data is placed in the archival data pond, it is often useful to index the data independently so that future analysts will be able to find data efficiently.
Creating Independent Indexes for Archival Data
Fig 15.3 shows the indexing of data in the archival data pond.
Fig 15.3 Indexing of data in the archival data pond
In Summary
The archival data pond receives data from the other data ponds when the data in those ponds has a very low probability of usage. Data in the data archival pond is held in the pond indefinitely. Data is restructured as it enters the archival data pond in order to have metadata and metaprocess information placed physically adjacent to the actual data itself. On occasion, separate and independent indexes of data are created and stored in the data archival pond.