Image241123.jpg

Chapter 5
Generic Structure of the Data Pond

Each of the data ponds (other than the raw data pond) has some common components:

  • Pond descriptor. The pond descriptor contains a description of the external contents and manifestation of the pond, and where the data in the pond originated from.
  • Pond target. The pond target is a description of the relationship between the business of the corporation and the data inside the pond.
  • Pond data. The data in the pond is merely the physical data that resides inside the pond.
  • Pond metadata. The metadata describes the physical characteristics of the data contained in the data pond.
  • Pond metaprocess. Metaprocess information is information about the transformation / conditioning of the data inside the data pond. In order to be useful, data in the pond must undergo a transformation / conditioning process.
  • Pond transformation criteria. Pond transformation criteria are documentation of how the transformation / conditioning of data inside the pond should occur.

Pond Descriptor

The pond descriptor has information such as:

  • Frequency of update or refreshment. The update frequency or refreshment refers to the cycle with which data is sent to the data pond and/or the frequency or refreshment cycle of data outside the pond. This can be a regularly scheduled movement of data or update / refreshment can be on an as needed basis.
  • Source description. The source description describes the lineage of the data in the data pond. In many cases, the lineage of data will pass through more than one source. This lineage information is useful to the analyst in determining the fitness of data in the data pond for analysis.
  • Volume of data. The volume of data is a general description of how much data is in the data pond. Data is measured both in terms of number of records and in number of bytes. The volume of data greatly influences the type and depth of analysis that can be done.
  • Selection criteria. The selection criteria are a description of the criteria that were used to select the data for inclusion in the data pond. The selection criteria of data are important to the analyst in determining what data is in the pond and why it is there.
  • Summarization criteria. Most of the time, data is summarized or otherwise processed as it passes into the data pond. The summarization is a description of the algorithms employed. In some cases, data is transformed in a different model than summarization. This is a description of the algorithmic processing used in the shaping of the data in the data pond. The summarization criteria are useful to the analyst in determining how to do analysis.
  • Organization criteria. Once the data is placed in the data pond, it is usually organized along the lines of the target of the pond. The target of the pond is similar to the data model of the business. The organization of data can be rigorous or casual, but in any case there is a description of exactly how the pond is organized. The description of the data organization is useful to the business analyst trying to make sense of the data pond.
  • Data relationships. There normally are many data relationships among the data found in the pond. This is a description of those relationships. The data relationships are useful to the business analyst when it comes time to do business analysis.

Pond Target

The pond target is the basic model that is used to shape the data in the data pond. The pond target can be as formal as a data model or can be as informal as a general description of the data found in the data pond. Typical pond target elements include such things as customer profile, sales record, shipment record, patient record, part number, inventory, SKU, telephone call record, click stream activity, delivery information, insurance claim, professor name, class name, class schedule, flight schedule, flight manifest, passenger record, reservation record, and so forth.

The pond target is the means by which a business relationship is made to the data in the data pond. The pond target is invaluable to the business analyst in planning how to conduct an analysis. There will then be, of necessity, a business relationship between the elements found in the target and the business itself.

Pond Data

The pond data is the physical manifestation of the data itself as it resides in the pond. The data can be organized in many ways depending on the storage mechanism for the data pond. In the world of Big Data, it is customary for the information to be stored in a “schema on read” manner. In this system, the data is initially stored in a block of data. Then when a query is made against the data, the system goes and reads the block of data and determines the schema inside the block.

By organizing data in this manner, very large amounts of data can be stored efficiently. However, by storing the data in a “schema on read” manner, the retrieval and analysis of the data can cause significant overhead for the system to bear. Every time data is accessed, all the data in the pond must be accessed in a “schema on read” organization of data.

Pond Metadata

An important component of the data pond is the metadata that describes the physical characteristics of the data residing in the pond. The metadata is dependent on the data that exists outside the pond and the physical organization of the pond itself. If the data is stored in a standard DBMS outside the pond, many (or all) of those characteristics will be carried inside. In this case, the analyst can expect to find the same records, attributes, keys, and indexes.

But if the data is stored in document form outside the data pond, then the analyst can expect to find the data organized in a document by document organization. Even in the case of data stored in a “schema on read” system, metadata is still needed. However the data is physically organized inside the pond, it will be described by metadata. Without the metadata descriptions, the analyst would have a hard time figuring how to read and analyze the data pond. Fig 5.1 shows that metadata about the data in the data pond is contained inside the data pond itself.

Image250433.jpg

Fig 5.1 Storing the metadata about the data in the data pond

Pond Metaprocess

The metaprocess description of the transformation that takes place inside the data pond is found in the pond itself. Data enters the data pond in a raw state. Data is then “conditioned” or transformed into a form and structure that makes the data useful and intelligible to the analyst.

It is noteworthy that the conditioning process for each data pond is quite different than the conditioning process for other data ponds. The analog pond has its conditioning process which is quite different than the conditioning process for the application data pond or the textual data pond.

Metaprocess information may describe processing that has occurred outside the data pond as well. On occasion, significant business processing has occurred long before the data arrives at the data pond. It is entirely possible that metaprocess information can be gathered and stored when processing data. The metaprocess information describes the conditioning process that is necessary for each data pond, as seen in Fig 5.2.

Image250440.jpg

Fig 5.2 Performing the conditioning processing for each data pond

Pond Transformation Criteria

The transformation criteria are a description of the criteria used in the transformation process for the conditioning of data within the data pond. Each of the data ponds has their own unique transformation criteria. The analog data pond may have a statement of the threshold for measurements. There may be a criterion that says: “If the length is greater than 45 cm then capture the record, else do not capture the record.” Or there may be criterion that says: “Catch all measurements of a certain machine for the month of May.”

In the application data pond, there might be criteria that looks like: “If gender = 0 then convert gender to female. If gender = 1 then convert gender to male. If gender = x then convert gender to female. If gender = y then convert gender to male, and so forth.” Or there might be criteria that says: “If measurement is made in inches, then convert to centimeters.”

In the textual data pond, there might be transformation criteria such as: “If word = Honda then add car to classification. If word = Porsche then add car to classification. If word = Ford then add car to classification. If word = Volkswagen, then add car to classification.” Or there might be criterion that says: “If word = elm then type = tree. If word = oleander, then type = bush.”

The transformation criteria is where the analyst goes to determine exactly how transformations have been accomplished. Fig 5.3 depicts the transformation criteria for each data pond.

Image250447.jpg

Fig 5.3 Determining the transformation criteria for each data pond

In Summary

Each data pond contains the following types of data:

  • Pond descriptor
  • Pond target
  • Pond data
  • Pond metadata
  • Pond metaprocess
  • Pond transformation criteria
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset