|
Each of the data ponds (other than the raw data pond) has some common components:
Pond Descriptor
The pond descriptor has information such as:
Pond Target
The pond target is the basic model that is used to shape the data in the data pond. The pond target can be as formal as a data model or can be as informal as a general description of the data found in the data pond. Typical pond target elements include such things as customer profile, sales record, shipment record, patient record, part number, inventory, SKU, telephone call record, click stream activity, delivery information, insurance claim, professor name, class name, class schedule, flight schedule, flight manifest, passenger record, reservation record, and so forth.
The pond target is the means by which a business relationship is made to the data in the data pond. The pond target is invaluable to the business analyst in planning how to conduct an analysis. There will then be, of necessity, a business relationship between the elements found in the target and the business itself.
Pond Data
The pond data is the physical manifestation of the data itself as it resides in the pond. The data can be organized in many ways depending on the storage mechanism for the data pond. In the world of Big Data, it is customary for the information to be stored in a “schema on read” manner. In this system, the data is initially stored in a block of data. Then when a query is made against the data, the system goes and reads the block of data and determines the schema inside the block.
By organizing data in this manner, very large amounts of data can be stored efficiently. However, by storing the data in a “schema on read” manner, the retrieval and analysis of the data can cause significant overhead for the system to bear. Every time data is accessed, all the data in the pond must be accessed in a “schema on read” organization of data.
Pond Metadata
An important component of the data pond is the metadata that describes the physical characteristics of the data residing in the pond. The metadata is dependent on the data that exists outside the pond and the physical organization of the pond itself. If the data is stored in a standard DBMS outside the pond, many (or all) of those characteristics will be carried inside. In this case, the analyst can expect to find the same records, attributes, keys, and indexes.
But if the data is stored in document form outside the data pond, then the analyst can expect to find the data organized in a document by document organization. Even in the case of data stored in a “schema on read” system, metadata is still needed. However the data is physically organized inside the pond, it will be described by metadata. Without the metadata descriptions, the analyst would have a hard time figuring how to read and analyze the data pond. Fig 5.1 shows that metadata about the data in the data pond is contained inside the data pond itself.
Fig 5.1 Storing the metadata about the data in the data pond
Pond Metaprocess
The metaprocess description of the transformation that takes place inside the data pond is found in the pond itself. Data enters the data pond in a raw state. Data is then “conditioned” or transformed into a form and structure that makes the data useful and intelligible to the analyst.
It is noteworthy that the conditioning process for each data pond is quite different than the conditioning process for other data ponds. The analog pond has its conditioning process which is quite different than the conditioning process for the application data pond or the textual data pond.
Metaprocess information may describe processing that has occurred outside the data pond as well. On occasion, significant business processing has occurred long before the data arrives at the data pond. It is entirely possible that metaprocess information can be gathered and stored when processing data. The metaprocess information describes the conditioning process that is necessary for each data pond, as seen in Fig 5.2.
Fig 5.2 Performing the conditioning processing for each data pond
Pond Transformation Criteria
The transformation criteria are a description of the criteria used in the transformation process for the conditioning of data within the data pond. Each of the data ponds has their own unique transformation criteria. The analog data pond may have a statement of the threshold for measurements. There may be a criterion that says: “If the length is greater than 45 cm then capture the record, else do not capture the record.” Or there may be criterion that says: “Catch all measurements of a certain machine for the month of May.”
In the application data pond, there might be criteria that looks like: “If gender = 0 then convert gender to female. If gender = 1 then convert gender to male. If gender = x then convert gender to female. If gender = y then convert gender to male, and so forth.” Or there might be criteria that says: “If measurement is made in inches, then convert to centimeters.”
In the textual data pond, there might be transformation criteria such as: “If word = Honda then add car to classification. If word = Porsche then add car to classification. If word = Ford then add car to classification. If word = Volkswagen, then add car to classification.” Or there might be criterion that says: “If word = elm then type = tree. If word = oleander, then type = bush.”
The transformation criteria is where the analyst goes to determine exactly how transformations have been accomplished. Fig 5.3 depicts the transformation criteria for each data pond.
Fig 5.3 Determining the transformation criteria for each data pond
In Summary
Each data pond contains the following types of data: