Chapter 8.3

The Data Warehouse/Operational Environment Interface

Abstract

Data architecture began with simple storage devices. But soon, the need to store lots of data and to access the data quickly caused these early devices to disappear. In its place came disk storage. With disk storage, data could be accessed directly. But the need for managing volumes of data surpassed that of disk storage. One day, there appeared big data. And with big data came the ability to store effectively unlimited amounts of data. But as big data grew, the older day-to-day systems did not go away. There began to be a need for a rational way to interface legacy systems to big data.

Keywords

Storage device; Paper tape; Punched cards; Disk storage direct access of data; Big data; Interfacing corporate data and big data

As interesting as the big data/existing system interface is, it is not the only interface the data architect needs to know about. The other major interface in the corporate systems environment that is of interest is the interface between the operational environment and the data warehouse.

The Operational/Data Warehouse Interface

Fig. 8.3.1 shows the interface between the operational environment and the data warehouse environment.

Fig. 8.3.1
Fig. 8.3.1 Transforming application data to corporate data.

The operational environment is the place where day-to-day corporate decisions are made at the detailed level. The data warehouse environment is the place where the data that serve as the basis for corporate decision-making are stored.

The Classical ETL Interface

The interface between the two environments is called the “ETL” layer. ETL stands for extract/transform/load. It is in the ETL interface that application data are transformed to corporate data. The transformation is one of the most important transformations of data the corporation has.

Fig. 8.3.2 shows the classical ETL interface.

Fig. 8.3.2
Fig. 8.3.2 ETL performs the transformation.

The transformation of data in this interface is from application data to corporate data. The operational data are defined by each application. As a consequence, there are inconsistent definitions of data, inconsistent formulas, inconsistent structures of data, and so forth. But when the data pass through the ETL layer, the inconsistencies are resolved.

The ODS and the ETL Interface

There are however several variations to the classical ETL interface between the operational environment and the data warehouse environment. One of those variations is the inclusion of the ODS into the interface.

Fig. 8.3.3 shows that the ODS can participate in the interface.

Fig. 8.3.3
Fig. 8.3.3 The operational data store (ODS).

Data that flow to the ODS can flow directly into the ODS from the operational environment, or data that pass to the ODS can pass through the ETL transformation layer. Whether or not the data pass through, the ETL layer depends entirely on the class of the ODS. In the case of a class I ODS, data pass directly from the operational system to the ODS. In the case of a class II or class III ODS, data pass through the ETL interface.

Not every corporation has or needs an ODS. Usually, it is those corporations where there is a high degree of online transaction processing where the ODS is found.

The Staging Area

Another variation of the classical ETL interface between the operational environment and the data warehouse environment is the case where there is a staging area.

Fig. 8.3.4 shows a staging area.

Fig. 8.3.4
Fig. 8.3.4 Staging area.

There are some very specific cases where a staging area is called for. One of those instances is the case where data from two or more files must be merged and there is a timing issue. The data from file ABC are ready for merger at 9:00 am, and the data from file BCD are not ready for a merger until 5:00 pm in the afternoon. In this case, the data from file ABC must be “staged” until the merge is ready to occur.

A second case for the staging area is where there is a large volume of data and the data must be separated into different workloads in order to accommodate the parallelization of the ETL process. In this case, a staging area is needed to separate the volume of data.

A third case for the staging area is the case where the data coming from the operational environment must pass through a preprocessing step. In the preprocessing step, data pass through edit and correction process.

One of the issues with the staging area is whether or not analytic processing can be done against data found in the staging area. As a rule, data in the staging area are not used for analytic processing. This is because the data found in the staging area have not yet been passed through the transformation process. Therefore, it does not make sense to do any analytic processing against data found in the staging area.

Note that a staging area is optional and most organizations do not need one.

Changed Data Capture

Yet, another variation on the classical interface between operational systems and data warehouse systems is that of what is termed the CDC option. “CDC” stands for “changed data capture.” For high-performance online transaction environments, it is difficult or inefficient to scan the entire database every time data need to be refreshed into the data warehouse environment. In these environments, it makes sense to determine what data need to be updated into the data warehouse by examining the log tape or journal tape. The log tape is created for the purposes of online backup and recovery in the eventuality of a failure during online transaction processing. But the log tape contains all the data that need to be updated into the data warehouse. The log tape is read offline and is used to gather the data that need to be updated into the data warehouse.

Fig. 8.3.5 depicts the CDC option.

Fig. 8.3.5
Fig. 8.3.5 Transaction processing systems, changed data capture option.

Inline Transformation

Another alternative to the classical operational to data warehouse interface is that of the inline transformation. In the inline transformation, the data that need to flow to the data warehouse are captured and processed as part of online transaction processing.

Inline transformation is not used very often because the coding needs to be part of the original coding specifications and because of the resource consumption that is required during high-performance online transaction processing. In truth, most code for online transaction processing is created before any one realizes that the results of online transaction processing need to be reflected in the data warehouse environment.

However, on occasion, this option is seen.

Fig. 8.3.6 shows the inline transformation option.

Fig. 8.3.6
Fig. 8.3.6 Textual data, inline contextualization option.

ELT Processing

A final variation on the classical ETL interface is one that can be called the ELT interface. The ELT interface is one where the data are loaded directly from the operational environment to the data warehouse. Once in the data warehouse, the data are then transformed.

The problem with the ELT option is that there is the temptation to simply not execute the “T” step (i.e., the transformation step). In this case, the data warehouse turns into a “garbage dump.” And once the data warehouse is loaded with garbage, it becomes worthless as a foundation for decision-making.

If an organization has the will power to not neglect to do the “T” step, then there is nothing wrong with the ELT approach. But few organizations have the willpower and discipline to use the ELT approach properly.

Fig. 8.3.7 illustrates the ELT approach to interfacing operational systems and data warehouses.

Fig. 8.3.7
Fig. 8.3.7 An alternative to ETL is ELT.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset