Chapter 9
Most applications rely on their database structure to help enforce data quality. Data warehouse applications are an exception—a data warehouse emphasizes the reading of data and makes data quality the responsibility of loading programs. The antipatterns in this chapter simplify reading but compromise the ability of database structure to enforce quality. Hence, these antipatterns are acceptable for data warehouses, but you should avoid them otherwise.
A model has elements (entity types, relationships, attributes) that are not fundamental. These elements can be computed from other elements and lack substance in their own right. Derived data can complicate development and increases the likelihood of data corruption.
Consider using derived data for critical application elements or to resolve performance bottlenecks. For example, it is faster to store the current security positions for a broker account, than to compute them by processing past transactions. Derived data is often used for data warehouses to speed performance and ease the writing of queries.
When possible, rework the model to eliminate derived elements. Otherwise carefully document derived data and pay special attention to checking for inconsistencies in your test plan. Sometimes it is helpful to use a generic mechanism to keep derived data consistent with base data—for example, it is entirely acceptable to have derived data that is computed via a database view.
In Figure 9.1 it is much better to include a person's birthdate rather than age which changes over time. The original model is also undesirable for a data warehouse.
Figure 9.2 and Figure 9.3 show a partial model for a library. The slash prefix in Figure 9.2 is UML notation for derived data. The dueDate can be computed from checkoutDate and checkoutPeriod. The calculatedFine is computed according to the formula. Figure 9.4 shows SQL Server code for the calculations.
Figure 9.5 shows a proper use of derived data for a fact table of a simple data warehouse. The gross profit is the retail price less the wholesale cost.
An entity type has groups of similar attributes. Such a model can be brittle, verbose, and awkward to extend.
Parallel attributes are often used for data warehouses to ease data retrieval. Data warehouses forego the enforcement of data quality, leaving that responsibility to the feeder applications and scripts that load data.
Abstract and factor out commonality.
Figure 9.6a is taken from a legacy application. The repetition is obvious. An organization has different kinds of products for which both sales and profit are recorded. Figure 9.6b avoids the repetition and readily extends to new products and financial metrics. The original model is also undesirable for a data warehouse.
Dimensions and facts of data warehouses often have parallel attributes. For example, saleQuantity and salePrice in Figure 6.6 are parallel attributes. Figure 9.7 shows a Customer dimension with parallel attributes for address and phone number.
Two entity types have several (at least three) similar relationships.
As with parallel attributes, parallel relationships can be acceptable for a data warehouse in order to simplify queries for retrieving data.
Abstract and factor out commonality.
In Figure 9.8, a person can participate in making a movie in various ways. The left model shows five movie roles. The right model permits any number of roles.
Figure 9.9 shows another representation for a Customer dimension. The various addresses can be noted via parallel relationships. The use of parallel relationships could be preferred to Figure 9.7 if there were extensive Address fields to record.
An entity type has disparate attributes and lacks cohesion.
Giant tables are reasonable for input/output staging. It is also acceptable to combine entity types for the dimension tables of data warehouses.
Make each concept its own entity type.
In Figure 9.10 the contact position and contact phone depend on the contact name which in turn depends on the account. In the left model, several account records could have the same contact name with inconsistent positions and phone numbers. Note that the left model violates third normal form.
As a counterexample, it can be appropriate to combine entity types for a data warehouse. Specifically, such a combination can flatten a dimension. Many dimensions, when logically modeled, consist of several related entities. Data warehouses often collapse such snowflakes into a single dimension entity. For example, Figure 9.11 collapses the Product dimension of Figure 6.6 into a single entity.
An antipattern is a characterization of a common software flaw. The antipatterns in Table 9.1 simplify reading but compromise the ability of database structure to enforce quality. These antipatterns are often acceptable for data warehouses, but you should avoid them otherwise.
Summary of Non-Data-Warehouse Antipatterns
Antipattern name |
Observation |
Exceptions |
Resolution |
Frequency |
---|---|---|---|---|
Derived data |
A model has elements that are not fundamental. |
OK for critical elements, bottlenecks, and data warehouses. |
Rework the model to eliminate derived data. |
Common |
Parallel attributes |
An entity type has groups of similar attributes. |
Often used for data warehouses. |
Abstract and factor out commonality. |
Occasional |
Parallel relationships |
Two entity types have several similar relationships. |
Can be acceptable for a data warehouse. |
Abstract and factor out commonality. |
Occasional |
Combined entity types |
An entity type has disparate attributes. |
OK for I/O staging and data warehouses. |
Make each concept its own entity type. |
Occasional |