Chapter 16
This chapter summarizes my approach to database design. I start with a UML model of conceptual and logical intent and use that as the basis for preparing an IDEF1X model. Given an IDEF1X model, modern tools can readily generate SQL code to create the database structure. As a matter of software engineering discipline, I recommend a uniform policy for database design and deviate from the recommendations that follow only for good cause (such as demanding performance).
As Figure 16.1 shows, each entity type normally maps to a table. Each attribute becomes a column in the table. There may be additional columns for existence-based identity, buried relationships, and generalization discriminators (to be explained).
Two alternative mappings are sometimes useful.
Relationship mappings mostly depend on the multiplicity.
Many-to-many relationship. Map a many-to-many relationship to a distinct table (Figure 16.2). The primary key of the relationship combines the primary keys of the entity types. Role names become part of foreign key attribute names.
Simple one-to-many relationship. Bury a foreign key in the “many” table (Figure 16.3). Role names become part of foreign key attribute names.
Simple optional-to-exactly-one relationship. Bury a foreign key in the “optional” table (Figure 16.4). Role names become part of foreign key attribute names. The foreign key is unique in the buried table.
Ordered relationship. Use the same mapping as without ordering. Add a sequence number attribute and define a uniqueness constraint on the source entity type plus the sequence number (Figure 16.5).
Sometimes it is desirable to disregard the recommended mappings and promote simple one-to-one and one-to-many relationships to distinct tables. For example, Chapter 14 had a separate table for the relationship between Entity and Icon even though it was one-to-many (as well as for the relationships between Model-Diagram, Port-Tab, and Connection-Line). It would have been correct to bury the relationships, but from an architectural point of view I wanted to isolate dependencies between subject areas.
There are four possible situations for qualified relationships.
One-to-optional after qualification. This is the most common situation (Figure 16.6). The underlying relationship is one-to-many without the qualifier and you should bury the source entity type key and the qualifier in the “many” table. The combination of the source entity type plus the qualifier is unique.
Optional-to-optional after qualification. This situation is also common (Figure 16.7). Bury the source entity type key and the qualifier in the table that is “many” without qualification. In the example, one record (the root of the tree has a null parentID) keeps the source entity type plus the qualifier from being unique. (As an alternative, you can promote the relationship to a table and then the qualified relationship has a unique key.)
Many-to-optional after qualification. These qualified relationships seldom occur (Figure 16.8). Promote the relationship to a table with a primary key of the source entity type plus the qualifier. Note that the combination of the related entity types (assemblyID and componentID in Figure 16.8) need not be unique; the same component part may be used multiple times for the same assembly. For example, a wiper may be used for the left windshield, right windshield, and rear window.
Optional-to-many after qualification. These qualified relationships also seldom occur (Figure 16.9). Bury the source entity type key and the qualifier in the “many” table. In the example the “ordered” constraint yields an alternate key — the combination of the source entity type plus the qualifier plus a sequence number is unique. If “ordered” was omitted from the model, there would still be a buried source entity type key plus the qualifier, but no alternate key constraint.
Generalization organizes entity types by their similarities and differences. This book only considers single inheritance—a generalization for which a subtype has one supertype. (Chapter 8 recommends that data models avoid multiple inheritance—a generalization for which a subtype has multiple supertypes.)
As Figure 16.10 shows, I recommend that you map the supertype and each subtype to a table. The tables share common primary key values. The discriminator indicates the appropriate subtype table for each supertype record. Views can be helpful for consolidating the data of an entity across generalization levels.
Relational databases only partially support generalization. Relational databases cannot express the generalization partition, that each supertype record is further described by exactly one subtype. Also referential integrity is a one-way mechanism and cannot enforce the two-way correspondence between records of a supertype and a subtype. (Section 16.6 elaborates.) Applications must compensate with additional programming.
Several alternative mappings are sometimes helpful for generalization.
Chapter 11 discussed intrinsic identity — the ability to start from outside a database and find data with fields that have application meaning. In contrast, design identity is the ability to find data within a database. Databases implement design identity with primary keys. A primary key is a candidate key that is chosen for internal (foreign key) references. A table normally has a primary key, and has at most one primary key.
There are several approaches to implementing primary keys. Of these approaches I favor existence-based identity and existence-based identity + lookups.
Existence-based identity. An artificial number (called an ID) is the primary key of each entity table (Figure 16.11). The major DBMS products can generate such numbers; examples include Oracle sequences and the SQL Server identity property. With existence-based identity all primary keys are single attribute, small, and uniform in size. One of the few downsides is that the IDs can complicate debugging. (When debugging you have to dereference IDs to see meaningful fields.)
Existence-based identity + lookups. Another option (Figure 16.12) is to use a mnemonic abbreviation for lookup tables, and existence-based identity for everything else. “Lookup” data means metadata such as AccountType that is often implemented as a pick list in a user interface. Lookup tables have a small number of records (usually no more than tens of records) that are seldom updated.
Value-based identity. A combination of application attributes identify each entity. In Figure 16.13 names identify Region, Bank, and AccountType. Account is identified by bankName combined with accountNum. An advantage of value-based identity is that primary keys have intrinsic meaning.
However primary key values can be difficult to change — a change to a primary key may require propagation to many foreign keys. Some entities lack natural real-world identifiers. Also primary keys can become lengthy, from the foreign keys of a series of related tables as Figure 16.14 illustrates. Figure 16.15 shows an awkward situation for Transaction. With value-based identity and referential integrity, the two paths lead to two copies of bankID and accountNumber. With value-based identity developers must abandon referential integrity to have a single copy of the fields.
Hybrid identity. A schema can combine existence-based identity with value-based identity. In Figure 16.16 Region, Bank, and AccountType have artificial identity and Account has identity derived from a bank reference combined with an account number.
Propagated identity. Identity can also propagate from one entity type to another. In Figure 16.17 the primary key of Exercise is also the primary key for Answer. (A book can have many exercises; an exercise may have an answer.) This is seldom a good idea as there is a “leakage” of information from one entity type to the other.
Table 16.1 summarizes the approaches to design identity and their trade-offs.
Summary of Approaches to Design Identity
Identity approach |
Definition |
Advantages |
Drawbacks |
Recommendation |
---|---|---|---|---|
Existence-based |
An artificial number is the primary key of each entity. |
|
|
A good choice |
Existence-based + lookups |
A mnemonic for lookups and existence-based for everything else. |
|
|
A good choice |
Value-based |
A combination of application attributes identifies each entity. |
|
|
Seldom a good idea |
Hybrid |
Schema combines existence-based and value-based identity. |
|
A poor option |
|
Propagated |
PK of an entity comes from a relationship. |
|
A poor option |
Note: I recommend the use of existence-based identity or existence-based + lookups.
Once the IDEF1X model is in place, you should define referential integrity actions to enforce the model’s meaning. Referential integrity is a database mechanism that ensures that references to data really exist and that there are no dangling foreign keys. Normally you should let a DBMS enforce referential integrity constraints rather than write custom application code. If you use existence-based identity, there are no primary key updates to propagate and referential integrity actions are only needed for deletions.
Neither the UML model nor the IDEF1X model specify referential integrity precisely, so you will have to rely on your application understanding. I recommend the following guidelines for deletions.
Generalization. Cascade deletions for foreign keys that arise from generalization. For example, in Figure 16.18 OwnedAsset is a subtype of Asset and ownedAssetID references assetID. Upon deletion, an Asset record and an OwnedAsset record should both be deleted. Given separate supertype and subtype tables, there should be a on delete cascade clause for each subtype reference to the supertype.
A relational DBMS can propagate deletion downward from the supertype to the subtypes. However, a relational DBMS cannot propagate deletion upward from a subtype toward the supertype. For example, the deletion of an Asset record could cause the deletion of the corresponding OwnedAsset. Then the deletion of an OwnedAsset could cause deletion of its RentedAssets. But deletion of a RentedAsset record cannot propagate upward to Asset. Consequently, it is necessary to first delete the Asset records that are RentedAssets and then delete the Asset record for the OwnedAsset.
Some organizations forbid cascaded deletions as a matter of policy (fearful of inadvertent errors). In that case there will have to be additional programming code to propagate a deletion from a supertype record to descendant subtype records.
SQL has powerful constraint mechanisms that are part of the language. As much as possible, it is desirable to place declarative constraints in the database rather than write imperative constraints via programming code. The not null clause enforces that a column of a table must have a value. The previous section discussed referential integrity which ensures that there are no dangling referents. Unique indexes (next section) can enforce candidate keys. In addition SQL has triggers and general constraints.
A trigger performs a database command upon the occurrence of a specified event and satisfaction of a condition. [Ehnasri-2006] Although it is a dangerous practice, triggers can be used to enforce database constraints. The concern is that careless use of triggers can lead to explosions of database activity — one trigger fires, causing other triggers to fire, leading to an extensive cascade. One trigger in isolation is straightforward to understand. However a database with numerous triggers can be inscrutable.
It is especially important not to use SQL triggers to implement referential integrity. This was done with some of the old DBMS products of the past. Modern SQL has declarative referential integrity that is well understood and efficient — you should use it. Triggers are several orders of magnitude slower for executing referential integrity and should not be used for that purpose.
A proper use of triggers is for propagating data — to update related applications, to synchronize distributed databases, or to feed data warehouses. Triggers can also be helpful for keeping derived data consistent with its underlying base data.
SQL also supports general constraints with the check constraint. Models imply some of these constraints. Others are details that are lacking from the model and rely on your application understanding.
The purpose of the generalization discriminator is to indicate which subtype record elaborates each supertype record. Accordingly a discriminator must be an enumeration with one value for each of the subtypes. For example, in Figure 16.18 assetDiscrim is an enumeration with two values: RentedAsset and OwnedAsset. With SQL assetDiscrim would be stored as a string that is not null. A check constraint could enforce that the string value was in the list {‘RentedAsset’, ‘OwnedAsset’}.
SQL check constraints are also useful for enforcing domains. A SQL table has many columns each of which has a domain. A domain specifies a datatype, constraints on the data, and semantic meaning of the data. Thus the domain for UPC codes may store data as a string of digits with a specified length and have a rule to verify the check digit at the end. (See Chapter 11 for a discussion of UPC codes.) As another example, in Figure 16.18 a RentedAsset’s endTime must be greater than its startTime.
SQL check constraints can also enforce enumerations. Enumerations often arise and should be enforced by the database rather than application code. The following are enumerations: actualOrEstimate (Figure 10.7), grade (Figure 10.11), format (Figure 10.15), priority (Figure 10.37), and outcome (Figure 10.37).
Indexes serve two purposes: enforcing uniqueness for primary and candidate keys as well as enabling fast database traversal. Most relational DBMSs create indexes as a side effect of declaring primary keys and candidate keys. I recommend that you also create an index for each foreign key that is not subsumed by a primary key or candidate key. These foreign key indexes are important because they enable the fast performance that users expect when they traverse a model. Joins often occur across relationships and across the levels of generalization hierarchies. Joins are orders of magnitude more efficient if foreign keys and primary keys have indexes.
You should incorporate foreign key indexes in your initial database design because they are straightforward to include and there is no good reason to defer them. The database administrator (DBA) may define additional indexes to fine-tune performance. The DBA may also use DBMS-specific features.
If you have a modern tool, it is relatively easy to generate SQL code from a database design. With ERwin I pay attention to the following.
This chapter summarizes my approach to database design. I start with a UML model of conceptual and logical intent and use that as the basis for preparing an IDEF1X model. Modern tools, such as ERwin, can then generate SQL code to create the database design. Here is a summary of my preferred database design practices.
Table 16.2 summarizes the recommended mapping rules.
Summary of Relational DBMS Mapping Rules
Concept |
Model construct |
Relational DBMS construct |
---|---|---|
Entity type |
Entity type |
Table |
Non-qualified relationship |
Many-to-many |
Distinct table |
Simple one-to-many |
Buried foreign key |
|
Simple one-to-one |
||
Relationship with attributes |
Distinct table |
|
Aggregation |
Same as underlying relationship |
|
Composition |
||
Ordered relationship |
||
Qualified relationship |
One-to-optional |
Buried foreign key + qualifier |
Optional-to-optional |
Buried foreign key + qualifier |
|
Many-to-optional |
Distinct table |
|
Optional-to-many |
Buried foreign key + qualifier |
|
Generalization |
Separate supertype and subtype tables |
Many of the ideas in this chapter come from my consulting and database reverse engineering experiences.
[Bruce-1992] is a good reference for IDEF1X. [Elmasri-2006] is a good general database reference.
[Bruce-1992] Thomas A. Bruce. Designing Quality Databases with IDEF1X Information Models. New York, New York: Dorset House, 1992.
[Elmasri-2006] Ramez Elmasri and Shamkant B. Navathe. Fundamentals of Database Systems (5th Edition). Boston, Massachusetts: Addison-Wesley, 2006.