Chapter 11

Identity

Identity is the property that distinguishes an entity from all others. In concept, an entity has intrinsic identity apart from how the entity may happen to be implemented. Users must be able to find data in a database or the database is compromised.

11.1 Intrinsic Identity

Intrinsic identity is the ability to find data with fields that have meaning. Starting from outside a database, a user specifies real-world fields to find one or more entities and then navigates the database to find the desired data. Intrinsic identity has no bearing on how identity is implemented. Rather intrinsic identity provides a way for logically finding entities when searching a database.

11.1.1 Candidate Keys

Candidate keys provide one aspect of intrinsic identity. A candidate key is a combination of one or more fields that uniquely identify the records in a table. The set of fields in a candidate key must be minimal; no field can be discarded from the candidate key without destroying uniqueness. No field in a candidate key can be null. When a candidate key is defined, the DBMS guarantees that the combination of fields will be unique. I strongly encourage the use of candidate keys. After all, the purpose of a database is not only to store data but also to assure data’s quality. When you use a candidate key as the starting point for a search, you are guaranteed to obtain no more than one entity.

The UML lacks a notation for candidate keys. This book uses the notation {unique} to flag a unique field. In Figure 11.1 Airline has two candidate keys, each of which consists of one field. Thus, for example, the name American Airlines and the code AA both denote the same airline.

Figure 11.1

Figure showing intrinsic identity in a model. There are unique combinations of fields that provide starting points for finding data in a database.

Intrinsic identity in a model. There are unique combinations of fields that provide starting points for finding data in a database.

IDEF1X uses the AKn.m annotation to indicate unique combinations of fields (alternate keys). The “n” is the number of the alternate key. The “m” is the sequence of fields within the alternate key. Thus in Figure 11.1 the airlineName and airlineCode each individually denote an Airline. The combination of airlineID and accountNumber is also unique for a FrequentFlyerAccount.

11.1.2 UML Qualifiers

UML qualifiers are also helpful for specifying intrinsic identity. A qualifier is an attribute that distinguishes among the entities at a “many” relationship end. A qualifier selects among the entities, reducing the effective multiplicity, often from “many” to “one”. For example, in Figure 11.1 the UML model and IDEF1X models have the same meaning. They specify that an Airline plus an accountNumber yield at most one FrequentFlyerAccount. In contrast, if the accountNumber is omitted, an Airline yields many FrequentFlyerAccounts. A qualifier specifies an important path for traversing a model and finding data.

11.1.3 Logical Horizon

Traversal is a third aspect of intrinsic identity. Applications can navigate a model and its corresponding database by traversing relationships and generalizations. Traversals of a model become SQL joins in an implementation.

The logical horizon [Feldman-1986] of an entity type is the set of entity types reachable by one or more paths terminating in a cumulative multiplicity of one. A path is a sequence of traversals of relationships and generalization levels. The purpose of the logical horizon is to compute the specific entities that can be inferred from a starting entity.

First I will explain Figure 11.2 and then give some examples for logical horizon. A Person may have multiple FrequentFlyerAccounts, each of which is offered by an Airline. An accountNumber is unique within the context of an Airline. Each FrequentFlyerAccount can have Activities posted of various ActivityTypes. For example, a frequent flyer account can have credits posted for actual flight miles, bonus miles, dining activity, and car rentals.

Figure 11.2

Figure showing names. Names are often helpful for finding entities.

Names. Names are often helpful for finding entities.

The logical horizon of Person is null; the only relationship is to FrequentFlyerAccount and the multiplicity is “many.” The logical horizon of FrequentFlyerAccount is Person and Airline. The logical horizon of Activity is FrequentFlyerAccount, Person, Airline, and ActivityType.

11.2 Names

Names are prominent in models and often helpful for finding entities. Webster’s dictionary defines a name as “a word or phrase that constitutes the distinctive designation of a person or thing.” There are four scenarios for how names can be used to find entities.

  • Unique names. Some names are unique and resolve to a single entity. For example, the name of a country corresponds to a specific country. In Figure 11.2 Airline name, Airline code, and ActivityType name are globally unique.
  • Unique names within a context. Other names are not unique on their own but are unique when combined with a parent entity. For example, the names of provinces are unique within the context of a country. UML qualifiers (and their equivalent in IDEF1X alternate keys) provide a notation for specifying fields that are unique within a context. In Figure 11.2 accountNumber provides the unique name for a FrequentFlyerAccount within the context of an Airline.
  • Non-unique names. Still other names provide important description but alone cannot find an entity. For example, person names are important, but insufficient for finding an individual person. Sometimes non-unique names can be augmented with additional details to find a specific entity.
  • Multiple unique names. Some entities have multiple names. Figure 11.3 promotes the substanceName attribute to an entity type because each chemical substance may have multiple aliases. For example, propylene is known as propylene and C3H6.

Figure 11.3

Figure showing multiple unique names. Chemical substances can have multiple names that identify the substance.

Multiple unique names. Chemical substances can have multiple names that identify the substance.

11.3 Surrogate Identity

Sometimes entities can be identified via other entities with which they are closely related. For example, you cannot reliably identify a person with his or her name. However, in some applications you can identify a person via a passport, driver’s license, or identity card.

11.4 Structured Fields

Some entities, especially mechanical parts and items for commerce, have structured fields that provide identity. A structured field is a field that is composed from constituent pieces with a specified grammar. Structured fields are synthetic but when parsed the pieces have meaning. Many structured fields are backed by standard protocols.

As an example, consider the UPC, EAN, and GTIN codes for consumer packaged goods. The UPC (Universal Product Code) originated in the United States in the early 1970s and has twelve digits. The first digit is the system digit and indicates the kind of item—general merchandise, random-weight item, health item, in-house item, and coupon. The next five digits denotes the vendor. Digits seven through eleven are called the item code—a unique code for an item within the context of a vendor. The last digit is a check digit.

The UPC was followed by the EAN (European Article Number) which has thirteen digits. The EAN has the following sequence: two system digits, five vendor digits, five item digits, and one check digit.

Most recently the UPC and EAN have both been replaced by the GTIN (Global Trade Item Number). The GTIN has fourteen digits with the following sequence: item digit, two system digits, five vendor digits, five item digits, and one check digit. Note that there are a total of six item digits, one at the start of the sequence and the other five later in the sequence. The extra item digit is often used to distinguish individual items from multiple items that are packaged into cases and pallets.

Many manufacturers have product codes that embody information about the major components, location of manufacture, and date of manufacture.

11.5 Master Applications

Some organizations have dedicated applications (called master applications) that enforce the identity of occurrences of crucial concepts and unify their data across an organization’s applications. All requests for creating, modifying, and deleting such occurrences must be coordinated via the corresponding master application. For example, customers, parts, locations, securities, and contracts are often good candidates for master applications. An organization should have no more than a handful of master applications and the subject concepts should pervade the organization.

Consider a Customer master application. If each application names customers, there could be many variations, such as AT&T, A.T.&T, and American Telephone and Telegraph. One application might have the current mailing address and another might have a stale address from the past. With this kind of chaos, it is difficult to have a global perspective for data. In contrast, a Customer master application would dictate the precise name for a company for use throughout an organization and serve as a centralized source for its data.

11.6 Merging Data

Duplicate copies of data can arise in various ways, such as through flaws in business processes. For example, an airline may create two frequent flyer accounts for a person and not realize that they are dealing with the same customer. Duplicate copies can also arise through business acquisitions that cause databases to be merged. For example, the merger of Delta Airlines and Northwest Airlines created a need to merge frequent flyer databases.

The tricky part of merging data is determining if one entity is the same thing as another. The entities being compared must have matching real-world data that can identify them or there must be an algorithm to deduce a match (such as by equating the various AT&T names in Section 11.5). Sometimes the information required to make the determination is missing from a database; then a person has to investigate to reach a decision. Once a match is established the mechanics of combination are straightforward but can still be tedious to accomplish without disrupting a business.

There is no merge technique that is clearly superior. The appropriate solution depends on the situation as Table 11.1 explains.

Table 11.1

Summary of Approaches to Merging Data

Merge approach

Definition

Advantages

Drawbacks

Combine entities

Copy data from one entity to the other. Then discard the rejected entity.

  • Fully consolidates the data.
  • There can be much data to migrate, especially if there are many foreign keys.
  • May have to remap foreign keys and revise foreign key definitions.
  • It is easy to overlook a foreign key, causing a stale reference.

Favor an entity

Mark one entity as deprecated and the other as active. Both entities remain in the database. Gradually migrate data to the active entity.

  • Eventually consolidates the data.
  • There is less risk of stale references as obsolete data can persist for awhile.
  • In the meantime there is database clutter, possible confusion, and indirection in accessing data.
  • There can be much data to migrate, especially if there are many foreign keys.

Logical merge

Use a binding table to logically combine entities. (See Section 10.1.) One entity could be favored for new data.

  • Can readily merge data as well as unmerge.
  • There is no risk of stale references.
  • Causes indirection in accessing data, complicating queries and slowing performance.
  • The database schema remains fragmented by multiple aspects of an entity.

Note: The appropriate solution depends on the situation.

11.7 Chapter Summary

Identity is the property that distinguishes an entity from all others. In concept, an entity has intrinsic identity apart from how the entity may happen to be implemented. Candidate keys and UML qualifiers are important aspects of intrinsic identity. Names are prominent in models and can be helpful for finding specific data.

Bibliographic Notes

[Khoshafian-1986] is a classic reference on identity, but the ideas in the paper reach beyond programming languages and also pertain to databases.

Chapter 5 of [Fowler-1997] has a good discussion of identity.

Chapter 4 of [Arlow-2004] discusses identity for persons and organizations. Chapter 7 discusses identity for products.

References

[Arlow-2004] Jim Arlow and Ila Neustadt. Enterprise Patterns and MDA: Building Better Software with Archetype Patterns and UML. Boston, Massachusetts: Addison-Wesley, 2004.

[Feldman-1986] P. Feldman and D. Miller. Entity model clustering: Structuring a data model by abstraction. Computer Journal 29, 4 (1986), 348-360.

[Fowler-1997] Martin Fowler. Analysis Patterns: Reusable Object Models. Boston, Massachusetts: Addison-Wesley, 1997.

[Khoshafian-1986] S.N. Khoshafian and G.P. Copeland. Object identity. OOPSLA '86 as ACM SIG-PLAN21, 11 (November 1986), 406-416.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset