The two techniques used to build the relational logical data model are normalization and abstraction.

When I turned 12, I received a trunk full of baseball cards as a birthday present from my parents. I was delighted, not just because there may have been a Hank Aaron or Pete Rose buried somewhere in that trunk, but because I loved to organize the cards. I categorized each card according to year and team. Organizing the cards in this way gave me a deep understanding of the players and their teams. To this day, I can answer many baseball card trivia questions.

Normalization, in general, is the process of applying a set of rules with the goal of organizing something. I was normalizing the baseball cards according to year and team. We can also apply a set of rules and normalize the attributes within our organizations. The rules are based on how the business works, which is why normalization is the primary technique used in building the relational logical data model.

Just as those baseball cards lay unsorted in that trunk, our companies have huge numbers of attributes spread throughout departments and applications. The rules applied to normalizing the baseball cards entailed first sorting by year, and then by team within a year. The rules for normalizing our attributes can be boiled down to a single sentence:

Make sure every attribute is single-valued and provides a fact completely and only about its primary key.

The underlined terms require more of an explanation.

Single-valued means an attribute must contain only one piece of information. If Consumer Name contains Consumer First Name and Consumer Last Name, for example, we must split Consumer Name into two attributesConsumer First Name and Consumer Last Name.

Provides a fact means that a given primary key value will always return no more than one of every attribute that is identified by this key. If a Customer Identifier value of 123 for example, returns three customer last names (Smith, Jones, and Roberts), it violates this part of the normalization definition.

Completely means that the minimal set of attributes that uniquely identify an instance of the entity is present in the primary key. If, for example, there are two attributes in an entitys primary key, but only one is needed for uniqueness, the attribute that is not needed for uniqueness should be removed from the primary key.

Only means that each attribute must provide a fact about the primary key and nothing else. That is, there can be no hidden dependencies. For example, assume an Order is identified by an Order Number. Within Order, there are many attributes, including Order Scheduled Delivery Date, Order Actual Delivery Date, and Order On Time Indicator. Order On Time Indicator contains either a Yes or a No, providing a fact about whether the Order Actual Delivery Date is less than or equal to the Order Scheduled Delivery Date. Order On Time Indicator, therefore, provides a fact about Order Actual Delivery Date and Order Scheduled Delivery Date, not directly about Order Number, so it should be removed from the normalized model. Order On Time Indicator is an example of a derived attribute, meaning it is calculated. Derived attributes are removed from a normalized model.

So, a general definition for normalization is that it is a series of rules for organizing something. The series of rules can be summarized as: Every attribute is single-valued and provides a fact completely and only about its primary key. An informal definition I frequently use for normalizing is: A formal process of asking business questions. We cannot determine if every attribute is single-valued and provides a fact completely and only about its primary key unless we understand the data. To understand the data, we usually need to ask lots of questions. Even for an apparently simple attribute such as Phone Number, for example, we can ask many questions:

·          Whose phone number is this?

·          Do you always have to have a phone number?

·          Can you have more than one phone number?

·          Do you ever recognize the area code as separate from the rest of the phone number?

·          Do you ever need to see phone numbers outside a given country?

·          What type of phone number is this? That is, is it a fax number, mobile number, etc.?

·          Does the time of day matter? For example, do we need to distinguish between the phone number to use during working hours, and outside working hours? Of course, that would lead to a discussion on what we mean by working hours.

To ensure that every attribute is single-valued and provides a fact completely and only about its primary key, we apply a series of rules in small steps, where each step (or level of normalization) checks something that moves us towards our goal. Most data professionals would agree that the full set of normalization levels is the following:


·         first normal form (1NF)

·         second normal form (2NF)

·         third normal form (3NF)

·         Boyce/Codd normal form (BCNF)

·         fourth normal form (4NF)

·         fifth normal form (5NF).

Each level of normalization includes the lower levels of rules that precede it. If a model is in 5NF, it is also in 4NF, BCNF, and so on. Even though there are higher levels of normalization than 3NF, many interpret the term normalized to mean 3NF. This is because the higher levels of normalization (that is, BCNF, 4NF, and 5NF) cover specific situations that occur much less frequently than the first three levels. Therefore, to keep things simple, this chapter focuses only on first through third normal forms.

Initial Chaos

I would describe the trunk of baseball cards I received as being in a chaotic state, as there was no order to the cards. Just a bunch of cards thrown in a large box. I removed the chaos by organizing the cards. The term chaos can be applied to any unorganized pile, including attributes. We may have a strong understanding of each of the attributes, such as their name and definition, but we lack knowledge to which entity the attribute should be assigned. When I pick out a 1978 Pete Rose from the box and put this card in the 1978 pile, I start bringing order were there was chaossimilar to assigning Customer Last Name to the customer pile (called the Customer entity).

Lets walk through an example – Figure 9.3 contains a bunch of what appears to be employee attributes.

Figure 9.3 Initial chaotic state

Often definitions are of poor quality or missing completely and let’s assume that this is the case with this Employee entity. We are told however, that Employee Vested Indicator captures whether an Employee is eligible for retirement benefits – a value of Y for “yes” means the employee is eligible, and a value of N for “no” means the employee is not eligible, and this indicator is derived from the employee’s start date. For example, if an employee has worked for the company for at least five years, then this indicator contains the value Y.

What is lacking at this point, and what will be solved though normalization, is putting these attributes into the right entities.

It is very helpful to have some sample values for each of these attributes, so lets assume the spreadsheet in Table 9.2 is a representative set of employee values.

Table 9.2 Sample employees

Emp Id

Dept Cd

Phone 1

Phone 2

Phone 3

Emp Name

Dept Name

Emp Start Date

Emp Vested Ind

123

A

973-555-1212

678-333-3333

343-222-1111

Henry Winkler

Data Admin

4/1/2012

N

789

A

732-555-3333

678-333-3333

343-222-1111

Steve Martin

Data Admin

3/5/2007

Y

565

B

333-444-1111

516-555-1212

343-222-1111

Mary Smith

Data Warehouse

2/25/2006

Y

744

A

232-222-2222

678-333-3333

343-222-1111

Bob Jones

Data Admin

5/5/2011

N

First Normal Form (1NF)

Recall that the series of rules can be summarized as: Every attribute is single-valued and provides a fact completely and only about its primary key. First Normal Form (1NF) is the “single-valued” part. It means that for a given primary-key value, we can find, at most, one of every attribute that depends on that primary key.

Ensuring each attribute provides a fact about its primary key includes correcting the more blatant issue shown in Table 9.2, as well as addressing repeating groups and multi-valued attributes. Specifically, the modeler needs to:

·         Move repeating attributes to a new entity. When there are two or more of the same attribute in the same entity, they are called repeating attributes. The reason repeating attributes violate 1NF is that for a given primary key value, we are getting more than one value back for the same attribute. Repeating attributes often take a sequence number as part of their name, such as phone number in this employee example. We can find ourselves asking many questions just to determine if there are any repeating attributes we need to address. We can have a question template, such as “Can a” [[insert entity name here]] “have more than one” [[insert attribute name here]] “?” So these are all valid questions in our employee example:             

o        Can an Employee have more than one Employee Identifier?

o        Can an Employee have more than one Department Code?

o        Can an Employee have more than one Phone Number?

o        Can an Employee have more than one Employee Name?

o        Can an Employee have more than one Department Name?

o        Can an Employee have more than one Employee Start Date?

o        Can an Employee have more than one Employee Vested Indicator?

·         Separate multi-valued attributes. Multi-valued means that within the same attribute we are storing at least two distinct values. There are at least two different business concepts hiding in one attribute. For example, Employee Name may contain both a first name and last name. Employee First Name and Employee Last Name can be considered distinct attributes, and therefore, Henry Winkler stored in Name, is multi-valued, because it contains both Henry and Winkler. We may find ourselves asking many questions just to determine if there are any multi-valued attributes we need to identify. We can have another question template, such as “Does a” [[insert attribute name here]] “contain more than one piece of business information?” So these are all valid questions in our employee example:

o        Does an Employee Identifier contain more than one piece of business information?

o        Does a Department Code contain more than one piece of business information?

o        Does a Phone Number contain more than one piece of business information?

o        Does an Employee Name contain more than one piece of business information?

o        Does a Department Name contain more than one piece of business information?

o        Does an Employee Start Date contain more than one piece of business information?

o        Does an Employee Vested Indicator contain more than one piece of business information?

After answering these questions based on the sample data from Table 9.2, and by talking with a business expert, we update the model which appears in Figure 9.4.

Figure 9.4 From Chaos to 1NF

We learned that although Phone Number 1, Phone Number 2, and Phone Number 3 appear as repeating attributes, they are really three different pieces of information, based upon the sample values we were given. Phone Number 3 contained the same value for all four employees and we learned after validating with the business expert that this is the organization’s phone number. Phone Number 2 varied by department so this attribute was renamed to Department Phone Number and Phone Number 3 is different for each employee and we learned that this is the Employee Phone Number. We also were told that Employee Name does contain more than one piece of information and therefore it should be split into Employee First Name and Employee Last Name.

Second Normal Form (2NF)

Recall that the series of rules can be summarized as: Every attribute is single-valued and provides a fact completely and only about its primary key. First Normal Form (1NF) is the “single-valued” part. Second Normal Form (2NF) is the completely part. This means each entity must have the minimal set of attributes that uniquely identifies each entity instance.

As with 1NF, we will find ourselves asking many questions to determine if we have the minimal primary key. We can have another question template, such as: Are all of the attributes in the primary key needed to retrieve a single instance of [[insert attribute name here]]? In the Employee example shown in Figure 9.4, the minimal set of primary key instances are Employee Identifier and Department Code.

So these are all valid questions for our employee example:             

·          Are both Employee Identifier and Department Code needed to retrieve a single instance of Employee Phone Number?

·          Are both Employee Identifier and Department Code needed to retrieve a single instance of Department Phone Number?

·          Are both Employee Identifier and Department Code needed to retrieve a single instance of Organization Phone Number?

·          Are both Employee Identifier and Department Code needed to retrieve a single instance of Employee First Name?

·          Are both Employee Identifier and Department Code needed to retrieve a single instance of Employee Last Name?

·          Are both Employee Identifier and Department Code needed to retrieve a single instance of Department Name?

·          Are both Employee Identifier and Department Code needed to retrieve a single instance of Employee Start Date?

·          Are both Employee Identifier and Department Code needed to retrieve a single instance of Employee Vested Indicator?

Normalization is a process of asking business questions. In this example, we could not complete 2NF without asking the business “Can an Employee work for more than one Department at the same time?” If the answer is “Yes” or “Sometimes”, then the first model in Figure 9.5 on the left is accurate. If the answer is “No”, then the second model prevails.

Figure 9.5 Employee example in 2NF

Third Normal Form (3NF)

Recall that the series of rules can be summarized as: Every attribute is single-valued and provides a fact completely and only about its primary key. First Normal Form (1NF) is the “single-valued” part. Second Normal Form (2NF) is the “completely” part. Third Normal Form (3NF) is the “only” part.

3NF requires the removal of hidden dependencies. Each attribute must be directly dependent on the primary key, and not directly dependent on any other attributes within the same entity.

The data model is a communication tool. The relational logical data model communicates which attributes are facts about the primary key and only the primary key. Hidden dependencies complicate the model and make it difficult to determine how to retrieve values for each attribute.

To resolve a hidden dependency, you will either need to remove the attribute that is a fact about non-primary key attribute(s) from the model, or you will need to create a new entity with a different primary key for the attribute that is dependent on the non-primary key attribute(s).

As with 1NF and 2NF, we will find ourselves asking many questions to uncover hidden dependencies. We can have another question template, such as:

Is [[insert attribute name here]] a fact about any other attribute in this same entity?

So these are all valid questions for our contact example:             

·          Is Employee Phone Number a fact about any other attribute in Employee?

·          Is Organization Phone Number a fact about any other attribute in Department?

·          Is Department Phone Number a fact about any other attribute in Department?

·          Is Employee First Name a fact about any other attribute in Employee?

·          Is Employee Last Name a fact about any other attribute in Employee?

·          Is Department Name a fact about any other attribute in Department?

·          Is Employee Start Date a fact about any other attribute in Employee?

·          Is Employee Vested Indicator a fact about any other attribute in Employee?

Note that Employee Vested Indicator may be a fact about Employee Start Date as Employee Vested Indicator can be calculated Y or N based upon the employee’s start date. Figure 9.6 shows the model now in 3NF after removing the derived attribute, Employee Vested Indicator.

Figure 9.6 Employee example in 3NF

You will find that the more you normalize, the more you go from applying rules sequentially to applying them in parallel. For example, instead of first applying 1NF to your model everywhere, and then when you are done applying 2NF, and so on, you will find yourself looking to apply all levels at once. This can be done by looking at each entity and making sure the primary key is correct and that it contains a minimal set of attributes, and that all attributes are facts about only the primary key.

Normalization is a mandatory technique on the relational logical data model. Abstraction is an optional technique. Abstraction brings flexibility to your data models by redefining and combining some of the attributes, entities, and relationships within the model into more generic terms.

For example, we can take our normalized data model and abstract Employee into Party and Role, as shown in Figure 9.7.

Figure 9.7 Employee abstracted

Notice the extra flexibility we gain with abstraction. By abstracting Employee into the Party Role concept, we can accommodate additional roles without changes to our model and most likely without changes to our application. Roles such as Contractor and Consumer can be added gracefully without updates to our model. However, this extra flexibility does come with a price. Actually, three high prices:

·         Loss of communication. The concepts we abstract are no longer represented explicitly on the model. That is, when we abstract, we often convert column names to entity instances. For example, Employee is no longer an explicit entity, but is, instead, an entity instance of Party Role, with a Role Type Code value of 03 for Employee. One of the main reasons we model is for communication, and abstracting can definitely hinder communication.

·         Loss of business rules. When we abstract, we can also lose business rules. To be more specific, the rules we enforced on the data model before abstraction now need to be enforced through other means, such as through programming code. If we wanted to enforce that an Employee must have a Start Date for example, we can no longer enforce this rule through the abstracted data model in Figure 9.7.

·         Additional development complexity. Abstracting requires sophisticated development techniques to turn attributes into values, when loading an abstract structure, or to turn values back into attributes, when populating a structure from an abstract source. Imagine the work to populate Party Role from the source Employee. It would be much easier for a developer to load data from an entity called Employee into an entity called Employee. The code would be simpler and it would be very fast to load.

So, although abstraction provides flexibility to an application, it does come with a cost. It makes the most sense to use abstraction when the modeler or analyst anticipates additional types of something coming in the near future. For example, from Figure 9.7, additional types of people, such as Contractor and Consumer.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset