Chapter 7: Data Quality Refactorings

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7
Data Quality Refactorings

Data quality refactorings are changes that improve the quality of the information contained within a database. Data quality refactorings improve and/or ensure the consistency and usage of the values stored within the database. The data quality refactorings are as follows:

• Add Lookup Table

• Apply Standard Codes

• Apply Standard Type

• Consolidate Key Strategy

• Drop Column Constraint

• Drop Default Value

• Drop Non-Nullable Constraint

• Introduce Column Constraint

• Introduce Common Format

• Introduce Default Value

• Make Column Non-Nullable

• Move Data

• Replace Type Code With Property Flags

Common Issues When Implementing Data Quality Refactorings

Because data quality refactorings change the values of the data stored within your database, several issues are common to all of them. As a result, you need to do the following:

1. Fix broken constraints. You may have constraints defined on the affected data. If so, you can apply Drop Column Constraint (page 172) to first remove the constraint and then apply Introduce Column Constraint (page 180) to add the constraint back, reflecting the values of the improved data.

2. Fix broken views. Views will often reference hard-coded data values in their WHERE clauses, usually to select a subset of the data. As a result, these views may become broken when the values of the data change. You will need to find these broken views by running your test suite and by searching for view definitions that reference the columns in which the changed data is stored.

3. Fix broken stored procedures. The variables defined within a stored procedure, any parameters passed to it, the return value(s) calculated by it, and any SQL defined within it are potentially coupled to the values of the improved data. Hopefully, your existing tests will reveal business logic problems arising from the application of any data quality refactorings; otherwise, you will need to search for any stored procedure code accessing the column(s) in which the changed data is stored.

4. Update the data. You will likely want to lock the source data rows during the update, affecting performance and availability of the data for the application(s). You can take two strategies to do this. First, you can lock all the data and then do the updates at that time. Second, you can lock subsets of the data, perhaps even just a single row at a time, and do the update just on the subset. The first approach ensures consistency but risks performance degradation with large amounts of data—updating millions of rows can take time, preventing applications from making updates during this period. The second approach enables applications to work with the source data during the update process but risks inconsistency between rows because some will have the older, “low-quality” values, whereas other rows will have been updated.

Add Lookup Table

Create a lookup table for an existing column.

Motivation

There are several reasons why you may want to apply Add Lookup Table:

• Introduce referential integrity. You may want to introduce a referential integrity constraint on an existing Address.State to ensure the quality of the data.

• Provide code lookup. Many times you want to provide a defined list of codes in your database instead of having an enumeration in every application. The lookup table is often cached in memory.

• Replace a column constraint. When you introduced the column, you added a column constraint to ensure that a small number of correct code values persisted. But, as your application(s) evolved, you needed to introduce more code values, until you got to the point where it was easier to maintain the values in a lookup table instead of updating the column constraint.

• Provide detailed descriptions. In addition to defining the allowable codes, you may also want to store descriptive information about the codes. For example, in the State table, you may want to relate the code CA to California.

Potential Tradeoffs

There are two issues to consider when adding a lookup table. The first is data population—you need to be able to provide valid data to populate the lookup table. Although this sounds simple in practice, the implication is that you must have an agreement as to the semantics of the existing data values in Address.State of Figure 7.1. This is easier said than done. For example, in the case of introducing a State lookup table, some applications may work with all 50 U.S. states, whereas others may also include the four territories (Puerto Rico, Guam, the District of Columbia, and the U.S. Virgin Islands). In this situation, you may either need to add two lookup tables, one for the 50 states and the other for the territories, or implement a single table and then the appropriate validation logic within applications that only need a subset of the lookup data.

Figure 7.1. Adding a State lookup table.

The second issue is that there will be a performance impact resulting from the addition of a foreign key constraint. (See the refactoring Add Foreign Key Constraint on page 204 for details.)

Schema Update Mechanics

As depicted in Figure 7.1, to update the database schema, you must do the following:

1. Determine the table structure. You must identify the column(s) of the lookup table (State).

2. Introduce the table. Create State in the database via the CREATE TABLE command.

3. Determine lookup data. You have to determine what rows are going to be inserted in the State. Consider applying the Insert Data refactoring (page 296).

4. Introduce referential constraint. To enforce referential integrity constraints from the code column in the source table(s) to State, you must apply the Add Foreign Key refactoring.

The following code depicts the DDL to introduce the State table and add a foreign key constraint between it and Address:

––Create the lookup table.
CREATE TABLE State (
  State CHAR(2) NOT NULL,

  Name CHAR(50),
  CONSTRAINT PKState
    PRIMARY KEY (State)
);

––Introduce Foreign Key to the lookup table
ALTER TABLE Address ADD CONSTRAINT FK_Address_State
  FOREIGN KEY (State) REFERENCES State;

Data-Migration Mechanics

You must ensure that the data values in Address.State have corresponding values in State. The easiest way to populate State.State is to copy the unique values from Address.State. With this automated approach, you need to remember to inspect the resulting rows to ensure that invalid data values are not introduced—if so, you need to update both Address and State appropriately. When there are descriptive information columns, such as State.Name, you must provide the appropriate values; this is often done manually via a script or a data-administration utility. An alternative strategy is to simply load the values into the State table from an external file.

The following code depicts the DDL to populate the State table with distinct values from the Address.State column. We then cleanse the data, in this case ensuring that all addresses use the code TX instead of Tx or tx or Texas. The final step is to provide state names corresponding to each state code. (In the example, we populate values for just three states.)

––Populate data in the lookup table
INSERT INTO State (State)
SELECT DISTINCT UPPER(State) FROM Address;

––Update the Address.StateCode to valid values and
clean data

UPDATE Address SET State = 'TX' WHERE
UPPER(State) ='TX';
...
––Now provide state names
UPDATE State SET Name = 'Florida' WHERE State='FL';
UPDATE State SET Name = 'Illinois' WHERE State='IL';
UPDATE State SET Name = 'California' WHERE State='CA';

Access Program Update Mechanics

When you add State, you have to ensure that external programs now use the data values from the lookup table. The following code shows how external programs can now get the name of the state from the State table; in the past, they would have to get this information from an internally hard-coded collection:

// After code
ResultSet rs = statement.executeQuery(
"SELECT State, Name FROM State");

Some programs may choose to cache the data values, whereas others will access State as needed—caching works well because the values in State rarely change. Furthermore, if you are also introducing a foreign key constraint along with Lookup Table, external programs will need to handle any exceptions thrown by the database. See the Add Foreign Key refactoring (page 204) for details.

Apply Standard Codes

Apply a standard set of code values to a single column to ensure that it conforms to the values of similar columns stored elsewhere in the database.

Motivation

You may need to apply Apply Standard Codes to do the following:

• Cleanse data. When you have the same semantic meaning for different code values in your database, it is generally better to standardize them so that you can apply standard logic across all data attributes. For example, in Figure 7.2, when the values in Country.Code is USA and Address.CountryCode is US, you have a potential problem because you can no longer accurately join the two tables. Apply a consistent value, one or the other, throughout your database.

Figure 7.2. Applying standard state codes.

• Support referential integrity. When you want to apply Add Foreign Key Constraint (page 204) to tables based on the code column, you need to standardize the code values first.

• Add a lookup table. When you are applying Add Lookup Table (page 153), you often need to first standardize the code values on which the lookup is based.

• Conform to corporate standards. Many organizations have detailed data and data modeling standards that development teams are expected to conform to. Often when applying Use Official Data Source (page 271), you discover that your current data schema does not follow your organization’s standards and therefore needs to be refactored to reflect the official data source code values.

• Reduce code complexity. When you have a variety of values for the semantically same data, you will be writing extra program code to deal with the different values. For example, your existing program code of countryCode = ‘US’ || countryCode = ‘USA’ . . . would be simplified to something like country Code = ‘USA’.

Potential Tradeoffs

Standardizing code values can be tricky because they are often used in many places. For example, several tables may use the code value as a foreign key to another table; therefore, not only does the source need to be standardized but so do the foreign key columns. Second, the code values may be hard-coded in one or more applications, requiring extensive updates. For example, applications that access the Country table may have the value USA hard-coded in SQL statements, whereas applications that use the Address table have US hard-coded.

Schema Update Mechanics

To Apply Standard Codes to the database schema, you must do the following:

1. Identify the standard values. You need to settle on the “official” values for the code. Are the values being provided from existing application tables or are they being provided by your business users? Either way, the values must be accepted by your project stakeholder(s).

2. Identify the tables where the code is stored. You must identify the tables that include the code column. This may require extensive analysis and many iterations before you discover all the tables where the code is used. Note that this refactoring is applied a single column at a time; you will potentially need to apply it several times to ensure consistency across your database.

3. Update stored procedures. When you standardize code values, the stored procedures that access the affected columns may need to be updated. For example, if getUSCustomerAddress has the WHERE clause as Address.CountryCode=“USA”, this needs to change to Address.CountryCode=“US”.

Data-Migration Mechanics

When you standardize on a particular code, you must update all the rows where there are nonstandard codes to use the standard ones. If you are updating small numbers of rows, a simple SQL script that updates the target table(s) is sufficient. When you have to update large amounts of data, or in cases where the code in transactional tables is being changed, apply Update Data (page 310) instead.

The following code depicts the DML to update data in the Address and Country tables to use the standard code values:

UPDATE Address SET CountryCode = 'CA' WHERE CountryCode = 'CAN';
UPDATE Address SET CountryCode = 'US' WHERE CountryCode = 'USA';

UPDATE Country SET CountryCode = 'CA' WHERE CountryCode = 'CAN';
UPDATE Country SET CountryCode = 'US' WHERE CountryCode = 'USA';

Access Program Update Mechanics

When Apply Standard Codes is applied, the following aspects of external programs must be examined:

1. Hard-coded WHERE clauses. You may need to update SQL statements to have the correct values in the WHERE clause. For example, if the Country.Code row values changes from ‘US’ to ‘USA’, you will have to change your WHERE clause to use the new value.

2. Validation code. Similarly, you may need to update source code that validates the values of data attributes. For example, code that looks like countryCode = ‘US’ must be updated to use the new code value.

3. Lookup constructs. The values of codes may be defined in various programming “lookup constructs” such as constants, enumerations, and collections for use throughout an application. The definition of these lookup constructs must be updated to use the new code values.

4. Test code. Code values are often hard-coded into testing logic and/or test data generation logic; you will have to change these to now use the new values.

The following code shows you the before and after state of the method to read U.S. addresses:

// Before code
stmt = DB.prepare("SELECT addressId, line1, line2,
city, state, zipcode, country FROM address WHERE
countrycode = ?");
stmt.setString(1,"USA");
stmt.execute();
ResultSet rs = stmt.executeQuery();

//After code
stmt = DB.prepare("SELECT addressId, line1, line2,
city, state, zipcode, country FROM address WHERE
countrycode = ?");
stmt.setString(1,"US");
stmt.execute();
ResultSet rs = stmt.executeQuery();

Apply Standard Type

Ensure that the data type of a column is consistent with the data type of other similar columns within the database.

Motivation

The refactoring Apply Standard Types can be used to do the following:

• Ensure referential integrity. When you want to apply Add Foreign Key (page 204) to all tables storing the same semantic information, you need to standardize the data types of the individual columns. For example, Figure 7.3 shows how all the phone number columns are refactored to be stored as integers. Another common example occurs when you have Address.ZipCode stored as a VARCHAR data type and Customer.Zip stored as NUMERIC data type; you should standardize on one data type so that you can apply referential integrity constraints.

Figure 7.3. Applying standard data types in Customer, Branch, and Employee tables.

• Add a lookup table. When you are applying Add Lookup Table (page 153), you will want a consistent type used for the two code columns.

• Conform to corporate standards. Many organizations have detailed data and data modeling standards that development teams are expected to conform to. Often, when applying Use Official Data Source (page 271), you discover that your current data schema does not follow your standards and therefore needs to be refactored to reflect the official data source type.

• Reduce code complexity. When you have a variety of data types for semantically the same data, you require extra program code to handle the different column types. For example, phone number-validation code for the Customer, Branch, and Employee, classes could be refactored to use a shared method.

Potential Tradeoffs

Standardizing data types can be tricky because the individual columns are often referred to in many places. For example, several application classes may reference the column, code that will need to change when you change the column’s data type. Second, when you are attempting to change the data type of a column, you may have situations when the source data cannot be converted to the destination data type. For example, when you change Customer.Zip to NUMERIC data type, you may not be able to convert international postal codes such as R2D 2C3 that contain character data.

Schema Update Mechanics

To apply this refactoring, you must first identify the standard data type. You need to settle on the “official” data type for the columns. This data type must handle all existing data and external access programs must be able to handle it. (Older languages sometimes cannot process new data types.) Then you must identify the tables that include column(s) where the data type needs to change. This may require extensive analysis and many iterations before you find out all the tables where the column needs to change. Note that this refactoring is applied to a single column at a time; therefore, you will potentially need to apply it several times to ensure consistency across your database.

Figure 7.3 depicts how we change the Branch.Phone, Branch.FaxNumber, and Employee.PhoneNumber columns to use the same data type of integer—the column Customer.PhoneNumber is already an integer and therefore does not need to be refactored. Although this would really be three individual refactorings, because the three columns appear in the same table it behooves us to bundle these three refactorings together.

The following code depicts the three refactorings required to change the Branch.Phone, Branch.FaxNumber, and Employee.Phone columns. We are going to add a new column to the tables using Introduce New Column (page 301). Because we want to provide some time for all the applications to migrate to the new columns, during this transition phase we are going to maintain both the columns and also synchronize the data in them:

ALTER TABLE Branch ADD COLUMN PhoneNumber INT;
COMMENT ON Branch.PhoneNumber "Replaces Phone,
dropdate=2007-03-27"

ALTER TABLE Branch ADD COLUMN FaxNo INT;
COMMENT ON Branch.FaxNo "Replaces FaxNumber,
dropdate=2007-03-27"

ALTER TABLE Employee ADD PhoneNo INT;
COMMENT ON Employee.PhoneNo "Replaces PhoneNumber,
dropdate=2007-03-27"

The following code depicts how to synchronize changes in the Branch.Phone, Branch.FaxNumber, and Employee.Phone columns with the existing columns:

CREATE OR REPLACE TRIGGER SynchronizeBranchPhoneNumbers
  BEFORE INSERT OR UPDATE
  ON Branch
  REFERENCING OLD AS OLD NEW AS NEW
  FOR EACH ROW
  DECLARE
  BEGIN
    IF :NEW.PhoneNumber IS NULL THEN
      :NEW.PhoneNumber := :NEW.Phone;
    END IF;
    IF :NEW.Phone IS NULL THEN
      :NEW.Phone := :NEW.PhoneNumber;
    END IF;
    IF :NEW.FaxNumber IS NULL THEN
      :NEW.FaxNumber := :NEW.FaxNo;
    END IF;
    IF :NEW.FaxNo IS NULL THEN
      :NEW.FaxNo := :NEW.FaxNumber;
    END IF;
END;
/

CREATE OR REPLACE TRIGGER SynchronizeEmployeePhoneNumbers
  BEFORE INSERT OR UPDATE
  ON Employee
  REFERENCING OLD AS OLD NEW AS NEW
  FOR EACH ROW
  DECLARE
  BEGIN
    IF :NEW.PhoneNumber IS NULL THEN
      :NEW.PhoneNumber := :NEW.PhoneNo;
    END IF;
    IF :NEW.PhoneNo IS NULL THEN
      :NEW.PhoneNo := :NEW.PhoneNumber;
    END IF;
  END;
/

––Update Existing data for the first time
UPDATE Branch SET
PhoneNumber = formatPhone(Phone),
FaxNo = formatPhone(FaxNumber);
UPDATE Employee SET
  PhoneNo = formatPhone(PhoneNumber);

––Drop the old columns on Mar 23 2007
ALTER TABLE Branch DROP COLUMN Phone;
ALTER TABLE Branch DROP COLUMN FaxNumber;

ALTER TABLE Employee DROP COLUMN PhoneNumber;
DROP TRIGGER SynchronizeBranchPhoneNumbers;
DROP TRIGGER SynchronizeEmployeePhoneNumbers;

Data-Migration Mechanics

When small numbers of rows need to be converted, you will likely find that a simple SQL script that converts the target column(s) is sufficient. When you want to convert large amounts of data, or you have complicated data conversions, consider applying Update Data (page 310).

Access Program Update Mechanics

When Apply Standard Type is applied, the external programs should be updated in the following manner:

1. Change application variables data type. You need to change the program code so that its data types match the data type of the column.

2. Database interaction code. The code that saves, deletes, and retrieves data from this column must be updated to work with the new data type. For example, if the Customer.Zip has changed from a character to numeric data type, you must change your application code from customerGateway.getString(“ZIP”) to customerGateway.getLong (“ZIP”).

3. Business logic code. Similarly, you may need to update application code that works with the column. For example, comparison code such as Branch.Phone = ‘XXX-XXXX’ must be updated to look like Branch.Phone = XXXXXXX.

The following code snippet shows the before and after state of the class that finds the Branch row for a given BranchID when we change data type of PhoneNumber to a Long from a String:

// Before code
stmt = DB.prepare("SELECT BranchId, Name, PhoneNumber, "
            "FaxNumber FROM branch WHERE BranchId = ?");
stmt.setLong(1,findBranchId);
stmt.execute();
ResultSet rs = stmt.executeQuery();
if (rs.next()) {
  rs.getLong("BranchId");
  rs.getString("Name");
  rs.getString("PhoneNumber");
  rs.getString("FaxNumber");
}

//After code
stmt = DB.prepare("SELECT BranchId, Name, PhoneNumber, "+
            "FaxNumber FROM branch WHERE branchId = ?");
stmt.setLong(1,findBranchId);
stmt.execute();
ResultSet rs = stmt.executeQuery();
if (rs.next()) {
  rs.getLong("BranchId");
  rs.getString("Name");
  rs.getLong("PhoneNumber");
  rs.getString("FaxNumber");
}

Consolidate Key Strategy

Choose a single key strategy for an entity and apply it consistently throughout your database.

Motivation

Many business entities have several potential keys. There is usually one or more keys with business meaning, and fundamentally you can always assign a surrogate key column to a table. For example, your Customer table may have CustomerPOID as the primary key and both CustomerNumber and SocialSecurityNumber as alternate/secondary keys. There are several reasons why you would want to apply Consolidate Key Strategy:

• Improve performance. You likely have an index for each key that the database needs to insert, update, and delete in a manner that performs well.

• Conform to corporate standards. Many organizations have detailed data and data modeling guidelines that development teams are expected to conform to, guidelines that indicate the preferred keys for entities. Often when applying Use Official Data Source (page 271), you discover that your current data schema does not follow your standards and therefore needs to be refactored to reflect the official data source code values.

• Improve code consistency. When you have a variety of keys for a single entity, you have code that implements access to the table in various ways. This increases the maintenance burden for those working with that code because they must understand each approach being used.

Potential Tradeoffs

Consolidating key strategies can be difficult. Not only do you need to update the schema of Policy in Figure 7.4, you also need to update the schema of any tables that include foreign keys to Policy that do not use your chosen key strategy. To do this, you need to apply Replace Column (page 126). You may also find that your existing set of keys does not contain a palatable option for the “one, true key,” and therefore you need to apply either Introduce Surrogate Key (page 85) or Introduce Index (page 248).

Figure 7.4. Consolidating the key strategy for the Policy table.

Schema Update Mechanics

To implement this refactoring within your database schema, you must do the following:

1. Identify the proper key. You need to settle on the “official” key column(s) for the entity. Ideally, this should reflect your corporate data standards, if any.

2. Update the source table schema. The simplest approach is to use the current primary key and to stop using the alternate keys. With this strategy, you simply drop the indices, if any, supporting the key. This approach still works if you choose to use one of the alternate keys rather than the current primary key. However, if none of the existing keys work, you may need to apply Introduce Surrogate Key.

3. Deprecate the unwanted keys. Any existing keys that are not to be the primary key, in this case PolicyNumber, should be marked that they will no longer be used as keys after the transition period. Note that you may want to retain the uniqueness constraint on these columns, even though you are not going to use them as keys anymore.

4. Add a new index. If one does not already exist, a new index based on the official key needs to be introduced for Policy via Introduce Index (page 248).

Figure 7.4 shows how we consolidate the key strategy for Policy to use only PolicyOID for the key. To do this, the Policy.PolicyNumber is deprecated to indicate that it will no longer be used as a key as of December 14, 2007, and PolicyNotes.PolicyOID is introduced as a new key column to replace PolicyNotes.PolicyNumber. The following code adds the PolicyNotes.PolicyNumber column:

ALTER TABLE PolicyNotes ADD PolicyOID CHAR(16);

The following code is run after the transition period ends to drop PolicyNotes.PolicyNumber and the index for the alternate key based on Policy.PolicyNumber:

COMMENT ON Policy 'Consolidation of keys to use only PolicyPOID, dropdate = <<2007-12-14>>'
DROP INDEX PolicyIndex2;

COMMENT ON PolicyNotes 'Consolidation of keys to use only PolicyPOID, therefore drop the
PolicyNumber column, dropdate = <<2007-12-14>>'

ALTER TABLE PolicyNotes ADD CONSTRAINT
PolicyNotesPolciyOID_PK
PRIMARY KEY (PolicyOID, NoteNumber);

ALTER TABLE PolicyNotes DROP COLUMN PolicyNumber;

Data-Migration Mechanics

Tables with foreign keys maintaining relationships to Policy must now implement foreign keys that reflect the chosen key strategy. For example, the PolicyNotes table originally implemented a foreign key based on Policy.PolicyNumber. It must now implement a foreign key based on Policy.PolicyOID. The implication is that you will need to potentially apply Replace Column (page 126) to do so, and this refactoring requires you to copy the data from the source column (the value in Policy.PolicyOID) to the PolicyNotes.PolicyOID column. The following code sets the value of the PolicyNotes.PolicyNumber column:

UPDATE PolicyNotes
SET PolicyNotes.PolicyOID = Policy.PolicyOID
WHERE PolicyNotes.PolicyNumber = Policy.PolicyNumber;

Access Program Update Mechanics

When this refactoring is applied, the primary task is to ensure that existing SQL statements use the official primary key column in WHERE clauses to ensure continued good performance of joins. For example, the before code joined the Policy, PolicyNotes, and PolicyDueDiligence tables by a combination of the PolicyOID and PolicyNumber columns. The “after code” code joins them based solely on the PolicyOID column:

// Before code
stmt.prepare(
"SELECT Policy.Note FROM Policy, PolicyNotes " +
"WHERE Policy.PolicyNumber = PolicyNotes.PolicyNumber "+
"AND Policy.PolicyOID=?");
stmt.setLong(1,policyOIDToFind);
stmt.execute();
ResultSet rs = stmt.executeQuery();

//After code
stmt.prepare(
"SELECT Policy.Note FROM Policy, PolicyNotes " +
"WHERE Policy.PolicyOID = PolicyNotes.PolicyOID "+
"AND Policy.PolicyOID=?");
stmt.setLong(1,policyOIDToFind);
stmt.execute();
ResultSet rs = stmt.executeQuery();

Drop Column Constraint

Remove a column constraint from an existing table.

When you need to remove a referential integrity constraint on a column, you should consider the use of Drop Foreign Key Constraint (page 213).

Motivation

The most common reason why you want to apply Drop Column Constraint is that the constraint is no longer applicable because of changes in business rules. For example, perhaps the Address.Country column is currently constrained to the values of “US” and “Canada”, but now you are also doing business internationally and therefore have addresses all over the world. A second reason is that the constraint is applicable to only a subset of the applications that access the column, perhaps because of changes in some applications but not others. For example, some applications may be used in international settings, whereas others are still only North American—because the constraint is no longer applicable to all applications, it must be removed from the common database and implemented in the applications that actually need it. As a side effect, you may see some minor performance improvements because the database no longer needs to do the work of enforcing the constraint. A third reason is that the column has been refactored, perhaps Apply Standard Codes (page 157) or Apply Standard Type (page 162) have been applied to it and now the column constraint needs to be dropped, refactored, and then reapplied via Introduce Column Constraint (page 180).

Potential Tradeoffs

The primary challenge with this refactoring is that you potentially need to implement the constraint logic in the subset of applications that require it. Because you are implementing the same logic in several places, you are at risk of implementing it in different ways.

Schema Update Mechanics

To update the database schema, you must simply drop the constraint from Balance via the DROP CONSTRAINT clause of the ALTER TABLE SQL command. Figure 7.5 depicts an example where the constraint is that the Account.Balance column be positive; this constraint is being removed to support the ability to allow accounts to be overdrawn. The following code depicts the DDL to drop the constraint on Account.Balance column:

Figure 7.5. Dropping the column constraint from the Account table.

ALTER TABLE Account DROP CONSTRAINT Positive_Balance;

Data-Migration Mechanics

There is no data to migrate for this database refactoring.

Access Program Update Mechanics

The access programs working with this column will include logic to handle any errors being thrown by the database when the data being written to the column does not conform to the constraint. You need to remove this code from the programs.

Drop Default Value

Remove the default value that is provided by a database from an existing table column.

Motivation

We often use Introduce Default Value (page 186) when we want the database to persist some of the columns with data, when those columns are not being assigned any data from the application. When we no longer have a need for the database to insert data for some of the columns, because the application is providing the data that is needed, in many cases we may no longer want the database to persist the default value, because we want the application to provide the value for the column in question. In this situation, we make use of Drop Default Value refactoring.

Potential Tradeoffs

There are two potential challenges to consider regarding to this refactoring. First, there may be unintended side effects. Some applications may assume that a default value is persisted by the database and therefore exhibit different behavior now that columns in new rows that formerly would have some value are null now. Second, you need to improve the exception handling of external programs, which may be more effort than it is worth. If a column is non-nullable and data is not provided by the application, the database will throw an exception, which the application will not be expecting.

Schema Update Mechanics

To update the schema to remove a default value, you must remove the default value from the column of the table in the database using the MODIFY clause of the ALTER TABLE command. The following code depicts the steps to remove the default value on the Customer.Status column of Figure 7.6. From a database point of view, a default value of null is the same as having no default value:

Figure 7.6. Dropping the default value for the Customer.Status column.

ALTER TABLE Customer MODIFY Status DEFAULT NULL;

Data-Migration Mechanics

There is no data to migrate for this database refactoring.

Access Program Update Mechanics

If some access programs depend on the default value to be used by the table, you either need to add data validation code to counteract the change to the table or consider backing out of this refactoring.

The following code shows how your application code has to provide the value for the column now instead of depending on the database to provide the default value:

//Before code
public void createRetailCustomer
  (long customerId,String firstName) {
  stmt = DB.prepare("INSERT into customer" +
    "(Customerid, FirstName) " +
    "values (?, ?)");
  stmt.setLong(1, customerId);
  stmt.setString(2, firstName);
  stmt.execute();
}

//After code
public void createRetailCustomer
  (long customerId,String firstName) {
  stmt = DB.prepare("INSERT into customer" +
    "(Customerid, FirstName, Status) " +
    "values (?, ?, ?)");
  stmt.setLong(1, customerId);
  stmt.setString(2, firstName);
  stmt.setString(3, RETAIL);
  stmt.execute();
}

Drop Non-Nullable

Change an existing non-nullable column such that it accepts null values.

Figure 7.7. Making the CityLookup.StateID column nullable.

Motivation

There are two primary reasons to apply this refactoring. First, your business process has changed so that now parts of the entity are persisted at different times. For example, one application may create the entity, not assign a value to this column right away, and another application update the row at a later time. Second, during transition periods, you may want a particular column to be nullable. For example, when one of the applications cannot provide a value for the column because of some refactoring that the application is going through, you want to change the non-nullable constraint for a limited amount of time during the transition phase so that the application can continue working. Later on, you will apply Make Column Non-Nullable (page 189) to revert the constraint back to the way it was.

Potential Tradeoffs

Every application that accesses this column must be able to accept a null value, even if it merely ignores it or more likely assumes an intelligent default when it discovers the column is null. If there is an intelligent default value, you should consider the refactoring Introduce Default Value (page 186), too.

Schema Update Mechanics

To perform this refactoring you must simply remove the NOT NULL constraint from the column. This is done via the SQL command ALTER TABLE MODIFY COLUMN. The following code shows how to make CityLookup.StateCode nullable:

–– On 2007-06-14
ALTER TABLE CityLookup MODIFY StateCode NULL;

Data-Migration Mechanics

There is no data to migrate for this database refactoring.

Access Program Update Mechanics

You must refactor all the external programs currently accessing CityLookup.StateCode so that the code can appropriately handle null values. Because null values can be persisted, they can therefore be retrieved, implying that we need to add null value checking code. Furthermore, you want to rework any code that checks for null value exceptions that are no longer being thrown by the database.

The following code sample shows how to add null checking logic to application code:

//Before code
public StringBuffer getAddressString(Address address) {
  StringBuffer stringAddress = new StringBuffer();
  stringAddress = address.getStreetLine1();
  stringAddress.append(address.getStreetLine2());
  stringAddress.append(address.getCity());
  stringAddress.append(address.getPostalCode());

  stringAddress.append(states.getNameFor
  (address.getStateCode()));
  return stringAddress;
}

//After code
public StringBuffer getAddressString(Address address) {
  StringBuffer stringAddress = new StringBuffer();
  stringAddress = address.getStreetLine1();
  stringAddress.append(address.getStreetLine2());
  stringAddress.append(address.getCity());
  stringAddress.append(address.getPostalCode());
  String statecode = address.getStateCode;
  if (statecode != null) {
    stringAddress.append(states.getNameFor
    (stateCode()));
  }
  return stringAddress;
}

Introduce Column Constraint

Introduce a column constraint in an existing table.

When you have a need to add a non-nullable constraint on a column, you should consider the use of Make Column Non-Nullable refactoring (page 189). When you need to add a referential integrity constraint on a column, you should consider the use of Add Foreign Key (page 204).

Motivation

The most common reason why you want to apply Introduce Column Constraint is to ensure that all applications interacting with your database persist valid data in the column. In other words, column constraints are typically used to implement common, and relatively simple, data-validation rules.

Potential Tradeoffs

The primary issue is whether there is truly a common constraint for this data element across all programs that access it. Individual applications may have their own unique version of a constraint for this column. For example, the Customer table could have a FavoriteColor column. Your corporate standard might be the values “Red”, “Green”, and “Blue”, yet for competitive reasons one application may allow the color “Yellow” as a fourth value, whereas two other applications will only allow “Red” and “Blue”. You could argue that there should be one consistent standard across all applications, but the reality is that individual applications have good reasons for doing things a certain way. The implication is that because individual applications will implement their own versions of business rules, that even in the best of circumstances you will discover that you will have existing data within your database that does not meet the constraint conditions. One strategy is to write a script that crawls through your database tables and then reports on any constraint violations, identifying data to be fixed. These processes will be run in batch mode during nonbusy hours for the database and the application.

You may also apply this refactoring because you need to refactor an existing column constraint. Refactorings such as Apply Standard Codes (page 157) and Apply Standard Type (page 162) could force you to change the underlying code value or data type respectively of the column, requiring you to reintroduce the constraint.

Schema Update Mechanics

As depicted in Figure 7.8, to update the database schema you must simply introduce a constraint on the column, in this case Customer.CreditLimit. As you can see in the following code, you do this via the ADD CONSTRAINT clause of the ALTER TABLE SQL command. In this case, the credit limit must be less than $50,000 as a fraud-prevention tactic:

Figure 7.8. Introducing a constraint to the Customer.CreditLimit column.

ALTER TABLE Customer ADD CONSTRAINT
Check_Credit_Limit CHECK (CreditLimit < 50000.00);

Data-Migration Mechanics

Although not a data migration per se, a significant primary challenge with this refactoring is to make sure that existing data conforms to the constraint that is being applied on the column. You first need to define the constraint by working with your project stakeholders to identify what needs to be done to the data values that do not conform to the constraint being applied. Then you need to fix the source data. You may need to apply Update Data (page 310), as appropriate, to ensure that the values stored in this column conform to the constraint.

Access Program Update Mechanics

You need to ensure that the access programs can handle any errors being thrown by the database when the data being written to the column does not conform to the constraint. You need to look for every point in the programs where an insert or update is being done to the table, and then add the proper exception handling code, as shown in the following code:

//Before code
stmt = conn.prepare(
  "INSERT INTO Customer "+
  "(CustomerID,FirstName,Status,CreditLimit) "+
  "VALUES (?,?,?,?)");
stmt.setLong(1,customerId);
stmt.setString(2,firstName);
stmt.setString(3,status);
stmt.setBigDecimal(4,creditLimit);
stmt.executeUpdate();
}

//After code
stmt = conn.prepare(
  "INSERT INTO Customer "+
  "(CustomerID,FirstName,Status,CreditLimit) "+
  "VALUES (?,?,?,?)");
stmt.setLong(1,customerId);
stmt.setString(2,firstName);
stmt.setString(3,status);
stmt.setBigDecimal(4,creditLimit);
try {
  stmt.executeUpdate();
} catch (SQLException exception){
    while (exception != null) {
      int errorCode = e.getErrorCode();
      if (errorCode = 2290) {
            handleCreditLimitExceeded();
      }
    }
}

Introduce Common Format

Apply a consistent format to all the data values in an existing table column.

Figure 7.9. Applying a common format to the Address.PhoneNumber column.

Motivation

You typically apply Introduce Common Format to simplify external program code. When the same data is stored in different formats, external programs require unnecessarily complex logic to work with your data. It is generally better to have the data in uniform format so that applications interfacing with your database do not have to handle multiple formats. For example, when the values in Customer.Phonenumber are ‘4163458899’, ‘905-777-8889’, and ‘(990) 345-6666’, every application accessing that column must be able to parse all three formats. This problem is even worse when an external program parses the data in several places, often because the program is poorly designed. A common format enables you to reduce the validation code dealing with Customer.Phonenumber and to reduce the complexity of your display logic—although the data is stored as 1234567890, it might be output in a report as (123) 456-7890.

Another common reason is to conform to your existing corporate standards. Often when applying Use Official Data Source (page 271), you discover that your current data format differs from that of the “official” data source, and therefore your data needs to be reformatted for consistency.

Potential Tradeoffs

Standardizing the format for the data values can be tricky when multiple formats exist within the same column. For example, the Customer.Phonenumber column could be stored using 15 different formats—you would need code that can detect and then convert the data in each row to the standard format. Luckily, the code to do this conversion should already exist within one or more external programs.

Schema Update Mechanics

To Introduce Common Format to a column, you must first identify the standard format. You need to settle on the “official” format for the data values. The data format may be an existing standard or it may be a new one defined by your business stakeholders, but either way the format must be accepted by your project stakeholder(s). Then, you must identify the column(s) where the format needs to be applied. You apply Introduce Common Format one column at a time; smaller refactorings are always easier to implement and then deploy into production. You will potentially need to apply it several times to ensure consistency across your database.

Data-Migration Mechanics

The first step is to identify the various formats currently being used within the column, often a simple SELECT statement will get you this information, to help you understand what your data migration code needs to do. The second step is to write the code to convert the existing data into the standard format. You may want to write this code using either standard SQL data manipulation language (DML), an application programming language such as Java or C#, or with an extract-transform-load (ETL) tool. When small numbers of rows need to be updated, a simple SQL script that updates the target table(s) is sufficient. When you want to update a large amount of data, or in cases where the code in transactional tables is being changed, apply Update Data (page 310) instead.

The following code depicts the DML to update data in the Customer table:

UPDATE Customer SET PhoneNumber =
      REPLACE(PhoneNumber,'-',''),

UPDATE Customer SET PhoneNumber =
      REPLACE(PhoneNumber,' ',''),

UPDATE Customer SET PhoneNumber =
      REPLACE(PhoneNumber,'(',''),

UPDATE Customer SET PhoneNumber =
      REPLACE(PhoneNumber,')',''),

UPDATE Customer SET PhoneNumber =
      REPLACE(PhoneNumber,'+1',''),

UPDATE Customer SET PhoneNumber =
      REPLACE(PhoneNumber,'.',''),

As you can see, we are updating one type of format at a time. You can also encapsulate the individual changes using a stored procedure, as shown here:

UPDATE Address SET PhoneNumber =
FormatPhoneNumber(PhoneNumber);

Access Program Update Mechanics

When Introduce Common Format is applied, the following aspects of the external programs must be examined:

1. Cleansing code. Your external programs will contain logic that accepts the various formats and converts them to the format that they want to work with. You may even be able to use some of this existing code as the basis for your data migration code above.

2. Validation code. You may need to update source code that validates the values of data attributes. For example, code that looks for formats such as Phonenumber = ‘NNN-NNN-NNNN’ must be updated to use the new data format.

3. Display code. Your user interface code will often include logic to display data elements in a specific format, often one that is not being used for storage (for example, NNNNNNNNNN for storage, (NNN) NNNNNNN for display). This includes display logic for both your screens and reports.

4. Test data. Your test data, or test data generation code, must now be changed to conform to the new standard data format. You may also want to add new test cases to test for data rows in inappropriate formats.

Introduce Default Value

Let the database provide a default value for an existing table column.

Motivation

We often want the value of a column to have a default value populated when a new row is added to a table. However, the insert statements may not always populate that column, often because the column has been added after the original insert was written or simply because the application code that is submitting the insert does not require that column. Generally, we have found that Introduce Default Value is useful when we want to make the column non-nullable later on (see the database refactoring Make Column Non-Nullable on page 189).

Potential Tradeoffs

There are several potential challenges to consider regarding this refactoring:

• Identifying a true default can be difficult. When many applications share the same database, they may have different default values for the same column, often for good reasons. Or it may simply be that your business stakeholders cannot agree on a single value—you need to work closely with them to negotiate the correct value.

• Unintended side effects. Some applications may assume that a null value within a column actually means something and will therefore exhibit different behavior now that columns in new rows that formerly would have been null now are not.

• Confused context. When a column is not used by an application, the default value may introduce confusion over the column’s usage with the application team.

Schema Update Mechanics

This is a single-step refactoring. You merely use the SQL ALTER TABLE command to define the default value for the column. An effective date for this refactoring is optionally indicated to tell people when this default value was first introduced into the schema. The following code shows how you would introduce a default value on the Customer.Status column, as depicted in Figure 7.10:

Figure 7.10. Introducing a default value on the Customer.Status column.

ALTER TABLE Customer MODIFY Status DEFAULT 'NEW';

COMMENT ON Customer.Status 'Default value of NEW will be inserted when there is no data present
on the insert as of June 14 2006';

Data-Migration Mechanics

The existing rows may already have null values in this column, rows that will not be automatically updated as a result of adding a default value. Furthermore, there may be invalid values in some rows, too. You need to examine the values contained in the column—simply looking at a unique list of values may suffice—to determine whether you need to do any updates. If appropriate, you need to write a script that runs through the table to introduce the default value to these rows.

Access Program Update Mechanics

On the surface, it seems unlikely that access programs would be affected by the introduction of a default value; however, appearances can be deceiving. Potential issues you may run across include the following:

1. Invariants are broken by the new value. For example, a class may assume that the value of a color column is red, green, or blue, but the default value has now been defined as yellow.

2. Code exists to apply default values. There may now be extraneous source code that checks for a null value and introduces the default value programmatically. This code should be removed.

3. Existing source code assumes a different default value. For example, existing code may look for the default value of none, which was set programmatically in the past, and if found it gives users the option to change the color. Now the default value is yellow, so this code will never be invoked.

You need to analyze the access programs thoroughly, and then update them appropriately, before introducing a default value to a column.

Make Column Non-Nullable

Change an existing column such that it does not accept any null values.

Motivation

There are two reasons to apply Make Column Non-Nullable. First, you want to enforce business rules at the database level such that every application updating this column is forced to provide a value for it. Second, you want to remove repetitious logic within applications that implement a not-null check—if you are not allowed to insert a null value, you will never receive one.

Potential Tradeoffs

Any external program that updates rows within the given table must provide a value for the column—some programs may currently assume that the column is nullable and therefore not provide such a value. Whenever an update or insertion occurs, you must ensure that a value is provided, implying that the external programs need to be updated and/or the database itself must provide a valid value. One technique we have found useful is to assign a default value using Introduce Default Value (page 186) for this column.

Schema Update Mechanics

As depicted in Figure 7.11, to perform Make Column Non-Nullable, you must simply add a NOT NULL constraint to the column. This is done via the SQL command ALTER TABLE, as shown in the following code:

Figure 7.11. Making the Customer.FirstName column non-nullable.

ALTER TABLE Customer
MODIFY FirstName NOT NULL;

Data-Migration Mechanics

You may need to clean the existing data because you cannot make a column non-nullable if there are existing rows with a null value in the column. To address this problem, write a script that replaces the rows containing a null with an appropriate value.

The following code shows how to clean the existing data to support the change depicted in Figure 7.11. The initial step is to make sure that the FirstName column does not contain any null values; if it does, we have to update the data in the table:

SELECT count(FirstName) FROM Customer
WHERE
FirstName IS NULL;

If we do find rows where Customer.FirstName is null, we have to go ahead and apply some algorithm and make sure that Customer.FirstName does not contain any null values. In this example, we set Customer.FirstName to ‘???’ to indicate that we need to update this record. This strategy was chosen carefully by our project stakeholders because the data being changed is critical to the business:

UPDATE Customer SET FirstName='???'
WHERE
FirstName IS NULL;

Access Program Update Mechanics

You must refactor all the external programs currently accessing Customer.FirstName to provide an appropriate value whenever they modify a row within the table. Furthermore, they must also detect and then handle any new exceptions that are thrown by the database.

If you are unsure about where all this column is used and do not have a way to change all those instances, you can apply the database refactoring Introduce Default Value (page 186) so that the database provides a default value when no value is provided by the application. This may be an interim strategy for you—after sufficient time has passed and you believe that the access programs have been updated, or at least until you are willing to quickly deal with the handful of access programs that were not updated, you can apply Drop Default Value (page 174) and thereby improve performance.

The following code shows the before and after state when the Make Column Non-Nullable refactoring is applied. In the example, we have decided to throw an exception when null values are found:

//Before code
stmt = conn.prepare(
  "INSERT INTO Customer "+
  "(CustomerID,FirstName,Surname) "+
  "VALUES (?,?,?,)");
stmt.setLong(1,customerId);
stmt.setString(2,firstName);
stmt.setString(3,surname);
stmt.executeUpdate();
}

//After code
if (firstName == null) {
  throw new CustomerFirstNameCannotBeNullException();
};
stmt = conn.prepare(
  "INSERT INTO Customer "+
  "(CustomerID,FirstName,Surname) "+
  "VALUES (?,?,?,)");
stmt.setLong(1,customerId);
stmt.setString(2,firstName);
stmt.setString(3,surname);
stmt.executeUpdate();
}

Move Data

Move the data contained within a table, either all or a subset of its columns, to another existing table.

Motivation

You typically want to apply Move Data as the result of structural changes with your table design; but as always, application of this refactoring depends on the situation. You may want to apply Move Data as the result of the following:

• Column renaming. When you rename a column, the process is first to introduce a new column with the new name, move the original data to the new column, and then remove the original column using Drop Column (page 72).

• Vertically splitting a table. When you apply Split Table (page 145) to vertically reorganize an existing table, you need to move data from the source table into the new tables that have been split off from it.

• Horizontally splitting a table. Sometimes a horizontal slice of data is moved from a table into another table with the same structure because the original table has grown so large that performance has degraded. For example, you might move all the rows in your Customer table that represent European customers into a EuropeanCustomer table. This horizontal split might be the first step toward building an inheritance hierarchy (for example, EuropeanCustomer inherits from Customer) within your database and/or because you intend to add specific columns for Europeans.

• Splitting a column. With Split Column (page 140), you need to move data to the new column(s) from the source.

• Merging a table or column. Applying Merge Tables (page 96) or Merge Columns (page 92) requires you to move data from the source(s) into the target(s).

• Consolidation of data without involving structural changes. Data is often stored in several locations. For example, you may have several customer-oriented tables, one to store Canadian customer information and several for customers who live in various parts of the United States, all of which have the same basic structure. As the result of a corporate reorganization, the data for the provinces of British Columbia and Alberta are moved from the CanadianCustomer table to the WestCoastCustomer table to reflect the data ownership within your organization.

• Move data before removing a table or column. Applying Drop Table (page 77) or Drop Column (page 72) may require you to apply Move Data first if some or all of the data stored in the table or column is still required.

Potential Tradeoffs

Moving data between columns can be tricky at any time, but it is particularly challenging when millions of rows are involved. While the move occurs, the applications accessing the data may be impacted—you will likely want to lock the data during the move—affecting performance and availability of the data for the application, because the applications now cannot access data during this move.

Schema Update Mechanics

To perform Move Data, you must first identify data to move and any dependencies on it. Should the corresponding row in the other table be deleted or should the column value in the corresponding be nulled/zeroed out, or should the corresponding value be left alone? Is the row being moved or are we just moving some column(s).

Second, you must identify the data destination. Where is the data being moved to? To another table or tables? Is the data being transformed while its being moved? When you move rows, you must make sure that all the dependent tables that have foreign key references to the table where the data is being moved to now reference the table where the destination of the move is.

Data-Migration Mechanics

When small amounts of data need to be moved, you will likely find that a simple SQL script that inserts the source data into the target location, and then deletes the source, is sufficient. For large amounts of data, you must take a more sophisticated approach because of the time it will take to move the data. You may need to export the source data and then import it into the target location. You can also use database utilities such as Oracle’s SQLLDR or a bulk loader.

Figure 7.12 depicts how we moved the data within the Customer.Status column to CustomerStatusHistory table to enable us to track the status of a customer over a period of time. The following code depicts the DML to move the data from the Customer.Status column to the CustomerStatusHistory. First, we insert the data into the CustomerStatusHistory from the Customer table, and later we update the Customer.Status column to NULL:

Figure 7.12. Moving status data from Customer to CustomerStatusHistory.

INSERT INTO CustomerStatusHistory (CustomerId,Status,EffectiveDate)
SELECT CustomerId,Status,Sysdate FROM Customer;
UPDATE Customer SET Status = NULL;

Access Program Update Mechanics

When Move Data is applied as the result of another database refactoring, the applications should have been updated to reflect that refactoring. However, when Move Data is applied to support a reorganization of data within an existing schema, it is possible that external application code will need to be updated. For example, SELECT clauses may need to be reworked to access the moved data from the new source(s) for it. The following code snippet shows how the getCustomerStatus() method changes:

//Before code
public String getCustomerStatus(Customer customer) {
  return customer.getStatus();
}

//After code
public String getCustomerStatus(Customer customer) throws SQLException {
  stmt.prepare(
    "SELECT Status "+
    "FROM CustomerStatusHistory " +
    "WHERE " +
    "CustomerId = ? "+
    "AND EffectiveDate < TRUNC(SYSDATE) "+
    "ORDER BY EffectiveDate DESC");
    stmt.setLong(1,customer.getCustomerId);
  ResultSet rs = stmt.execute();
  if (rs.next()) {
    return rs.getString("Status");
  }
  throw new CustomerStatusNotFoundInHistoryException();
}

Replace Type Code With Property Flags

Replace a code column with individual property flags, usually implemented as Boolean columns, within the same table column.

Note that some database products support native Boolean columns. In other database products, you can use alternative data types for Boolean values (for example, NUMBER(1) in Oracle, where a value 1 means TRUE and 0 means FALSE).

Motivation

The refactoring Replace Type Code With Property Flags is used to do the following:

• Denormalize for performance. When you want to improve performance, by having a column for each instance of the type code. For example, the Address.Type column has values of Home and Business; it would be replaced by isHome and isBusiness property flag columns. This enables a more efficient search because it is faster to compare two Boolean values than it is to compare two string values.

• Simplify selections. Type code columns work well when the types are mutually exclusive; when they are not, however, they become problematic to search on. For example, assume that it is possible for an address to be either a home or a business. With the type code approach, you would need the Address.Type column values of Home, Business, and HomeAndBusiness, for example. To obtain a list of all business addresses, you would need to run a query along the lines of SELECT * FROM Address WHERE Type = “Business” OR Type = “HomeAndBusiness”. As you can imagine, this query would need to be updated if a new kind of address type, such as a vacation home address, was added that could also be a potential business. With the property flag column approach, the query would look like SELECT * FROM Address WHERE isBusiness = TRUE. This query is simple and would not need to be changed if a new address type was added.

• Decouple applications from type code values. When you have multiple applications using Account.AccountType of Figure 7.13, making changes to the values becomes difficult because most applications will have hard coded the values. When these type codes are replaced with property flag columns, the applications will only have to check for the standard TRUE or FALSE values. With a type code column, the applications are coupled to the name of the column and the values within the column; with property flag columns, the applications are merely coupled to the names of the columns.

Figure 7.13. Replacing the AccountType code column with property flags.

Potential Tradeoffs

Every time you want to add a new type of value, you must change the table structure. For example, when you want to add a money market account type, you must add the isMoneyMarket column to the Account table. This will not be desirable after awhile because tables with large numbers of columns are more difficult to understand than tables with smaller numbers of columns. The result of joins with this table increase in size each time you add a new type column. However, it is very easy to add a column independent of the rest of the columns. With the type code column approach, the column is coupled to all the applications accessing it.

Schema Update Mechanics

To Replace Type Code With Property Flags to your database table, you must do the following:

1. Identify the type code to replace. Find the type code column that you need to replace, which is Account.AccountType in our example. You should also find out the instances of the type code to replace—for example, when you want to replace the AddressType code, you need to find all the different types of addresses, and replace all of the instances of AddressType.

2. Introduce property flag columns. After you have identified the instances of the type codes that you want to replace, you have to add those many columns to the table using Introduce New Column (page 301). In Figure 7.13, we are going to add the HasChecks, HasFinancialAdviceRights, HasPassbook, and IsRetailCustomer columns to Account.

3. Remove the type code column. After all the type codes have been converted to property flags, you have to remove the type code column using Drop Column (page 72).

Figure 7.13 shows how we change the Account.AccountType column to use type flag columns for every instance of the AccountType data. In the example, we have four different types of accounts that will be converted to type flag columns, named HasChecks, HasFinancialAdviceRights, HasPassbook, and isRetailCustomer. The following SQL depicts how to add the four type flag columns to the Account table, the synchronization trigger during the transition period, the code to drop the original column, and the trigger after the transition period ends:

ALTER TABLE Account ADD COLUMN HasChecks Boolean;
COMMENT ON Account.HasChecks "Replaces AccountType of Checks"

ALTER TABLE Account ADD COLUMN HasFinancialAdviceRights Boolean;
COMMENT ON Account.HasFinancialAdviceRights "Replaces AccountType of FinancialAdviceRights"

ALTER TABLE Account ADD COLUMN HasPassbook Boolean;
COMMENT ON Account.HasPassbook "Replaces AccountType of Passbook"

ALTER TABLE Account ADD COLUMN isRetailCustomer Boolean;
COMMENT ON Account. isRetailCustomer "Replaces
AccountType of RetailCustomer"

CREATE OR REPLACE TRIGGER SynchronizeAccountTypeColumns
BEFORE INSERT OR UPDATE
ON ACCOUNT
REFERENCING OLD AS OLD NEW AS NEW
FOR EACH ROW

DECLARE
BEGIN
  IF (:NEW.HasChecks IS NULL AND :NEW.AccountType = 'CHECKS') THEN
    :NEW.HasChecks := TRUE;
  END IF;
  IF (:NEW.HasFinancialAdviceRights IS NULL AND :NEW.AccountType = 'FINANCIALADVICERIGHTS') THEN
    :NEW.HasFinancialAdviceRights := TRUE;
  END IF;
  IF (:NEW.HasPassbook IS NULL AND :NEW.AccountType = 'PASSBOOK') THEN
    NEW.HasPassbook := TRUE;
  END IF;
  IF (:NEW.isRetailCustomer IS NULL AND
:NEW.AccountType ='RETAILCUSTOMER') THEN
    :NEW.isRetailCustomer := TRUE;
  END IF;
END;
/

––On June 14 2007
ALTER TABLE Account DROP COLUMN AccountType;
DROP TRIGGER SynchronizeAccountTypeColumns;

Data-Migration Mechanics

You have to write data-migration scripts to update the property flags based on the value of the type code you can use Update Data (page 310). During the transition phase, when the type code column and the type flag columns also exist, we have to keep the columns synchronized, so that the applications using the data get consistent information. This can be achieved by using database triggers. The following SQL syntax shows how to update the existing data in the Account table (see Update Data for details):

UPDATE Account SET HasChecks = TRUE
  WHERE AccountType = 'CHECKS';
UPDATE Account SET HasChecks = FALSE
  WHERE HasChecks != TRUE;

UPDATE Account SET HasFinancialAdviceRights = TRUE
  WHERE AccountType = 'FINANCIALADVICERIGHTS';
UPDATE Account SET HasFinancialAdviceRights = FALSE
  WHERE HasFinancialAdviceRights != TRUE;

UPDATE Account SET HasPassbook = TRUE
  WHERE Accounttype = 'PASSBOOK';
UPDATE Account SET HasPassbook = FALSE
  WHERE HasPassbook != TRUE;

UPDATE Account SET isRetailCustomer = TRUE
  WHERE Accounttype ='RETAILCUSTOMER';
UPDATE Account SET isRetailCustomer = FALSE
  WHERE isRetailCustomer!= TRUE;

Access Program Update Mechanics

When Replace Type Code With Property Flags is applied, you need to update external programs in two ways. First, the SQL code (or meta data) that saves, deletes, and retrieves data from this column must be updated to work with the individual type flag columns and not AccountType. For example, when we have SELECT * FROM Account WHERE AccountType = ‘XXXX’, this SQL will change to SELECT * FROM Account WHERE isXXXX = TRUE. Similarly, you will have to change the program code to update the type flag columns rather than AccountType during an insert or update operation.

Second, you may need to update application code that works with the column. For example, comparison code such as Customer.AddressType = ‘Home’ must be updated to work with isHome.

The following code depicts how the Account class changes when you replace the account type code with property flags:

//Before code
public class Account {

private Long accountID;
private BigDecimal balance;
private String accountType;
private Boolean FALSE = Boolean.FALSE;
private Boolean TRUE = Boolean.TRUE;

public Long getAccountID() {
  return accountID;
}

public BigDecimal getBalance() {
  return balance;
}

public Boolean HasChecks() {
  return accountType.equals("CHECKS");
}

public Boolean HasFinancialAdviceRights() {
  return accountType.equals("FINANCIALADVICERIGHTS");
}

public Boolean HasPassBook() {
  return accountType.equals("PASSBOOK");
}

public Boolean IsRetailCustomer() {
  return accountType.equals("RETAILCUSTOMER");
}
}

//After code
public class Account {

private Long accountID;
private BigDecimal balance;
private Boolean HasChecks;
private Boolean HasFinancialAdviceRights;
private Boolean HasPassBook;
private Boolean IsRetailCustomer;

public Long getAccountID() {
  return accountID;
}

public BigDecimal getBalance() {
  return balance;
}

public Boolean HasChecks() {
  return HasChecks;
}

public Boolean HasFinancialAdviceRights() {
  return HasFinancialAdviceRights;
}

public Boolean HasPassBook() {
return HasPassBook;
}

public Boolean IsRetailCustomer() {
  return IsRetailCustomer;
}
}

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 7: Data Quality Refactorings

Create new playlist

Sign In

Sign Up

Chapter 7Data Quality Refactorings

Common Issues When Implementing Data Quality Refactorings

Add Lookup Table

Motivation

Potential Tradeoffs

Schema Update Mechanics

Data-Migration Mechanics

Access Program Update Mechanics

Apply Standard Codes

Motivation

Potential Tradeoffs

Schema Update Mechanics

Data-Migration Mechanics

Access Program Update Mechanics

Apply Standard Type

Motivation

Potential Tradeoffs

Schema Update Mechanics

Data-Migration Mechanics

Access Program Update Mechanics

Consolidate Key Strategy

Motivation

Potential Tradeoffs

Schema Update Mechanics

Data-Migration Mechanics

Access Program Update Mechanics

Drop Column Constraint

Motivation

Potential Tradeoffs

Schema Update Mechanics

Data-Migration Mechanics

Access Program Update Mechanics

Drop Default Value

Motivation

Potential Tradeoffs

Schema Update Mechanics

Data-Migration Mechanics

Access Program Update Mechanics

Drop Non-Nullable

Motivation

Potential Tradeoffs

Schema Update Mechanics

Data-Migration Mechanics

Access Program Update Mechanics

Introduce Column Constraint

Motivation

Potential Tradeoffs

Schema Update Mechanics

Data-Migration Mechanics

Access Program Update Mechanics

Introduce Common Format

Motivation

Potential Tradeoffs

Schema Update Mechanics

Data-Migration Mechanics

Access Program Update Mechanics

Introduce Default Value

Motivation

Potential Tradeoffs

Schema Update Mechanics

Data-Migration Mechanics

Access Program Update Mechanics

Make Column Non-Nullable

Motivation

Potential Tradeoffs

Schema Update Mechanics

Data-Migration Mechanics

Access Program Update Mechanics

Move Data

Motivation

Potential Tradeoffs

Schema Update Mechanics

Data-Migration Mechanics

Access Program Update Mechanics

Replace Type Code With Property Flags

Motivation

Table of Contents for
Chapter 7: Data Quality Refactorings

Chapter 7
Data Quality Refactorings