Chapter 3
The Process of Database Refactoring

A new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die, and a new generation grows up that is familiar with it.

—Max Planck

This chapter describes how to implement a single refactoring within your database. We work through an example of applying the Move Column (page 103), a structural refactoring. Although this seems like a simple refactoring, and it is, you will see it can be quite complex to safely implement it within a production environment. Figure 3.1 overviews how we will move the Customer.Balance column to the Account table, a straightforward change to improve the database design.

Figure 3.1. Moving the Customer.Balance column to Account.

image

In Chapter 1, “Evolutionary Database Development,” we overviewed the concept of logical working sandboxes—development sandboxes in which developers have their own copy of the source code and database to work with; a project-integration environment where team members promote and then test their changes; preproduction environments for system, integration, and user acceptance testing; and production. The hard work of database refactoring is done within your development sandbox—it is considered, implemented, and tested before it is promoted into other environments. The focus of this chapter is on the work that is performed within your development sandbox. Chapter 4, “Deploying into Production,” covers the promotion and eventual deployment of your refactorings.

Because we are describing what occurs within your development sandbox, this process applies to both the single-application database as well as the multi-application database environments. The only real difference between the two situations is the need for a longer transition period (more on this later) in the multi-application scenario.

Figure 3.2 depicts a UML 2 Activity diagram that overviews the database refactoring process. The process begins with a developer who is trying to implement a new requirement to fix a defect. The developer realizes that the database schema may need to be refactored. In this example, Eddy, a developer, is adding a new type of financial transaction to his application and realizes that the Balance column actually describes Account entities, not Customer entities. Because Eddy follows common agile practices such as pair programming (Williams & Kessler 2002) and modeling with others (Ambler 2002), he decides to enlist the help of Beverley, the team’s database administrator (DBA), to help him to apply the refactoring. Together they iteratively work through the following activities:

Figure 3.2. The database refactoring process.

image

• Verify that a database refactoring is appropriate.

• Choose the most appropriate database refactoring.

• Deprecate the original database schema.

• Test before, during, and after.

• Modify the database schema.

• Migrate the source data.

• Modify external access program(s).

Run regression tests.

• Version control your work.

• Announce the refactoring.

3.1 Verify That a Database Refactoring Is Appropriate

First, Beverley determines whether the suggested refactoring needs to occur. There are three issues to consider:

1. Does the refactoring make sense?

Perhaps the existing table structure is correct. It is common for developers to either disagree with, or to simply misunderstand, the existing design of a database. This misunderstanding could lead them to believe that the design needs to change when it really does not. The DBA should have a good knowledge of the project team’s database, other corporate databases, and will know whom to contact about issues such as this. Therefore, they will be in a better position to determine whether the existing schema is the best one. Furthermore, the DBA often understands the bigger picture of the overall enterprise, providing important insight that may not be apparent when you look at it from the point of view of the single project. However, in our example, it appears that the schema needs to change.

2. Is the change actually needed now?

This is usually a “gut call” based on her previous experience with the application developer. Does Eddy have a good reason for making the schema change? Can Eddy explain the business requirement that the change supports? Does the requirement feel right? Has Eddy suggested good changes in the past? Has Eddy changed his mind several days later, requiring Beverley to back out of the change? Depending on this assessment, Beverley may suggest that Eddy think the change through some more or may decide to continue working with him, but will wait for a longer period of time before they actually apply the change in the project-integration environment (Chapter 4) if they believe the change will need to be reversed.

3. Is it worth the effort?

The next thing that Beverley does is to assess the overall impact of the refactoring. To do this, Beverley should have an understanding of how the external program(s) are coupled to this part of the database. This is knowledge that Beverley has built up over time by working with the enterprise architects, operational database administrators, application developers, and other DBAs. When Beverley is not sure of the impact, she needs to make a decision at the time and go with her gut feeling or decide to advise the application developer to wait while she talks to the right people. Her goal is to ensure that she implements database refactorings that will succeed—if you are going to need to update, test, and redeploy 50 other applications to support this refactoring, it may not be viable for her to continue. Even when there is only one application accessing the database, it may be so highly coupled to the portion of the schema that you want to change that the database refactoring simply is not worth it. In our example, the design problem is so clearly severe that she decides to implement it even though many applications will be affected.

3.2 Choose the Most Appropriate Database Refactoring

As you can see in this book, you could potentially apply a large number of refactorings to your database schema. To determine which is the most appropriate refactoring for your situation, you must first analyze and understand the problem you face. When Eddy first approached Beverley, he may or may not have done this analysis. For example, he may have just gone to her and said that the Account table needs to store the current balance; therefore, we need to add a new column (via the Introduce Column transformation on page 180). However, what he did not realize was that the column already exists in the Customer table, which is arguably the wrong place for it to be—Eddy had identified the problem correctly, but had misidentified the solution. Based on her knowledge of the existing database schema, and her understanding of the problem identified by Eddy, Beverley instead suggests that they apply the Move Column (page 103) refactoring.

3.3 Deprecate the Original Database Schema

If multiple applications access your database, you likely need to work under the assumption that you cannot refactor and then deploy all of these programs simultaneously. Instead, you need a transition period, also called a deprecation period, for the original portion of the schema that you are changing (Sadalage & Schuh 2002; Ambler 2003). During the transition period, you support both the original and new schemas in parallel to provide time for the other application teams to refactor and redeploy their systems. Typical transition periods last for several quarters, if not years. The potentially long time to fully implement a refactoring underscores the need to automate as much of the process as possible. Over a several-year period, people within your department will change, putting you at risk if parts of the process are manual. Having said that, even in the case of a single-application database, your team may still require a transition period of a few days within your project-integration sandbox—your teammates need to refactor and retest their code to work with the updated database schema.

Figure 3.3 depicts the life cycle of a database refactoring within a multi-application scenario. You first implement it within the scope of your project, and if successful, you eventually deploy it into production. During the transition period, both the original schema and the new schema exist, with sufficient scaffolding code to ensure that any updates are correctly supported. During the transition period, you need to assume two things: first, that some applications will use the original schema whereas others will use the new schema; and second, that applications should only have to work with one but not both versions of the schema. In our example, some applications will work with Customer.Balance and others with Account.Balance, but not both simultaneously. Regardless of which column they work with, the applications should all run properly. When the transition period has expired, the original schema plus any scaffolding code is removed and the database retested. At this point, the assumption is that all applications work with Account.Balance.

Figure 3.3. The life cycle of a database refactoring in a multi-application scenario.

image

Figure 3.4 depicts the original database schema, and Figure 3.5 shows what the database schema would look like during the transition period for when we apply the Move Column database refactoring to Customer.Balance. In Figure 3.5, the changes are shown in bold, a style that we use throughout the book. Notice how both versions of the schema are supported during this period. Account.Balance has been added as a column, and Customer.Balance has been marked for removal on or after June 14, 2006. A trigger was also introduced to keep the values contained in the two columns synchronized, the assumption being that new application code will work with Account.Balance but will not keep Customer.Balance up-to-date. Similarly, we assume that older application code that has not been refactored to use the new schema will not know to keep Account.Balance up-to-date. This trigger is an example of database scaffolding code, simple and common code that is required to keep your database “glued together.” This code has been assigned the same removal date as Customer.Balance.

Figure 3.4. The original Customer/Account schema.

image

Figure 3.5. Supporting both versions of the schema.

image

Not all database refactorings require a transition period. For example, neither Introduce Column Constraint (page 180) nor Apply Standard Codes (page 157) database refactorings require a transition period because they simply improve the data quality by narrowing the acceptable values within a column. A narrower value may break existing applications, so beware of the refactorings.

Chapter 5, “Database Refactoring Strategies,” discusses strategies for choosing an appropriate transition period.

3.4 Test Before, During, and After

You can have the confidence to change your database schema if you can easily validate that the database still works with your application after the change, and the only way to do that is to take a Test-Driven Development (TDD) approach, as suggested in Chapter 1. With a TDD-based approach, you write a test and then you write just enough code, often data definition language (DDL), to fulfill the test. You continue in this manner until the database refactoring has been implemented fully. You will potentially need to write tests that do the following:

• Test your database schema.

• Test the way your application uses the database schema.

• Validate your data migration.

• Test your external program code.

3.4.1 Testing Your Database Schema

Because a database refactoring will affect your database schema, you need to write database-oriented tests. Although this may sound strange at first, you can validate many aspects of a database schema:

Stored procedures and triggers. Stored procedures and triggers should be tested just like your application code would be.

Referential integrity (RI). RI rules, in particular cascading deletes in which highly coupled “child” rows are deleted when a parent row is deleted, should also be validated. Existence rules, such as a customer row corresponding to an account row, must exist before the row can be inserted into the Account table, and can be easily tested, too.

View definitions. Views often implement interesting business logic. Things to look out for include: Does the filtering/select logic work properly? Do you get back the right number of rows? Are you returning the right columns? Are the columns, and rows, in the right order?

Default values. Columns often have default values defined for them. Are the default values actually being assigned? (Someone could have accidentally removed this part of the table definition.)

Data invariants. Columns often have invariants, implemented in the forms of constraints, defined for them. For example, a number column may be restricted to containing the values 1 through 7. These invariants should be tested.

Database testing is new to many people, and as a result you are likely to face several challenges when adopting database refactoring as a development technique:

Insufficient testing skills. This problem can be overcome through training, through pairing with someone with good testing skills (pairing a DBA without testing skills and a tester without DBA skills still works), or simply through trial and error. The important thing is that you recognize that you need to pick up these skills.

Insufficient unit tests for your database. Few organizations have yet to adopt the practice of database testing, so it is likely that you will not have a sufficient test suite for your existing schema. Although this is unfortunate, there is no better time than the present to start writing your test suite.

Insufficient database testing tools. Luckily, tools such as DBUnit (dbunit.sourceforge.net) for managing test data and SQLUnit (sqlunit.sourceforge.net) for testing stored procedures are available as open source software (OSS). In addition, several commercial tools are available for database testing. However, at the time of this writing, there is still significant opportunity for tool vendors to improve their database testing offerings.

So how would we test the changes to the database schema? As you can see in Figure 3.5, there are two changes to the schema during the transition period that we must validate. The first one is the addition of the Balance column to the Account table. This change is covered by our data migration and external program testing efforts, discussed in the following sections. The second change is the addition of the two triggers, SynchronizeAccountBalance and SynchronizeCustomerBalance, which, as their names imply, keep the two data columns synchronized. We need tests to ensure that if Customer.Balance is updated that Account.Balance is similarly updated, and vice versa.

3.4.2 Validating Your Data Migration

Many database refactorings require you to migrate and sometimes even cleanse the source data. In our example, we must copy the data values from Customer.Balance to Account.Balance as part of implementing the refactoring. In this case, we want to validate that the correct balance was in fact copied over for individual customers.

In refactorings such as Apply Standard Codes (page 157) and Consolidate Key Strategy (page 168), you actually “cleanse” data values. This cleansing logic must be validated. With the first refactoring, you may convert code values such as USA and U.S. all to the standard value of US throughout your database. You would want to write tests to validate that the older codes were no longer being used and that they were converted properly to the official value. With the second refactoring, you might discover that customers are identified via their customer ID in some tables, by their social security number (SSN) in other tables, and by their phone number in other tables. You would want to choose one way to identify customers, perhaps by their customer ID, and then refactor the other tables to use this type of column instead. In this case, you would want to write tests to verify that the relationship between the various rows was still being maintained properly. (For example, if the telephone number 555-1234 referenced the Sally Jones customer record, the Sally Jones record should still be getting referenced when you replace it with customer ID 987654321.)

3.4.3 Testing Your External Access Programs

Your database is accessed by one or more programs, including the application that you are working on. These programs should be validated just like any other IT asset within your organization. To successfully refactor your database, you need to be able to introduce the final schema, shown in Figure 3.6, and see what breaks in your external access programs. The only way that you can have the confidence to refactor your database schema is if you have a full regression test suite for these programs—yes, we realize that you likely do not have these test suites. Once again, there is no better time than the present to start building up your test suite. We suggest that you write all the testing code you require to support each individual database refactoring for all external access programs. (Actually, the owners of these systems need to write those tests, not you.) If you work this way, over time you will build up the test suite that you require.

Figure 3.6. The final version of the database schema.

image

3.5 Modify the Database Schema

Eddy and Beverley work together to make the changes within their development sandbox. As you see in Figure 3.5, they need to add the Account.Balance column as well as the two triggers, SynchronizeAccountBalance and SynchronizeCustomerBalance. The DDL code to do this is shown here:

ALTER TABLE Account ADD Balance Numeric;
COMMENT ON Account.Balance 'Move of Customer.Balance column, finaldate = 2006-06-14';

CREATE OR REPLACE TRIGGER SynchronizeCustomerBalance
 BEFORE INSERT OR UPDATE
 ON Account
 REFERENCING OLD AS OLD NEW AS NEW
 FOR EACH ROW
 DECLARE
 BEGIN
  IF :NEW.Balance IS NOT NULL THEN
   UpdateCustomerBalance;
  END IF;
 END;
/
COMMENT ON SynchronizeCustomerBalance 'Move of Customer.Balance column to Account,
dropdate = 2006-06-14';

CREATE OR REPLACE TRIGGER SynchronizeAccountBalance
 BEFORE INSERT OR UPDATE OR DELETE
 ON Customer
 REFERENCING OLD AS OLD NEW AS NEW
 FOR EACH ROW
 DECLARE
 BEGIN
  IF DELETING THEN
   DeleteCustomerIfAccountNotFound;
  END IF;
  IF (UPDATING OR INSERTING) THEN
   IF :NEW.Balance IS NOT NULL THEN
    UpdateAccountBalanceForCustomer;
   END IF;
  END IF;
 END;
/
COMMENT ON SynchronizeAccountBalance 'Move of Customer.Balance column to Account,
dropdate = 2006-06-14'


At the time of this writing, no automated database refactoring tools are available—therefore, you need to code everything by hand for now. Do not worry. This will change in time. For now, you want to write a single script containing the preceding code that you can apply against the database schema. We suggest assigning a unique, incremental number to each script. The easiest way to do so is just to start at the number one and increment a counter each time you define a new database refactoring—the easiest way to do that is to use the build number of your application. However, to make this strategy work within a multiple team environment, you need a way to either assign unique numbers across all teams or to add a unique team identifier to the individual refactorings. Fundamentally, you need to be able to differentiate between Team A’s refactoring number 1701 and Team B’s refactoring number 1701. Another option, discussed in more detail in Chapter 5, is to assign timestamps to the refactoring.

There are several reasons why you want to work with small scripts for individual refactorings:

Simplicity. Small, focused change scripts are easier to maintain than scripts comprising many steps. If you discover that a refactoring should not be performed because of unforeseen problems (perhaps you cannot update a major application that accesses the changed portion of the schema), for example, you want to be able to easily not perform that refactoring.

Correctness. You want to be able to apply each refactoring, in the appropriate order, to your database schema so as to evolve it in a defined manner. Refactorings can build upon each other. For example, you might rename a column and then a few weeks later move it to another table. The second refactoring would depend on the first refactoring because its code would refer to the new name of the column.

Versioning. Different database instances will have different versions of your database schema. For example, Eddy’s development sandbox may have version 163, the project-integration sandbox version 161, the QA/Test sandbox version 155, and the production database version 134. To migrate the project-integration sandbox schema to version 163, you should merely have to apply database refactoring 162 and 163. To keep track of the version number, you need to introduce a common table, such as DatabaseConfiguration, that stores the current version number among other things. This table is discussed in further detail in Chapter 5.

The following DDL code must be run against your database after the transition period has ended (discussed in Chapter 4). Similarly, this code should be captured in a single script file, along with the identifier of 163 in this case, and run in sequential order against your database schema as appropriate.

ALTER TABLE Customer DROP COLUMN Balance;
DROP TRIGGER SynchronizeAccountBalance;
DROP TRIGGER SynchronizeCustomerBalance;


3.6 Migrate the Source Data

Many database refactorings require you to manipulate the source data in some way. Sometimes, you just need to move data from one location to another, something we need to do with Move Data (page 192). Other times, you need to cleanse the values of the data itself; this is common with the data quality refactorings (Chapter 7, “Data Quality Refactorings”) such as Apply Standard Type (page 162) and Introduce Common Format (page 183).

Similar to modifying your database schema, you will potentially need to create a script to perform the required data migration. This script should have the same identification number as your other script to make them easy to manage. In our example of moving the Customer.Balance column to Account, the data migration script would contain the following data manipulation language (DML) code:

/*
One time migration of data from Customer.Balance to Account.Balance.
*/

UPDATE Account SET Balance =
 (SELECT Balance FROM Customer
  WHERE CustomerID = Account.CustomerID);


Depending on the quality of the existing data, you may quickly discover the need to further cleanse the source data. This would require the application of one or more data quality database refactorings. It is good practice to keep your eye out for data quality problems when you are working through structural and architectural database refactorings. Data quality problems are quite common with legacy database designs that have been allowed to degrade over time.

3.7 Refactor External Access Program(s)

When your database schema changes, you will often need to refactor any existing external programs that access the changed portion of the schema. As you learned in Chapter 2, “Database Refactoring,” this includes legacy applications, persistence frameworks, data replication code, and reporting systems, to name a few.

Several good books provide guidance for effective refactoring of external access programs:

Refactoring: Improving the Design of Existing Code (Fowler 1999) is the classic text on the subject.

Working Effectively with Legacy Code (Feathers 2004) describes how to refactor legacy systems that have existed within your organization for many years.

Refactoring to Patterns (Kerievsky 2004) describes how to methodically refactor your code to implement common design and architectural patterns.

When many programs access your database, you run the risk that some of them will not be updated by the development teams responsible for them, or worse yet they may not even be assigned to a team at the present moment. The implication is that someone will need to be assigned responsibility for updating the application(s), as well as the responsibility to burden the cost. Hopefully, other teams are responsible for these external programs; otherwise, your team will need to accept responsibility for making the required changes. It is frustrating to discover that the political challenges surrounding the need to update other systems often far outweigh the technical challenges of doing so.

So what do you do when there is no funding to update the external programs? You have two basic strategies from which to choose. First, make the database refactoring and assign it a transition period of several decades. This way the external programs that you cannot change still work; however, other applications can access the improved design. This strategy has the unfortunate disadvantage that the scaffolding code to support both schemas will exist for a long time, reducing database performance and cluttering your database. The second strategy is to not do the refactoring.

3.8 Run Your Regression Tests

Part of implementing your refactoring is to test it to ensure that it works. As indicated earlier, you will test a little, change a little, test a little, and so on until the refactoring is complete. Your testing activities should be automated as much as possible. A significant advantage of database refactoring is that because the refactorings represent small changes, when a test breaks you have a pretty good idea where the problem lies—where you just made the change.

3.9 Version Control Your Work

When your database refactoring is successful, you should put all your work under configuration management (CM) control by checking it into a version control tool. If you treat your database-oriented artifacts the exact same way that you treat your source code, you should be okay. Artifacts to version control include the following:

• Any scripts that you have created

• Test data and/or generation code

• Test cases

• Documentation

• Models

3.10 Announce the Refactoring

A database is a shared resource. Minimally, it is shared within your application development team, if not by several application teams. Therefore, you need to communicate to interested parties that the database refactoring has been made. Early in the life cycle of the refactoring, you need to communicate the changes within your team, something that could be as simple as announcing the change at your team’s next standup meeting. In a multi-application database environment, you must communicate the changes to other teams, particularly when you decide to promote the refactoring into your preproduction test environments. This communication might be a simple e-mail on an internal mailing list specifically used to announce database changes, it could be a line item in your regular project status report, or it could be a formal report to your operational database administration group.

An important aspect of your announcement efforts will be the update of any relevant documentation. This documentation will be critical during your promotion and deployment efforts (see Chapter 4) because the other teams need to know how the database schema has evolved. A simple approach is to develop database release notes that summarize the changes that you have made, listing each database refactoring in order. Our example refactoring would appear in this list as “163: Move the Customer.Balance column into the Account table.” These release notes will likely be required by enterprise administrators so that they can update the relevant meta data. (Better yet, your team should update this meta data as part of their refactoring efforts.)

You will want to update the physical data model (PDM) for your database. Your PDM is the primary model describing your database schema and is often one of the few “keeper” models created on application development projects, and therefore should be kept up-to-date as much as possible.

3.11 What You Have Learned

The hard work of database refactoring is done within your development sandbox, hopefully by a developer paired with a DBA. The first step is to verify that a database refactoring is even appropriate—perhaps the cost of performing the refactoring currently outweighs the benefit, or perhaps the current schema is the best design for that specific issue. If a refactoring is required, you must choose the most appropriate one to get the job done. In a multi-application environment, many refactorings require you to run both the original and new versions of the schema in parallel during a transition period that is long enough to allow any applications accessing that portion of the schema time to be redeployed.

To implement the refactoring, you should take a Test-First approach to increase the chance that you detect any breakages introduced by the refactoring. You must modify the database schema, potentially migrate any relevant source data, and then modify any external programs that access the schema. All of your work should be version controlled, and after the refactoring has been implemented within your development environment, it should be announced to your teammates and then eventually to appropriate external teams that might need to know about the schema change.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset