Chapter 2
Database Refactoring

As soon as one freezes a design, it becomes obsolete.

—Fred Brooks

This chapter overviews the fundamental concepts behind database refactoring, explaining what it is, how it fits into your development efforts, and why it is often hard to do successfully. In the following chapters, we describe in detail the actual process of refactoring your database schema.

2.1 Code Refactoring

In Refactoring, Martin Fowler (1999) describes the programming technique called refactoring, which is a disciplined way to restructure code in small steps. Refactoring enables you to evolve your code slowly over time, to take an evolutionary (iterative and incremental) approach to programming. A critical aspect of a refactoring is that it retains the behavioral semantics of your code. You do not add functionality when you are refactoring, nor do you take it away. A refactoring merely improves the design of your code—nothing more and nothing less. For example, in Figure 2.1 we apply the Push Down Method refactoring to move the calculateTotal() operation from Offering into its subclass Invoice. This change looks easy on the surface, but you may also need to change the code that invokes this operation to work with Invoice objects rather than Offering objects. After you have made these changes, you can say you have truly refactored your code because it works again as before.

Figure 2.1. Pushing a method down into a subclass.

image

Clearly, you need a systematic way to refactor your code, including good tools and techniques to do so. Most modern integrated development environments (IDEs) now support code refactoring to some extent, which is a good start. However, to make refactoring work in practice, you also need to develop an up-to-date regression-testing suite that validates that your code still works—you will not have the confidence to refactor your code if you cannot be reasonably assured that you have not broken it.

Many agile developers, and in particular Extreme Programmers (XPers), consider refactoring to be a primary development practice. It is just as common to refactor a bit of code as it is to introduce an if statement or a loop. You should refactor your code mercilessly because you are most productive when you are working on high-quality source code. When you have a new feature to add to your code, the first question that you should ask is “Is this code the best design possible that enables me to add this feature?” If the answer is yes, add the feature. If the answer is no, first refactor your code to make it the best design possible, and then add the feature. On the surface, this sounds like a lot of work; in practice, however, if you start with high-quality source code, and then refactor it to keep it so, you will find that this approach works incredibly well.

2.2 Database Refactoring

A database refactoring (Ambler 2003) is a simple change to a database schema that improves its design while retaining both its behavioral and informational semantics—in other words, you cannot add new functionality or break existing functionality, nor can you add new data or change the meaning of existing data. From our point of view, a database schema includes both structural aspects, such as table and view definitions, and functional aspects, such as stored procedures and triggers. From this point forward, we use the terms code refactoring to refer to traditional refactoring as described by Martin Fowler and database refactoring to refer to the refactoring of database schemas. The process of database refactoring, described in detail in Chapter 3, is the act of making these simple changes to your database schema.

Database refactorings are conceptually more difficult than code refactorings: Code refactorings only need to maintain behavioral semantics, whereas database refactorings must also maintain informational semantics. Worse yet, database refactorings can become more complicated by the amount of coupling resulting from your database architecture, overviewed in Figure 2.2. Coupling is a measure of the dependence between two items; the more highly coupled two things are, the greater the chance that a change in one will require a change in another. The single-application database architecture is the simplest situation—your application is the only one interacting with your database, enabling you to refactor both in parallel and deploy both simultaneously. These situations do exist and are often referred to as standalone applications or stovepipe systems. The second architecture is much more complicated because you have many external programs interacting with your database, some of which are beyond the scope of your control. In this situation, you cannot assume that all the external programs will be deployed at once, and must therefore support a transition period (also referred to as a deprecation period) during which both the old schema and the new schema are supported in parallel. More on this later.

Figure 2.2. The two categories of database architecture.

image

Although we discuss the single-application environment throughout the book, we focus more on the multi-application environment, in which your database currently exists in production and is accessed by many other external programs over which you have little or no control. Don’t worry. In Chapter 3, we describe strategies for working in this sort of situation.

To put database refactoring into context, let’s step through a quick example. You have been working on a banking application for a few weeks and have noticed something strange about the Customer and Account tables depicted in Figure 2.3. Does it really make sense that the Balance column be part of the Customer table? No, so let’s apply the Move Column (page 103) refactoring to improve our database design.

Figure 2.3. The initial database schema for Customer and Account.

image

2.2.1 Single-Application Database Environments

Let’s start by working through an example of moving a column from one table to another within a single-application database environment. This is the simplest situation that you will ever be in, because you have complete control over both the database schema and the application source code that accesses it. The implication is that you can refactor both your database schema and your application code simultaneously—you do not need to support both the original and new database schemas in parallel because only the one application accesses your database.

In this scenario, we suggest that two people work together as a pair; one person should have application programming skills, and the other database development skills, and ideally both people have both sets of skills. This pair begins by determining whether the database schema needs to be refactored. Perhaps the programmer is mistaken about the need to evolve the schema, and how best to go about the refactoring. The refactoring is first developed and tested within the developer’s sandbox. When it is finished, the changes are promoted into the project-integration environment, and the system is rebuilt, tested, and fixed as needed.

To apply the Move Column (page 103) refactoring in the development sandbox, the pair first runs all the tests to see that they pass. Next, they write a test because they are taking a Test-Driven Development (TDD) approach. A likely test is to access a value in the Account.Balance column. After running the tests and seeing them fail, they introduce the Account.Balance column, as you see in Figure 2.4. They rerun the tests and see that the tests now pass. They then refactor the existing tests, which verify that customer deposits work properly with the Account.Balance column rather than the Customer.Balance column. They see that these tests fail, and therefore rework the deposit functionality to work with Account.Balance. They make similar changes to other code within the tests suite and the application, such as withdrawal logic, that currently works with Customer.Balance.

Figure 2.4. The final database schema for Customer and Account.

image

After the application is running again, they then back up the data in Customer.Balance, for safety purposes, and then copy the data from Customer.Balance into the appropriate row of Account.Balance. They rerun their tests to verify that the data migration has safely occurred. To complete the schema changes, the final step is to drop the Customer.Balance column and then rerun all tests and fix anything as necessary. When they finish doing so, they promote their changes into the project-integration environment as described earlier.

2.2.2 Multi-Application Database Environments

This situation is more difficult because the individual applications have new releases deployed at different times over the next year and a half. To implement this database refactoring, you do the same sort of work that you did for the single-application database environment, except that you do not delete the Customer.Balance column right away. Instead, you run both columns in parallel during a “transition period” of at least 1.5 years to give the development teams time to update and redeploy all of their applications. This portion of the database schema during the transition period is shown in Figure 2.5. Notice how there are two triggers, SynchronizeCustomerBalance and SynchronizeAccountBalance, which are run in production during the transition period to keep the two columns in sync.

Figure 2.5. The database schema during the transition period.

image

Why such a long period of time for the transition period? Because some applications currently are not being worked on, whereas other applications are following a traditional development life cycle and only release every year or so—your transition period must take into account the slow teams as well as the fast ones. Furthermore, because you cannot count on the individual applications to update both columns, you need to provide a mechanism such as triggers to keep their values synchronized. There are other options to do this, such as views or synchronization after the fact, but as we discuss in Chapter 5, “Database Refactoring Strategies,” we find that triggers work best.

After the transition period, you remove the original column plus the trigger(s), resulting in the final database schema of Figure 2.4. You remove these things only after sufficient testing to ensure that it is safe to do so. At this point, your refactoring is complete. In Chapter 3, we work through implementing this example in detail.

2.2.3 Maintaining Semantics

When you refactor a database schema, you must maintain both the informational and behavioral semantics—you should neither add anything nor take anything away. Informational semantics refers to the meaning of the information within the database from the point of view of the users of that information. Preserving the informational semantics implies that if you change the values of the data stored in a column, the clients of that information should not be affected by the change—for example, if you apply the Introduce Common Format (page 183) database refactoring to a character-based phone number column to transform data such as (416) 555-1234 and 905.555.1212 into 4165551234 and 9055551212, respectively. Although the format has been improved, requiring simpler code to work with the data, from a practical point of view the true information content has not. Note that you would still choose to display phone numbers in (XXX) XXX-XXXX format; you just would not store the information in that manner.

Focusing on practicality is a critical issue when it comes to database refactoring. Martin Fowler likes to talk about the issue of “observable behavior” when it comes to code refactoring, his point being that with many refactorings you cannot be completely sure that you have not changed the semantics in some small way, that all you can hope for is to think it through as best you can, to write what you believe to be sufficient tests, and then run those tests to verify that the semantics have not changed. In our experience, a similar issue exists when it comes to preserving information semantics when refactoring a database schema—changing (416) 555-1234 to 4165551234 may in fact have changed the semantics of that information for an application in some slightly nuanced way that we do not know about. For example, perhaps a report exists that somehow only works with data rows that have phone numbers in the (XXX) XXX-XXXX format, and the report relies on that fact. Now the report is outputting numbers in the XXXXXXXXXX format, making it harder to read, even though from a practical sense the same information is still being output. When the problem is eventually discovered, the report may need to be updated to reflect the new format.

Similarly, with respect to behavioral semantics, the goal is to keep the black-box functionality the same—any source code that works with the changed aspects of your database schema must be reworked to accomplish the same functionality as before. For example, if you apply Introduce Calculation Method (page 245), you may want to rework other existing stored procedures to invoke that method rather than implement the same logic for that calculation. Overall, your database still implements the same logic, but now the calculation logic is just in one place.

It is important to recognize that database refactorings are a subset of database transformations. A database transformation may or may not change the semantics; a database refactoring does not. We describe several common database transformations in Chapter 11, “Non-Refactoring Transitions,” because they are not only important to understand, they can often be a step within a database refactoring. For example, when applying the Move Column earlier to move the Balance column from Customer to Account, you needed to apply the Introduce Column transformation (page 180) as one of the steps.

On the surface, the Introduce Column sounds like a perfectly fine refactoring; adding an empty column to a table does not change the semantics of that table until new functionality begins to use it. We still consider it a transformation (but not a refactoring) because it could inadvertently change the behavior of an application. For example, if we introduce the column in the middle of the table, any program logic using positional access (for example, code that refers to column 17 rather than the column’s name) will break. Furthermore, COBOL code bound to a DB2 table will break if it is not rebound to the new schema, even if the column is added at the end of the table. In the end, practicality should be your guide. If we were to label Introduce Column as a refactoring, or as a “Yabba Dabba Do” for all that matter, would it affect the way that you use it? We hope not.

2.3 Categories of Database Refactorings

We also distinguish six different categories of database refactorings, as described in Table 2.1. This categorization strategy was introduced to help organize this book, and hopefully to help organize future database refactoring tools. Our categorization strategy is not perfect; for example, the Replace Method With View refactoring (page 265) arguably fits into both the Architectural and Method categories. (We have put it into the Architectural category.)

Table 2.1. Database Refactoring Categories

image

2.4 Database Smells

Fowler (1997) introduced the concept of a “code smell,” a common category of problem in your code that indicates the need to refactor it. Common code smells include switch statements, long methods, duplicated code, and feature envy. Similarly, there are common database smells that indicate the potential need to refactor it (Ambler 2003). These smells include the following:

Multipurpose column. If a column is being used for several purposes, it is likely that extra code exists to ensure that the source data is being used the “right way,” often by checking the values of one or more other columns. An example is a column used to store either someone’s birth date if he or she is a customer or the start date if that person is an employee. Worse yet, you are likely constrained in the functionality that you can now support—for example, how would you store the birth date of an employee?

Multipurpose table. Similarly, when a table is being used to store several types of entities, there is likely a design flaw. An example is a generic Customer table that is used to store information about both people and corporations. The problem with this approach is that data structures for people and corporations differ—people have a first, middle, and last name, for example; whereas a corporation simply has a legal name. A generic Customer table would have columns that are NULL for some kinds of customers but not others.

Redundant data. Redundant data is a serious problem in operational databases because when data is stored in several places, the opportunity for inconsistency occurs. For example, it is quite common to discover that customer information is stored in many different places within your organization. In fact, many companies are unable to put together an accurate list of who their customers actually are. The problem is that in one table John Smith lives at 123 Main Street, and in another table at 456 Elm Street. In this case, this is actually one person who used to live at 123 Main Street but who moved last year; unfortunately, John did not submit two change of address forms to your company, one for each application that knows about him.

Tables with too many columns. When a table has many columns, it is indicative that the table lacks cohesion—that it is trying to store data from several entities. Perhaps your Customer table contains columns to store three different addresses (shipping, billing, seasonal) or several phone numbers (home, work, cell, and so on). You likely need to normalize this structure by adding Address and PhoneNumber tables.

Tables with too many rows. Large tables are indicative of performance problems. For example, it is time-consuming to search a table with millions of rows. You may want to split the table vertically by moving some columns into another table, or split it horizontally by moving some rows into another table. Both strategies reduce the size of the table, potentially improving performance.

“Smart” columns. A smart column is one in which different positions within the data represent different concepts. For example, if the first four digits of the client ID indicate the client’s home branch, then client ID is a smart column because you can parse it to discover more granular information (for example, home branch ID). Another example includes a text column used to store XML data structures; clearly, you can parse the XML data structure for smaller data fields. Smart columns often need to be reorganized into their constituent data fields at some point so that the database can easily deal with them as separate elements.

Fear of change. If you are afraid to change your database schema because you are afraid to break something—for example, the 50 applications that access it—that is the surest sign that you need to refactor your schema. Fear of change is a good indication that you have a serious technical risk on your hands, one that will only get worse over time.

It is important to understand that just because something smells, it does not mean that it is bad—limburger cheese smells even when it is perfectly fine. However, when milk smells bad, you know that you have a problem. If something smells, look at it, think about it, and refactor it if it makes sense.

2.5 How Database Refactoring Fits In

Modern software development processes, including the Rational Unified Process (RUP), Extreme Programming (XP), Agile Unified Process (AUP), Scrum, and Dynamic System Development Method (DSDM), are all evolutionary in nature. Craig Larman (2004) summarizes the research evidence, as well as the overwhelming support among the thought leaders within the IT community, in support of evolutionary approaches. Unfortunately, most data-oriented techniques are serial in nature, relying on specialists performing relatively narrow tasks, such as logical data modeling or physical data modeling. Therein lies the rub—the two groups need to work together, but both want to do so in different manners.

Our position is that data professionals can benefit from adopting modern evolutionary techniques similar to those of developers, and that database refactoring is one of several important skills that data professionals require. Unfortunately, the data community missed the object revolution of the 1990s, which means they missed out on opportunities to learn the evolutionary techniques that application programmers now take for granted. In many ways, the data community is also missing out on the agile revolution, which is taking evolutionary development one step further to make it highly collaborative and cooperative.

Database refactoring is a database implementation technique, just like code refactoring is an application implementation technique. You refactor your database schema to ease additions to it. You often find that you have to add a new feature to a database, such as a new column or stored procedure, but the existing design is not the best one possible to easily support that new feature. You start by refactoring your database schema to make it easier to add the feature, and after the refactoring has been successfully applied, you then add the feature. The advantage of this approach is that you are slowly, but constantly, improving the quality of your database design. This process not only makes your database easier to understand and use, it also makes it easier to evolve over time; in other words, you improve your overall development productivity.

Figure 2.6 provides a high-level overview of the critical development activities that occur on a modern project working with both object and relational database technologies. Notice how all the arrows are bidirectional. You iterate back and forth between activities as needed. Also notice how there is neither a defined starting point nor a defined ending point—this clearly is not a traditional, serial process.

Figure 2.6. Potential development activities on an evolutionary development project.

image

Database refactoring is only part of the evolutionary database development picture. You still need to take an evolutionary/agile approach to data modeling. You still need to test your database schema and put it under configuration management control. And, you still need to tune it appropriately. These are topics better left to other books.

2.6 Making It Easier to Refactor Your Database Schema

The greater the coupling, the harder it is to refactor something. This is true of code refactoring, and it is certainly true of database refactoring. Our experience is that coupling becomes a serious issue when you start to consider behavioral issues (for example, code), something that many database books choose not to address. The easiest scenario is clearly the single-application database because your database schema will only be coupled to itself and to your application. With the multi-application database architecture depicted in Figure 2.7, your database schema is potentially coupled to application source code, persistence frameworks and Object-Relational Mapping (ORM) tools, other databases (via replication, data extracts/loads, and so on), data file schemas, testing code, and even to itself.

Figure 2.7. Databases are highly coupled to external programs.

image

An effective way to decrease the coupling that your database is involved with is to encapsulate access to it. You do this by having external programs access your database via persistence layers, as depicted in Figure 2.8. A persistence layer can be implemented in several ways—via data access objects (DAOs), which implement the necessary SQL code; by frameworks; via stored procedures; or even via Web services. As you see in the diagram, you can never get the coupling down to zero, but you can definitely reduce it to something manageable.

Figure 2.8. Reducing coupling via encapsulating access.

image

2.7 What You Have Learned

Code refactoring is a disciplined way to restructure code in small, evolutionary steps to improve the quality of its design. A code refactoring retains the behavioral semantics of your code; it neither adds functionality nor takes functionality away. Similarly, a database refactoring is a simple change to a database schema that improves its design while retaining both its behavioral and informational semantics. Database refactoring is one of the core techniques that enable data professionals to take an evolutionary approach to database development. The greater the coupling that your database is involved with, the harder it will be to refactor.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset