DENORMALIZATION CONSIDERED HARMFUL (II)

I said earlier that the argument in favor of denormalization was that it makes retrievals easier to express and makes them perform better. But does this argument really stand up to careful analysis? Let’s take a closer look.

First of all, it clearly isn’t true across the board that retrievals are easier to express; the previous section presented a detailed counterexample, but the point can be made with much simpler examples. By way of illustration, consider what’s involved in formulating the query “Get all supplier details” against (a) the normalized design of Figure 1-1 and (b) a denormalized design in which relvars S, SP, and P are replaced by a single “joined” relvar called, say, SSPP. Here are Tutorial D formulations:

  1. S

  2. SSPP { SNO , SNAME , STATUS , CITY }

Or if you prefer SQL:

  1. SELECT *
    FROM   S
  2. SELECT DISTINCT SNO , SNAME , STATUS , CITY
    FROM   SSPP

The next point is that many queries are likely to perform worse, too. There are several reasons for this state of affairs. One is that denormalization leads to redundancy, which in turn can lead to a need to do duplicate elimination (note that DISTINCT in the second of the foregoing SQL formulations!). Another is as follows:

  • Suppose again that the join of suppliers, shipments, and parts is represented as one single stored file. Also, assume for simplicity that any given stored file consists of a physically contiguous collection of stored records, one for each tuple currently appearing in the relvar the stored file represents.

  • Let’s suppose too for the sake of the argument that the query “Get details for suppliers who supply red parts” will perform reasonably well against this physical structure. OK; but the query “Get all supplier details” will perform worse than it would against the structure in which the three relvars map to three physically separate stored files! Why? Because in the latter design, all supplier stored records will be physically contiguous, whereas in the former design they’ll effectively be spread over a wider area, and will therefore require more I/O. Analogous remarks apply to any query that accesses suppliers only, or parts only, or shipments only, instead of performing some kind of join.

  • Note too that denormalization, again because it increases redundancy, will most likely lead to bigger stored records, and this fact too can lead to more I/O, not less. For example, a 4K page can hold two 2K stored records but only one 3K stored record; hence, a denormalization that increases redundancy by 50 percent could increase I/O by 100 percent (I’m speaking pretty loosely here, of course).

My next observation is that even if we accept the claim that denormalization makes retrievals easier to express and perform better, it certainly makes updates harder to express and perform worse. Now, this point is (as I said before) widely understood; however, what’s not so widely understood is that denormalization opens the door to integrity violations, too. For example, in relvar S (as opposed to the projection relvars SNC and CT), someone—either the system or the user, and in current practice probably the latter—is going to have to be responsible for maintaining the FD {CITY} → {STATUS}; and if that maintenance isn’t done, integrity is lost. (By contrast, in the two-relvar design, all that has to be done is to enforce the key constraint on CT—which will definitely be done by the system, not the user—and the fact that each city has one status will then be maintained “automatically.”)

My final point is this: Regardless of whether we’re talking about

  1. True denormalization, which is done at the physical level only, or

  2. The kind of denormalization we have to do in most of today’s SQL products, which affects the logical level as well,

the point isn’t widely enough appreciated that when people say “denormalize for performance,” they’re really referring to the performance of specific applications. As I put it earlier, denormalization is typically based on a somewhat narrow perspective on the overall problem. Any given physical design is likely to be good for some applications but bad for others (in terms of its performance implications, that is).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset