DENORMALIZATION CONSIDERED HARMFUL (I)

In this section, I’d like to present an argument—a logical argument, that is, and one you might not have seen before—in support of the position that you should denormalize only as a last resort. Essentially, the argument is that while (as is well known) denormalization can be logically bad for update, it can be logically bad for retrieval as well, in the sense that it can make certain queries harder to formulate. (Alternatively, it can make them easier to formulate incorrectly, meaning that, if they execute, you’re getting answers that might be correct in themselves but are answers to the wrong questions.) Let me illustrate.

Once again consider relvar S, with its FD {CITY} → {STATUS}. As noted earlier, that relvar can be regarded as the result of denormalizing relvars SNC (with attributes SNO, SNAME, and CITY) and CT (with attributes CITY and STATUS). Now consider the query “Get the average supplier city status value.” Given the sample values in Figure 3-2, the status values for Athens, London, and Paris are 30, 20, and 30, respectively, and so the average is 80/3, which is 26.667 to three decimal places. Here then are some attempts at formulating this query in SQL (I’ll assume for simplicity that S is nonempty, so we don’t have to worry about what happens in SQL if we try to apply the AVG operator to an empty set):[83]

  1. SELECT AVG ( STATUS ) AS RESULT
    FROM   S

    Result (incorrect): 26. The problem here is that London’s status and Paris’s status have both been counted twice. Perhaps we need a DISTINCT inside the AVG invocation? Let’s try that:

  2. SELECT AVG ( DISTINCT STATUS ) AS RESULT
    FROM   S

    Result (incorrect): 25. No, it’s distinct cities we need to examine, not distinct status values. We can do that by grouping:

  3. SELECT CITY , AVG ( STATUS ) AS RESULT
    FROM   S
    GROUP  BY CITY

    Result (incorrect): (Athens,30), (London,20), (Paris,30). This formulation gives average status per city, not the overall average. Perhaps what we want is the average of the averages?—

  4. SELECT CITY , AVG ( AVG ( STATUS ) ) AS RESULT
    FROM   S
    GROUP  BY CITY

    Result: Syntax error—the SQL standard quite rightly doesn’t allow “set function” invocations to be nested in this manner.[84] One more attempt:

  5. SELECT AVG ( TEMP.STATUS ) AS RESULT
    FROM ( SELECT DISTINCT S.CITY , S.STATUS
                FROM   S ) AS TEMP

    Result (correct at last): 26.667. But note how complicated this expression is compared to its analog on the fully normalized design (relvars SNC and CT):

  6. SELECT AVG ( STATUS ) AS RESULT
    FROM   CT


[83] What changes would be needed to the various SQL expressions if we couldn’t make that assumption?

[84] I say “quite rightly” only because we’re in the SQL context specifically; a more orthodox language such as Tutorial D would certainly let us nest such invocations (or its analog of such invocations, rather). Let me explain. Consider the SQL expression SELECT SUM(QTY) AS RESULT FROM SP WHERE QTY > 100 (I deliberately switch to a different example for reasons of clarity). The argument to the SUM invocation here is really what’s denoted by the expression QTY FROM SP WHERE QTY > 100, and a more orthodox language would therefore enclose that whole expression in parentheses. But SQL doesn’t. As a consequence, an expression of the form AVG(SUM(QTY)) has to be illegal, because SQL can’t figure out which portions of the surrounding expression have to do with the AVG argument and which with the SUM argument.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset