Chapter 15. We Need More Science

What I tell you three times is true

Lewis Carroll: The Hunting of the Snark

Note: Portions of this chapter originally appeared, in considerably different form, in my book Date on Database: Writings 2000-2006 (Apress, 2006).

Redundant adj. de trop, diffuse, excessive, extra, inessential, inordinate, padded, periphrastic, pleonastical, prolix, repetitious, supererogatory, superfluous, supernumerary, surplus, tautological, unemployed, unnecessary, unneeded, unwanted, verbose, wordy

I give the foregoing splendid list of synonyms here purely for whatever intrinsic interest it might have (I found it in Chambers Twentieth Century Thesaurus, which also gives the following nice list of antonyms: concise, essential, necessary). Be that as it may, we’ve seen that design theory in general can be regarded as a set of principles and techniques for reducing redundancy (and thereby reducing the potential for certain inconsistencies and update anomalies that might otherwise occur). But what exactly is redundancy? We don’t seem to have a very precise definition of the term; we just have a somewhat vague idea that it can lead to problems, at least if it isn’t managed properly.[149]

In order to get a slightly better handle on this question, we first need to distinguish clearly between the logical and physical levels of the system. Obviously the design goals are different at the two levels. At the physical level, redundancy will almost certainly exist in some shape or form. Here are a couple of reasons why:

  • Indexes and other such “fast access path” structures necessarily entail some redundancy, because certain data values are stored both in those auxiliary structures and in the structures to which they provide that “fast access.”

  • Derived relvars (and/or derived relations) that are physically stored in some way—what are known variously as snapshots or summary tables or materialized queries or materialized views[150]—also obviously involve some redundancy.

The reason for redundancy at the physical level is performance, of course. But physical redundancy has (or should have!) no effect on the logical level; it’s managed by the DBMS, and it’s never directly seen by the user. I mention it here only to get it out of the way, as it were. From this point forward, I’ll be concerned only with redundancy at the logical level.

At the logical level, then, it’s tempting just to say that redundancy is always bad. But of course this statement is much too simplistic, owing to the availability of the view mechanism if nothing else. Let me digress for a moment to elaborate on this latter point. It’s well known, but worth stating explicitly nevertheless, that views (like normalization, though for very different reasons) serve two rather different purposes:

  1. The user who actually defines view V is, obviously, aware of the expression X in terms of which V is defined. That user can use the name V wherever the expression X is intended, but such uses are basically just shorthand (like the use of macros in a programming language).

  2. By contrast, a user who’s merely informed that view V exists and is available for use is supposed not to be aware of the expression X; to that user, in fact, V is supposed to look and feel just like a base relvar.[151]

As an example of Case 1, suppose the user perceives the database as containing two relvars R1 and R2 and goes on to define their join as a view; clearly, then, that view is redundant so far as that user is concerned, and it could be dropped without any loss of information. For definiteness, therefore, I’m going to assume from this point forward (barring explicit statements to the contrary) that no relvar in the database is defined in terms of any others, so that at least this particular kind of redundancy isn’t usually present. With this possibility ruled out, then, it’s tempting to set a stake in the ground and say again that redundancy at the logical level is always undesirable. In order to adopt such a position, however, we need to be able to say what we mean by redundancy—for otherwise the position can’t possibly make sense. And even if we can come up with a good definition of the term, is the position (that redundancy at the logical level is always bad, that is) really tenable? Is it possible to eliminate all redundancy? Is it even desirable?

These are questions of considerable pragmatic importance, of course. Indeed, I think it’s noteworthy that Codd called his very first (1969) paper on the relational model “Derivability, Redundancy, and Consistency of Relations Stored in Large Data Banks” (boldface added; see Appendix C). And his second (1970) paper, “A Relational Model of Data for Large Shared Data Banks” (again, see Appendix C)—this is the one that’s usually regarded as the seminal paper in the field, though that characterization is a little unfair to its 1969 predecessor—was in two parts of almost equal length, the second of which was called “Redundancy and Consistency” (the first was called “Relational Model and Normal Form”). Codd thus clearly regarded his thoughts on redundancy as a major part of the contribution of his relational work: rightly so, in my opinion, since he did at least provide us with a framework in which we could begin to address the issue precisely and systematically.

Now, I showed in the previous chapter that one “definition” of redundancy that doesn’t work is this: The database involves redundancy if and only if it contains two distinct appearances of the same tuple. But we can validly say the following:

  • Definition: The database involves redundancy if and only if it contains, directly or indirectly, two distinct representations of the same proposition.

The trouble is, although this definition is clearly correct, it doesn’t help much with the practical problem of reducing redundancy. But it does at least imply the following, which is a little better:

  • Definition (preliminary version): Let D be a database design; let DB be a database value (i.e., a set of values for the relvars mentioned in D) that conforms to D; and let p be a proposition. Further, let DB contain some specific appearance of some tuple, or some combination of tuples, that represents p (either explicitly or implicitly). If DB additionally contains some distinct appearance of some tuple, or some combination of tuples, that also represents p (either explicitly or implicitly), then DB contains, and D permits, redundancy.

The principles of normalization and The Principle of Orthogonal Design are aimed precisely at reducing redundancy in the foregoing narrow sense. Observe, however, that all the definition says is if (not if and only if) certain tuples appear, then there’s redundancy; i.e., it’s not a complete definition. Indeed, we’ll see several examples later in this chapter that still involve redundancy, even though they don’t contain distinct tuples or tuple combinations that represent the same proposition. What’s more, most of the examples in question are fully normalized and fully orthogonal. In other words, the principles of normalization and orthogonality, though necessary and undoubtedly important, are far from being sufficient.

A LITTLE HISTORY

Before I get into a discussion of how and why normalization and orthogonality are insufficient, I’d like to say a little more about Codd’s attempts, in his very first two papers, to address the redundancy issue. In his 1969 paper, he said this:

A set of relations is strongly redundant if it contains at least one relation [that] is derivable from the rest of the [relations in the set].

And he tightened up this definition somewhat in the 1970 paper:

A set of relations is strongly redundant if it contains at least one relation that possesses a projection [that] is derivable from other projections of relations in the set.

I should explain that when Codd says a relation r is derivable from a set S of relations, he means r is equal to the result of applying some sequence of relational operations (join, projection, and so forth) to relations from S. I do have a few comments on his definitions, however:

  • First, the term relation should be replaced by the term relvar throughout. (Of course, this latter term wasn’t introduced until many years later, and Codd never used it at all.)

  • Second, we can ignore the qualifier strongly. Codd was distinguishing between “strong” redundancy and what he called weak redundancy, but weak redundancy is irrelevant as far as we’re concerned. The reason is that weak redundancy has to do merely with equality dependencies that don’t hold at all times but do happen to hold at particular times, given the relation values that happen to exist at the times in question. In fact, it seems to me that Codd was struggling here with the logical difference between relations and relvars!—see the previous bullet item. “Strong” redundancy applies to relvars (it’s what we usually mean by redundancy when we talk about database design). “Weak” redundancy, by contrast, applies to relations, not relvars (it’s just an artifact of the values the relvars happen to have at some particular time).

  • The 1969 definition is subsumed by the 1970 definition, of course, because (as we know from Chapter 6) every relvar R is identically equal to a certain projection of R—namely, the identity projection.

  • More to the point, the 1970 definition is still deficient as a definition of redundancy in general, for at least the following two reasons:

    1. It includes certain possibilities that we normally wouldn’t regard as redundancies at all. For example, suppose the suppliers-and-parts database is subject to the constraint that every part must be supplied by at least one supplier. Then the projection of relvar SP on {PNO} will necessarily be equal to the projection of relvar P on {PNO}, and we’ll have a “strong redundancy” on our hands. Note: Perhaps a more realistic example to illustrate the same point would be a constraint on a personnel database to the effect that every department must have at least one employee.

    2. At the same time, it excludes many possibilities that we certainly would regard as redundancies—see, e.g., the example of light vs. heavy parts in Chapter 14 (second version, as illustrated in Figure 14-3). Several further examples are given in later sections of the present chapter.

  • Even more to the point, the references to projections in the 1970 definition should be replaced by references to projections that correspond to components of irreducible JDs. (The first of the two objections in the previous bullet item would then go away.)

One last point on Codd’s definitions: Codd did at least say (in both papers) that “we shall associate with [the database] a collection of statements [that] define all of the redundancies” in that database. The “statements” Codd is referring to here are Tutorial D CONSTRAINT statements (or something logically equivalent to such statements, of course). In other words, Codd certainly wanted the system to be aware of the redundancies, and he wanted those redundancies to be managed accordingly. Unfortunately, however, he then went on to say:

The generation of an inconsistency ... could be logged internally, so that if it were not remedied within some reasonable time ... the system could notify the security officer [sic]. Alternatively, the system could [inform the user] that such and such relations now need to be changed to restore consistency ... Ideally, [different remedial actions] should be possible ... for different subcollections of relations.

Note: “Inconsistencies” (or, as I would prefer to call them, integrity violations) can certainly be caused by redundancy—more precisely, by redundancy that’s inadequately managed—but not all integrity violations are caused by redundancy, of course. More to the point, I believe the database should never be allowed to contain any inconsistencies, at least as far as the user is concerned; as I said in the previous chapter, you can never trust the results you get from an inconsistent database. In other words, “remedying inconsistencies” needs to be done immediately, at the level of individual statements (not even at the transaction level).[152] See the section MANAGING REDUNDANCY later in this chapter.



[149] I remind you, though, that we do at least have a precise definition of the kind of redundancy that can be removed by taking projections (see Chapter 13).

[150] This last term is strongly deprecated, by the way, because the construct in question isn’t a view. Views are virtual, not materialized (at least as far as the relational model is concerned), and materialized view is thus a contradiction in terms. The proper term is snapshot.

[151] Emphasis on supposed—I’m describing an ideal situation here. Today’s reality is rather messier, as I’m sure you know.

[152] See SQL and Relational Theory for a defense of this possibly rather unorthodox opinion. Let me add, with little by the way of elaboration, that the position implies a requirement for the system to support multiple assignment, which is a form of assignment that allows several variables—in particular, several relvars—to be updated “simultaneously” (i.e., within the confines of a single statement).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset