Three. Tuples and Relations

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter Three. Tuples and Relations

From the first two chapters you should have gained a pretty good understanding of what tuples and relations are, at least from an informal point of view. Now I want to define those concepts more precisely, and I want to explore some of the consequences of those precise definitions. Perhaps I should warn you that the definitions can look a little daunting, but that’s not unusual with formal definitions; the tuple and relation concepts themselves are quite straightforward, once you’ve struggled through the formalism, and you should be ready to do that by now because the terminology, at least, should be reasonably familiar to you.

What’s a Tuple?

Is this a tuple?

Figure Three-1.

Well, no, of course it isn’t—it’s a picture of a tuple, not a tuple as such (and note that for once I’ve included the type names as well as the attribute names in that picture). As we saw in Chapter 1, there’s a difference between a thing and a picture of a thing, and that difference can be very important. For example, tuples have no left-to-right ordering to their attributes, and so the following is an equally good (or bad?) picture of the very same tuple :

Figure Three-2.

Thus, while I’ll certainly be making use of pictures like these in the sections to follow, please keep in mind that they’re only pictures, and they can sometimes suggest some things that aren’t true.

With that caveat out of the way, I can now say exactly what a tuple is:

Definition: Let T1, T2, . . . ,Tn (n ≥ 0) be type names, not necessarily all distinct. Associate with each Ti a distinct attribute name, Ai; each of the n attribute-name:type-name combinations that results is an attribute. Associate with each attribute a value vi of type Ti; each of the n attribute:value combinations that results is a component. The set of all n components thus defined, t say, is a tuple value (or just a tuple for short) over the attributes A1, A2,..., An. The value n is the degree of t; a tuple of degree one is unary, a tuple of degree two is binary, a tuple of degree three is ternary,..., and more generally a tuple of degree n is n-ary. The set of all n attributes is the heading of t.

For example, with reference to either of the earlier pictures (of our usual tuple for supplier S1), we have:

Degree

Four. The heading is also said to have degree four.

Type names

SNO, NAME, INTEGER, and CHAR.

Corresponding attribute names

SNO, SNAME, STATUS, and CITY.

Corresponding attribute values

SNO('S1'), NAME('Smith'), 20, and ‘London’.

Note

I showed most of these attribute values in simplified form in the pictures (and I’ll continue to make use of similar simplifications in such pictures throughout this book). However, it’s strictly incorrect to say that, for example, the SNO value in the tuple we’re talking about is just ‘S1’ or (even sloppier) just S1. A value of type SNO is a value of type SNO, not a value of type CHAR!—and the expression SNO('S1') is a literal of type SNO. (More precisely, it’s an invocation of the SNO selector, as we saw in Chapter 2. A literal is, precisely, a selector invocation in which the arguments are themselves all literals in turn.)

Heading

The easiest thing to do here is show another picture:

Figure Three-3.

Of course, this picture represents a set, and the order of attributes is arbitrary. Here’s another picture of the same heading:

Figure Three-4.

Exercise: How many different pictures of this same general nature could we draw to represent this heading? (Answer: 4 * 3 * 2 * 1 = 24.)

Now, a tuple is a value; like all values, therefore, it has a type (as we already know from Chapter 2), and that type, like all types, has a name. In Tutorial D, such names take the form TUPLE{H}, where {H} is the heading. In our example, the name is:

    TUPLE { SNO SNO, SNAME NAME, STATUS INTEGER, CITY CHAR }

(but the order in which the attributes are specified is arbitrary).

To repeat, a tuple is a value. Like all values, therefore, it must be returned by some selector invocation (a tuple selector invocation, of course, if the value is a tuple). Here’s a tuple selector invocation for our example (Tutorial D again):

    TUPLE { SNO SNO('S1'), SNAME NAME('Smith'),
                       STATUS 20, CITY 'London' }

(where the order in which the components is specified is arbitrary). Observe that in Tutorial D each component is specified as just an attribute-name:attribute-value pair; the attribute type is omitted because it can always be inferred from the type of the expression denoting the attribute value.

Note

A remark on syntax: As you can see, the keyword TUPLE does double duty in Tutorial D—it’s used in connection with both tuple selector invocations and tuple type names. An analogous remark applies to the keyword RELATION (see the upcoming section See "What’s a Relation?“).

Finally, a word about SQL. Of course, SQL supports rows, not tuples; in particular, it supports row types, a row type constructor, and row value constructors, which are somewhat analogous to tuple types, the TUPLE type generator, and tuple selectors, respectively. But these analogies are loose at best. For example, this SQL expression:

    ROW ( 1, 2 )

which is an example of a row value constructor, (a) has components that are unnamed and (b) doesn’t represent the same row as the row value constructor ROW (2,1). (There are other differences, too, between SQL’s row support and our tuple support, but they’re beyond the scope of the present discussion.)

Note

The keyword ROW in an SQL row value constructor is optional.

Some Important Consequences

Now I want to highlight some important consequences of the foregoing definitions. The first is: no tuple ever contains any nulls. The reason is that, by definition, every tuple contains a value (of the appropriate type) for each of its attributes, and we saw in Chapter 1 that nulls aren’t values.^[*] Of course, if no tuple ever contains any nulls, then no relation does either, a fortiori; so right away we have at least a formal reason for rejecting the concept of nulls—but I’ll give some more pragmatic reasons as well, in the section "Why Nulls Are Prohibited,” later in this chapter.

The next consequence is:every subset of a tuple is a tuple and every subset of a heading is a heading. For example, given our usual tuple for supplier S1, what we might call “the {SNO,CITY} value” within that tuple is itself another tuple:

Figure Three-5.

Its heading is as indicated, and its type is:

    TUPLE { SNO SNO, CITY CHAR }

In the same way, the following is a tuple too (its type is TUPLE {SNO SNO}):

Figure Three-6.

So if we want to access the actual attribute value—SNO('S1') in the example—we have to extract it somehow from its containing tuple. Tutorial D uses syntax of the form SNO FROM t for this purpose (where t is any expression that denotes a tuple with an SNO attribute). SQL uses dot qualification: t.SNO.

Note

We saw in Chapter 2 that a tuple t and a relation r that contains just that tuple t aren’t the same thing. Analogously, a value v and a tuple t that contains just that value v aren’t the same thing, either; in particular, they’re of different types.

Now, I’m sure you know that the empty set is a subset of every set. It follows that a tuple with an empty heading, and therefore an empty set of components, is a valid tuple!—though it’s a little hard to draw a picture of such a tuple on paper, and I’m not even going to try. A tuple with an empty heading has type TUPLE{} (it has no components); indeed, we sometimes refer to it explicitly as a 0-tuple, to emphasize the fact that it is of degree zero. We also sometimes call it an empty tuple. Now, you might think such a tuple is unlikely to be of much use in practice; in fact, it turns out, perhaps rather surprisingly, to be of crucial importance. I’ll have more to say about it in the section See "TABLE_DUM and TABLE_DEE,” later in this chapter.

The last “important consequence” I want to discuss here has to do with the notion of tuple equality . (Recall from Chapter 2 that the “=” comparison operator is defined for every type, and tuple types are no exception.) Basically, of course, two tuples are equal if and only if they’re the very same tuple (just as, for example, two integers are equal if and only if they’re the very same integer). But it’s worth spelling out the semantics of tuple equality explicitly, since so much in the relational model depends on it—for example, candidate keys, foreign keys, and most of the operators of the relational algebra are all defined in terms of it. Here then is a precise definition:

Definition: Tuples t1 and t2 are equal if and only if they have the same attributes A1, A2, ..., An—in other words, they’re of the same type—and, for all i (i = 1, 2, ..., n), the value v1 of Ai in t1 is equal to the value v2 of Ai in t2.

Also (this might seem obvious, but it needs to be said), two tuples are duplicates if and only if they’re equal.

By the way, it’s an immediate consequence of this definition that all 0-tuples are duplicates of one another. For this reason, we’re within our rights if we talk in terms of the 0-tuple instead of a 0-tuple, and indeed we usually do.

Observe finally that the comparison operators “<” and “>” don’t apply to tuples (see Exercise 3-11 at the end of the chapter).

What’s a Relation?

I’ll use our usual suppliers relation as a basis for examples in this section. Here’s a picture:

Figure Three-7.

And here’s a definition:

Definition: Let {H} be a tuple heading and let t1, t2, . . . , tm (m ≥ 0) be distinct tuples with heading {H}. The combination, r say, of {H} and the set of tuples {t1, t2, . . . , tm} is a relation value (or just a relation for short) over the attributes A1, A2, . . . , An, where A1, A2, . . . , An are the attributes in {H}. The heading of r is {H}; r has the same attributes (and hence the same attribute names and types) and the same degree as that heading does. The body of r is the set of tuples {t1, t2, . . . , tm}. The value m is the cardinality of r.

I’ll leave it as an exercise to interpret the suppliers relation in terms of the foregoing definition. However, I will at least explain why we call such things relations. Basically, each tuple in a relation represents an n-ary relationship, in the ordinary natural-language sense, among a set of n values (one value for each tuple attribute), and the full set of tuples in a given relation represents the full set of such relationships that happen to exist at some given time—and, mathematically speaking, that’s a relation. Thus, the “explanation” often heard, to the effect that the relational model is so called because it lets us “relate one table to another,” though accurate in a kind of secondary sense, really misses the basic point. The relational model is so called because it deals with certain abstractions that we can think of as “tables” but are known, formally, as relations in mathematics.

Now, a relation, like a tuple, is itself a value and has a type, and that type has a name. In Tutorial D, such names take the form RELATION{H}, where {H} is the heading. Here’s an example:

    RELATION { SNO SNO, SNAME NAME, STATUS INTEGER, CITY CHAR }

(The order in which the attributes are specified is arbitrary.) Also, every relation value is returned by some relation selector invocation—for example:

    RELATION {
       TUPLE { SNO SNO('S1'), SNAME NAME('Smith'),
                              STATUS 20, CITY 'London' } ,
       TUPLE { SNO SNO('S2'), SNAME NAME('Jones'),
                              STATUS 10, CITY 'Paris'  } ,
       TUPLE { SNO SNO('S3'), SNAME NAME('Blake'),
                              STATUS 30, CITY 'Paris'  } ,
       TUPLE { SNO SNO('S4'), SNAME NAME('Clark'),
                              STATUS 20, CITY 'London' } ,
       TUPLE { SNO SNO('S5'), SNAME NAME('Adams'),
                              STATUS 30, CITY 'Athens' } }

The order in which the tuples are specified is arbitrary.

What support does SQL provide for the foregoing ideas? The answer is: not much. It doesn’t really have anything analogous to the concept of a relation type at all; rather, an SQL table is considered to consist of rows (a multiset or bag of rows, to be precise) that are of a certain row type. (It does have something it calls named table types, but this construct is very different from our relation type, and I’m not going to discuss it here.) It follows that SQL doesn’t really have anything analogous to the RELATION type generator, either. It does, however, have something called a table value constructor that’s akin, somewhat, to a relation selector. Here’s an example:

    VALUES ( 1, 2 ), ( 2, 1 ), ( 1, 1 ), ( 1, 2 )

This expression evaluates to a table with four—not three!—rows and two columns (which have no names).

Further Important Consequences

Most of the properties of relations I talked about in Chapter 1 are direct consequences of the definitions in the previous section, but there are a couple I didn’t call out explicitly before, and I want to elaborate on some of the others.

The first point I didn’t mention before is that every subset of a body is a body (loosely, every subset of a relation is a relation). In particular, therefore, since the empty set is a subset of every set, a relation can have a body that consists of an empty set of tuples (and we call such a relation an empty relation ). For example, suppose there are no shipments right now. In this case, relvar SP will have as its current value the empty shipments relation, which we might draw like this:^[*]

Figure Three-8.

Note that, given any particular relation type, there’s exactly one empty relation of that type—but empty relations of different types aren’t the same thing, precisely because they’re of different types. For example, the empty suppliers relation isn’t equal to the empty parts relation; their bodies are equal but their headings aren’t.

The second point I deliberately didn’t mention before is this. Recall from Chapter 1 that, while a relation can be pictured as a table, a relation isn’t a table. (To say it one more time, a picture of a thing is not the same as the thing.) Of course, it can be convenient to think of a relation as a table; after all, tables are “user-friendly”; indeed, it’s the fact that we can think of relations, informally, as tables—sometimes more explicitly as flat or two-dimensional tables—that makes relational systems intuitively easy to understand and use, and makes it intuitively easy to reason about the way such systems behave. In other words, it’s a very nice property of the relational model that its basic data structure, the relation, has such an intuitively attractive pictorial representation.

Unfortunately, many people seem to have been blinded by that attractive pictorial representation into thinking that relations as such are “flat” or “two-dimensional.” But they’re not. Rather, if relation r has n attributes, then each tuple in r represents a point in a certain n-dimensional space(and the relation overall represents a set of such points). For example, each of the five tuples appearing in our usual sample value for the suppliers relvar S represents a certain point in a certain 4-dimensional space, and the relation overall can thus be said to be 4-dimensional. Thus, relations are n-dimensional, not two-dimensional!^[*] As I’ve written elsewhere (in quite a few places, in fact): let’s all vow never to say “flat relations” ever again.

Now I turn to the issues I want to elaborate on. There are three of them: duplicate tuples, nulls, and relations with no attributes. I’ll discuss each in its own section.

Why Duplicate Tuples Are Prohibited

There are numerous practical arguments in support of the position that duplicate tuples (“duplicates” for short) should be prohibited. Here I have room for just one—but I think it’s a powerful one. However, it does rely on certain notions I haven’t discussed yet in this book, so I need to make a couple of preliminary assumptions:

I assume you know that relational DBMSs include a component called the optimizer , whose job is to try to figure out the best way to implement user queries and the like (where “best” basically means best-performing).
I assume you also know that one of the things optimizers do is what’s sometimes called query rewrite . Query rewrite is the process of transforming some relational expression X1—representing some user query, say—into another such expression X2, such that X1 and X2 are guaranteed to produce the same result when evaluated, but where X2 has better performance characteristics than X1 (at least, we hope it does).

Now I can present my argument. The fundamental point I want to make is that certain expression transformations, and hence certain optimizations, that are valid in a relational context are not valid in the presence of duplicates. By way of example, consider the (nonrelational) database shown in Figure 3-1.

Figure 3-1. A nonrelational database, with duplicates

Before going any further, perhaps I should ask the question: what does it mean to have three <P1,Screw> rows in table P and not two, or four, or seventeen?^[†] It must mean something, for if it means nothing, then why are the duplicates there in the first place? As I once heard Ted Codd say, “If something is true, saying it twice doesn’t make it any more true.”

So I have to assume there’s some meaning attached to the duplication, even though that meaning, whatever it is, is hardly very explicit. (I note in passing, therefore, that duplicates contravene one of the original objectives of the relational model: explicitness. The meaning of the data should be as obvious and explicit as possible, since we’re supposed to be talking about a shared database. The presence of duplicates implies that part of the meaning is hidden.) In other words, given that duplicates do have some meaning, there are presumably going to be business decisions made on the basis of the fact that, for example, there are three <P1,Screw> rows in table P and not two or four or seventeen. If not, then why are the duplicates there in the first place?

Now consider the following query: “Get part numbers for parts that either are screws or are supplied by supplier S1, or both.” Here are some candidate SQL formulations for this query, together with the output produced in each case:

    1  SELECT P.PNO
       FROM   P
       WHERE  P.PNAME = NAME('Screw')
       OR     P.PNO IN
            ( SELECT SP.PNO
              FROM   SP
              WHERE  SP.SNO = SNO('S1') )

       Result: P1 * 3, P2 * 1.

    2   SELECT SP.PNO
       FROM   SP
       WHERE  SP.SNO = SNO('S1')
       OR     SP.PNO IN
            ( SELECT P.PNO
              FROM   P
              WHERE  P.PNAME = NAME('Screw') )

       Result: P1 * 2, P2 * 1.

    3   SELECT P.PNO
       FROM   P, SP
       WHERE  ( SP.SNO = SNO('S1') AND
                SP.PNO = P.PNO )
       OR     P.PNAME = NAME('Screw')

       Result: P1 * 9, P2 * 3.

    4   SELECT SP.PNO
       FROM   P, SP
       WHERE  ( SP.SNO = SNO('S1') AND
                SP.PNO = P.PNO )
       OR     P.PNAME = NAME('Screw')

       Result: P1 * 8, P2 * 4.

    5   SELECT P.PNO
       FROM   P
       WHERE  P.PNAME = NAME('Screw')
    
       UNION  ALL
       SELECT SP.PNO
       FROM   SP
       WHERE  SP.SNO = SNO('S1')

       Result: P1 * 5, P2 * 2.

    6   SELECT DISTINCT P.PNO
       FROM   P
       WHERE  P.PNAME = NAME('Screw')
       UNION  ALL
       SELECT SP.PNO
       FROM   SP
       WHERE  SP.SNO = SNO('S1')

       Result: P1 * 3, P2 * 2.

    7   SELECT P.PNO
       FROM   P
       WHERE  P.PNAME = NAME('Screw')
       UNION  ALL
       SELECT DISTINCT SP.PNO
       FROM   SP
       WHERE  SP.SNO = SNO('S1')

       Result: P1 * 4, P2 * 2.

    8   SELECT DISTINCT P.PNO
       FROM   P
       WHERE  P.PNAME = NAME('Screw')
       OR     P.PNO IN
            ( SELECT SP.PNO
              FROM   SP
              WHERE  SP.SNO = SNO('S1') )

       Result: P1 * 1, P2 * 1.

    9   SELECT DISTINCT SP.PNO
       FROM   SP
       WHERE  SP.SNO = SNO('S1')
       OR     SP.PNO IN
            ( SELECT P.PNO
              FROM   P
              WHERE  P.PNAME = NAME('Screw') )

       Result: P1 * 1, P2 * 1.

    10   SELECT P.PNO
       FROM   P
       GROUP  BY P.PNO, P.PNAME
       HAVING P.PNAME = NAME('Screw')
       OR     P.PNO IN
            ( SELECT SP.PNO
              FROM   SP
              WHERE  SP.SNO = SNO('S1') )

       Result: P1 * 1, P2 * 1.

    
    11   SELECT P.PNO
       FROM   P, SP
       GROUP  BY P.PNO, P.PNAME, SP.SNO, SP.PNO
       HAVING ( SP.SNO = SNO('S1') AND
                SP.PNO = P.PNO )
       OR     P.PNAME = NAME('Screw')

       Result: P1 * 2, P2 * 2.

    12   SELECT P.PNO
       FROM   P
       WHERE  P.PNAME = NAME('Screw')
       UNION
       SELECT SP.PNO
       FROM   SP
       WHERE  SP.SNO = SNO('S1')

       Result: P1 * 1, P2 * 1.

Note

Actually, certain of the foregoing formulations—which?—are a little suspect, because they assume that every screw is supplied by at least one supplier. But this fact has no material effect on the argument that follows.

The obvious first point is that the twelve different formulations produce nine different results—different, that is, with respect to their degree of duplication. (I make no claim, incidentally, that the twelve different formulations and the nine different results are the only ones possible; indeed, they aren’t, in general.) Thus, if the user really cares about duplicates, he or she needs to be extremely careful in formulating the query in such a way as to obtain exactly the desired result.

Furthermore, of course, analogous remarks apply to the system itself: because different formulations can produce different results, the optimizer too has to be extremely careful in its task of expression transformation. For example, the optimizer isn’t free to transform, say, formulation 1 into formulation 3 or the other way around, even if it would like to. In other words, duplicate rows act as a significant optimization inhibitor. Here are some implications of this fact:

The optimizer code itself is harder to write, harder to maintain, and probably more buggy—all of which combines to make the product simultaneously more expensive and less reliable, as well as late in delivery to the marketplace.
System performance is likely to be worse than it might be.
Users are going to have to get involved in performance issues. To be more specific, they’re going to have to spend time and effort in figuring out how to formulate a given query in order to get the best performance—a state of affairs the relational model was expressly meant to avoid!

At this point in the original draft, I said that if you find the discipline of always specifying DISTINCT annoying, don’t complain to me-complain to the SQL vendors instead. But my reviewers reacted with almost unanimous horror to my suggestion that you should always specify DISTINCT. One wrote: “Those who really know SQL well will be shocked at the thought of coding SELECT DISTINCT by default.” Well, I’d like to suggest, politely, that (a) those who are “shocked at the thought” probably know the implementa-tions well, not SQL, and (b) their shock is probably due to their recognition that those implementations do such a poor job of optimizing away unnecessary DISTINCTs. If I write SELECT DISTINCT S.SNO FROM S ..., that DISTINCT can safely be ignored. If I write EXISTS (SELECT DISTINCT ...) or IN (SELECT DISTINCT ...), that DISTINCT can safely be ignored. If I write SELECT DISTINCT SP.SNO FROM SP ... GROUP BY SP.SNO, that DISTINCT can safely be ignored. If I write SELECT DISTINCT ... UNION SELECT DISTINCT ..., those DISTINCTs can safely be ignored. And so on. Why should I, as a user, have to devote time and effort to figuring out whether some DISTINCT is going to be a performance hit and whether it’s logically safe to omit it, and to remembering all of the details of SQL’s inconsistent rules for when duplicates are automatically eliminated and when they’re not? Well, I could go on. However, I decided- against my own better judgment, but in the interests of maintaining good relations (with myreviewers, I mean)-not to follow myown advice in the remainder of this book but only to request duplicate elimination explicitly whenit seemedlogically necessary to doso. It wasn’t always easy to decide when that was, either. But at least now I can add my voice to those complaining to the vendors, I suppose.

The fact that duplicates serve as an optimization inhibitor is particularly frustrating in view of the fact that, in most cases, users probably don’t care how many duplicates appear in the result. In other words:

Different formulations produce different results.
But the differences are probably irrelevant from the user’s point of view.
But the optimizer is unaware of this latter fact and is therefore prevented, unnecessarily, from performing the transformations it might like to perform.

On the basis of examples like the foregoing, I’m tempted to conclude among other things that you should always ensure that query results contain no duplicates—for example, by always specifying DISTINCT in your SQL queries—and thus simply forget about the whole problem. And if you follow this advice, of course, then there’s no good reason for allowing duplicates in the first place . . . .

There’s much more that could be said on this issue, but I have room for just four more points. First, the alternative to SELECT DISTINCT in SQL is SELECT ALL—and SELECT ALL is unfortunately the default. The trouble is, however, SELECT DISTINCT might take longer to execute than SELECT ALL, even if the DISTINCT is effectively a “no op.” I don’t want to discuss this one any further, except to note that the reason SQL systems are typically unable to optimize properly over duplicate elimination is their lack of knowledge of key inheritance (that is, their inability to figure out the keys for the result of an arbitrary relational expression).

Second, you might object, not unreasonably, that base tables in practice never do include duplicates, and my example therefore intuitively fails. True enough; but the trouble is, SQL can generate duplicates in query results. Indeed, different formulations of the same query can produce results with different degrees of duplication, as we’ve already seen, even if the input tables themselves have no duplicates at all. For example, consider the following formulations of the query “Get supplier numbers for suppliers who supply at least one part” on the usual suppliers-and-parts database:

    SELECT S.SNO           |    SELECT S.SNO
    FROM   S               |    FROM   S, SP
    WHERE  S.SNO IN        |    WHERE  S.SNO = SP.SNO
         ( SELECT SP.SNO
           FROM   SP )

(Given our usual sample data values, what results do these two expressions produce?) So if you don’t want to think of the tables in Figure 3-1 as base tables specifically, that’s fine: just take them to be the output from previous queries, and the rest of the analysis goes through unchanged.

Third, there’s another at least psychological argument against duplicates that I think is quite persuasive (thanks to Jonathan Gennick for this one): if, in accordance with the n-dimensional perspective on relations introduced in the previous section, you think of a table as a plot of points in some n-dimensional space, then duplicate rows clearly don’t add anything; they simply amount to plotting the same point twice.

My last point is this. Suppose table T does permit duplicates. Then we can’t tell the difference between “genuine” duplicates in T and duplicates that arise from errors in data entry on T! For example, what happens if the person responsible for data entry unintentionally—that is, by mistake—enters the very same row into T twice? (Easily done, by the way.) Is there a way to delete just the “second” row and not the “first”? Note that we presumably do want to delete that “second” row, since it shouldn’t have been entered in the first place.

Why Nulls Are Prohibited

The opening paragraph from the previous section applies equally well here (with just one tiny text substitution), so I’ll basically just repeat it: There are numerous practical arguments in support of the position that nulls should be prohibited. Here I have room for just one—but I think it’s a powerful one. But it does rely on certain notions I haven’t discussed yet in this book, so I need to make a couple of preliminary assumptions:

I assume you know that any comparison in which either or both of the comparands is null evaluates to the UNKNOWN truth value instead of TRUE or FALSE. The justification for this state of affairs is the intended interpretation of null as value unknown: if the value of A is unknown, then obviously it’s unknown whether, for example, A > B, regardless of the value of B (even—perhaps especially—if the value of B is unknown as well). Incidentally, it’s this fact that’s the source of the term three-valued logic (3VL): the notion of nulls inevitably leads us into a logic in which there are three truth values instead of the usual two. (The relational model, by contrast, is based on conventional two- valued logic, 2VL.)
I assume you also know the 3VL truth tables for the familiar operators NOT, AND, and OR (T = TRUE, F = FALSE, U = UNKNOWN):
Figure Three-10.
Observe in particular that NOT returns UNKNOWN if its input is UNKNOWN; AND returns UNKNOWN if one input is UNKNOWN and the other is either UNKNOWN or TRUE; OR returns UNKNOWN if one input is UNKNOWN and the other is either UNKNOWN or FALSE.

Now I can present my argument. The fundamental point I want to make is that certain boolean expressions—and therefore certain queries—produce results that are correct according to three-valued logic but not correct in the real world. By way of example, consider the (nonrelational) database shown in Figure 3-2, in which “the CITY is null” for part P1. Note carefully that the empty space in that figure, in the place where the CITY value for part P1 ought to be, stands for nothing at all; conceptually, there’s nothing—not even a string of blanks or an empty string—in that position (which means the “tuple” for part P1 isn’t really a tuple, a point I’ll come back to near the end of this section).

Figure 3-2. A nonrelational database, with a null

Consider now the following (admittedly rather contrived) query on the database of Figure 3-2: “Get SNO-PNO pairs where either the supplier and part cities are different or the part city isn’t Paris (or both).” Here’s the obvious SQL formulation of this query:^[*]

    SELECT S.SNO, P.PNO
    FROM   S, P
    WHERE  S.CITY <> P.CITY
    OR     P.CITY <> 'Paris'

Let’s focus on the boolean expression in the WHERE clause (parentheses added for clarity):

    ( S.CITY <> P.CITY ) OR ( P.CITY <> 'Paris' )

For the only data we have in the database, this expression evaluates to UNKNOWN OR UNKNOWN, which reduces to just UNKNOWN. Now, SQL queries retrieve data for which the expression in the WHERE clause evaluates to TRUE, not to FALSE and not to UNKNOWN; in the example, therefore, nothing is retrieved at all.

But of course part P1 does have some corresponding city in the real world; in other words, “the null CITY” for part P1 does stand for some real value, say xyz. Obviously, either xyz is Paris or it isn’t. If it is, then the expression:

    ( S.CITY <> P.CITY ) OR ( P.CITY <> 'Paris' )

becomes (for the only data we have):

    ( 'London' <> 'Paris' ) OR ( 'Paris' <> 'Paris' )

This expression evaluates to TRUE, because the first term evaluates to TRUE. Alternatively, if xyz isn’t Paris, the expression becomes (again, for the only data we have):

    ( 'London' <> xyz ) OR ( xyz <> 'Paris' )

This expression also evaluates to TRUE, because the second term evaluates to TRUE. Thus, the boolean expression is always TRUE in the real world, and the query should return the pair S1-P1, regardless of what real value the null stands for. In other words, the result that’s correct according to the logic (that is, 3VL) and the result that’s correct in the real world are different!

By way of another example, consider the following query (I didn’t lead with this example because it’s even more contrived than the previous one, but in some ways it makes the point with still more force):

    SELECT P.PNO
    FROM   P
    WHERE  P.CITY = P.CITY

The real-world answer here is obviously the set of part numbers currently appearing in P (in other words, the set containing just part number P1, given the sample data shown in Figurer 3-2). SQL, however, will return no part numbers at all.

To sum up: if you have any nulls in your database, you’re getting wrong answers to some of your queries. What’s more, you have no way of knowing, in general, just which queries you’re getting wrong answers to; all results become suspect. You can never trust the answers you get from a database with nulls. In my opinion, this state of affairs is a complete showstopper.

As with the business of duplicates in the previous section, there’s much more that could be said on this issue, but I just want to close with a review of the formal argument against nulls. Recall that, by definition, a null isn’t a value. It follows that:^[*]

A “type” that contains a null isn’t a type (because types contain values).
A “tuple” that contains a null isn’t a tuple (because tuples contain values).
A “relation” that contains a null isn’t a relation (because relations contain tuples, and tuples don’t contain nulls).

The net is that if nulls are present, then we’re certainly not talking about the relational model (I don’t know what we are talking about, but it’s not the relational model); the entire edifice crumbles, and all bets are off.

TABLE_DUM and TABLE_DEE

The previous two sections were perhaps a little depressing; this section, by contrast, is much more upbeat. Recall from the section "Some Important Consequences" that the empty set is a subset of every set, and hence that there’s such a thing as the empty tuple (also called the 0-tuple), and of course it has an empty heading. What I didn’t point out before is that—obviously enough—a relation too might have an empty heading (a heading is a set of attributes, and there’s no reason why that set shouldn’t be empty). Such a relation is of type RELATION{}, and its degree is zero. It’s also sometimes said to be 0-ary (or nullary), just as, for example, a relation of degree two is said to be binary. (By the way, nullary here has nothing to do with nulls; nulls are still bad news, but nullary relations are very good news indeed. However, I prefer to avoid the term nullary, precisely because it does sound as if it had something to do with nulls.)

Let r be a relation of degree zero, then. How many such relations are there? The answer: just two. First, of course, r might be empty (meaning it contains no tuples)—remember there’s exactly one empty relation of any given type. Second, if r isn’t empty, then the tuples it contains must all be 0-tuples. But there’s only one 0-tuple!—equivalently, all 0-tuples are duplicates of one another—and so r cannot possibly contain more than one of them. So there are indeed just two relations with no attributes: one with just one tuple, and one with no tuples at all. For fairly obvious reasons, I’m not going to try drawing pictures of these relations; in fact, this is one place where the idea of thinking of relations as tables really does break down.

Now, you might well be thinking: “So what? Why on earth would I ever want a relation that has no attributes at all?” Even if they’re mathematically respectable (which they are), surely they’re of no practical significance? In fact, however, it turns out they’re of very great practical significance indeed: so much so, that we have pet names for them—we call them TABLE_DUM and TABLE_DEE, or DUM and DEE for short (DUM is the empty one, DEE is the one with one tuple).^[*] And the reason they’re so significant is their meanings,which are FALSE (or NO) for DUM and TRUE (or YES) for DEE. They have the most fundamental meanings of all.

Note

I’ll discuss the whole notion of what relations in general mean in the next chapter.

By the way, a good way to remember which is which is this: YES and DEE both have an “E”; NO and DUM don’t.

I haven’t covered enough in this book yet to show concrete examples of DUM and DEE in action, as it were, but we’ll see plenty of examples of their use in the pages ahead. Here I’ll just mention one point that should make at least intuitive sense at this early juncture: DEE in particular plays a role in the relational algebra that’s analogous to the role played by zero in conventional arithmetic. And we all know how important zero is; in fact, it’s hard to imagine an arithmetic without zero (the ancient Romans tried, but it didn’t get them very far). Well, it should be equally hard to imagine a relational algebra without TABLE_DEE.

Summary

In this chapter I’ve presented precise definitions for the fundamental concepts tuple and relation. As I said in the introductory section, those definitions can be a little daunting at first, but I hope you were able to make sense of them after having read the first two chapters. I also discussed tuple and relation types, and showed some examples of tuple and relation selector invocations. And I discussed certain important consequences of the definitions:

No tuple, and therefore no relation, ever contains any nulls.
Every subset of a tuple is a tuple, every subset of a heading is a heading, and every subset of a body is a body (and those subsets can always be empty).
Two tuples are equal if and only if they’re the very same tuple.

Other consequences already covered in earlier chapters include:

Every tuple contains exactly one value for each of its attributes, and relations are therefore always in first normal form.
The attributes of a tuple, and therefore of a relation, are unordered, left to right.
The tuples of a relation are also unordered, top to bottom.
Relations never contain any duplicate tuples.

I also gave some important pragmatic arguments in favor of prohibiting both duplicates and nulls. And I finished up by briefly introducing the very important degree-zero relations TABLE_DUM and TABLE_DEE; DUM means NO or FALSE, and DEE means YES or TRUE.

Exercises

Exercise 3-1

Define as precisely as you can the terms attribute, body, cardinality, degree, heading, relation, relvar, and tuple. Also, every relation and every relvar is of some (relation) type; define the term relation type as precisely as you can too.

Exercise 3-2

What’s a literal?

Exercise 3-3

State as precisely as you can what it means for two tuples to be equal.

Exercise 3-4

We saw in the section “What’s a Tuple?” that it’s strictly incorrect to say that, for example, S1 or even ‘S1’ is a supplier number (“a value of type SNO is a value of type SNO, not a value of type CHAR”). As a consequence, Figure 1-3 in Chapter 1 is pretty sloppy, inasmuch as it pretends that it is correct to think of, for example, S1 as a supplier number. Show the correct way of referring to the various scalar values in that figure, given the types specified for attributes in the suppliers-and-parts database in the introductory section in Chapter 2.

Exercise 3-5

Write Tutorial D tuple selector invocations for a typical tuple from (a) the parts relvar, (b) the shipments relvar. Also show SQL’s counterparts, if any, to those selector invocations.

Exercise 3-6

Write a typical Tutorial D relation selector invocation. Also show SQL’s counterpart, if any, to that selector invocation.

Exercise 3-7

Does the concept of a tuple literal make any sense? What about a relation literal?

Exercise 3-8

(This is essentially a repeat of Exercise 1-8 from Chapter 1, but you should be able to give a more comprehensive answer now.) There are many differences between a relation and a table. List as many as you can.

Exercise 3-9

The attributes of a tuple can be of any type whatsoever (well, almost; can you think of any exceptions?). Give an example of (a) a tuple with a tuple-valued attribute, (b) a tuple with a relation-valued attribute (RVA).

Exercise 3-10

Tuple equality was defined precisely in the body of the chapter—but what about relation equality? (I’ll discuss the question of relation comparisons in general in Chapter 5, but you might like to take a shot at defining relation equality for yourself here.)

Exercise 3-11

Why don’t the comparison operators “<” and “>” apply to tuples? Note that, by contrast, they do apply to SQL rows. How can this be? (Hint: What’s the difference between SQL’s rules for row equality and the relational rules for tuple equality?)

Exercise 3-12

Give an example of a relation with (a) one RVA, (b) two RVAs. Also give two more relations that represent the same information as those relations but do not involve RVAs. Then give an example of a relation with an RVA such that there’s no relation without an RVA that represents precisely the same information. (You might prefer to come back to this exercise after reading more about RVAs in Chapter 5.)

Exercise 3-13

It’s sometimes suggested that a relation (or relvar, rather) is really just a traditional computer file, with tuples instead of records and attributes instead of fields. Discuss.

Exercise 3-14

Explain the relations TABLE_DUM and TABLE_DEE in your own words. Does SQL support them?

Exercise 3-15

The following is a legitimate example of an SQL row value constructor:

    ROW ( 1, NULL )

Is the row it represents null or not?

Exercise 3-16

“Duplicates are a good idea in databases because duplicates occur naturally in the real world. For example, all pennies are duplicates of one another.” How would you respond to this argument?

Exercise 3-17

Why are the databases of Figures 3-1 and 3-2 nonrelational?

Exercise 3-18

Do you think nulls occur naturally in the real world?

Exercise 3-19

Nulls were originally proposed as a solution to the problem of missing information. Now, it’s true that information is often missing in the real world (think of such things as “Speaker to be announced” or “Present address unknown,” for example). Therefore, if nulls are prohibited, how should we deal with missing information inside our database systems? (As I’m sure you noticed, I didn’t answer this question in the body of the chapter. This exercise might best serve as a basis for group discussion. I’ll have a little more to say on the subject in Chapter 7.)

Exercise 3-20

TABLE_DEE means TRUE and TABLE_DUM means FALSE. Do these facts mean we could dispense with the usual BOOLEAN data type? Also, DEE and DUM are relations, not relvars. Do you think it would ever make sense to define a relvar of degree zero?

Exercise 3-21

We saw in the body of the chapter that this SQL expression:

    VALUES ( 1, 2 ), ( 2, 1 ), ( 1, 1 ), ( 1, 2 )

denotes a table with four rows. What does the following SQL expression denote?

    VALUES ( ( 1, 2 ), ( 2, 1 ), ( 1, 1 ), ( 1, 2 ) )

^[*]Despite the fact that SQL often (though not always) refers to them explicitly as null values.

^[*]I now revert to the convention by which we omit the type names from a heading in informal contexts. Throughout the remainder of this book, in fact, I’ll feel free to regard headings as either including type names or excluding them—whichever best suits my purpose at the time.

^[*]Indeed, I think it could be argued that one reason we hear so much about the need for “multidimensional databases” (for decision support applications in particular) is precisely because so many people think relations aren’t multidimensional.

^[†]I use table terminology in this section because the things we’re talking about are certainly neither relations nor relvars; in particular, they have no keys (observe that there’s no double underlining in Figure 3-1).

^[*]As an exercise, show the result of this query given our usual sample data values (see Figure 1-3 in Chapter 1). Note: The symbol “<>” is SQL syntax for “≠” (and I’ll add for the record that the symbols “<=” and “>=” are SQL syntax for “≤” and “≥”, respectively, as you probably know).

^[*]Remarks analogous to those that follow, though possibly a little less severe, might be made in connection with “relations” with duplicates, or top-to-bottom tuple ordering, or left-to-right attribute ordering.

^[*]Let me say a little more about these pet names. First, for the benefit of non-English speakers, I should explain that they’re basically just wordplay on Tweedledum and Tweedledee, who were originally characters in a children’s nursery rhyme and were subsequently incorporated into Lewis Carroll’s Through the Looking-Glass. Second, the names are a little unfortunate, given that these two relations are precisely the ones that can’t reasonably be depicted as tables! But we’ve been using those names for so long now that we’re probably not going to change them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Three. Tuples and Relations

Create new playlist

Sign In

Sign Up

Chapter Three. Tuples and Relations

What’s a Tuple?

Note

Note

Note

Some Important Consequences

Note

What’s a Relation?

Further Important Consequences

Why Duplicate Tuples Are Prohibited

Note

Why Nulls Are Prohibited

TABLE_DUM and TABLE_DEE

Note

Summary

Exercises

Exercise 3-1

Exercise 3-2

Exercise 3-3

Exercise 3-4

Exercise 3-5

Exercise 3-6

Exercise 3-7

Exercise 3-8

Exercise 3-9

Exercise 3-10

Exercise 3-11

Exercise 3-12

Exercise 3-13

Exercise 3-14

Exercise 3-15

Exercise 3-16

Exercise 3-17

Exercise 3-18

Exercise 3-19

Exercise 3-20

Exercise 3-21

Table of Contents for
Three. Tuples and Relations