Chapter 6

SQL and Relational Algebra I: The Original Operators

Join the union!

—Susan B. Anthony (1869)

This is the first of two chapters on the operators of the relational algebra; it discusses the original operators (i.e., the ones briefly described in Chapter 1) in some depth, and it also examines certain ancillary but important issues—e.g., the significance of proper attribute (or column) naming once again. It also explains the implications of such matters for our overall goal of using SQL relationally.

SOME PRELIMINARIES

I’ll begin by reviewing a few points from Chapter 1. First, recall that each algebraic operator takes at least one relation as input and produces another relation as output.1 Second, recall too that the fact that the output is the same kind of thing as the input(s)—they’re all relations— constitutes the closure property of the algebra, and it’s that property that lets us write nested relational expressions. Third, I gave outline descriptions in Chapter 1 of what I there called “the original operators” (restrict, project, product, union, intersect, difference, and join); however, now I’m in a position to define those operators, and others, much more carefully. Before I can do that, however, I need to make a few more general points:

  • The operators of the algebra are generic, meaning they apply (in effect) to all possible relations. For example, we don’t need one specific join operator to join departments and employees and another, different, join operator to join suppliers and shipments. (Incidentally, do you think an analogous remark applies to object systems?).

  • The operators are all read-only: They “read” their operands and they return a result, but they don’t update anything. In other words, they operate on relations, not relvars.

  • Of course, the previous point doesn’t mean that relational expressions can’t refer to relvars. For example, if R1 and R2 are relvar names, then R1 UNION R2 is certainly a valid relational expression in Tutorial D (just so long as the relvars denoted by those names are of the same type, that is). In that expression, however, R1 and R2 don’t denote those relvars as such; rather, they denote the relations that happen to be the current values of those relvars at that time. In other words, we can certainly use a relvar name to denote a relation operand—and such a relvar reference in itself thus constitutes a valid relational expression2—but in principle we could equally well denote the very same operand by means of an appropriate relation literal instead.3

    An analogy might help clarify this latter point. Suppose N is a variable of type INTEGER, and at time t it has the value 3. Then N + 2 is certainly a valid numeric expression, but at time t it means exactly the same as 3 + 2, no more and no less.

  • Finally, given that the operators of the algebra are indeed all read-only, it follows that INSERT, DELETE, and UPDATE (and relational assignment), though they’re certainly relational operators, aren’t relational algebra operators as such—though, regrettably, you’ll often come across statements to the contrary in database textbooks and elsewhere.

I also need to say something here about Tutorial D specifically, because it’s in its support for the algebra in particular that the design ofTutorial D differs most significantly from that of SQL. The overriding point is this: In operations like UNION or JOIN that need some kind of correspondence to be established between attributes of their operands, Tutorial D does so by requiring the attributes in question to be, formally, the very same attribute (i.e., to have the same name and same type). For example, here’s a Tutorial D expression for the join of parts and suppliers on cities:

P JOIN S

The join operation here is performed, by definition, on the basis of part and supplier cities, CITY being the only attribute that P and S have in common (i.e., the only common attribute).

To repeat, Tutorial D establishes the correspondence between operand attributes, when such a correspondence is required, by insisting that the attributes in question in fact be one and the same. And it applies this same rule uniformly and consistently across the board, in all pertinent contexts. By contrast, SQL uses different rules in different contexts. Sometimes it uses ordinal position (we’ve already seen an example of this case in connection with foreign keys, as discussed in the previous chapter). Sometimes it uses explicit specifications (and such explicit specifications take different forms in different contexts). Sometimes it requires the attributes in question (or columns, rather) to have the same name—and then the correspondence is sometimes established explicitly, sometimes implicitly. And regardless of whether it requires the columns in question to have the same name, sometimes it requires those columns to be of the same type, and sometimes it doesn’t. In order to illustrate a few of these possibilities, let’s consider the P JOIN S example again. Here’s one possible formulation of that join in SQL:

SELECT P.PNO , P.PNAME , P.COLOR , P.WEIGHT , P.CITY
                                        /* or S.CITY */ ,
       S.SNO , S.SNAME , S.STATUS
FROM   P , S
WHERE  P.CITY = S.CITY

In this formulation, the required column correspondence is specified explicitly in the WHERE clause. As you probably know, however, examples like this one can in fact be formulated in several different ways in SQL.4 Here are three more SQL formulations of the P JOIN S example (I’ve numbered them for purposes of subsequent reference; as you can see, formulations 2 and 3 are a little closer to the spirit of Tutorial D):5

  1. SELECT P.PNO , P.PNAME , P.COLOR , P.WEIGHT , P.CITY
                                            /* or S.CITY */ ,
           S.SNO , S.SNAME , S.STATUS
    FROM   P JOIN S
    ON     P.CITY = S.CITY

  2. SELECT P.PNO , P.PNAME , P.COLOR , P.WEIGHT , CITY ,
           S.SNO , S.SNAME , S.STATUS
    FROM   P JOIN S
    USING  ( CITY )

  3. SELECT P.PNO , P.PNAME , P.COLOR , P.WEIGHT , CITY ,
           S.SNO , S.SNAME , S.STATUS
    FROM   P NATURAL JOIN S

Observe now that:

  • In formulation 1, the column correspondence is again specified explicitly, but this time by means of an ON clause instead of a WHERE clause.

  • In formulation 2, the correspondence is based on common column names, but it’s still specified explicitly by means of the USING clause.

  • In formulation 3, the correspondence is again based on common column names, but this time it’s implicit.

Here are some further points of difference between SQL and Tutorial D, most of them also arising from the difference in the languages’ respective approaches to establishing attribute (or column) correspondence for the purposes of operators like join:

  • SQL permits, and sometimes requires, dot qualified names. Tutorial D doesn’t. (I’ll have more to say about SQL’s dot qualified names in Chapter 12.)

  • Tutorial D sometimes needs to rename attributes in order to avoid what would otherwise be naming clashes or mismatches. SQL usually doesn’t—though for other reasons it does support an analog of the RENAME operator that Tutorial D uses for the purpose, as we’ll see in the next section.

  • Partly as a consequence of the previous point, Tutorial D has no need for SQL’s “correlation name” concept; in effect, it replaces that concept by the idea that attributes sometimes need to be renamed, as mentioned in the previous bullet item. (I’ll be discussing SQL’s correlation names in detail in Chapter 12.)

  • As well as supporting (either explicitly or implicitly) certain features of the relational algebra, SQL also explicitly supports certain features of the relational calculus (correlation names are a case in point, and EXISTS is another). Tutorial D doesn’t. One consequence of this difference is that SQL is a highly redundant language, in that it typically provides numerous different ways of formulating the same query, a fact that can have serious negative implications for both the user and the optimizer. (I once wrote a paper on this topic called “Fifty Ways to Quote Your Query”—see Appendix G—in which I showed that even a query as simple as “Get names of suppliers who supply part P2” can be expressed in well over 50 different ways in SQL.)

  • SQL requires most query formulations to conform to its SELECT - FROM - WHERE template. Tutorial D has no analogous requirement. Note: I’ll have more to say on this particular issue in the next chapter.

    In what follows, I’ll show examples in both Tutorial D and SQL.

MORE ON CLOSURE

To say it again, the result of every relational operation is a relation. Conversely, any operator that produces a result that isn’t a relation is, by definition, not a relational operator.6 For example, any operator that produces an ordered result isn’t a relational operator (see the discussion of ORDER BY in the next chapter). And in SQL in particular, the same is true of any operator that produces a result with duplicate rows, or left to right column ordering, or nulls, or anonymous columns, or duplicate column names. Closure is crucial! As I said near the beginning of the previous section, closure is what makes it possible to write nested relational expressions, and (as we’ll see later) it’s also important in expression transformation, and hence in optimization. Strong recommendation: Don’t use any operation that violates closure if you want the result to be amenable to further relational processing.

Now, when I say the result of every algebraic operation is another relation, I hope it’s clear that I’m talking from a conceptual point of view; I don’t mean the system always has to materialize those results in their entirety. For example, consider the following expression (a restriction of a join—Tutorial D on the left and SQL on the right as usual, and I’ve deliberately shown explicit dot qualifications in the SQL version):

( P JOIN S )                 SELECT P.* , S.SNO , S.SNAME , S.STATUS
WHERE PNAME > SNAME          FROM   P , S
                             WHERE  P.CITY = S.CITY
                             AND    P.PNAME > S.SNAME

Clearly, as soon as any given tuple of the join is formed, the system can test that tuple right away against the condition PNAME > SNAME (P.PNAME > S.SNAME in the SQL version) to see if it belongs in the final output, discarding it if not.7 Thus, the intermediate result that’s the output from the join might never have to exist as a fully materialized relation in its own right at all. In practice, in fact, the system tries very hard not to materialize intermediate results in their entirety, for obvious performance reasons. (As an aside, I remark that the process by which tuples of an intermediate result are produced and passed on to another operation one at a time instead of en bloc is sometimes referred to as pipelining.)

The foregoing example raises another point, however. Consider the boolean expression PNAME > SNAME in the Tutorial D version. That expression applies, conceptually, to the result of P JOIN S, and the attribute names PNAME and SNAME in that expression therefore refer to attributes of that result—not to the attributes of those names in relvars P and S. But how do we know that result has any such attributes? What is the heading of that result? More generally, what’s the heading for the result of any algebraic operation? Clearly, what we need is a set of rules—to be specific, a set of relation type inference rules—such that if we know the headings (and therefore the types) of the input relations for an operation, we can infer the heading (and therefore the type) of the output relation from that operation. And the relational model does include such a set of rules. In the case at hand, for example, those rules say the output from P JOIN S is of this type:

RELATION { PNO CHAR , PNAME CHAR , COLOR CHAR , WEIGHT RATIONAL ,
           CITY CHAR , SNO CHAR , SNAME CHAR , STATUS INTEGER }

In fact, for join, the heading of the output is simply the union of the headings of the inputs (where by union I mean the regular set theory union, not the special relational union I’ll be discussing later in this chapter). In other words, the output has all of the attributes of the inputs, except that common attributes—just CITY in the example—appear once, not twice, in that output. Of course, those attributes don’t have any left to right order, so I could equally well say the type of the result of P JOIN S is (for example):

RELATION { SNO CHAR , PNO CHAR , SNAME CHAR , WEIGHT RATIONAL ,
           CITY CHAR , STATUS INTEGER , PNAME CHAR , COLOR CHAR }

Note that type inference rules of some kind are definitely needed in order to support the closure property fully. After all, closure says every result is a relation, and relations have a heading as well as a body, so every result must have a proper relational heading as well as a proper relational body.

Now, the RENAME operator mentioned in the previous section is needed in large part because of the foregoing type inference rules; among other things, it allows us to perform, e.g., a join, even when the relations involved don’t meet the attribute naming requirements for that operation (speaking a trifle loosely). Here’s the definition:

Definition: Let relation r have an attribute called A and no attribute called B. Then (and only then) the expression r RENAME {A AS B} denotes an (attribute) renaming on r, and it returns the relation with heading identical to that of r except that attribute A in that heading is renamed B, and body identical to that of r (except that references to A in that body—more precisely, in tuples in that body—are replaced by references to B, a nicety that can be ignored for present purposes).

For example:

S RENAME { CITY AS SCITY }          SELECT SNO , SNAME , STATUS ,
                                           S.CITY AS SCITY
                                    FROM   S

Given our usual sample values, the result looks like this (it’s identical to our usual suppliers relation, except that the city attribute is called SCITY):

┌─────┬───────┬────────┬────────┐
SNO SNAME STATUS SCITY  
├═════┼───────┼────────┼────────┤
S1   Smith      20 London
S2   Jones      10 Paris  
S3   Blake      30 Paris  
S4   Clark      20 London
S5   Adams      30 Athens
└─────┴───────┴────────┴────────┘

Note: I won’t usually bother to show results explicitly in this chapter unless I think the particular operator I’m talking about might be unfamiliar to you, as in the case at hand.

Important: The foregoing example does not change relvar S in the database! RENAME isn’t like SQL’s ALTER TABLE; the RENAME invocation is only an expression (just as, for example, P JOIN S or N + 2 are only expressions), and like any expression it simply denotes a value. What’s more, since it is an expression, not a statement or “command,” it can be nested inside other expressions. We’ll see plenty of examples of such nesting later.

So how does SQL handle this business of result type inference? The answer is: Not very well. First of all, as we saw in Chapter 3, it doesn’t really have a notion of “relation type” (or table type, rather) anyway. Second, it can produce results with columns that effectively have no name at all (for example, consider SELECT PNO, 2 * WEIGHT FROM P). Third, it can also produce results with duplicate column names (for example, consider SELECT DISTINCT P.CITY, S.CITY FROM P, S). Strong recommendation: Follow the column naming discipline from Chapter 3 wherever necessary to ensure that SQL conforms as far as possible to the relational rules described in this chapter. Just to remind you, that discipline involved using AS specifications to give proper column names to columns that otherwise (a) wouldn’t have a name at all or (b) would have a name that wasn’t unique. My SQL examples in this chapter and the next (indeed, throughout the rest of this book) will all abide by this discipline.

I haven’t finished with the example from the beginning of this section. Here it is again:

( P JOIN S )                 SELECT P.* , S.SNO , S.SNAME , S.STATUS
WHERE PNAME > SNAME          FROM   P , S
                             WHERE  P.CITY = S.CITY
                             AND    P.PNAME > S.SNAME

As you can see, the counterpart in the SQL version to Tutorial D’s PNAME > SNAME is P.PNAME > S.SNAME (note the “P.” and “S.” qualifiers)—which is curious when you come to think about it, because that expression is supposed to apply to the result of the FROM clause (see the section “Evaluating SQL Expressions,” later in this chapter), and tables P and S certainly aren’t part of that result! Indeed, it’s quite difficult to explain how references to the names P and S in the WHERE and SELECT clauses (and possibly elsewhere in the overall expression) can make any sense at all in terms of the result of the FROM clause. The SQL standard does explain it, but the machinations it has to go through in order to do so are much more complicated than Tutorial D’s type inference rules—so much so that I won’t even try to explain them here, but will simply rely on the fact that they can be explained if necessary. I justify this omission by appealing to the fact that you’re supposed to be familiar with SQL already. It’s tempting to ask, though, whether you had ever thought about this issue before ... but I won’t.

Now I can go on to describe some other algebraic operators. Please note that I’m not trying to be exhaustive in this chapter (or indeed the next); I won’t be covering “all known operators,” and I won’t even describe all of the operators I do cover in full generality. In most cases, in fact, I’ll just give a careful but somewhat informal definition and show some simple examples.

RESTRICTION

Definition: Let r be a relation and let bx be a boolean expression in which every attribute reference identifies some attribute of r and there aren’t any relvar references. Then (and only then) (a) bx is a restriction condition on r, (b) the expression r WHERE bx denotes the restriction of r according to bx, and (c) it returns the relation with heading the same as that of r and body consisting of all tuples of r for which bx evaluates to TRUE.

For example:

P WHERE WEIGHT < 17.5          SELECT *
                               FROM   P
                               WHERE  WEIGHT < 17.5

Let r be a relation. Then the restriction r WHERE TRUE (or, more generally, any expression of the form r WHERE bx where bx is a boolean expression such as 1 = 1 that’s identically TRUE) just returns r. Such a restriction is known as an identity restriction.

Note: Tutorial D does support expressions of the form r WHERE bx, of course, but those expressions aren’t limited to being simple restrictions as defined above, because the boolean expression bx isn’t limited to being a restriction condition but can be more general. Similar remarks apply to SQL also. Examples are given in later chapters.

As an aside, I remark that restrict is sometimes called select; I prefer not to use this term, however, because of the potential confusion with SQL’s SELECT operator (also with selectors as described in Chapter 2). SQL’s SELECT operator—meaning, more precisely, the SELECT clause portion of a SELECT expression—isn’t restriction at all but is, rather, a kind of loose combination of UNGROUP, EXTEND, RENAME, and “project” (“project” in quotes because it doesn’t eliminate duplicates unless explicitly asked to do so). Note: UNGROUP and EXTEND are described in the next chapter.

PROJECTION

Definition: Let relation r have attributes called A1, A2, ..., An (and possibly others). Then (and only then) the expression r{A1,A2,...,An} denotes the projection of r on {A1, A2, ..., An}, and it returns the relation with heading consisting of attributes A1, A2, ..., An and body consisting of all tuples t such that there exists a tuple in r that has the same value for attributes A1, A2, ..., An as t does.

For example:

P { COLOR , CITY }          SELECT DISTINCT COLOR , CITY
                            FROM   P

To repeat, the result is a relation; thus, “duplicates are eliminated,” to use the common phrase, and that DISTINCT in the SQL formulation is really needed, therefore.8 The result heading has attributes (or columns) COLOR and CITY—in that left to right order, in SQL.

Let r be a relation. Then:

  • The projection of r on all of its attributes just returns r. Such a projection is known as an identity projection. For example, the expression SP{SNO,QTY,PNO} denotes the identity projection of the relation that’s the current value of relvar SP.

  • The projection r{ }—in other words, the projection of r on no attributes at all—returns TABLE_DEE if r is nonempty, TABLE_DUM otherwise. Such a projection is sometimes called a nullary projection; however, the term nullary is best avoided because of the potential confusion with SQL-style nulls.

    Note: Just to remind you, TABLE_DEE is the unique relation with no attributes and just one tuple—the 0-tuple, of course—and TABLE_DUM is the unique relation with no attributes and no tuples at all. The fact that projecting r on no attributes always yields one of these two relations is a direct consequence of the fact that every tuple has the same value for the empty set of attributes (namely, the 0-tuple). See the answer to Exercise 3.16 in Chapter 3 if you need to refresh your memory regarding this point.

Tutorial D also allows a projection to be expressed in terms of the attributes to be removed instead of the ones to be kept. Thus, for example, the Tutorial D expressions

P { COLOR , CITY }  and  P { ALL BUT PNO , PNAME , WEIGHT }

mean exactly the same thing. This feature can save a lot of writing (think of projecting a relation of degree 100 on 99 of its attributes).9 Analogous remarks apply, wherever they make sense, to all of the operators in Tutorial D.

In concrete syntax, it turns out to be convenient to assign high precedence to the projection operator. In Tutorial D, for example, we take the expression

P JOIN S { CITY }

to mean

P JOIN ( S { CITY } )

and not

( P JOIN S ) { CITY }

Exercise: Show the difference between these two interpretations, given our usual sample data.

JOIN

Before I get to the join operator as such, it’s helpful to introduce the concept of joinability. Relations r1 and r2 are joinable if and only if attributes with the same name are of the same type (meaning they are in fact the very same attribute)—equivalently, if and only if the set theory union of the headings of r1 and r2 is itself a legal heading. Note that this concept applies not just to join as such but to various other operations as well, as we’ll see in the next chapter. Anyway, armed with this notion, I can now define the join operation (note how the definition appeals to the fact that tuples are sets and hence can be operated upon by set theory operators such as union):

Definition: Let relations r1 and r2 be joinable. Then (and only then) the expression r1 JOIN r2 denotes the natural join (or just the join for short) of r1 and r2, and it returns the relation with heading the set theory union of the headings of r1 and r2 and body the set of all tuples t such that t is the set theory union of a tuple from r1 and a tuple from r2.

The following example is repeated from the section “Some Preliminaries,” except that now I’ve dropped the explicit dot qualifications from the SQL version where they aren’t needed:

P JOIN S          SELECT PNO , PNAME , COLOR , WEIGHT ,
                         P.CITY /* or S.CITY */ ,
                         SNO , SNAME , STATUS
                  FROM   P , S
                  WHERE  P.CITY = S.CITY

I remind you, however, that SQL also allows this join to be expressed in a style that’s a little closer to that of Tutorial D (and this time I deliberately replace that long commalist of column references in the SELECT clause by a simple “*”):

SELECT *
FROM   P NATURAL JOIN S

The result heading, given this latter formulation, has attributes—or columns, rather— CITY, PNO, PNAME, COLOR, WEIGHT, SNO, SNAME, and STATUS (in that order in SQL, though not of course in the Tutorial D analog). Note: I’ll have more to say on this SQL column ordering issue in the subsection “Explicit JOINs in SQL,” later in this section.

There are several further points to be made in connection with the natural join operation. First of all, observe that intersection is a special case (i.e., r1 INTERSECT r2 is a special case of r1 JOIN r2, in Tutorial D terms). To be specific, it’s the special case in which relations r1 and r2 aren’t merely joinable but are actually of the same type (i.e., have the same heading). For example, the following expressions are logically equivalent:

P { CITY } INTERSECT S { CITY }

P { CITY } JOIN S { CITY }

However, I’ll have more to say about INTERSECT as such later in this chapter.

Next, product is a special case, too (i.e., r1 TIMES r2 is a special case of r1 JOIN r2, in Tutorial D terms). To be specific, it’s the special case in which relations r1 and r2 have no attribute names in common. Why? Because, in this case, (a) the set of common attributes is empty; (b) as noted earlier, every possible tuple has the same value for the empty set of attributes (namely, the 0-tuple); thus, (c) every tuple in r1 joins to every tuple in r2, and so we get the product as stated. For example, the following expressions are logically equivalent:

P { ALL BUT CITY } TIMES S { ALL BUT CITY }

P { ALL BUT CITY } JOIN S { ALL BUT CITY }

For completeness, however, I’ll give the definition anyway:

Definition: Let relations r1 and r2 have no attribute names in common. Then (and only then) the expression r1 TIMES r2 denotes the cartesian product (or just the product for short) of r1 and r2, and it returns the relation with heading the set theory union of the headings of r1 and r2 and body the set of all tuples t such that t is the set theory union of a tuple from r1 and a tuple from r2.

Here’s an example:

( P RENAME { CITY AS PCITY } )      SELECT PNO , PNAME , COLOR ,
  TIMES  /* or JOIN */                     WEIGHT , P.CITY AS PCITY ,
( S RENAME { CITY AS SCITY } )             SNO , SNAME , STATUS ,
                                           S.CITY AS SCITY
                                    FROM   P , S

Note the need to rename at least one of the two CITY attributes in this example (in fact I’ve renamed them both, purely for reasons of symmetry). The result heading has attributes or columns PNO, PNAME, COLOR, WEIGHT, PCITY, SNO, SNAME, STATUS, and SCITY (in that order, in SQL).

Last, join is usually thought of as a dyadic operator specifically; however, it’s possible, and useful, to define an n-adic version of the operator (and Tutorial D does), according to which we can write expressions of the form

JOIN { r1 , r2 , ... , rn }

to join any number of relations r1, r2, ..., rn.10 For example, the join of parts and suppliers could alternatively be expressed as follows:

JOIN { P , S }

What’s more, we can use this syntax to ask for “joins” of just a single relation, or even of no relations at all! The join of a single relation r, JOIN {r}, is just r itself; this case is perhaps not of much practical importance (?). Perhaps surprisingly, however, the join of no relations at all, JOIN { }, is very important indeed!—and the result is TABLE_DEE. (Recall once again that TABLE_DEE is the unique relation with no attributes and just one tuple.) Why is the result TABLE_DEE? Well, consider the following:

  • In ordinary arithmetic, 0 is what’s called the identity (or identity value) with respect to “+”; that is, for all numbers x, the expressions x + 0 and 0 + x are both identically equal to x. As a consequence, the sum of an empty set of numbers is 0.11 (To see this claim is reasonable, consider a piece of code that computes the sum of n numbers by initializing the sum to 0 and then iterating over those n numbers. What happens if n = 0?)

  • In like fashion, 1 is the identity with respect to “*”; that is, for all numbers x, the expressions x * 1 and 1 * x are both identically equal to x. As a consequence, the product of an empty set of numbers is 1.

  • In the relational algebra, TABLE_DEE is the identity with respect to JOIN; that is, for all relations r, the expressions r JOIN TABLE_DEE and TABLE_DEE JOIN r are both identically equal to r (see the paragraph immediately following). As a consequence, the join of an empty set of relations is TABLE_DEE.

If you’re having difficulty with this idea, don’t worry about it too much for now. But if you come back to reread this section later, I do suggest you try to convince yourself that r JOIN TABLE_DEE and TABLE_DEE JOIN r are indeed both identically equal to r. It might help to point out that the joins in question are actually cartesian products (right?).

Explicit JOINs in SQL

In SQL, the keyword JOIN can be used to express various kinds of join operations (although those operations can always be expressed without it, too). Simplifying slightly, the possibilities—I’ve numbered them for purposes of subsequent reference—are as follows (t1 and t2 are tables, denoted by table expressions tx1 and tx2, say; bx is a boolean expression; and C1, C2, ..., Cn are columns appearing in both t1 and t2):

  1. t1 NATURAL JOIN t2

  2. t1 JOIN t2 ON bx

  3. t1 JOIN t2 USING ( C1 , C2 , ... , Cn )

  4. t1 CROSS JOIN t2

I’ll elaborate on the four cases briefly, since the differences between them are a little subtle and can be hard to remember:

  1. Case 1 has effectively already been explained. Note: Actually, Case 1 is logically identical to a Case 3 expression12—see below—in which the specified columns C1, C2, ..., Cn are all of the common columns (i.e., all of the columns that appear in both t1 and t2), in the order in which they appear in t1.

  2. Case 2 is logically equivalent to the following:

    ( SELECT * FROM t1 , t2 WHERE bx )

  3. Case 3 is logically equivalent to a Case 2 expression in which bx takes the form

    t1.C1 = t2.C1 AND t1.C2 = t2.C2 AND ... AND t1.Cn = t2.Cn

    —except that columns C1, C2, ..., Cn appear once, not twice, in the result, and the column ordering in the heading of the result is (in general) different: Columns C1, C2, ..., Cn appear first, in that order; then the other columns of t1 appear, in the order in which they appear in t1; then the other columns of t2 appear, in the order in which they appear in t2. (Do you begin to see what a pain this left to right ordering business is?)

  4. Finally, Case 4 is logically equivalent to the following:

    ( SELECT * FROM t1 , t2 )

Recommendations:

  1. Use Case 1 (NATURAL JOIN) in preference to other methods of formulating a join (but make sure columns with the same name are of the same type). Note that the NATURAL JOIN formulation will often be the most succinct if other recommendations in this book are followed.13

  2. Avoid Case 2 (JOIN ON), because it’s guaranteed to produce a result with duplicate column names (unless tables t1 and t2 have no common column names in the first place). But if you really do want to use Case 2—which you just might, if you want to formulate a greater-than join, say14—then make sure columns with the same name are of the same type, and make sure you do some appropriate renaming as well. For example:

    SELECT temp.*
    FROM ( SELECT * FROM S JOIN P ON S.CITY > P.CITY ) AS temp
         ( SNO , SNAME , STATUS , SCITY ,
           PNO , PNAME , COLOR , WEIGHT , PCITY )

    It’s not really clear why you’d ever want to use such a formulation, however, given that it’s logically equivalent to the following slightly less cumbersome one:

    SELECT SNO , SNAME , STATUS , S.CITY AS SCITY ,
           PNO , PNAME , COLOR , WEIGHT , P.CITY AS PCITY
    FROM   S , P
    WHERE  S.CITY > P.CITY

  3. In Case 3 (JOIN USING), make sure columns with the same name are of the same type.

  4. In Case 4 (CROSS JOIN), make sure there aren’t any common column names.

Recall finally that, as noted in Chapter 1, an explicit JOIN invocation isn’t allowed in SQL as a “stand alone” table expression (i.e., one at the outermost level of nesting). Nor is it allowed as the table expression in parentheses that constitutes a subquery (see Chapter 12).

UNION, INTERSECTION, AND DIFFERENCE

Union, intersection, and difference (UNION, INTERSECT, and MINUS in Tutorial D; UNION, INTERSECT, and EXCEPT in SQL) all follow the same general pattern. I’ll start with union.

Union

Definition: Let relations r1 and r2 be of the same type T. Then (and only then) the expression r1 UNION r2 denotes the union of r1 and r2, and it returns the relation of type T with body the set of all tuples t such that t appears in at least one of r1 and r2.

For example (I’ll assume, just for the sake of the examples in this section, that parts have an extra attribute called STATUS, of type INTEGER):

P { STATUS , CITY } UNION          SELECT STATUS , CITY
S { CITY , STATUS }                FROM   P
                                   UNION CORRESPONDING
                                   SELECT CITY , STATUS
                                   FROM   S

As with projection, it’s worth noting explicitly in connection with union that “duplicates are eliminated.” Note that we don’t need to specify DISTINCT in the SQL version in order to achieve this effect; although UNION provides the same options as SELECT does (DISTINCT vs. ALL), the default for UNION is DISTINCT, not ALL (for SELECT it’s the other way around, as you’ll recall from Chapter 4). The result heading has attributes or columns STATUS and CITY—in that order, in SQL. As for the CORRESPONDING specification in the SQL formulation, that specification allows us to ignore the possibility that those columns might appear at different ordinal positions within the operand tables. Recommendations:

  • Make sure every column of the first operand table has the same name and type as some column of the second operand table and vice versa.

    Aside: Here’s another SQL question for you: Does SQL in fact allow corresponding columns in those operand tables to be of different types? Answer: Yes, it does. The standard’s own definition of the result of A UNION B (not meant to be valid SQL syntax) runs something like this: Let r be a row that’s a duplicate of both some row in A and some row in B. Then the result contains (a) exactly one duplicate of every such row r and (b) no row that’s not a duplicate of some such row r. The reason for this somewhat convoluted definition is that two rows can be duplicates in SQL without being identical, owing to the fact that (as we saw in Chapter 2, section “Type Checking and Coercion in SQL”) two scalar values in turn can “compare equal” without being identical.

    Note, incidentally, that the foregoing definition is still not complete, in that it fails to specify exactly which particular “duplicate of row r” actually appears in the result. End of aside.

  • Always specify CORRESPONDING if possible.15 If it isn’t—in particular, if the SQL product you’re using doesn’t support it—then make sure columns line up properly, as in this revised version of the example:

    SELECT STATUS , CITY FROM P
    UNION
    SELECT STATUS , CITY FROM S   /* note the left to right reordering */

  • Don’t include the “BY (column name commalist)” option in the CORRESPONDING specification, unless it makes no difference anyway (e.g., specifying BY (STATUS,CITY) would make no difference in the example).16

    Note: This recommendation is perhaps a little debatable. At least the BY option might sometimes save keystrokes (though not always—see the example below). But it’s misleading, because it means the union operands aren’t the specified tables as such but certain projections of those tables; it’s also unnecessary, because those projections could always be specified explicitly anyway. For example, the SQL expression

    SELECT * FROM P
    UNION  CORRESPONDING BY ( CITY )
    SELECT * FROM S

    is logically equivalent to this (shorter!) one:

    SELECT CITY FROM P
    UNION
    SELECT CITY FROM S

  • Never specify ALL. Note: The usual reason for specifying ALL on UNION isn’t that users want to see duplicate rows in the output; rather, it’s that they know there aren’t any duplicate rows in the input—i.e., the union is disjoint (see below)—and so they’re trying to prevent the system from having to do the extra work of trying to eliminate duplicates that they know aren’t there in the first place. In other words, it’s a performance reason. See the discussion of such matters in Chapter 4, in the section “Avoiding Duplicates in SQL.”

Tutorial D also supports “disjoint union” (D_UNION), which is a version of union that requires its operands to have no tuples in common. For example:

S { CITY } D_UNION P { CITY }

Given our usual sample data, this expression will produce a run time error, because supplier cities and part cities aren’t disjoint. SQL has no direct counterpart to D_UNION.

Tutorial D also supports n-adic forms of both UNION and D_UNION. The syntax consists—with one small exception, explained below—of the operator name (i.e., UNION or D_UNION), followed by a commalist in braces of relational expressions r1, r2, ..., rn. The relations denoted by r1, r2, ..., rn must all be of the same type. For example, the foregoing D_UNION example could alternatively be expressed as follows:

D_UNION { S { CITY } , P { CITY } }

Note: The union or disjoint union of a single relation r is just r. The union or disjoint union of no relations at all is the empty relation of the pertinent type—but that type needs to be specified explicitly, since there aren’t any relational expressions from which the type can be inferred. Thus, for example, the expression

UNION { SNO CHAR , STATUS INTEGER } { }

denotes the empty relation of type RELATION {SNO CHAR, STATUS INTEGER}. Compare the answer to Exercise 3.15 in Chapter 3.

Intersection

Definition: Let relations r1 and r2 be of the same type T. Then (and only then) the expression r1 INTERSECT r2 denotes the intersection of r1 and r2, and it returns the relation of type T with body the set of all tuples t such that t appears in each of r1 and r2.

For example:

P { STATUS , CITY } INTERSECT          SELECT    STATUS , CITY
S { CITY , STATUS }                    FROM      P
                                       INTERSECT CORRESPONDING
                                       SELECT    CITY , STATUS
                                       FROM      S

All comments and recommendations noted under “Union” apply here also, mutatis mutandis. Note: As we’ve already seen, intersect is really just a special case of join. Tutorial D and SQL both support it, however, if only for psychological reasons. As mentioned in a footnote earlier, Tutorial D also supports an n-adic form, but I’ll skip the details here.

Difference

Definition: Let relations r1 and r2 be of the same type T. Then (and only then) the expression r1 MINUS r2 denotes the difference between r1 and r2 (in that order), and it returns the relation of type T with body the set of all tuples t such that t appears in r1 and not in r2.

For example:

P { STATUS , CITY } MINUS          SELECT STATUS , CITY
S { CITY , STATUS }                FROM   P
                                   EXCEPT CORRESPONDING
                                   SELECT CITY , STATUS
                                   FROM   S

All comments and recommendations noted under “Union” apply here also, mutatis mutandis. Note, however, that minus is strictly dyadic—Tutorial D doesn’t support any kind of “n-adic minus” operation (see Exercise 6.17 at the end of the chapter). But it does support “included minus” (I_MINUS), which is a version of minus that requires the second operand to be included in the first (i.e., the second operand mustn’t contain any tuples that aren’t also contained in the first operand). For example:

S { CITY } I_MINUS P { CITY }

Given our usual sample data, this expression will produce a run time error, because there’s at least one part city that isn’t also a supplier city. SQL has no direct counterpart to I_MINUS.

WHICH OPERATORS ARE PRIMITIVE?

I’ve now covered all of the operators I want to cover in this chapter. As I’ve more or less said already, however, not all of those operators are primitive—some of them can be defined in terms of others. One possible primitive set is the set {restrict, project, join, union, difference}; another can be obtained by replacing join in this set by product. Note: You might be surprised to see no mention here of rename. In fact, however, rename isn’t primitive, though I haven’t covered enough groundwork yet to show why not (see Exercise 7.3 in Chapter 7). What this discussion does show, however, is that there’s a difference between being primitive and being useful! I certainly wouldn’t want to be without our useful rename operator, even if it isn’t primitive.

FORMULATING EXPRESSIONS ONE STEP AT A TIME

Consider the following Tutorial D expression (the query is “Get pairs of supplier numbers such that the suppliers concerned are colocated—i.e., are in the same city”):

( ( ( S RENAME { SNO AS SA } ) { SA , CITY } JOIN
    ( S RENAME { SNO AS SB } ) { SB , CITY } )
                               WHERE SA < SB ) { SA , SB }

The result has two attributes, called SA and SB (it would have been sufficient to do just one attribute renaming; once again I did two for symmetry). The purpose of the condition SA < SB is twofold:17

  • It eliminates pairs of supplier numbers of the form (a,a).

  • It guarantees that the pairs (a,b) and (b,a) won’t both appear.

Be that as it may, I now show another formulation of the query in order to show how Tutorial D’s WITH construct can be used to simplify the business of formulating what might otherwise be rather complicated expressions:

WITH ( t1 := ( S RENAME { SNO AS SA } ) { SA , CITY } ,
       t2 := ( S RENAME { SNO AS SB } ) { SB , CITY } ,
       t3 := t1 JOIN t2 ,
       t4 := t3 WHERE SA < SB ) :
t4 { SA, SB }

As the example suggests, a WITH specification in Tutorial D can appear as a prefix to a relational expression. Such a specification consists of the keyword WITH followed by a parenthesized commalist of assignments of the form name := expression, that parenthesized commalist then being followed in its entirety by a colon. For each of those “name := expression” assignments, the expression on the right side is evaluated and the result effectively assigned to the temporary variable whose name appears on the left side (in the example, I’ve used the names t1, t2, t3, t4—“t” for temporary). Also, the assignments are executed in sequence as written; as a consequence, any given assignment in the commalist is allowed to refer to names introduced in assignments earlier in that same commalist. Those introduced names can also be referenced in the relational expression that appears following the colon.

Tutorial D allows WITH specifications to appear on statements as well as on expressions. For example:

WITH ( temp := RELATION { TUPLE { SNO 'S5' , PNO 'P6' , QTY 250 } } ) :
SP := SP UNION temp ;

SQL too supports a WITH construct, with these differences:

  • WITH in Tutorial D can be used at any level of nesting. By contrast, WITH in SQL can be used only at the outermost level.18

  • WITH in Tutorial D can be used in connection with expressions of any kind.19 By contrast, WITH in SQL can be used only in connection with table expressions specifically.

  • SQL uses the keyword AS in place of Tutorial D’s assignment symbol (“:=”).

  • SQL doesn’t use the enclosing parentheses or colon separator.

  • As already noted, Tutorial D allows WITH specifications on statements as well as expressions. SQL doesn’t.

Also, the name portion of a “name AS expression” within an SQL WITH specification can optionally be followed by a parenthesized commalist of column names (much as in a range variable definition—see Chapter 12). However, it shouldn’t be necessary to exercise this option very often if other recommendations in this book are followed.

Here’s an SQL version of the example:

WITH t1 AS ( SELECT SNO AS SA , CITY
             FROM   S ) ,
     t2 AS ( SELECT SNO AS SB , CITY
             FROM   S ) ,
     t3 AS ( SELECT *
             FROM   t1 NATURAL JOIN t2 ) ,
     t4 AS ( SELECT *
             FROM   t3
             WHERE  SA < SB )

SELECT SA , SB
FROM   t4

Aside: It’s worth noting, however, that SQL’s WITH construct is not nearly as useful in practice as its Tutorial D counterpart, because neither the syntactic nor the semantic structure of SQL lends itself readily to the idea of breaking large expressions down into smaller ones. For example, as noted earlier in this chapter, the relational functionality of UNGROUP, EXTEND, RENAME, and projection is all bundled into just one clause in SQL (viz., the SELECT clause portion of an SQL SELECT expression). End of aside.

In closing this section, I should make it clear that WITH isn’t really an operator of the relational algebra as such—it’s just a syntactic device to help with the formulation of complicated expressions (especially ones involving common subexpressions). I’ll be making extensive use of it in the pages ahead.

WHAT DO RELATIONAL EXPRESSIONS MEAN?

Recall now from Chapter 5 that every relvar has a certain relvar predicate, which is, loosely, what the relvar means. For example, the predicate for the suppliers relvar S is:

Supplier SNO is under contract, is named SNAME, has status STATUS, and is located in city CITY.

What I didn’t mention in Chapter 5, however, is that the foregoing notion extends in a natural way to apply to arbitrary relational expressions. For example, consider the projection of suppliers on all attributes but CITY:

S { SNO , SNAME , STATUS }

This expression denotes a relation containing all tuples of the form

TUPLE { SNO s , SNAME n , STATUS t }

such that a tuple of the form

TUPLE { SNO s , SNAME n , STATUS t , CITY c }

currently appears in relvar S for some CITY value c (where c is a value of type CHAR, of course). In other words, the result represents the current extension of a predicate that looks like this:

There exists some city CITY such that supplier SNO is under contract, is named SNAME, has status STATUS, and is located in city CITY.

This predicate can be understood as representing the meaning of the relational expression S{SNO,SNAME,STATUS}. Observe that it has just three parameters and the corresponding relation has just three attributes—CITY isn’t a parameter to that predicate but what logicians call a “bound variable” instead, owing to the fact that it’s “quantified” by the phrase There exists some city (see Chapter 10 for further discussion of bound variables and quantifiers).20 Note: A possibly clearer way of making the same point—viz., that the predicate has just three parameters, not four—is to observe that the predicate in question is logically equivalent to this one:

Supplier SNO is under contract, is named SNAME, has status STATUS, and is located somewhere [in other words, in some city, but we don’t know which].

Remarks analogous to the foregoing apply to every possible relational expression. To be specific: Every relational expression rx always has an associated meaning, or predicate; moreover, the predicate for rx can always be determined from the predicates for the relvars involved in that expression, together with the semantics of the relational operations involved. As an exercise, you might like to revisit some of the relational (or SQL) expressions shown earlier in this chapter, with a view to determining what the corresponding predicate might look like in each case.

EVALUATING SQL TABLE EXPRESSIONS

In addition to natural join, Codd originally defined an operator he called θ-join, where θ denoted any of the usual scalar comparison operators (“=”, “≠”, “<”, and so on). Now, θ-join isn’t primitive; in fact, it’s defined to be a restriction of a product. Here by way of example is the “not equals” join of suppliers and parts on cities (so θ here is “≠”):

( ( S RENAME { CITY AS SCITY } )    SELECT SNO , SNAME , STATUS ,
  TIMES                                    S.CITY AS SCITY , PNO ,
  ( P RENAME { CITY AS PCITY } ) )         PNAME , COLOR , WEIGHT ,
WHERE SCITY ≠ PCITY                        P.CITY AS PCITY
                                    FROM   S , P
                                    WHERE  S.CITY <> P.CITY

Now let’s focus on the SQL formulation specifically. You can think of the expression constituting that formulation as being evaluated in three steps, as follows:

  1. The FROM clause is evaluated, to yield the product of tables S and P. Note: If we were doing this relationally, we would have to rename at least one of the CITY attributes before that product could be computed. SQL gets away with renaming them afterward because its tables have a left to right ordering to their columns, meaning it can distinguish the two CITY columns by their ordinal position. For simplicity, let’s agree to overlook this detail.

  2. Next, the WHERE clause is evaluated, to yield a restriction of that product by eliminating rows in which the two city values are equal. Note: If θ had been “=” instead of “≠” (or “<>”, rather, in SQL), this step would have been: Restrict the product by retaining rows in which the two city values are equal—in which case we would now have formed what’s called the equijoin of suppliers and parts on cities. In other words, an equijoin is a θ-join for which θ is “=”. Exercise: What’s the difference between an equijoin and a natural join?

  3. Finally, the SELECT clause is evaluated, to yield a “projection” of that restriction on the columns specified in the SELECT clause—“projection” in quotes, because it won’t actually eliminate duplicates as true projection does, unless DISTINCT is specified. (Actually it’s doing some renaming as well, in this particular example, and I mentioned earlier in this chapter that SELECT provides other functionality too, in general—but for now I want to overlook these details as well, for simplicity.)

At least to a first approximation, then, the FROM clause corresponds to a product, the WHERE clause to a restriction, and the SELECT clause to a projection; thus, the overall SELECT – FROM – WHERE expression denotes a projection of a restriction of a product. It follows that I’ve just given a loose, but reasonably formal, definition of the semantics of SQL’s SELECT – FROM – WHERE expressions; equivalently, I’ve given a conceptual algorithm for evaluating such expressions. Now, there’s no implication that the implementation has to use exactly that algorithm in order to evaluate such expressions; au contraire, it can use any algorithm it likes, just so long as whatever algorithm it does use is guaranteed to give the same result as the conceptual one. And there are often good reasons—usually performance reasons— for using a different algorithm, thereby (for example) evaluating the clauses in a different order or otherwise rewriting the original query. However, the implementation is free to do such things only if it can be proved that the algorithm it does use is logically equivalent to the conceptual one. Indeed, one way to characterize the job of the optimizer is to find an algorithm that’s guaranteed to be equivalent to the conceptual one but performs better ... which brings us to the next section.

EXPRESSION TRANSFORMATION

In this section, I want to take a slightly closer look at what the optimizer does; more specifically, I want to consider what’s involved in transforming some relational expression into another, logically equivalent, expression. (I mentioned this notion under the discussion of duplicates in Chapter 4, where I explained that such transformations are one of the things the optimizer does. In fact, such transformations constitute one of the two great ideas at the heart of relational optimization. The other, beyond the scope of this book, is the use of “database statistics” to do what’s called cost based optimizing.21 )

I’ll start with a trivial example. Consider the following Tutorial D expression (the query is “Get suppliers who supply part P2, together with the corresponding quantities,” and I’ll ignore the SQL analog for simplicity):

( ( S JOIN SP ) WHERE PNO = 'P2' ) { ALL BUT PNO }

Suppose there are 1,000 suppliers and 1,000,000 shipments, of which 500 are for part P2. If the expression were simply evaluated by brute force, as it were (i.e., without any optimization at all), the sequence of events would be:

  1. Join S and SP: This step involves reading the 1,000 supplier tuples; reading the 1,000,000 shipment tuples 1,000 times each, once for each of the 1,000 suppliers; constructing an intermediate result consisting of 1,000,000 tuples; and writing those 1,000,000 tuples back out to the disk. (I’m assuming here for simplicity that tuples are physically stored as such on the disk, and I’m also assuming I can take “number of tuple reads and writes” as a reasonable measure of performance. Neither of these assumptions is very realistic, but this fact doesn’t materially affect my argument.)

  2. Restrict the result of Step 1: This step involves reading 1,000,000 tuples but produces a result containing only 500 tuples, which I’ll assume can be kept in main memory. (By contrast, in Step 1 I was assuming for the sake of the example—realistically or otherwise— that the 1,000,000 intermediate result tuples required too much space to be kept in main memory.)

  3. Project the result of Step 2: This step involves no tuple reads or writes (i.e., to or from the disk) at all, so we can ignore it.

By contrast, the following procedure is equivalent to the one just described, in the sense that it produces the same final result, but is obviously much more efficient:

  1. Restrict SP to just the tuples for part P2: This step involves reading 1,000,000 shipment tuples but produces a result containing only 500 tuples, which can be kept in main memory.

  2. Join S and the result of Step 1: This step involves reading 1,000 supplier tuples (once only, not once per P2 shipment, because all the P2 shipments are in memory). The result contains 500 tuples (still in main memory).

  3. Project the result of Step 2: Again we can ignore this step.

The first of these two procedures involves a total of 1,002,001,000 tuple reads and writes, whereas the second involves only 1,001,000; thus, it’s clear the second procedure is likely to be over 1,000 times faster than the first. It’s also clear we’d like the implementation to use the second rather than the first! If it does, then what it’s doing (in effect) is transforming the original expression

( S JOIN SP ) WHERE PNO = 'P2'

—I’m ignoring the final projection now, since it isn’t really relevant to the argument—into the expression

S JOIN ( SP WHERE PNO = 'P2' )

These two expressions are logically equivalent, but they have very different performance characteristics, as we’ve seen. If the system is presented with the first expression, therefore, we’d like it to transform it into the second before evaluating it—and of course it can. The point is, the relational algebra, being a high level formalism, is subject to various formal transformation laws; for example, there’s a law that says, loosely, that a join followed by a restriction can always be transformed into a restriction followed by a join (this was the law I was using in the example). And a good optimizer will know those laws, and will apply them— because the performance of a query ideally shouldn’t depend on the specific syntax used to express that query in the first place. Note: Actually it’s an immediate consequence of the fact that not all of the algebraic operators are primitive that certain expressions can be transformed into others (for example, an expression involving intersection can be transformed into one involving join instead), but there’s much more to the issue than that, as I hope is obvious from the example.

Now, there are many possible transformation laws, and this isn’t the place for an exhaustive discussion. But I would at least like to highlight a few important cases and make a few key points. First, the law mentioned in the previous paragraph is actually a special case of a more general law, called the distributive law. In general, the monadic operator f distributes over the dyadic operator g if and only if f(g(a,b)) = g(f(a),f(b)) for all a and b. In ordinary arithmetic, for example, SQRT (nonnegative square root) distributes over multiplication, because

SQRT ( a * b ) = SQRT ( a ) * SQRT ( b )

for all a and b (take f as SQRT and g as “*”); thus, a numeric expression optimizer can always replace either of these expressions by the other when doing numeric expression transformation. As a counterexample, SQRT does not distribute over addition, because the square root of a + b is not equal to the sum of the square roots of a and b, in general.

In relational algebra, restriction distributes over union, intersection, and difference. It also distributes over join, provided the restriction condition consists at its most complex of the AND of two separate restriction conditions, one for each of the two join operands. In the case of the example discussed above, this requirement was satisfied—in fact, the restriction condition was very simple and applied to just one of the operands—and so we were able to use the distributive law to replace the expression by a more efficient equivalent. The net effect was that we were able to “do the restriction early.” Doing restrictions early is almost always a good idea, because it serves, typically, (a) to reduce the number of tuples to be scanned in the next operation in sequence and (b) to reduce the number of tuples in the output from that operation as well.

Here are some other specific cases of the distributive law, this time involving projection. First, projection distributes over union, though not over intersection or difference. Second, it also distributes over join, so long as all of the joining attributes are included in the projection. These laws can be used for “doing projections early,” which again is usually a good idea, for reasons similar to those given above for restrictions.

Two more important general laws are the laws of commutativity and associativity:

  • The dyadic operator g is commutative if and only if g(a,b) = g(b,a) for all a and b. In ordinary arithmetic, for example, addition and multiplication are commutative, but subtraction and division aren’t. Similarly, in relational algebra, union, intersection, and join are commutative, but difference isn’t. So, for example, if a query involves a join of two relations r1 and r2, the commutative law tells us it doesn’t matter which of r1 and r2 is taken as the “outer” relation and which the “inner.” The system is therefore free to choose (say) the smaller relation—i.e., whichever of r1 and r2 has the lower cardinality—as the outer one in computing the join.22

  • The dyadic operator g is associative if and only if g(a,g(b,c)) = g(g(a,b),c) for all a, b, c. In arithmetic, addition and multiplication are associative—e.g., (a + b) + c is equal to a + (b +c)—but subtraction and division aren’t. Similarly, in relational algebra, union, intersection, and join are associative, but difference isn’t. So, for example, if a query involves a join of three relations r1, r2, and r3, the associative and commutative laws taken together tell us we can join the relations pairwise in any order we like.23 The system is thus free to decide which of the various possible sequences is most efficient.

Note, incidentally, that all of these transformations can be performed without any regard for either actual data values or physical access paths (indexes and the like) in the database as physically stored. In other words, such transformations represent optimizations that are virtually guaranteed to be good, regardless of what the database looks like physically. Perhaps I should add, however, that while many such transformations are available for sets, fewer are available for bags (as indeed we saw in some of the exercises in Chapter 4); and fewer still are available if column ordinal position has to be taken into account; and fewer still are available if nulls and 3VL have to be taken into account as well. What do you conclude?

THE RELIANCE ON ATTRIBUTE NAMES

There’s one question that might have been bothering you but hasn’t been addressed in this chapter so far. The operators of the relational algebra, at least as described in this book, all rely heavily on proper attribute naming. For example, the Tutorial D expression R1 JOIN R2— where I’ll suppose, just to be definite, that R1 and R2 are base relvars—is defined to do the join on the basis of those attributes of R1 and R2 that have the same names. But the question often arises: Isn’t this approach rather fragile? For example, what happens if we use SQL’s ALTER TABLE (or something analogous to that operator) to “add a new attribute” to relvar R2, say, that has the same name as one already existing in relvar R1?

Well, first let me clarify one point. It’s true that the operators do rely, considerably, on proper attribute naming. However, they also require attributes of the same name to be of the same type (and hence in fact to be the very same attribute, formally speaking); equivalently, they require attributes of different types to have different names. Thus, for example, an error would occur—at compile time, too, I would hope—if, in the expression R1 JOIN R2, R1 and R2 both had an attribute called A but the two A’s were of different types.24 Note that this requirement (that attributes of different types have different names) imposes no serious functional limitations, thanks to the availability of the attribute RENAME operator.

Now to the substance of the question. In fact, there’s a popular misconception here, and I’m very glad to have this opportunity to dispel it. In today’s SQL systems, application program access to the database is provided either through a call level interface or through an embedded, but conceptually distinct, data sublanguage (“embedded SQL”). But embedded SQL is really just a call level interface with a superficial dusting of syntactic sugar, so the two approaches come to the same thing from the DBMS’s point of view, and indeed from the host language’s point of view as well. In other words, SQL and the host language are typically only loosely coupled in most systems today. As a result, much of the advantage of using a well designed, well structured programming language is lost in today’s database environment. Here’s a pertinent quote:25 “Most programming errors in database applications would show up as type errors [if the database definition were] part of the type structure of the program.”

Now, the fact that the database definition is not “part of the type structure of the program” in today’s systems can be traced back to a fundamental misunderstanding that was prevalent in the database community in the early 1960s or so. The perception at that time was that, in order to achieve data independence (more specifically, logical data independence—see Chapter 9), it was necessary to move the database definition out of the program so that, in principle, that definition could be changed later without changing the program. But that perception was at least partly incorrect. What was, and is, really needed is two separate definitions, one inside the program and one outside; the one inside would represent the programmer’s perception of the database (and would provide the necessary compile time checking on queries, etc.), the one outside would represent the database “as it really is.” Then, if it subsequently becomes necessary to change the definition of the database “as it really is,” logical data independence is preserved by changing the mapping between the two definitions.

Here’s how the mechanism I’ve just described might look in SQL. First let me introduce the notion of a public table, which represents the application’s perception of some portion of the database. For example:

CREATE PUBLIC TABLE X            /* hypothetical syntax! */
   ( SNO   VARCHAR(5)  NOT NULL ,
     SNAME VARCHAR(25) NOT NULL ,
     CITY  VARCHAR(20) NOT NULL ,
     UNIQUE ( SNO ) ) ;

CREATE PUBLIC TABLE Y            /* hypothetical syntax! */
   ( SNO   VARCHAR(5)  NOT NULL ,
     PNO   VARCHAR(6)  NOT NULL ,
     UNIQUE ( SNO , PNO ) ) ,
     FOREIGN KEY ( SNO ) REFERENCES X ( SNO ) ) ;

These definitions effectively assert that “the application believes” there are tables in the suppliers-and-parts database called X and Y, with columns and keys as specified. Such is not the case, of course—but there are database tables called S and SP (with columns and keys as specified for X and Y, respectively, but with one additional column in each case), and we can define mappings as follows:

X image SELECT SNO , SNAME , CITY FROM S ;   /* hypothetical syntax! */

Y image SELECT SNO , PNO FROM SP ;           /* hypothetical syntax! */

These mappings are defined outside the application (the symbol “image” means “is defined as”).

Now consider the SQL expression X NATURAL JOIN Y. Clearly, the join here is being done on the basis of the common column SNO. And if, say, a column SNAME is added to the database table SP, all we have to do is change the mapping—actually no change is required at all, in this particular example!—and everything will continue to work as before; in other words, logical data independence will be preserved.

Unfortunately, today’s SQL products don’t work this way. Thus, for example, the SQL expression S NATURAL JOIN SP is, sadly, subject to exactly the “fragility” problem mentioned in the original question (but then so too is the simpler expression SELECT * FROM S, come to that). However, you can reduce that problem to more manageable proportions by adopting the strategy suggested under the discussion of column naming in Chapter 3. For convenience, I repeat that strategy here:

  • For every base table, define a view identical to that base table except possibly for some column renaming.

  • Make sure the set of views so defined abides by the naming discipline described in that same discussion (i.e., of column naming) in Chapter 3.

  • Operate in terms of those views instead of the underlying base tables.

Now, if the base tables do change subsequently, all you’ll have to do is change the view definitions accordingly.

EXERCISES

6.1 What if anything is wrong with the following SQL expressions or would-be expressions (from a relational perspective or otherwise)?

  1. SELECT * FROM S , SP

  2. SELECT SNO , CITY FROM S

  3. SELECT SNO , PNO , 2 * QTY FROM SP

  4. SELECT S.SNO FROM S , SP

  5. SELECT S.SNO , S.CITY FROM S NATURAL JOIN P

  6. SELECT CITY FROM S UNION SELECT CITY FROM P

  7. SELECT S.* FROM S NATURAL JOIN SP

  8. SELECT * FROM S JOIN SP ON S.SNO = SP.SNO

  9. SELECT * FROM ( S NATURAL JOIN P ) AS temp

  10. SELECT * FROM S CROSS JOIN SP CROSS JOIN P

6.2 Closure is important in the relational algebra for the same kind of reason that numeric closure is important in ordinary arithmetic. In arithmetic, however, there’s one situation where the closure property breaks down, in a sense—namely, division by zero. Is there any analogous situation in the relational algebra?

6.3 Given the usual suppliers-and-parts database, what’s the value of the Tutorial D expression JOIN {S,SP,P}? What’s the corresponding predicate? And how would you express this join in SQL?

6.4 Why do you think the project operator is so called?

6.5 For each of the following Tutorial D expressions on the suppliers-and-parts database, give both (a) an SQL analog and (b) an informal interpretation of the expression (i.e., a corresponding predicate) in natural language. Also show the result of evaluating the expressions, given our usual sample values for relvars S, P, and SP.

  1. ( S JOIN ( SP WHERE PNO = 'P2' ) ) { CITY }

  2. ( P { PNO } MINUS ( SP WHERE SNO = 'S2' ) { PNO } ) JOIN P

  3. S { CITY } MINUS P { CITY }

  4. ( S { SNO , CITY } JOIN P { PNO , CITY } ) { SNO , PNO }

  5. JOIN { ( S RENAME { CITY AS SC } ) { SC } ,
           ( P RENAME { CITY AS PC } ) { PC } }

6.6 Union, intersection, product, and join are all both commutative and associative. Verify these claims. Are they valid in SQL?

6.7 Which if any of the relational algebra operators described in this chapter have a definition that doesn’t rely on tuple equality?

6.8 The SQL FROM clause FROM t1, t2, ..., tn (where each ti denotes a table) returns the product of its arguments. But what if n = 1?—what’s the product of just one table? And by the way, what’s the product of t1 and t2 if t1 and t2 both contain duplicate rows?

6.9 Write Tutorial D and/or SQL expressions for the following queries on the suppliers-and-parts database:

  1. Get all shipments.

  2. Get supplier numbers for suppliers who supply part P1.

  3. Get suppliers with status in the range 15 to 25 inclusive.

  4. Get part numbers for parts supplied by a supplier in London.

  5. Get part numbers for parts not supplied by any supplier in London.

  6. Get all pairs of part numbers such that some supplier supplies both of the indicated parts.

  7. Get supplier numbers for suppliers with a status lower than that of supplier S1.

  8. Get part numbers for parts supplied by all suppliers in London.

  9. Get (SNO,PNO) pairs such that the indicated supplier does not supply the indicated part.

  10. Get suppliers who supply at least all parts supplied by supplier S2.

6.10 Prove the following statements (making them more precise where necessary):

  1. A sequence of restrictions of a given relation can be transformed into a single restriction.

  2. A sequence of projections of a given relation can be transformed into a single projection.

  3. A restriction of a projection can be transformed into a projection of a restriction.

6.11 Union is said to be idempotent, because r UNION r is identically equal to r for all r. (Is this true in SQL?) As you might expect, idempotence can be useful in expression transformation. Which other relational algebra operators, if any, are idempotent?

6.12 Let r be a relation. What does the Tutorial D expression r{ } mean (i.e., what’s the corresponding predicate)? What does it return? Also, what does the Tutorial D expression r{ALL BUT} mean, and what does it return?

6.13 The boolean expression x > y AND y > 3 (which might be part of a query) is equivalent to, and can therefore be transformed into, the boolean expression x > y AND y > 3 AND x > 3. (The equivalence is based on the fact that the comparison operator “>” is transitive—i.e., x > y and y > z together imply x > z.) Note that the transformation is certainly worth making if x and y are from different relations, because it enables the system to perform an additional restriction (using x > 3) before doing the greater-than join implied by x > y. As we saw in the body of the chapter, doing restrictions early is generally a good idea; having the system infer additional “early” restrictions, as here, is also a good idea. Do you know of any SQL product that actually performs this kind of optimization?

6.14 Consider the following Tutorial D expression:

WITH ( pp := P WHERE COLOR = 'Purple' ,
       tx := SP RENAME { SNO AS X } ) :
S WHERE ( tx WHERE X = SNO ) { PNO } ⊇ pp { PNO }

What does this expression mean? Given our usual sample data values, show the result returned. Does that result accord with your intuitive understanding of what the expression means? Justify your answer.

6.15 SQL has no direct counterpart to either D_UNION or I_MINUS. How best might the D_UNION and I_MINUS examples from the body of the chapter—i.e., S{CITY} D_UNION P{CITY} and S{CITY} I_MINUS P{CITY}—be simulated in SQL?

6.16 What do you understand by the term joinable? How could the definition of the term be extended to cover the case of n relations for arbitrary n (instead of just n = 2, which was the case discussed in the body of the chapter)?

6.17 What exactly is it that makes it possible to define n-adic versions of JOIN and UNION (and D_UNION)? Does SQL have anything analogous? Why doesn’t an n-adic version of MINUS (or I_MINUS) make sense?

6.18 I claimed earlier in the book that TABLE_DEE meant TRUE and TABLE_DUM meant FALSE. Substantiate and/or elaborate on these claims.

6.19 What exactly does the following SQL expression return?

SELECT DISTINCT S.*
FROM   S , P

Warning: There’s a trap here.

ANSWERS

First of all, here are answers to a couple of exercises that were stated inline in the body of the chapter:

  • The first asked what the difference was, given our usual sample data, between the expressions P JOIN (S{CITY}) and (P JOIN S){CITY}. Answer: The first yields full part details (PNO, PNAME, COLOR, WEIGHT, and CITY) for parts in the same city as at least one supplier, the second yields just the CITY values for those same parts (speaking a trifle loosely in both cases).

  • The second exercise asked what the difference was between an equijoin and a natural join. Answer: Let the relations to be joined be r1 and r2, and assume for simplicity that r1 and r2 have just one common attribute, A. Before we can perform the equijoin, then, we need to do some renaming. For definiteness, suppose we apply the renaming to r2, to yield r3 = r2 RENAME {A AS B}. Then the equijoin is defined to be equal to (r1 TIMES r3) WHERE A = B. Note in particular that A and B are both attributes of the result, and every tuple in that result will have the same value for those two attributes. Projecting attribute B away from that result yields the natural join r1 JOIN r2.

6.1 a. The result has duplicate column names (as well as left to right column ordering). b. The result has left to right column ordering. c. The result has an unnamed column (as well as left to right column ordering). d. The result has duplicate rows (even though the SELECT clause explicitly specifies S.SNO, not SP.SNO, and SNO values are unique in table S). e. Compile time error: S NATURAL JOIN P has no column called S.CITY.26 f. Nothing wrong (though it would be nice if CORRESPONDING were specified). g. The result has duplicate rows and left to right column ordering; it also has no SNO column, a fact that might come as a surprise.27 h. The result has duplicate column names (as well as left to right column ordering). i. Compile time (syntax) error: AS specification not allowed (because the expression “(S NATURAL JOIN P)” is neither a table name nor a table subquery—see Chapter 12). j. The result has duplicate column names (as well as left to right column ordering).

6.2 No! In particular, certain relational divides that—by analogy—you might intuitively expect to fail in fact don’t. Here are some examples (which might not make much sense until you’ve read the relevant section of Chapter 7):

  • Let relation z be of type RELATION {PNO CHAR} and let its body be empty. Then the expression

    SP { SNO , PNO } DIVIDEBY z { PNO }

    reduces to the projection SP{SNO} of SP on SNO.

  • Let z be either TABLE_DEE or TABLE_DUM. Then the expression

    r DIVIDEBY z

    reduces to r JOIN z. In other words, if z is TABLE_DEE, the result is just r; if z is TABLE_DUM, the result is the empty relation of the same type as r.

  • Let relations r and s be of the same type. Then the expression

    r DIVIDEBY s

    gives TABLE_DEE if r is nonempty and every tuple of s appears in r, TABLE_DUM otherwise.

  • Finally, r DIVIDEBY r gives TABLE_DUM if r is empty, TABLE_DEE otherwise.

6.3 First of all, observe that S, SP, and P are indeed “3-way joinable,” as is required for the expression JOIN {S,SP,P} to be well formed (see the answer to Exercise 6.16 below). The joining attributes are SNO, PNO, and CITY (each of which is a common attribute for exactly two of the relations to be joined, as it happens). The result predicate is: Supplier SNO is under contract, is named SNAME, has status STATUS, and is located in city CITY; part PNO is used in the enterprise, is named PNAME, has color COLOR and weight WEIGHT, and is stored in city CITY; and supplier SNO supplies part PNO in quantity QTY. Note that both appearances of SNO in this predicate refer to the same parameter, as do both appearances of PNO and both appearances of CITY. Given our usual sample values, the result looks like this:

image

The simplest SQL formulation is just

S NATURAL JOIN SP NATURAL JOIN P

(though it might be necessary, depending on context, to prefix this expression with “SELECT * FROM”—see Chapter 12).

6.4 In two-dimensional analytic or cartesian geometry, the points (x,0) and (0,y) are the projections of the point (x,y) on the X axis and the Y axis, respectively; equivalently, (x) and (y) are the projections into certain one-dimensional spaces of the point (x,y) in two-dimensional space. These notions are readily generalizable to n dimensions (recall from Chapter 3 that relations are indeed n-dimensional).

6.5 Throughout these answers, I show SQL expressions that aren’t necessarily direct transliterations of their Tutorial D (i.e., algebraic) counterparts but are, rather, “more natural” formulations of the query in SQL terms.

  1. SQL analog:

    SELECT DISTINCT CITY
    FROM   S NATURAL JOIN SP
    WHERE  PNO = 'P2'

    Predicate: City CITY is such that some supplier who supplies part P2 is located there.

    ┌────────┐
    CITY   
    ├════════┤
    London
    Paris  
    └────────┘

  2. SQL analog:

    SELECT *
    FROM   P
    WHERE  PNO NOT IN
         ( SELECT PNO
           FROM   SP
           WHERE  SNO = 'S2' )

    Predicate: Part PNO is used in the enterprise, is named PNAME, has color COLOR and weight WEIGHT, is stored in city CITY, and isn’t supplied by supplier S2.

    ┌─────┬───────┬───────┬────────┬────────┐
    PNO PNAME COLOR WEIGHT CITY   
    ├═════┼───────┼───────┼────────┼────────┤
    P3   Screw Blue     17.0 Oslo   
    P4   Screw Red      14.0 London
    P5   Cam    Blue     12.0 Paris  
    P6   Cog    Red      19.0 London
    └─────┴───────┴───────┴────────┴────────┘

  3. SQL analog:

    SELECT CITY
    FROM   S
    EXCEPT CORRESPONDING
    SELECT CITY
    FROM   P

    Predicate: City CITY is such that some supplier is located there but no part is stored there.

    ┌────────┐
    CITY   
    ├════════┤
    Athens
    └────────┘

  4. SQL analog:

    SELECT SNO , PNO
    FROM   S NATURAL JOIN P

    Note: There’s no need to do the preliminary projections (of S on {SNO,CITY} and P on {PNO,CITY}) in the Tutorial D version, either. Do you think the optimizer might ignore them?

    Predicate: Supplier SNO and part PNO are colocated.

    ┌─────┬─────┐
    SNO PNO
    ├═════┼═════┤
    S1   P1  
    S1   P4  
    S1   P6  
    S2   P2  
    S2   P5  
    S3   P2  
    S3   P5  
    S4   P1  
    S4   P4  
    S4   P6  
    └─────┴─────┘

  5. SQL analog:

    SELECT S.CITY AS SC , P.CITY AS PC
    FROM   S , P

    Predicate: Some supplier is located in city SC and some part is stored in city PC.

    ┌────────┬────────┐
    SC      PC     
    ├════════┼════════┤
    London London
    London Paris  
    London Oslo   
    Paris   London
    Paris   Paris  
    Paris   Oslo   
    Athens London
    Athens Paris  
    Athens Oslo   
    └────────┴────────┘

6.6 Intersection and product are both special cases of join, so we can ignore them here. The fact that union and join are commutative is immediate from the fact that the definitions are symmetric in the two relations concerned. I now show that union is associative. Let t be a tuple. Using “≡” to stand for “if and only if” (or “is equivalent to”) and “” to stand for “appears in” (as usual in both cases), we have:

t ( r UNION ( s UNION u ) ) ≡ t r OR t ( s UNION u )
                              ≡ t r OR ( t s OR t u )
                              ≡ ( t r OR t s ) OR t u
                              ≡ t ( r UNION s ) OR t u
                              ≡ t ( ( r UNION s ) UNION u )

Note the appeal in the third line to the associativity of OR. The proof that join is associative is analogous.

As for SQL, well, let’s first of all ignore nulls and duplicate rows (what happens if we don’t?). Then:

  • SELECT A, B FROM T1 UNION CORRESPONDING SELECT B, A FROM T2 and SELECT B, A FROM T2 UNION CORRESPONDING SELECT A, B FROM T1 aren’t equivalent, because they produce results with different left to right column orderings. Thus, union in general isn’t commutative in SQL (and the same goes for intersection).

  • T1 JOIN T2 and T2 JOIN T1 aren’t equivalent (in general), because they produce results with different left to right column orderings. Thus, join in general isn’t commutative in SQL (and the same goes for product).

The operators are, however, all associative.

6.7 RENAME is the only one—and even that one’s debatable! See the answer to Exercise 7.3 in the next chapter.

6.8 The product of a single table t is defined to be just t. But the question of what the product of t1 and t2 is if t1 and t2 both contain duplicate rows is a tricky one! See the answer to Exercise 4.4 in Chapter 4 for further discussion.

6.9 Tutorial D on the left, SQL on the right, as usual (the solutions aren’t unique, in general; note too that the Tutorial D solutions in particular could often be improved by using operators to be described in Chapter 7):

  1. SP                              SELECT * FROM SP
                                    or
                                    TABLE SP /* see Chapter 12 */

  2. ( SP WHERE PNO = 'P1' )         SELECT SNO
                      { SNO }       FROM   SP
                                    WHERE  PNO = 'P1'

  3. S WHERE STATUS ≥ 15             SELECT *
        AND STATUS ≤ 25             FROM   S
                                    WHERE  STATUS BETWEEN 15 AND 25

  4. ( ( S JOIN SP )                 SELECT DISTINCT PNO
      WHERE CITY = 'London' )       FROM   SP , S
                       { PNO }      WHERE  SP.SNO = S.SNO
                                    AND    S.CITY = 'London'

  5. P { PNO } MINUS ( ( S JOIN P )         SELECT PNO
      WHERE CITY = 'London' ) { PNO }      FROM   P
                                           EXCEPT CORRESPONDING
                                           SELECT PNO
                                           FROM   SP , S
                                           WHERE  SP.SNO = S.SNO
                                           AND    S.CITY = 'London'

  6. WITH ( z := SP { SNO , PNO } ) :     SELECT DISTINCT XX.PNO AS X ,
    ( ( z RENAME { PNO AS X } )                          YY.PNO AS Y
       JOIN                              FROM   SP AS XX , SP AS YY
       ( z RENAME { PNO AS Y } ) )       WHERE  XX.SNO = YY.SNO
    { X , Y }                           

  7. ( S WHERE STATUS <                     SELECT SNO
      STATUS FROM ( TUPLE FROM             FROM   S
         ( S WHERE SNO = 'S1' ) ) )        WHERE  STATUS <
                          S { SNO }             ( SELECT STATUS
                                                  FROM   S
                                                  WHERE  SNO = 'S1' )

    Note: The expression STATUS FROM (TUPLE FROM ...) in the Tutorial D solution here extracts the STATUS value from the single tuple in the relation that’s the TUPLE FROM argument (that relation must have cardinality one). By contrast, the SQL solution effectively does a double coercion: First, it coerces a table of one row to that row; second, it coerces that row to the single scalar value it contains.

  8. WITH ( tx := S WHERE                   SELECT PNO
                    CITY = 'London' ,      FROM   P
           ty := SP RENAME                 WHERE  NOT EXISTS (
                    { PNO AS Y } ) :       SELECT * FROM S
    ( ( P WHERE                            WHERE  CITY = 'London'
      ( ty WHERE Y = PNO ) ) { SNO }       AND    NOT EXISTS (
                        ⊇ tx { SNO } )     SELECT *
                           { PNO }         FROM   SP
                                           WHERE  SP.SNO = S.SNO
                                           AND    SP.PNO = P.PNO ) )

    Note the use of a relational comparison in the Tutorial D expression here. The SQL version uses EXISTS (see Chapter 10). A more elegant Tutorial D solution can be found as the answer to Exercise 7.9e in Chapter 7.

  9. ( S { SNO } JOIN P { PNO } )           SELECT SNO , PNO
                MINUS SP { SNO , PNO }     FROM   S , P
                                           EXCEPT CORRESPONDING
                                           SELECT SNO , PNO
                                           FROM   SP

  10. WITH ( tx := SP WHERE                   SELECT SNO
                    SNO = 'S2' ,            FROM   S
           ty := SP RENAME                  WHERE  NOT EXISTS (
                 { SNO AS Y } ) :           SELECT *
    S WHERE ( ty WHERE Y = SNO )            FROM   SP AS SPX
            { PNO } ⊇ tx { PNO }            WHERE  SNO = 'S2'
                                            AND    NOT EXISTS (
                                            SELECT *
                                            FROM   SP AS SPY
                                            WHERE  SPY.SNO = S.SNO
                                            AND    SPY.PNO = SPX.PNO ) )

    A more elegant Tutorial D solution can be found as the answer to Exercise 7.9f in Chapter 7.

6.10 It’s intuitively obvious that all three statements are true. No further answer provided.

6.11 Union isn’t idempotent in SQL, because the expression SELECT * FROM T UNION CORRESPONDING SELECT * FROM T isn’t identically equal to SELECT * FROM T.28 That’s because if T contains any duplicates, they’ll be eliminated from the result of the union. (And what happens if T contains any nulls? Good question!)

Join and intersection are also idempotent in the relational model but not in SQL, “thanks” again to duplicates and nulls). Note, however, that cartesian product is not idempotent, in general; in fact, the expression r TIMES r fails on a syntax error, except in the very special case where the heading of r is empty.

6.12 As explained in the body of the chapter, the expression r{ } denotes the projection of r on no attributes; it returns TABLE_DUM if r is empty and TABLE_DEE otherwise. The answer to the question “What’s the corresponding predicate?” depends on what the predicate for r is. For example, the predicate for SP{ } is (a trifle loosely): There exists a supplier SNO, there exists a part PNO, and there exists a quantity QTY such that supplier number SNO supplies part PNO in quantity QTY. Note that this predicate is in fact a proposition; if SP is empty (in which case SP{ } is TABLE_DUM) it evaluates to FALSE, otherwise (in which case SP{ } is TABLE_DEE) it evaluates to TRUE.

The expression r{ALL BUT} denotes the projection of r on all bar none of its attributes (in other words, it denotes the identity projection of r); it returns r. The corresponding predicate is identical to that for r.

6.13 So far as I know, DB2 and Ingres both perform this kind of optimization (DB2 refers to it as “predicate transitive closure”). Other products might do so too.

6.14 The expression means “Get suppliers who supply all purple parts.” Of course, the point is that (given our usual sample data values) there aren’t any purple parts. The expression correctly returns a relation identical to the current value of relvar S (i.e., all five suppliers, loosely speaking). For further explanation—in particular, for justification of the fact that this is indeed the correct answer—see Chapter 11.

6.15 For S{CITY} D_UNION P{CITY}, a rough equivalent in SQL might look like this:

SELECT CITY
FROM ( SELECT CITY FROM S
       UNION  CORRESPONDING
       SELECT CITY FROM P ) AS temp
WHERE  NOT EXISTS
     ( SELECT    CITY FROM S
       INTERSECT CORRESPONDING
       SELECT    CITY FROM P )

This SQL expression isn’t precisely equivalent to the original, however. To be specific, if supplier cities and part cities aren’t disjoint, then the SQL expression won’t fail at run time but will simply return an empty result. Note: In case you’re wondering about that AS temp specification, it’s there because it’s required—on the subquery in the FROM clause but not on the subquery in the WHERE clause (where in fact it would be illegal!)—for reasons explained in Chapter 7.

Turning now to S{CITY} I_MINUS P{CITY}, a rough equivalent in SQL might look like this:

SELECT CITY
FROM ( SELECT CITY FROM S
       EXCEPT CORRESPONDING
       SELECT CITY FROM P ) AS temp
WHERE NOT EXISTS
     ( SELECT CITY FROM P
       EXCEPT CORRESPONDING
       SELECT CITY FROM S )

Again, however, this SQL expression isn’t precisely equivalent to the original, however. To be specific, if part cities aren’t a subset of supplier cities, then the SQL expression won’t fail at run time but will simply return an empty result.

6.16 Relations r1 and r2 are joinable if and only if attributes with the same name are of the same type (equivalently, if and only if the set theory union of their headings is a legal heading). That’s the dyadic case. Extending the definition to the n-adic case is easy: Relations r1, r2, ..., rn (n > 0) are joinable—sometimes n-way joinable, for emphasis—if and only if, for all i, j (1 ≤ in, 1 ≤ jn), relations ri and rj are joinable. Note: It’s worth pointing out that dyadic joinability isn’t transitive; that is, just because (a) r1 and r2 are joinable and (b) r2 and r3 are joinable, it doesn’t necessarily follow that (c) r1 and r3 are joinable. Development of an example to illustrate this point is left as a subsidiary exercise.

6.17 It’s possible to define n-adic versions of JOIN, UNION, and D_UNION because the operators (a) are all both commutative and associative and (b) all have a corresponding identity value.

SQL does effectively support (analogs of) n-adic join and union, though not for n < 2. For join, the syntax is:29

[ SELECT * FROM ] t1 NATURAL JOIN t2
                     NATURAL JOIN t3
                       ...........
                     NATURAL JOIN tn

For union, the syntax is:

SELECT * FROM t1 UNION CORRESPONDING SELECT * FROM t2
                 UNION CORRESPONDING SELECT * FROM t3
                   ................................
                 UNION CORRESPONDING SELECT * FROM tn

An n-adic version of MINUS or I_MINUS makes no sense because MINUS and I_MINUS are neither commutative nor associative, nor do they have a corresponding identity value.

6.18 For a brief justification, see the answer to Exercise 6.12 above. A longer one follows. Consider the projection S{SNO} of (the relation that’s current value of) the suppliers relvar S on {SNO}. Let’s refer to the result of this projection as r; given our usual sample data values, r contains five tuples. Now consider the projection of that relation r on the empty set of attributes, r{ }. As we saw in the answer to Exercise 3.16 in Chapter 3, projecting any tuple on no attributes at all yields an empty tuple; thus, every tuple in r produces an empty tuple when r is projected on no attributes. But all empty tuples are duplicates of one another; thus, projecting the 5-tuple relation r on no attributes yields a relation with no attributes and one (empty) tuple, or in other words TABLE_DEE.

Now recall that every relvar has an associated predicate. For relvar S, that predicate looks like this:

Supplier SNO is under contract, is named SNAME, has status STATUS, and is located in city CITY.

For the projection r = S{SNO}, it looks like this:

There exists some name SNAME, there exists some status STATUS, and there exists some city CITY such that supplier SNO is under contract, is named SNAME, has status STATUS, and is located in city CITY.

And for the projection r{ }, it looks like this:

There exists some supplier number SNO, there exists some name SNAME, there exists some status STATUS, and there exists some city CITY such that supplier SNO is under contract, is named SNAME, has status STATUS, and is located in city CITY.

Observe now that this last predicate is in fact a proposition: It evaluates to TRUE or FALSE, unequivocally. In the case at hand, r{ } is TABLE_DEE and the predicate (proposition) evaluates to TRUE. But suppose no suppliers at all were represented in the database at this time. Then S{SNO} would yield an empty relation r, r{ } would be TABLE_DUM, and the predicate (proposition) in question would evaluate to FALSE.

6.19 The expression returns the current value of table S—unless table P is currently empty, in which case it returns an empty result.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset