SQL and Relational Theory

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7

SQL and Relational Algebra II: Additional Operators

Algebra is the part of advanced mathematics that is not calculus.

—John Derbyshire: Unknown Quantity: A Real and Imaginary History of Algebra (2006)

As I’ve said several times already, a relational algebra operator is an operator that (a) takes one or more relations—or possibly no relations at all, in the case of the n-adic versions of operators such as join (see Chapter 6)—as input and (b) produces another relation as output. As I observed in Chapter 1, however, any number of operators can be defined that conform to this simple characterization. The previous chapter described the original operators (join, project, etc.); by contrast, the present chapter describes some of the many additional operators that have been defined since the relational model was first invented. It also considers how those operators might best be realized in SQL.

Note: By its nature, this chapter is necessarily something of a miscellany. Thus, you might just want to skim it lightly on a first pass, and come back to it later if you need to gain a deeper understanding of one or more of the topics discussed. Perhaps it would help to say up front that, from a practical point of view at least, the most important topics are probably these:

Semijoin and semidifference (MATCHING and NOT MATCHING)
EXTEND
Image relations
Aggregate operators

But I’ll begin with a brief discussion of exclusive union.

EXCLUSIVE UNION

In set theory, union is inclusive; that is, given sets s1 and s2, an element appears in their union if and only if it appears in either or both of s1 or s2. Thus, the UNION operator can be seen as the set theory counterpart to logical OR, which is inclusive in a similar sense. But logic additionally defines an exclusive version of OR (XOR), and so we can define an exclusive union operator analogously: The exclusive union (XUNION) of two sets s1 and s2 is the set of elements appearing in s1 or s2 but not both. And, of course, we can define a relational version of this operator as well:

Definition: Let relations r1 and r2 be of the same type T. Then (and only then) the expression r1 XUNION r2 denotes the exclusive union of r1 and r2, and it returns the relation of type T with body the set of all tuples t such that t appears in exactly one of r1 and r2.

For example (assuming as we did in Chapter 6, in the section “Union, Intersection, and Difference,” that parts have an extra attribute called STATUS, of type INTEGER):

P { STATUS , CITY } XUNION     │     SELECT STATUS , CITY
S { CITY , STATUS }            │     FROM   P
                               │     WHERE ( STATUS , CITY ) NOT IN
                               │           ( SELECT STATUS , CITY
                               │             FROM   S )
                               │     UNION CORRESPONDING
                               │     SELECT CITY , STATUS
                               │     FROM   S
                               │     WHERE ( CITY , STATUS ) NOT IN
                               │           ( SELECT CITY , STATUS
                               │             FROM   P )

Tutorial D also supports an n-adic form of XUNION. However, the details are a little tricky; for that reason, I’ll just give a definition here, without further discussion. You can find more details, if you’re interested, in the paper “N-adic vs. Dyadic Operators: An Investigation” (see Appendix G).

Definition: Let relations r1, r2, ..., rn (n ≥ 0) all be of the same type T. Then (and only then) the expression XUNION {r1,r2,...,rn}denotes the exclusive union of r1, r2, ..., rn, and it returns the relation of type T with body the set of all tuples t such that t appears in exactly m of r1, r2, ..., rn, where m is odd (and possibly different for different tuples t).

The exclusive union of a single relation r is just r. The exclusive union of no relations at all is the empty relation of the pertinent type—but that type needs to be specified explicitly, since there aren’t any relational expressions from which the type can be inferred. Thus, for example, the expression

XUNION { SNO CHAR , STATUS INTEGER } { }

denotes the empty relation of type RELATION {SNO CHAR, STATUS INTEGER}.

Note: In set theory, exclusive union is more usually known as symmetric difference. Thus, the keyword XMINUS might be acceptable as an alternative to XUNION.

SEMIJOIN AND SEMIDIFFERENCE

Join is one of the most familiar of all of the relational operators. In practice, however, it turns out that queries that require the join operator at all often really require a variation on that operator called semijoin. (You might not have heard of semijoin before, but in fact it’s quite important.) Here’s the definition:

Definition: Let relations r1 and r2 be joinable,¹ and let r1 have attributes called A1, A2, ..., An (and no others). Then (and only then) the expression r1 MATCHING r2 denotes the semijoin of r1 with r2 (in that order), and it returns the relation denoted by the expression (r1 JOIN r2) {A1,A2, ...,An}.

In other words, r1 MATCHING r2 is the join of r1 and r2, projected back on the attributes of r1 (and so the heading of the result is the same as that of r1). Here’s an example (“Get suppliers who currently supply at least one part”):

S MATCHING SP     │    SELECT S.* FROM S
                  │    WHERE  SNO IN
                  │         ( SELECT SNO FROM SP )

Note that the expressions r1 MATCHING r2 and r2 MATCHING r1 aren’t equivalent, in general—the first returns some subset of r1, the second returns some subset of r2. Note too that we could replace IN by MATCH in the SQL version; interestingly, however, we can’t replace NOT IN by NOT MATCH in the semidifference analog (see below), because there’s no “NOT MATCH” operator in SQL.

Turning now to semidifference:² If semijoin is in some ways more important than join, a similar remark applies here also, but with even more force—in practice, most queries that require difference at all really require semidifference. Here’s the definition:

Definition: Let relations r1 and r2 be joinable. Then (and only then) the expression r1 NOT MATCHING r2 denotes the semidifference between r1 and r2 (in that order), and it returns the relation denoted by the expression r1 MINUS (r1 MATCHING r2).

Here’s an example (“Get suppliers who currently supply no parts at all”):

S NOT MATCHING SP               │    SELECT S.* FROM S
                                │    WHERE  SNO NOT IN
                                │         ( SELECT SNO FROM SP )

As with MATCHING, the heading of the result is the same as that of r1. Note: If r1 and r2 are of the same type, r1 NOT MATCHING r2 degenerates to r1 MINUS r2; in other words, difference (MINUS) is a special case of semidifference, relationally speaking. By contrast, join isn’t a special case of semijoin—they’re really different operators, though it’s true that (loosely speaking) some joins are semijoins and some semijoins are joins. See Exercise 7.19 at the end of the chapter.

EXTEND

You might have noticed that the algebra as I’ve described it so far in this book doesn’t have any conventional computational capabilities. Now, SQL does; certainly we can write queries in SQL of the form SELECT A + B AS C ..., for example. However, as soon as we write that “+” sign, we’ve gone beyond the bounds of the algebra as originally defined. So we need to add something to the algebra in order to provide this kind of functionality, and EXTEND is what we need. By way of example, suppose part weights (in relvar P) are given in pounds, and we want to see those weights in grams. There are 454 grams to a pound, and so we can write:

EXTEND P : │ SELECT P.* , WEIGHT * 454 AS GMWT
{ GMWT := WEIGHT * 454 } │ FROM P

Given our usual sample values, the result looks like this:

┌─────┬───────┬───────┬────────┬────────┬────────┐
│ PNO │ PNAME │ COLOR │ WEIGHT │ CITY   │ GMWT   │
├═════┼───────┼───────┼────────┼────────┼────────┤
│ P1  │ Nut   │ Red   │   12.0 │ London │ 5448.0 │
│ P2  │ Bolt  │ Green │   17.0 │ Paris  │ 7718.0 │
│ P3  │ Screw │ Blue  │   17.0 │ Oslo   │ 7718.0 │
│ P4  │ Screw │ Red   │   14.0 │ London │ 6356.0 │
│ P5  │ Cam   │ Blue  │   12.0 │ Paris  │ 5448.0 │
│ P6  │ Cog   │ Red   │   19.0 │ London │ 8626.0 │
└─────┴───────┴───────┴────────┴────────┴────────┘

Important: Relvar P is not changed in the database! EXTEND is not like SQL’s ALTER TABLE; the EXTEND expression is just an expression, and like any expression it simply denotes a value. In particular, therefore, it can be nested inside other expressions. Here’s an example (the query is “Get part number and gram weight for parts with gram weight greater than 7000 grams”):

( ( EXTEND P :                   │   SELECT PNO ,
    { GMWT := WEIGHT * 454 } )   │          WEIGHT * 454 AS GMWT
         WHERE GMWT > 7000.0 )   │   FROM   P
                { PNO , GMWT }   │   WHERE  WEIGHT * 454 > 7000.0

As you can see, there’s an interesting difference between the Tutorial D and SQL versions of this example. To be specific, the (sub)expression WEIGHT * 454 appears once in the Tutorial D version but twice in the SQL version. In the SQL version, therefore, we have to hope the implementation will be smart enough to realize that it can evaluate that subexpression just once per tuple (or once per row, rather) instead of twice.

The problem that this example illustrates is that SQL’s SELECT - FROM - WHERE template is too rigid. What we need to do, as the Tutorial D formulation makes clear, is take a restriction of an extension (and then take a projection of that restriction); in SQL terms, in other words, we need to apply the WHERE clause to the result of the SELECT clause, as it were. But the SELECT - FROM - WHERE template forces the WHERE clause to apply to the result of the FROM clause, not the SELECT clause (see the section “Evaluating SQL Table Expressions” in Chapter 6). To put it another way: In many respects, it’s the whole point of the algebra that (thanks to closure) relational operations can be combined and nested in arbitrary ways; but SQL’s SELECT - FROM - WHERE template effectively means that queries must be expressed as a product, followed by a restrict, followed by some combination of project and/or extend and/or rename³—and many queries just don’t fit this pattern.

Incidentally, you might be wondering why I didn’t formulate the SQL version like this:

SELECT PNO , WEIGHT * 454 AS GMWT
FROM P
WHERE GMWT > 7000.0

(The change is in the last line.) The reason is that GMWT is the name of a column of the final result; table P has no such column, the WHERE clause thus makes no sense, and the expression fails on a syntax error at compile time.

Actually, SQL does allow the query under discussion to be formulated in a style that’s a little closer to that of Tutorial D (and now I’ll show all of the otherwise implicit dot qualifications explicitly, for clarity):

SELECT temp.PNO , temp.GMWT
FROM ( SELECT PNO , WEIGHT * 454 AS GMWT
FROM P ) AS temp
WHERE temp.GMWT > 7000.0

But this functionality—namely, the ability to allow nested subqueries in the FROM clause— wasn’t part of SQL as originally defined (it was introduced into the standard with SQL:1992). Note too that this kind of formulation inevitably leads to a need to reference certain variables (temp, in the example) before they’re defined—quite possibly a long way before they’re defined, in fact, in real world SQL queries.

Note: I need to say a little more about the FROM clause in the foregoing example. As you can see, it takes the form

FROM ( ... ) AS temp

Formally speaking, it’s the parenthesized expression within this FROM clause that constitutes the nested subquery (see Chapter 12). And—here comes the point—SQL has a syntax rule to the effect that a nested subquery in the FROM clause must be accompanied by an explicit AS specification that defines a name for the table denoted by that subquery,⁴ even if that name is never explicitly referenced elsewhere in the overall expression. In fact, in the example at hand, we could omit all of the explicit references to the name temp (i.e., all of the explicit “temp.” dot qualifications) if we wanted to, thus:

SELECT PNO , GMWT
FROM ( SELECT PNO , WEIGHT * 454 AS GMWT
FROM P ) AS temp
WHERE GMWT > 7000.0

But that AS temp specification is still needed nonetheless.

I’ll close this section with a formal definition of the EXTEND operator:

Definition: Let relation r not have an attribute called A. Then (and only then) the expression EXTEND r: {A := exp} denotes an extension of r, and it returns the relation with heading the heading of r extended with attribute A and body the set of all tuples t such that t is a tuple of r extended with a value for A that’s computed by evaluating the expression exp on that tuple of r.

Observe that the result has cardinality equal to that of r and degree equal to that of r plus one. The type of A in that result is the type of exp.

IMAGE RELATIONS

Image relations are a hugely useful abstraction in general, and it’s a little surprising that so few people in the database community seem to be aware of them (though to be fair they haven’t been very widely discussed in the literature). However, I need to alert you up front to the fact that some people do seem to find them a little difficult to come to grips with; I don’t really know why this should be—the idea is actually quite straightforward—but as I say, I think it’s only fair to make you aware of this state of affairs. You have been warned!

First, then, here’s an informal definition: An image relation is, loosely, the “image” of some tuple t1 within some relation r2 (where, typically, tuple t1 is a tuple in the current value of some relvar R1 and relation r2 is the current value of some relvar R2, usually but not necessarily distinct from R1). By way of example, consider the suppliers-and-parts database, with its usual sample values. Let tuple t1 be the tuple for supplier S4 in the current value of relvar S, and let relation r2 be the current value of relvar SP. Then the following is the image of that supplier tuple t1 within that shipments relation r2:

┌─────┬─────┐
│ PNO │ QTY │
├═════┼─────┤
│ P2  │ 200 │
│ P4  │ 300 │
│ P5  │ 400 │
└─────┴─────┘

Clearly, this particular image relation can be obtained by means of the following Tutorial D expression:

( SP WHERE SNO = 'S4' ) { ALL BUT SNO }

In other words, restrict the current value of SP to just the tuples for S4, and then project away attribute SNO (since its value is the same, viz., S4, in every tuple in that restriction).

Here now is a formal definition of image relations in general:

Definition: Let relations r1 and r2 be joinable; let t1 be a tuple of r1; let t2 be a tuple of r2 that has the same values as tuple t1 for those attributes that are common to t1 and t2; let relation r3 be that restriction of r2 that contains all and only such tuples t2; and let relation r4 be the projection of r3 on all but those common attributes. Then r4 is the image relation (with respect to r2) corresponding to t1.

Here’s an example that illustrates the usefulness of image relations:

S WHERE ( !!SP ) { PNO } = P { PNO }

Explanation:

First of all, the roles of r1 and r2 from the formal definition are being played by the suppliers relation and the shipments relation, respectively (where by “the suppliers relation” I mean the current value of relvar S, of course, and similarly for “the shipments relation”).
Next, observe that the boolean expression in the WHERE clause here involves an equality comparison between two relations (actually two projections). We can imagine that boolean expression being evaluated for each tuple t1 in r1 (i.e., each tuple in the suppliers relation) in turn.
Consider one such tuple, say that for supplier Sx. For that tuple, then, the expression !!SP—pronounced “bang bang SP” or “double bang SP”—denotes the corresponding image relation r4 within r2; in other words, it denotes the set of (PNO,QTY) pairs within SP for parts supplied by that supplier Sx.⁵ The expression !!SP is an image relation reference.
The expression (!!SP){PNO}—i.e., the projection of the image relation on {PNO}—thus denotes the set of part numbers for parts supplied by supplier Sx.
The expression overall (i.e., S WHERE ...) thus denotes suppliers from S for whom that set of part numbers is equal to the set of all part numbers in the projection of P on {PNO}. In other words, it represents the query “Get suppliers who supply all parts” (speaking a little loosely).

Note: Since the concept of an image relation is defined in terms of some given tuple (t1, in the formal definition), it’s clear that an image relation reference can appear, not in all possible contexts in which relational expressions in general can appear, but only in certain specific contexts: namely, those in which the given tuple t1 is understood and well defined. WHERE clauses are one such context, as the foregoing example indicates, and we’ll see another in the section “Image Relations Revisited,” later in this chapter.

Aside: SQL has no direct support for image relations as such. Here for interest is an SQL formulation of the query “Get suppliers who supply all parts” (I show it for your consideration, but I’m not going to discuss it in detail here, except to note that it can obviously (?) be improved in a variety of ways):

SELECT *
FROM   S
WHERE  NOT EXISTS
     ( SELECT PNO
       FROM   SP
       WHERE  SP.SNO = S.SNO
       EXCEPT CORRESPONDING
       SELECT PNO
       FROM   P )
AND    NOT EXISTS
     ( SELECT PNO
       FROM   P
       EXCEPT CORRESPONDING
       SELECT PNO
       FROM   SP
       WHERE  SP.SNO = S.SNO )

End of aside.

To get back to image relations as such, it’s worth noting that the “!!” operator can be defined in terms of MATCHING. For example, the expression discussed above—

S WHERE ( !!SP ) { PNO } = P { PNO }

—is logically equivalent to the following:

S WHERE
( SP MATCHING RELATION { TUPLE { SNO SNO } } ) { PNO } = P { PNO }

Explanation: Again consider some tuple of S, say that for supplier Sx. For that tuple, then:

The expression

TUPLE { SNO SNO }

—which is a tuple selector invocation—denotes a tuple containing just the SNO value Sx (the first SNO in that expression is an attribute name, the second denotes the value of the attribute of that name in the tuple for Sx within relvar S).
So the expression

RELATION { TUPLE { SNO SNO } }

—which is a relation selector invocation—denotes the relation that contains just that tuple.
Hence, the expression

SP MATCHING RELATION { TUPLE { SNO SNO } }

denotes a certain restriction of SP: namely, that restriction that contains just those shipment tuples that have the same SNO value as the supplier tuple for supplier Sx does.

It follows that, in the context under consideration, the overall expression “SP MATCHING ...” is logically equivalent to the image relation reference “!!SP” as claimed.

By way of another example, suppose we’re given a revised version of the suppliers-and-parts database—one that’s simultaneously both extended and simplified, compared to our usual version—that looks like this (in outline):

S   { SNO }        /* suppliers               */
SP  { SNO , PNO }  /* supplier supplies part  */
PJ  { PNO , JNO }  /* part is used in project */
J   { JNO }        /* projects                */

Relvar J here represents projects (JNO stands for project number), and relvar PJ indicates which parts are used in which projects. Now consider the query “Get all (sno,jno) pairs such that sno is an SNO value currently appearing in relvar S, jno is a JNO value currently appearing in relvar J, and supplier sno supplies all parts used in project jno.” This is a complicated query!— but a formulation using image relations is almost trivial:

( S JOIN J ) WHERE !!SP ⊇ !!PJ

Exercise: Give an SQL analog of this expression.

Reverting now to the usual suppliers-and-parts database, here’s another example (“Delete shipments from suppliers in London”—and this time I’ll show an SQL analog as well):

DELETE SP WHERE IS_NOT_EMPTY      │   DELETE FROM SP
  ( !!( S WHERE                   │   WHERE  SNO IN
          CITY = 'London' ) ) ;   │        ( SELECT SNO FROM S
                                  │          WHERE CITY = 'London' ) ;

For a given shipment, the relation denoted by the specified image relation reference—i.e., !!(S WHERE ...)—is either empty, if the corresponding supplier isn’t in London, or contains exactly one tuple otherwise. Note: I’m not claiming that image relations are a big help with this particular example; I just wanted to show an image relation reference of the form !!rx where the specified relational expression rx was something more complicated than just a simple relvar reference (i.e., not just a simple relvar name). Here for the record is another formulation of the example in Tutorial D that avoids the image relation reference (actually it’s very similar to the SQL formulation):

DELETE SP WHERE SNO ∊ ( S WHERE CITY = 'London') { SNO }

DIVIDE

I include a discussion of divide in this chapter mainly just to show why I think—contrary to conventional wisdom, perhaps—it isn’t very important; in fact, I think it should be dropped. You can skip this section if you like.

I have at least three reasons for wanting to drop divide. One is that any query that can be formulated in terms of divide can alternatively, and much more simply, be formulated in terms of image relations instead, as I’ll demonstrate in just a moment. Another is that there are at least seven different divide operators anyway!—that is, there are, unfortunately, at least seven different operators all having some claim to be called “divide,” and I certainly don’t want to explain all of them.⁶ Instead, I’ll limit my attention here to the original and simplest one. Here’s a definition:

Definition: Let relations r1 and r2 be such that the heading of r2 is some subset of the heading of r1, and let A1, A2, ..., An be the attributes of r1 that aren’t also attributes of r2. Then (and only then) the expression r1 DIVIDEBY r2 denotes the division of r1 by r2,⁷ and it’s logically equivalent to the following:

WITH ( temp := r1 { A1 , A2 , ... , An } ) :
temp NOT MATCHING ( ( temp JOIN r2 ) NOT MATCHING r1 )

For example, the expression

SP { SNO , PNO } DIVIDEBY P { PNO }

(given our usual sample data values) yields:

┌─────┐
│ SNO │
├═════┤
│ S1 │
└─────┘

Note: I recommend that you take a few moments right now to check this result. (Pause.)

Now, if you do indeed check the result as recommended, you’ll see why the DIVIDEBY expression might loosely be characterized as representing the query “Get supplier numbers for suppliers who supply all parts” (I’ll explain the reason for that qualifier “loosely” in a few moments). In practice, however, we’re more likely to want full supplier details (not just supplier numbers) for the suppliers in question, in which case the division will need to be followed by a join, thus:

( SP { SNO , PNO } DIVIDEBY P { PNO } ) JOIN S

But we already know how to formulate this latter query more simply using image relations:

S WHERE ( !!SP ) { PNO } = P { PNO }

This latter formulation is (a) more succinct, (b) easier to understand (at least, so it seems to me), and (c) correct. This last point is the crucial one, of course, and I’ll explain it below. First, however, I want to explain why the operator is called divide, anyway. The reason is that if r1 and r2 are relations with no attribute names in common and we form the product r1 TIMES r2, and then divide the result by r2, we get back to r1. (At least, we do so just as long as r2 isn’t empty. What happens if it is?) In other words, product and divide are inverses of each other, in a sense.

Now, I’ve said the expression

SP { SNO , PNO } DIVIDEBY P { PNO }

can loosely be characterized as a formulation of the query “Get supplier numbers for suppliers who supply all parts”; indeed, this very example is often used as a basis for explaining, and justifying, the divide operator in the first place. Unfortunately, however, that characterization isn’t quite correct. Rather, the expression is a formulation of the query “Get supplier numbers for suppliers who supply at least one part and in fact supply all parts.”⁸ In other words, the divide operator not only suffers from problems of complexity and lack of succinctness—it doesn’t even solve the problem it was originally, and explicitly, intended to address.

AGGREGATE OPERATORS

In a sense this section is a bit of a digression, because the operators to be discussed aren’t relational but scalar—they return a scalar result.⁹ But I do need to say something about them before I can get back to the main theme of the chapter.

An aggregate operator in the relational model is an operator that derives a single value from the “aggregate” (i.e., the bag or set) of values appearing within some attribute within some relation—or, in the case of COUNT, which is slightly special, from the “aggregate” that’s the entire relation. The examples on the left below illustrate the the use of the COUNT aggregate operator specifically in Tutorial D:

X := COUNT ( S ) ;              │   SELECT COUNT ( * ) AS X
                                │   FROM   S

Y := COUNT ( S { STATUS } ) ;   │   SELECT COUNT ( DISTINCT STATUS ) AS Y
                                │   FROM   S

Note that I carefully don’t say the examples on the right “illustrate the use of the COUNT aggregate operator in SQL.” That’s because I’m going to argue that SQL doesn’t really support aggregate operators at all!—at least, not properly. But first things first ... Until further notice, let me focus on Tutorial D, and in particular on the COUNT examples just shown. Given our usual sample values:

The first example assigns the value 5 (the number of tuples in the current value of relvar S) to the variable X.
The second example assigns the value 3 (the number of tuples in the projection of the current value of relvar S on {STATUS}, which is to say the number of distinct STATUS values in that current value) to the variable Y.

In general, a Tutorial D aggregate operator invocation looks like this:

<agg op name> ( <relation exp> [ , <exp> ] )

Legal <agg op name>s include COUNT, SUM, AVG, MAX, MIN, AND, OR, and XOR. Within the <exp>, an <attribute ref> can appear wherever a literal would be allowed. That <exp> must be omitted if the <agg op name> is COUNT; otherwise, it can be omitted only if the <relation exp> denotes a relation of degree one, in which case an <exp> consisting of a reference to the sole attribute of that relation is assumed.

Aside: The aggregate operators AND, OR, and XOR apply to aggregates of boolean values specifically. AND in particular can be useful in connection with certain integrity constraints (see Chapter 8 for further discussion). As for SQL, SQL’s counterparts to AND and OR are called EVERY and SOME, respectively (there’s no counterpart to XOR). SOME can alternatively be spelled ANY; likewise, in ALL or ANY comparisons (see Chapter 12), ANY can alternatively be spelled SOME. Oddly enough, however, the SQL “set function” EVERY can’t alternatively be spelled ALL, and in ALL or ANY comparisons ALL can’t alternatively be spelled EVERY. End of aside.

Here are some Tutorial D examples:

SUM ( SP , QTY )

This expression denotes the sum of all quantities in relvar SP (given our usual sample values, the result is 3100).
SUM ( SP { QTY } )

This expression is shorthand for SUM(SP{QTY},QTY), and it denotes the sum of all distinct quantities in SP (i.e., 1000).
AVG ( SP , 3 * QTY )

This expression effectively asks what the average shipment quantity would be if quantities were all triple their current value (the answer is 775). More generally, if the expression <exp> is more complicated than a simple <attribute ref>, then the invocation

agg ( rx , exp )

is essentially shorthand for the following:

agg ( EXTEND rx : { a := exp } , a )

I turn now to SQL. For convenience, let me first repeat the examples:

X := COUNT ( S ) ;              │   SELECT COUNT ( * ) AS X
                                │   FROM   S

Y := COUNT ( S { STATUS } ) ;   │   SELECT COUNT ( DISTINCT STATUS ) AS Y
                                │   FROM   S

Now, you might have been surprised to hear me claim earlier that SQL doesn’t really support aggregate operators at all—especially since most people would surely consider SELECT expressions like those on the right above to be, precisely, SQL aggregate operator invocations.¹⁰ But they aren’t. Let me explain. As we know, the counts are 5 and 3, respectively. But those SELECT expressions don’t yield those counts as such, as true aggregate operator invocations would; rather, they yield tables that contain those counts. More precisely, each yields a table with one row and one column, and the sole value in that row is the actual count:

┌───┐         ┌───┐
│ X │         │ Y │      /* the lack of double underlining in these */
├───┤         ├───┤      /* tables is not a mistake -- see Exercise */
│ 5 │         │ 3 │      /* 7.23 at the end of the chapter          */
└───┘         └───┘

As you can see, therefore, the SELECT expressions really don’t represent aggregate operator invocations as such; at best, they represent only approximations to such invocations. In fact, aggregation is treated in SQL as if it were a special case of summarization. Now, I haven’t discussed summarization yet, of course; for present purposes, however, you can regard it loosely as what’s represented in SQL by a SELECT expression with a GROUP BY clause. Now, the foregoing SELECT expressions don’t have a GROUP BY clause—but they’re defined to be shorthand for the following, which do (and do therefore represent summarizations as claimed):

SELECT COUNT ( * ) AS X
FROM   S
GROUP  BY ( )

SELECT COUNT ( DISTINCT STATUS ) AS Y
FROM   S
GROUP  BY ( )

Aside: In case these expressions look strange to you, I should explain that, first, SQL does allow GROUP BY clauses with the operand commalist enclosed in parentheses; second, that commalist can be empty, just so long as those parentheses are specified; finally, specifying such an empty commalist is equivalent to omitting the GROUP BY clause entirely, because:

Such a GROUP BY effectively means “group by no columns.”
Every row has the same value for no columns: namely, the 0-row (despite the fact that SQL doesn’t actually support the 0-row!).
Every row in the table is thus part of the same group; in other words, the entire table is treated as a single group,¹¹ and that’s effectively what happens when the GROUP BY clause is omitted entirely.

End of aside.

So SQL does support summarization—but it doesn’t support aggregation as such. Sadly, the two concepts are often confused, and perhaps you can begin to see why. What’s more, the picture is confused still further by the fact that, in SQL, it’s common in practice for the table that results from an “aggregation” to be coerced to the single row it contains, or even doubly coerced to the single value that row contains: two separate errors (of judgment, if nothing else) thus compounding to make the SQL-style “aggregation” look more like a true aggregation after all! Such double coercion occurs in particular when the SELECT expression is enclosed in parentheses to form a scalar subquery, as in the following SQL assignments:

SET X = ( SELECT COUNT ( * ) FROM S ) ;

SET Y = ( SELECT COUNT ( DISTINCT STATUS ) FROM S ) ;

But assignment as such is far from being the only context in which such coercions occur (see Chapters 2 and 12).

Aside: Actually there’s another oddity arising in connection with SQL-style aggregation (I include this observation here because this is where it logically belongs, but it does rely on a detailed understanding of SQL-style summarization, and you can skip it if you like):

In general, an expression of the form SELECT - FROM T - WHERE - GROUP BY - HAVING delivers a result containing exactly one row for each group in G, where G is the “grouped table” resulting from applying the WHERE, GROUP BY, and HAVING clauses to table T.
Omitting the WHERE and HAVING clauses, as in a “straightforward” SQL-style aggregation, is equivalent to specifying WHERE TRUE and HAVING TRUE, respectively. For present purposes, therefore, we need consider the effect of the GROUP BY clause (only) in determining the grouped table G.
Suppose table T has nT rows. Then arranging those rows into groups can produce at most nT groups; in other words, the grouped table G has nG groups for some nG (nG ≤ nT), and the overall result, obtained by applying the SELECT clause to G, thus has nG rows.
Now suppose nT is zero (i.e., table T is empty); then nG must clearly be zero as well (i.e., the grouped table G, and hence the result of the SELECT expression overall, must both be empty as well).
In particular, therefore, the expression

SELECT COUNT ( * ) AS X
FROM   S
GROUP  BY ( )

—which is the expanded form of SELECT COUNT(*) AS X FROM S—ought logically to produce the result shown on the left, not the one shown on the right, if table S happens to be empty:

┌───┐           ┌───┐
│ X │           │ X │
├───┤           ├───┤
└───┘           │ 0 │
                └───┘

In fact, however, it produces the result on the right. How? Answer: By special casing. Here’s a direct quote from the standard: “If there are no grouping columns, then the result of the <group by clause> is the grouped table consisting of T as its only group.” In other words, while grouping an empty table in SQL does indeed (as argued above) produce an empty set of groups in general, the case where the commalist of grouping columns is empty is special; in that case, it produces a set containing exactly one group, that group being identical to the empty table T. In the example, therefore, the COUNT operator is applied to an empty group, and thus “correctly” returns the value zero.

Now, you might be thinking the discrepancy here is hardly earth shattering; you might even be thinking the result on the right above is somehow “better” than the one on the left. But (to state the obvious) there’s a logical difference between the two, and—to repeat from Chapter 1—as Wittgenstein said, all logical differences are big differences. Logical mistakes like the one under discussion are simply unacceptable in a system that’s meant to be solidly based on logic, as relational systems are. End of aside.

Empty Arguments

The foregoing aside raises yet another issue. To be specific, let agg be an aggregate operator; then what should happen if agg is invoked on an empty argument? For example, given our usual sample data values, what value should the following statement assign to X?

X := SUM ( SP WHERE SNO = 'S5' , QTY ) ;

The answer, of course, is zero; as explained in Chapter 6 under the discussion of n-adic join, zero is the identity value with respect to addition, and the sum of no numbers is therefore zero. More generally, let agg be an aggregate operator (other than COUNT), and let av be the aggregate value over which some given invocation of agg is to be evaluated. If av is of cardinality one, the result of the invocation in question is the single value contained in av. If av is of cardinality zero (i.e., if av is empty), and if all of the following are true—

The invocation in question is essentially just shorthand for repeated invocation of some dyadic operator op;
An identity value iv exists for op;
The semantics of agg don’t demand the result of an invocation to be a value actually appearing in av;

—then

The result of the invocation in question is that identity value iv.

Thus, for the aggregate operators discussed in this section, identity values (and hence the result returned if the argument is empty) are as follows:¹²

AND: TRUE.
OR and XOR: FALSE.
COUNT and SUM: Zero. Note: The type of the result in these cases is INTEGER (for COUNT) and the type of the specified argument expression (for SUM). By way of example, if relvar P is currently empty, COUNT (P) returns 0 and SUM (P,WEIGHT) returns 0.0.
AVG: Since asking for the average of an empty set is effectively asking for zero to be divided by zero, the only reasonable response is to raise an exception (and careful coding might sometimes be called for, therefore).
MAX and MIN: By definition, asking for the maximum or minimum of some set of values is asking for some specific value from within that set. If the set in question happens to be empty, therefore, the only reasonable response is, again, to raise an exception (and careful coding might again sometimes be called for, therefore).

To return to MAX and MIN for a moment: Actually there’s an argument that says the MAX and MIN of an empty aggregate shouldn’t be undefined after all. For definiteness, consider MAX specifically. Define a dyadic operator MAX2 that returns the larger of its two arguments (more precisely, define MAX2{x1,x2} to return x1 if x1 ≥ x2 and x2 otherwise). Then (a) any given MAX invocation is essentially just shorthand for repeated invocation of MAX2, and (b) MAX2 clearly has an identity value, viz., “negative infinity” (meaning the minimum value of the pertinent type); so we might reasonably define MAX to return that identity value if its aggregate argument is empty. Likewise, we might reasonably define MIN to return “positive infinity” (the maximum value of the pertinent type) if its aggregate argument is empty. Perhaps the best approach in practice would be to provide both versions of MAX—they are, after all, different operators—and let the user decide. We might even provide a third version, one that takes an additional argument x, where x is supplied by the user and is the value to be returned if the aggregate argument is empty.

IMAGE RELATIONS REVISITED

In this section, I just want to present a series of examples that show the usefulness of image relations in connection with aggregate operators as discussed in the previous section.

Example 1: Get suppliers for whom the total shipment quantity, taken over all shipments for the supplier in question, is less than 1000.

S WHERE SUM ( !!SP , QTY ) < 1000

For any given supplier, the expression SUM (!!SP,QTY) denotes, precisely, the total shipment quantity for the supplier in question. An equivalent formulation without the image relation is:

S WHERE SUM ( SP MATCHING RELATION { TUPLE { SNO SNO } } , QTY ) < 1000

Here for interest is an SQL “analog”—“analog” in quotes because actually there’s a trap in this example; the SQL expression shown is not quite equivalent to the Tutorial D expressions shown previously (why not?):

SELECT S.*
FROM   S , SP
WHERE  S.SNO = SP.SNO
GROUP  BY S.SNO , S.SNAME , S.STATUS , S.CITY
HAVING SUM ( SP.QTY ) < 1000

Incidentally, I can’t resist pointing out in passing that (as this example suggests) SQL lets us say “S.*” in the SELECT clause but not in the GROUP BY clause, where it would make just as much sense.

Example 2: Get suppliers with fewer than three shipments.

S WHERE COUNT ( !!SP ) < 3

Example 3: Get suppliers for whom the minimum shipment quantity is less than half the maximum shipment quantity (taken over all shipments for the supplier in question in both cases).

S WHERE MIN ( !!SP , QTY ) < 0.5 * MAX ( !!SP , QTY )

Here I’m assuming that MIN and MAX have been defined to return “positive infinity” and “negative infinity,” respectively, if their aggregate argument is empty (see the discussion of such matters at the very end of the previous section). Note that under that assumption (and given our usual sample values), the result contains no tuple for supplier S5.

Example 4: Get shipments such that at least two other shipments involve the same quantity.

SP WHERE COUNT ( !!( SP RENAME { SNO AS SN , PNO AS PN } ) ) > 2

This example is very contrived, but it illustrates the point that we might occasionally need to do some attribute renaming in connection with image relation references. In the example, the renaming is needed in order to ensure that the image relation we want, in connection with a given shipment tuple, is defined in terms of attribute QTY only. The introduced names SN and PN are arbitrary.

I remark in passing that the RENAME invocation in this example—

SP RENAME { SNO AS SN , PNO AS PN }

—illustrates the “multiple” form of the RENAME operator. The individual renamings in such a RENAME invocation are effectively executed in parallel.¹³ Similar “multiple” forms are defined for various other operators, too, including EXTEND in particular (I’ll give an example later).

Example 5: Update suppliers for whom the total shipment quantity, taken over all shipments for the supplier in question, is less than 1000, reducing their status to half its previous value.

UPDATE S WHERE SUM ( !!SP , QTY ) < 1000 : { STATUS := 0.5 * STATUS } ;

SUMMARIZATION

It’s convenient to begin this section with an example (which I’ll label SX1—“SUMMARIZE Example 1”—for purposes of subsequent reference):

SUMMARIZE SP PER ( S { SNO } ) : { PCT := COUNT ( PNO ) }

Given our usual sample values, the result looks like this:

┌─────┬─────┐
│ SNO │ PCT │
├═════┼─────┤
│ S1  │   6 │
│ S2  │   2 │
│ S3  │   1 │
│ S4  │   3 │
│ S5  │   0 │
└─────┴─────┘

Explanation: In this example, the relation to be summarized—“the SUMMARIZE relation”—is the current value of relvar SP, and the relation controlling or driving the summmarization—“the PER relation”—is the current value of the expression S{SNO} (in other words, it’s the projection on {SNO} of the current value of relvar S). The result consists of five tuples, one for each tuple in the PER relation. In turn, each result tuple consists of a PER tuple extended with a certain count, that count being a count of the number of PNO values in the SUMMARIZE relation that are paired with the SNO value in that particular PER tuple.

Here now is a definition:

Definition: Let relations r1 and r2 be such that the heading of r2 is some subset of that of r1. Let r2 have attributes called A1, A2, ..., An and no others (in particular, no attribute called B). Then (and only then) the expression SUMMARIZE r1 PER (r2) : {B := exp} denotes a summarization of r1 according to r2, and it returns the relation with (a) heading consisting of attributes A1, A2, ..., An, and B and (b) body consisting of all tuples t such that t is a tuple of r2, extended with a value b for attribute B. That value b is computed by evaluating the expression exp over all tuples of r1 that have the same value for attributes A1, A2, ..., An as t does.

Points arising from this definition:

With reference to Example SX1, r1 is the current value of SP and r2 is the current value of S{SNO}. Observe in particular, therefore, that the heading of r2 is indeed a subset of the heading of r1, as required.
The result has cardinality equal to that of r2 and degree equal to that of r2 plus one. The type of attribute B in that result is the type of exp. (In terms of Example SX1, the result has cardinality five and degree two, and “attribute B”—i.e., attribute PCT—is of type INTEGER.)
The expression exp will typically be what’s called an open expression. An open expression is one that can’t appear in all possible contexts in which expressions in general can appear, but only in certain specific contexts: namely, those that suffice to give it meaning. Note: Actually this notion should be familiar to you; we’ve encountered it before, in a footnote in the previous chapter. To repeat the example from that footnote, the expression STATUS = 20 is open; by contrast, the expression S WHERE STATUS = 20 is closed. (As a matter of fact we’ve encountered another example of open expressions in this chapter too: viz., image relation references.)
More specifically, the expression exp can, and in practice usually will, include at least one summary. COUNT (PNO) is an example. Note carefully that this “summary” is not an invocation of the COUNT aggregate operator; to be specific, the COUNT aggregate operator takes a relation as its argument, while the argument in the expression COUNT (PNO) is not a relation but an attribute. In fact, the expression COUNT (PNO) is really rather special—it has no meaning outside the context of an appropriate SUMMARIZE, and it can’t be used outside that context. Note, therefore, that my earlier criticisms of COUNT and the rest in SQL (to the effect that they can’t appear “stand alone”) apply with just as much force to Tutorial D’s “summaries.” All of which begins to make it look as if SUMMARIZE might be not quite respectable, in a way, and it might be nice if we could replace it by something better ... See the section “Summarization Revisited,” later.
So there’s a logical difference between aggregate operators and summaries. However, at least it’s true that every aggregate operator does have a summary counterpart (and vice versa). It’s also true that each aggregate operator has the same name as its summary counterpart. We’ll see some more examples of such summaries later.

I’ve said that the heading of the PER relation r2 is required to be the same as that of some projection of the SUMMARIZE relation r1. However, in the special case where r2 doesn’t merely have the same heading as some projection of r1 but actually is such a projection, Tutorial D provides us with a tiny shorthand. To be specific, it allows the PER specification to be replaced by a BY specification, as here (“Example SX2”):

SUMMARIZE SP BY { SNO } : { PCT := COUNT ( PNO ) }

Here’s the result:

┌─────┬─────┐
│ SNO │ PCT │
├═════┼─────┤
│ S1  │   6 │
│ S2  │   2 │
│ S3  │   1 │
│ S4  │   3 │
└─────┴─────┘

As you can see, this result differs from the result in Example SX1 in that it contains no tuple for supplier S5. That’s because BY {SNO} here is defined to be shorthand for PER (SP{SNO})—SP, because it’s the current value of SP that we want to summarize—and SP doesn’t currently contain a tuple for supplier S5.

Now, Example SX2 can be expressed in SQL more or less directly, as follows:

SELECT SNO , COUNT ( ALL PNO ) AS PCT
FROM SP
GROUP BY SNO

Recall now from the section “Aggregate Operators” that, very loosely speaking, aggregation—to the extent it’s supported at all, that is—is represented in SQL by a SELECT expression without an explicit GROUP BY. By contrast, as the foregoing example suggests (and as previously noted in that section “Aggregate Operators,” in fact), summarization is represented in SQL by a SELECT expression with an explicit GROUP BY clause (usually, at any rate, but see further discussion later). Points arising:

You can think of such an expression as being evaluated as follows. First, the table specified by the FROM clause is partitioned into set of disjoint “groups”—actually tables—as specified by the grouping column(s) in the GROUP BY clause; second, result rows are obtained, one for each group, by computing the specified summary (or summaries, plural) for the pertinent group and appending other items as specified by the SELECT item commalist. Note: The SQL analog of the term summary is “set function”; that term is doubly inappropriate, however, because (a) the argument to such a function isn’t a set but a bag, in general, and (b) the result isn’t a set either. For these reasons, in fact, I generally set the term in quotation marks.
It’s safe to specify just SELECT, not SELECT DISTINCT, in the example because (a) the result table is guaranteed to contain just one row for each group, by definition, and (b) each such row contains a different unique value for the grouping column(s), again by definition.
The ALL specification could be omitted from the COUNT invocation in this example, because for “set functions” ALL is the default. (The alternative is DISTINCT, of course. In the case at hand, however, it makes no difference whether ALL or DISTINCT is specified, because the combination of supplier number and part number is a key for SP.)
The “set function” COUNT(*) is special—it applies, not to values in some column (as, e.g., SUM (...) does), but to rows in some table. (In the example, the specification COUNT (PNO) could be replaced by COUNT(*) without affecting the result.)

Now let’s get back to Example SX1. Here’s a possible SQL formulation of that example:

SELECT S.SNO , ( SELECT COUNT ( PNO )
                 FROM   SP
                 WHERE  SP.SNO = S.SNO ) AS PCT
FROM   S

The important point about this example is that the result now does contain a row for supplier S5, because (thanks to the FROM clause, which takes the form FROM S) that result contains one row for each supplier number in table S, not table SP. And, as you can see, this formulation differs from the one given for Example SX2—the one that missed supplier S5—in that it doesn’t include a GROUP BY clause, and it doesn’t do any grouping (at least, not overtly).

Aside: By the way, there’s another trap for the unwary here. As you can see, the second item in the SELECT item commalist in the foregoing SQL expression—i.e., the subexpression (SELECT ... S.SNO) AS PCT—is of the form subquery AS name (and the subquery in question is in fact a scalar one). Now, if that same text were to appear in a FROM clause, the “AS name” specification would be understood as defining a name for the table denoted by that subquery.¹⁴ In the SELECT clause, however, that very same “AS name” specification is understood as defining a name for the pertinent column of the overall result. It follows that the following SQL expression is not logically equivalent to the one shown above:

SELECT S.SNO , ( SELECT COUNT ( PNO ) AS PCT
                 FROM   SP
                 WHERE  SP.SNO = S.SNO )
FROM   S

With this formulation, the table t that’s returned by evaluation of the subquery has a column called PCT. That table t is then doubly coerced to the sole scalar value it contains, producing a column value in the overall result—but (believe it or not) the standard doesn’t guarantee that that column in the overall result has any particular column name; in particular, it doesn’t guarantee that it’s called PCT. End of aside.

To revert to the main thread of the discussion: As a matter of fact, Example SX2 could also be expressed in SQL without using GROUP BY, as follows:

SELECT DISTINCT SPX.SNO , ( SELECT COUNT ( SPY.PNO )
                            FROM   SP AS SPY
                            WHERE  SPY.SNO = SPX.SNO ) AS PCT
FROM   SP AS SPX

As these examples suggest, SQL’s GROUP BY clause is in fact logically redundant (a fact that I’m sure will come as a surprise to some readers); that is, any relational expression that can be represented using GROUP BY can also be represented without it. But there’s another point that needs to be made here too. Suppose Example SX1 had requested, not the count of part numbers, but the sum of quantities, for each supplier. In Tutorial D:

SUMMARIZE SP PER ( S { SNO } ) : { TOTQ := SUM ( QTY ) }

Given our usual sample values, the result looks like this:

┌─────┬──────┐
│ SNO │ TOTQ │
├═════┼──────┤
│ S1  │ 1300 │
│ S2  │  700 │
│ S3  │  200 │
│ S4  │  900 │
│ S5  │    0 │
└─────┴──────┘

By contrast, this SQL expression—

SELECT S.SNO , ( SELECT SUM ( QTY )
                 FROM   SP
                 WHERE  SP.SNO = S.SNO ) AS TOTQ
FROM   S

—gives a result in which the TOTQ value for supplier S5 is shown as null, not zero. That’s because (as mentioned in Chapter 4) if any SQL “set function” other than COUNT or COUNT(*) is invoked on an empty argument, the result is incorrectly defined to be null. To get the correct result, therefore, we need to use COALESCE, as follows:

SELECT S.SNO , ( SELECT COALESCE ( SUM ( QTY ) , 0 )
                 FROM   SP
                 WHERE  SP.SNO = S.SNO ) AS TOTQ
FROM   S

Suppose now that the example had asked for the sum of quantities for each supplier, but only where that sum is greater than 250. In Tutorial D, then, we can simply enclose the formulation shown earlier in parentheses and apply the pertinent restriction to it, thus:

( SUMMARIZE SP PER ( S { SNO } ) : { TOTQ := SUM ( QTY ) } )
WHERE TOTQ > 250

Result:

┌─────┬──────┐
│ SNO │ TOTQ │
├═════┼──────┤
│ S1  │ 1300 │
│ S2  │  700 │
│ S4  │  900 │
└─────┴──────┘

A “natural” SQL formulation of this query would be as follows (note the HAVING clause):

SELECT SNO , SUM ( QTY ) AS TOTQ
FROM   SP
GROUP  BY SNO
HAVING SUM ( QTY ) > 250  /* not TOTQ > 250 !!! */

But it could also be formulated like this:

SELECT DISTINCT SPX.SNO , ( SELECT SUM ( SPY.QTY )
                            FROM   SP AS SPY
                            WHERE  SPY.SNO = SPX.SNO ) AS TOTQ
FROM   SP AS SPX
WHERE  ( SELECT SUM ( SPY.QTY )
         FROM   SP AS SPY
         WHERE  SPY.SNO = SPX.SNO ) > 250

As this example suggests, then, HAVING, like GROUP BY, is also logically redundant— any relational expression that can be represented with it can also be represented without it. So GROUP BY and HAVING could both be dropped from SQL without any loss of relational functionality! And while it might be true that the GROUP BY and HAVING versions of some query are often more succinct,¹⁵ it’s also true that they sometimes deliver the wrong answer. For example, consider what would happen in the foregoing example if we had wanted the sum to be less than, instead of greater than, 250. Simply replacing “>” by “<” in the GROUP BY / HAVING formulation does not work. (Why not? Does it work in the non GROUP BY / non HAVING formulation?) Recommendations: If you do use GROUP BY or HAVING, make sure the table in the FROM clause is the one you really want to drive the operation (typically suppliers rather than shipments, in terms of the examples in this section). Also, be on the lookout for the possibility that some summarization is being done on an empty set, and use COALESCE wherever necessary.

There’s one more thing I need to say about GROUP BY and HAVING. Consider the following SQL expression:¹⁶

SELECT SNO , CITY , SUM ( QTY ) AS TOTQ
FROM S NATURAL JOIN SP
GROUP BY SNO

Observe that column CITY is mentioned in the SELECT item commalist here but isn’t one of the grouping columns. That mention is legitimate, however, because table S is subject to a certain functional dependency—see Chapter 8—according to which each SNO value in that table has just one corresponding CITY value (again, in that table); what’s more, the SQL standard includes rules according to which the system will in fact be aware of that functional dependency. As a consequence, even though it isn’t a grouping column, CITY is still known to be single valued per group, and it can therefore indeed appear in the SELECT clause as shown (also in the HAVING clause, if there is one).

Of course, it’s not logically wrong—though there might be negative performance implications—to specify the column as a grouping column anyway, as here:

SELECT SNO , CITY , SUM ( QTY ) AS TOTQ
FROM S NATURAL JOIN SP
GROUP BY SNO , CITY

SUMMARIZATION REVISITED

The SUMMARIZE operator has been part of Tutorial D since its inception. When image relations were introduced, however, that operator became logically redundant—and while there might be reasons (perhaps pedagogic ones) to retain it, the fact is that most summarizations can be more succinctly expressed by using image relations in combination with the EXTEND operator, as I now proceed to show.¹⁷

First of all, recall Example SX1 from the previous section (“For each supplier, get the supplier number and a count of the number of parts supplied”). The SUMMARIZE formulation looked like this:

SUMMARIZE SP PER ( S { SNO } ) : { PCT := COUNT ( PNO ) }

Here by contrast is an equivalent EXTEND formulation:

EXTEND S { SNO } : { PCT := COUNT ( !!SP ) }

(Since the combination {SNO,PNO} is a key for relvar SP, there’s no need to project the image relation on {PNO} before computing the count.) As this example suggests, EXTEND is certainly another context in which image relations make sense; in fact, they’re arguably even more useful in this context than they are in WHERE clauses.

The rest of this section consists of more examples. I’ve continued the numbering from the examples in the section “Image Relations Revisited.” In each case, I show an EXTEND formulation first, followed by an equivalent SUMMARIZE formulation; equivalent SQL formulations are left as an exercise.

Example 6: For each supplier, get supplier details and total shipment quantity, taken over all shipments for the supplier in question.

EXTEND S : { TOTQ := SUM ( !!SP , QTY ) }

SUMMARIZE analog:

S JOIN ( SUMMARIZE SP PER ( S { SNO } ) : { TOTQ := SUM ( QTY ) } )

Example 7: For each part supplied, get part details and total, maximum, and minimum shipment quantity, taken over all shipments for the part in question.

EXTEND ( P MATCHING SP ) : { TOTQ := SUM ( !!SP , QTY ) ,
MAXQ := MAX ( !!SP , QTY ) ,
MINQ := MIN ( !!SP , QTY ) }

SUMMARIZE analog:

P JOIN ( SUMMARIZE SP BY { PNO } : { TOTQ := SUM ( QTY ) ,
MAXQ := MAX ( QTY ) ,
MINQ := MIN ( QTY ) } )

Note the use of the multiple forms of EXTEND and SUMMARIZE in this example.

Example 8: For each supplier, get supplier details, total shipment quantity taken over all shipments for the supplier in question, and total shipment quantity taken over all shipments for all suppliers.

EXTEND S : { TOTQ := SUM ( !!SP , QTY ) ,
GTOTQ := SUM ( SP , QTY ) }

Result:

┌─────┬──────┬───────┐
│ SNO │ TOTQ │ GTOTQ │
├═════┼──────┼───────┤
│ S1  │ 1300 │  3100 │
│ S2  │  700 │  3100 │
│ S3  │  200 │  3100 │
│ S4  │  900 │  3100 │
│ S5  │    0 │  3100 │
└─────┴──────┴───────┘

SUMMARIZE analog:

JOIN { S , SUMMARIZE SP PER ( S { SNO } ) : { TOTQ := SUM ( QTY ) } ,
SUMMARIZE SP BY { } : { GTOTQ := SUM ( QTY ) } }

Example 9: Let city c be such that some supplier in c supplies some part in c. For each such city c, get c and the maximum and minimum shipment quantities for all shipments for which the supplier and part are both in city c.

WITH ( temp := JOIN { S , SP , P } ) :
EXTEND temp { CITY } : { MAXQ := MAX ( !!temp , QTY ) ,
MINQ := MIN ( !!temp , QTY ) }

SUMMARIZE analog:

WITH ( temp := JOIN { S , SP , P } ) :
SUMMARIZE temp BY { CITY } : { MAXQ := MAX ( QTY ) ,
MINQ := MIN ( QTY ) }

The point of this rather contrived example is to illustrate the usefulness of WITH, in connection with “SUMMARIZE-type” EXTENDs in particular, in avoiding the need to write out some possibly lengthy subexpression several times. Note: This book generally has little to say about performance matters, but I think it’s worth pointing out that we would surely expect the system, in examples like this one, to evaluate the pertinent subexpression once instead of several times. In other words, the use of WITH can be one of those nice win-win situations that are good for both the user and the DBMS.

GROUP, UNGROUP, AND RELATION VALUED ATTRIBUTES

Recall from Chapter 2 that relations with relation valued attributes (RVAs for short) are legal. Refer to Fig. 7.1, which shows relations R1 and R4 from Figs. 2.1 and 2.2, respectively, in that chapter; note that R4 has an RVA and R1 doesn’t, but the two relations clearly represent the same information.

Fig. 7.1: Relations R1 and R4 from Figs. 2.1 and 2.2 in Chapter 2

Note: For the record, the type of relation R4 in the figure is RELATION {SNO CHAR, PNO_REL RELATION {PNO CHAR}}.

Now, we obviously need a way to map between relations without RVAs and relations with them, and that’s the purpose of the GROUP and UNGROUP operators. I don’t want to go into a lot of detail on those operators here; let me just say that—as in fact was previously noted in the answer to Exercise 3.7 in Chapter 3—given the relations shown in Fig. 7.1, the expression

R1 GROUP { PNO } AS PNO_REL

will produce R4, and the expression

R4 UNGROUP PNO_REL

will produce R1.

By the way, it’s worth noting that the following expression—

EXTEND R1 { SNO } : { PNO_REL := !!R1 }

—will produce exactly the same result as the GROUP example shown above. In other words, GROUP can be defined in terms of EXTEND and image relations. Now, I’m not suggesting that we get rid of our useful GROUP operator; quite apart from anything else, a language that had an explicit UNGROUP operator (as Tutorial D does) but no explicit GROUP operator could certainly be criticized on ergonomic grounds, if nothing else. But it’s at least interesting, and perhaps pedagogically helpful, to note that the semantics of GROUP can so easily be explained in terms of EXTEND and image relations.¹⁸

And by the way again: If R4 includes a tuple for supplier number Sx, say, and if the PNO_REL value in that tuple is empty, then the result of the foregoing UNGROUP will contain no tuple at all for supplier number Sx. For further details, I refer you to my book An Introduction to Database Systems (see Appendix G) or the book Databases, Types, and the Relational Model: The Third Manifesto (again, see Appendix G), by Hugh Darwen and myself.

The SQL counterparts to GROUP and UNGROUP are quite complex, and I don’t propose to go into details here. However, I will at least show approximate SQL analogs (?) of the Tutorial D examples above.¹⁹ First GROUP:

SELECT DISTINCT X.SNO ,
                CAST ( TABLE ( SELECT Y.PNO
                               FROM   R1 AS Y
                               WHERE  Y.SNO = X.SNO )
                AS ROW ( PNO VARCHAR(6) ) MULTISET ) AS PNO_REL
FROM   R1 AS X

Now UNGROUP:

SELECT SNO , X.PNO
FROM R4 , UNNEST ( PNO_REL ) AS X ( PNO )

Note: I can’t help pointing out a certain irony in SQL’s version of the GROUP example. As you can see, the SQL expression in that example involves a subquery in (a CAST invocation within) the SELECT clause. Of course, a subquery denotes a table; in SQL, however, that table is often coerced—in the context of a SELECT clause in particular—to a single row, or more frequently to a single column value from within that single row. In the case at hand, however, we don’t want any such coercion; so we have to tell SQL explicitly, by means of the TABLE operator,²⁰ not to do what it normally would do (by default, as it were) in such a context.

RVAs Make Outer Join Unnecessary

There are several further points worth making in connection with relation valued attributes. First of all, they make outer join unnecessary! Second, it turns out they’re sometimes necessary even in base relvars. Third, they’re conceptually necessary anyway in order to support relational comparison operations. And fourth, they make it desirable to support certain additional aggregate operators. I’ll elaborate on each of these points in turn.

I’ll begin by showing a slightly more complicated example of an RVA. Consider the following Tutorial D expression:

EXTEND S : { PQ ≔ !!SP }

Suppose we evaluate this expression and assign the result to a relvar SPQ. A sample value for SPQ, corresponding to our usual sample values for relvars S and SP, is shown (in outline) in Fig. 7.2 below. Attribute PQ is relation valued.

Fig. 7.2: Relvar SPQ (sample value)

Now consider the following SQL expression:

SELECT SNO , SNAME , STATUS , CITY , PNO , QTY
FROM S NATURAL LEFT OUTER JOIN SP

The result of evaluating this expression is shown (again in outline) in Fig. 7.3.

Fig. 7.3: Left outer join of S and SP (sample value)

Observe now that with our usual sample values, the set of shipments for supplier S5 is empty, and that:

In Fig. 7.2, that empty set of shipments is represented by an empty set.
In Fig. 7.3, by contrast, that empty set is represented by nulls (indicated by shading in the figure).

To represent an empty set by an empty set seems like such an obviously good idea! In fact, as I said earlier, there would be no need for outer join at all if RVAs were properly supported. Thus, one advantage of RVAs is that they deal more elegantly with the problem that outer join is intended to solve than outer join itself does—and I’m tempted to say that this fact all by itself, even if there were no other advantages, is a big argument in favor of RVAs.

At the risk of laboring the obvious, I’d like to say too that if there aren’t any shipments for supplier S5, it means, to repeat, that the set of shipments for supplier S5 is empty (and that’s exactly what the relation in Fig. 7.2 says). It certainly doesn’t mean that supplier S5 supplies some unknown part in some unknown quantity; and yet unknown is—and in fact was originally and explicitly intended to be—the way null is usually interpreted. So Fig. 7.3 not only involves nulls (which as we saw in Chapter 4 are bad news anyway, for all kinds of reasons), it actually misrepresents the semantics of the situation.

RVAs in Base Relvars

Let’s look at some typical operations involving relvar SPQ (Fig. 7.2). Consider first the following queries:

Get supplier numbers for suppliers who supply part P2.

( ( SPQ UNGROUP PQ ) WHERE PNO = 'P2' ) { SNO }
Get part numbers for parts supplied by supplier S2.

( ( SPQ WHERE SNO = 'S2' ) UNGROUP PQ ) { PNO }

As you can see, the natural language versions of these two queries are symmetric, but the Tutorial D formulations on the RVA design (Fig. 7.2) aren’t. By contrast, Tutorial D formulations of the same queries against our usual (non RVA) design are symmetric, as well as being simpler than their RVA counterparts:

( SP WHERE PNO = 'P2' ) { SNO }

( SP WHERE SNO = 'S2' ) { PNO }

In fact, the queries on the RVA design effectively involve mapping that design to the non RVA design anyway (that’s what the UNGROUPs do).

Similar remarks apply to updates and constraints. For example, suppose we need to update the database to show that supplier S2 supplies part P5 in a quantity of 500. Here are Tutorial D formulations on (a) the non RVA design, (b) the RVA design:

INSERT SP
RELATION { TUPLE { SNO 'S2' , PNO 'P5' , QTY 500 } } ;
UPDATE SPQ WHERE SNO = 'S2' :
{ INSERT PQ
RELATION { TUPLE { PNO 'P5' , QTY 500 } } } ;

Once again, the natural language requirement is stated in a symmetric fashion; its formulation in terms of the non RVA design is symmetric too; but its formulation in terms of the RVA design isn’t (in fact, it’s quite cumbersome). And, of course, the reason for this state of affairs is that the RVA design itself is asymmetric—in effect, it regards parts as subordinate to suppliers, instead of giving parts and suppliers equal weight, as it were.

Examples like the ones discussed above tend to suggest that RVAs in base relvars are probably a bad idea (certainly relvar SPQ in particular isn’t very well designed). But this position might better be seen as a guideline, not an absolute limitation, because in fact there are cases—comparatively rare ones perhaps—where a base relvar with an RVA is exactly the right design. A sample value for such a relvar (SIBLING) is shown in Fig. 7.4. The intended interpretation is that the persons identified within any given PERSONS value are all siblings of one another, and have no other siblings. Thus, Amy and Bob are siblings; Cal, Don, and Eve are siblings; and Fay is an only child. Note that the relvar has just one attribute (an RVA) and three tuples. Note too that the sole key involves an RVA.

Fig. 7.4: Relvar SIBLING (sample value)

Note: It’s important to understand that no non RVA relation exists that represents exactly the same information, no more and no less, as the relation shown in Fig. 7.4 does. In particular, if we ungroup that relation as follows—

SIBLING UNGROUP SIBS

—we lose the information as to who’s a sibling of whom.

RVAs Are Necessary for Relational Comparisons

Consider once again this example from the section on image relations earlier in this chapter:

S WHERE ( !!SP ) { PNO } = P { PNO }

(“suppliers who supply all parts”). Clearly, the boolean expression in the WHERE clause here involves a relational comparison (actually an equality comparison). Recall now from Chapter 6 that an expression of the form r WHERE bx denotes a restriction as such only if bx is a restriction condition, and bx is a restriction condition if and only if every attribute reference in bx identifies some attribute of r and there aren’t any relvar references. In the example, therefore, the boolean expression isn’t a genuine restriction condition, because (a) it involves some references to attributes that aren’t attributes of S and (b) it also involves some relvar references (viz., to relvars SP and P). In fact, the example overall is really shorthand for something that might look like this:

WITH ( t1 := EXTEND S : { X := ( !!SP ) { PNO } } ,
t2 := EXTEND t1 : { Y := P { PNO } } ) :
t2 WHERE X = Y

Now the boolean expression in the WHERE clause (in the last line) is indeed a genuine restriction condition. Observe, however, that attributes X and Y are both RVAs. As the example suggests, therefore, RVAs are always involved, at least implicitly, whenever relational comparisons are performed.

Aggregate Operators

Consider relvar SPQ again, with sample value as shown in Fig. 7.2. Attribute PQ is relation valued. And just as it makes sense (and is useful) to define, e.g., numeric aggregate operators such as SUM on numeric attributes, so it makes sense, and is useful, to define relational aggregate operators on relation valued attributes. For example, the following expression returns the union of all of the relations currently appearing as values of attribute PQ in relvar SPQ:

UNION ( SPQ , PQ )

Or equivalently (but why exactly is this equivalent?):

UNION ( SPQ { PQ } )

Tutorial D supports the following relation valued aggregate operators: UNION, D_UNION, and INTERSECT. And SQL has analogs of UNION and INTERSECT (though not D_UNION); however, they’re called, not UNION and INTERSECT as one might reasonably have expected, but FUSION and INTERSECTION [sic], respectively. (It would be very naughty of me to suggest that if union is called FUSION, then intersection ought surely to be called FISSION, so I won’t.)

“WHAT IF” QUERIES

“What if” queries are a frequent requirement; they’re used to explore the effect of making certain changes to the database without actually having to make (and subsequently unmake, possibly) the changes in question. Here’s an example (“What if parts in Paris were in Nice instead and their weight was doubled?”):

EXTEND P WHERE CITY = 'Paris' :
{ CITY := 'Nice' , WEIGHT := 2 * WEIGHT }

As you can see, this expression makes use of EXTEND once again. This time, however, the target attributes in the assignments in braces aren’t “new” attributes, as they normally are for EXTEND; instead, they’re attributes already existing in the specified relation. What the expression does is this: It yields a relation containing exactly one tuple t2 for each tuple t1 in the current value of relvar P for which the city is Paris—except that, in that tuple t2, the weight is double that in tuple t1 and the city is Nice, not Paris. In other words, the expression overall is shorthand for the following:

WITH ( t1 := P WHERE CITY = 'Paris' ,
t2 := EXTEND t1 : { NC := 'Nice' , NW := 2 * WEIGHT } ,
t3 := t2 { ALL BUT CITY , WEIGHT } ) :
t3 RENAME { NC AS CITY , NW AS WEIGHT }

Aside: Here for purposes of comparison is an SQL analog of the foregoing Tutorial D expression:

WITH t1 AS ( SELECT P.*
             FROM   P
             WHERE  CITY = 'Paris' ) ,
     t2 AS ( SELECT P.* , 'Nice' AS NC , 2 * WEIGHT AS NW
             FROM   t1 )

SELECT PNO , PNAME , COLOR , NW AS WEIGHT , NC AS CITY
FROM   t2

End of aside.

Here now for the record is a formal definition for this alternative version of EXTEND:

Definition: Let relation r have an attribute called A. Then (and only then) the expression EXTEND r : {A := exp} denotes an extension of r, and it returns the relation with heading the same as that of r and body the set of all tuples t such that t is derived from a tuple of r by replacing the value of A by a value that’s computed by evaluating the expression exp on that tuple of r.²¹

And now I can take care of some unfinished business from Chapter 5. In that chapter, I said the relational UPDATE operator was shorthand for a certain relational assignment, but the details were a little more complicated than they were for INSERT and DELETE. Now I can explain those details. By way of example, consider the following UPDATE statement:

UPDATE P WHERE CITY = 'Paris' :
{ CITY := 'Nice' , WEIGHT := 2 * WEIGHT } ;

This statement is logically equivalent to the following relational assignment:

P := ( P WHERE CITY ≠ 'Paris' )
       UNION
     ( EXTEND ( P WHERE CITY = 'Paris' ) :
                      { CITY := 'Nice' , WEIGHT := 2 * WEIGHT } ) ;

Alternatively, recall from Chapter 5 that “updating relvar R” really means we’re replacing the relation r1 that’s the original value of R by another relation r2, where r2 is computed as (r1 MINUS s1) UNION s2 for certain relations s1 and s2. In the case at hand, using “” as in Chapter 6 to denote “is defined as,” we have:

s1 P WHERE CITY = 'Paris'

s2 EXTEND ( P WHERE CITY = 'Paris' ) :
{ CITY := 'Nice' , WEIGHT := 2 * WEIGHT } )

Thus, the expanded form of the UPDATE becomes:

P := ( P MINUS s1 ) UNION s2 ;

Note: Actually, we could safely replace MINUS and UNION here by I_MINUS and D_UNION, respectively, and we could safely drop the parentheses. (In both cases, why?)²²

A NOTE ON RECURSION

Consider the following lightly edited extract from Exercise 5.16 in Chapter 5:

The well known bill of materials application involves a relvar—PP, say—showing which parts contain which parts as immediate components. Of course, immediate components are themselves parts, and they can have further immediate components of their own.

Fig. 7.5 shows (a) a sample relation value for that relvar PP and (b) the relation that’s the transitive closure of that sample relation value,²³ shown as the corresponding value of a relvar TC. The predicates are as follows:

PP:	Part PX contains part PY as an immediate component.
TC:	Part PX contains part PY as a component at some level, but not necessarily as an immediate component.

For example, if part P1 contains part P2 as a component and part P2 contains part P4 as a component, then certainly part P1 contains part P4 as a component “at some level.”

Fig. 7.5: Relvars PP and TC (sample values)

Given a relation value pp for relvar PP, the relation value tc that’s the transitive closure of pp can be defined as follows (observe that the definition involves a recursive reference to tc):

Definition: The pair (px,py) appears in tc if and only if:

It appears in pp, or
There exists some pz such that the pair (px,pz) appears in pp and the pair (pz,py) appears in tc.

In other words, if we think of pp as representing a directed graph, with a node for each part and an edge from each node to each corresponding immediate component node, then (px,py) appears in the transitive closure if and only if there’s a path in that graph from node px to node py.

Aside: In practice relvar PP would probably have a QTY attribute as well, showing how many instances of the immediate component part PY are needed to make one instance of part PX, and we would probably want to compute not just the transitive closure as such, but also the total number of instances of part PY needed to make one instance of part PX: the gross requirements problem. I ignore this refinement here for simplicity. End of aside.

It’s also possible to define the transitive closure by means of an iterative procedure:

tc := pp ;
do until tc reaches a "fixpoint" ;
   WITH ( t1 := pp RENAME { PY AS PZ } ,
          t2 := tc RENAME { PX AS PZ } ,
          t3 := ( t1 JOIN t2 ) { PX , PY } ) :
   tc := tc UNION t3 ;
end ;

Loosely speaking, this code works by repeatedly forming an intermediate result consisting of the union of (a) the previous intermediate result and (b) a relation computed on the current iteration. The process is repeated until that intermediate result reaches a fixpoint (i.e., until it ceases to grow). Note: It’s easy to see the code is very inefficient!—in effect, each iteration repeats the entire computation of the previous one. In fact, it’s little more than a direct implementation of the original (recursive) definition. However, it could clearly be made more efficient if desired. Similar remarks apply to all of the code samples in the present section.

Turning now to Tutorial D, we could define a recursive operator (TCLOSE) to compute the transitive closure as follows:²⁴

OPERATOR TCLOSE ( XY RELATION { X ... , Y ... } )
             RETURNS RELATION { X ... , Y ... } ;
   RETURN ( WITH ( t1 := XY RENAME { Y AS Z } ,
                   t2 := XY RENAME { X AS Z } ,
                   t3 := ( t1 JOIN t2 ) { X , Y } ,
                   t4 := XY UNION t3 ) :
            IF t4 = XY THEN t4     /* unwind recursion      */
               ELSE TCLOSE ( t4 )  /* recursive invocation  */
            END IF ) ;
END OPERATOR ;

Now, e.g., the expression TCLOSE(pp) will return the transitive closure of pp. Hence, for example, the expression

( TCLOSE ( PP ) WHERE PX = 'P1') { PY }

will give the “bill of materials” for part P1, and the expression

( TCLOSE ( PP ) WHERE PY = 'P6') { PX }

will give the “where used” list for part P6. Note: Computing the bill of materials for a given part is sometimes referred to as part explosion; likewise, computing the “where used” list for a given part is referred to as part implosion.

Now, SQL too supports what it calls “recursive queries.” Here’s an SQL expression to compute the transitive closure of PP:

WITH RECURSIVE TC ( PX , PY ) AS
   ( SELECT PP.PX , PP.PY
     FROM   PP
     UNION  CORRESPONDING
     SELECT PP.PX , TC.PY
     FROM   PP , TC
     WHERE  PP.PY = TC.PX )

SELECT PX , PY
FROM   TC

As you can see, this expression too is very close to being a direct transliteration of the original recursive definition.

Note: This book deliberately has very little to say about commercial SQL products. However, I’d like to offer a brief remark here regarding Oracle specifically. As you might know, Oracle has had some recursive query support for many years. By way of example, the query “Explode part P1” can be expressed in Oracle as follows:

SELECT  LEVEL , PY
FROM    PP
CONNECT BY PX = PY
START   WITH PX = 'P1'

I don’t want to explain in detail how this expression is evaluated—but I do want to show the result it produces, given the sample data of Fig. 7.5. Here it is:

┌───────┬────┐
│ LEVEL │ PY │
├───────┼────┤
│     1 │ P2 │
│     2 │ P4 │
│     3 │ P5 │
│     4 │ P6 │
│     1 │ P3 │
│     2 │ P4 │
│     3 │ P5 │
│     4 │ P6 │
└───────┴────┘

Note carefully that this result isn’t a relation (and the relational closure property has thereby been violated). First of all, it contains some duplicate rows; for example, the row (2,P4) appears twice. More important, those duplicate rows are not duplicate rows as we usually understand them in SQL; that is, they aren’t just “saying the same thing twice,” as I put it in Chapter 4. To spell the point out, one of those two (2,P4) rows reflects the path in the graph from part P1 to part P4 via part P2; the other reflects the path in the graph from part P1 to part P4 via part P3. Thus, if we deleted one of those rows, we would lose information.

Aside: Actually the same kind of problem can arise in the SQL standard if the recursive query in question uses UNION ALL instead of UNION DISTINCT—as in practice such queries very typically do. Further details are beyond the scope of this book; however, if you try to code the gross requirements problem in SQL you might see for yourself why it’s tempting, at least superficially, to use UNION ALL. End of aside.

Note too that in addition to the foregoing violations, the ordering of the rows in the Oracle result is significant as well. For example, the reason we know the first (2,P4) row corresponds to the path from P1 to P4 via P2 specifically is because it immediately follows the row corresponding to the path from P1 to its immediate component P2. Thus, if we reordered the rows, again we would lose information.

Cycles

Consider Fig. 7.5 once again. Suppose the relation pp shown as a value for relvar PP in that figure additionally contained a tuple representing, say, the pair (P5,P1). Then there would be a cycle in the data (actually two cycles, one involving parts P1-P2-P4-P5-P1 and one involving parts P1-P3-P4-P5-P1). In the case of bill of materials, such cycles should presumably not be allowed to occur, since they make no sense. Sometimes, however, they do make sense; the classic example is a transportation network, in which there are routes from, say, New York (JFK) to London (LHR), London to Paris (CDG), and Paris back to New York again (as well as routes in all of the reverse directions, of course).

Now, the existence of a cycle in the data has no effect on the transitive closure as such. However, it does have the potential to cause an infinite loop in certain kinds of processing. For example, a query to find travel routes from New York to Paris might—if we’re not careful— produce results as follows:

JFK - LHR - CDG
JFK - LHR - JFK - LHR - CDG
JFK - LHR - JFK - LHR - JFK - LHR - CDG
etc., etc. etc.

Of course, it might at least be possible to formulate the query in such a way as to exclude segments in which the destination city is JFK (since we certainly don’t want a route that takes us back to where we started). But even this trick will still allow routes such as:

JFK - ORD - LHR - ORD - LHR - ORD - LHR - ... - CDG

(ORD = Chicago). Moreover, it still won’t prevent an infinite loop. Now, we might prevent the infinite loop as such by rejecting routes involving, say, more than four segments; but under such a scheme we could still get, e.g., the route JFK-ORD-LHR-ORD-CDG. Clearly, what we need is a more general mechanism that will allow the query to recognize when a given node in the graph has previously been visited. And SQL does in fact include a feature, the CYCLE clause, that can be used in recursive queries to achieve such an effect. The specifics are a little complicated, and I don’t want to get into details here; suffice it to say that the CYCLE clause provides a means of tagging nodes (i.e., rows) as they’re visited, and then stopping the recursion if a tagged node is subsequently encountered again. For more details, I refer you to the standard document itself.

WHAT ABOUT ORDER BY?

The last topic I want to address in this chapter is ORDER BY (just ORDER, in Tutorial D). Now, despite the title of this chapter, ORDER BY isn’t actually part of the relational algebra; in fact, as I said in Chapter 1, it isn’t a relational operator at all, because it produces a result that isn’t a relation (it does take a relation as input, but it produces something else—namely, a sequence of tuples—as output). Note: Please don’t misunderstand me here. I’m not saying ORDER BY isn’t useful. However, I am saying it can’t sensibly appear in a relational expression²⁵ (unless it’s treated simply as a “no op,” I suppose). By definition, therefore, the following expressions, though legal, aren’t relational expressions as such:

S MATCHING SP ORDER ( ASC SNO )   │   SELECT DISTINCT S.*
                                  │   FROM   S , SP
                                  │   WHERE  S.SNO = SP.SNO
                                  │   ORDER  BY SNO ASC

That said, I’d like to point out that for a couple of reasons ORDER BY is actually a rather strange operator. First, it effectively works by sorting tuples into some specified sequence—and yet “<” and “>” aren’t defined for tuples, as we know from Chapter 3.²⁶ Second, it’s not a function. All of the operators of the relational algebra described in this book—in fact, read-only operators in general, as that term is usually understood—are functions, meaning there’s always just one possible output for any given input. By contrast, ORDER BY can produce several different outputs from the same input. As an illustration of this point, consider the effect of the operation ORDER BY CITY on our usual suppliers relation. Clearly, this operation can return any of four distinct results, corresponding to the following sequences (I’ll show just the supplier numbers, for simplicity):

S5 , S1 , S4 , S2 , S3
S5 , S4 , S1 , S2 , S3
S5 , S1 , S4 , S3 , S2
S5 , S4 , S1 , S3 , S2

Note: It would be remiss of me not to mention in passing that although the operators of the relational algebra described in this book are indeed functions, most of them have counterparts in SQL that aren’t. This state of affairs is due to the fact that, as explained in Chapter 2, SQL sometimes defines the result of the comparison v1 = v2 to be TRUE even when v1 and v2 are distinct. For example, consider the character strings 'Paris' and 'Paris ', respectively (note the trailing space in the second of these); these values are clearly distinct, and yet SQL sometimes regards them as equal. As explained in Chapter 2, therefore, certain SQL expressions are “possibly nondeterministic.” Here’s a simple example:

SELECT DISTINCT CITY FROM S

If one supplier has CITY 'Paris' and another 'Paris ', then the result will include either 'Paris' or 'Paris ' (or possibly both), but which result we get might not be defined; indeed, it isn’t defined, at least as far as the standard is concerned. We could even legitimately get one result on one day and another on another, even if the database hasn’t changed at all in the interim. You might like to meditate on the implications of this state of affairs.

EXERCISES

7.1 For each of the following Tutorial D expressions on the suppliers-and-parts database, give both (a) an SQL analog and (b) an informal interpretation of the expression (i.e., a corresponding predicate) in natural language. Also show the result of evaluating the expression, given our usual sample values for relvars S, P, and SP.

S MATCHING ( SP WHERE PNO = 'P2' )
S NOT MATCHING ( SP WHERE PNO = 'P2' )
P WHERE ( !!SP ) { SNO } = S { SNO }
P WHERE SUM ( !!SP , QTY ) < 500
P WHERE TUPLE { CITY CITY } ∊ S { CITY }
EXTEND S : { TAG := 'Supplier' }
EXTEND ( S MATCHING ( SP WHERE PNO = 'P2' ) ) :
{ TRIPLE_STATUS := 3 * STATUS }
EXTEND ( P JOIN SP ) : { SHIPWT := WEIGHT * QTY }
EXTEND P : { GMWT := WEIGHT * 454 , OZWT := WEIGHT * 16 }
EXTEND P : { SCT := COUNT ( !!SP ) }
EXTEND S : { NP := COUNT ( ( SP RENAME { SNO AS X } ) WHERE X = SNO ) }
SUMMARIZE S BY { CITY } : { SUM_STATUS := SUM ( STATUS ) }
SUMMARIZE ( S WHERE CITY = 'London' ) PER ( TABLE_DEE ) :

Note: Tutorial D allows the PER specification to be omitted from a SUMMARIZE, in which case PER (TABLE_DEE)—equivalently, BY { }—is assumed by default. The foregoing SUMMARIZE could therefore be simplified slightly, thus:

SUMMARIZE ( S WHERE CITY = 'London' ) : { N := COUNT ( SNO ) }
EXTEND SP WHERE SNO = 'S1' : { SNO := 'S7' , QTY := 0.5 * QTY }

7.2 In what circumstances if any are r1 MATCHING r2 and r2 MATCHING r1 equivalent?

7.3 Show that RENAME isn’t primitive.

7.4 Give an expression involving EXTEND instead of SUMMARIZE that’s logically equivalent to the following:

SUMMARIZE SP PER ( S { SNO } ) : { NP := COUNT ( PNO ) }

7.5 Consider the following Tutorial D expressions. Which if any are equivalent to which of the others? Show an SQL analog in each case.

SUMMARIZE r PER ( r { } ) : { CT := SUM ( 1 ) }
SUMMARIZE r PER ( TABLE_DEE ) : { CT := SUM ( 1 ) }
SUMMARIZE r BY { } : { CT := SUM ( 1 ) }
EXTEND TABLE_DEE : { CT := COUNT ( r ) }

7.6 Consider the relational aggregate operators UNION and INTERSECT. What do you think these operators should return if their argument (a set or bag of relations) happens to be empty?

7.7 Let relation R4 in Fig. 7.1 denote the current value of some relvar. If R4 is as described in Chapter 2, what’s the predicate for that relvar?

7.8 Let r be the relation denoted by the following Tutorial D expression:

SP GROUP { } AS X

What does r look like, given our usual sample value for SP? Also, what does the following expression yield?

r UNGROUP X

7.9 Write Tutorial D and/or SQL expressions for the following queries on the suppliers-and-parts database:

Get the total number of parts supplied by supplier S1.
Get supplier numbers for suppliers whose city is first in the alphabetic list of such cities.
Get city names for cities in which at least two suppliers are located.
Get city names for cities in which at least one supplier or part is located, but not both.
Get part numbers for parts supplied by all suppliers in London.
Get suppliers who supply at least all parts supplied by supplier S2.

7.10 Let relation pp be as defined in the section “A Note on Recursion” and let TCLOSE be the transitive closure operator. What does the expression TCLOSE(TCLOSE(pp)) denote?

7.11 Given our usual sample values for the suppliers-and-parts database, what does the following Tutorial D expression denote?

EXTEND S : { PNO_REL := ( !!SP ) { PNO } }

7.12 Let the relation returned by the expression in the previous exercise be kept as a relvar called SSP. What do the following updates do?

INSERT SSP RELATION
     { TUPLE { SNO 'S6' , SNAME 'Lopez' , STATUS 30 , CITY 'Madrid' ,
               PNO_REL RELATION { TUPLE { PNO 'P5' } } } } ;

UPDATE SSP WHERE SNO = 'S2' :
     { INSERT PNO_REL RELATION { TUPLE { PNO 'P5' } } } ;

7.13 Using relvar SSP from the previous exercise, write expressions for the following queries:

Get pairs of supplier numbers for suppliers who supply exactly the same set of parts.
Get pairs of part numbers for parts supplied by exactly the same set of suppliers.

7.14 A quota query is a query that specifies a desired limit, or quota, on the cardinality of the result: for example, the query “Get the two heaviest parts,” for which the quota is two. Give Tutorial D and SQL formulations of this query. Given our usual data values, what exactly do these formulations return?

7.15 Using the Tutorial D SUMMARIZE operator, how would you deal with the query “For each supplier, get the supplier number and the sum of distinct shipment quantities for shipments by that supplier”?

7.16 Given a revised version of the suppliers-and-parts database that looks like this—

S   { SNO }        /* suppliers                 */
SP  { SNO , PNO }  /* supplier supplies part    */
SJ  { SNO , JNO }  /* supplier supplies project */

—give Tutorial D and SQL formulations of the query “For each supplier, get supplier details, the number of parts supplied by that supplier, and the number of projects supplied by that supplier.” For Tutorial D, give both EXTEND and SUMMARIZE formulations.

7.17 What does the following Tutorial D expression mean?

S WHERE ( !!(!!SP) ) { PNO } = P { PNO }

7.18 Is there a logical difference between the following two Tutorial D expressions? If so, what is it?

EXTEND TABLE_DEE : { NSP := COUNT ( SP ) }

EXTEND TABLE_DEE : { NSP := COUNT ( !!SP ) }

7.19 Give an example of a join that’s not a semijoin and a semijoin that’s not a join. When exactly are the expressions r1 JOIN r2 and r1 MATCHING r2 equivalent?

7.20 Let relations r1 and r2 be of the same type, and let t1 be a tuple in r1. For that tuple t1, then, what exactly does the expression !!r2 denote? And what happens if r1 and r2 aren’t just of the same type but are in fact the very same relation?

7.21 What’s the logical difference, if any, between the following SQL expressions?

SELECT COUNT ( * ) FROM S

SELECT SUM ( 1 ) FROM S

7.22 By definition, ORDER BY (or just ORDER, in Tutorial D) can’t appear in a relational expression (or table expression, rather, in SQL). So where can it appear?

7.23 Why don’t the tables shown on pages 231 and 249 (Fig. 7.3) have any doubly underlined columns?

ANSWERS

Here first are answers to certain exercises that were stated inline in the body of the chapter. In one, we were given relvars as follows—

S   { SNO }        /* suppliers               */
SP  { SNO , PNO }  /* supplier supplies part  */
PJ  { PNO , JNO }  /* part is used in project */
J   { JNO }        /* projects                */

—and we were asked for a SQL formulation of the query “Get all (sno,jno) pairs such that sno appears in S, jno appears in J, and supplier sno supplies all parts used in project jno.” A possible formulation is as follows:

SELECT SX.SNO , JX.JNO
FROM   S AS SX , J AS JX
WHERE  NOT EXISTS
     ( SELECT *
       FROM   P AS PX
       WHERE  EXISTS
            ( SELECT *
              FROM   PJ AS PJX
              WHERE  PJX.PNO = PX.PNO
              AND    PJX.JNO = JX.JNO )
       AND    NOT EXISTS
            ( SELECT *
              FROM   SP AS SPX
              WHERE  SPX.PNO = PX.PNO
              AND    SPX.SNO = SX.SNO ) )

Note: For a detailed discussion of how to tackle complicated SQL queries like this one, see Chapter 11.

Another inline exercise asked what happens if (a) r1 and r2 are relations with no attribute names in common, (b) r2 is empty, (c) we form the product r1 TIMES r2, and finally (d) we divide that product by r2. Answer: It should be clear that the product is empty, and hence the final result is empty too (it has the same heading as r1, but of course it isn’t equal to r1, in general). Do note, however, that dividing by an empty relation isn’t an error (it’s not like dividing by zero in arithmetic). See the answer to Exercise 6.2 in Chapter 2.

In another inline exercise, we were given this expression—

SELECT SNO , SUM ( QTY ) AS TOTQ
FROM SP
GROUP BY SNO
HAVING SUM ( QTY ) > 250

—as an SQL formulation of the query “For each supplier, get the supplier number and corresponding total shipment quantity, but only where that total quantity is greater than 250.” Then we were told that (a) if we wanted that total quantity to be less than, instead of greater than, 250, and so (b) we replaced that “>” by “<” in the last line, then (c) the resulting expression would not do the trick. But why not? Answer: Consider supplier S5, who (given our usual sample data) currently supplies no parts at all. The total quantity for that supplier is zero, and so supplier S5 should be represented in the result. In SQL, however, the total quantity for such a supplier is considered to be null, not zero; the comparison in the HAVING clause will therefore evaluate to UNKNOWN, not TRUE, and so that supplier won’t be represented in the result after all. Note: Of course, the obvious fix for this problem is to replace SUM (QTY) in the HAVING clause by COALESCE (SUM (QTY), 0). Subsidiary exercise: Do we also need to apply that same fix in the SELECT clause? If not, why not?

7.1 Throughout these answers, I show SQL expressions that aren’t necessarily direct transliterations of their Tutorial D (i.e., algebraic) counterparts but are, rather, “more natural” formulations of the query in SQL terms. The solutions aren’t necessarily unique. Note: This latter remark applies to many of the code solutions throughout the rest of this book, and I won’t bother to make it again.

SQL analog:

SELECT *
FROM   S
WHERE  SNO IN
     ( SELECT SNO
       FROM   SP
       WHERE  PNO = 'P2' )

Predicate: Supplier SNO is under contract, is named SNAME, has status STATUS, is located in city CITY, and supplies part P2.

┌─────┬───────┬────────┬────────┐
│ SNO │ SNAME │ STATUS │ CITY   │
├═════┼───────┼────────┼────────┤
│ S1  │ Smith │     20 │ London │
│ S2  │ Jones │     10 │ Paris  │
│ S3  │ Blake │     30 │ Paris  │
│ S4  │ Clark │     20 │ London │
└─────┴───────┴────────┴────────┘
SQL analog:

SELECT *
FROM   S
WHERE  SNO NOT IN
     ( SELECT SNO
       FROM   SP
       WHERE  PNO = 'P2' )

Predicate: Supplier SNO is under contract, is named SNAME, has status STATUS, is located in city CITY, and doesn’t supply part P2.

┌─────┬───────┬────────┬────────┐
│ SNO │ SNAME │ STATUS │ CITY   │
├═════┼───────┼────────┼────────┤
│ S5  │ Adams │     30 │ Athens │
└─────┴───────┴────────┴────────┘
SQL analog:

SELECT *
FROM   P AS PX
WHERE  NOT EXISTS
     ( SELECT *
       FROM   S AS SX
       WHERE  NOT EXISTS
            ( SELECT *
              FROM   SP AS SPX
              WHERE  SPX.SNO = SX.SNO
              AND    SPX.PNO = PX.PNO ) )

Predicate: Part PNO is used in the enterprise, is named PNAME, has color COLOR and weight WEIGHT, is stored in city CITY, and is supplied by all suppliers.

┌─────┬───────┬───────┬────────┬────────┐
│ PNO │ PNAME │ COLOR │ WEIGHT │ CITY   │
├═════┼───────┼───────┼────────┼────────┤
└─────┴───────┴───────┴────────┴────────┘
SQL analog:

SELECT *
FROM   P
WHERE  ( SELECT COALESCE ( SUM ( QTY ) , 0 )
         FROM   SP
         WHERE  SP.PNO = P.PNO ) < 500

Predicate: Part PNO is used in the enterprise, is named PNAME, has color COLOR and weight WEIGHT, is stored in city CITY, and is supplied in a total quantity, taken over all suppliers, that’s less than 500.

┌─────┬───────┬───────┬────────┬────────┐
│ PNO │ PNAME │ COLOR │ WEIGHT │ CITY   │
├═════┼───────┼───────┼────────┼────────┤
│ P3  │ Screw │ Blue  │   17.0 │ Oslo   │
│ P6  │ Cog   │ Red   │   19.0 │ London │
└─────┴───────┴───────┴────────┴────────┘
SQL analog:

SELECT *
FROM   P
WHERE  CITY IN
     ( SELECT CITY
       FROM   S )

Predicate: Part PNO is used in the enterprise, is named PNAME, has color COLOR and weight WEIGHT, is stored in city CITY, and is located in the same city as some supplier.

┌─────┬───────┬───────┬────────┬────────┐
│ PNO │ PNAME │ COLOR │ WEIGHT │ CITY   │
├═════┼───────┼───────┼────────┼────────┤
│ P1  │ Nut   │ Red   │   12.0 │ London │
│ P2  │ Bolt  │ Green │   17.0 │ Paris  │
│ P4  │ Screw │ Red   │   14.0 │ London │
│ P5  │ Cam   │ Blue  │   12.0 │ Paris  │
│ P6  │ Cog   │ Red   │   19.0 │ London │
└─────┴───────┴───────┴────────┴────────┘
SQL analog:

SELECT S.* , 'Supplier' AS TAG
FROM   S

Predicate: Supplier SNO is under contract, is named SNAME, has status STATUS, is located in city CITY, and has a TAG of ‘Supplier’.

┌─────┬───────┬────────┬────────┬──────────┐
│ SNO │ SNAME │ STATUS │ CITY   │ TAG      │
├═════┼───────┼────────┼────────┼──────────┤
│ S1  │ Smith │     20 │ London │ Supplier │
│ S2  │ Jones │     10 │ Paris  │ Supplier │
│ S3  │ Blake │     30 │ Paris  │ Supplier │
│ S4  │ Clark │     20 │ London │ Supplier │
│ S5  │ Adams │     30 │ Athens │ Supplier │
└─────┴───────┴────────┴────────┴──────────┘
SQL analog:

SELECT DISTINCT SNO , S.* , 3 * STATUS AS TRIPLE_STATUS
FROM   S NATURAL JOIN SP
WHERE  PNO = 'P2'

Predicate: Supplier SNO is under contract, is named SNAME, has status STATUS, is located in city CITY, supplies part P2, and has TRIPLE_STATUS equal to three times the value of STATUS.

┌─────┬───────┬────────┬────────┬───────────────┐
│ SNO │ SNAME │ STATUS │ CITY   │ TRIPLE_STATUS │
├═════┼───────┼────────┼────────┼───────────────┤
│ S1  │ Smith │     20 │ London │            60 │
│ S2  │ Jones │     10 │ Paris  │            30 │
│ S3  │ Blake │     30 │ Paris  │            90 │
│ S4  │ Clark │     20 │ London │            60 │
└─────┴───────┴────────┴────────┴───────────────┘
SQL analog:

SELECT PNO , PNAME, COLOR , WEIGHT , CITY , SNO , QTY
                            WEIGHT * QTY AS SHIPWT
FROM   P NATURAL JOIN SP

Predicate: Part PNO is used in the enterprise, is named PNAME, has color COLOR and weight WEIGHT, is stored in city CITY, is supplied by supplier SNO in quantity QTY, and that shipment (of PNO by SNO) has total weight SHIPWT equal to WEIGHT times QTY.

┌─────┬───────┬───────┬────────┬────────┬─────┬─────┬────────┐
│ PNO │ PNAME │ COLOR │ WEIGHT │ CITY   │ SNO │ QTY │ SHIPWT │
├═════┼───────┼───────┼────────┼────────┼═════┼─────┼────────┤
│ P1  │ Nut   │ Red   │   12.0 │ London │ S1  │ 300 │ 3600.0 │
│ P1  │ Nut   │ Red   │   12.0 │ London │ S2  │ 300 │ 3600.0 │
│ P2  │ Bolt  │ Green │   17.0 │ Paris  │ S1  │ 200 │ 3400.0 │
│ P2  │ Bolt  │ Green │   17.0 │ Paris  │ S2  │ 400 │ 6800.0 │
│ P2  │ Bolt  │ Green │   17.0 │ Paris  │ S3  │ 200 │ 3400.0 │
│ P2  │ Bolt  │ Green │   17.0 │ Paris  │ S4  │ 200 │ 3400.0 │
│ P3  │ Screw │ Blue  │   17.0 │ Oslo   │ S1  │ 400 │ 6800.0 │
│ P4  │ Screw │ Red   │   14.0 │ London │ S1  │ 200 │ 2800.0 │
│ P4  │ Screw │ Red   │   14.0 │ London │ S4  │ 300 │ 4200.0 │
│ P5  │ Cam   │ Blue  │   12.0 │ Paris  │ S1  │ 100 │ 1200.0 │
│ P5  │ Cam   │ Blue  │   12.0 │ Paris  │ S4  │ 400 │ 4800.0 │
│ P6  │ Cog   │ Red   │   19.0 │ London │ S1  │ 100 │ 1900.0 │
└─────┴───────┴───────┴────────┴────────┴─────┴─────┴────────┘
SQL analog:

SELECT P.* , WEIGHT * 454 AS GMWT , WEIGHT * 16 AS OZWT
FROM   P

Predicate: Part PNO is used in the enterprise, is named PNAME, has color COLOR, weight WEIGHT, weight in grams GMWT (= 454 times WEIGHT), and weight in ounces OZWT (= 16 times WEIGHT).

┌─────┬───────┬───────┬────────┬────────┬────────┬───────┐
│ PNO │ PNAME │ COLOR │ WEIGHT │ CITY   │ GMWT   │ OZWT  │
├═════┼───────┼───────┼────────┼────────┼────────┼───────┤
│ P1  │ Nut   │ Red   │   12.0 │ London │ 5448.0 │ 192.0 │
│ P2  │ Bolt  │ Green │   17.0 │ Paris  │ 7718.0 │ 204.0 │
│ P3  │ Screw │ Blue  │   17.0 │ Oslo   │ 7718.0 │ 204.0 │
│ P4  │ Screw │ Red   │   14.0 │ London │ 6356.0 │ 168.0 │
│ P5  │ Cam   │ Blue  │   12.0 │ Paris  │ 5448.0 │ 192.0 │
│ P6  │ Cog   │ Red   │   19.0 │ London │ 8626.0 │ 228.0 │
└─────┴───────┴───────┴────────┴────────┴────────┴───────┘
SQL analog:

SELECT P.* , ( SELECT COUNT ( SNO )
               FROM   SP
               WHERE  SP.PNO = P.PNO ) AS SCT
FROM   P

Predicate: Part PNO is used in the enterprise, is named PNAME, has color COLOR, weight WEIGHT, and city CITY, and is supplied by SCT suppliers.

┌─────┬───────┬───────┬────────┬────────┬─────┐
│ PNO │ PNAME │ COLOR │ WEIGHT │ CITY   │ SCT │
├═════┼───────┼───────┼────────┼────────┼─────┤
│ P1  │ Nut   │ Red   │   12.0 │ London │   2 │
│ P2  │ Bolt  │ Green │   17.0 │ Paris  │   4 │
│ P3  │ Screw │ Blue  │   17.0 │ Oslo   │   1 │
│ P4  │ Screw │ Red   │   14.0 │ London │   2 │
│ P5  │ Cam   │ Blue  │   12.0 │ Paris  │   2 │
│ P6  │ Cog   │ Red   │   19.0 │ London │   1 │
└─────┴───────┴───────┴────────┴────────┴─────┘
SQL analog:

SELECT S.* , ( SELECT COUNT ( PNO )
               FROM   SP
               WHERE  SP.SNO = S.SNO ) AS NP
FROM   S

Predicate: Supplier SNO is under contract, is named SNAME, has status STATUS, is located in city CITY, and supplies NP parts.

┌─────┬───────┬────────┬────────┬────┐
│ SNO │ SNAME │ STATUS │ CITY   │ NP │
├═════┼───────┼────────┼────────┼────┤
│ S1  │ Smith │     20 │ London │  6 │
│ S2  │ Jones │     10 │ Paris  │  2 │
│ S3  │ Blake │     30 │ Paris  │  1 │
│ S4  │ Clark │     20 │ London │  3 │
│ S5  │ Adams │     30 │ Athens │  0 │
└─────┴───────┴────────┴────────┴────┘
SQL analog:

SELECT CITY , SUM ( STATUS ) AS SUM_STATUS
FROM   S
GROUP  BY CITY

Predicate: The sum of status values for suppliers in city CITY is SUM_STATUS.

┌────────┬────────────┐
│ CITY   │ SUM_STATUS │
├════════┼────────────┤
│ London │         40 │
│ Paris  │         40 │
│ Athens │         30 │
└────────┴────────────┘
SQL analog:

SELECT COUNT ( SNO ) AS N
FROM S
WHERE CITY = 'London'

Predicate: There are N suppliers in London.

┌───┐
│ N │
├───┤
│ 2 │
└───┘

The lack of double underlining here is not a mistake (see the answer to Exercise 7.23).
SQL analog:

SELECT 'S7' AS SNO , PNO , QTY * 0.5 AS QTY
FROM   SP
WHERE  SNO = 'S1'

Predicate: SNO is S7 and supplier S1 supplies part PNO in quantity twice QTY.

┌─────┬─────┬─────┐
│ SNO │ PNO │ QTY │
├═════┼═════┼─────┤
│ S7  │ P1  │ 150 │
│ S7  │ P2  │ 100 │
│ S7  │ P3  │ 200 │
│ S7  │ P4  │ 100 │
│ S7  │ P5  │  50 │
│ S7  │ P6  │  50 │
└─────┴─────┴─────┘

7.2 The expressions r1 MATCHING r2 and r2 MATCHING r1 are equivalent if and only if r1 and r2 are of the same type, in which case both expressions reduce to just r1 JOIN r2 (and this latter expression reduces in turn to r1 INTERSECTr2).

7.3 RENAME isn’t primitive because (for example) the expressions

S RENAME { CITY AS SCITY }

and

( EXTEND S : { SCITY := CITY } ) { ALL BUT CITY }

are equivalent. Note: Possible appearances to the contrary notwithstanding, EXTEND isn’t primitive either—it can be defined in terms of join (at least in principle), as is shown in the book Databases, Types, and the Relational Model: The Third Manifesto, by Hugh Darwen and myself (see Appendix G).

7.4 EXTEND S { SNO } : { NP := COUNT ( !!SP ) }

7.5 You can determine which of the expressions are equivalent to which by inspecting the following results of evaluating them. Note that the “summary” SUM(1), evaluated over n tuples, is equal to n. (Even if n is zero! SQL, of course, would say the result is null in such a case.)

           ┌────┐                        ┌────┐
r empty:   │ CT │     r has n tuples     │ CT │
           ├────┤     (n > 0):           ├────┤
           └────┘                        │  n │
                                         └────┘
           ┌────┐                        ┌────┐
r empty:   │ CT │     r has n tuples     │ CT │
           ├────┤     (n > 0):           ├────┤
           │  0 │                        │  n │
           └────┘                        └────┘
           ┌────┐                        ┌────┐
r empty:   │ CT │     r has n tuples     │ CT │
           ├────┤     (n > 0):           ├────┤
           └────┘                        │  n │
                                         └────┘
           ┌────┐                        ┌────┐
r empty:   │ CT │     r has n tuples     │ CT │
           ├────┤     (n > 0):           ├────┤
           │  0 │                        │  n │
           └────┘                        └────┘

In other words, the result is a relation of degree one in every case. If r is nonempty, all four expressions are equivalent; otherwise a. and c. are equivalent, as are b. and d., but a. and b. aren’t. SQL analogs:

SELECT COUNT ( * ) AS CT
FROM r
EXCEPT CORRESPONDING
SELECT 0 AS CT
FROM r
SELECT COUNT ( * ) AS CT
FROM r
Same as a.
Same as b.

7.6 They return, respectively, the empty relation and the universal relation (of the applicable type in each case). Note: The universal relation of type RELATION H is the relation of that type that contains all possible tuples of type TUPLE H. The implementation might reasonably want to outlaw invocations of INTERSECT on an empty argument (at least if those invocations really need the result to be materialized).

Just to remind you, SQL’s analogs of the UNION and INTERSECT aggregate operators are called FUSION and INTERSECTION, respectively. If their arguments are empty, they both return null. Otherwise, they return a result as follows (these are direct quotes from the standard; T is the table over which the aggregation is being done). First FUSION:

[The] result is the multiset M such that for each value V in the element type, including the null value [sic], the number of elements of M that are identical to V is the sum of the number of identical copies of V in the multisets that are the values of the column in each row of T.

(The “element type” is the type of the elements of the multisets in the argument column.) Now INTERSECTION:

[The] result is a multiset M such that for each value V in the element type, including the null value, the number of duplicates of V in M is the minimum of the number of identical copies of V in the multisets that are the values of the column in each row of T.

Note the asymmetry, incidentally: In SQL, INTERSECTION (and INTERSECT) are defined in terms of MIN, but FUSION (and UNION) are defined in terms not of MAX but of SUM (?).

7.7 The predicate can be stated in many different ways, of course. Here’s one reasonably straightforward formulation: Supplier SNO supplies part PNO if and only if part PNO is mentioned in relation PNO_REL. That “and only if” is important, by the way (right?).

7.8 Relation r has the same cardinality as SP and the same heading, except that it has one additional attribute, X, which is relation valued. The relations that are values of X have degree zero; furthermore, each is TABLE_DEE, not TABLE_DUM, because every tuple sp in SP effectively includes the 0-tuple as its value for that subtuple of sp that corresponds to the empty set of attributes. Thus, each tuple in r effectively consists of the corresponding tuple from SP extended with the X value TABLE_DEE, and thus the original GROUP expression is logically equivalent to the following:

EXTEND SP : { X := TABLE_DEE }

The expression r UNGROUP X yields the original SP relation again.

7.9 Tutorial D on the left, SQL on the right as usual:

N := COUNT ( SP               │   SET N = ( SELECT COUNT ( * )
     WHERE SNO = 'S1' ) ;     │             FROM   SP
                              │             WHERE  SNO = 'S1' ) ;
( S WHERE CITY =              │   SELECT SNO
  MIN ( S , CITY ) ) { SNO }  │   FROM   S
                              │   WHERE  CITY =
                              │        ( SELECT MIN ( CITY )
                              │          FROM   S )
S { CITY }                │   SELECT DISTINCT CITY
WHERE COUNT ( !!S ) > 1   │   FROM   S AS SX
                          │   WHERE  ( SELECT COUNT ( * )
                          │            FROM   S AS SY
                          │            WHERE  SY.CITY = SX.CITY ) > 1
S { CITY } XUNION         │   SELECT CITY
  XUNION P { CITY }       │   FROM   S
                          │   WHERE  CITY NOT IN ( SELECT CITY
                          │                        FROM   P )
                          │   UNION  CORRESPONDING
                          │   SELECT CITY
                          │   FROM   P
                          │   WHERE  CITY NOT IN ( SELECT CITY
                          │                        FROM   S )
( P WHERE ( !!SP ) { SNO } ⊇           │   SELECT PNO
( S WHERE CITY = 'London' ) { SNO } )  │   FROM   P
                              { PNO }  │   WHERE  NOT EXISTS (
                                       │   SELECT * FROM S
                                       │   WHERE  CITY = 'London'
                                       │   AND    NOT EXISTS (
                                       │   SELECT *
                                       │   FROM   SP
                                       │   WHERE  SP.SNO = S.SNO
                                       │   AND    SP.PNO = P.PNO ) )
S WHERE ( !!SP ) { PNO } ⊇             │ SELECT SNO
  ( SP WHERE SNO = 'S2' ) { PNO }      │ FROM   S
                                       │ WHERE  NOT EXISTS (
                                       │ SELECT *
                                       │ FROM   SP AS SPX
                                       │ WHERE  SNO = 'S2'
                                       │ AND    NOT EXISTS (
                                       │ SELECT *
                                       │ FROM   SP AS SPY
                                       │ WHERE  SPY.SNO = S.SNO
                                       │ AND    SPY.PNO = SPX.PNO ) )

7.10 It’s the same as TCLOSE (pp). In other words, transitive closure is idempotent. Note: I’m extending the definition of idempotence somewhat here. In Chapter 6, I said (in effect) that a dyadic operator Op is idempotent if and only if Op(x,x) = x for all x; now I’m saying (in effect) that a monadic operator Op is idempotent if and only if Op(Op(x)) = Op(x) for all x. (Actually, mathematics textbooks typically define idempotence as a concept that applies to just one of these two cases; but some define it for the monadic case only and others for the dyadic case only. However, both cases clearly make sense.)

7.11 It denotes a relation of type

RELATION { SNO CHAR , SNAME CHAR , STATUS INTEGER , CITY CHAR ,
PNO_REL RELATION { PNO CHAR } }

that looks like this (in outline):

┌─────┬───────┬────────┬────────┬─────────┐
│ SNO │ SNAME │ STATUS │ CITY   │ PNO_REL │
├═════┼───────┼────────┼────────┼─────────┤
│     │       │        │        │ ┌─────┐ │
│ S1  │ Smith │     20 │ London │ │ PNO │ │
│     │       │        │        │ ├═════┤ │
│     │       │        │        │ │ P1  │ │
│     │       │        │        │ │ P2  │ │
│     │       │        │        │   ..    │
│     │       │        │        │ │ P6  │ │
│     │       │        │        │ └─────┘ │
│     │       │        │        │ ┌─────┐ │
│ S2  │ Jones │     10 │ Paris  │ │ PNO │ │
│     │       │        │        │ ├═════┤ │
│     │       │        │        │ │ P1  │ │
│     │       │        │        │ │ P2  │ │
│     │       │        │        │ └─────┘ │
  ..    .....       ..   ......     ..
│     │       │        │        │ ┌─────┐ │
│ S5  │ Adams │     30 │ Athens │ │ PNO │ │
│     │       │        │        │ ├═════┤ │
│     │       │        │        │ └─────┘ │
└─────┴───────┴────────┴────────┴─────────┘

The expression is logically equivalent to the following:

( S JOIN SP { SNO , PNO } ) GROUP { PNO } AS PNO_REL

Attribute PNO_REL is an RVA. Note, incidentally, that if r is the foregoing relation, then the expression

( r UNGROUP PNO_REL ) { ALL BUT PNO }

will not return our usual suppliers relation. To be precise, it will return a relation that differs from our usual suppliers relation only in that it’ll have no tuple for supplier S5.

7.12 The first is straightforward: It inserts a new tuple, with supplier number S6, name Lopez, status 30, city Madrid, and PNO_REL value a relation containing just one tuple, containing in turn the PNO value P5. As for the second, I think it would be helpful to show, extracted from Appendix D, the Tutorial D grammar for <relation assign> (the names of the syntactic categories are meant to be self-explanatory):

<relation assign>
    ::=   <relvar name> := <relation exp>
        | <insert> | <d_insert> | <delete> | <i_delete> | <update>

<insert>
    ::=   INSERT <relvar name> <relation exp>

<d_insert>
    ::=   D_INSERT <relvar name> <relation exp>

<delete>
    ::=   DELETE <relvar name> <relation exp>
        | DELETE <relvar name> [ WHERE <boolean exp> ]

<i_delete>
    ::=   I_DELETE <relvar name> <relation exp>

<update>
    ::=   UPDATE <relvar name> [ WHERE <boolean exp> ] :
                               { <attribute assign commalist> }

And an <attribute assign>, if the attribute in question is relation valued, is basically just a <relation assign> (except that the pertinent <attribute name> appears in place of the target <relvar name> in that <relation assign>), and that’s where we came in. Thus, in the exercise, what the second update does is replace the tuple for supplier S2 by another in which the PNO_REL value additionally includes a tuple for part P5.

7.13 Query a. is easy:

WITH ( tx := ( SSP RENAME { SNO AS XNO } ) { XNO , PNO_REL } ,
ty := ( SSP RENAME { SNO AS YNO } ) { YNO , PNO_REL } ) :
( tx JOIN ty ) { XNO , YNO }

Note that the join here is being done on an RVA (and so is implicitly performing relational comparisons).

Query b., by contrast, is not so straightforward. Query a. was easy because SSP “nests parts within suppliers,” as it were; for Query b. we would really like to have suppliers nested within parts instead. So let’s do that:²⁷

WITH ( pps := ( SSP UNGROUP PNO_REL ) GROUP { SNO } AS SNO_REL ,
tx := ( pps RENAME { PNO AS XNO } ) { XNO , SNO_REL } ,
ty := ( pps RENAME { PNO AS YNO } ) { YNO , SNO_REL } ) :
( tx JOIN ty ) { XNO , YNO }

7.14

WITH ( t1 := P RENAME { WEIGHT AS WT } ,
       t2 := EXTEND P : { N_HEAVIER :=
                              COUNT ( t1 WHERE WT > WEIGHT ) } ) :
( t2 WHERE N_HEAVIER < 2 ) { ALL BUT N_HEAVIER }

SELECT *
FROM   P AS PX
WHERE ( SELECT COUNT ( * )
       FROM    P AS PY
       WHERE   PX.WEIGHT < PY.WEIGHT ) < 2

Both formulations return parts P2, P3, and P6 (i.e., a result of cardinality three, even though the specified quota was two). Quota queries can also return a result of cardinality less than the specified quota (e.g., consider the query “Get the ten heaviest parts”).

Note: Quota queries are quite common in practice. In our book Databases, Types, and the Relational Model: The Third Manifesto (see Appendix G), therefore, Hugh Darwen and I suggest a shorthand for expressing them, according to which the foregoing query might be expressed thus in Tutorial D:

( ( RANK P BY ( DESC WEIGHT AS W ) ) WHERE W ≤ 2 ) { ALL BUT W }

SQL has something similar.

7.15 This formulation does the trick:

SUMMARIZE SP { SNO , QTY } PER ( S { SNO } ) : { SDQ := SUM ( QTY ) }

But the following formulation, using EXTEND and image relations, is surely to be preferred:

EXTEND S { SNO } : { SDQ := SUM ( !!SP { QTY } ) }

Here for interest is an SQL analog:

SELECT SNO , ( SELECT COALESCE ( SUM ( DISTINCT QTY ) , 0 ) AS SDQ
FROM S

7.16

EXTEND S : { NP := COUNT ( !!SP ) , NJ := COUNT ( !!SJ ) }

JOIN { S , SUMMARIZE SP PER ( S { SNO } ) : { NP := COUNT ( PNO ) } ,
           SUMMARIZE SJ PER ( S { SNO } ) : { NJ := COUNT ( JNO ) } }

SELECT SNO , ( SELECT COUNT ( PNO )
               FROM   SP
               WHERE  SP.SNO = S.SNO ) AS NP ,
             ( SELECT COUNT ( JNO )
               FROM   SJ
               WHERE  SJ.SNO = S.SNO ) AS NJ
FROM   S

7.17 For a given supplier number, sno say, the expression !!SP denotes a relation with heading {PNO,QTY} and body consisting of those (pno,qty) pairs that correspond in SP to that supplier number sno. Call that relation ir (for image relation). By definition, then, for that supplier number sno, the expression !!(!!SP) is shorthand for the following:

( ( ir ) MATCHING RELATION { TUPLE { } } ) { ALL BUT }

This expression in turn is equivalent to:

( ( ir ) MATCHING TABLE_DEE ) { PNO , QTY }

And this expression reduces to just ir. Thus, “!!” is idempotent (i.e., !!(!!r) is equivalent to !!r for all r), and the overall expression

S WHERE ( !!(!!SP) ) { PNO } = P { PNO }

is equivalent to:

S WHERE ( !!SP ) { PNO } = P { PNO }

(“Get suppliers who supply all parts”).

7.18 No, there’s no logical difference.

7.19 S JOIN SP isn’t a semijoin; S MATCHING SP isn’t a join (it’s a projection of a join). The expressions r1 JOIN r2 and r1 MATCHING r2 are equivalent if and only if relations r1 and r2 are of the same type (when the final projection becomes an identity projection, and the expression overall degenerates to r1 INTERSECT r2).

7.20 If r1 and r2 are of the same type and t1 is a tuple in r1, the expression !!r2 (for t1) denotes a relation of degree zero—TABLE_DEE if t1 appears in r2, TABLE_DUM otherwise. And if r1 and r2 are the same relation (r, say), !!r2 becomes !!r, and it denotes TABLE_DEE for every tuple in r.

7.21 They’re the same unless table S is empty, in which case the first yields a one-column, one- row table containing a zero and the second yields a one-column, one-row table “containing a null.”

7.22 In SQL, typically in a cursor definition; in Tutorial D (where ORDER BY is spelled just ORDER), in a special operator (“LOAD”), not discussed further in this book, that retrieves a specified relation into an array (of tuples). Note that both of these situations are ones in which we’re leaving the relational context and moving out into the external environment, as it were.

7.23 The tables on page 228 have no doubly underlined columns because the relations they represent are both of cardinality one, by definition. If we think of those relations as the current values of two relvars—defined by the SQL expressions SELECT COUNT(*) AS X FROM S and SELECT COUNT (DISTINCT STATUS) AS Y FROM S, respectively—then those relvars are each subject to a constraint to the effect that their cardinality must be one, and hence they each have an empty key (see the answer to Exercise 5.23 in Chapter 5). And empty keys obviously can’t be shown using the double underlining convention.

As for the table on page 246, it doesn’t represent a relation at all. Therefore it can’t be the current value of any relvar; thus, the concept of having a key at all simply doesn’t apply, and the double underlining convention therefore doesn’t apply either.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for SQL and Relational Theory

Create new playlist

Sign In

Sign Up

Chapter 7

EXCLUSIVE UNION

SEMIJOIN AND SEMIDIFFERENCE

EXTEND

IMAGE RELATIONS

DIVIDE

AGGREGATE OPERATORS

Empty Arguments

IMAGE RELATIONS REVISITED

SUMMARIZATION

SUMMARIZATION REVISITED

GROUP, UNGROUP, AND RELATION VALUED ATTRIBUTES

RVAs Make Outer Join Unnecessary

RVAs in Base Relvars

RVAs Are Necessary for Relational Comparisons

Aggregate Operators

“WHAT IF” QUERIES

A NOTE ON RECURSION

Cycles

WHAT ABOUT ORDER BY?

EXERCISES

ANSWERS

Table of Contents for
SQL and Relational Theory