Appendix F

No SQL and Relational Theory

Just say no!

—Nancy Reagan (1982)

Note: This appendix is based in part on material from two papers that originally appeared in the NoCOUG Journal (the journal of the Northern California Oracle User Group):

The material is reused here by permission.

I said in Chapter 1 that we’d be concerned in this book with principles, not products, and foundations, not fashion or fads. In this appendix, however, I’m going to go back on that promise a little. As I’m sure you know, in recent years there’s been a major upsurge of interest in what are known generically as NoSQL systems. And, as so often happens when something suddenly becomes fashionable for some reason, this new development has been surrounded by a fair degree of confusion. In this appendix, I’d like to try to clear up some of that confusion, if I can. In particular, I’d like to try to clarify the relationship—to the extent that any such relationship exists—between the NoSQL movement and what is after all a major topic for the present book, viz., relational theory.

The first confusion arises in connection with the very name “NoSQL.” Naïvely, what I thought when I first encountered that name was “Aha! So at last people have begun to realize what a difficult language SQL really is, and they want something better to access their databases with.” But, of course, that’s not what the NoSQL advocates are saying at all.1 Their quarrel isn’t with the SQL language as such; rather, it’s with existing SQL products, which, they claim, all fail to provide adequate performance and/or adequate scalability and/or adequate data availability. Well, they might be right in these claims—in fact, let’s assume they are, for the sake of the present discussion at any rate. But of course those deficiencies are deficiencies of the products in question, not deficiencies of the SQL language as such. Indeed, some NoSQL advocates are now saying that “NoSQL” doesn’t really mean “No SQL,” it means “Not Only SQL,” with the implication that the products in question might still support SQL as such in addition to something else (i.e., something that’s explicitly not SQL).

A second confusion arises from the fact that the name “NoSQL” doesn’t refer to just one product, or one kind of product. Rather, it’s an umbrella term, used to refer to at least four different things: viz., document, key-value, column, and graph systems.2 These different systems can be characterized very briefly as follows:

  • Document systems: The data in the database consists of a collection of XML or other documents. Documents can be retrieved in their entirety or searched via some document query language (e.g., XPath or Xquery, in the case of XML).

  • Key-value systems: The database contains nothing but uninterpreted “blobs” (see Chapter 2), each with its own unique ID. The blobs are the values and the IDs are the keys, and the entire collection of key-value pairs is referred to as a dictionary, map, or associative array. All access is by key.

  • Column systems: The database contains (typically) just one giant table, made up of rows and columns as usual.3 The rows are identified by key as usual. However, the columns aren’t columns in the SQL sense—they don’t contain just values, they contain key-value pairs instead, and there’s no requirement that every row contain the same kinds of such pairs (i.e., the rows aren’t homogeneous). Access is by row key and column key within row key.

  • Graph systems: Graph systems are somewhat different in kind from the other three. The database consists of nodes and edges, which to a first approximation you can think of as representing entities and relationships, respectively. Typically, a query language is provided that allows the user to enter the graph at some specific node and then traverse edges in the graph to some desired target.

So what is it that unites these systems?—what is it that they have in common, to justify referring to them all by the same blanket label? Mostly, it seems they’re united in what they’re not. To be specific, they’re obviously not conventional SQL systems, as that term is generally understood. Unfortunately, however, I would have to say they’re also united in the extensive and demonstrable lack of understanding of the relational model displayed by their promoters. (The remainder of this appendix provides evidence in support of this complaint on my part.) And the first three categories, at least, are also united in their failure to distinguish properly between a model and its implementation (this criticism might not apply to graph systems, or at least not so much or so obviously). And there are some further, somewhat interrelated, points of commonality too (though again these points might not apply to graph systems). To be specific:

  • The systems typically reject the ACID properties of transactions, replacing them by a set of so called “BASE properties” instead. BASE is a somewhat contrived acronym, standing for Basic Availability, Soft state, Eventual consistency. I’ll have more to say about these properties in the section “Eventual Consistency,” later.

  • They typically keep several copies of the stored data (data replication), typically at several different sites, in order to provide improved availability.

  • They also typically employ functional segmentation and sharding in order to improve availability still further. I’ll discuss these issues in the next two sections.

  • Because the data structures they support are so simple, they might not require a schema.4 Of course, without a schema the DBMS has no knowledge of any internal structure the data might possess, in which case it probably won’t be able to support anything very sophisticated by way of a query language.

  • There’s no join support.5 Thus, it might be necessary to do several separate queries in order to obtain some desired result. Here’s what Wikipedia has to say in connection with this point: “NoSQL queries are often faster than traditional SQL queries, so the cost of having to do additional queries [might] be acceptable.” Myself, I would have thought this point could be open to dispute.

Aside: Before I go on to elaborate on some of the foregoing ideas, perhaps I should offer some explicit examples of quotes to back up my claim that the promoters of these systems seem not to understand the relational model properly. Here are a few (sources omitted to protect the guilty):

  • “Relational databases lack relationships.”

  • “Relationships do exist in the vernacular of relational databases, but only as a means of joining tables.”

  • “Join tables add accidental complexity; they mix business data with foreign key metadata.”

  • “[This product] goes a step beyond even relational databases by lifting up the connections to be first-class citizens. It does this by allowing the relationships ... to have an arbitrary number of properties.”

  • “The relational database exposes a tabular view of the world. But our [application] is filled with hierarchical and connected data.”

And so on. End of aside.

FUNCTIONAL SEGMENTATION

The basic idea of functional segmentation, or just segmentation for short, is very simple: The overall business enterprise is divided up into a set of more or less independent functional pieces or segments, and then all of the data for a given segment is stored at its own site, in its own database. For example, a computer manufacturer might have a hardware products segment, a software products segment, an education and training segment, a publications segment, and so on, each with its own database. The big advantage of such a scheme is that it avoids having a single point of failure in the system, so that a user of, say, the education database will be unaffected if the publications database goes offline.

Of course, segmentation as just described isn’t exactly a novel concept (many years ago we used to call those separate databases subject databases).6 But more to the point, let me now observe that—as I’m sure you’ve already realized—there’s no conflict whatsoever between such segmentation and relational theory. Segmentation is merely a physical database design decision, and it could perfectly well be applied in the relational context if it were thought desirable to do so. In other words, there’s absolutely no reason why a desire for segmentation should require the rejection of any relational concepts (or SQL concepts, come to that).

Note: The complete set of subject databases for the enterprise could be regarded as a single distributed database. However, distributed transactions can be expensive, and for that reason they aren’t allowed in NoSQL systems. As a consequence, there’s a possibility that those separate subject databases might sometimes become at least temporarily inconsistent—for example, a document in the publications database might contain a reference to a product the hardware division has stopped making. I’ll return to this point in the section “Eventual Consistency” below.

SHARDING

Sharding can be regarded as an extended form of what’s more usually known as horizontal decomposition, as this latter term is used in conventional database design contexts. Suppose the computer manufacturer’s education database from the previous section contains tables that look like this (in outline):

COURSE   ( CNO , TITLE , ... )             KEY ( CNO )
OFFERING ( CNO , ONO , LOCATION , ... )    KEY ( CNO , ONO )
TEACHER  ( CNO , ONO , TNO , ... )         KEY ( CNO , ONO , TNO )
STUDENT  ( CNO , ONO , SNO , GRADE , ... ) KEY ( CNO , ONO , SNO )

Note the hierarchic structure here: There’s a one to many relationship from courses to course offerings, another from such offerings to teachers, and another from such offerings to students.7 In particular, each of the four tables has {CNO} (course number) as its key or part of its key. Thus, the tables can each be partitioned on the basis of course numbers, such that the rows from all four tables with CNO in the range 1-100 (say) form one “shard,” the rows with CNO in the range 101-200 form another, and so on. Each shard is then kept in its own database, stored at its own site.

Observe now that—once again as I’m sure you’ve realized—there’s no conflict here with relational theory. Like segmentation, sharding is just a physical database design decision, and it can perfectly well be applied in the relational context if it were thought desirable to do so. Thus, there’s no reason why a desire to carry out such sharding should require the rejection of any relational concepts (or SQL concepts, come to that). Note too that the complete set of shards can be regarded as a single distributed database, though as previously noted distributed transactions aren’t allowed in NoSQL systems.

EVENTUAL CONSISTENCY

Now let me get back to that BASE acronym. B and A stand for basic availability, meaning, loosely, that “most of the data is available most of the time.” S stands for soft state, apparently meaning that the stored data doesn’t always have to be consistent (though exactly how it comes to have that meaning I really couldn’t say). And E stands for eventual consistency, meaning that inconsistencies if any will be resolved at some future time (“eventually”), not necessarily at the time of the pertinent update, nor even necessarily at the end of the transaction. It’s this last item, eventual consistency, that I now want to discuss.

Let me begin by saying that I have no problem with the broad intent of this notion, nor with the business requirement that the NoSQL supporters are aiming at in this connection. However, I do think I have a rather different perspective on what’s really going on here—different, that is, from the perspective usually presented when “eventual consistency” is discussed in the NoSQL literature—and it’s that different perspective that I’d now like to explain.

  • To say a database (be it distributed or otherwise) is consistent merely means, formally speaking, that the database in question conforms to all declared integrity constraints. Now, it’s crucially important that databases always be consistent in this sense; indeed, a database that’s not consistent in this sense, at some particular time, is like a logical system that contains a contradiction. Well, actually, that’s exactly what it is—a logical system that contains a contradiction. And in a logical system that contains a contradiction, you can prove anything; for example, you can prove that 1 = 0. (In fact, as we saw in Chapter 8, you can prove that you can prove that 1 = 0 in such a system.) What this means in database terms is that if the database is inconsistent in the foregoing sense, you can never trust the answers you get to queries (they may be false, they may be true, and you have no way in general of knowing which they are); all bets are off. As far as declared constraints are concerned, in other words, the system simply must do the checking whenever a pertinent update occurs; there’s no alternative, because (to say it again) not to do that checking is to risk having a database for which all bets are off. In other words, immediate constraint checking is logically required.

  • But consistency in the foregoing formal sense isn’t necessarily the same thing as consistency as conventionally (and informally) understood—meaning consistency as understood in conventional real world terms, outside the world of databases.8 Suppose there are two items A and B in the database that, in the real world, we believe should have the same value. They might, for example, both be the selling price for some given commodity, stored twice (perhaps at different sites) because data replication is being used to improve availability. If A and B in fact have different values at some given time, we might certainly say, informally, that there’s an inconsistency in the stored data at that time. But that “inconsistency” is an inconsistency as far as the system is concerned only if the system has been told that A and B are supposed to be equal—i.e., only if “A = B” has been declared as a formal integrity constraint. If it hasn’t, then (a) the fact that AB at some time doesn’t in itself constitute a consistency violation as far as the system is concerned, and (b) importantly, the system will nowhere rely on an assumption that A and B are equal.

  • Thus, if all we want is for A and B to be equal “eventually”—i.e., if we’re content for that requirement to be handled in the application layer, outside the DBMS—all we have to do as far as the database system is concerned is omit any declaration of “A = B” as a formal constraint. No problem, and in particular no violation of relational theory.

THE FERNANDEZ INTERVIEW

I’ve claimed in this appendix so far that three of what are generally regarded as the defining characteristics of NoSQL systems—segmentation, sharding, and eventual consistency (or replication together with eventual consistency, rather)—are entirely compatible with relational theory. Thus, it seems appropriate to close the discussion with a lightly edited version of the pertinent portion of the interview conducted by Iggy Fernandez with Hugh Darwen and myself, where some of these matters are examined in a little more depth. Note: The interview was originally published in NoCOUG Journal 27, No. 3 (August 2013), www.nocoug.org/Journal/NoCOUG_Journal_201308.pdf. What follows is essentially the whole of the “No to NoSQL!” section of that interview. My thanks to Iggy Fernandez and NoCOUG for allowing me to republish this material in the present book.

Fernandez: The archetypal NoSQL product is Dynamo from Amazon.com. The 2007 ACM paper by Amazon.com states: “Customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados. Therefore, the service responsible for managing shopping carts requires that it can always write to and read from its data store, and that its data needs to be available across multiple data centers … There are many services on Amazon’s platform that only need primary-key access to a data store. For many services, such as those that provide best seller lists, shopping carts, customer preferences, session management, sales rank, and product catalog, the common pattern of using a relational database would lead to inefficiencies and limit scale and availability. Dynamo provides a simple primary-key-only interface to meet the requirements of these applications.” The Dynamo paper is where the popular claim originated that NoSQL products are faster, more scalable, and more available than relational products in certain clearly delineated scenarios such as online shopping carts. But is there any merit to the claim at all?

Date: First off, let me make it very clear that I know almost nothing about Dynamo as such. But if the statement is correct that it provides “a simple primary-key-only interface to meet the requirements of [certain rather simple] applications”—well, fine. I have no problem with that. If there’s a class of applications that (a) are important for some pragmatic reason and (b) require only a limited subset of the system’s full functionality, then I think it’s perfectly reasonable for the system to provide a special interface tailored to just those requirements. Why, that’s exactly what IBM did, with its Fast Path option in IMS! In a relational system, that special interface would support a carefully chosen subset of the full relational interface, and the implementation would be able to take advantage of the fact that the interface is circumscribed in just such a way. It might be able to make use of its own special stored data formats as well. And—just to spell the point out—I see no reason why the provision of that special interface and those special stored data formats should have any negative effect at all on users who want to use the regular “full function” relational interface.

That said, if there’s a suggestion that Amazon’s various disaster scenarios, regarding tornados and the rest, are somehow more of a problem for relational systems than they are for nonrelational ones, then of course I reject that suggestion 100 percent. As so often, I strongly suspect that what’s going on here is some kind of confusion between what truly relational systems ought to be capable of and what today’s mainstream SQL products can actually do. If today’s SQL products fail to meet Amazon’s requirements, well, that might be a valid criticism of those products—but it’s not a valid criticism of relational systems in general.

To sum up: I do think we should discard SQL, as quickly as we can, and replace it by something better. Unfortunately, however, most of the people who currently want to discard SQL (or, at least, those who are most vocal about doing so) seem to want to do so for the wrong reasons. And there’s a strong likelihood that they’ll replace it, not by something better, but by something worse.

Darwen: So Amazon uses a Dynamo key value to access a giant blob whose structure is understood only by the applications. In that case they are happy to forgo all of those advantages given by Codd in 1974, not all of which are properly delivered by SQL products in any case.9 If the people at Amazon are satisfied that Dynamo provides everything they need, and they feel they have good reason to reject the use of any SQL products, then who are we to argue? The indictment seems to be of SQL products, not relational databases.

It’s easy to understand why the mainstream SQL products might be too “heavy” for Amazon’s needs. Those products have become extremely cluttered up with all sorts of features that would be of little or no use in the Amazon scenario: baroque support for user defined data types, pointers (in the form of REF values), BLOBs and CLOBs, subtables and supertables, sequence generators, datalinks, locators, system versioned tables, and on and on.

Rel (dbappbuilder.sourceforge.net/Rel.html) is an implementation, by Dave Voorhis of the University of Derby, U.K., of the relational language Tutorial D that we (Chris and I) devised for teaching and illustrative purposes. Rel is a very simple DBMS that meets all of the criteria for being fully relational. If Rel, or one of the other prototype implementations of The Third Manifesto, were dressed up sufficiently for commercial purposes—including in particular the performance enhancements to be obtained by implementation of established optimization techniques and sophisticated storage structures—then it would be interesting to see if Amazon still preferred Dynamo.

All of that said, we admit that full scalability might be hard to achieve with a fully relational system. That’s because we require such a system to support expressions of arbitrary complexity for deriving whatever results might be desired from the database and for expressing whatever integrity constraints might be needed. Let’s suppose there are extreme cases where what runs in acceptable time with small databases is simply not feasible with large ones. Such cases would militate against the declaration of certain integrity constraints (constraints that current SQL products don’t even support anyway). They also militate against the use of certain queries, in an OLTP context, that SQL products do support, but that problem can of course be avoided simply by not doing such queries. We conjecture that such cases would be unlikely to arise in Amazon’s shopping scenario. In any case, if they would still prefer Dynamo to a putative souped up Rel, then so be it. If the benefits of a relational system aren’t all needed in a particular situation, and provision of a more tailored solution in that situation is found to be cost effective, then who could argue that Amazon would have made a bad choice?

By the way, the bogey of scalability is sometimes advanced as justification for failing to provide any solution at all. A pertinent example is the lack of full support for integrity constraints in SQL systems (where something close to full support could be achieved by implementing the ISO standard’s CREATE ASSERTION statement, for example). But some databases are quite small and subject to quite infrequent updates. I use Rel for several small databases that I maintain for domestic and hobby purposes on my home computer. I have benefited significantly from the ability to define constraints that would be impossible to define in SQL without CREATE ASSERTION.10 Without those constraints, certain errors by me would have gone undetected, resulting in incorrect databases. I’m typically dealing with relations consisting of a few hundred tuples, and in some cases I’m updating no more than once per month. It seems unfair that small organizations, which have little clout with the SQL vendors, can’t also enjoy the benefits of such solutions, just because those solutions might not be practical for large organizations (with deep pockets, therefore listened to by the DBMS vendors) using OLTP on enormous databases. In this connection, one SQL DBMS implementer once told me privately that he agreed with me in principle but tellingly added that supporting CREATE ASSERTION would be very difficult in his product and lack of scalability for some kinds of constraints would give him a good get-out clause!

Fernandez: Another breed of NoSQL products that has gained considerable commercial momentum is “graph systems” such as Neo4J. In “Normalized Database Structure: A Brief Tutorial,” Codd carved out a special exception for such products: “It may be argued that in some applications the problems have an immediate natural formulation in terms of networks. This is true of some applications, such as studies of transportation networks, power-line networks, computer design, and the like. We shall call these network applications … Except in network applications, links should not be employed in the user’s data model.” Since the problems addressed by these products (e.g., shortest path) have no solution in relational calculus, do these products have a legitimate case to be nonrelational?

Date: Several points here. Yes, Codd did “carve out a special exception” for what he called network applications. But I’m not sure he was right to do so. As a simple counterexample, a company’s organization chart has “an immediate natural formulation in terms of networks” (in fact, often—though not always—in terms of a hierarchy, which is a simple special case). But it doesn’t follow that we need a network DBMS (i.e., one that exposes “links” or pointers to the user) in order to deal with corporate organizations, and of course we don’t.

Second, I’d like to point out that in the paper you reference, Codd also said this: “[Users often have] occasion to require tables to be printed out or displayed. What could be a simpler, more universally needed, and more universally understood data structure than a table? Why not permit ... users to view all the data ... in a tabular way?”

Third, any graph can always be represented—quite succinctly, in fact—in relational form. As for “shortest path” and other such problems, please note that the relational model is only a minimum requirement. Even if you’re right when you suggest that the shortest path problem can’t be formulated in relational calculus—I presume you’re referring to the fact that the relational calculus as originally defined had no support for the famous “ancestral” problem— well, that’s not to say it never will have such support. In fact, a great deal of research has been done on adding such support (and implementing it efficiently, too).

Fourth, let’s agree for the sake of the argument that there are some problems that graph-based DBMSs can solve better than relational ones. I don’t have an issue with that. My position is this: We know the class of problems for which relational systems are suited is very large—but it’s not necessarily universal. But I very much doubt whether any other approach is universal either. So my objection isn’t to using, e.g., graph-based DBMSs to solve problems that they solve well; rather, my objection is to attempts to solve by nonrelational means problems that can reasonably, perhaps better, be solved by relational means. In other words, graph-based DBMSs (for example) might well have a role to play, but that role is not to take over the entire database world. To repeat something I’ve said elsewhere: I’ve never seen a proposal for “taking over the database world”—i.e., for replacing the relational model—in which the person doing the proposing really understood the relational model. Surely, if you want to claim that Technology A is no good and needs to be replaced by Technology B, then it’s incumbent on you to understand Technology A in the first place? And, more specifically, to demonstrate that Technology B solves not only all of the problems that Technology A does, but also some problem that Technology A doesn’t?

Darwen: Well, graph DBMSs and the like simply are nonrelational. Of course we don’t suggest that all databases should be relational—only that all general purpose ones should be. But if you were to ask if it’s legitimate to claim that solutions to problems like “shortest path” are inherently unobtainable with a relational system, then the answer is an emphatic “No!” As Chris has effectively already said, having no solution in relational calculus doesn’t mean it’s impossible for a relational DBMS to provide the necessary operators. For example, Tutorial D already includes an operator, TCLOSE, for deriving a relation representing the transitive closure of its operand, a recursive relation. And even SQL includes support—rather elaborate support, in fact—for recursive table expressions in general. Any operator that’s closed over relations is admissible in a relational database language.

Now, the proponents of graph databases might wish to argue that their systems can provide much faster solutions to such problems than could ever be obtained by implementations of suitable relational operators. But suppose the graph DBMS were the engine of a relational DBMS, such that a relational expression is mapped under the covers to an equivalent expression or procedure in the graph DBMS’s language. Wouldn’t we then see the relational DBMS performing pretty much as well—or as badly!—as the graph DBMS on its own? And wouldn’t its user then be receiving all the benefits claimed for relational DBMSs in general in addition to those claimed for graph DBMSs in particular? It’s interesting to see in this connection that some of the DBMSs listed in the Wikipedia article “Graph database,” notably those available from Oracle, use some form of SQL as their query language.

Fernandez: Another breed of NoSQL products that has gained considerable commercial momentum is the so-called “Big Data” products like Hadoop that aim to process nontransactional data outside the transactional DBMS. It was apparent that the glaring drawback of this class of NoSQL products was the absence of SQL, and so there has been a rush to provide SQL-like functionality in this space, with Impala from Cloudera leading the way. Which leads to the question: Is it kosher to decouple relational algebra and relational calculus from the DBMS as Impala has done?

Date: Before I became “a database person,” I was a languages person. I worked for several years at IBM Hursley (in the U.K.), which in those days was the home of PL/I. (Of course, you might never have heard of PL/I, and I’ll be the first to admit that as a language it looks a little antiquated now. But it was a big deal at the time—and a big revenue earner for IBM, I might add.) So when I first learned about Ted Codd’s relational model, I wanted to add relations and relational operators to PL/I—in order that PL/I would be able to operate on data in a relational database, of course, but not only for that reason; I always thought it would be useful to have “local” relations, meaning ones that weren’t in the database, and to be able to operate on those relations by means of joins and unions and so on. So if that’s what you mean by “decoupling relational algebra and relational calculus from the DBMS,” then I’m all for it.

But you touch on something else here: “Big Data.” Sorry if I’m beginning to sound like a broken record—I guess that metaphor is pretty antiquated too!—but I see no reason why a relational system shouldn’t be able to handle “Big Data” perfectly well. Data size is, of course, an implementation concern, not a model concern. The relational model is deliberately silent on all matters of physical implementation. Just because the implementation has to deal with enormous volumes of data, that’s no reason, as far as I can see, why the user interface has to be anything other than relational.

Darwen: My observations on graph databases seem applicable here too. Couldn’t Impala be thought of as the DBMS and Hadoop as Impala’s database engine? Well, up to a point, perhaps, but if Impala users have to use Hadoop itself for certain purposes (perhaps database definition? updating? constraint enforcement?), then we could hardly call Impala a fully relational DBMS, even if its language, as far as it goes, were in keeping with the relational model. In any case, there have been many examples over the years of the need being perceived for a “decoupled” relational front end to a nonrelational system. For example, in the early 1980s a group in IBM (U.K.) developed an SQL front end to IBM’s ancient and still running hierarchical system IMS. There can be no objection to such products; on the contrary, if the front end were fully relational (as opposed to SQL), we would encourage them to be provided wherever the need arises.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset