2

Manipulating Query Expressions

The first step in answering a query posed to a data integration system is selecting which data sources are most relevant to the query. Making this decision relies on a set of methods for manipulating queries and reasoning about relationships between queries. This chapter covers such reasoning techniques in detail. While these techniques are very useful in data integration, they are also of independent interest and have been used in other contexts as well, such as query optimization and physical database design.

Section 2.1 begins with a review of database concepts that are used throughout the book. Section 2.2 describes techniques for query unfolding, where the goal is to reformulate a query, possibly posed over other queries or views, so that it refers only to the database relations. In the context of query optimization, query unfolding may let a query optimizer discover more efficient query processing plans because there is more freedom deciding in which order to perform join operations. In the context of data integration, query unfolding will be used to reformulate a query posed over a mediated schema into a query referring to data sources (Section 3.2.2).

Section 2.3 describes algorithms for query containment and query equivalence. Query containment is a fundamental ordering relationship that may hold between a pair of queries. If the query Q1 contains the query Q2, then the answers to Q1 will always be a superset of the answers of Q2 regardless of the state of the database. If two queries are equivalent, then they always produce the same answers, even though they may look different syntactically. We will use query containment and equivalence to decide whether two data sources are redundant with each other and whether a data source can be used to answer a query. In the context of query optimization, we use query equivalence to verify that a transformation on a query preserves its meaning.

Section 2.4 describes techniques for answering queries using views. Intuitively, answering queries using views considers the following problem. Suppose you have stored the results of several queries into a set of views over a database. Now you receive a new query and you want to know whether you can answer the query using only the views, without accessing the original database. In the context of query optimization, finding a way to answer a query using a set of views can substantially reduce the amount of computation needed to answer the query. As we describe in Section 3.2.3, in the context of data integration we often describe a set of data sources as views over a mediated schema. User queries are formulated using the terms in the mediated schema. Hence, answering queries using views is necessary in order to reformulate user queries to refer to the data sources.

2.1 Review of Database Concepts

We begin with a review of a few basic terms and concepts from the database literature related to data modeling and querying.

2.1.1 Data Model

Data integration systems need to handle data in a variety of data models, be it relational, XML, or unstructured data. We review the basics of the relational data model here, and introduce other data models later in the book. In particular, Chapter 11 discusses XML and its underlying data model, as it plays a key role in many data integration scenarios.

A relational database is a set of relations, also called tables (see Figure 2.1). A database schema includes a relational schema for each of its tables and a set of integrity constraints that we describe a bit later.

image

Figure 2.1 The Interview table stores the employment candidates and the result of their interviews. The EmployeePerformance table describes the quarterly evaluation score of each employee.

A relational schema specifies the set of attributes in the table and a data type for each attribute. The arity of a relation is the number of attributes it has. For example, in Figure 2.1, the arity of the Interview relation is 5 and its schema is:

candidate: string date: date
recruiter: string hireDecision: boolean
grade: float

In a sense, the schema describes how the database author decided to organize the data, what aspects of the data she chose to model, and the distinctions she wished to make. For example, the Interview table does not include an attribute for the location of the interview or the position for which the candidate was interviewing. The EmployeePerformance table only provides a single grade, and does not model the fact that this grade may be the composition of several more specific performance measures. One of the challenges we face in data integration is that different sources organize their data differently, and these differences need to be reconciled.

A relational table includes a finite number of rows, called tuples (or records). A tuple assigns a value to each attribute of the table. If the relation being discussed is clear from the context, we denote its tuples with its values in parentheses, e.g.,

(Alan Jones, 5/4/2006, Annette Young, No, 2.8)

If not, we denote it as a ground atom:

Interview (Alan Jones, 5/4/2006, Annette Young, No, 2.8)

In some cases, we describe a tuple with a mapping from attribute names to values, such as:

{candidate→ Alan Jones, date→ 5/4/2006, recruiter → Annette Young, hireDecision → No, grade → 2.8}

We pay special attention to the NULL value in a database. Intuitively, NULL means that the value is not known or may not exist. For example, the value of an age attribute may be NULL if it’s not known, whereas the value of spouse could be either unknown or may not exist. The important property to keep in mind about the NULL value is that an equality test involving a NULL returns NULL. In fact, even NULL = NULL returns NULL. This makes intuitive sense because if two values are not known, we certainly do not know that they are equal to each other. We can test explicitly for NULL with the predicate is NULL.

A state of the database, or database instance, is a particular snapshot of the contents of the database. We distinguish between set semantics of databases and multi-set semantics. In set semantics, a database state assigns a set of tuples to each relation in the database. That is, a tuple can only appear once in a relation. In multi-set semantics, a tuple can appear any number of times in each relation, and hence a database instance is an assignment of a multi-set of tuples to each relation. Unless we state otherwise, our discussion will assume set semantics. Commercial relational databases support both semantics.

We use these notations throughout the book.

We denote a database instance with D (possibly with subscripts).

We denote attributes with letters from the beginning of the alphabet, e.g., A, B, C. We denote sets or lists of attributes with overbars, e.g., image.

We denote relation names with the letters R, S, and T and sets or lists of relations with overbars, e.g., image.

We denote tuples with the lowercase letters, e.g., s, t.

If image is a set of attributes and t is a tuple in a relation that has the attributes in image, then image denotes the restriction of the tuple t to the attributes in image.

2.1.2 Integrity Constraints

Integrity constraints are a mechanism for limiting the possible states of the database. For example, in the employee database, we do not want two rows for the same employee. An integrity constraint would specify that in the employee table the employee ID needs to be unique across the rows. The languages for specifying integrity constraints can get quite involved. Here we focus on the following most common types of integrity constraints:

Key constraints: A set of attributes image of a relation R is said to be a key of R if there do not exist a pair of tuples t1, t2R such that image and t1t2. For example, the attribute candidate can be a key of the Interview table.

Functional dependencies: We say that a set of attributes image functionally determines a set of attributes image in a relation R if for every pair of tuples t1, t2R, if image, then image.
For example, in the EmployeePerformance table, empID and reviewQuarter functionally determine grade. Note that a key constraint is implies a functional dependency where the key attributes determine all of the other attributes in the relation.

Foreign key constraints: Let S be a relation with attribute A, and let T be a relation whose key is the attribute B. The attribute A is said to be a foreign key of attribute B of table T if whenever there is a row in S where the attribute A has the value v, then there must exist a row in T where the value of the attribute B is v. For example, the empID attribute of the EmployeePerformance table would be a foreign key of the Employee table.

General Constraint Expressions

Tuple-generating dependencies (TGDs) and equality-generating dependencies (EGDs) offer a general formalism for specifying a wide class of integrity constraints. As we see in subsequent chapters, TGDs are also used to specify schema mappings.

Tuple-generating dependencies are formulas of the form

image

where s1, …, sm and t1, …, tl are relation names. The variables image are a subset of image, and the variables image are a subset of image. The variables image do not appear in image. Depending on the context, the relation names on the left- and right-hand side of the dependency may refer to relations in the same schema or may refer to different databases.

Equality-generating dependencies are similar, except that the right-hand side contains only equality atoms:

image

where all the variables on the right-hand side appear on the left-hand side. We note that in practice, the quantifiers (image and image) are often omitted. In these cases, all the variables that appear on the left-hand side are assumed to be universally quantified (∀) and the variables appearing only on the right-hand side are assumed to be existentially quantified (∃).

As we illustrate below, we can use these formalisms to express the contraint classes mentioned above.

The attribute candidate is a key of the Interview table: (the following formula specifies a key constraint assuming set semantics for the table)

(∀C1, D1, R1, H1, G1, D2, R2, H2, G2)

Interview (C1, D1, R1, H1, G1), Interview(C1, D2, R2, H2, G2) →

D1 = D2, R1 = R2, H1 = H2, G1 = G2

The attributes empID and reviewQuarter functionally determine grade in the EmployeePerformance table:

(∀I, N1, R, G1, Re1, N2, G2, Re2)

EmployeePerformance(I, N1, R, G1, Re1), EmployeePerformance(I, N2, R, G2, Re2)

G1 = G2

The recruiter attribute of the Interview table is a foreign key of the EmployeePerformance table:

(∀I, N, R, Gr, Re) EmployeePerformance(I, N, R, Gr, Re) → (∃Na, Hd, Mg) Employee(I, Na, Hd, Ma)

2.1.3 Queries and Answers

Queries are used for several purposes in data integration systems. As in database systems, queries are used in order to formulate users’ information needs. In some cases, we may want to reuse the query expression in other queries, in which case we define a named view over the database, defined by the query. If we want the database system to compute the answer to the view and maintain the answer as the database changes, we refer to the view as a materialized view.

In data integration systems, we also use queries to specify relationships between the schemata of data sources. In fact, as we discuss in Chapter 3, queries form the core of the formalisms for specifying semantic mappings.

We distinguish between structured queries and unstructured queries. Structured queries such as SQL queries over relational databases or XQuery queries over XML databases are the ones we work hard to support in database systems. Unstructured queries are the ones we are most familiar with on the Web: the most common form of an unstructured query is a list of keywords.

We use two different notations for queries over relational databases throughout the book. The first is SQL, which is the language used to query relational data in commercial relational systems. Unfortunately, SQL is not known for its aesthetic aspects and hence not convenient for more formal expositions. Hence, in some of the more formal discussions, we use the notation of conjunctive queries, which is based on (a very simple form of) mathematical logic.

SQL is a very complex language. For our discussion, we typically use only its most basic features: selecting specific rows from a table, selecting specific columns from a table, combining data from multiple tables using the join operator, taking the union of two tables, and computing basic aggregation functions. For example, the following queries are typical of the ones we see in the book.

Example 2.1

SELECT recruiter, candidate

FROM Interview, EmployeePerformance

WHERE recruiter = name AND EmployeePerformance.grade < 2.5

Example 2.2

SELECT reviewer, AVG(grade)

FROM EmployeePerformance

WHERE reviewQuarter = “1/2007

The query in Example 2.1 asks for pairs of recruiters and candidates where the recruiter got a low grade on their performance review. The answer to this query may reveal candidates who we may want to reinterview. The query in Example 2.2 asks for the average grade that a reviewer gave in the first quarter of 2007.

Given a query Q and a database D, we denote by Q (D) the result of applying the query Q to the database D. Recall that Q (D) is also a relation whose schema is defined implicitly by the query expression.

2.1.4 Conjunctive Queries

We briefly review the formalism for conjunctive queries. A conjunctive query has the following form:

image

In the query, image are the subgoals (or conjuncts) of the query and together form the body of the query. The Ri’s are database relations, and the image’s are tuples of variables and constants. Note that the same database relation can occur in multiple subgoals. Unless we explicitly give the query a different name, we refer to it as Q.

The variables in image are called distinguished variables, or head variables, and the others are existential variables. The predicate Q denotes the answer relation of the query. Its arity is the number of elements in image. We denote by Vars (Q) the set of variables that appear in its head or body.

The cj’s are interpreted atoms and are of the form X θ Y, where X and Y are either variables or constants, and at least one of X or Y is a variable. The operator θ is an interpreted predicate such as =, ≤, <, !=, >, or ≥. We assume the obvious meaning for the interpreted predicates, and unless otherwise stated, we interpret them over a dense domain.1

The semantics of a conjunctive query Q over a database instance D is as follows. Consider any mapping ψ that maps each of the variables in Q to constants in D. Denote by ψ (Ri) the result of applying ψ to image, by ψ (ci) the result of applying ψ to ci, and by ψ (Q) the result of applying ψ to image, all resulting in ground atoms. If

each of image is in D, and

for each 1 ≤ jm, ψ (cj) is satisfied,

then (and only then) ψ (Q) is in the answer to Q over D.

To illustrate the correspondence between SQL queries and conjunctive queries, the following conjunctive query is the same as the SQL query in Example 2.1:

Q1 (Y, X) :- Interview (X, D, Y, H, F), EmployeePerformance (E, Y, T, W, Z), W ≤ 2.5

Note that the join in the conjunctive query is expressed by the fact that the variable Y appears in both subgoals. The predicate on the grade is expressed with an interpreted atom.

Conjunctive queries must be safe; that is, every variable appearing in the head also appears in a non-interpreted atom in the body. Otherwise, the set of possible answers to the query may be infinite (i.e., the variable appearing in the head but not in the body can be bound to any value).

We can also express disjunctive queries in this notation. To express disjunction, we write two (or more) conjunctive queries with the same head predicate.

Example 2.3

The following query asks for the recruiters who performed the best or the worst:

Q1 (E, Y) :- Interview (X, D, Y, H, F), EmployeePerformance (E, Y, T, W, Z), W ≤ 2.5

Q1 (E, Y) :- Interview (X, D, Y, H, F), EmployeePerformance (E, Y, T, W, Z), W ≥ 3.9

We also consider conjunctive queries with negated subgoals, of the form

image

For queries with negation, we extend the notion of safety as follows: any variable appearing in the head of the query must also appear in a positive subgoal. To produce an answer for the query, the mapping from the variables of Q to the constants in the database must satisfy image.

In our discussions, the term conjunctive queries will refer to conjunctive queries without interpreted predicates or negation. If we allow interpreted or negated atoms, we will explicitly say so.

2.1.5 Datalog Programs

A datalog program is a set of rules, each of which is a conjunctive query. Instead of computing a single answer relation, a datalog program computes a set of intensional relations (called IDB relations), one of them being designated as the query predicate. In datalog, we refer to the database relations as the EDB relations (extensional database). Intuitively, the extensional relations are given as a set of tuples (also referred to as ground facts), while the intensional relations are defined by a set of rules. Consequently, the EDB relations can occur in only the body of the rules, whereas the IDB relations can occur both in the head and in the body.

Example 2.4

Consider a database that includes a simple binary relation representing the edges in a graph: edge(X,Y) holds if there is an edge from X to Y in the graph. The following datalog query computes the paths in the graph. edge is an EDB relation and path is an IDB relation.

r1 path(X, Y) :- edge (X, Y)

r2 path (X, Y) :- edge (X, Z), path (Z, Y)

The first rule states that all single edges form paths. The second rule computes paths that are composed from shorter ones. The query predicate in this example is path. Note that replacing r2 with the following rule would produce the same result.

r3 path (X, Y) :- path (X, Z), path (Z, Y)

The semantics of datalog programs are based on conjunctive queries. We begin with empty extensions for the IDB predicates. We choose a rule in the program and apply it to the current extension of the EDB and IDB relations. We add the tuples computed for the head of the rule to its extension. We continue applying the rules of the program until no new tuples are computed for the IDB relations. The answer to the query is the extension of the query predicate. When the rules do not contain negated subgoals, this process is guaranteed to terminate with a unique answer, independent of the order in which we applied the rules.

Example 2.5

In Example 2.4, suppose we begin with a database that contains the tuples edge(1,2), edge(2,3), and edge(3,4). When we apply r1 we will obtain path(1,2), path(2,3), and path(3,4). The first time we apply r2 we obtain path(1,3) and path(2,4). The second time we apply r2 we obtain path(1,4). Since no new tuples can be derived, the evaluation of the datalog program terminates.

In data integration we are interested in datalog programs mostly because they are sometimes needed in order to compute all the answers to a query from a set of data sources (see Sections 3.3 and 3.4). Readers who are familiar with the Prolog programming language will notice that datalog is a subset of Prolog. The reader should also note that not all SQL queries can be expressed in datalog. In particular, there is no support in datalog for grouping and aggregation and for outer joins. SQL does support limited kinds of recursion but not arbitrary recursion.

2.2 Query Unfolding

One of the important advantages of declarative query languages is composability: you can write queries that refer to views (i.e., other named queries) in their body. For example, in SQL, you can refer to other views in the FROM clause. Composability considerably simplifies the task of expressing complex queries, because they can be written in smaller fragments and combined appropriately. Query unfolding is the process of undoing the composition of queries: given a query that refers to views, query unfolding will rewrite the query so it refers only to the database tables.

Query unfolding is conceptually quite simple. We iteratively unfold a single view in the definition of the query until no more views remain. The following describes a single unfolding step. Like all other algorithms in this chapter, we describe them using the notation of conjunctive queries (see Section 2.1). We typically restrict our discussion to algorithms for manipulating conjunctive queries, and in some cases describe some important extensions. The bibliographic references contain pointers to algorithms that cover more complex queries.

Unfolding A Subgoal

Let Q be a conjunctive query of the form

Q(image) :- p1 (image, …, pn (image)

where p1 is itself a relation defined by the query

p1(image) :- s1 (image, …, sm (image)

We assume without loss of generality that image is a tuple of variables, and each variable occurs in the tuple at most once. The other subgoals of Q may also be defined by other queries or be database relations.

A single unfolding step is performed as follows. Let ψ be the variable mapping that maps image to image and maps the existential variables in p1 to new variables that do not occur anywhere else. To unfold p1 (image), remove p1 (image) from Q and add the subgoals s1 (ψ(image)), …, sm (ψ(image)) to the body of Q.

We repeat the above procedure until all of the subgoals in Q refer to database relations.

Example 2.6

Consider the following example, where Q3 is defined in terms of Q1 and Q2. The relation Flight stores pairs of cities between which there is a direct flight, and the relation Hub stores the set of hub cities of the airline. The query Q1 asks for pairs of cities between which there is a flight that goes through a hub. The query Q2 asks for pairs of cities that are on the same outgoing path from a hub.

Q1 (X, Y) :- Flight (X, Z), Hub (Z), Flight (Z, Y)

Q2 (X, Y) :- Hub (Z), Flight (Z, X), Flight (X, Y)

Q3 (X, Z) :- Q1 (X, Y), Q2 (Y, Z)

The unfolding of Q3 is

Q3 (X, Z) :- Flight (X, U), Hub (U), Flight (U, Y), Hub (W), Flight (W, Y), Flight (Y, Z)

There are a few points to note about query unfolding. First, the resulting query may have subgoals that seem redundant. The next section will describe a set of algorithms for removing redundant subgoals by leveraging techniques for query containment. Second, the query and the view may each have interpreted predicates (recall from Section 2.1.4) that are satisfiable in isolation, but after the unfolding we may discover that the query is unsatisfiable and therefore the empty answer can be returned immediately. This, of course, is an extreme example where unfolding can lead to significant optimization in query evaluation.

It is also interesting to note that through repeated applications of the unfolding step, the number of subgoals may grow exponentially. It is quite easy to create an example of n queries defining a query Q, such that unfolding Q yields 2n subgoals.

Finally, we emphasize that unfolding does not necessarily yield more efficient ways to execute the query. In fact, in Section 2.4 we do exactly the opposite—we try to rewrite queries so they do refer to views in order to speed up query processing. Unfolding merely allows the query processor to explore a wider collection of query plans by considering a larger set of possible orderings of the join operations in the query, and by considering the complete set of constraints expressed with interpreted predicates holistically. Of course, the ability to unfold queries is crucial when queries are written in a compositional fashion, which is one of the main benefits of declarative querying.

2.3 Query Containment and Equivalence

Let us reconsider the unfolding of the query Q3 in Example 2.6:

Q3 (X, Z) :- Flight (X, U), Hub (U), Flight (U, Y), Hub (W), Flight (W, Y), Flight (Y, Z)

Intuitively, this query seems to have more subgoals than necessary. Specifically, the subgoals Hub (W) and Flight (W, Y) seem redundant, because whenever the subgoals Hub (Z) and Flight (Z, Y) are satisfied, then so must Hub (W) and Flight (W, Y). Hence, we would expect the following query to produce exactly the same result as Q3:

Q4 (X, Z) :- Flight (X, U), Hub (U), Flight (U, Y), Flight (Y, Z)

Furthermore, if we consider the following query, which requires both intermediate stops to be hubs:

Q5 (X, Z) :- Flight (X, U), Hub (U), Flight (U, Y), Hub (Y), Flight (Y, Z)

then we would expect the set of answers to Q4 to always be a superset of the answers to Q5.

Query containment and equivalence provide a formal framework for reaching the conclusions we described above. Reasoning in this way enables us to remove subgoals from queries, thereby reducing the computation needed to execute them. As we will see in the next chapter, query containment and equivalence provide the formal framework for comparing different results of query reformulation in data integration systems.

2.3.1 Formal Definition

We begin with the formal definition. Recall that Q (D) denotes the result of the query Q over the database D. The arity of a query refers to the number of arguments in its head.

Definition 2.1 (Containment and Equivalence).

Let Q1 and Q2 be two queries of the same arity. We say that Q1 is contained in Q2, denoted by image, if for any database D, image. We say that Q1 is equivalent to Q2, denoted by image, if image and image.

The important aspect of the above definition is that containment and equivalence are properties of the queries, and not the current state of the database. The relationship needs to hold for any database state. In fact, determining query containment and equivalence can be viewed as a problem of logical deduction specialized to query expressions.

In our discussion, we do not consider the data types of individual columns of database relations. However, in practice we would only consider containment between pairs of queries that are union compatible, i.e., each of the columns of their head predicates are compatible. In what follows, we describe query containment (and hence equivalence) algorithms for common classes of queries. The bibliographic references point out additional cases that have been studied and indicate where to find the full details of the algorithms we describe.

2.3.2 Containment of Conjunctive Queries

We begin by discussing the simplest case of query containment: conjunctive queries with no interpreted predicates or negation. In our discussion we will often refer to variable mappings. A variable mapping ψ from query Q1 to query Q2 maps the variables of Q1 to either variables or constants in Q2. We also apply variable mappings to tuples of variables and to atoms. Hence, ψ (X1, …, Xn) denotes ψ (X1), …, ψ (Xn), and ψ (p (X1, …, Xn)) denotes p (ψ (X1), …, ψ (Xn)).

In the case of conjunctive queries with no interpreted predicates or negation, checking containment amounts to finding a containment mapping, defined next.

Definition 2.2 (Containment Mapping).

Let Q1 and Q2 be conjunctive queries. Let ψ be a variable mapping from Q1 to Q2. We say that ψ is a containment mapping from Q1 to Q2 if

image, where image and image are the head variables of Q1 and Q2, respectively, and

for every subgoal image in the body of Q1, image is a subgoal of Q2.

The following theorem shows that the existence of a containment mapping is a necessary and sufficient condition for containment.

Theorem 2.1

Let Q1 and Q2 be two conjunctive queries. Then, image if and only if there is a containment mapping from Q1 to Q2.

Proof

Suppose Q1 and Q2 have the following form:

Q1 (image) :- g1 (image), …, gn (image)

Q2 (image) :- h1 (image), …, hm (image)

For the if direction, suppose there is a containment mapping, ψ, from Q1 to Q2 and suppose that D is an arbitrary database instance. We need to show that if image, then image.

Suppose image; then there is a mapping, ϕ, from the variables of Q2 to the constants in D such that

1. image, and

2. image for 1 ≤ im.

Now consider the composition mapping image that maps the variables of Q1 to constants in D (i.e., the mapping that first applies ψ and then applies ϕ to the result). Since ψ is a containment mapping, the following conditions, necessary to show that image, hold:

image because image and image.

For every 1 ≤ in, image is a subgoal of Q2, and therefore image.

Therefore, image.

For the only if direction, assume that image, and we will show that there is a containment mapping from Q1 to Q2.

Let us consider a special database, DC, that we call the canonical database of Q2, and is constructed as follows. The constants in DC are the variables or constants appearing in the body of Q2. The tuples of DC correspond to the subgoals of Q2. That is, for every 1 ≤ im, the tuple image is in the relation hi.

Clearly, image, simply by the definition of Q2. Since image, image must also be in Q1 (DC), and therefore there is a mapping ψ from the variables of Q1 to the constants in Q2 such that

image, and

for every 1 ≤ in, image.

It is easy to see that ψ is a containment mapping from Q1 to Q2.

The important aspect of the above proof is the concept of a canonical database. We were able to show that if image, then containment holds on any database. Hence, we obtain Algorithm 1 for checking containment.

image

Algorithm 1 CQContainment: Query containment for conjunctive queries.

Example 2.7

Recall the queries we considered earlier:

Q3 (X, Z) :- Flight (X, U), Hub (U), Flight (U, Y), Hub (W), Flight (W, Y), Flight (Y, Z)

Q4 (X1, Z1) :- Flight (X1, U1), Hub (U1), Flight (U1, Y1), Flight (Y1, Z1)

The following containment mapping shows that image:

image

As we will see soon, a single canonical database will not always suffice as we consider queries with interpreted predicates or negation, but we will be able to remedy the situation by considering multiple canonical databases.

Computational Complexity

Deciding whether image is NP-complete in the size of the two queries. In practice, however, this is not a concern for multiple reasons. First, we are measuring the complexity in terms of the size of the queries, not the size of the data, and queries tend to be relatively small (though not always!). In fact, the algorithm above evaluates a query over an extremely small database. Second, in many practical cases, there are polynomial-time algorithms for containment. For example, if none of the database relations appears more than twice in the body of one of the queries, then it can be shown that containment can be checked in time that is polynomial in the size of the queries.

2.3.3 Unions of Conjunctive Queries

Next, we consider containment and equivalence of unions of conjunctive queries. Recall that unions are expressed as multiple rules with the same relation in the head.

Example 2.8

The following query asks for the pairs of cities that (1) are either connected by one-stop flight, or (2) both have a flight into the same hub.

Q1 (X, Y) :- Flight (X, Z), Flight (Z, Y)

Q1 (X, Y) :- Flight (X, Z), Flight (Y, Z), Hub (Z)

Suppose we want to decide whether the following query is contained in Q1:

Q2 (X, Y) :- Flight (X, Z), Flight (Z, Y), Hub (Z)

The following theorem shows a very important property: if Q2 is contained in Q1, then there must be a single conjunctive query in Q1 that contains Q2 on its own. In other words, two conjunctive queries in Q1 cannot “gang up” on Q2.

Theorem 2.2

Let image be a union of conjunctive queries and let Q2 be a conjunctive query. Then image if and only if there is an 1 ≤ in, such that image.

Proof

The if direction is obvious. If one of the conjunctive queries in Q1 contains Q2, then clearly image.

For the only if direction, we again consider the canonical database created by Q2, DC. Suppose the head of Q2 is image. Since image, then image should be in Q1 (DC), and therefore there is some 1 ≤ in, such that image. It is easy to see that image.

In Example 2.8, containment holds because Q2 is contained in the first rule defining Q1.

An important corollary of Theorem 2.2 is that the algorithm for checking containment of queries with union is a slight variation on the previous algorithm. Specifically, we create a canonical database from the body of Q2. If image is in the answer to Q1 over the canonical database, then image. In the case that Q2 is a union of conjunctive queries, image if and only if Q1 contains each of the conjunctive queries in Q2. Hence, the complexity results for conjunctive queries transfer over to queries with unions.

2.3.4 Conjunctive Queries with Interpreted Predicates

We now consider conjunctive queries with interpreted predicates, which have the form:

Q (image) :- R1 (image), …, Rn (image), c1, …, cm

where cj’s are interpreted atoms (we often call them comparisons), and are of the form XθY, where X and Y are either variables or constants (but at least one of them must be a variable). The operator θ is an interpreted predicate such as =, ≤, <, ≠, >, or ≥. We assume the obvious meaning for the interpreted predicates, and unless otherwise stated, we interpret them over a dense domain.

Conjunctive queries allow conjunctions of interpreted atoms. In some of our reasoning, we will also manipulate Boolean formulas that include disjunctions. We use the standard notation for logical entailment. Specifically, if C is a Boolean formula over interpreted atoms, and c is an interpreted atom, then image means that any variable substitution that satisfies C also satisfies c. For example, image, but image. Checking whether image can be done in time quadratic in the sizes of C and c.

The following definition extends the notion of containment mappings to queries with interpreted predicates.

Definition 2.3 (Containment Mapping with Interpreted Predicates).

Let Q1 and Q2 be conjunctive queries with interpreted atoms. Let C1 (resp. C2 ) be the conjunction of interpreted predicates in Q1 (resp. Q2 ). Let ψ be a variable mapping from Q1 to Q2. We say that ψ is a containment mapping from Q1 to Q2 if

image, where image and image are the head variables of Q1 and Q2, respectively,

for every non-interpreted subgoal image in the body of Q1, image is a subgoal of Q2, and

image.

It is easy to verify that one direction of Theorem 2.1 extends to queries with interpreted predicates. That is, if there is a containment mapping from Q1 to Q2, then image. However, as the following example shows, the converse is not true.

Example 2.9

Q1 (X, Y) :- R (X, Y), S (U, V), UV

Q2 (X, Y) :- R (X, Y), S (U, V), S (V, U)

It is easy to see that image because in either alternative, UV or U > V, the S subgoal in Q1 will be satisfied. However, there is no containment mapping with interpreted predicates from Q1 to Q2.

In fact, the reasoning that shows that image in Example 2.9 is the key to developing a containment algorithm for conjunctive queries with interpreted predicates. In the example, suppose we rewrite Q2 as the following union:

Q2 (X, Y) :- R (X, Y), S (U, V), S (V, U), UV

Q2 (X, Y) :- R (X, Y), S (U, V), S (V, U), U > V

Now it is fairly easy to verify using containment mappings that Q1 contains each of the two conjunctive queries in Q2. The following discussion will make this intuition precise.

We first introduce complete orderings that are refinements of a set of conjunctive comparison atoms. Specifically, a conjunction of comparison predicates, C, may still only specify partial knowledge about the orderings of the variables in C. For example, {X ≥ 5, Y ≤ 8} does not entail XY, XY, or X = Y. The complete orderings originating from C are the sets of comparisons that are consistent with C, but also completely determine all the order relations between pairs of variables. Naturally, there are multiple complete orderings consistent with a given conjunction, C.

Definition 2.4 (Complete Ordering and Query Refinements).

Let Q be a conjunctive query whose variables are image and mentioning the constants image. Let C be the conjunction of interpreted atoms in Q. A complete ordering CT on the variables image is a satisfiable conjunction of interpreted predicates, such that image and for every pair d1, d2, where image, one of the following holds:

image,

image,

image.

Given a conjunctive query

Q (image) :- R1 (image), …, Rn (image), C

let image be the complete orderings of the variables and constants in Q. Then

Q1 (image) :- R1 (image), …, Rn (image), c1

          image

Ql (image) :- R1 (image), …, Rn (image), cl

are the complete query refinements of Q. Note that image.

To check that Q1 contains Q2, we need to find a containment mapping from Q1 to each of the complete query refinements of Q2. The following theorem states this intuition formally.

Theorem 2.3

Let Q1 and Q2 be two conjunctive queries with interpreted predicates. Let image be the complete query refinements of Q2. image if and only if for every 1 ≤ il there is a containment mapping from Q1 to image.

Proof

The if direction is rather obvious since image and because containment mappings are a sufficient condition for containment even with interpreted predicates.

For the only if direction, consider l canonical databases, image, each constructed as before, except that the constants in image are rational numbers that satisfy the interpreted predicates in image. Let ti be an answer derived by image from image. Since image, then image. A similar argument to the proof of Theorem 2.1 shows that there is a containment mapping from Q1 to image.

The practical problem with the algorithm implied by Theorem 2.3 is that the number of complete query refinements of Q2 can be quite large. In fact, the number of refinements can be exponential in the size of Q2. Recall that the number of refinements corresponds to all the different orderings on constants and variables appearing in the query. As a result, the computational complexity of checking containment for conjunctive queries with interpreted predicates is image-complete, which is the class of problems described by formulas of the form image, where P is a condition that can be verified in polynomial time. Here, X ranges over the set of refinements, Y ranges over the set of variable mappings, and P verifies that the variable mapping is a containment mapping.

Fortunately, in practice we do not need to consider all possible refinements of Q2. Algorithm 2 describes, CQIPContainment, a more efficient algorithm for query containment with interpreted predicates.

Applying algorithm CQIPContainment to our example, we would evaluate Q1 on the database DC that contains the tuples R(X, Y), S(U, V), S(V, U). Q1(DC) would contain the following pairs: image. Hence, image. Since C2 = True, we get image, and therefore image.

2.3.5 Conjunctive Queries with Negation

Next, we consider queries with negation that have the following form:

Q (image) :- R1 (image), …, Rn (image), ¬S1 (image), …, ¬Sm (image)

image

Algorithm 2 CQIPContainment: Query containment for conjunctive queries with interpreted predicates.

We require that the queries be safe, i.e., that any variable appearing in the head of the query also appear in a positive subgoal. For ease of exposition, we consider queries without interpreted predicates.

The natural extension of containment mappings to queries with negation ensures that positive subgoals are mapped to positive subgoals and negative subgoals are mapped to negative subgoals. The following theorem shows that containment mappings offer a sufficient condition for containment. We leave the proof to the reader, as it is an extension of the proof of Theorem 2.1.

Theorem 2.4

Let Q1 and Q2 be safe conjunctive queries with negated subgoals. If there is a containment mapping ψ from Q1 to Q2 such that ψ maps the positive subgoals of Q1 to positive subgoals of Q2 and maps the negative subgoals of Q1 to negative subgoals of Q2, then image.

Here, too, containment mappings do not provide a necessary condition for containment. It is actually quite tricky to find an example where containment holds but there is no containment mapping, but the following example illustrates this case.

Example 2.10

Consider the queries Q1 and Q2. Note that they do not have head variables, and are therefore Boolean queries—they return true or false.

Q1() :- a (A, B), a (C, D), ¬a (B, C)

Q2() :- a (X, Y), a (Y, Z), ¬a (X, Z)

A little bit of thought will show that image; however, there is no containment mapping from Q1 to Q2.

Recall that in the previous cases, the key to proving containment was to consider a very special canonical database (or set of databases). If containment was satisfied on all canonical databases, then we were able to derive that containment holds on all databases. In the case of queries with negated subgoals, the key to proving containment is to show that we only need to consider databases of a certain size.

Theorem 2.5

Let Q1 and Q2 be two conjunctive queries with safe negated subgoals and assume they mention disjoint sets of variables. Let B be the total number of variables and constants in Q2. Then, image if and only if image on all databases D that have at most B constants.

Proof

Let Q1 and Q2 be of the following form (we omit the variables in the subgoals):

Q1 (image) :- image

Q2 (image) :- image

The only if direction is obvious, so we consider the if direction. Suppose image. It suffices to show that there is a counterexample database with at most B constants.

Since image, there must exist a database D, a tuple image, and a mapping ϕ from the variables of Q2 to the constants in D, such that (1) image, (2) image, for 1 ≤ im, (3) image for 1 ≤ ik, and (4) image.

Consider the database D′ that includes only the tuples from D that either have only constants in the range of ϕ or have constants from Q2.

First, we argue that image. This follows from the construction of D′. The range of ϕ is unaffected, and therefore the positive atoms that contributed to deriving image are in D′ and the negative atoms certainly hold because D′ is a subset of D. Hence, (2) and (3) above still hold. In fact, the same argument shows that image for any database D″ such that image.

Second, we argue that we can build a database D″, where image and D″ is a counterexample. Furthermore, the number of constants in D″ is at most B. Note that the number of constants in D′ is bounded by B.

Let us consider two cases. In the first case, there was no mapping, ψ, from the variables of Q1 to D, such that image for 1 ≤ in. In this case, there certainly is no such mapping from Q1 to D′, and therefore D′ is a counterexample because image.

In the second case, there may be mappings from the variables of Q1 to D′ that satisfy its positive subgoals. We construct D″ as following, after initializing it to D′.

Let ψ be a mapping from the variables of Q1 to D′, such that image for 1 ≤ in. There must be a j, 1 ≤ jl, such that image. Otherwise, image would be in the answer to Q1 (D). We add image to D″, and therefore ψ is no longer a variable mapping that would lead to image being in image. Note that adding image to D″ did not change the number of constants in D″ because all the variables that occur in rj must occur in one of the pi’s.

Since the number of tuples we may add to D″ in this way is bounded, we will ultimately not be able to add any more tuples. The database D″ has at most B constants, and image. Hence, D″ is a counterexample, establishing the theorem.

Theorem 2.5 entails that it suffices to check that containment holds on all databases with up to B constants (up to isomorphism). Since there are a finite number of such databases, we can conclude that query containment is decidable. In what follows we describe a strategy that is more efficient than naively enumerating all possible such databases.

Let C denote the variables appearing in Q2. We construct a set of canonical databases whose constants are C. Intuitively, each canonical database corresponds to a set of equality constraints applied to the elements of C (in the same spirit of refinements in Section 2.3.4, but considering only the = predicate).

Formally, in each canonical database we apply a partition to elements of C. A partition of C maps the constants in C to a set of equivalence classes and applies a homomorphism to C where each constant is mapped to a unique representative of its equivalence class. Let k be the number of possible partitions of C, and ϕk the homomorphism associated with the kth partition. We create databases D1, …, Dk, where the database Di includes the tuples image. For each Di we construct a set of databases image as follows:

If the body of Q2 is not satisfied by Di, we set image to be the empty set. Note that Di may not satisfy the body of Q2 because a negative subgoal of Q2 may not be satisfied.

If the body of Q2 is satisfied by Di, then image is the set of canonical databases that can be constructed from Di as follows:

Add any subset of tuples to Di that includes only constants from Di, but do not add the image.

The containment image holds if and only if for every i, 1 ≤ ik and for every image, image.

Example 2.11

We illustrate the containment algorithm on the queries in Example 2.10. Q2 includes four constants, A, B, C, and D.

We first consider the partition in which all variables in Q2 are equated, yielding the database D1 = {a(A, A)}. The body of Q2 is not satisfied on D1 because the negative subgoal is not satisified. Hence, image is the empty set and containment holds trivially.

Now consider the other extreme, the partition in which each variable is in its own equivalence class, yielding the database D2 = {a(A, B), a(C, D)}. Since Q2 is satisfied in D2, image includes any database that can be constructed by adding tuples to D2 that have the constants A, B, C, and D, but not the atom a (B, C). A case-by-case analysis verifies that Q1 is satisfied in every database in image.

In the same fashion, the reader can verify containment holds when we consider the other partitions of the variables in Q2.

Computational Complexity

The computational complexity of checking containment for conjunctive queries with negated subgoals is image-complete, which is the class of problems described by formulas of the form image, where P is a condition that can be verified in polynomial time. Here, X ranges over the set of canonical databases image, Y ranges over the set of variable mappings, and P is the function that verifies that the variable mapping is a containment mapping.

2.3.6 Bag Semantics, Grouping, and Aggregation

SQL queries operate, by default, on bags (multi-sets) of tuples, rather than sets of tuples. Answers to queries in SQL are also bags by default, unless the DISTINCT keyword is used. The key difference in bag semantics is that we also count the number of times a tuple appears in a relation (either the input or the result of a query). We refer to that number as the tuple’s multiplicity. For example, consider the following query over the table in Figure 2.2:

Q1 (X, Y) :- Flight (X, Z, W), Flight (Z, Y, W1)

image

Figure 2.2 A simple flight table that illustrates the difference between set and bag semantics.

Under set semantics, the answer to Q1 would contain a single tuple (San Francisco, Anchorage). Under bag semantics, the tuple (San Francisco, Anchorage) appears in the answer with multiplicity 2, because there are two pairs of flights that connect San Francisco with Anchorage.

The topic of query containment and equivalence for queries with bag semantics is quite involved. Query containment and equivalence for queries with grouping and aggregation depend heavily on our understanding of containment for bags. This section gives an overview of the issues and some of the results known in this area.

To discuss containment of queries with aggregation, we first need to modify the definition of containment slightly to account for the different semantics. Specifically, we say that image with bag semantics if for any database D and for any tuple image, the multiplicity of image in Q1 (D) is at least as much as the multiplicity of image in Q2 (D). As before, image are equivalent if image and image.

Example 2.12

As a trivial example showing that equivalence under set semantics does not imply equivalence under bag semantics, consider the following two queries:

Q1 (X) :- P (X)

Q2 (X) :- P (X), P (X)

Suppose an instance of P has n occurrences of a constant d. Then the result of Q2 would contain image occurrences of d, while Q1 would only contain n occurrences of d. Hence, while Q1 and Q2 are equivalent under set semantics, they are not equivalent under bag semantics.

Query containment for conjunctive queries without interpreted predicates and without negation is not even known to be decidable. Furthermore, it is known that when interpreted predicates are allowed or when unions are allowed, query containment is undecidable. There is more known on query equivalence. In fact, the following theorem states that for two conjunctive queries to be bag-equivalent, they need to be isomorphic to each other. The bibliographic notes include a reference to the proofs of the theorems in this section.

Theorem 2.6

Let Q1 and Q2 be conjunctive queries. Then, Q1 and Q2 are bag-equivalent if and only if there is a 1-1 containment mapping (i.e., isomorphism) from Q1 to Q2.

Queries with Grouping and Aggregation

Conditions for equivalence for queries with grouping and aggregation highly depend on the specific aggregation function in the query. Some aggregation functions are sensitive to multiplicities (e.g., Average and Count), while others are not (e.g., Max and Min). The following are two examples of where precise conditions are known for equivalence of such queries.

Count Queries

These queries are of the form

Q (image, count(image)) :- R1 (image), …, Rn (image)

The query computes the table specified by the body subgoals, groups the results by the variables in image, and for every group outputs the number of different values for image.

The following theorem shows that count queries are sensitive to multiplicities. The theorem can be extended to queries with interpreted predicates using the concept of complete orderings we introduced earlier.

Theorem 2.7

Let Q1 and Q2 be two count queries. The equivalence image holds if and only if there is a variable mapping ψ from the variables of Q1 to the variables of Q2 thatinduces an isomorphism between the bodies of Q1 and Q2 and an isomorphism between the heads of Q1 and Q2.

Max Queries

It can be shown that max queries are not sensitive to multiplicities. A max query has the form

Q (image, max(Y)) :- R1 (image), …, Rn (image)

The query computes the table specified by the body subgoals, groups the results by the variables in image, and for every group outputs the maximum value of the variable Y. We define the core of a max query to be the query where the variable Y appears in the head without the aggregation function, i.e.,

Q (image, Y) :- R1 (image), …, Rn (image)

The following theorem establishes the property of max queries:

Theorem 2.8

Let Q1 and Q2 be two max queries with no comparison predicates. The equivalence image holds if and only if the core of Q1 is equivalent to the core of Q2.

Example 2.13

The following example shows that Theorem 2.8 does not apply to queries with interpreted predicates.

Q1 (max(Y)) :- p (Y), P (Z1), p (Z2), Z1 < Z2

Q2 (max(Y)) :- p (Y), P (Z), Z < Y

Both queries return answers if P contains at least two different constants. However, while Q1 considers all tuples in P, Q2 considers all the tuples except for the one with the smallest value. Hence, the maximum value is always the same (and hence, image), but the cores are not equivalent. Fortunately, there is a more elaborate condition for checking containment of max queries with interpreted predicates (see the bibliographic references for more details).

2.4 Answering Queries Using Views

In the previous section we described query containment, which is a fundamental relationship between a pair of queries. One of the important uses of query containment is that when we detect that a query Q1 is contained in a query Q2, we can often compute the answer to Q1 from the answer to Q2. In this section we are interested in the more general problem of when we can answer a query Q from a collection of views. The following example illustrates our goal.

Example 2.14

Consider the familiar movie domain with the following relations. In the relations, the first argument is a numerical ID of the movies in the database.

Movie (ID, title, year, genre

Director (ID, director)

Actor (ID, actor)

Suppose a user poses the following query, asking for comedies produced after 1950 where the director was also an actor in the movie:

Q (T, Y, D) :- Movie (I, T, Y, G), Y ≥ 1950, G = “comedy”, Director (I, D), Actor (I, D)

Suppose that, in addition, we have access to the following view over the database:

V1 (T, Y, D) :- Movie (I, T, Y, G), Y ≥ 1940, G = “comedy”, Director (I, D), Actor (I, D)

Since the selection on year in V1 is less restrictive than in Q, we can infer that image. Since the year attribute is in the head of V1, we can use V1 to answer Q simply by adding another selection:

Q′ (T, Y, D) :- V1 (T, Y, D), Y ≥ 1950

Answering Q using V1 is likely to be more efficient than answering Q1 directly from the database, because we do not need to perform the join operations in Q. However, suppose that instead of V1 we only had access to the following views:

V2 (I, T, Y) :- Movie (I, T, Y, G), Y ≥ 1950, G = “comedy”

V3 (I, D) :- Director (I, D), Actor (I, D)

Neither V2 nor V3 contain Q, and therefore the same reasoning as above does not apply. However, we can still answer Q using V2 and V3 as follows, also leading to a more efficient evaluation plan in many cases:

Q″ (T, Y, D) :- V2 (I, T, Y), V3 (I, D)

The example above illustrates that deciding whether a query can be answered from a set of precomputed views is a more general problem than query containment, because we need to consider combinations of views. In data integration, such combinations will be quite important because we will consider combinations of data sources, each corresponding to a view, to answer a user query. This section describes algorithms for answering queries using views and provides a deeper understanding for when views can be used and when not.

2.4.1 Problem Definition

We begin by formally defining the problem of answering queries using views. Throughout this section we assume that we are given a set of views denoted as V1, …, Vm. Unless stated otherwise, we will assume that the language for specifying the query and the views is conjunctive queries.

Given a query Q and a set of view definitions V1, …, Vm, a rewriting of the query using the views is a query expression Q′ that refers only to the views V1, …, Vm. In SQL, a query refers only to the views if all the relations mentioned in the FROM clause are views. In the notation of conjunctive queries, a query refers only to the views if all the subgoals are views with the exception of interpreted atoms. In practice, we may also be interested in rewritings that can also refer to the database relations, but all the algorithms we describe in this section can easily be extended to that case.

The most natural question to pose is whether there is an equivalent rewriting of the query that uses the views.

Definition 2.5 (Equivalent Query Rewritings).

Let Q be a query and image be a set of view definitions. The query Qis an equivalent rewriting of Q using image if

Qrefers only to the views in image, and

Qis equivalent to Q.

For example, the query Q″ in Example 2.14 is an equivalent rewriting of Q using V2 and V3. Note that when we consider the equivalence between a query Q and a rewriting Q′, we actually need to consider the unfolding of Q′ w.r.t. the views.

There may not always be an equivalent rewriting of the query using a set of views, and we may want to know what is the best we can do with a set of views. For example, suppose that instead of V2 in Example 2.14, we had the following view that includes comedies produced after 1960:

V4 (I, T, Y) :- Movie (I, T, Y, G), Y ≥ 1960, G = “comedy”

While we can’t use V4 to completely answer Q1, the following rewriting is the best we can do and may be our only choice if we do not have access to any other data:

Qent (T, Y, D) :- V4 (I, T, Y), V3 (I, D)

Qent is called a maximally contained rewriting of Q using V3 and V4. Intuitively, Qent is the best conjunctive query that uses the views to answer the query. If we also have the following view:

V5 (I, T, Y) :- Movie (PI, T, Y, G), Y ≥ 1950, Y ≤ 1955, G = “comedy”

then we can construct a union of conjunctive queries that is the best rewriting we can find.

The following definition makes these intuitions precise. To define maximally contained rewritings, we first need to specify the exact language we consider for rewritings (e.g., do we allow unions, interpreted predicates?). If we restrict or otherwise change the language of the rewriting, the maximally contained rewriting may change. In the definition below, the query language is denoted by image.

Definition 2.6 (Maximally Contained Rewritings)

Let Q be a query, image be a set of view definitions, and image be a query language. The query Qis a maximally contained rewriting of Q using image w.r.t. image if

Qis a query in image that refers only to the views in image,

Qis contained in Q, and

there is no rewriting image, such that image and Q1 is not equivalent to Q′.

When a rewriting Q′ is contained in Q but is not a maximally contained rewriting we refer to it as a contained rewriting.

Example 2.15

If we consider the language of the rewriting to be unions of conjunctive queries, the following is the maximally contained rewriting of Q:

Q(4) (T, Y, D) :- V4 (I, T, Y), V3 (I, D)

Q(4) (T, Y, D) :- V5 (I, T, Y), V3 (I, D)

If we restrict image to be conjunctive queries, then either of the two rules defining Q(4) would be a maximally contained rewriting. This example illustrates that the maximally contained rewriting need not be unique.

The algorithms for finding equivalent rewritings differ a bit from those for finding maximally contained rewritings. As we explain in Chapter 3, in the context of data integration, views describe the contents of the data sources. However, the sources may be incomplete and not contain all the data conforming to their descriptions. In addition, the sources together may not cover the entire domain of interest. In the rest of this chapter, unless otherwise mentioned, our goal is to find a union of conjunctive queries that is the maximally contained rewriting of a conjunctive query using a set of conjunctive views.

2.4.2 When is a View Relevant to a Query?

Informally, a few conditions need to hold in order for a view to be useful for answering a query. First, the set of relations the view mentions should overlap with that of the query. Second, if the query applies predicates to attributes that it has in common with the view, then the view must apply either equivalent or logically weaker predicates in order to be part of an equivalent rewriting. If the view applies a logically stronger predicate, it may be part of a contained rewriting.

The following example illustrates some of the subtleties in answering queries using views. Specifically, it shows how small modifications to the views render them useless for answering a given query.

Example 2.16

Consider the following views:

V6 (T, Y) :- Movie (I, T, Y, G), Y ≥ 1950, G = “comedy”

V7 (I, T, Y) :- Movie (I, T, Y, G), Y ≥ 1950, G = “comedy”, Award (I, W)

V8 (I, T) :- Movie (I, T, Y, G), Y ≥ 1940, G = “comedy”

V6 is similar to V1, except that it does not include the ID attribute of the Movie table in the head, and therefore we cannot perform the join with V2. The view V7 considers only the comedies produced in or after 1950 that also won at least one award. Hence, the view applies an additional condition that does not exist in the query and cannot be used in an equivalent rewriting. Note, however, that if we have an integrity constraint stating that every movie wins an award (unlikely, unfortunately), then V7 would actually be usable. Finally, view V8 applies a weaker predicate on the year than in the query, but the year is not included in the head. Therefore, the rewriting cannot apply the appropriate predicate on the year.

The next few sections will give precise conditions and efficient algorithms for answering queries using views.

2.4.3 The Possible Length of a Rewriting

Before we discuss specific algorithms for rewriting queries using views, a first question we may ask is: What is the space of possible rewritings we even need to consider? We now describe a fundamental result that shows that if a conjunctive query has n subgoals, then we need not consider query rewritings that have more than n subgoals. This result enables us to focus on the set of rewritings that have at most n subgoals, and find efficient ways of searching this set.

Theorem 2.9

Let Q and image be conjunctive queries without comparison predicates or negation, and let n be the number of subgoals in Q.

If there is an equivalent conjunctive rewriting of Q using image, then there is one with at most n subgoals.

If Qis a contained conjunctive rewriting of Q using image, and has more than n subgoals, then there is a conjunctive rewriting Q″, such that image, and Qhas at most n subgoals.

Proof

Let the query be of the form

Q (image) :- e1 (image), …, en (image)

and suppose that

Q′ (image) :- v1 (image), …, vm (image)

is an equivalent rewriting of Q using image, where the vi’s are subgoals referring to view relations, and that m > n. Let the unfolding of Q′ w.r.t. the view definitions be

image

Note that image is equivalent to Q′. Since image, there must be a containment mapping from Q to image. Let the containment mapping from Q to image be ψ. Since Q has only n subgoals, the image of ψ can only be present in at most n rows of the definition of image. Let us assume without loss of generality that these are the first k rows, where kn. Now consider the rewriting Q″ that has only the first k subgoals of Q′. The containment mapping ψ still establishes that image, and since Q″ has a subset of the subgoals of Q′, then clearly image, and hence image. Consequently, image, and Q″ has at most n subgoals. The proof of the second part of the theorem is almost identical.

Theorem 2.9 yields a first naive algorithm for finding a rewriting of a query given a set of views. Specifically, there are a finite number of rewritings of length at most n subgoals using a set of views image, because there are a finite number of variable patterns we can consider for each view. Hence, to find an equivalent rewriting, the algorithm can proceed as follows:

Guess a rewriting Q′ of Q that has at most n subgoals.

Check if image.

Similarly, the union of all rewritings Q′ of length at most n subgoals, such that image is a maximally contained rewriting of Q.

This algorithm shows that finding an equivalent rewriting of a conjunctive query using a set of conjunctive views is in NP. In fact, it can be shown that the problem is also NP-hard, and therefore NP-complete. Employing this algorithm is impractical because it means enumerating all possible rewritings of length at most n subgoals and checking equivalence of these rewritings to the query. The next sections describe algorithms for finding a maximally contained rewriting that are efficient in practice.

2.4.4 The Bucket and MiniCon Algorithms

This section describes two algorithms, the Bucket Algorithm and the MiniCon Algorithm, that drastically reduce the number of rewritings we need to consider for a query given a set of views. Broadly speaking, the algorithms first determine a set of view atoms that are relevant to each subgoal, and then consider only the combinations of these view atoms. The MiniCon Algorithm goes a step further and also considers some interactions between view subgoals, therefore further reducing the number of combinations that need to be considered.

The Bucket Algorithm

The main idea underlying the Bucket Algorithm is that the number of query rewritings that need to be considered can be drastically reduced if we first consider each subgoal in the query in isolation, and determine which views may be relevant to that subgoal. As we discuss later, the Bucket Algorithm is not guaranteed to find all the maximally contained rewritings. However, since the reduction in the number of rewritings is best illustrated when the query and views include interpreted predicates, we use them in our algorithm description and example.

The first step of the algorithm is shown in Algorithm 3. The algorithm constructs for each subgoal g in the query a bucket of relevant view atoms. A view atom is relevant if one of its subgoals can play the role of g in the rewriting. To do that, several conditions must be satisfied: (1) the view subgoal should be over the same relation as g, (2) the interpreted predicates of the view and the query are mutually satisfiable after the appropriate substitution is made, and (3) if g includes a head variable of the query, then the corresponding variable in V must also be a head variable in the view.

image

Algorithm 3 CreateBuckets: Creating the buckets.

The second step of the Bucket Algorithm considers all the possible combinations of rewritings. Each combination that includes a view atom from each bucket (eliminating duplicate atoms if necessary). This phase of the algorithm differs depending on whether we are looking for an equivalent rewriting of the query using the views or the maximally contained rewriting:

For an equivalent rewriting, we consider each candidate rewriting Q′ and check whether image or whether there are interpreted atoms C that can be added to Q′ such that image.

For the maximally contained rewriting we construct a union of conjunctive queries by considering each conjunctive rewriting Q′:

If image, then Q′ is included in the union.

If there exist interpreted atoms C that can be added to Q′ such that image, then image is included in the union.

If image but there exists a homomorphism, ψ, on the head variables of Q′ such that image, then we include image in the union.

Example 2.17

Continuing with the example from the movie domain, suppose we also have the relation Revenues(ID, Amount), describing the revenues each movie garnered over time, and suppose movie IDs are integers. Consider a query that asks for directors whose movies garnered a significant amount of money:

Q (ID, Dir) :- Movie (ID, Title, Year, Genre), Revenues (ID, Amount), Director (ID, Dir), Amount ≥ $100M

Suppose we have the following views:

V1 (I, Y) :-Movie (I, T, Y, G), Revenues (I, A), I ≥ 5000, A ≥ $200M

V2 (I, A) :-Movie (I, T, Y, G), Revenues (I, A)

V3 (I, A) :-Revenues (I, A), A ≤ $50M

V4 (I, D, Y) :-Movie (I, T, Y, G), Director (I, D), I ≤ 3000

In the first step the algorithm creates a bucket for each of the relational subgoals in the query in turn. The resulting contents of the buckets are shown in Table 2.1. The bucket of Movie(ID,Title,Year,Genre) includes views V1, V2, and V4. Note that each view head in a bucket only includes variables in the domain of the mapping. Fresh variables (primed) are used for the other head variables of the view (such as A’ in V2(ID,A’)).

Table 2.1 Contents of the Buckets*.

Movie(ID, Title, Year, Genre) Revenues(ID, Amount) Director(ID, Dir)
V1 (ID, Year) V1 (ID, Y’) V4 (ID, Dir, Y’)
V2 (ID, A’) V2 (ID, Amount)
V4 (ID, D’, Year)

* The primed variables are those that are not in the domain of the unifying mapping.

The bucket of the subgoal Revenues(ID,Amount) contains the views V1 and V2, and the bucket of Director(ID,Dir) includes an atom of V4. There is no atom of V3 in the bucket of Revenues(I,A) because constraint on the revenues considered in V3 is not mutually satisfiable with the predicates in the query.

In the second step of the algorithm, we combine elements from the buckets. The first combination, involving the first element from each bucket, yields the rewriting

q1 (ID, Dir):- V1 (ID, Year), V1 (ID, Y′), V4 (ID, Dir, Y″)

However, while both V1 and V4 are relevant to the query in isolation, their combination is guaranteed to be empty because they cover disjoint sets of movie identifiers.

Considering the second elements in the two left buckets yields the rewriting

q2 (ID, Dir) :- V2 (ID, A′), V2 (ID, Amount), V4 (ID, Dir, Y′)

As is, this rewriting is not contained in the query, but we can add the predicate Amount ≥ $100M, and remove one redundant subgoal V2 (ID, A′) to obtain a contained rewriting

q3 (ID, Dir) :- V2 (ID, Amount), V4 (ID, Dir, Y′), Amount ≥ $100M

Finally, combining the last elements in each of the buckets yields

q4 (ID, Dir) :- V4 (ID, D′, Year), V2 (ID, Amount), V4 (ID, Dir, Y′)

However, after removing the first subgoal, which is redundant, and adding the predicate Amount ≥ $100M, we would obtain q3 again, which is the only contained rewriting the algorithm finds.

The MiniCon Algorithm

The MiniCon Algorithm also proceeds in two steps. It begins like the Bucket Algorithm, considering which views contain subgoals that are relevant to a subgoal g in the query. However, once the algorithm finds a partial mapping from a subgoal g in the query to a subgoal g1 in a view V, it tries to determine which other subgoals from that view must also be used in conjunction with g1. The algorithm considers the join predicates in the query (which are specified by multiple occurrences of the same variable) and finds the minimal additional set of subgoals that need to be mapped to subgoals in V, given that g will be mapped to g1. This set of subgoals and mapping information is called a MiniCon Description (MCD) and can be viewed as a generalization of buckets. The following example illustrates the intuition behind the MiniCon Algorithm and how it differs from the Bucket Algorithm.

Example 2.18

Consider the following example over the same movie schema:

Q1 (Title, Year, Dir) :- Movie (ID, Title, Year, Genre), Director (ID, Dir), Actor (ID, Dir)

V5 (D, A)      :- Director (I, D), Actor (I, A)

V6 (T, Y, D, A)   :- Director (I, D), Actor (I, A), Movie (I, T, Y, G)

In its first step, the Bucket Algorithm would put the atoms V5 (Dir, A’) and V5 (D’, Dir) in the buckets of Director (ID, Dir) and Actor (ID, Dir), respectively. However, a careful analysis reveals that these two bucket items cannot contribute to a rewriting. Specifically, suppose the variable Dir is mapped to the variable D in V5 (see figure below). The variable ID needs to be mapped to the variable I in V5, but I is not a head variable of V5, and therefore we will not be able to join the Director subgoal of V5 with the other subgoals in the query.

Q1 (Title, Year, Amount):- Director (ID, Dir), Actor (ID, Dir), Movie (ID, Title, Year, Genre)

               ↓    ↓

V5 (D, A)        :- Director (I, D) Actor (I, A)

The MiniCon Algorithm realizes that V5 cannot be used to answer the query and therefore ignores it. We now describe the details of the algorithm.

Step 1: Creating MCDs

A MCD is a mapping from a subset of the variables in the query to variables in one of the views. Intuitively, a MCD represents a fragment of a containment mapping from the query to the rewriting of the query. The way in which we construct the MCDs guarantees that these fragments can later be combined seamlessly.

In the algorithm description, we use the following terms. First, given a mapping τ from Vars (Q) to Vars (V), we say that a view subgoal g1 covers a query subgoal g if image. Second, we often need to consider specializations of a view, formed by equating some of the head variables of the view (e.g., V6 (T, Y, D, D) instead of V6 (T, Y, D, A) in our example). We describe these specializations with head homomorphisms. A head homomorphism h on a view V is a mapping h from Vars (V) to Vars (V) that is the identity on the existential variables, but may equate distinguished variables, i.e., for every distinguished variable x, h (x) is distinguished, and h (x) = h (h (x)). We can now define MCDs formally.

Definition 2.7 (MiniCon Descriptions)

A MCD C for a query Q over a view V is a tuple of the form image where

hC is a head homomorphism on V,

image is the result of applying hC to V, i.e., image, where image are the head variables of V,

φC is a partial mapping from Vars (Q) to hC (Vars (V)), and

GC is a subset of the subgoals in Q that are covered by some subgoal in hC (V) using the mapping φC.

In words, φC is a mapping from Q to the specialization of V obtained by the head homomorphism hC. The main idea underlying the algorithm is to carefully choose the set GC of subgoals of Q that we cover by the mapping φC. The algorithm will use only the MCDs that satisfy the following property:

Property 2.1

Let C be a MCD for Q over V. The MiniCon Algorithm considers C only if it satisfies the following conditions.

C1. For each head variable X of Q that is in the domain of φC, image is a head variable in image.

C2. If image is an existential variable in image, then for every g, subgoal of Q, that includes X (1) all the variables in g are in the domain of φC, and (2) image.

Clause C1 is also required by the Bucket Algorithm. Clause C2 captures the intuition we illustrated in our example. If a variable X is part of a join predicate that is not enforced by the view, then X must be in the head of the view so the join predicate can be applied by another subgoal in the rewriting. In our example, clause C2 would rule out the use of V5 for query Q1.

Algorithm 4 builds the MCDs. Note that the algorithm will not consider all the possible MCDs, but only those in which hC is the least restrictive head homomorphism necessary in order to unify subgoals of the query with subgoals in a view.

image

Algorithm 4 FormMCDs: First phase of the MiniCon Algorithm. Note that condition (b) minimizes Gc given a choice of hC and φC, and is therefore not redundant with condition (c).

Example 2.19

In our example, after realizing that V5 cannot be part of any MCD, the algorithm would create an MCD for V6, whose components are

(A → D, V6 (T, Y, D, D), Title → T, Year → Y, Dir → D, {1, 2, 3})

Note that in the MCD the head homomorphism equates the variables D and A in V6, and that the MCD includes all the subgoals from the query.

Step 2: Combining the MCDs

In the second phase, the MiniCon Algorithm combines the MCDs to create conjunctive rewritings and outputs a union of conjunctive queries. Because of the way in which the MCDs were constructed, the second phase of the algorithm is actually simpler and more efficient than the corresponding one in the Bucket Algorithm. Specifically, any time a set of MCDs covers mutually disjoint subsets of subgoals in the query, but together cover all the subgoals, the resulting rewriting is guaranteed to be a contained rewriting. Hence, we do not need to perform any containment checks in this phase. Algorithm 5 combines MCDs.

Minimizing the Rewritings

The rewritings resulting from the second phase of the algorithm may be redundant. In general, they can be minimized by techniques for conjunctive query minimization. However, one case of redundant subgoals can be identified more simply as follows. Suppose a rewriting Q′ includes two atoms A1 and A2 of the same view V, whose MCDs were C1 and C2, and the following conditions are satisfied: (1) whenever A1(resp. A2) has a variable from Q in position i, then A2(resp. A1) has either the same variable or a variable that does not appear in Q in that position, and (2) the ranges of image and image do not overlap on existential variables of V. In this case we can remove one of the two atoms by applying to Q′ the homomorphism τ that is (1) the identity on the variables of Q and (2) the most general unifier of A1 and A2.

Constants in the Query and Views

When the query or the view includes constants, we make the following modifications to the algorithm. First, the domain and range of φC in the MCDs may also include constants. Second, a MCD also records a (possibly empty) set of mappings ψC from variables in Vars (Q) to constants.

When the query includes constants, we add the following condition to Property 2.1:

C3. If a is a constant in Q, it must be the case that either (1) image is a distinguished variable in image or (2) image is the constant a.

image

Algorithm 5 CombineMCDs: Second phase of the MiniCon Algorithm—combining the MCDs.

When the views have constants, we modify Property 2.1 as follows:

We relax clause C1: a variable x that appears in the head of the query must either be mapped to a head variable in the view (as before) or be mapped to a constant a. In the latter case, the mapping xa is added to ψC.

If image is a constant a, then we add the mapping xa to ψC. (Note that condition C2 only applies to existential variables, and therefore if image is a constant that appears in the body of V but not in the head, a MCD is still created.)

Next, we combine MCDs with some extra care. Two MCDs, C1 and C2, both of which have x in their domain, can be combined only if either they (1) both map x to the same constant, or (2) one (e.g., C1) maps x to a constant and the other (e.g., C2) maps x to a distinguished variable in the view. Note that if C2 maps x to an existential variable in the view, then the MiniCon Algorithm would never consider combining C1 and C2 in the first place, because they would have overlapping GC sets.

Finally, we modify the definition of EC, such that whenever possible, it chooses a constant rather than a variable.

Computational Complexity

The worst-case computational complexity of the Bucket and MiniCon Algorithms is the same. In both cases the running time is image, where n is the number of subgoals in the query, m is the maximal number of subgoals in a view, and M is the number of views.

2.4.5 A Logical Approach: The Inverse-Rules Algorithm

In this section we describe an algorithm for rewriting queries using views that takes a purely logical approach to the problem. The following example illustrates the intuition behind the algorithm.

Example 2.20

Consider the view from our previous example:

V7 (I, T, Y, G) :- Movie (I, T, Y, G), Director (I, D), Actor (I, D)

Suppose we know that the tuple (79522, Manhattan, 1979, Comedy) is in the extension of V7. Clearly, we can infer from it that Movie (79522, Manhattan, 1979, Comedy) holds. In fact, the following rule would be logically sound:

IN1 : Movie (I, T, Y, G) :- V7 (I, T, Y, G)

We can also infer from V7 (79522, Manhattan, 1979, Comedy) that some tuples exist in Director and Actor, but it’s a bit trickier. In particular, we do not know which value of D in the database yielded V7 (79522, Manhattan, 1979, Comedy). All we know is that there is some constant, c, such that Director (79522, c), Actor (79522, c) hold. We can express this inference using the following rules:

IN2 : Director (I, f1 (I, T, Y, G)) :- V7 (I, T, Y, G)

IN3 : Actor (I, f1 (I, T, Y, G))  :-V7 (I, T, Y, G)

The term f1 (I, T, Y, G) is called a Skolem term and denotes some constant that depends on the values I, T, Y, and G and the function name f1. Given two nonidentical Skolem terms, we do not know whether they refer to the same constant in the database or not, only that they both exist.

The query rewriting produced by the Inverse-Rules Algorithm includes all the inverse rules we can write for each of the view definitions and the rule defining the query. In our example, the inverse rules are IN1, IN2, and IN3. To illustrate, suppose our query asked for all movie titles, with their genres and years:

Q2 (Title, Year, Genre) :- Movie (ID, Title, Year, Genre)

Evaluating the inverse rules over the extension V7 (79522, Manhattan, 1979, Comedy) yields the tuples

Movie (79522, Manhattan, 1979, Comedy),

Director (79522, f1 ((79522, Manhattan, 1979, Comedy))

Actor (79522, f1 ((79522, Manhattan, 1979, Comedy)).

Evaluating Q2 on that set of tuples would yield the answer Movie (Manhattan, 1979, Comedy).

image

Algorithm 6 CreateInverseRules: Inverse rule rewriting algorithm. Note that the answer contains the original query and the inverse rules image.

Algorithm 6 creates the inverse-rule rewriting.

2.4.6 Comparison of the Algorithms

The strength of the Bucket Algorithm is that it exploits the comparison atoms in the query to prune significantly the number of candidate conjunctive rewritings that need to be considered. Checking whether a view should belong to a bucket can be done in time polynomial in the size of the query and view definition when the predicates involved are arithmetic comparisons. Hence, if the views are indeed distinguished by having different interpreted predicates, then the resulting buckets will be relatively small. This is a common scenario when we apply answering queries using views to integrating data on the World Wide Web, where sources (modeled as views) are distinguished by the geographical area they apply to or other interpreted predicates.

However, the Bucket Algorithm does not consider interactions between different subgoals in the query and the views, and therefore the buckets may contain irrelevant subgoals. The MiniCon Algorithm overcomes that issue. Furthermore, because of the way MCDs are built, the second phase of the MiniCon Algorithm does not require a containment check, and is thereby more efficient. In experiments, the MiniCon Algorithm is uniformly faster than the Bucket Algorithm.

The main advantage of the Inverse-Rules Algorithm over the Bucket and MiniCon algorithms is its conceptual simplicity, being based on a purely logical approach to inverting the view definitions. Because of this simplicity, the algorithm can also be applied when the query Q is recursive and, as we discuss in Chapter 3, to cases where there are known functional dependencies. In addition, the inverse rules can be created in time that is polynomial in the size of the view definitions. Note that the algorithm does not tell us whether the maximally contained rewriting is equivalent to the original query (and hence avoids NP-hardness).

On the other hand, the rewriting produced by the Inverse-Rules Algorithm often leads to a query that that is more expensive to evaluate over the view extensions. The reason is that the Bucket and MiniCon Algorithms create buckets and MCDs in a way that considers the context of the query Q, while the inverse rules are computed only based on the view definition. In experiments, the MiniCon Algorithm has been shown to typically be faster than the Inverse-Rules Algorithm.

2.4.7 View-Based Query Answering

Thus far we have described algorithms for finding the maximally contained rewriting for a query given a set of views. However, these rewritings are maximal with respect to the query language we consider for the rewritings. In our discussion, we have considered rewritings that we can express as unions of conjunctive queries.

In this section we consider a more general question: Given a query Q, a set of view definitions image, and extensions for each of the views, image, what are all the answers to Q that can be inferred from image and image? We begin by introducing the notion of certain answers that enables us to formally state the above question, and then describe some basic results on finding certain answers.

Given a database D and a query Q, we know with certainty all the answers to Q. However, if we are only given image and image, then we do not precisely know the contents of D. Instead, we only have partial information about the real state of the database D. The information is partial in the sense that all we know is the answer to certain queries posed on D.

Example 2.21

Suppose we have the following views computing the set of actors and directors, respectively:

V8 (Dir)  :- Director (ID, Dir)

V9 (Actor) :- Actor (ID, Actor)

Suppose we are given the following view extensions:

v8: {Allen, Coppola}

v9: {Keaton, Pacino}

There are multiple databases that may have led to these view extensions. For example, in one, the pairs of director and actor would be ((Allen, Keaton), (Coppola, Pacino)), while in another it could be ((Allen, Pacino), (Coppola, Keaton)). In fact, the database that contains all the possible pairs would also lead to the same view extensions.

With partial information, all we know is that D is in some set image of possible databases. However, for each database in image, the answer to Q may be different. The certain answers, which we define below, are the ones that are answers to Q in every database in image.

Before we can formally define certain answers, we need to be explicit about our assumptions about the completeness of the views. We distinguish between the closed-world assumption and the open-world assumption. Under the closed-world assumption, the extensions image are assumed to contain all the tuples in their corresponding views, while under the open-world assumption they may contain only a subset of the tuples in the views. The closed-world assumption is the one typically made when using views for query optimization, and the open-world assumption is typically used in data integration.

We can now formally define the notion of certain answers.

Definition 2.8 (Certain Answers)

Let Q be a query and image be a set of view definitions over the database schema R1, …, Rn. Let the sets of tuples image be extensions of the views V1, …, Vm, respectively.

The tuple image is a certain answer to the query Q under the closed-world assumption given v1, …, vm if image for all database instances D such that image for every i, 1 ≤ im.

The tuple image is a certain answer to the query Q under the open-world assumption given v1, …, vm if image for all database instances D such that image for every i, 1 ≤ im.

Example 2.22

Consider the views

V8 (Dir)  :- Director (ID, Dir)

V9 (Actor) :- Actor (ID, Actor)

Under the closed-world assumption, there is only a single database that is consistent with the view extensions. Hence, given the query

Q4 (Dir, Actor) :- Director (ID, Dir), Actor (ID, Actor)

the tuple (Allen, Keaton) is a certain answer. However, under the open-world assumption, this tuple is no longer a certain answer because the view extensions may be missing the other actors and directors.

Certain Answers under the Open-World Assumption

Under the open-world assumption, when the query does not contain interpreted predicates, the union of conjunctive queries produced by the MiniCon and Inverse-Rules Algorithms are guaranteed to compute all the certain answers to a conjunctive query given a set of conjunctive views. The following theorem shows that the Inverse-Rules Algorithm produces all the certain answers.

Theorem 2.10

Let Q be a conjunctive query and let image be conjunctive queries, where neither Q nor image contain interpreted atoms or negation. Let Qbe the result of the Inverse-Rules Algorithm on Q and image. Given extensions image for the views in image, evaluating Qover image will produce all the certain answers for Q w.r.t. image and image.

Proof

Denote by image the set of databases that are consistent with the extensions image. To prove the theorem, we need to show that if image for every image, then image would be produced by evaluating Q′ over image.

Evaluating Q′ on image can be divided into two steps. In the first, we evaluate all the inverse rules on image to produce a canonical database D′. In the second step, we evaluate Q on D′. The proof is based on showing that if image is a tuple of constants with no functional terms and image, then image for every image.

Suppose image where Vi is of the form

v (image) :- p1 (image), …, pm (image)

and the existential variables in V are Y1, …, Yk. If image, then there must be constants, b1, …, bk, and a variable mapping ψ that maps image to image and Y1, …, Yk to b1, …, bk, such that image for 1 ≤ in.

Evaluating the inverse rules on image produces exactly these tuples, except instead of b1, …, bk we get image. In fact, all the tuples in D′ are produced in this way for some image, so there are no extra tuples in D′ that are not required by some image.

Hence, we can characterize the databases in image as follows. Every image can be created from D′ by mapping the functional terms in D′ to constants (possibly equating two distinct functional terms) and adding more tuples. In other words, for every database image there is a homomorphism, ϕ, such that image.

Consequently, it is easy to see that if image, then image for every image, since any pair of constants that are equal in D′ are guaranteed to be equal in D.

Theorem 2.10 can be used to show that the MiniCon Algorithm produces all the certain answers under the same condition. We leave it to the reader to show that any answer produced by the Inverse-Rules Algorithm is also produced by the MiniCon Algorithm, and therefore it produces all the certain answers as well.

Corollary 2.1

Let Q be a conjunctive query, and let image be conjunctive queries. The Inverse-Rules and MiniCon Algorithms produce the maximally contained union of conjunctive query rewriting of Q using image.

Certain Answers under the Closed-World Assumption

Under the closed-world assumption, finding all the certain answers turns out to be computationally much harder. The following theorem shows that finding all the certain answers is co-NP-hard in the size of the data. Hence, none of the query languages we considered for rewriting will produce all the certain answers.

Theorem 2.11

Let Q be a query and image be a set of view definitions. The problem of determining, given a view instance, whether a tuple is a certain answer under the closed-world assumption is co-NP-hard.

Proof crux

We prove the theorem by a reduction from the 3-colorability problem. Let G = (V, E) be an arbitrary graph. Consider the view definitions

v1 (X)  :- color (X, Y)

v2 (Y)  :- color (X, Y)

v3 (X, Y) :- edge (X, Y)

and consider the instance I with image, image, and image. It can be shown that under the closed-world assumption, the query

q() :- edge (X, Y), color (X, Z), color (Y, Z)

has a certain answer if and only if the graph G is not 3-colorable. Because testing a graph’s 3-colorability is NP-complete, the theorem follows. Note that q is a query with arity 0, hence it has a certain answer if and only if for any database, there is at least one substitution that satisfies the body.

Certain Answers for Queries with Interpreted Predicates

When queries and views include interpreted predicates, the results on finding certain answers or maximally contained rewritings do not all carry over. In fact, the fundamental result on the limit on the size of rewritings (Theorem 2.9) does not hold any more.

The hard case is when the query contains interpreted predicates. The following theorem shows that it suffices for the query to contain the ≠ predicate, and then the problem of finding all certain answers is co-NP-complete.

Theorem 2.12

Let Q be a query and image be a set of view definitions, all conjunctive queries, but Q may have thepredicate. The problem of determining, given a view instance, whether a tuple is a certain answer under the open-world assumption is co-NP-hard.

Proof crux

We prove the theorem by a reduction from the problem of testing satisfiability of a CNF formula. Let ψ be a CNF formula with variables x1, …, xn and conjuncts c1, …, cm. Consider the following conjunctive view definitions and their corresponding view instances:

v1 (X, Y, Z) :- p (X, Y, Z)

v2 (X)    :- r (X, Y)

v3 (Y)    :- p (X, Y, Z), r (X, Z)

image occurs in image occurs in image

image

image

Finally, consider the following query:

q() :- r (X, Y), r (X, Y′), YY

It is possible to show that q has a certain answer under the open-world assumption if and only if the formula ψ is not satisfiable. Since the problem of testing satisfiability is NP-complete, the theorem follows.

Fortunately, there are two important cases where the algorithms we described still yield the maximally contained rewritings and all the certain answers:

if the query does not contain interpreted predicates (but the views may), and

if all the interpreted predicates in the query are semi-interval comparisons.

The proof is left as an exercise for the reader.

Bibliographic Notes

Query containment has been an active area of research for many years, beginning with Chandra and Merlin [114], who proved that query containment and equivalence are NP-complete for conjunctive queries. Sagiv and Yannakakis first considered queries with unions and negation [502]. Rarajaman and Chekuri [130] describe several cases where checking query containment can be done in time polynomial in the size of the queries. Saraiya shows that when predicates do not appear more than twice in a conjunctive query, query containment can be checked in polynomial time [505].

Containment for conjunctive queries with interpreted predicates was first considered by Klug [347] (establishing the upper bound) and van der Meyden [557] (establishing the lower bound). Afrati et al. [16] consider more efficient algorithms for testing containment with interpreted predicates. In [12], the authors present a thorough treatment of query containment and answering queries using views with arithmetic comparisons. The algorithm we describe in Section 2.3.4 and the algorithm for query containment with negated subgoals in Section 2.3.5 are from Levy and Sagiv [382]. Benedikt and Gottlob [64] show that query containment for nonrecursive datalog is complete for co-NEXPTIME. Containment for queries with bag semantics were first considered by Chaudhuri and Vardi [127], and then by [320, 332]. Cohen [142] surveys algorithms for query containment with grouping and aggregation. Green [267] describes containment results for queries over annotated relations that cover also queries over bag semantics.

Answering queries using views has been considered in the context of data integration, query optimization (e.g., [13, 63, 125, 259, 588]), and maintenance of physical data independence (e.g., [551, 586]). A survey of the main techniques for answering queries using views is described by Halevy [284]. The result on the length of a rewriting in Section 2.4.3 is from Levy et al. [398] and the Bucket Algorithm is a slightly modified version of the algorithm described in [381]. The MiniCon Algorithm [482] and the SVB Algorithm [439] are also complete and more efficient. In [12] the authors discuss the subtleties of answering queries using views in the presence of interpreted predicates in the query and the views, and they show why the Bucket Algorithm may not be complete in the presence of interpreted predicates. In [349] the authors describe a significant speedup compared to the MiniCon Algorithm. The Inverse-Rules Algorithm is from Duschka and Genesereth [198]. An approach for answering queries using views based on the Chase Algorithm is described in [480]. Several works (e.g., [108]) considered answering queries using views over description logics. Certain answers were introduced by Abiteboul and Duschka [6], as well as the basic complexity results we described. Libkin [389] describes a general framework for defining incomplete information in databases from which he derives several results concerning certain answers for relational data and for XML data.

1 The alternative would be to interpret them over a discrete domain such as the integers. In that case we need to account for subtle inferences such as implying X = 4 from the conjunction X > 3, X < 5.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset