The most basic type of join combines data from two tables
that are specified in the FROM clause of a SELECT statement. When
you specify multiple tables in the FROM clause but do not include
a WHERE statement to subset data, PROC SQL returns the Cartesian product
of the tables. In a Cartesian product, each row in the first table
is combined with every row in the second table. Below is an example
of this type of query, which joins the tables One and Two.
proc sql;
select *
from one, two;
|
|
The output shown above
displays all possible combinations of each row in table One with all
rows in table Two. Note that each table has a column named X, and
both of these columns appear in the output. A Cartesian product includes
all columns from the source tables. Columns that have common names
are not overlaid.
In most cases, generating
all possible combinations of rows from multiple tables does not yield
useful results, so a Cartesian product is rarely the query outcome
that you want. For example, in the Cartesian product of two tables
that contain employee information, each row of output might contain
information about two different employees. Usually, you want your
join to return only a subset of rows from the tables.
The size of a Cartesian
product can also be problematic. The number of rows in a Cartesian
product is equal to the product of the number of rows in the contributing
tables.
The tables One and Two,
used in the preceding example, contain three rows each, as shown below.
The number of rows in
the Cartesian product of tables One and Two is calculated as follows:
3 x 3 = 9 rows
Joining small tables
such as One and Two results in a relatively small Cartesian product.
However, the Cartesian product of large tables can be huge and can
require a large amount of system resources for processing.
For example, joining
two tables of 1,000 rows each results in output of the following size:
1,000 x 1,000 = 1,000,000 rows
When you run a query
that involves a Cartesian product that cannot be optimized, PROC SQL
writes the following warning message to the SAS log.
Table 3.1 SAS Log
NOTE: The
execution of this query involves performing one or more Cartesian
product joins that cannot be optimized.
|
Although you often do
not choose to create a query that returns a Cartesian product, it
is important to understand how a Cartesian product is built. In all
types of joins, PROC SQL generates a Cartesian product first, and
then eliminates rows that do not meet any subsetting criteria that
you have specified.
Note: In many cases, PROC SQL can
optimize the processing of a join, thereby minimizing the resources
that are required to generate a Cartesian product.