Using Inner Joins

A Brief Overview

An SQL inner join combines matching rows between two tables. You can perform an inner join on tables using two methods:

Specify the two tables to be joined in a FROM clause separated by the INNER JOIN keyword. Next, specify an ON clause that indicates how rows should be matched.
Join the two tables based on the matching criteria, known as join conditions, that you specify in a WHERE clause.

The following diagram illustrates an inner join of two tables. The shaded area of overlap represents the matching rows (the subset of rows) that the inner join returns as output.

Figure 3.2 Venn Diagram, Inner Join for Table 1 and Table 2

Note: An inner join is sometimes called a conventional join.

This book focuses on joining tables using the INNER JOIN keyword with the ON clause in the FROM clause.

Inner Join Syntax

In the FROM clause, specify the first table, the keyword INNER JOIN, and then the second table. Following the table names and join type, the syntax requires an ON clause to describe the join criteria for matching rows in the tables. Omitting the ON clause produces a syntax error.

Syntax, SELECT statement for inner join:

SELECT column-1<,...column-n>

FROM table-1 | view-1 INNER JOIN table-2 | view-2 <INNER JOIN...table-n | view-n>

ON table1.column=table2.column

<other clauses>;

INNER JOIN

specifies the join type.

table

specifies the name of the source table.

ON

specifies join conditions, which are expressions that specify the column or columns on which the tables are to be joined.

table.column

refers to the source table and the column name on which the join occurs.

refers to optional PROC SQL clauses.

Note: The maximum number of tables that you can combine in a single inner join depends on your version of SAS.

Understanding How Joins Are Processed

Understanding how PROC SQL processes inner and outer joins will help you to understand which output is generated by each type of join. PROC SQL follows these steps to process a join:

Builds a Cartesian product of rows from the indicated tables.
Evaluates each row in the Cartesian product, based on the join conditions specified in the WHERE clause, along with any other subsetting conditions, and removes any rows that do not meet the specified conditions.
If summary functions are specified, summarizes the applicable rows.
Returns the rows that are to be displayed in output.

Note: The PROC SQL query optimizer follows a more complex process than the conceptual approach described here, by breaking the Cartesian product into smaller pieces. For each query, the optimizer selects the most efficient processing method for the specific situation.

By default, PROC SQL joins do not overlay columns with the same name. Instead, the output displays all columns that have the same name. To avoid having columns with the same name in the output from an inner or outer join, you can eliminate or rename the duplicate columns.

Tip

You can also use the COALESCE function with an inner or outer join to overlay columns with the same name.

Example: Using a FROM Clause with the INNER JOIN Keyword

The following example illustrates an inner join. Inner joins are also known as equijoins because of the equality in the ON clause. The ON clause specifies that only rows with identical values in the column produce a match.

proc sql;
   select *
      from certadv.one inner join certadv.two
      on one.x=two.x;
quit;

Output 3.2 Inner Join Output: INNER JOIN Keyword

The FROM clause, along with the INNER JOIN keyword and the ON clause, specifies that the result set should include only these rows: those whose values of column X in table One are equal to values in column X of table Two. Only one row from table One and one row from table Two have matching values of X. Therefore, those two rows are combined into one row of output.

Note: PROC SQL will not perform a join unless the columns that are compared in the join condition have the same data type. However, the two columns are not required to have the same name. For example, the join condition shown in the following ON clause is valid if ID and EmpID have the same data type:

on table1.id = table2.empid

Note: The join condition that is specified in the ON clause often contains the equal (=) operator, but the expression might contain one or more other operators instead. You can use other comparison operators, such as greater than, less than, or special WHERE operators.

Example: Eliminating Duplicate Columns

The following example uses an inner join to combine the tables One and Two.

proc sql;
   select *
      from certadv.one inner join certadv.two
      on one.x=two.x;
quit;

Output 3.3 Inner Join, Tables One and Two

The two tables have a column with an identical name, X. Because the SELECT clause in the query shown above contains an asterisk, the output displays all columns from both tables.

To eliminate a duplicate column, you can specify just one of the duplicate columns in the SELECT statement. The SELECT statement in the preceding PROC SQL query can be modified as follows:

proc sql;
   select one.x, a, b
      from certadv.one inner join certadv.two
      on one.x=two.x;
quit;

Here, the SELECT clause specifies that only column X from table One will be included in output. The output, which now displays only one column X, is shown below.

Output 3.4 Table One Output

Note: In an inner equijoin, like the one shown here, it does not matter which of the same-named columns is listed in the SELECT statement. The SELECT statement in this example could have specified Two.X instead of One.X.

Another way to eliminate the duplicate X column in the preceding example is shown below:

proc sql;
   select one.*, b
      from certadv.one inner join certadv.two
      on one.x=two.x;
quit;

By using the asterisk (*) to select all columns from table One, and only B from table Two, this query generates the same output as the preceding version.

Example: Renaming a Column by Using a Column Alias

If you are working with several tables that have a column with a common name but slightly different data, you might want both columns to appear in output. To avoid the confusion of displaying two different columns with the same name, you can rename one of the duplicate columns by specifying a column alias in the SELECT statement.

proc sql;
   select one.x as ID, two.x, a, b
      from certadv.one inner join certadv.two
      on one.x=two.x;
quit;

The output of the modified query is shown here.

Output 3.5 PROC SQL Result: Modified Query Output

The column One.X has been renamed to ID and the output clearly indicates that ID and X are two different columns.

Example: Joining Tables That Have Rows with Matching Values

Consider what happens when you join two tables in which multiple rows have duplicate values of the column on which the tables are being joined. Each of the tables Three and Four has multiple rows that contain the value 2 for column X.

Figure 3.3 Original Tables: Certadv.Three and Certadv.Four

The following PROC SQL inner join matches rows from the two tables based on the common column X.

proc sql;
   select *
      from certadv.three inner join certadv.four
      on three.x=four.x;
quit;

The output shows how this inner join handles the duplicate values of X.

Output 3.6 PROC SQL Inner Join

All possible combinations of the duplicate rows are displayed. There are no matches on any other values of X, so no other rows are displayed in output.

Note: A DATA step match-merge would write output to only two rows because it processes data sequentially from top to bottom.

Specifying a Table Alias

To enable PROC SQL to distinguish between same-named columns from different tables, you use qualified column names. To create a qualified column name, you prefix the column name with its table name. The following PROC SQL inner join contains several qualified column names.

proc sql;
title 'Employee Names and Job Codes';
   select staffmaster.empid, lastname, firstname, jobcode
      from certadv.staffmaster inner join certadv.payrollmaster
      on staffmaster.empid=payrollmaster.empid;
quit;

It can be time consuming and difficult to read PROC SQL code that contains lengthy qualified column names. You can use a temporary, alternate name for any or all tables in any PROC SQL query. This temporary name, which is called a table alias, is specified after the table name in the FROM clause. The keyword AS is often used, although its use is optional.

The following modified PROC SQL query specifies table aliases in the FROM clause, and then uses the table aliases to qualify column names in the SELECT and ON clauses.

proc sql;
title 'Employee Names and Job Codes';
   select s.empid, lastname, firstname, jobcode
      from certadv.staffmaster as s inner join 
           certadv.payrollmaster as p
      on s.empid=p.empid;
quit;

In this query, the optional keyword AS is used to define the table aliases in the FROM clause. The FROM clause would be equally valid without the keyword AS.

from certadv.staffmaster s,
     certadv.payrollmaster p

Note: While using table aliases helps you to work more efficiently, the use of table aliases does not cause SAS to execute the query more quickly.

Table aliases are usually optional. However, there are two situations that require their use.

When Are Table Aliases Required?	Example
a table is joined to itself (called a self-join or reflexive join)	from certadv.staffmaster as s1, certadv.staffmaster as s2
you need to reference columns from same-named tables in different libraries	from certadv.flightdelays as af, certadvf.flightdelays as wf on af.delay > wf.delay

Thus far, you have seen relatively simple examples of inner joins. However, as in any other PROC SQL query, inner joins can include more advanced components, such as these:

titles and footers
functions and expressions in a SELECT clause
multiple conditions in a WHERE clause
an ORDER BY clause for sorting
summary functions with grouping

Example: Complex PROC SQL Inner Join

Suppose you want to create a report where the name is displayed with first initial and last name (R. Long), JobCode, and ages of all employees who live in New York. The report also should be sorted by JobCode and Age.

The data that you need is stored in the two tables below.

Table	Relevant Columns
Certadv.Staffmaster	EmpID, LastName, FirstName, State
Certadv.Payrollmaster	EmpId, JobCode, DateOfBirth

Of the three columns that you want to display, JobCode is the only column that already exists in the tables. The other two columns must be created from existing columns.

The PROC SQL query shown here uses an inner join to generate the output that you want:

proc sql outobs=15;
   title 'New York Employees';
   select substr(firstname,1,1) || '. ' || lastname   /*1*/
      as Name,
      jobcode,
      int((today() - dateofbirth)/365.25)
      as Age
      from certadv.payrollmaster as p inner join      /*2*/
           certadv.staffmaster as s
      on p.empid =                                    /*3*/
         s.empid
      where state='NY'                                /*4*/
      order by 2,3                                    /*5*/
;
quit;

1	The SELECT clause specifies the new column Name, the existing column JobCode, and the new column Age. The SELECT clause uses functions and expressions to create two new columns. To create Name, the SUBSTR function extracts the first initial from FirstName. Then the concatenation operator combines the first initial with a period, a space, and then the contents of the LastName column. Finally, the keyword AS names the new column. To calculate Age, the INT function returns the integer portion of the result of the calculation. In the expression that is used as an argument of the INT function, the employee's birth date (DateOfBirth) is subtracted from today's date (returned by the TODAY function), and the difference is divided by the number of days in a year (365.25).
2	The FROM clause lists the tables to select from. The FROM clause uses the AS keyword to distinguish a table alias from other table names. The FROM clause also specifies the INNER JOIN keyword to specify the join type.
3	The ON clause describes the join criteria for matching rows in the tables. The ON clause produces a match for the rows with identical values in the EmpID column.
4	The WHERE clause subsets the data further by selecting only the rows where State is `NY`.
5	The ORDER BY clause specifies the order in which rows are displayed in the result table.

This query writes to output only those rows that have matching values of EmpID and rows in which the value of State is NY. You do not need to prefix the column name State with a table name because State occurs in only one of the tables.

Output 3.7 PROC SQL Query Result: Inner Join

Example: PROC SQL Inner Join with Summary Functions

You can use summary functions to summarize and group data in a PROC SQL join. The following example summarizes columns for New York employees in each job code: number of employees and average age.

proc sql outobs=15;
   title 'Average Age of New York Employees';
   select jobcode,count(p.empid) as Employees,               /*1*/
                  avg(int((today() - dateofbirth)/365.25))
                  format=4.1 as AvgAge
      from certadv.payrollmaster as p inner join             /*2*/
           certadv.staffmaster as s
      on p.empid=                                            /*3*/
         s.empid 
      where state='NY'                                       /*4*/
      group by jobcode                                       /*5*/
      order by jobcode                                       /*6*/
;
quit;

1	The SELECT clause uses summary functions to create two new columns. To create Employees, the COUNT function is used with p.EmpID (Payrollmaster.EmpID) as its argument. To create AvgAge, the AVG function is used with an expression as its argument. As described in the previous example, the expression uses the INT function to calculate each employee's age.
2	The FROM clause lists the tables to select from. The FROM clause uses the AS keyword to distinguish a table alias from other table names. The FROM clause also specifies the INNER JOIN keyword to specify the join type.
3	The ON clause describes the join criteria for matching rows in the tables. The ON clause produces a match for the rows with identical values in the EmpID column.
4	The WHERE clause subsets the data further by selecting only the rows where State is `NY`.
5	The GROUP BY clause specifies what variable to group the data by for summarization.
6	The ORDER BY clause specifies the order in which rows are displayed in the result table.

Output 3.8 PROC SQL Query Result: Inner Join with Summary Functions

Table of Contents for Using Inner Joins

Create new playlist

Sign In

Sign Up