Subsetting Data Using Subqueries

Introducing Subqueries

A Brief Overview

A subquery selects rows from one table based on values in another table. Subqueries are also known as nested queries or inner queries. A subquery is a query expression that is nested as part of another query expression. Depending on the clause that contains it, a subquery can return a single value or multiple values. Subqueries are most often used in WHERE and HAVING clauses. Subqueries are enclosed in parentheses.
The following PROC SQL query contains a subquery in the HAVING clause that returns all jobcodes where the average salary for that jobcode is greater than the company average salary.
proc sql;
   select jobcode, avg(salary) as AvgSalary
         format=dollar11.2
      from certadv.payrollmaster
      group by jobcode
      having avg(salary)>(select avg(salary)
      from certadv.payrollmaster);
quit;
The subquery shown above is a single-value subquery; it returns a single value, the average salary from the table Certadv.Payrollmaster, to the outer query. A subquery can return values for multiple rows but only for a single column.
The table that a subquery references can be either the same as or different from the table that is referenced by the outer query. In the PROC SQL query shown above, the subquery selects data from the same table as the outer query.

Using SAS Functions

PROC SQL supports almost all of the SAS DATA step functions. You can use SAS functions as a part of subqueries to subset your data. For example, you can use date functions in PROC SQL to summarize amounts by quarter and month.

The Two Types of Subqueries

Type
Description
noncorrelated
A noncorrelated subquery is a self-contained subquery that executes independently of the outer query. The simplest type of subquery is a noncorrelated subquery that returns a single value.
correlated
A dependent subquery is one that requires one or more values to be passed to it by the outer query before the subquery can return a value to the outer query.
Both noncorrelated and correlated subqueries can return either single or multiple values to the outer query.

Subsetting Data by Using Noncorrelated Subqueries

Example: Using Single-Value Noncorrelated Subqueries

The following PROC SQL query displays job codes for which a group’s average salary exceeds the company’s average salary. The HAVING clause contains a noncorrelated subquery. PROC SQL always evaluates a noncorrelated subquery before the outer query. If a query contains noncorrelated subqueries at more than one level, PROC SQL evaluates the innermost subquery first and works outward, evaluating the outermost query last.
proc sql;
   select jobcode, avg(salary) as AvgSalary format=dollar11.2
      from certadv.payrollmaster
      group by jobcode
      having avg(salary)>               /*1*/
         (select avg(salary)            /*2*/
            from certadv.payrollmaster);
quit;
1 The HAVING clause completes the expression by calculating the subquery. The subquery calculates the average salary for the entire company, all rows in the table, using the AVG summary function with Salary as an argument. The subquery returns the value of the average salary to the outer query.
2 In the SELECT clause, the outer query calculates the average salary for each JobCode group as defined by the GROUP BY clause. It selects only the groups whose average salary is greater than the company’s average salary.
The PROC SQL query result is displayed below.
Output 5.1 PROC SQL Query Result: Average Salary for Each JobCode
PROC SQL Query Result: Average Salary for Each JobCode
This noncorrelated subquery returns only a single value, the average salary for the whole company, to the outer query. Both the subquery and the outer query use the same table as a source.

Using Multiple-Value Noncorrelated Subqueries

Some subqueries are multiple-value subqueries: they return more than one value, or row, to the outer query. If your noncorrelated subquery might return a value for more than one row, be sure to use one of the following operators in the WHERE or HAVING clause that can handle multiple values:
  • the conditional operator IN
  • a comparison operator that is modified by ANY or ALL
  • the conditional operator EXISTS
Note: If you create a noncorrelated subquery that returns multiple values, but the WHERE or HAVING clause in the outer query contains an operator other than one of the operators that are specified above, the query will fail. An error message is displayed in the SAS log, which indicates that the subquery evaluated to more than one row. For example, if you use the equal (=) operator with a noncorrelated subquery that returns multiple values, the query will fail. The equal operator can handle only a single value.
Consider a query that contains both the conditional operator IN and a noncorrelated subquery that returns multiple values.

Example: Using a Conditional Operator in a Noncorrelated Subquery

Suppose you want to send cards to employees who have birthdays coming up. Create a PROC SQL query that lists the names and addresses of all employees who have birthdays in February. This query selects data from two different tables:
  • employee names and addresses in the table Certadv.Staffmaster
  • employee birth dates in the table Certadv.Payrollmaster
In both tables, the employees are identified by their employee identification number, EmpID.
In the following PROC SQL query, the WHERE clause contains the conditional operator IN followed by a noncorrelated subquery:
proc sql;
   select empid, lastname, firstname, city, state
      from certadv.staffmaster
         where empid in                                /*1*/
            (select empid from certadv.payrollmaster   /*2*/
               where month(dateofbirth)=2);            /*3*/
quit;
1 The WHERE expression selects the employees whose birthday is in the month of February. Note that the MONTH function is used in the subquery.
2 The subquery returns the EmpID values of the selected employees to the outer query.
3 The outer query displays data for the employees identified by the subquery.
The output, shown below, lists the six employees who have February birthdays.
Output 5.2 PROC SQL Query Result: Employees with February Birthdays
PROC SQL Query Result: Employees with February Birthdays
Although an inner join would have generated the same results, it is better to use a subquery in this example since no columns from the Certadv.Payrollmaster table were in the output.

Using Comparisons with Subqueries

Sometimes it is helpful to compare a value with a set of values returned by a subquery. When a subquery might return multiple values, you must use one of the conditional operators ANY or ALL to modify a comparison operator in the WHERE or HAVING clause immediately before the subquery. For example, the following WHERE clause contains the less than (<) comparison operator and the conditional operator ANY:
where dateofbirth < any
   <subquery...>
Note: If you create a noncorrelated subquery that returns multiple values, and if the WHERE or HAVING clause in the outer query contains a comparison operator that is not modified by ANY or ALL, the query will fail.
When the outer query contains a comparison operator that is modified by ANY or ALL, the outer query compares each value that it retrieves against the value or values that are returned by the subquery. All values for which the comparison is true are then included in the query output. If ANY is specified, the comparison is true if it is true for any one of the values that are returned by the subquery. If ALL is specified, the comparison is true only if it is true for all values that are returned by the subquery.
Note: The operators ANY and ALL can be used with correlated subqueries, but they are usually used only with noncorrelated subqueries.
In PROC SQL queries, you can use the following comparisons with subqueries. ANY, ALL, and EXISTS cannot be used in other SAS procedures.
Subquery Comparison Operators
Conditional Operator
Operator Looks for These Values
Example
ANY
values that meet a specified condition with respect to any one of the values returned by a subquery
where dateofbirth < any
   (select dateofbirth
      from certadv.payrollmaster
      where jobcode='FA3')
ALL
values that meet a specified condition with respect to all the values returned by a subquery
where dateofbirth < all
   (select dateofbirth
      from certadv.payrollmaster
      where jobcode='FA3')
EXISTS
values that are returned by a subquery
where exists
   (select *
      from certadv.flightschedule
      where fa.empid=
            flightschedule.empid) 

Using the ANY Operator

An outer query that specifies the ANY operator selects values that pass the comparison test with any of the values that are returned by the subquery.
For example, suppose you have an outer query containing the following WHERE clause:
where dateofbirth < any
   <subquery...>
This WHERE clause specifies that DateofBirth (the operand) should be less than any (the comparison operator) of the values returned by the subquery.
The following comparison shows the effect of using ANY with these common comparison operators: greater than (>), less than (<) and equal to (=).
Comparison Operator with ANY
Outer Query Action
Example
> ANY
Selects values that are greater than any value returned by the subquery.
If the subquery returns the values 20, 30, 40, the outer query selects all values that are > 20 (the lowest value that was returned by the subquery).
< ANY
Selects values that are less than any value returned by the subquery.
If the subquery returns the values 20, 30, 40, the outer query selects all values that are < 40 (the highest value that was returned by the subquery).
= ANY
Selects values that are equal to any value returned by the subquery.
If the subquery returns the values 20, 30, 40, the outer query selects all values that are = 20 or = 30 or = 40.
Tip
Instead of using the ANY operator with a subquery, there are some SAS functions that you can use to achieve the same result with greater efficiency. Instead of > ANY, use the MIN function in the subquery. Instead of < ANY, use the MAX function in the subquery.

Example: Using the ANY Operator

Suppose you want to identify any flight attendants at level 1 or level 2 who are older than any of the flight attendants at level 3. Job type and level are identified in JobCode; each flight attendant has the job code FA1, FA2, or FA3. The following PROC SQL query accomplishes this task by using a subquery and the ANY operator.
proc sql;
   select empid, jobcode, dateofbirth
      from certadv.payrollmaster
         where jobcode in ('FA1', 'FA2')
            and dateofbirth <any     /*1*/
               (select dateofbirth   /*2*/
                  from certadv.payrollmaster
                  where jobcode='FA3');
quit;
1 The subquery returns the birth dates of all level-3 flight attendants.
2 The outer query selects only level-1 and level-2 flight attendants whose birth dates indicate that they are older than any of the others whose birth dates are returned by the subquery.
Note that both the outer query and subquery use the same table.
Note: Internally, SAS represents a date value as the number of days from January 1, 1960, to the given date. For example, the SAS date for 17 October 1991 is 11612. Representing dates as the number of days from a reference date makes it easy for the computer to store them and perform calendar calculations. These numbers are not meaningful to users, however, so several formats are available for displaying dates and datetime values in most of the commonly used notations.
Output 5.3 PROC SQL Query Result: Level-1 and Level-2 Flight Attendants
PROC SQL Query Result: Level-1 and Level-2 Flight Attendants
Tip
Using the ANY operator to solve this problem results in a large number of calculations, which increases processing time. For this example, it would be more efficient to use the MAX function in the subquery. Here is an alternative WHERE clause:
where jobcode in ('FA1','FA2') 
      and dateofbirth < 
         (select max(dateofbirth)
            from [...]

Using the ALL Operator

An outer query that specifies the ALL operator selects values that pass the comparison test with all of the values that are returned by the subquery.
The following comparison shows the effect of using ALL with these common comparison operators: greater than (>) and less than (<).
Comparison Operator with ALL
Sample Values Returned by Subquery
Effect
> ALL
(20, 30, 40)
> 40
(greater than the highest number in the list)
< ALL
(20, 30, 40)
< 20
(less than the lowest number in the list)

Example: Using the ALL Operator

Substitute ALL for ANY in the previous query example. The following query identifies level-1 and level-2 flight attendants who are older than all of the level-3 flight attendants.
proc sql;
   select empid, jobcode, dateofbirth
      from certadv.payrollmaster
         where jobcode in ('FA1', 'FA2')
            and dateofbirth < all    /*1*/
               (select dateofbirth   /*2*/
                  from certadv.payrollmaster
                  where jobcode='FA3');
quit;
1 The subquery returns the birth dates of all level-3 flight attendants.
2 The outer query selects only level-1 and level-2 flight attendants whose birth dates indicate that they are older than any of the others whose birth dates are returned by the subquery.
The following query results show that only two level-1 or level-2 flight attendants are older than all of the level-3 flight attendants.
Output 5.4 PROC SQL Query Result: Level-1 and Level-2 Flight Attendants
PROC SQL Query Result: Level-1 and Level-2 Flight Attendants
Tip
For this example, it would be more efficient to solve this problem using the MIN function in the subquery instead of the ALL operator. Here is an alternative WHERE clause:
where jobcode in ('FA1','FA2') 
      and dateofbirth < 
        (select min(dateofbirth) 
           from [...]

Subsetting Data by Using Correlated Subqueries

A Brief Overview

Correlated subqueries are not evaluated independently because they depend on the values passed to them by the outer query for results. Correlated subqueries evaluate each row in the outer query and often require more processing time than noncorrelated subqueries. Because correlated subqueries are resource intensive, they reduce system performance. A PROC SQL join is a more efficient alternative to a correlated subquery.

Example: Using Correlated Subqueries

The following PROC SQL query displays the names of all navigators who are also managers. The WHERE clause in the subquery specifies the column Staffmaster.EmpID, which is the column that the outer query must pass to the correlated subquery.
proc sql;
   select lastname, firstname
      from certadv.staffmaster
         where 'NA'=
            (select jobcategory
               from certadv.supervisors
               where staffmaster.empid =
               supervisors.empid);
quit;
Note: When a column appears in more than one table, the column name is preceded by the table name or alias to avoid ambiguity. In this example, EmpID appears in both tables, so the appropriate table name is specified in front of each reference to that column.
The output from this query is shown below. There are three navigators who are also managers.
Output 5.5 PROC SQL Query Result: Navigators Who Are Also Managers
PROC SQL Query Result: Navigators Who are Also Managers

Using the EXISTS and NOT EXISTS Conditional Operators

In the WHERE clause or in the HAVING clause of an outer query, you can use the EXISTS or NOT EXISTS conditional operators to test for the existence or non-existence of a set of values returned by a subquery.
Condition
Requirement
EXISTS
the subquery returns at least one row
NOT EXISTS
the subquery returns no data
Note: The operators EXISTS and NOT EXISTS can be used with both correlated and noncorrelated subqueries.

Example: Correlated Subquery with NOT EXISTS

Consider a sample PROC SQL query that includes the NOT EXISTS conditional operator. Suppose you are working with the following tables:
  • Certadv.Flightattendants contains the names and employee ID numbers of all flight attendants.
  • Certadv.Flightschedule contains one row for each crew member who is assigned to a flight for each date.
As shown in the diagram below, the intersection of these two tables contains data for all flight attendants who have been scheduled to work.
Figure 5.1 Flight Attendants and Employees Scheduled to Work
Flight Attendants and Employees Scheduled to Work Venn Diagram Relationship
Now suppose you want to list by name the flight attendants who have not been scheduled. That is, you want to identify the data in the area highlighted below.
Figure 5.2 Flight Attendants Not Scheduled to Work
Flight Attendants Not Scheduled to Work Venn Diagram Relationship
The following PROC SQL query accomplishes this task by using a correlated subquery and the NOT EXISTS operator.
proc sql;
   select lastname, firstname
      from certadv.flightattendants
         where not exists
            (select * from certadv.flightschedule
               where flightattendants.empid=
                  flightschedule.empid);
quit;
Output 5.6 PROC SQL Query Result: List of Employees Not Scheduled for a Flight
PROC SQL Query Result: List of Employees Not Scheduled for a Flight
Last updated: October 16, 2019
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset