Viewing SELECT Statement Syntax
Limiting the Number of Rows Displayed
Eliminating Duplicate Rows from Output
Subsetting Rows by Using Conditional Operators
Using the BETWEEN-AND Operator to Select within a Range of Values
Using the CONTAINS or Question Mark (?) Operator to Select a String
Using the IN Operator to Select Values from a List
Using the IS MISSING or IS NULL Operator to Select Missing Values
Using the LIKE Operator to Select a Pattern
Using the Sounds-Like (=*) Operator to Select a Spelling Variation
Subsetting Rows by Using Calculated Values
Understanding How PROC SQL Processes Calculated Columns
Specifying Column Formats and Labels
Specifying Titles and Footnotes
Adding a Character Constant to Output
Number of Arguments and Summary Function Processing
Groups and Summary Function Processing
SELECT Clause Columns and Summary Function Processing
Using a Summary Function with a Single Argument (Column)
Using a Summary Function with Multiple Arguments (Columns)
Using a Summary Function without a GROUP BY Clause
Using a Summary Function with Columns Outside of the Function
Using a Summary Function with a GROUP BY Clause
Counting Values by Using the COUNT Summary Function
Counting All Non-Missing Values in a Column
Counting All Unique Values in a Column
Selecting Groups by Using the HAVING Clause
Subsetting Data by Using Subqueries
Subsetting Data by Using Noncorrelated Subqueries
Using Single-Value Noncorrelated Subqueries
Using Multiple-Value Noncorrelated Subqueries
Using Comparisons with Subqueries
Subsetting Data by Using Correlated Subqueries
Using the EXISTS and NOT EXISTS Conditional Operators
Example: Correlated Subquery with NOT EXISTS
The SELECT statement is the primary tool of PROC SQL. Using the SELECT statement, you can identify, manipulate, and retrieve columns of data from one or more tables and views.
You should already know how to create basic PROC SQL queries by using the SELECT statement and most of its subordinate clauses. To build on your existing skills, this chapter presents a variety of useful query techniques, such as the use of subqueries to subset data.
The PROC SQL query shown below illustrates some of the new query techniques that you will learn:
proc sql outobs=20;
title 'Job Groups with Average Salary';
title2 '> Company Average';
select jobcode,
avg(salary) as AvgSalary format=dollar11.2,
count(*) as Count
from sasuser.payrollmaster
group by jobcode
having avg(salary) >
(select avg(salary)
from sasuser.payrollmaster)
order by avgsalary desc;
In this chapter, you learn to
• display all rows, eliminate duplicate rows, and limit the number of rows displayed
• subset rows using other conditional operators and calculated values
• enhance the formatting of query output
• use summary functions, such as COUNT, with and without grouping
• subset groups of data by using the HAVING clause
• subset data by using correlated and noncorrelated subqueries
• validate query syntax.
Before you begin this chapter, you should complete the following chapter:
• “Performing Queries Using PROC SQL” on page 4.
The SELECT statement and its subordinate clauses are the building blocks for constructing all PROC SQL queries.
General form, SELECT statement:
SELECT column-1<, ... column-n>
FROM table-1 | view-1<, ... table-n | view-n>
WHERE expression
GROUP BY column-1<, ... column-n>>
HAVING expression>
ORDER BY column-1<, ... column-n>>;
where
SELECT
specifies the column(s) that will appear in the output
FROM
specifies the table(s) or view(s) to be queried
WHERE
uses an expression to subset or restrict the data based on one or more condition(s)
GROUP BY
classifies the data into groups based on the specified column(s)
HAVING
uses an expression to subset or restrict groups of data based on group condition(s)
ORDER BY
sorts the rows that the query returns by the value(s) of the specified column(s).
Note: The clauses in a PROC SQL SELECT statement must be specified in the order shown.
You should be familiar with all of the SELECT statement clauses except for the HAVING clause. The use of the HAVING clause is presented later in this chapter.
Now, we will look at some ways you can limit and subset the number of columns that will be displayed in query output.
You already know how to select specific columns for output by listing them in the SELECT statement. However, for some tasks, you will find it useful to display all columns of a table concurrently. For example, before you create a complex query, you might want to see the contents of the table you are working with.
To display all columns in the order in which they are stored in a table, use an asterisk (*) in the SELECT clause. All rows will also be displayed, by default, unless you limit or subset them.
The following SELECT step displays all columns and rows in the table Sasuser.Staffchanges, which lists all employees in a company who have had changes in their employment status.
proc sql;
select *
from sasuser.staffchanges;
As shown in the output, the table contains six columns and six rows.
When you specify SELECT *, you can also use the FEEDBACK option in the PROC SQL statement, which writes the expanded list of columns to the SAS log. For example, the PROC SQL query shown below contains the FEEDBACK option:
proc sql feedback;
select *
from sasuser.staffchanges;
This query produces the following feedback in the SAS log.
202 proc sql feedback;
203 select *
204 from sasuser.staffchanges;
NOTE: Statement transforms to:
select STAFFCHANGES.EmpID,
STAFFCHANGES.LastName, STAFFCHANGES.FirstName,
STAFFCHANGES.City, STAFFCHANGES.State,
STAFFCHANGES.PhoneNumber
from SASUSER.STAFFCHANGES
The FEEDBACK option is a debugging tool that lets you see exactly what is being submitted to the SQL processor. The resulting message in the SAS log not only expands asterisks (*) into column lists, but it also resolves macro variables and places parentheses around expressions to show their order of evaluation.
When you create PROC SQL queries, you will sometimes find it useful to limit the number of rows that PROC SQL displays in the output. To indicate the maximum number of rows to be displayed, you can use the OUTOBS= option in the PROC SQL statement.
General form, PROC SQL statement with OUTOBS= option:
PROC SQL OUTOBS= n;
where
n
specifies the number of rows.
Note: The OUTOBS= option restricts the rows that are displayed, but not the rows that are read. To restrict the number of rows that PROC SQL takes as input from any single source, use the INOBS= option. For more information about the INOBS= option, see “Managing Processing Using PROC SQL” on page 278.
Suppose you want to quickly review the types of values that are stored in a table, without printing out all the rows. The following PROC SQL query selects data from the table Sasuser.Flightschedule, which contains over 200 rows. To print only the first 10 rows of output, you add the OUTOBS= option to the PROC SQL statement.
proc sql outobs=10; |
When you limit the number of rows that are displayed, a message similar to the following appears in the SAS log.
WARNING: Statement terminated early due to OUTOBS=10 option.
Note: The OUTOBS= and INOBS= options will affect tables that are created by using the CREATE TABLE statement and your report output.
Note: In many of the examples in this chapter, OUTOBS= is used to limit the number of rows that are displayed in output.
In some situations, you might want to display only the unique values or combinations of values in the column(s) listed in the SELECT clause. You can eliminate duplicate rows from your query results by using the keyword DISTINCT in the SELECT clause. The DISTINCT keyword applies to all columns, and only those columns, that are listed in the SELECT clause. We will see how this works in the following example.
Suppose you want to display a list of the unique flight numbers and destinations of all international flights that are flown during the month.
The following SELECT statement in PROC SQL selects the columns FlightNumber and Destination in the table Sasuser.Internationalflights:
proc sql outobs=12;
select flightnumber, destination
from sasuser.internationalflights;
Here is the output.
As you can see, there are several duplicate pairs of values for FlightNumber and Destination in the first 12 rows alone. For example, flight number 182 to YYZ appears in rows 1 and 8. The entire table contains many more rows with duplicate values for each flight number and destination because each flight has a regular schedule.
To remove rows that contain duplicate values, add the keyword DISTINCT to the SELECT statement, following the keyword SELECT, as shown in the following example:
proc sql;
select distinct flightnumber, destination
from sasuser.internationalflights
order by 1;
With duplicate values removed, the output will contain many fewer rows, so the OUTOBS= option has been removed from the PROC SQL statement. Also, to sort the output by FlightNumber (column 1 in the SELECT clause list), the ORDER BY clause has been added.
Here is the output from the modified program.
There are no duplicate rows in the output. There are seven unique FlightNumber-Destination value pairs in this table.
In the WHERE clause of a PROC SQL query, you can specify any valid SAS expression to subset or restrict the data that is displayed in output. The expression might contain any of various types of operators, such as the following.
Type of Operator |
Example |
comparison |
where membertype='GOLD' |
logical |
where visits<=3 or status='new' |
concatenation |
where name=trim(last) ||', '||first |
Note: For a complete list of operators that can be used in SAS expressions, see the SAS documentation.
Comparison, logical, and concatenation operators are used in PROC SQL as they are used in other SAS procedures. For example, the following WHERE clause contains
• the logical operator AND, which joins multiple conditions
• two comparison operators: an equal sign (=) and a greater than symbol (>).
proc sql;
select ffid, name, state, pointsused
from sasuser.frequentflyers
where membertype='GOLD' and pointsused>0
order by pointsused;
In PROC SQL queries, you can also use the following conditional operators. All of these operators except for ANY, ALL, and EXISTS, can also be used in other SAS procedures.
Conditional Operator |
Tests for... |
Example |
BETWEEN-AND |
values that occur within an inclusive range |
where salary between 70000 |
CONTAINS or ? |
values that contain a specified string |
where name contains 'ER' |
IN |
values that match one of a list of values |
where code in ('PT' , 'NA', 'FA') |
IS MISSING or IS NULL |
missing values |
where dateofbirth is missing |
LIKE (with %, _) |
values that match a specified pattern |
where address like '% P%PLACE' |
=* |
values that sound like a specified value |
where lastname=* 'Smith' |
ANY |
values that meet a specified condition with respect to any one of the values returned by a subquery |
where dateofbirth < any |
ALL |
values that meet a specified condition with respect to all the values returned by a subquery |
where dateofbirth < all |
EXISTS |
the existence of values returned by a subquery |
where exists |
TIP To create a negative condition, you can precede any of these conditional operators, except for ANY and ALL, with the NOT operator.
Most of these conditional operators, and their uses, are covered in the next several sections. ANY, ALL, and EXISTS are discussed later in the chapter.
To select rows based on a range of numeric or character values, you use the BETWEEN-AND operator in the WHERE clause. The BETWEEN-AND operator is inclusive, so the values that you specify as limits for the range of values are included in the query results, in addition to any values that occur between the limits.
General form, BETWEEN-AND operator:
BETWEEN value-1 AND value-2
where
value-1
is the value at the one end of the range
value-2
is the value at the other end of the range.
Note: When specifying the limits for the range of values, it is not necessary to specify the smaller value first.
Here are several examples of WHERE clauses that contain the BETWEEN-AND operator. The last example shows the use of the NOT operator with the BETWEEN-AND operator.
Example |
Returns rows in which... |
where date between '01mar2000'd Date In this example, the values are specified as date constants. |
the value of Date is 01mar2000, 07mar2000, or any date value in between |
where salary between 70000 |
the value of Salary is 70000, 80000, or any numeric value in between |
where salary not between 70000 |
the value of Salary is not between or equal to 70000 and 80000 |
The CONTAINS or question mark (?) operator is usually used to select rows for which a character column includes a particular string. These operators are interchangeable.
General form, CONTAINS operator:
sql-expression CONTAINS sql-expression
sql-expression ? sql-expression
where
sql-expression
is a character column, string (character constant), or expression. A string is a sequence of characters to be matched that must be enclosed in quotation marks.
Note: PROC SQL retrieves a row for output no matter where the string (or second sql-expression) occurs within the column’s (or first sql-expression’s) values. Matching is case sensitive when making comparisons.
Note: The CONTAINS or question mark (?) operator is not part of the ANSI standard; it is a SAS enhancement.
The following PROC SQL query uses CONTAINS to select rows in which the Name column contains the string ER. As the output shows, all rows that contain ER anywhere within the Name column are displayed.
proc sql outobs=10; |
To select only the rows that match one of the values in a list of fixed values, either numeric or character, use the IN operator.
General form, IN operator:
column IN (constant-1<,...constant-n>)
where
column
specifies the selected column name
constant-1 and constant-n
represent a list that contains one or more specific values. The list of values must be enclosed in parentheses and separated by either commas or spaces. Values can be either numeric or character. Character values must be enclosed in quotation marks.
Here are examples of WHERE clauses that contain the IN operator.
Example |
Returns rows in which... |
where jobcategory in ('PT','NA','FA') | the value of JobCategory is PT, NA, or FA |
where dayofweek in (2,4,6) |
the value of DayOfWeek is 2, 4, or 6 |
where chesspiece not in |
the value of chesspiece is rook, knight, or bishop |
To select rows that contain missing values, both character and numeric, use the IS MISSING or IS NULL operator. These operators are interchangeable.
General form, IS MISSING or IS NULL operator:
column IS MISSING
column IS NULL
where
column
specifies the selected column name.
Note: The IS MISSING operator is not part of the ANSI standard for SQL. It is a SAS enhancement.
Suppose you want to find out whether the table Sasuser.Marchflights has any missing values in the column Boarded. You can use the following PROC SQL query to retrieve rows from the table that have missing values:
proc sql;
select boarded, transferred,
nonrevenue, deplaned
from sasuser.marchflights
where boarded is missing;
The output shows that two rows in the table have missing values for Boarded.
TIP Alternatively, you can specify missing values without using the IS MISSING or IS NULL operator, as shown in the following examples:
where boarded = .
where flight = ' '
However, the advantage of using the IS MISSING or IS NULL operator is that you do not have to specify the data type (character or numeric) of the column.
To select rows that have values that match as specific pattern of characters rather than a fixed character string, use the LIKE operator. For example, using the LIKE operator, you can select all rows in which the LastName value starts with H. (If you wanted to select all rows in which the last name contains the string HAR, you would use the CONTAINS operator.)
General form, LIKE operator:
column LIKE ‘pattern’
where
column
specifies the column name
pattern
specifies the pattern to be matched and contains one or both of the special characters underscore (_) and percent sign (%). The entire pattern must be enclosed in quotation marks and matching is case sensitive.
When you use the LIKE operator in a query, PROC SQL uses pattern matching to compare each value in the specified column with the pattern that you specify using the LIKE operator in the WHERE clause. The query output displays all rows in which there is a match.
You specify a pattern using one or both of the special characters shown below.
Special Character |
Represents |
underscore (_) |
any single character |
percent sign (%) |
any sequence of zero or more characters |
Note: The underscore (_) and percent sign (%) are sometimes referred to as wildcard characters.
To specify a pattern, combine one or both of the special characters with any other characters that you want to match. The special characters can appear before, after, or on both sides of other characters.
Consider how the special characters can be combined to specify a pattern. Suppose you are working with a table column that contains the following list of names:
• Diana
• Diane
• Dianna
• Dianthus
• Dyan
Here are several patterns that you can use to select one or more of the names from the list. Each pattern uses one or both of the special characters.
LIKE Pattern |
Name(s) Selected |
LIKE 'D_an' |
Dyan |
LIKE 'D_an_' |
Diana, Diane |
LIKE 'D_an__ |
Dianna |
LIKE 'D_an%' |
all names from the list |
The following PROC SQL query uses the LIKE operator to find all frequent-flyer club members whose street name begins with P and ends with the word PLACE. The following PROC SQL step performs this query:
proc sql;
select ffid, name, address
from sasuser.frequentflyers
where address like '% P%PLACE';
The pattern ‘% P%PLACE’ specifies the following sequence:
• any number of characters (%)
• a space
• the letter P
• any number of characters (%)
• the word PLACE.
Here are the results of this query.
To select rows that contain a value that sounds like another value that you specify, use the sounds-like operator (= *) in the WHERE clause.
General form, sounds-like (=*) operator:
sql-expression =* sql-expression
where
sql-expression
is a character column, string (character constant), or expression. A string is a sequence of characters to be matched that must be enclosed in quotation marks.
The sounds-like (=*) operator uses the SOUNDEX algorithm to compare each value of a column (or other sql-expression) with the word or words (or other sql-expression) that you specify. Any rows that contain a spelling variation of the value that you specified are selected for output.
For example, here is a WHERE clause that contains the sounds-like operator:
where lastname =* 'Smith';
The sounds-like operator does not always select all possible values. For example, suppose you use the preceding WHERE clause to select rows from the following list of names that sound like Smith:
• Schmitt
• Smith
• Smithson
• Smitt
• Smythe
Two of the names in this list will not be selected: Schmitt and Smithson.
Note: The SOUNDEX algorithm is English-biased and is less useful for languages other than English. For more information about the SOUNDEX algorithm, see the SAS documentation.
You should already know how to define a new column by using the SELECT clause and performing a calculation. For example, the following PROC SQL query creates the new column Total by adding the values of three existing columns: Boarded, Transferred, and Nonrevenue:
proc sql outobs=10;
select flightnumber, date, destination,
boarded + transferred + nonrevenue
as Total
from sasuser.marchflights
You can also use a calculated column in the WHERE clause to subset rows. However, because of the way in which SQL queries are processed, you cannot just specify the column alias in the WHERE clause. To see what happens, we will take the preceding PROC SQL query and add a WHERE clause in the SELECT statement to reference the calculated column Total, as shown below:
proc sql outobs=10;
select flightnumber, date, destination,
boarded + transferred + nonrevenue
as Total
from sasuser.marchflights
where total < 100;
When this query is executed, the following error message is displayed in the SAS log.
519 proc sql outobs=10;
520 select flightnumber, date, destination,
521 boarded + transferred + nonrevenue
522 as Total
523 from sasuser.marchflights
524 where total < 100;
ERROR: The following columns were not found in the contributing tables: total.
This error message is generated because, in SQL queries, the WHERE clause is processed before the SELECT clause. The SQL processor looks in the table for each column named in the WHERE clause. The table Sasuser.Marchflights does not contain a column named Total, so SAS generates an error message.
When you use a column alias in the WHERE clause to refer to a calculated value, you must use the keyword CALCULATED along with the alias. The CALCULATED keyword informs PROC SQL that the value is calculated within the query. Now, the PROC SQL query looks like this:
proc sql outobs=10;
select flightnumber, date, destination,
boarded + transferred + nonrevenue
as Total
from sasuser.marchflights
where calculated total < 100;
This query executes successfully and produces the following output.
Note: As an alternative to using the keyword CALCULATED, repeat the calculation in the WHERE clause. However, this method is inefficient because PROC SQL has to perform the calculation twice. In the preceding query, the alternate WHERE statement would be:
where boarded + transferred + nonrevenue <100;
You can also use the CALCULATED keyword in other parts of a query. In the following example, the SELECT clause calculates the new column Total and then calculates a second new column based on Total. To create the second calculated column, you have to specify the keyword CALCULATED in the SELECT clause.
proc sql outobs=10;
select flightnumber, date, destination,
boarded + transferred + nonrevenue
as Total,
calculated total/2 as Half
from sasuser.marchflights;
This query produces the following output.
Note: The CALCULATED keyword is a SAS enhancement and is not specified in the ANSI Standard for SQL.
When you are using PROC SQL, you might find that the data in a table is not formatted as you would like it to appear. Fortunately, with PROC SQL you can use enhancements, such as the following, to improve the appearance of your query output:
• column labels and formats
• titles and footnotes
• columns that contain a character constant.
You know how to use the first two enhancements with other SAS procedures. You can also enhance PROC SQL query output by working with the following query:
proc sql outobs=15;
select empid, jobcode, salary,
salary * .10 as Bonus
from sasuser.payrollmaster
where salary>75000
order by salary desc;
This query limits output to 15 observations. The SELECT clause selects three existing columns from the table Sasuser.Payrollmaster, and calculates a fourth (Bonus). The WHERE clause retrieves only rows in which salary is greater than 75,000. The ORDER BY clause sorts by the Salary column and uses the keyword DESC to sort in descending order.
Here is the output from this query.
Note: The Salary column has the format DOLLAR9. specified in the table.
Look closely at this output and you will see that improvements can be made. You will learn how to enhance this output in the following ways:
• replace original column names with new labels
• specify a format for the Bonus column, so that all values are displayed with the same number of decimal places
• display a title at the top of the output
• add a column using a character constant.
By default, PROC SQL formats output using column attributes that are already saved in the table or, if none are saved, the default attributes. To control the formatting of columns in output, you can specify column modifiers, such as LABEL= and FORMAT=, after any column name specified in the SELECT clause. When you define a new column in the SELECT clause, you can assign a label rather than an alias, if you prefer.
Column Modifier |
Specifies... |
Example |
LABEL= |
the label to be displayed for the column |
select hiredate |
FORMAT= |
the format used to display column data |
select hiredate |
Note: LABEL= and FORMAT= are not part of the ANSI standard. These column modifiers are SAS enhancements.
TIP To force PROC SQL to ignore permanent labels in a table, specify the NOLABEL system option.
Your first task is to specify column labels for the first two columns. Below, the LABEL= option has been added after both EmpID and JobCode, and the text of each label is enclosed in quotation marks. For easier reading, each of the four columns in the SELECT clause is now listed on its own line.
proc sql outobs=15;
select empid label='Employee ID',
jobcode label='Job Code',
salary,
salary * .10 as Bonus
from sasuser.payrollmaster
where salary>75000
order by salary desc;
Next, you will add a format for the Bonus column. Because the Bonus values are dollar amounts, you will use the format Dollar 12.2. The FORMAT= modifier has been added to the SELECT clause, below, immediately following the column alias Bonus:
proc sql outobs=15;
select empid label='Employee ID',
jobcode label='Job Code',
salary,
salary * .10 as Bonus
format=dollar12.2
from sasuser.payrollmaster
where salary>75000
order by salary desc;
Now that column formats and labels have been specified, you can add a title to this PROC SQL query.
You should already know how to specify and cancel titles and footnotes with other SAS procedures. When you specify titles and footnotes with a PROC SQL query, you must place the TITLE and FOOTNOTE statements in either of the following locations:
• before the PROC SQL statement
• between the PROC SQL statement and the SELECT statement.
In the following PROC SQL query, two title lines have been added between the PROC SQL statement and the SELECT statement:
proc sql outobs=15;
title 'Current Bonus Information';
title2 'Employees with Salaries > $75,000';
select empid label='Employee ID',
jobcode label='Job Code',
salary,
salary * .10 as Bonus
format=dollar12.2
from sasuser.payrollmaster
where salary>75000
order by salary desc;
Now that these changes have been made, you can look at the enhanced query output.
The first two columns have new labels, the Bonus values are consistently formatted, and two title lines are displayed at the top of the output.
Another way of enhancing PROC SQL query output is to define a column that contains a character constant. To do this, you include a text string in quotation marks in the SELECT clause.
TIP You can define a column that contains a numeric constant in a similar way, by listing a numeric value (without quotation marks) in the SELECT clause.
You can look at the preceding PROC SQL query output again and determine where you can add a text string.
You can remove the column label Bonus and display the text bonus is: in a new column to the left of the Bonus column. This is how you want the columns and rows to appear in the query output.
To specify a new column that contains a character constant, you include the text string in quotation marks in the SELECT clause list. Your modified PROC SQL query is shown below:
proc sql outobs=15;
title 'Current Bonus Information';
title2 'Employees with Salaries > $75,000';
select empid label='Employee ID',
jobcode label='Job Code',
salary,
'bonus is:',
salary * .10 format=dollar12.2
from sasuser.payrollmaster
where salary>75000
order by salary desc;
In the SELECT clause list, the text string bonus is: has been added between Salary and Bonus.
Note that the code as Bonus has been removed from the last line of the SELECT clause. Now that the character constant has been added, the column alias Bonus is no longer needed.
Instead of just listing individual rows, you can use a summary function (also called an aggregate function) to produce a statistical summary of data in a table. For example, in the SELECT clause in the following query, the AVG function calculates the average (or mean) miles traveled by frequent-flyer club members. The GROUP BY clause tells PROC SQL to calculate and display the average for each membership group (MemberType).
proc sql; |
You should already be familiar with the list of summary functions that can be used in a PROC SQL query.
PROC SQL calculates summary functions and outputs results in different ways depending on a combination of factors. Four key factors are
• whether the summary function specifies one or multiple columns as arguments
• whether the query contains a GROUP BY clause
• if the summary function is specified in a SELECT clause, whether there are additional columns listed that are outside of a summary function
• whether the WHERE clause, if there is one, contains only columns that are specified in the SELECT clause.
To ensure that your PROC SQL queries produce the intended output, it is important to understand how the factors listed above affect the processing of summary functions. Consider an overview of all the factors, followed by a detailed example that illustrates each factor.
Summary functions specify one or more arguments in parentheses. In the examples shown in this chapter, the arguments are always columns in the table being queried.
Note: The ANSI-standard summary functions, such as AVG and COUNT, can be used only with a single argument. The SAS summary functions, such as MEAN and N, can be used with either single or multiple arguments.
The following chart shows how the number of columns specified as arguments affects the way that PROC SQL calculates a summary function.
If a summary function... |
Then the calculation is... |
Example |
specifies one column as argument |
performed down the column |
proc sql; |
specifies multiple columns as argument |
performed across columns for each row |
proc sql outobs=10; |
Summary functions perform calculations on groups of data. When PROC SQL processes a summary function, it looks for a GROUP BY clause:
If a GROUP BY clause... |
Then PROC SQL... |
Example |
is not present in the query |
applies the function to the entire table |
proc sql outobs=10; |
is present in the query |
applies the function to each group specified in the GROUP BY clause |
proc sql outobs=10; If a query contains a GROUP BY clause, all columns in the SELECT clause that do not contain a summary function should be listed in the GROUP BY clause or unexpected results might be returned. |
A SELECT clause that contains a summary function can also list additional columns that are not specified in the summary function. The presence of these additional columns in the SELECT clause list causes PROC SQL to display the output differently.
If a SELECT clause... |
Then PROC SQL... |
Example |
contains summary function(s) and no columns outside of summary functions |
calculates a single value by using the summary function for the entire table or, if groups are specified in the GROUP BY clause, for each group combines or rolls up the information into a single row of output for the entire table or, if groups are specified, for each group |
proc sql; |
contains summary function(s) and additional columns outside of summary functions |
calculates a single value for the entire table or, if groups are specified, for each group, and displays all rows of output with the single or grouped value(s) repeated |
proc sql; |
Note: WHERE clause columns also affect summary function processing. If there is a WHERE clause that references only columns that are specified in the SELECT clause, PROC SQL combines information into a single row of output. However, this condition is not covered in this chapter. For more information, see the SAS documentation for the SQL procedure.
In the next few sections, you will look more closely at the query examples shown above to see how the first three factors impact summary function processing.
Compare two PROC SQL queries that contain a summary function: one with a single argument and the other with multiple arguments. To keep things simple, these queries do not contain a GROUP BY clause.
Below is a PROC SQL query that displays the average salary of all employees listed in the table Sasuser.Payrollmaster:
proc sql;
select avg(salary) as AvgSalary
from sasuser.payrollmaster;
The SELECT statement contains the summary function AVG with Salary as its argument. Because there is only one column as an argument, the function calculates the statistic down the Salary column to display a single value: the average salary for all employees. The output is shown here.
Consider a PROC SQL query that contains a summary function with multiple columns as arguments. This query calculates the total number of passengers for each flight in March by adding the number of boarded, transferred, and nonrevenue passengers:
proc sql outobs=10;
select sum(boarded,transferred,nonrevenue)
as Total
from sasuser.marchflights;
The SELECT clause contains the summary function SUM with three columns as arguments. Because the function contains multiple arguments, the statistic is calculated across the three columns for each row to produce the following output.
Note: Without the OUTOBS= option, all rows in the table would be displayed in the output.
Consider how a PROC SQL query with a summary function is affected by including a GROUP BY clause and including columns outside of a summary function.
Once again, here is the PROC SQL query that displays the average salary of all employees listed in the table Sasuser.Payrollmaster. This query contains a summary function but, since the goal is to display the average across all employees, there is no GROUP BY clause.
proc sql outobs=20;
select avg(salary) as AvgSalary
from sasuser.payrollmaster;
Note that the SELECT clause lists only one column: a new column that is defined by a summary function calculation. There are no columns listed outside of the summary function.
Here is the query output.
Suppose you calculate an average for each job group and group the results by job code. Your first step is to add an existing column (JobCode) to the SELECT clause list. The modified query is shown here:
proc sql outobs=20;
select jobcode, avg(salary) as AvgSalary
from sasuser.payrollmaster;
Consider what the query output looks like now that the SELECT statement contains a column (JobCode) that is not a summary function argument.
Note: Remember that this PROC SQL query uses the OUTOBS= option to limit the output to 20 rows. Without this limitation, the output of this query would display all 148 rows in the table.
As this result shows, adding a column to the SELECT clause that is not within a summary function causes PROC SQL to output all rows instead of a single value. To generate this output, PROC SQL
• calculated the average salary down the column as a single value (54079.62)
• displayed all rows in the output, because JobCode is not specified in a summary function.
Therefore, the single value for AvgSalary is repeated for each row.
Note: When this query is submitted, the SAS log displays a message indicating that data remerging has occurred. Data remerging is explained later in this chapter.
While this result is interesting, you have not yet reached your goal: grouping the data by JobCode. The next step is to add the GROUP BY clause.
Below is the PROC SQL query from the previous page, to which has been added a GROUP BY clause that specifies the column JobCode. (In the SELECT clause, JobCode is specified but is not used as a summary function argument.) Other changes to the query include removing the OUTOBS= option (it is unnecessary) and specifying a format for the AvgSalary column.
proc sql;
select jobcode,
avg(salary) as AvgSalary format=dollar11.2
from sasuser.payrollmaster
group by jobcode;
Consider how the addition of the GROUP BY clause affects the output.
Success! The summary function has been calculated for each JobCode group, and the results are grouped by JobCode.
Sometimes you want to count the number of rows in an entire table or in groups of rows. In PROC SQL, you can use the COUNT summary function to count the number of rows that have nonmissing values. There are three main ways to use the COUNT function.
Using this form of COUNT... |
Returns... |
Example |
COUNT(*) |
the total number of rows in a group or in a table |
select count(*) as Count |
COUNT(column) |
the total number of rows in a group or in a table for which there is a nonmissing value in the selected column |
select count(jobcode) as Count |
COUNT(DISTINCT column) |
the total number of unique values in a column |
select count(distinct jobcode) |
CAUTION:
The COUNT summary function counts only the nonmissing values; missing values are ignored. Many other summary functions also ignore missing values. For example, the AVG function returns the average of the nonmissing values only. When you use a summary function with data that contains missing values, the results might not provide the information that you expect. It is a good idea to familiarize yourself with the data before you use summary functions in queries.
TIP To count the number of missing values, use the NMISS function. For more information about the NMISS function, see the SAS documentation.
Consider the three ways of using the COUNT function.
Suppose you want to know how many employees are listed in the table Sasuser.Payrollmaster. This table contains a separate row for each employee, so counting the number of rows in the table gives you the number of employees. The following PROC SQL query accomplishes this task:
proc sql;
select count(*) as Count
from sasuser.payrollmaster;
Note: The COUNT summary function is the only function that allows you to use an asterisk (*) as an argument.
You can also use COUNT(*) to count rows within groups of data. To do this, you specify the groups in the GROUP BY clause. Consider a more complex PROC SQL query that uses COUNT(*) with grouping. This time, the goal is to find the total number of employees within each job category, using the same table that is used above.
proc sql;
select substr(jobcode,1,2)
label='Job Category',
count(*) as Count
from sasuser.payrollmaster
group by 1;
This query defines two new columns in the SELECT clause. The first column, which is labeled JobCategory, is created by using the SAS function SUBSTR to extract the two-character job category from the existing JobCode field. The second column, Count, is created by using the COUNT function. The GROUP BY clause specifies that the results are to be grouped by the first defined column (referenced by 1 because the column was not assigned a name).
CAUTION:
When a column contains missing values, PROC SQL treats the missing values as a single group. This can sometimes produce unexpected results.
Suppose you want to count all of the non-missing values in a specific column instead of in the entire table. To do this, you specify the name of the column as an argument of the COUNT function. For example, the following PROC SQL query counts all non-missing values in the column JobCode:
proc sql; |
Because the table has no missing data, you get the same output with this query as you would by using COUNT(*). JobCode has a non-missing value for each row in the table. However, if the JobCode column contained missing values, this query would produce a lower value of Count than the previous query. For example, if JobCode contained three missing values, the value of Count would be 145.
To count all unique values in a column, add the keyword DISTINCT before the name of the column that is used as an argument. For example, here is the previous query modified to count only the unique values:
proc sql;
select count(distinct jobcode) as Count
from sasuser.payrollmaster;
This query counts 16 unique values for JobCode.
To display the unique JobCode values, you can apply the method of eliminating duplicates, which was discussed earlier. The following query lists only the unique values for JobCode.
proc sql;
select distinct jobcode
from sasuser.payrollmaster;
There are 16 job codes, so the output contains 16 rows.
You have seen how to use the GROUP BY clause to group data. For example, the following query calculates the average salary within each job-code group, and displays the average for each job code:
proc sql;
select jobcode,
avg(salary) as AvgSalary
format=dollar11.2
from sasuser.payrollmaster
group by jobcode;
There are 16 job codes in the table, so the output displays 16 rows.
Now, suppose you want to select only a subset of groups for your query output. You can use a HAVING clause, following a GROUP BY clause, to select (or filter) the groups to be displayed. The way a HAVING clause affects groups is similar to the way that a WHERE clause affects individual rows. As in a WHERE clause, the HAVING clause contains an expression that is used to subset the data. Any valid SAS expression can be used. When you use a HAVING clause, PROC SQL displays only the groups that satisfy the HAVING expression.
Note: You can use summary functions in a HAVING clause but not in a WHERE clause, because a HAVING clause is used with groups, but a WHERE clause can be used only with individual rows.
Modify the query shown above so that it selects only the JobCode groups with an average salary of more than $56,000. The HAVING clause has been added at the end of the query.
proc sql;
select jobcode,
avg(salary) as AvgSalary
format=dollar11.2
from sasuser.payrollmaster
group by jobcode
having avg(salary) > 56000;
TIP Alternatively, because the average salary is already calculated in the SELECT clause, the HAVING clause could specify the column alias AvgSalary:
having AvgSalary > 56000
Note that you do not have to specify the keyword CALCULATED in a HAVING clause; you would have to specify it in a WHERE clause.
The query output is shown below. This output is smaller than the previous output, because only a subset of the job-code groups is displayed.
If you omit the GROUP BY clause in a query that contains a HAVING clause, then the HAVING clause and summary functions (if any are specified) treat the entire table as one group. Without a GROUP BY clause, the HAVING clause in the example shown above calculates the average salary for the table as a whole (all jobs in the company), not for each group (each job code). The output contains either all the rows in the table (if the average salary for the entire table is greater than $56,000) or none of the rows in the table (if the average salary for the entire table is less than $56,000).
Sometimes, when you use a summary function in a SELECT clause or a HAVING clause, PROC SQL must remerge data (make two passes through the table). Remerging requires additional processing time and is often unavoidable. However, there are some situations in which you might be able to modify your query to avoid remerging. Understanding how and when remerging occurs will increase your ability to write efficient queries.
Consider a PROC SQL query that requires remerging. This query calculates each navigator’s salary as a percentage of all navigators’ salaries:
proc sql;
select empid, salary,
(salary/sum(salary)) as Percent
format=percent8.2
from sasuser.payrollmaster
where jobcode contains 'NA';
When you submit this query, the SAS log displays the following message.
NOTE: The query requires remerging summary statistics back with the original data.
Remerging occurs whenever any of the following conditions exist:
• The values returned by a summary function are used in a calculation.
• The SELECT clause specifies a column that contains a summary function and other column(s) that are not listed in a GROUP BY clause.
• The HAVING clause specifies one or more columns or column expressions that are not included in a subquery or a GROUP BY clause.
During remerging, PROC SQL makes two passes through the table:
PROC SQL calculates and returns the value of summary functions. PROC SQL also groups data according to the GROUP BY clause.
PROC SQL retrieves any additional columns and rows that it needs to display in the output, and uses the result from the summary function to calculate any arithmetic expressions in which the summary function participates.
Consider how PROC SQL remerges data when it processes the following query:
proc sql;
select empid, salary,
(salary/sum(salary)) as Percent
format=percent8.2
from sasuser.payrollmaster
where jobcode contains 'NA';
In the first pass, for each row in which the jobcode contains 'NA', PROC SQL calculates and returns the value of the SUM function (specified in the SELECT clause).
In the second pass, PROC SQL retrieves the additional columns and rows that it needs to display in output (EmpID, Salary) and the rows in which JobCode contains “NA”. PROC SQL also uses the result from the SUM function to calculate the arithmetic expression (salary/sum(salary)).
CAUTION:
Some implementations of SQL do not support remerging and would consider the preceding example to be in error.
TIP You can obtain the same results by using a subquery. Subqueries are discussed later in this chapter.
The WHERE and HAVING clauses both subset data based on an expression. In the query examples shown earlier in this chapter, the WHERE and HAVING clauses contained standard SAS expressions. For example, the expression in the following WHERE clause uses the BETWEEN-AND conditional operator and specifies the Salary column as an operand:
where salary between 70000 and 80000
PROC SQL also offers another type of expression that can be used for subsetting in WHERE and HAVING clauses: a query-expression or subquery. A subquery is a query that is nested in, and is part of, another query. A PROC SQL query might contain subqueries at one or more levels.
Note: Subqueries are also known as nested queries, inner queries, and sub-selects.
The following PROC SQL query contains a subquery in the HAVING clause that returns all jobcodes where the average salary for that jobcode is greater than the company average salary.
proc sql;
select jobcode,
avg(salary) as AvgSalary
format=dollar11.2
from sasuser.payrollmaster
group by jobcode
having avg(salary) >
(select avg(salary)
from sasuser.payrollmaster);
TIP It is recommended that you enclose a subquery (inner query) in parentheses, as shown here.
A subquery selects one or more rows from a table, and then returns single or multiple values to be used by the outer query. The subquery shown above is a single-value subquery; it returns a single value, the average salary from the table Sasuser.Payrollmaster, to the outer query. A subquery can return values for multiple rows but only for a single column.
The table that a subquery references can be either the same as or different from the table referenced by the outer query. In the PROC SQL query shown above, the subquery selects data from the same table as the outer query.
There are two types of subqueries.
Type of Subquery |
Description |
noncorrelated |
a self-contained subquery that executes independently of the outer query |
correlated |
a dependent subquery that requires one or more values to be passed to it by the outer query before the subquery can return a value to the outer query |
Both noncorrelated and correlated subqueries can return either single or multiple values to the outer query.
The next few sections provide a more in-depth look at noncorrelated and correlated subqueries, and how they are processed.
A noncorrelated subquery is a self-contained subquery that executes independently of the outer query.
The simplest type of subquery is a noncorrelated subquery that returns a single value.
The following PROC SQL query is the same query that is used in the previous section. This query displays job codes for which the group’s average salary exceeds the company’s average salary. The HAVING clause contains a noncorrelated subquery.
proc sql;
select jobcode,
avg(salary) as AvgSalary
format=dollar11.2
from sasuser.payrollmaster
group by jobcode
having avg(salary) >
(select avg(salary)
from sasuser.payrollmaster);
PROC SQL always evaluates a noncorrelated subquery before the outer query. If a query contains noncorrelated subqueries at more than one level, PROC SQL evaluates the innermost subquery first and works outward, evaluating the outermost query last.
In the query shown above, the inner query and outer query are processed as follows:
To complete the expression in the HAVING clause, the subquery calculates the average salary for the entire company (all rows in the table), using the AVG summary function with Salary as an argument.
The subquery returns the value of the average salary to the outer query.
The outer query calculates the average salary (in the SELECT clause) for each JobCode group (as defined in the GROUP BY clause), and selects only the groups whose average salary is greater than the company’s average salary.
The query output is shown here.
This noncorrelated subquery returns only a single value, the average salary for the whole company, to the outer query. Both the subquery and the outer query use the same table as a source.
Some subqueries are multiple-value subqueries: they return more than one value (row) to the outer query. If your noncorrelated subquery might return a value for more than one row, be sure to use one of the following operators in the WHERE or HAVING clause that can handle multiple values:
• the conditional operator IN
• a comparison operator that is modified by ANY or ALL
• the conditional operator EXISTS.
CAUTION:
If you create a noncorrelated subquery that returns multiple values, but the WHERE or HAVING clause in the outer query contains an operator other than one of the operators that are specified above, the query will fail. An error message is displayed in the SAS log, which indicates that the subquery evaluated to more than one row. For example, if you use the equal (=) operator with a noncorrelated subquery that returns multiple values, the query will fail. The equal operator can handle only a single value.
Consider a query that contains both the conditional operator IN and a noncorrelated subquery that returns multiple values. (The operators ANY, ALL, and EXISTS are presented later in this chapter.)
Suppose you want to send birthday cards to employees who have birthdays coming up. You decide to create a PROC SQL query that will list the names and addresses of all employees who have birthdays in February. This query, unlike the one shown on the previous page, will select data from two different tables:
• employee names and addresses in the table Sasuser.Staffmaster
• employee birthdates in the table Sasuser.Payrollmaster.
In both tables, the employees are identified by their employee identification number (EmpID).
In the following PROC SQL query, the WHERE clause contains the conditional operator IN followed by a noncorrelated subquery:
proc sql;
select empid, lastname, firstname,
city, state
from sasuser.staffmaster
where empid in
(select empid
from sasuser.payrollmaster
where month(dateofbirth)=2);
This query is processed as follows:
To complete the expression in the WHERE clause of the outer query, the subquery selects the employees whose date of birth is February. Note that the MONTH function is used in the subquery.
The subquery then returns the EmpID values of the selected employees to the outer query.
The outer query displays data (from the columns identified in the SELECT clause) for the employees identified by the subquery.
The output, shown below, lists the six employees who have February birthdays.
Although an inner join would have generated the same results, it is better to use a subquery in this example since no columns from the sasuser.payrollmater table were in the output.
Sometimes it is helpful to compare a value with a set of values returned by a subquery. When a subquery might return multiple values, you must use one of the conditional operators ANY or ALL to modify a comparison operator in the WHERE or HAVING clause immediately before the subquery. For example, the following WHERE clause contains the less than (<) comparison operator and the conditional operator ANY:
where dateofbirth < any
<subquery...>
CAUTION:
If you create a noncorrelated subquery that returns multiple values, and if the WHERE or HAVING clause in the outer query contains a comparison operator that is not modified by ANY or ALL, the query will fail.
When the outer query contains a comparison operator that is modified by ANY or ALL, the outer query compares each value that it retrieves against the value(s) returned by the subquery. All values for which the comparison is true are then included in the query output. If ANY is specified, then the comparison is true if it is true for any one of the values that are returned by the subquery. If ALL is specified, then the comparison is true only if it is true for all values that are returned by the subquery.
Note: The operators ANY and ALL can be used with correlated subqueries, but they are usually used only with noncorrelated subqueries.
Consider how the operators ANY or ALL are used.
An outer query that specifies the ANY operator selects values that pass the comparison test with any of the values that are returned by the subquery.
For example, suppose you have an outer query containing the following WHERE clause:
where dateofbirth < any
<subquery...>
This WHERE clause specifies that DateofBirth (the operand) should be less than any (the comparison operator) of the values returned by the subquery.
The following chart shows the effect of using ANY with these common comparison operators: greater than (>), less than (<) and equal to (=).
Comparison Operator with ANY |
Outer Query Selects... |
Example |
>ANY |
values that are greater than any value returned by the subquery |
If the subquery returns the values 20, 30, 40, then the outer query selects all values that are > 20 (the lowest value that was returned by the subquery). |
< ANY |
values that are less than any value returned by the subquery |
If the subquery returns the values 20, 30, 40, then the outer query selects all values that are < 40 (the highest value that was returned by the subquery). |
= ANY |
values that are equal to any value returned by the subquery |
If the subquery returns the values 20, 30, 40, the outer query selects all values that are = 20 or = 30 or = 40. |
TIP Instead of using the ANY operator with a subquery, there are some SAS functions that you can use to achieve the same result with greater efficiency. Instead of > ANY, use the MIN function in the subquery. Instead of < ANY, use the MAX function in the subquery.
Suppose you want to identify any flight attendants at level 1 or level 2 who are older than any of the flight attendants at level 3. Job type and level are identified in JobCode; each flight attendant has the job code FA1, FA2, or FA3. The following PROC SQL query accomplishes this task by using a subquery and the ANY operator:
proc sql;
select empid, jobcode, dateofbirth
from sasuser.payrollmaster
where jobcode in ('FA1','FA2')
and dateofbirth < any
(select dateofbirth
from sasuser.payrollmaster
where jobcode='FA3'),
Here is what happens when this query is processed:
The subquery returns the birthdates of all level-3 flight attendants.
The outer query selects only those level-1 and level-2 flight attendants whose birthdate is less than any of the dates returned by the subquery.
Note that both the outer query and subquery use the same table.
Note: Internally, SAS represents a date value as the number of days from January 1, 1960, to the given date. For example, the SAS date for 17 October 1991 is 11612. Representing dates as the number of days from a reference date makes it easy for the computer to store them and perform calendar calculations. These numbers are not meaningful to users, however, so several formats are available for displaying dates and datetime values in most of the commonly used notations.
Below are the query results.
TIP Using the ANY operator to solve this problem results in a large number of calculations, which increases processing time. For this example, it would be more efficient to use the MAX function in the subquery. The alternative WHERE clause follows:
where jobcode in ('FA1','FA2')
and dateofbirth <
(select max(dateofbirth)
from [...]
For more information about the MAX function, see the SAS documentation.
An outer query that specifies the ALL operator selects values that pass the comparison test with all of the values that are returned by the subquery.
The following chart shows the effect of using ALL with these common comparison operators: greater than (>) and less than (<).
Comparison Operator with ALL |
Sample Values Returned by Subquery |
Signifies... |
> ALL |
(20, 30, 40) |
> 40 (greater than the highest number in the list) |
< ALL |
(20, 30, 40) |
< 20 (less than the lowest number in the list) |
Substitute ALL for ANY in the previous query example. The following query identifies level-1 and level-2 flight attendants who are older than all of the level-3 flight attendants:
proc sql;
select empid, jobcode, dateofbirth
from sasuser.payrollmaster
where jobcode in ('FA1','FA2')
and dateofbirth < all
(select dateofbirth
from sasuser.payrollmaster
where jobcode='FA3'),
Here is what happens when this query is processed:
The subquery returns the birthdates of all level-3 flight attendants.
The outer query selects only those level-1 and level-2 flight attendants whose birthdate is less than all of the dates returned by the subquery.
The query results, below, show that only two level-1 or level-2 flight attendants are older than all of the level-3 flight attendants.
TIP For this example, it would be more efficient to solve this problem using the MIN function in the subquery instead of the ALL operator. The alternative WHERE clause follows:
where jobcode in ('FA1','FA2')
and dateofbirth <
(select min(dateofbirth)
from [...]
For more information about the MIN function, see the SAS documentation.
Correlated subqueries cannot be evaluated independently, but depend on the values passed to them by the outer query for their results. Correlated subqueries are evaluated for each row in the outer query and, therefore, tend to require more processing time than noncorrelated subqueries.
Note: Usually, a PROC SQL join is a more efficient alternative to a correlated subquery. You should already be familiar with basic PROC SQL joins.
Consider an example of a PROC SQL query that contains a correlated subquery. The following query displays the names of all navigators who are also managers. The WHERE clause in the subquery lists the column Staffmaster.EmpID, which is the column that the outer query must pass to the correlated subquery.
proc sql;
select lastname, firstname
from sasuser.staffmaster
where 'NA'=
select jobcategory
from sasuser.supervisors
where staffmaster.empid =
supervisors.empid);
Note: When a column appears in more than one table, the column name is preceded by the table name or alias to avoid ambiguity. In this example, EmpID appears in both tables, so the appropriate table name is specified in front of each reference to that column.
The output from this query is shown below. There are three navigators who are also managers.
In the WHERE clause or in the HAVING clause of an outer query, you can use the EXISTS or NOT EXISTS conditional operator to test for the existence or non-existence of a set of values returned by a subquery.
Condition |
Is true if... |
EXISTS |
the subquery returns at least one row |
NOT EXISTS |
the subquery returns no data |
Note: The operators EXISTS and NOT EXISTS can be used with both correlated and noncorrelated subqueries.
Consider a sample PROC SQL query that includes the NOT EXISTS conditional operator. Suppose you are working with the following tables:
• Sasuser.Flightattendants contains the names and employee ID numbers of all flight attendants.
• Sasuser.Flightschedule contains one row for each crew member assigned to a flight for each date.
As shown in the diagram below, the intersection of these two tables contains data for all flight attendants who have been scheduled to work.
Now suppose you want to list by name the flight attendants who have not been scheduled. That is, you want to identify the data in the area highlighted below.
The following PROC SQL query accomplishes this task by using a correlated subquery and the NOT EXISTS operator:
proc sql;
select lastname, firstname
from sasuser.flightattendants
where not exists
(select *
from sasuser.flightschedule
where flightattendants.empid=
flightschedule.empid);
The output is shown below.
When you are building a PROC SQL query, you might find it more efficient to check your query without actually executing it. To verify the syntax and the existence of columns and tables that are referenced in the query without executing the query, use either of the following:
• the NOEXEC option in the PROC SQL statement
• the VALIDATE keyword before a SELECT statement.
Consider how you specify the NOEXEC option and the VALIDATE keyword, and examine the minor differences between them.
The NOEXEC option is specified in the following PROC SQL statement:
proc sql noexec;
select empid, jobcode, salary
from sasuser.payrollmaster
where jobcode contains 'NA'
order by salary;
If the query is valid and all referenced columns and tables exist, the SAS log displays the following message.
NOTE: Statement not executed due to NOEXEC option.
Or, if there are any errors in the query, SAS displays the standard error messages in the log.
When you invoke the NOEXEC option, SAS checks the syntax of all queries in that PROC SQL step for accuracy but does not execute them.
You specify the VALIDATE keyword just before a SELECT statement; it is not used with any other PROC SQL statement.
We will modify the preceding PROC SQL query by using the VALIDATE keyword instead of the NOEXEC option:
proc sql;
validate
select empid, jobcode, salary
from sasuser.payrollmaster
where jobcode contains 'NA'
order by salary;
Note: Note that the VALIDATE keyword is not followed by a semicolon.
If the query is valid, the SAS log displays the following message.
NOTE: PROC SQL statement has valid syntax.
If there are errors in the query, SAS displays the standard error messages in the log.
The main difference between the VALIDATE keyword and the NOEXEC option is that the VALIDATE keyword only affects the SELECT statement that immediately follows it, whereas the NOEXEC option applies to all queries in the PROC SQL step. If you are working with a PROC SQL query that contains multiple SELECT statements, the VALIDATE keyword must be specified before each SELECT statement that you want to check.
In addition to the SELECT statement, PROC SQL supports the following statements.
Statement |
Use to ... |
ALTER TABLE expression; |
add, drop, and modify columns in a table |
CREATE expression; |
build new tables, views, or indexes |
DELETE expression; |
eliminate unwanted rows from a table or view |
DESCRIBE expression; |
display table and view attributes |
DROP expression; |
eliminate entire tables, views, or indexes |
INSERT expression |
add rows of data to tables or views |
RESET <option(s)>; |
add to or change PROC SQL options without re-invoking the procedure |
UPDATE expression; |
modify data values in existing rows of a table or view |
You can learn more about these PROC SQL statements in the following chapters:
• “Combining Tables Horizontally Using PROC SQL” on page 86
• “Combining Tables Vertically Using PROC SQL” on page 132
• “Creating and Managing Tables Using PROC SQL” on page 175
• “Creating and Managing Indexes Using PROC SQL” on page 238
• “Creating and Managing Views Using PROC SQL” on page 260
• “Managing Processing Using PROC SQL” on page 278.
This section contains the following topics.
• “Text Summary” on page 77
• “Sample Programs” on page 79
• “Points to Remember” on page 80
The SELECT statement and its subordinate clauses are the building blocks that you use to construct all PROC SQL queries.
To display all columns in the order in which they are stored in the table, use an asterisk (*) in the SELECT clause. To write the expanded list of columns to the SAS log, use the FEEDBACK option in the PROC SQL statement.
To limit the number of rows that PROC SQL displays as output, use the OUTOBS= option in the PROC SQL statement.
To eliminate duplicate rows from your query results, use the keyword DISTINCT in the SELECT clause.
In a PROC SQL query, use the WHERE clause with any valid SAS expression to subset data. The SAS expression can contain one or more operators, including the following conditional operators:
• the BETWEEN-AND operator selects within an inclusive range of values
• the CONTAINS or ? operator selects a character string
• the IN operator selects from a list of fixed values
• the IS MISSING or IS NULL operator selects missing values
• the LIKE operator selects a pattern
• the sounds-like (=*) operator selects a spelling variation
It is important to understand how PROC SQL processes calculated columns. When you use a column alias in the WHERE clause to refer to a calculated value, you must use the keyword CALCULATED with the alias.
You can enhance PROC SQL query output by using SAS enhancements such as column formats and labels, titles and footnotes, and character constraints.
PROC SQL calculates summary functions and outputs results differently, depending on a combination of factors:
• whether the summary function specifies one or more multiple columns as arguments
• whether the query contains a GROUP BY clause
• if the summary function is specified in a SELECT clause, whether there are additional columns listed that are outside the summary function
• whether the WHERE clause, if there is one, contains only columns that are specified in the SELECT clause.
To count non-missing values, use the COUNT summary function.
To select the groups to be displayed, use a HAVING clause following a GROUP BY clause.
When you use a summary function in a SELECT clause or a HAVING clause, in some situations, PROC SQL must remerge data. When PROC SQL remerges data, it makes two passes through the data, and this requires additional processing time.
In the WHERE or the HAVING clause of a PROC SQL query, you can use a subquery to subset data. A subquery is a query that is nested in, and is part of, another query. Subqueries can return values from a single row or multiple rows to the outer query but can return values only from a single column.
Noncorrelated subqueries execute independently of the outer query. You can use noncorrelated subqueries that return a single value or multiple values. To further qualify a comparison specified in a WHERE or a HAVING clause, you can use the conditional operators ANY and ALL immediately before a noncorrelated (or correlated) subquery.
Correlated subqueries cannot be evaluated independently because their results are dependent on the values returned by the outer query. In the WHERE or the HAVING clause of an outer query, you can use the EXISTS and NOT EXISTS conditional operators to test for the existence or non-existence of a set of values returned by the subquery.
To check the validity of the query syntax without actually executing the query, use the NOEXEC option or the VALIDATE keyword.
PROC SQL supports many statements in addition to the SELECT statement.
PROC SQL OUTOBS=n;
SELECT column-1<,... column-n>
FROM table-1 | view-1<, ... table-n | view-n>
<WHEREexpression>
<GROUP BY column-1<, ... column-n>>
<HAVINGexpression>
<ORDER BY column-1<, ... column-n>>;
QUIT;
sql-expression <NOT> BETWEEN sql-expression AND sql-expression
sql-expression <NOT> CONTAINS sql-expression
sql-expression <NOT> IN (query-expression | constant-1<,...constant-n>)
sql-expression IS MISSING
sql-expression IS NULL
sql-expression <NOT> LIKE sql-expression
sql-expression =* sql-expression
proc sql feedback;
select *
from sasuser.staffchanges;
quit;
proc sql;
select distinct flightnumber, destination
from sasuser.internationalflights
order by 1;
quit;
proc sql outobs=10;
validate
select flightnumber,
date label="Flight Date", destination,
boarded + transferred + nonrevenue
as Total
from sasuser.marchflights
where calculated total between 100 and 150;
quit;
proc sql noexec;
select jobcode,
avg(salary) as AvgSalary
format=dollar11.2
from sasuser.payrollmaster
group by jobcode
having avg(salary) >
(select avg(salary)
from sasuser.payrollmaster);
quit;
proc sql;
title 'Frequent Flyers Who Are Not Employees';
select count(*) as Count
from sasuser.frequentflyers
where not exists
(select *
from sasuser.staffmaster
where name=
trim(lastname)||', '||firstname);
quit;
• When you use summary functions, look for missing values. If a table contains missing values, your results might not be what you expect. Many summary functions ignore missing values when performing calculations, and PROC SQL treats missing values in a column as a single group.
• When you create complex queries, it is helpful to use the NOEXEC option or the VALIDATE statement to validate your query without executing it.
Select the best answer for each question. After completing the quiz, check your answers using the answer key in the appendix.
Which PROC SQL query will remove duplicate values of MemberType from the query output, so that only the unique values are listed?
proc sql nodup;
select membertype
from sasuser.frequentflyers;
proc sql;
select distinct(membertype)
as MemberType
from sasuser.frequentflyers;
proc sql;
select unique membertype
from sasuser.frequentflyers
group by membertype;
proc sql;
select distinct membertype
from sasuser.frequentflyers;
Which of the following will cause PROC SQL to list rows that have no data in the Address column?
WHERE address is missing
WHERE address not exists
WHERE address is null
both a and c
You are creating a PROC SQL query that will list all employees who have spent (or overspent) their allotted 120 hours of vacation for the current year. The hours that each employee used are stored in the existing column Spent. Your query defines a new column, Balance, to calculate each employee’s balance of vacation hours.
Which query will produce the report that you want?
proc sql;
select name, spent,
120-spent as calculated Balance
from Company.Absences
where balance <= 0;
proc sql;
select name, spent,
120-spent as Balance
from Company.Absences
where calculated balance <= 0;
proc sql;
select name, spent,
120-spent as Balance
from Company.Absences
where balance <= 0;
proc sql;
select name, spent,
120-spent as calculated Balance
from Company.Absences
where calculated balance <= 0;
Consider this PROC SQL query:
proc sql;
select flightnumber,
count(*) as Flights,
avg(boarded)
label="Average Boarded"
format=3.
from sasuser.internationalflights
group by flightnumber
having avg(boarded) > 150;
The table Sasuser.Internationalflights contains 201 rows, 7 unique values of FlightNumber, 115 unique values of Boarded, and 4 different flight numbers that have an average value of Boarded that is greater than 150. How many rows of output will the query generate?
150
7
4
1
You are writing a PROC SQL query that will display the names of all library cardholders who work as volunteers for the library, and the number of books that each volunteer currently has checked out. You will use one or both of the following tables:
• Library.Circulation lists the name and contact information for all library cardholders, and the number of books that each cardholder currently has checked out.
• Library.Volunteers lists the name and contact information for all library volunteers.
Assume that the values of Name are unique in both tables.
Which of the following PROC SQL queries will produce your report?
proc sql;
select name, checkedout
from library.circulation
where * in
(select *
from library.volunteers);
proc sql;
select name, checkedout
from library.circulation
where name in
(select name
from library.volunteers);
proc sql;
select name
from library.volunteers
where name, checkedout in
(select name, checkedout
from library.circulation);
proc sql;
select name, checkedout
from library.circulation
where name in
(select name
from library.volunteers;);
By definition, a noncorrelated subquery is a nested query that
returns a single value to the outer query.
contains at least one summary function.
executes independently of the outer query.
requires only a single value to be passed to it by the outer query.
Which statement about the following PROC SQL query is false?
proc sql;
validate
select name label='Country',
rate label='Literacy Rate'
from world.literacy
where 'Asia' =
(select continent
from world.continents
where literacy.name =
continents.country)
order by 2;
The query syntax is not valid.
The outer query must pass values to the subquery before the subquery can return values to the outer query.
PROC SQL will not execute this query when it is submitted.
After the query is submitted, the SAS log will indicate whether the query has valid syntax.
Consider the following PROC SQL query:
proc sql;
select lastname, firstname,
total, since
from charity.donors
where not exists
(select lastname
from charity.current
where donors.lastname =
current.lastname);
The query references two tables:
• Charity.Donors lists name and contact information for all donors who have made contributions since the charity was founded. The table also contains these two columns: Total, which shows the total dollars given by each donor, and Since, which stores the first year in which each donor gave money.
• Charity.Current lists the names of all donors who have made contributions in the current year, and the total dollars each has given this year (YearTotal).
Assume that the values of LastName are unique in both tables.
The output of this query displays
all donors whose rows do not contain any missing values.
all donors who made a contribution in the current year.
all donors who did not make a contribution in the current year.
all donors whose current year’s donation in Charity.Current has not yet been added to Total in Charity.Donors.
Which statement about data remerging is true?
When PROC SQL remerges data, it combines data from two tables.
By using data remerging, PROC SQL can avoid making two passes through the data.
When PROC SQL remerges data, it displays a related message in the SAS log.
PROC SQL does not attempt to remerge data unless a subquery is used.
A public library has several categories of books. Each book in the library is assigned to only one category. The table Library.Inventory contains one row for each book in the library. The Checkouts column indicates the number of times that each book has been checked out.
You want to display only the categories that have an average circulation (number of checkouts) that is less than 2500. Does the following PROC SQL query produce the results that you want?
proc sql;
title 'Categories with Average Circulation';
title2 'Less Than 2500';
select category,
avg(checkouts) as AvgCheckouts
from library.inventory
having avg(checkouts) < 2500
order by 1;
No. This query will not run because a HAVING clause cannot contain a summary function.
No. This query will not run because the HAVING clause must include the CALCULATED keyword before the summary function.
No. Because there is no GROUP BY clause, the HAVING clause treats the entire table as one group.
Yes.