Chapter 7: Performing Conditional Processing

7.1  Introduction

7.2  The IF and ELSE IF Statements

7.3  The Subsetting IF Statement

7.4  The IN Operator

7.5  Using a SELECT Statement for Logical Tests

7.6  Using Boolean Logic (AND, OR, and NOT Operators)

7.7  A Caution When Using Multiple OR Operators

7.8  The WHERE Statement

7.9  Some Useful WHERE Operators

7.10  Problems

 

7.1  Introduction

This chapter describes the tools that allow programs to “make decisions” based on data values. For example, you may want to read a value of Age and create a variable that represents age groups. You may want to determine if values for a particular variable are within predefined limits. Programs that perform any of these operations require conditional processing—the ability to make logical decisions based on data values.

7.2  The IF and ELSE IF Statements

Two of the basic tools for conditional processing are the IF and ELSE IF statements. To understand how these statements work, suppose you have collected the following data on a group of students:

        Age (in years)

        Gender (recorded as M or F)

        Midterm (grade on the midterm exam)

        Quiz (quiz grade from F to A+)

        FinalExam (grade on the final exam)

Your first task is to create a new variable that represents age groups. Here is a first attempt.

Note: This program is not correct.

Program 7.1: First Attempt to Group Ages into Age Groups (Incorrect)

  data Conditional;

     length Gender $ 1

            Quiz   $ 2;

     input Age Gender Midterm Quiz FinalExam;

     if Age lt 20 then AgeGroup = 1;

     if Age ge 20 and Age lt 40 then AgeGroup = 2;

     if Age ge 40 and Age lt 60 then AgeGroup = 3;

     if Age ge 60 then AgeGroup = 4;

  datalines;

  21 M 80 B- 82

  .  F 90 A  93

  35 M 87 B+ 85

  48 F  . .  76

  59 F 95 A+ 97

  15 M 88 .  93

  67 F 97 A  91

  .  M 62 F  67

  35 F 77 C- 77

  49 M 59 C  81

  ;

  title "Listing of Conditional";

  proc print data=Conditional noobs;

  run;

A complete list of the logical comparison operators is displayed below.

Logical Comparison

Mnemonic

Symbol

Equal to

EQ

=

Not equal to

NE

^= or ~= or ¬= *

Less than

LT

Less than or equal to

LE

<=

Greater than

GT

Greater than or equal to

GE

>=

Equal to one in a list

IN

 

* The symbol you use depends on what symbols are available on your keyboard. You can use the mnemonics on any system.

Let’s follow the logic of Program 7.1. The first IF statement asks if Age is less than 20. When the logical expression following the keyword IF is true, the statement following the word THEN is executed; if the expression is not true, the program continues to process the next statements in the DATA step.

There is one serious problem with this program’s logic and it relates to how SAS treats missing numeric values. Missing numeric values are treated logically as the most negative number you can reference on your computer. Therefore, the first IF statement will be true for missing values as well as for all ages less than 20.

Note: This is a very important point. It is a good example of a program that has no syntax errors, runs without any warning or error messages in the log, and produces incorrect results.

There are several ways to prevent your missing values from being included in AgeGroup 1. Here are several options:

if Age lt 20 and Age ne . then AgeGroup = 1;

if Age ge 0 and Age lt 20 then AgeGroup = 1;

if 0 le Age lt 20 then AgeGroup = 1;

if Age lt 20 and not missing(Age) then AgeGroup = 1;

All of these statements result in the correct value for AgeGroup. Subjects with a missing value for Age will also have a missing value for AgeGroup.

The first IF statement uses the fact that you refer to a numeric missing value in a DATA step by a period. The last IF statement uses the MISSING function. This function returns a value of TRUE if the argument (the variable in parentheses) is missing, and FALSE if the argument is not missing. Note: the MISSING function also works with character data.

This program can be improved further. If a person is less than 20 years of age, the first IF statement is true and AgeGroup is set to 1. All the remaining IF statements are still executed (although they will all be false and AgeGroup remains 1). A better way to write this program is to change all the IF statements after the first one to ELSE IF statements.

Here is what the corrected program looks like:

Program 7.2: Corrected Program to Group Ages into Age Groups

  data Conditional;

     length Gender $ 1

            Quiz   $ 2;

     input Age Gender Midterm Quiz FinalExam;

     if Age lt 20 and not missing(age) then AgeGroup = 1;

     else if Age ge 20 and Age lt 40 then AgeGroup = 2;

     else if Age ge 40 and Age lt 60 then AgeGroup = 3;

     else if Age ge 60 then AgeGroup = 4;

  datalines;

Using this logic, when any of the IF statements is true, all the following ELSE statements are not evaluated. This saves on processing time.

An alternative way to write this program is to test for a missing value in the first IF statement and use the ELSE statements to advantage, as follows:

Program 7.3: An Alternative to Program 7.2

  data Conditional;

     length Gender $ 1

            Quiz   $ 2;

     input Age Gender Midterm Quiz FinalExam;

     if missing(Age) then AgeGroup = .;

        else if Age lt 20 then AgeGroup = 1;

        else if Age lt 40 then AgeGroup = 2;

        else if Age lt 60 then AgeGroup = 3;

        else if Age ge 60 then AgeGroup = 4;

  datalines;

When you write a program like this, you need to “play computer” and make sure your logic is correct. For example, what if a person is 25 years old? The first IF statement is false because Age is not missing. The next ELSE IF statement is evaluated and found to be false as well. Finally, the third IF statement is evaluated and AgeGroup is set equal to 2. Because this IF statement is true, all the remaining ELSE IF statements are skipped.

If you are working with very large data sets and want to squeeze every last drop out of the efficiency tank, you should place the IF statements in order, from the ones most likely to have a true condition to the ones least likely to have a true condition. This increases efficiency because SAS skips testing all the ELSE conditions when a previous IF condition is true. The disadvantage of using this technique is that it makes the program harder to read and more likely to have an error. With the speed of modern computers, you should consider these extreme efforts at efficiency only for production programs that operate on very large data sets.

7.3  The Subsetting IF Statement

If you want to create a subset of data from a raw data file or from an existing SAS data set, you can use a special form of an IF statement called a subsetting IF. As an example, let’s use the raw data from Program 7.1, but restrict the resulting data set to females only. Here is the program:

Program 7.4: Demonstrating a Subsetting IF statement

  data Females;

     length Gender $ 1

            Quiz   $ 2;

     input Age Gender Midterm Quiz FinalExam;

     if Gender eq 'F';

  datalines;

  21 M 80 B- 82

  .  F 90 A  93

  35 M 87 B+ 85

  48 F  . .  76

  59 F 95 A+ 97

  15 M 88 .  93

  67 F 97 A  91

  .  M 62 F  67

  35 F 77 C- 77

  49 M 59 C  81

  ;

  title "Listing of Females";

  proc print data=Females noobs;

  run;

Notice that there is no THEN following the IF in this program. If the condition is true, the program continues to the next statement; if the condition is false, control returns to the top of the DATA step. In this case, the only statement following the subsetting IF statement is a DATALINES statement. If the value of Gender is F, the end of the DATA step is reached and an automatic output occurs. If the value of Gender is not equal to F, control returns to the top of the DATA step and the automatic output does not occur.

Here is the output:

Figure 7.1: Output from Program 7.4

Figure 7.1: Output from Program 7.4

A more efficient way to write Program 7.4 would be to read a value of Gender first. Then if it is an F, continue reading the rest of the data values; if not, return to the top of the DATA step and do not output an observation. To see how this can be accomplished, look at Section 11 in Chapter 21.

7.4  The IN Operator

If you want to test if a value is one of several possible choices, you can use multiple OR statements, like this:

if Quiz = 'A+' or Quiz = 'A' or Quiz = 'B+' or Quiz = 'B'

   then QuizRange = 1;

else if Quiz = 'B-' or Quiz = 'C+' or Quiz = 'C'

   then QuizRange = 2;

else if not missing(Quiz) then QuizRange = 3;

These statements can be simplified by using the IN operator, like this:

if Quiz in ('A+','A','B+','B') then QuizRange = 1;

else if Quiz in ('B-','C+','C') then QuizRange = 2;

else if not missing(Quiz) then QuizRange = 3;

The list of values in parentheses following the IN operator can be separated by commas or blanks. The first line could also be written like this:

if Quiz in ('A+' 'A' 'B+' 'B') then QuizRange = 1;

You can also use the IN operator with numeric variables. For example, if you had a numeric variable called Subject and you wanted to list observations for Subject numbers 10, 22, 25, and 33, the following WHERE statement could be used:

where Subject in (10,22,25,33);

Note: WHERE statements are similar to IF statements but they can be used only when reading observations from a SAS data set, not raw data – see section 7.8 in this chapter.

As with the example using character values, you may separate the values with commas or spaces. You can also specify a range of numeric values, using a colon to separate values in the list. For example, to list observations where Subject is a 10, 22–25, or 30, you can write:

where Subject in (10,22:25,30);

Remember that you should use a colon and not a dash for this feature to work—a dash would be interpreted as a minus sign in a list of numbers separated by spaces.

Note: The colon notation works only for integer values.

7.5  Using a SELECT Statement for Logical Tests

A SELECT statement provides an alternative to a series of IF and ELSE IF statements.

Here is one way to use a SELECT statement:

select (AgeGroup);

   when (1) Limit = 110;

   when (2) Limit = 120;

   when (3) Limit = 130;

   otherwise;

end;

The expression following the SELECT statement is referred to as a select-expression; the expression following a WHEN statement is referred to as a when-expression. In this example, the select-expression(AgeGroup) is compared to each of the when-expressions. If the comparison is true, the statement following the when-expression is executed and control skips to the end of the SELECT group. If the comparison is false, the next when-expression is compared to the select-expression. If none of the comparisons is evaluated to be true, the expression following the OTHERWISE statement is executed. As you can see in this example, the otherwise-expression can be a null statement. It is necessary to include an OTHERWISE statement because the program will terminate if you omit it and none of the preceding comparisons is true.

You can place more than one value in the when-expression, like this:

select (AgeGroup);

   when (1) Limit = 110;

   when (2) Limit = 120;

   when (3,5) Limit = 130;

   otherwise;

end;

In this example, AgeGroup values of 3 or 5 will set Limit equal to 130.

To help clarify this concept, let’s follow some scenarios:

If AgeGroup is equal to 1, Limit will be 110. If Agegroup is equal to 3, Limit will be equal to 130. If AgeGroup is equal to 4, Limit will be a missing value (because it is set to a missing value in the PDV at each iteration of the DATA step and it is never assigned a value).

If you do not supply a select-expression, each WHEN statement is evaluated to determine if the when-expression is true or false. As an example, here is Program 7.5, rewritten using a SELECT statement:

Program 7.5: Demonstrating a SELECT Statement When a Select-Expressionis Missing

  data Conditional;

     length Gender $ 1

            Quiz   $ 2;

     input Age Gender Midterm Quiz FinalExam;

     select;

        when (missing(Age)) AgeGroup = .;

        when (Age lt 20) AgeGroup = 1;

        when (Age lt 40) AgeGroup = 2;

        when (Age lt 60) AgeGroup = 3;

        when (Age ge 60) Agegroup = 4;

        otherwise;

     end;

  datalines;

Notice that there is no select-expression in this SELECT statement. Each when-expression is evaluated and, if true, the statement following the expression is executed.

7.6  Using Boolean Logic (AND, OR, and NOT Operators)

You can combine various logical operators (also known as Boolean operators) to form fairly complex statements. As an example, the data set Medical contains information on clinic, diagnosis (DX), and weight. A program to list all patients who were seen at the HMC clinic and had either a diagnosis 7 or 9 or weighted over 180 pounds demonstrates how to combine various Boolean operators:

Program 7.6: Combining Various Boolean Operators

  title "Example of Boolan Expressions";

  proc print data=Learn.Medical;

     where Clinic eq 'HMC' and

           (DX in ('7','9') or

           Weight gt 180);

     id Patno;

     var Patno Clinic DX Weight VisitDate;

  run;

Notice the parentheses around the two statements separated by OR. The AND operator has precedence over (i.e., is performed before) the OR operator. That is, a statement such as the following:

if X and Y or Z;

is the same as this one:

if (X and Y) or Z;

If you want to perform the OR operation before the AND operation, use parentheses, like this:

if X and (Y or Z);

In Program 7.6, you want the clinic to be HMC and you want either the diagnosis to be one of two values or the weight to be over 180 pounds. You need the parentheses to first decide if either the diagnosis or weight condition is true before performing the AND operation with the Clinic variable. Even in cases where parentheses are not needed, it is fine to include them so that the logical statements are easier to read and understand.

The NOT operator has the highest precedence, so it is performed before AND. For example, the statement:

if X and not y or z;

is equivalent to:

if (X and (not y)) or z;

Here is the output:

Figure 7.2: Output from Program 7.6

Figure 7.2: Output from Program 7.6

7.7  A Caution When Using Multiple OR Operators

Look at the short program here:

Program 7.7: A Caution on the Use of Multiple OR Operators

  data Believe_it_or_Not;

     input X;

     if X = 3 or 4 then Match = 'Yes';

     else Match = 'No';

  datalines;

  3

  7

  .

  ;

  title "Listing of Believe_it_or_Not";

  proc print data=Believe_it_or_Not noobs;

  run;

The programmer probably wanted to say:

if X = 3 or X = 4 then Match = 'Yes';

What happens when you run this program? Many folks would expect to see an error message in the log and would expect the program to terminate. Not so. Here is the output from this program:

Figure 7.3: Output from Program 7.7

Figure 7.3: Output from Program 7.7

In Program 7.7, there is one condition on either side of the OR operator—one is X = 3, the other is 4. In SAS, any value other than 0 or missing is true. Therefore, 4 is evaluated as true and the statement X = 4 OR 4 is always true.

Note: This bears repeating: Any numeric value in SAS that is not zero or missing is considered TRUE.

7.8  The WHERE Statement

If you are reading data from a SAS data set, you can use a WHERE statement to subset your data. For example, if you started with the SAS data set Conditional (Program 7.3), you could create a data set of all females with the following program:

Program 7.8: Using a WHERE Statement to Subset a SAS Data Set

  data Females;

     set Conditional;

     where Gender eq 'F';

  run;

In this example you could use either a WHERE or a subsetting IF statement. There are sometimes advantages to using a WHERE statement instead of a subsetting IF statement. You have a larger choice of operators that can be used with a WHERE statement (to be discussed next) and, if the input data set is indexed, the WHERE statement might be more efficient.

You may also use a WHERE statement in a SAS procedure to subset the data being processed. You cannot use a subsetting IF statement with a SAS procedure.

Note: IF statements are not allowed inside SAS procedures.

7.9  Some Useful WHERE Operators

The table here lists some of the useful operators that you can use with a WHERE statement.

Operator

Description

Example

IS MISSING

Matches a missing value

where Subj is missing

IS NULL

Equivalent to IS MISSING

where Subj is null

BETWEEN AND

An inclusive range

where age between 20 and 40

CONTAINS

Matches a substring

where Name contains Mac

LIKE

Matching with wildcards

where Name like R_n%

=*

Phonetic matching

where Name =* Nick

 

Note: When using the LIKE operator, the underscore character (_) takes the place of a single character, while the percent sign (%) can be substituted for a string of any length (including a null string).

Here are some examples.

Expression

Matches

where Gender is null

A missing character value

where Age is null

A missing numeric value

where Age is missing

A missing numeric value

where Age between 20 and 40

All values between 20 and 40, including 20 and 40

where Name contains ‘mac’

macon immaculate

where Name like ‘R_n%’

Ron Ronald Run Running

where Name =* ‘Nick’

Nick Nack Nikki

Notes:

        The IS NULL or IS MISSING expression matches a character or a numeric missing value.

        The BETWEEN AND expression matches all the values greater than or equal to the first value and less than or equal to the second value. This works with character as well as numeric variables.

        The CONTAINS expression matches any character value containing the given string.

        The LIKE expression uses two wildcard operators. The underscore (_) is a place holder; enter as many underscores as you need to stand for the same number of characters. The percent (%) matches nothing or a string of any length.

7.10  Problems

Solutions to odd-numbered problems are located at the back of this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion website at support.sas.com/cody for information about how to obtain the solutions to all problems.

 

1.       Run the program here to create a temporary SAS data set called School:

  data School;

     input Age Quiz : $1. Midterm Final;

     /* Add you statements here */

  datalines;

  12 A 92 95

  12 B 88 88

  13 C 78 75

  13 A 92 93

  12 F 55 62

  13 B 88 82

  ;

Using IF and ELSE IF statements, create two new variables as follows: Grade (numeric), with a value of 6 if Age is 12 and a value of 8 if Age is 13.

The quiz grades have numerical equivalents as follows: A = 95, B = 85, C = 75, D = 70, and F = 65. Using this information, compute a course grade (Course) as a weighted average of the Quiz (20%), Midterm (30%) and Final (50%).

2.       Using the SAS data set Hosp, use PROC PRINT to list observations for Subject  values of 5, 100, 150, and 200. Do this twice, once using OR operators and once using the IN operator.

Note: Subject is a numeric variable.

3.       Using the Sales data set, list the observations for employee numbers (EmpID) 9888 and 0177. Do this two ways, one using OR operators and the other using the IN operator. Note: EmpID is a character variable.

4.       Using the Sales data set, create a new, temporary SAS data set containing Region and TotalSales plus a new variable called Weight with values of 1.5 for the North Region, 1.7 for the South Region, and 2.0 for the West and East Regions. Use a SELECT statement to do this.

5.       Starting with the Blood data set, create a new, temporary SAS data set containing all the variables in Blood plus a new variable called CholGroup. Define this new variable as follows:

CholGroup

Chol

Low

Low – 110

Medium

111 – 140

High

141 – High

 

Use a SELECT statement to do this.

6.       Using the Sales data set, list all the observations where Region is North and Quantity is less than 60. Include in this list any observations where the customer name (Customer) is Pet's are Us.

7.       Using the Bicycles data set, list all the observations for Road Bikes that cost more than $2,500 or Hybrids that cost more than $660. The variable Model contains the type of bike and UnitCost contains the cost.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset