Image483_fmt.png

Chapter 2 Working with Your Data

2.1 Data Set Options

2.1.1 REPLACE and REPEMPTY

2.1.2 Password Protection

2.1.3 KEEP, DROP, and RENAME Options

2.1.4 Observation Control Using FIRSTOBS and OBS Data Set Options

2.2 Evaluating Expressions

2.2.1 Operator Hierarchy

2.2.2 Using the Colon as a Comparison Modifier

2.2.3 Logical and Comparison Operators in Assignment Statements

2.2.4 Compound Inequalities

2.2.5 The MIN and MAX Operators

2.2.6 Numeric Expressions and Boolean Transformations

2.3 Data Validation and Exception Reporting

2.3.1 Date Validation

2.3.2 Writing to an Error Data Set

2.3.3 Controlling Exception Reporting with Macros

2.4 Normalizing - Transposing the Data

2.4.1 Using PROC TRANSPOSE

2.4.2 Transposing in the DATA Step

2.5 Filling Sparse Data

2.5.1 Known Template of Rows

2.5.2 Double Transpose

2.5.3 Using COMPLETYPES with PROC MEANS or PROC SUMMARY

2.5.4 Using CLASSDATA

2.5.5 Using Preloaded Formats

2.5.6 Using the SPARSE Option with PROC FREQ

2.6 Some General Concepts

2.6.1 Shorthand Variable Naming

2.6.2 Understanding the ORDER= Option

2.6.3 Quotes within Quotes within Quotes

2.6.4 Setting the Length of Numeric Variables

2.7 WHERE Specifics

2.7.1 Operators Just for the WHERE

2.7.2 Interaction with the BY Statement

2.8 Appending Data Sets

2.8.1 Appending Data Sets Using the DATA Step and SQL UNION

2.8.2 Using the DATASETS Procedure’s APPEND Statement

2.9 Finding and Eliminating Duplicates

2.9.1 Using PROC SORT

2.9.2 Using FIRST. and LAST. BY-Group Processing

2.9.3 Using PROC SQL

2.9.4 Using PROC FREQ

2.9.5 Using the Data Component Hash Object

2.10 Working with Missing Values

2.10.1 Special Missing Values

2.10.2 MISSING System Option

2.10.3 Using the CMISS, NMISS, and MISSING Functions

2.10.4 Using the CALL MISSING Routine

2.10.5 When Classification Variables Are Missing

2.10.6 Missing Values and Macro Variables

2.10.7 Imputing Missing Values

In SAS the data set is central to most of our analyses and reporting. This means that it is crucial that we have the power to know all sorts of things about our data. The power that we need comes from a multitude of SAS tools and techniques. This chapter is a fairly random collection of these tools and techniques that can help us accomplish our goals of working with our data sets.

2.1 Data Set Options

Data set options can be used to modify how a data set is either read or written. There are over three dozen of these options, and while you will generally only make use of a hand full of them, you should have a good idea of their scope.

To use these option(s) place them in parentheses immediately following the name of the data set to which they are to be applied. While data set options can be used virtually anytime the data set is named, some of the options are situation dependent, which means that you will have to understand what an option does before applying it. For instance, options that control how a data set is to be read would not be used on a DATA statement.

In the following example the KEEP data set option is applied to the data set being used by PROC SORT.

proc sort data=advrpt.demog(keep= lname fname ssn)
               out=namesonly;
by lname fname;
run;

Regardless of how many variables are in ADVRPT.DEMOG the SORT procedure will only have to deal with the three variables named in the KEEP= option. For a SORT procedure this can substantially speed up the processing.

data yr6_7;
   set year2006 
       year2007(keep=subject visit labdt);
   run;

Data set options apply only to the data set to which they are associated. Here all the variables from YEAR2006 will be included. Only the variables from YEAR2007 will be limited by the KEEP= data set option.

data yr6_7;
   set year:(keep=subject visit labdt);
   run;

This is not the case when a data set list abbreviation is used. Only these three variables will be read from the incoming data sets (WORK.YEARxxxxxx). The variable list applies to each incoming data set; consequently, an error is generated if a variable is not present on one or more of the incoming data sets.

For a number of the data set options, similar functionality can be obtained through the use of DATA step statements or through the use of system options. System options are the most general (they would apply to all data sets); DATA step statements will only apply to data sets within that DATA step, and data set options are the most specific as they apply only to a specific data set.

MORE INFORMATION

Additional information on the use of data set options with PROC SORT can be found in Section 4.2. The INDSNAME= option, along with the IN= data set options, are discussed in Section 3.8.2.

2.1.1 REPLACE and REPEMPTY

data advrpt.class;
   set sashelp.class;
   where age > 25;
   run;

Since it is possible to create an empty (zero observation) data set, we may want to control whether or not the new table will replace an existing table of the same name. For this example, assume that the data set ADVRPT.CLASS already exists. Because there are no ages > 25 in the SASHELP.CLASS data set, the WHERE clause in this DATA step will always be false, no observations will be written, and the data set ADVRPT.CLASS will be replaced with an empty data set.

The REPLACE and REPEMPTY data set options allow us to control the conditions under which the data set is replaced.

  • REPLACE

REPLACE=NO prevents the replacement of a permanent data set. This data set option overrides the system option of the same name. REPLACE=YES is the default.

  • REPEMPTY

Determines whether or not an empty data set can overwrite an existing data set.

For full protection these two options are usually used together. Normally we want to be able to replace permanent data sets, unless the new version is empty. In the following DATA step there are no observations where AGE is greater than 25, so zero observations will be returned. However, since REPEMPTY=NO, the data set ADVRPT.CLASS will not be replaced.

data advrpt.class(replace=yes repempty=no);
   set sashelp.class;
   where age > 25;
   run;

Traditionally the issue of overwriting a data set which has observations with an empty one has been especially problematic when the semicolon has been left off of the DATA statement. In the following DATA step, because of the missing semicolon Callout 1 the SET statement is masked and three empty data sets are created (ADVRPT.VERYIMPORTANT, WORK.SET, and SASHELP.CLASS).

options DATASTMTCHK=NONE; Callout 2
data advrpt.VeryImportant Callout 1
   set sashelp.class;
   run;

Callout 1 The missing semicolon causes SAS to see the SET statement as part of the DATA statements. The result is that there is no incoming data set; consequently, the created data sets will have no variables or observations.

Callout 2The DATASTMTCHK system option protects us from this very problem by not allowing data sets to be created with names, such as SET and MERGE. Setting DATASTMTCHK to NONE removes this protection.

options DATASTMTCHK=NONE; Callout 2
data advrpt.VeryImportant(replace=yes repempty=no) Callout 3
   set sashelp.class;
   run;

Callout 3The REPEMPTY=NO option will protect our very important data set, but unfortunately not the SASHELP.CLASS data set.

Without a compelling reason to do so, it is my opinion that the value of DATASTMTCHK should not be set to NONE. If you must change the option, use the REPLACE and REPEMPTY data set options Callout 3 to provide some protection.

2.1.2 Password Protection

Data sets can be both encrypted and password protected. Password and encryption data set options include:

  • ALTER

Password to alter the data set

  • ENCRYPT

Encrypt the data set

  • PW

Specify the password

  • PWREQ

Password request window

  • READ

Password to read the data set

  • WRITE

Password to write to the data set

The following DATA step creates a data set that is not only encrypted, but requires different passwords for both reading and writing.

data advrpt.pword(encrypt=yes  pwreq=yes 
                  read=readpwd write=writepwd);
   DB='DEApp'; UID='MaryJ'; pwd='12z3'; output;
   DB='p127';  UID='Mary';  pwd='z123'; output;
   run; 
proc print data=advrpt.pword;
  run;

Before PROC PRINT can display the protected data set, the following dialogue box will appear requesting the READ password.

image shown here

While these password protections can be useful within SAS, the protected files are still vulnerable to deletion or manipulation using tools outside of SAS. Including the ENCRYPT option adds another layer of protection from tools other than SAS.

MORE INFORMATION

More about password protection including the use of a data set containing passwords, such as the one used in this section, is used in Section 5.4.2.

2.1.3 KEEP, DROP, and RENAME Options

When using the KEEP, DROP, or RENAME in a DATA step, you can choose between using data set options or DATA step statements. When multiple data sets are either created or read, DATA step statements apply to all created data sets, while data set options can be applied to specific data sets. As a general rule, when you have a choice, data set options are preferred over statements, as the data set options give you more control and the code is generally clearer.

The KEEP statement is the functional opposite of the DROP statement. The following discussion for the KEEP statement and the KEEP= data set option could just as easily been applied to the DROP statement and DROP= data set option. Here I am showing the KEEP statement and KEEP= data set option because I have an admitted bias against the DROP statement and DROP= data set option. The DROP statement and DROP= data set option work fine, both do exactly what they are supposed to do, and both can save the programmer typing when the list of variables to keep is long, and the list of variables to drop is short. However, the KEEP conveys more information to the programmer by documenting which variables the programmer does need to continue to think about.

The following examples highlight the differences between the KEEP statement and the KEEP= data set option. The KEEP statement below Callout 1 is only applied to the new data set (WORK.LABS) and in no way affects the Program Data Vector or what variables will be read from the incoming data set (ADVRPT.LAB_CHEMISTRY).

data labs;
   set advrpt.lab_chemistry;
   keep subject visit labdt; Callout 1
   if sodium>'142'; Callout 2
   run;

Callout 1 The KEEP statement variable list is applied to the new outgoing data set.

Callout 2The IF statement is executed after the entire observation is read and loaded into the PDV.

Using the KEEP statement Callout 1 is exactly the same as specifying the KEEP= option on the data set in the DATA statement Callout 3. The KEEP= option on the SET statement Callout 4, however, is applied before the PDV is built. Only those variables listed will be read from the incoming data set and included on the PDV.

data labs(keep=subject visit labdt); Callout 3
   set advrpt.lab_chemistry(keep=subject visit labdt sodium Callout 4
                            where=(sodium>'142')); Callout 5
   run;

Callout 3 The KEEP= data set option only impacts which variables will be written to the new data set.

Callout 4 On the SET statement the KEEP= data set option is applied to the incoming data set; therefore, this variable list affects which variables will appear on the PDV. Because SODIUM is used in the WHERE= clause Callout 5 in this example, the KEEP= data set option must also include SODIUM, even though it is not written to the new data set.

Callout 5 The WHERE= filter is specified as a data set option and is applied before observations are read.

The RENAME option allows you to change the name of a variable either as it is read or as it is written. Like the RENAME statement, the syntax is of the form oldname=newname. Placing the RENAME= option on the SET statement Callout 6 causes the name to be changed before it is written to the PDV.

data labs(keep=subject visit labdate)Callout 7;
   set advrpt.lab_chemistry(rename=(labdt=labdate))Callout 6;
   if sodium>'142';
   run;

Callout 6 The original name (LABDT) is changed to the new name (LABDATE).

Callout 7 The new name would be used in any programming statements and it also appears on the KEEP= variable list.

When the RENAME= and KEEP= options are both used on the same data set, it is important to understand which is applied first. In the following DATA step the incoming data set has both a RENAME= and KEEP= option.

data labs(keep=subject visit labdate)Callout 7;
   set advrpt.lab_chemistry(rename=(labdt=labdate)Callout 6
                            keep=subject visit labdtCallout 8 sodium
                            );
   if sodium>'142';
   run;

Callout 8 Since the KEEP= option is applied before the RENAME= option, the original variable name is used on the incoming KEEP= variable list.

MORE INFORMATION

When combined with PROC SORT these data set options can have a huge impact on processing efficiencies, see Section 4.2.

2.1.4 Observation Control Using FIRSTOBS and OBS Data Set Options

The data set options FIRSTOBS and OBS can be used separately or in conjunction with one another to limit which observations are read and/or written. Their operation is similar to the system options with corresponding names; however, as data set options their application can be more refined.

  • FIRSTOBS

specifies the number of the first observation to be read

  • OBS

specifies the last observation that is to be read (when FIRSTOBS is not also used, this corresponds to the number of observations that will be read).

In the following PROC PRINT step we have requested that only the first 6 observations be
printed Callout 1.

proc print data=sashelp.class(obs=6); Callout 1
   run;

				
2.1.4a obs=6
Obs     Name      Sex    Age    Height    Weight
  1    Alfred      M      14     69.0      112.5
  2    Alice       F      13     56.5       84.0
  3    Barbara     F      13     65.3       98.0
  4    Carol       F      14     62.8      102.5
  5    Henry       M      14     63.5      102.5
  6    James       M      12     57.3       83.0

The FIRSTOBS= option Callout 2 identifies the first observation that is to be printed.

title1 '2.1.4b firstobs=4';
proc print data=sashelp.class(firstobs=4); Callout 2
   run;
2.1.4b firstobs=4
Obs    Name       Sex    Age    Height    Weight
  4    Carol       F      14     62.8      102.5
  5    Henry       M      14     63.5      102.5
  6    James       M      12     57.3       83.0
  7    Jane        F      12     59.8       84.5
  8    Janet       F      15     62.5      112.5
  9    Jeffrey     M      13     62.5       84.0
 10    John        M      12     59.0       99.5
 11    Joyce       F      11     51.3       50.5
 12    Judy        F      14     64.3       90.0
 13    Louise      F      12     56.3       77.0
 14    Mary        F      15     66.5      112.0
 15    Philip      M      16     72.0      150.0
 16    Robert      M      12     64.8      128.0
 17    Ronald      M      15     67.0      133.0
 18    Thomas      M      11     57.5       85.0
 19    William     M      15     66.5      112.0

When these two options are used together Callout 3 they work independently of each other. It is important to remember that the OBS= option counts from observation 1 regardless of the value of FIRSTOBS. This is demonstrated in the following example.

title1 '2.1.4c firstobs=4 obs=6';
proc print data=sashelp.class(firstobs=4 obs=6); Callout 3
   run;

Only the first 6 observations are available to be printed (OBS=6); however, the first to be printed is the fourth observation (FIRSTOBS=4). As a result only three observations are actually printed.

2.1.4c firstobs=4 obs=6
Obs    Name     Sex    Age    Height    Weight
  4    Carol     F      14     62.8      102.5
  5    Henry     M      14     63.5      102.5
  6    James     M      12     57.3       83.0

Adding a WHERE clause Callout 4 changes the counting process. The WHERE clause is applied first; consequently, the counts and the selection of the observations is based on the subset of observations.

title1 '2.1.4d firstobs=4 obs=6 where (sex=m)';
proc print data=sashelp.class(firstobs=4 obs=6
                    where=(sex='M')); Callout 4
   run;

The resulting LISTING shows that only the fourth, fifth, and sixth male patients have been displayed.

2.1.4d firstobs=4 obs=6 where (sex=m)
Obs    Name       Sex    Age    Height    Weight
  9    Jeffrey     M      13     62.5       84.0
 10    John        M      12     59.0       99.5
 15    Philip      M      16     72.0      150.0

2.2 Evaluating Expressions

A SAS expression contains a combination of operators, constants, functions, and variables. Expressions are used in a number of ways, both within the DATA step and in the PROC step. Very often when we encounter the term ‘expression’, we most commonly think of comparison expressions; however, they are actually much more general and can also appear in other statements such as assignment statements. It is important to remember that, regardless of their use, the evaluation of an expression will follow a series of steps or rules. Understanding these rules can lead us to a more expansive use of expressions.

2.2.1 Operator Hierarchy

Operators are like verbs in the expression. They tell SAS what to do with things like constants and variables. In order to avoid confusion and ambiguity, operators are assigned a hierarchy or order in which they are applied. The hierarchy is formed by seven groups of operators, and within a group, operators of equal rank are applied from left to right (except Group 1 which is applied right to left).

At a simple level, we need to understand why the expression (5+6*2) is equal to 17 and not 22. But as we encounter expressions in non-standard form, such as some of those in Sections 2.2.2 and 2.2.3, we need to have a solid understanding of this hierarchy, if we are to understand why the expressions evaluate the way that they do.

Group

Operators

Parentheses

Operations within parentheses are performed first

Group 1

(performed right to left)

Exponentiation (**)

Prefix operators, such as, positive (+), negative (-), and negation

Minimum (MIN,><) and maximum (MAX, <>)

Group 2

Multiplication (*) and division (/)

Group 3

Addition (+) and subtraction (-)

Group 4

Concatenation (||)

Group 5

Comparisons such as equal (=) and less than (<)

Group 6

AND - Boolean comparison (&)

Group 7

OR - Boolean comparison (|)

season = 1*(1 le month(dob) le 3)
       + 2*(4 le month(dob) le 6)
       + 3*(7 le month(dob) le 9)
       + 4*(10 le month(dob) le 12);

Since any of these operators can appear in any expression, whether in an assignment statement or an IF statement, we need to expand our perception of what an expression should contain. The assignment statement shown here creates the variable SEASON which can contain one of the numeric values 0 thru 4 depending on the month of the date of birth. The ‘less-than-or-equal-to’ comparison operators (Group 5) return a zero or one which is multiplied against the constants. The comparison operators are just another form of expression operators and are perfectly suited to assignment statements as well as to logical expressions.

season= ceil(month(dob)/3);

Although it is important to understand the logic of the previous assignment statement, it could have been more simply written using the CEIL function. However the two statements are not equivalent. When DOB is missing, the first assignment statement returns a zero, while this one returns a missing value.

MORE INFORMATION

There are some additional comparison operators that are unique to the WHERE statement (see Section 2.7.1). The MIN and MAX operators are further discussed in Section 2.2.5.

2.2.2 Using the Colon as a Comparison Modifier

The colon ( : ) can be used as an operator modifier when character values are being compared. The colon permits the comparison of two strings of unequal length, and the colon follows the comparison operator of choice. In the example to the left, the subsetting IF statement will select all observations which have a LNAME starting with ‘Mar’. Since it does not matter whether the value with the smaller length is on the left or right side of the equal sign, a last name of ‘Ma’ would also be selected. The lengths of the values on both sides are determined and the smaller

length is selected and applied to both sides. An IF statement with the two values reversed would produce the same result.

data Mar;
   set advrpt.demog (keep=lname fname);
   if lname =: 'Mar';
   run;
if 'Mar' =: lname;
where lname in:('Me', 'Mar', 'Adams'),

The colon comparison operator modifier can also be used with the IN operator. In this example the WHERE statement will select a variety of last names that start with the indicated strings. Notice that the target strings are not all of the same length. The number of compared characters in LNAME will be appropriate for each of the individual values.

where 'Adamso'=: trim(lname);

Trailing blanks are counted and are used to determine matches. The following WHERE statement will return both ‘Adams’ and ‘Adamson’; however, if the TRIM function had not been used only ‘Adamson’ would have been found.

Similar functionality can be achieved in an SQL step; however, the syntax can vary. The colon operator can be used in a WHERE= data set option whenever you are importing a SAS data set (see Section 2.1).

proc sql;
   title2 'Used in SQL data set WHERE=';
   select lname, fname, dob
      from advrpt.demog(where=(lname=:'Adams')); Callout 1
   title2 'Used in SQL WHERE Clause';
   select lname, fname, dob
      from advrpt.demog
         where lname=:'Adams'; Callout 2
proc sql;
   title2 'Using the EQT operator';
   select lname, fname, dob
      from advrpt.demog
         where lname eqt 'Adams'; Callout 3
   quit;

Callout 1 The =: can be used within the WHERE= data set option even when it is used within the SQL step.

Callout 2 Although it worked in the WHERE= data set option, the =: will not work in an SQL WHERE clause. This SELECT statement will fail.

Callout 3 In SQL the EQT operator is similar to the =:, and because it is an SQL operator it can be used in the SQL WHERE clause. In addition to the EQT operator, SQL also supports the LET (<=:) and GET (>=:) operators.

SEE ALSO

The colon comparison modifier is not available in the macro language; however, macros have been written to provide similar functionality. See Carpenter (2004, Section 7.6.3 pg. 196).

2.2.3 Logical and Comparison Operators in Assignment Statements

season = 1*(1 le month(dob) le 3)
       + 2*(4 le month(dob) le 6)
       + 3*(7 le month(dob) le 9)
       + 4*(10 le month(dob) le 12);

In Section 2.2.1 the following assignment statement is briefly introduced. This is an example of a value look-up where the value of month determines the value of SEASON (Chapter 6 goes into detail on a variety of table look-up coding strategies). In this case the process could have been simplified by the use of a format. More importantly, however, it demonstrates the use of a comparison operator (LE) in an assignment statement.

In fact, as was mentioned in Section 2.2.1 there is no reason why any of the logical and comparison operators cannot appear in an assignment statement. The key to their use is to remember that logical expressions will yield either TRUE or FALSE, which is represented by 1 or 0 respectively. For a date of birth in May the previous equation is evaluated as is shown on the left. The expression results in a value of 2 for SEASON. When you are generating a numeric value based on a logical determination, such as this one, you should be able to write the assignment statement in a form similar to the one above. However, it is not unusual to see this type of assignment made through a series of less efficient IF-THEN/ELSE statements.

season = 1*(0)
       + 2*(1)
       + 3*(0)
       + 4*(0);

Although the previous example could have also been made using a user-defined format and a PUT function (see Section 6.5), the assignment of a value to GROUP using the series of IF-THEN/ELSE statements, such as the one shown here, does not so easily lend itself to a solution involving a format. The value can, however, be determined with an assignment statement containing the same logic as was used in these IF-THEN/ELSE statements. Since assignment statements tend to be processed faster than IF-THEN/ELSE statements, it is likely that the use of assignment statements can decrease processing time. This type of assignment statement will also generally out perform a PUT function.

if sex = 'M' and year(dob) >  1949 then group=1;
else if sex = 'M' and year(dob) le  1949 then group=2; 
else if sex = 'F' and year(dob) >  1949 then group=3;
else if sex = 'F' and year(dob) le  1949 then group=4;
group = 1*(sex = 'M' and year(dob) >  1949)
      + 2*(sex = 'M' and year(dob) le  1949) 
      + 3*(sex = 'F' and year(dob) >  1949) 
      + 4*(sex = 'F' and year(dob) le  1949);

Since True/False determinations always result in either a 0 or a 1, this same approach can be especially useful if assigning a numeric 0,1 value to a variable. In the following DATA step we would like to create a flag that indicates whether or not the date of birth is before 1950. Three equivalent flags have been created to demonstrate three different methods.

data flags;
   set advrpt.demog (keep=lname fname dob sex);
   if year(dob) >  1949 then boomer=1; Callout 1
   else boomer=0; 
   boomer2 = year(dob) >  1949; Callout 2
   boomer3 = ifn(year(dob) >  1949, 1, 0); Callout 3
   run;

Callout 1 Very often IF-THEN/ELSE statements are used. These statements tend to process slower than assignment statements.

Callout 2 The logical expression appears on the right of the equal sign.

Callout 3 The IFN function can be used to assign the result. This function has added value when a result other than just 0 or 1 is to be returned.

A similar coding structure is used when you need to create a flag that depends on whether or not an item is in a list of values. Here we need to determine if the number 3 is contained in one or more of the variables X1 through X4.

data _null_;
input x1-x4;
array a {*} x1-x4;
flag1 = (3 in a); Callout 4
flag2 = ^^whichn(3,of x:); Callout 5
put flag1= flag2=;
datalines;
1 2 3 4
5 6 7 8
run;

Callout 4 FLAG1 is created by determining if the value 3 is in the array A. If it is, a 1 is stored.

Callout 5 The WHICHN function returns an item number if the value in the first argument is found (otherwise a 0 is returned). This value is then converted to a 0 or a 1 by use of a double negation (see Section 2.2.6 for more on the use of the double negation).

MORE INFORMATION

The assignment statements discussed in this section are simple table lookups. Chapter 6 discusses a variety of different table lookup techniques in more detail. The IFN and IFC functions are discussed more in Section 3.6.6.

SEE ALSO

The sasCommunity.org tip http://www.sascommunity.org/wiki/Tips:Creating_a_flag_avoiding_the_If_..._Then_Structure discusses the use of this type of expression in an assignment statement. The discussion tab includes alternative forms that can be used in an SQL step. The example of flags used to indicate presence of a value in a list was suggested by Chang Chung and Mike Rhoads. Their examples can be found at:
http://www.sascommunity.org/wiki/Tips:Double_negatives_to_normalize_a_boolean_value and http://www.listserv.uga.edu/cgi-bin/wa?A2=ind1101c&L=sas-l&D=1&O=D&P=9693, respectively.

2.2.4 Compound Inequalities

if 13 le edu le 16;
if (13 le edu) and (edu le 16);

When a compound inequality is used in an expression it is important to understand how the expression is evaluated. An expression with a compound inequality very often contains a variable between two values, which form the upper and lower limits of a range. This compound expression is effectively interpreted as two distinct inequalities which are joined with an AND. The value of EDU must meet both conditions for the overall expression to evaluate as true.

where (13 le edu) le 16 ;

Misplacing the parentheses totally changes the way that the expression is evaluated. Notice the placement of the parentheses in this WHERE statement. The inequality inside the parentheses is evaluated to True or False (0 or 1), and the result compared to 16. This expression will be true for all values of EDU (both 0 and 1 are less than 16).

You may be thinking ‘OK, so what, I will be careful with parentheses. Why should I care?’ Of course I think that the way that SAS evaluates these expressions is both interesting and important, but there are also practical benefits.

In the macro language compound inequalities are not evaluated the same way as we might think based on our knowledge of DATA step expressions. The resolved macro variables in the expression in this %IF-%THEN statement show that the expression should be evaluated as false; however, it evaluates as TRUE and the %PUT executes.

%let x = 5;
%let y = 4;
%let z = 3;
%if &x lt &y lt &z %then 
       %put &x < &y < &z;
%if 5 lt 4 lt 3 %then %put 5 < 4 < 3;
%if (&x lt &y) lt &z %then %put &x < &y < &z;

This happens because the compound expression in the macro language is not broken up into two distinct inequalities like it would be in the DATA step. Instead it is evaluated as if there were parentheses around the first portion of the comparison. The expression is evaluated left to right, and (&x lt &y) will be either TRUE or FALSE (1 or 0). Either way as long as &Z is > 1, the overall expression will be TRUE.

In actual applications compound inequalities such as the ones shown above are quite common. It is much less likely that you will use other forms of compound expressions; however, as was shown in the example with the %IF statement, it is important to understand how the expression is evaluated.

2.2.5 The MIN and MAX Operators

CAVEAT

It is my opinion that these two operators, and their mnemonics (see Section 2.2.1), should never be used. They are included in this section primarily so that this warning can be given. When you want to return either the minimum or maximum, always use either the MIN or MAX function. These functions have none of the problems associated with these the two operators.

The MIN and MAX operators return the minimum or maximum of two values respectively.

where x = (y min z);
where x = min(y, z);

In this WHERE statement the smaller value of the variables Y and Z will be compared to the value of the variable X. The MIN and MAX operators allow the comparison of exactly two variables, while the preferred MIN and MAX functions can work with any number of variables.

These two operators do not behave the same in WHERE statement expressions as they do in an IF statement expression, and the differences are important. Actually they are only important if you intend to ignore the caveat and go ahead and use these operators. Assuming that you like living on the edge, read on.

Potential Problem #1 - Mnemonics

In the IF statement these two operators can be replaced with their mnemonics; however, these mnemonics do not work in the same way in the WHERE statement.



Operator


IF Statement

Mnemonic

WHERE Statement Caveat for the Mnemonic


IF and WHERE Clause



Results

MIN

><

Mnemonic is not supported in WHERE, although MIN is converted to >< in the LOG

where x=(y><z);

if x=(y><z);

WHERE error

IF works as expected

MAX

<>

Mnemonic is interpreted as not equal in the WHERE

where x=(y<>z);

True when X=1 and Y not equal to Z or when X=0 and Y=Z

Potential Problem #2 – Negative Values

When the negative sign is used with the first value associated with these operators, the expressions are not interpreted the same in the WHERE and IF statements.

Expression

Result

if -2 = (-5 min 2);

True – the minus sign is applied after the MIN operator, essentially the same as: if -2 = -(5 min 2);

where -2 = (-5 min 2);

False – the minus sign is applied before the MIN operator, comparison is the same as: where -2 = -5;

2.2.6 Numeric Expressions and Boolean Transformations

It is sometimes necessary to transform numeric values to Boolean (0 or 1) values. There is no one function that does this, and indeed the exact transformation may be too situational for a specific function.

True / False

x = ^^dob;

Partly because of the way that SAS handles TRUE/FALSE, i.e., false is 0 or missing and all else is true, the missing values must also map to 0. The double negation (NOT) is used to perform the transformation. Since negation is a Boolean operator, it converts the original value to either a zero or a one.

MORE INFORMATION

Double negation is used to convert a number to a binary 0/1 operator in an example in Section 12.6.2.

Replace Missing with 0

For reporting purposes missing values can be replaced by a 0 using a simple assignment statement. The COALESCE function returns the first non-missing value. In this example if DOB is missing a 0 is returned. Prior to the inclusion of the COALESCE function, this same operation was sometimes accomplished using the SUM function. Be careful when working with dates as was done here. Remember that, although both are false, a date of missing and a date of 0 have different meanings.

z = coalesce(dob,0);
y = sum(dob,0);

These two expressions do not result in a Boolean value. If you want to convert all missing values to 0 and all other values to 1 (including 0) you can use the negation of the MISSING function. In this expression MVAL will be 1 for all numbers except missing.

mval=^missing(val);

Determine Positive or Negative Values

data posneg;
   do v=.,-2 to 2;
      *if positive;
      pos = sign(v)=1;
      * Not positive;
      notpos = (sign(v) in(-1,0));
      * Negative;
      neg = ^sign(v)=-1;
      * Not negative;
      notneg = sign(v) in (0,1);
      output posneg;
   end;
   run;

Because of missing values and the 0 value, we have four distinct possibilities when separating positive and negative values. The groups of positive and non-negative values are not necessarily the same. Fortunately we can build the Boolean flag for each of these four possibilities, with the use of the SIGN function, which returns -1 for values < 0, 0 for values=0, 1 for values > 0, and missing for missing values.

2.2.6  Boolean Conversions
Positive or Negative?
Obs  v  pos  notpos  neg  notneg
 1   .   0      0     0      0
 2  -2   0      1     0      0
 3  -1   0      1     0      0
 4   0   0      1     0      1
 5   1   1      0     0      1
 6   2   1      0     0      1

SEE ALSO

Several of the code examples in this section have been suggested by Howard Schreier in the sasCommunity.org article: http://www.sascommunity.org/wiki/Numeric_transformations .

2.3 Data Validation and Exception Reporting

Although we sometimes have the opportunity to work with data that has already been scrubbed clean, often a major portion of our work is the preparation and cleaning of the data. This is especially true for data that has been hand entered or for data that comes from a source without your high standards of excellence. In the biotech/pharmaceutical industries a great deal of time and careful attention to detail is spent on the validation of the data.

For large or complex data sets the manual/visual process of finding data errors is just not practical. We need some tools that will allow us to automate the validation process.

When reporting on errors in your data, you will need to be able to communicate the specific problem and precisely where it occurs. Generally this means that you will have to identify the specific row and column (observation and variable). To identify a particular row you need to know the values of all the variables that form the data set’s primary key (the BY variables that identify the data down to the row level).

SEE ALSO

Wright (2006), and the related papers by Nelson (2004a, 2004b), discusses an approach to the validation of SAS programs. Bahler (2001) provides a summary of a variety of data cleaning techniques. Ron Cody (2008b) has written an entire book on data cleaning techniques.

2.3.1 Date Validation

Since we will work with dates, we need to know how to detect and work with incomplete or inappropriate date values. There are several issues that we need to understand.

Using Formats to Check Date Strings

Formats, whether they are user defined or provided by SAS, can be powerful tools when searching the data for values that are inappropriate, incomplete, missing, or out of range.

Dates can be particularly problematic for some types of data, such as survey response data. Often the dates are collected as character strings and then checked before conversion to a SAS date. Collecting them as character strings is important so that we can record partial dates. In the second observation shown to the left, the date might be from the response: “My second visit was in June of 2004. I do not remember the exact day.”

data visitdates;
   visit=1; v_date_ = '05/25/2004'; output;
   visit=2; v_date_ = '06/XX/2004'; output;
   run;
proc print data=visitdates;
   where input(v_date_,mmddyy10.) = .; 
   run;
2.3.1 Date Validation Using Formats
Date Errors
Obs    visit     v_date_
 2       2      06/XX/2004

The INPUT function in the WHERE statement attempts to convert the character string into a date. Invalid values will result in a missing value, which is selected by the WHERE.

When the INPUT function is used with the format MMDDY. for an incorrectly structured value, as it is in the second observation, an error is issued in the LOG. This error message can be suppressed by using either the ?? format modifier (see Section 1.3.1) or the NOFMTERR system option.

9    proc print data=visitdates;
10      where input(v_date_,mmddyy10.) eq .;
11      run;
ERROR: INPUT function reported 'ERROR: Invalid date value' while processing WHERE clause.

The ?? format modifier suppresses format error messages and can be used with the INFORMAT in the INPUT function, and is more specific than the NOFMTERR system option as it can be applied to a specific informat. The subsetting IF statement in this step replaces the WHERE statement in the previous example, as the ?? informat modifier cannot be used in the WHERE statement.

data dateerrors;
   set visitdates;
   if input(v_date_,?? mmddyy10.) eq .;
   run; 
proc print data=dateerrors;
   run;

Because we can suppress the error messages, we have more latitude in the use of formats to rebuild dates with missing components. As before, let’s assume that our incoming dates are received as character strings with either the day or month potentially coded as ‘XX’. This example shows the subject’s response to the question: “When did you stop smoking?” Very typically the patient is unable to remember the date let alone the day of the month. In this example we have decided to replace missing days with the 15th and if the month is missing with the year midpoint July 1. This code assumes that at least the year is present.


				
data quit_dates;
   subject=201; q_date_ = '05/25/1975'; output;
   subject=205; q_date_ = '10/XX/2001'; output;
   subject=208; q_date_ = 'XX/XX/1966'; output;
   run;
data Qdates(keep=subject q_date_ q_date);
   set quit_dates;
   format q_date date9.;
   q_date = input(q_date_,??mmddyy10.); Callout 1
   if missing(q_date) then do; Callout 2
      * Substitute missing day of month with 15;
      if substr(q_date_,4,2)='XX' then substr(q_date_,4,2)='15'; Callout 3
      q_date = input(q_date_,??mmddyy10.);
   end;
   if missing(q_date) then do; Callout 4
      * Substitute missing month with 07;
      if substr(q_date_,1,2)='XX' then do; Callout 5
         substr(q_date_,1,2)='07';
         * reset day of month also;
         substr(q_date_,4,2)='01';
      end;
      q_date = input(q_date_,??mmddyy10.); Callout 6
   end;
   run;

Callout 1 The first attempt is made to build the SAS date (Q_DATE) using the INPUT function.

Callout 2 If the SAS date is missing, one or more elements are probably coded as XX. First we check the day of the month.

Callout 3 When the day of the month is coded as XX, the value of 15 is substituted. Notice the use of the SUBSTR function on the left side of the = sign to perform the substitution (see Section 3.6.6 for more on the use of the SUBSTR function to substitute characters into the string).

Callout 4 The date is again checked now for a potentially missing month value.

Callout 5 A month coded as XX is detected, and the month and day are substituted.

Callout 6 The corrected date string is converted to a SAS date.

Checking for Missing Date Values

In the ADVRPT.DEMOG data set the variable DEATH records the date of death for those patients that died while under the care of a clinic (remember this is made up data and no patients were harmed for the writing of this book). We would like to filter the patients for those with a date of death. We need to remember that a SAS date contains the number of elapsed days since January, 1, 1960 (day 0). Dates before this date are negative numbers.

The WHERE statement can be used and a couple alternative forms of the expression could be suggested. We need to keep in mind that we are filtering for valid dates. One possible expression might be to filter for values that are greater than zero; however, this clearly excludes any dates before January 2, 1960, and is NOT sufficient.

where death>0;

Since we want to exclude missing dates, and missing values are false, another possible expression would be to simply check to see if DEATH is false. This nearly works, but it has a subtle flaw. Here we will exclude the valid date of death January

1, 1960, which is day zero and will be interpreted as false.

where death;
where death ne .;
 
where death > .;

What we really need to do is eliminate only the missing values and we should focus on this. Two expressions can be used. We can explicitly exclude the missing value and allow all others or we can accept any value that is larger than missing (all numbers positive or negative are larger than missing).

where death > .z;

Sometimes you will be working with data that utilizes more than one type of missing value. Since numeric variables can take on up to 28 types of missing values (., .a, .b, …. , .z, and ._), you may need to account for these when testing for missing. Of the 28 types of numeric missing values the smallest is ._, the next smallest is ., and the largest is .z. Therefore, to effectively test for all these missing value types, we need to code for the largest (.z). All missing values would also be detected by using the MISSING function.

where ^missing(death);

MORE INFORMATION

The WHERE= data set option is used in examples in Sections 2.1.3, 2.1.4d, and 2.2.2.

SEE ALSO

Hunt (2010) shows alternate methodologies for the completion of dates. Alternate solutions for the completion of partial dates can be found on the SAS Forums, see the following thread: http://communities.sas.com/message/40648#40648.

2.3.2 Writing to an Error Data Set

When attempting to identify the observations that contain data errors, it is often helpful to build a secondary data set containing just those observations with errors. Error reporting through a data set is as simple as adding conditional logic and another OUTPUT statement.

data medstartdates(keep=subject mednumber drug 
                        medstdt_ medstartdate)
     medstarterr(keep=subject mednumber drug medstdt_); Callout 1
   set advrpt.conmed(keep=subject  mednumber drug medstdt_);
   medstartdate = input(medstdt_,?? mmddyy10.); Callout 2
   if medstartdate = . then output medstarterr; Callout 3
   output medstartdates; Callout 4
   run;

Callout 1 The variable list in the error data set should contain all the variables that form the primary key, as well as those being tested.

Callout 2 In this example we are converting the text date to a SAS date. The use of the ?? format modifier to suppress error messages in the LOG is discussed in Section 2.3.1.

Callout 3 Incomplete dates, those that cannot be converted to valid SAS dates, result in a missing value.

Callout 4 In this example the data set MEDSTARTDATES will contain all observations, including those with errors. To remove observations with errors from this data set, simply start the OUTPUT statement with an ELSE statement. The statement becomes: else output medstartdates;

The resulting data table (WORK.MEDSTARTERR) will contain one observation for each MEDSTDT_ value that cannot be converted to a valid SAS date.

2.3.2a Collecting Date Errors
Obs    SUBJECT    mednumber    drug                            MEDSTDT_
 51      206          2        LUPRON                         XX/XX/1999
 52      206          3        CALCIUM CITRATE                --/--/----
 53      206          4        MVI                            --/--/----
 54      207          1        METOPROLOL                     --/--/----
 55      207          1        CLONIDINE                      --/--/----
 56      207          1        LUPRON                         XX/XX/1996
 57      207          2        DYAZIDE                        --/--/----
 58      207          2        COUMADIN                       XX/XX/1996
. . . . portions of the table are not shown . . . .

				

When testing more than one variable or different variables in different data sets, this form for the error reporting data set is not flexible enough. This is often the case when we build an automated system for validating our data. In an automated system, the exception data set needs to contain sufficient information for the data manager to get back to the value in question. A minimum amount of information would include:

  • DSN

data set name

  • List of Variables

key variables and their values

  • ERRVAR

variable containing the exception

  • ERRVAL

flagged value

  • ERRTEXT

reason that the value was flagged (exception criteria)

  • ERRRATING

exception rating (may be needed to identify critical variables)

This form of the error data set has one observation per error. It allows the reporting of more than one variable per observation and even errors from multiple data sets. In order to maximize flexibility, consider making the flagged value a character variable. This will require the conversion of numeric values to character, but adds flexibility. Also, since the list of key variables will likely change between data sets, many of the various key variables will be missing for observations that come from data sets for which they are not part of the key. As long as you are careful when designing your original data sets so that variables that appear in multiple data sets always have the same attributes, this will not be a problem.


				
title1 '2.3.2b Hardcoded Exception Reporting';
data errrpt(keep=dsn errvar errval errtxt errrating 
                 subject visit labdt); Callout 1
   length dsn        $25
          errvar     $15
          errval     $25
          errtxt     $20
          errrating  8;
   set advrpt.lab_chemistry; Callout 2
   if potassium lt 3.1 then do; Callout 3
      dsn = 'advrpt.lab_chemistry'; Callout 4
      errvar = 'potassium'; Callout 5
      errval = potassium; Callout 6
      errtxt = 'Low value(<3.1)'; Callout 7
      errrating= 1; Callout 8
      output errrpt; Callout 9
      end;
   if potassium gt 6.7 then do; Callout 10
      dsn = 'advrpt.lab_chemistry';
      errvar = 'potassium';
      errval = potassium;
      errtxt = 'High value(>6.7)';
      errrating=2;
      output errrpt;
      end;
   run;

Callout 1 The list of variables in the error reporting data table includes the variables which form the primary key for this data table (SUBJECT VISIT LABDT).

Callout 2 Read the data set to be checked.

Callout 3 Enter the DO block when this specific data exception has been detected.

Callout 4 Save the name of the data table being checked.

Callout 5 Save the name of the variable being tested.

Callout 6 Capture the data value for further inspection. When character values have been converted for the check, save the raw or character value. This allows you to see any text that might contribute to character to numeric conversion issues.

Callout 7 The error test should describe the problem detected in the data.

Callout 8 If required, assign an error severity code.

Callout 9 Write this observation out to the error report data set (ERRRPT).

Callout 10 The second exception criterion is tested in a DO block that is essentially the same as the first DO block Callout 3.

The resulting error report data set contains all the information necessary to locate, evaluate, and start the correction process.

2.3.2b Hardcoded Exception Reporting
Obs         dsn           errvar   errval     errtxt      errrating SUBJECT VISIT      LABDT
 1  advrpt.lab_chemistry potassium  6.8   High value(>6.7)    2       203      4  09/29/2006
 2  advrpt.lab_chemistry potassium    3   Low value(<3.1)     1       208     10  03/09/2007

From a coding standpoint, the beauty of this approach is that you can have as many checks as are needed, and each one is simply implemented by the addition of another DO block. The disadvantage becomes apparent for large studies and for complex data tables. While not terribly complex, the program(s) can become large. More importantly, since each individual check is implemented in the program, new or changed criteria require that the program itself be revalidated. These problems can be addressed by storing the test criteria outside of the program. One very convenient way to do this is to store the exception criteria in a data table (see Section 2.3.3).

MORE INFORMATION

Non-numeric values are detected and written to an error data set in Section 3.6.1.

2.3.3 Controlling Exception Reporting with Macros

In the DATA step shown in Example 2.3.2b, a DO block is constructed for each data check. Since each one of these DO blocks is very similar, they are prime candidates for being written for us by the macro language. The following macro, %ERRRPT, will build this DO block.

%macro errrpt(dsn=,errvar=,errval=,errtxt=,errrating=);
      dsn = "&dsn";
      errvar = "&errvar";
      errval = &errval;
      errtxt = "&errtxt"; 
      errrating= &errrating;
      output errrpt;
      end;
%mend errrpt;

When the %ERRRPT macro is called from within the DATA step, the IF THEN/DO statements become:

   if potassium lt 3.1 then do;
      %errrpt(dsn = advrpt.lab_chemistry,
              errvar = potassium,
              errval = potassium,
              errtxt = %str(Low value%(<3.1%)),
              errrating= 1)
   if potassium gt 6.7 then do;
      %errrpt(dsn = advrpt.lab_chemistry,
              errvar = potassium,
              errval = potassium,
              errtxt = %str(High value%(>6.7%)),
              errrating= 2)
   run;

Although we have simplified the code somewhat, there has still not been a huge savings in our coding effort. Of course we could have the macro language do even more of the lifting for us. If we could tell the macro language what and how many checks were needed, then each of the individual macro calls, including the IF THEN/DO, could be generated by a single macro call.

An easy way to store and pass along the needed information is through the use of a SAS data set. We can create a data set that contains the constraints for each data exception check. Each observation can then be used to build data exception and error trapping reports. For the previous example the data set might contain the following.

2.3.3b Data Set with Exception Criteria
Obs    errtst     errvar       errval           errtxt         errrating
 1     lt 3.1    potassium    potassium    Low value(<3.1)         1
 2     gt 6.7    potassium    potassium    High value(>6.7)        2

This data set is then used by a macro to build a series of macro variable lists. These lists are then processed (each observation in the data set becomes a test or DO block). The %ERRPT macro shown below builds these lists using an SQL step and then uses the lists to create a series of DO blocks using the macro variables in the list to ‘fill in the blanks’.

%macro errrpt(dsn=, bylst=subject); Callout 1
%local i chkcnt;
proc sql noprint;
   select errtst, errvar, errval, errtxt, errrating
      into :errtst1-:errtst99, Callout 2
           :errvar1-:errvar99,
           :errval1-:errval99,
           :errtxt1-:errtxt99,
           :errrating1-:errrating99
         from vallab; Callout 3
   %let chkcnt = &sqlobs; Callout 4
   quit;
data errrpt(keep=dsn errvar errval errtxt errrating 
                 &bylst); Callout 5
   length dsn        $25
          errvar     $15
          errval     $25
          errtxt     $15
          errrating  8;
set &dsn Callout 6 ;
%do i = 1 %to &chkcnt; Callout 7
   %* Write as many error checks as are needed;
   if &&errvar&i &&errtst&i Callout 8 then do;
      dsn = "&dsn";
      errvar = "&&errvar&i"; Callout 9
      errval = &&errval&i;
      errtxt = "&&errtxt&i";
      errrating= &&errrating&i;
      output errrpt;
      end;
   %end;
   run;
%mend errrpt;
%errrpt(dsn=advrpt.lab_chemistry, 
        bylst=subject visit labdt) Callout 10

Callout 1 The macro %ERRRPT is used to control the error reporting process. Macro %DO loops must appear inside of a macro definition.

Callout 2 The values that are being read from the control data set Callout 3 are stored in a series of macro variables. These take the form of &ERRVAL1, &ERRVAL2, etc. This code would allow up to, but no more than, 99 tests. There is no penalty for making this number too big (I like to over shoot by at least an order of magnitude); however, there is no hint in the LOG or elsewhere if the number is too small (values will just not get saved).

Callout 4 The number of observations that are read from the control data set are saved in the macro variable &CHKCNT. This will be the total number of checks for this data set.

Callout 5 The variables that form the primary key are included in the KEEP= list.

Callout 6 The data set to be checked is specified on the SET statement.

Callout 7 The macro %DO loop will iterate the number of times of the number contained in the macro variable &CHKCNT, which is once for each observation in the control data set. Each of the iterations will result in a DO block, with one DO block for each test to be performed.

Callout 8 For the second %DO loop iteration the IF statement becomes:

       if potassium gt 6.7 then do;

Callout 9 The individual macro variables are addressed using the &&VAR&I macro variable form. For the second pass of the macro %DO loop (&i=2), the macro variable reference &&ERRVAR&I resolves to &ERRVAR2, which in turn resolves to potassium.

Callout 10 The macro call contains the name of the data set to be checked and its list of BY variables.

MORE INFORMATION

The use of data sets to drive macros is discussed further in Section 13.5.

2.4 Normalizing - Transposing the Data

Most, but not all, SAS procedures prefer to operate against normalized data, which tends to be tall and narrow, and often contains classification variables that are used to identify individual rows. In the following presentation of the data, there is one value of SODIUM per observation and the classification variables are SUBJECT and VISIT.

2.4 Normalizing Data
Normal Form
Obs    SUBJECT    VISIT    sodium
  1      208         1      13.7
  2      208         2      14.1
  3      208         4      14.1
  4      208         5      14.1
  5      208         6      13.9
  6      208         7      13.9
  7      208         8      14.0
  8      208         9      14.0
  9      208        10      14.0
 10      209         1      14.0
      . . . . portions of the table are not shown . . . .

The same data could also be presented in a non-normal form, which tends to have one column for each level of one of the classification variables. In the following example, there is one observation per patient, with a column for each visit’s sodium value.

2.4 Normalizing Data
Non-normal Form
    S                                                 V    V    V     V     V    V    V
    U     V     V     V     V     V     V    V   V    i    i    i     i     i    i    i
    B     i     i     i     i     i     i    i   i    s    s    s     s     s    s    s
    J     s     s     s     s     s     s    s   s    i    i    i     i     i    i    i
O   E     i     i     i     i     i     i    i   i    t    t    t     t     t    t    t
b   C     t     t     t     t     t     t    t   t    1    1    1     1     1    1    1
s   T     1     2     4     5     6     7    8   9    0    1    2     3     4    5    6
1  208  13.7  14.1  14.1  14.1  13.9  13.9  14  14  14.0   .    .     .     .    .    .
2  209  14.0  14.0  13.9  14.2  14.5  13.8  14   .  13.8  14  14.1  14.2  14 .1 14  14.1

Since we often do not have control over the form of the data when we receive it, we need to be able to both convert the data from the normal to non-normal form and from non-normal to normal form. This process is known as transposing the data and the operations are commonly performed either by PROC TRANSPOSE or in the DATA step.

PROC TRANSPOSE is an efficient, powerful procedure for performing a transpose operation. The DATA step can be more flexible; however, PROC TRANSPOSE has the advantage of not requiring knowledge of how many transformed variables there will be prior to the transformation.

SEE ALSO

Toby Dunn (2010) discusses the differences between the normal and non-normal data forms and suggests programming motivations for using one form over the other. When summarizing at the same time as transposing, the MEANS and SUMMARY procedures can be very useful. King and Zdeb (2010) use the IDGROUP option on the OUTPUT statement to control the transpose process.

2.4.1 Using PROC TRANSPOSE

PROC TRANSPOSE tends to be less than intuitive for most users. The coding is not particularly difficult; however, for most users it is often hard to visualize what the resulting data set will look like. There is also a trap in this procedure that, when sprung, can cause the corruption of the data.

The following step, which creates a non-normal version of the lab chemistry data, demonstrates a simple PROC TRANSPOSE and will also be used to demonstrate the PROC TRANSPOSE trap.

proc transpose data=lab_chemistry(keep=subject visit sodium) Callout 1
                 out=lab_nonnormal(keep=subject visit:) Callout 2
                 prefix=Visit; Callout 3
   by subject; Callout 4
   var sodium; Callout 5
   run;

Callout 1 DATA= identifies the incoming data set.

Callout 2 The transposed data set is named with the OUT= option. Notice the use of the colon to select all variable names that start with the letters VISIT (see Section 2.6.1 for more on variable naming shortcuts).

Callout 3 The PREFIX= option identifies text that will be used to form the new column names.

Callout 4 The transposition process takes place within the group of variable(s) in the BY statement. In this example each distinct SUBJECT will form one row.

Callout 5 The column SODIUM is transposed to rows.

Using PROC TRANSPOSE
2.4.1a Incompletely Specified Observations
    S                                                       V     V     V    V    V
    U     V     V     V     V     V     V    V    V    V    i     i     i    i    i
    B     i     i     i     i     i     i    i    i    i    s     s     s    s    s
    J     s     s     s     s     s     s    s    s    s    i     i     i    i    i
O   E     i     i     i     i     i     i    i    i    i    t     t     t    t    t
b   C     t     t     t     t     t     t    t    t    t    1     1     1    1    1
s   T     1     2     3     4     5     6    7    8    9    0     1     2    3    4
1  208  13.7  14.1  14.1  14.1  13.9  13.9  14  14.0  14    .     .     .    .    .
2  209  14.0  14.0  13.9  14.2  14.5  13.8  14  13.8  14  14.1  14.2  14.1  14  14.1

Notice the value of SODIUM for patient 208 on visit #3 and #9. Compare this result to that shown in Section 2.4. Although it is not immediately obvious without careful inspection of the data, the TRANSPOSE trap has been sprung and the transposed data has been corrupted. The values of SODIUM have in some cases become associated with the wrong visit!

When using PROC TRANSPOSE there are two ways to identify the row, or rows, within which the transpose is to take place. In the previous example the BY statement is used. However, since there are multiple visits for each subject, there are multiple rows within each SUBJECT and these become the new columns (VISIT1, VISIT2, VISIT3, and so on). The problem is that there is nothing in our code that ties the VISIT variable to the new column, which in this case will be one of the variables VISIT1 through VISIT14. Patient 208 missed visits 3 and 9, consequently their ninth visit should have been classified as VISIT11. However, since it was the ninth observation for patient 208, PROC TRANSPOSE incorrectly classified the data to VISIT9. Notice that both patients 208 and 209 are showing data for VISIT3, although both actually missed visit #3.

The ID statement, which was not included in the previous example, can also be used to help identify rows. This statement names a variable that will be used to create the variable names for the new columns. More importantly it also ties a value in a specific row to a specified new column.

This type of problem is easily solved by the following rule: the combination of variables on the BY and ID statements must identify down to the row level.

In the following PROC TRANSPOSE step, an ID statement has been added. The BY variable, SUBJECT Callout 6, and the ID variable, VISIT Callout 7, form a unique key for this data set (any given combination of these two variables will identify at most one observation).

title2 '2.4.1b BY and ID Form a Unique Key';
proc transpose data=lab_chemistry(keep=subject visit sodium)
                 out=lab_nonnormal(keep=subject visit:)
                 prefix=Visit;
   by subject; Callout 6
   id visit; Callout 7
   var sodium;
   run;

Notice that there is no variable for VISIT3, because neither of these two patients had a visit number 3, so their third physical visit was visit number 4.

Using PROC TRANSPOSE
2.4.1b BY and ID Form a Unique Key
    S                                                 V    V    V     V     V    V    V
    U     V     V     V     V     V     V    V   V    i    i    i     i     i    i    i
    B     i     i     i     i     i     i    i   i    s    s    s     s     s    s    s
    J     s     s     s     s     s     s    s   s    i    i    i     i     i    i    i
O   E     i     i     i     i     i     i    i   i    t    t    t     t     t    t    t
b   C     t     t     t     t     t     t    t   t    1    1    1     1     1    1    1
s   T     1     2     4     5     6     7    8   9    0    1    2     3     4    5    6
1  208  13.7  14.1  14.1  14.1  13.9  13.9  14  14  14.0   .    .     .     .    .    .
2  209  14.0  14.0  13.9  14.2  14.5  13.8  14   .  13.8  14  14.1  14.2  14.1  14  14.1

SEE ALSO

The IDGROUP option in PROC MEANS and in PROC SUMMARY is used to transpose data in King and Zdeb (2010).

2.4.2 Transposing in the DATA Step

The DATA step offers a great deal of flexibility to the process of transposing data. Commonly the process of transposing will involve the use of an array and an iterative DO loop.

Rows to Columns

In order to transpose observations into columns, a series of observations must be processed for each new observation. The array statement is used to hold the values from the individual observations. Once all of the individual observations have been consolidated, the values in the array are written out to the new observation in the new data set.

data lab_nonnormal(keep=subject visit1-visit16);
   set lab_chemistry(keep=subject visit sodium);
   by subject;
   retain visit1-visit16 ; Callout 1
   array visits {16} visit1-visit16; Callout 2
   if first.subject then do i = 1 to 16; Callout 3
      visits{i} = .;
   end;
   visits{visit} = sodium; Callout 4
   if last.subject then output lab_nonnormal; Callout 5
   run;

Callout 1 The array values are retained so that we can accumulate sodium values across visits.

Callout 2 We know that there can be no more than 16 patient visits, so this becomes the dimension of the array.

Callout 3 Since visit values are retained, the array containing the visit values are cleared at the start of each subject.

Callout 4 The array is indexed using the visit number. This guarantees that the value of sodium will be assigned to the correct array variable.

Callout 5 The new observation is written after all the incoming observations for each subject have been processed.

Unlike the data set generated by PROC TRANSPOSE, notice that, even though VISIT3 does not appear in the untransposed data, this data set includes a variable for VISIT3. This is a result of the implicit use of VISIT3 in the ARRAY statement Callout 1.

Transposing in the DATA step
2.4.2a Rows to Columns
    S                                                    v    v    v     v     v    v    v
    U     v     v   v    v     v     v     v    v   v    i    i    i     i     i    i    i
    B     i     i   i    i     i     i     i    i   i    s    s    s     s     s    s    s
    J     s     s   s    s     s     s     s    s   s    i    i    i     i     i    i    i
O   E     i     i   i    i     i     i     i    i   i    t    t    t     t     t    t    t
b   C     t     t   t    t     t     t     t    t   t    1    1    1     1     1    1    1
s   T     1     2   3    4     5     6     7    8   9    0    1    2     3     4    5    6
1  208  13.7  14.1  .  14.1  14.1  13.9  13.9  14  14  14.0   .    .     .     .    .    .
2  209  14.0  14.0  .  13.9  14.2  14.5  13.8  14   .  13.8  14  14.1  14.2  14.1  14  14.1

Columns to Rows

You can also use the DATA step to normalize a data set. In the previous example we converted some lab data for two patients from a normal form to a non-normal form. We will now use a similar DATA step to convert it back to its original form.

title2 '2.4.2b Columns to Rows';
data lab_normal(keep=subject visit sodium);
   set lab_nonnormal(keep=subject visit:); Callout 6
   by subject;
   array visits {16} visit1-visit16; Callout 7
   do visit = 1 to 16; Callout 8
      sodium = visits{visit}; Callout 9
      output lab_normal; Callout 10
   end;
   run;

Callout 6 The variables necessary for the conversion are kept on the incoming data set.

Callout 7 The VISITS array is declared with each of the visits as an element.

Callout 8 The DO loop index is VISIT which increments for each visit. The index variable, VISIT, is added to the Program Data Vector and will be written to the new data set.

Transposing in the DATA step
2.4.2b Columns to Rows
Obs    SUBJECT    visit    sodium
  1      208         1      13.7
  2      208         2      14.1
  3      208         3        .
  4      208         4      14.1
  5      208         5      14.1
  6      208         6      13.9
  7      208         7      13.9
  8      208         8      14.0
  9      208         9      14.0
 10      208        10      14.0
 11      208        11        .
 12      208        12        .
 13      208        13        .
 14      208        14        .
 15      208        15        .
 16      208        16        .
 17      209         1      14.0
 18      209         2      14.0
 19      209         3        .
 portions of the table not shown 

Callout 9 The variable SODIUM is created from the array element identified with VISIT as the index.

Callout 10 Since the OUTPUT statement is inside the DO loop, it will write an observation for each of the iterations of the DO loop.

The resulting data set will have an observation for each patient X visit combination. This now includes combinations that did not originally exist.

In this example SUBJECT 208 now has observations for VISIT 3 and VISIT 11-VISIT16, and the value for SODIUM is appropriately missing for each of these visits. Observations with missing SODIUM values could easily be removed by adding an IF criterion to the OUTPUT statement Callout 10.

If sodium gt .z then output lab_normal;

2.5 Filling Sparse Data

Sometimes when data are entered observations are created only when there is a specific value. Observations, which reflect only missing values or for count data counts of 0, are not created. This creates denser data as there will be fewer missing values. Depending on which observations are included entire classification levels or combinations of classification levels could be missing from the data. This means that the data itself does not reflect the true sampling scheme. Sometimes we need to show all possible levels - not just those with non-missing values.

Creating observations, with the appropriate missing values, is sometimes known as creating sparse (less dense) data.

The examples shown in this section work with the LAB_CHEMISTRY data, which has one row per patient per visit. Missed visits are not represented in the data and will not be represented in tables and reports.

Each patient (SUBJECT) in the LAB_CHEMISTRY data should have an observation for each of the first 10 visits and may or may not have subsequent follow-up for visits 11 through 16. We need to make sure that each patient has an observation for the first 10 visits.

2.5.1 Known Template of Rows

When we know the full list of values that the classification variable(s) should take on, a template can be built and merged back onto the original data. The process of merging the template containing all possible combinations of the classification variables will add the appropriate observations to the data set. For the following example we want each SUBJECT to have at least the first 10 visits.

proc sort data=advrpt.lab_chemistry Callout 1
          out=lab_chemistry;
   by subject visit;
   run;
proc sort data=advrpt.lab_chemistry 
          out=sublist(keep=subject) Callout 2
          nodupkey; Callout 3
   by subject;
   run;
data subvislist;
   set sublist;
   do visit = 1 to 10; Callout 4
      output subvislist;
   end;
   run;
data sparsed;
   merge subvislist Callout 5
         lab_chemistry;
   by subject visit; Callout 6
   run;

Callout 1 The data set to be filled (sparsed) is sorted by the classification variables that will be used to fill the data.

Callout 2 A data set is created that contains only the classification variable(s) that do not need to be filled. In this example it is SUBJECT number.

Callout 3 NODUPKEY is used to eliminate all duplicate subjects. This list of distinct subject numbers could have also been created in a simple SQL step.

Callout 4 The list of unique subjects is read and a DO loop is used to output an observation for each SUBJECT - VISIT combination. When multiple classification variables need to be filled, nested DO loops are needed.

Callout 5 The template data set (SUBVISLIST) is merged back onto the original data set. Any extra rows in the template data, rows that did not appear in the original data, will now also appear in the sparsed data set.

Callout 6 The BY list will contain all the classification variables.

The LISTING for SUBJECT 210 shows that an observation has been added for visits 3 and 9. Even if they had not already been in the data for SUBJECT 210, visits 11-16 would not have been added.

2.5.1 Creating a Sparse Table
Every Patient should have the first 10 visits
Some patients have up to 16 visits
Obs    SUBJECT    visit         LABDT    potassium    sodium  chloride
114      210         1     02/19/2007       5.0        14.0        103
115      210         2     02/28/2007       4.0        14.2        103
116      210         3              .        .           .           .
117      210         4     03/14/2007       3.9        14.1        101
118      210         5     03/09/2007       4.7        14.4        105
119      210         6     03/16/2007       4.7        14.5        104
120      210         7     03/26/2007       4.7        14.3        103
121      210         8     03/28/2007       4.4        14.1        102
122      210         9              .        .           .           .
123      210        10     07/10/2007       4.3        14.2        106
124      210        11     04/06/2007       4.1        14.2        104
125      210        12     04/17/2007       4.0        13.9        103
126      210        13     04/19/2007       4.4        14.2        104
127      210        14     04/26/2007       4.1        14.1         99
128      210        15     05/22/2007       3.8        13.8         99
129      210        16     06/27/2007       5.2        14.3        104
. . . . Results for only SUBJECT 210 are shown . . . .

This process can also be done in a PROC SQL step, either way the concept is the same.

2.5.2 Double Transpose

When all possible levels of a classification variable are not known, or not easily specified in a DO loop, it is often possible to create a sparsed data set by performing two PROC TRANSPOSE steps. Rather than building a template data set as was done in Section 2.5.1, we will let the data itself determine the classification levels. It should be noted that the results of this technique and those obtained in Section 2.5.1 are not necessarily the same.

The first PROC TRANSPOSE step creates one column for each value of the variable to be sparsed (VISIT). Assuming that a given visit exists somewhere in the data, it will be represented as a column after the first PROC TRANSPOSE step.

The second PROC TRANSPOSE step reconverts the columns (one for each possible visit) into rows. The data now has the original form (as before the first PROC TRANSPOSE); however, every visit column is now represented as a row for every subject.

proc sort data=advrpt.lab_chemistry Callout 1
          out=lab_chemistry
          nodupkey;
   by subject visit;
   run;
proc transpose data=lab_chemistry Callout 2
               out=labtran 
               prefix=Visit; Callout 3
   by subject; Callout 4
   id visit; Callout 5
   var sodium potassium chloride;
   run;
proc transpose data=labtran Callout 6
               out=sparsed(rename=(_name_=Visit)); Callout 7
   by subject; Callout 8
   id _name_; Callout 9
   var visit:; Callout 10
   run;

Callout 1 The incoming data set must be sorted at least to the level of the BY statement Callout 4.

Callout 2 PROC TRANSPOSE is used to convert the rows that are to be filled into columns.

Callout 3 The PREFIX= option specifies the text used to form the root portion of the new variable names.

Callout 4 The BY statement lists the classification variables that do not need to be filled.

Callout 5 The variable VISIT will be used to identify which columns were formed from which rows. In this case, the numeric variable VISIT will be combined with the prefix text Callout 3 to form the new column name.

Callout 6 The data set LABTRAN will be transposed back to the original set of rows and columns.

Callout 7 By default rows in the new data set (SPARSED) will be identified with the variable _NAME_. This variable is renamed to VISIT.

Callout 8 The incoming data set (LABTRAN) has one observation per SUBJECT _NAME_ combination.

Callout 9 The ID statement identifies the variable (_NAME_) that contains the names of the new columns.

Callout 10 Each of the variables starting with VISIT is included in the transpose (see Section 2.6.1 for more on variable list abbreviations).

Prior to the first transpose, inspection of the data for SUBJECT=210 shows that this subject has missed both visits 3 and 9.

2.5.2 Creating a Sparse Table
Using a Double Transpose
Prior to First Transpose
 Obs    SUBJECT    VISIT         LABDT    potassium    sodium    chloride
 95      210         1     02/19/2007       5.0        14.0        103
 96      210         2     02/28/2007       4.0        14.2        103
 97      210         4     03/14/2007       3.9        14.1        101
 98      210         5     03/09/2007       4.7        14.4        105
 99      210         6     03/16/2007       4.7        14.5        104
100      210         7     03/26/2007       4.7        14.3        103
101      210         8     03/28/2007       4.4        14.1        102
102      210        10     07/10/2007       4.3        14.2        106
103      210        11     04/06/2007       4.1        14.2        104
104      210        12     04/17/2007       4.0        13.9        103
105      210        13     04/19/2007       4.4        14.2        104
106      210        14     04/26/2007       4.1        14.1         99
107      210        15     05/22/2007       3.8        13.8         99
108      210        16     06/27/2007       5.2        14.3        104

After the first PROC TRANSPOSE Callout 2 the data (WORK.LABTRAN) looks like the following (for SUBJECT=210). There is now a VISIT9 column for this subject, even though this subject did not have a VISIT9 in the data; however, there is still no column for visit 3. This is because no subject in the entire data set had a VISIT3.

2.5.2 Creating a Sparse Table
Using a Double Transpose
First Transpose
Obs  SUBJECT    _NAME_       Visit1   Visit2   Visit4   Visit5   Visit6   Visit7   Visit8
 31    210      sodium          14      14.2     14.1     14.4     14.5     14.3     14.1
 32    210      potassium        5       4.0      3.9      4.7      4.7      4.7      4.4
 33    210      chloride       103     103.0    101.0    105.0    104.0    103.0    102.0
Obs    Visit9    Visit10    Visit11   Visit12   Visit13   Visit14   Visit15    Visit16
 31       .        14.2       14.2      13.9      14.2      14.1      13.8       14.3
 32       .         4.3        4.1       4.0       4.4       4.1       3.8        5.2
 33       .       106.0      104.0     103.0     104.0      99.0      99.0      104.0

The second PROC TRANSPOSE Callout 6 uses this data as input and since all visits are included on the VAR statement Callout 10, each visit becomes a row in the new table.

2.5.2 Creating a Sparse Table
Using a Double Transpose
Second Transpose
Obs    SUBJECT     Visit     sodium    potassium    chloride
151      210      Visit1      14.0        5.0          103
152      210      Visit2      14.2        4.0          103
153      210      Visit4      14.1        3.9          101
154      210      Visit5      14.4        4.7          105
155      210      Visit6      14.5        4.7          104
156      210      Visit7      14.3        4.7          103
157      210      Visit8      14.1        4.4          102
158      210      Visit9        .          .             .
159      210      Visit10     14.2        4.3          106
160      210      Visit11     14.2        4.1          104
161      210      Visit12     13.9        4.0          103
162      210      Visit13     14.2        4.4          104
163      210      Visit14     14.1        4.1           99
164      210      Visit15     13.8        3.8           99
165      210      Visit16     14.3        5.2          104

This subject now has a data row for VISIT9 (with missing values) even though VISIT9 was not in the original data for this subject. This implies that at least one subject had a VISIT9. Since VISIT3 still does not appear, we can infer that no subject in our study had a VISIT3. This technique requires that the row that is to be sparsed (visits 3 and 9) appear in the data set somewhere at least once.

Remember when using PROC TRANSPOSE that it is very important that some combination of the BY and ID variables identify down to the row level (see Section 2.4.1).

MORE INFORMATION

The TRANSPOSE procedure and some of its pitfalls (gotcha’s) can be found in Section 2.4.1.

2.5.3 Using COMPLETYPES with PROC MEANS or PROC SUMMARY

proc means data=advrpt.lab_chemistry
           completetypes noprint nway;
   class subject visit;
   var sodium potassium chloride;
   output out=allvisits sum=;
   run;

The COMPLETETYPES option can be used on the PROC MEANS or PROC SUMMARY statement to force the procedure to generate statistics for all combinations of the classification variables.

In Section 2.5.2 a double PROC TRANSPOSE is used to determine all of the combinations of SUBJECT and VISIT. This can also be accomplished using the COMPLETETYPES option on the PROC MEANS or PROC SUMMARY statement.

2.5.3 Creating a Sparse Table
Using COMPLETETYPES
Obs   SUBJECT  VISIT  sodium   potassium  chloride
151     210       1    14.0       5.0        103
152     210       2    14.2       4.0        103
153     210       4    14.1       3.9        101
154     210       5    14.4       4.7        105
155     210       6    14.5       4.7        104
156     210       7    14.3       4.7        103
157     210       8    14.1       4.4        102
158     210       9      .         .           . Callout 1
159     210      10    14.2       4.3        106
160     210      11    14.2       4.1        104
161     210      12    13.9       4.0        103
162     210      13    14.2       4.4        104
163     210      14    14.1       4.1         99
164     210      15    13.8       3.8         99
165     210      16    14.3       5.2        104

There are no VISIT 9 observations for SUBJECT 210; however, since at least one subject somewhere in the LAB_CHEMISTRY data table had a VISIT 9, the report generated from the PROC MEANS results will show a VISIT 9 for all subjects Callout 1.

Behind the scenes PROC MEANS and PROC SUMMARY are really the same procedure, so this technique works with either procedure.

MORE INFORMATION

The COMPLETETYPES option is also discussed in Section 7.10. The COMPLETETYPES option also has implications when using preloaded formats; see Section 12.1.3.

2.5.4 Using CLASSDATA

The CLASSDATA option is used with the TABULATE, MEANS, and SUMMARY procedures to specify a data set that contains levels of one or more classification variables. If the data set contains levels that are not found in the data, those levels will be included in the resulting summary.


				
proc sort data=advrpt.demog(keep=subject)
   out=subjects nodupkey;
   by subject;
   run;
data Visits;
   set subjects;
   do visit = 1 to 16; Callout 1
      output visits;
   end;
   run;
proc means data=advrpt.lab_chemistry
           classdata=visits Callout 2
           noprint nway exclusive;
   class subject visit;
   var sodium potassium chloride;
   output out=allvisits Callout 3 sum=;
   run;

The data set WORK.VISITS is constructed to have one observation for each of the potential 16 visits Callout 1. This data set is then used with the CLASSDATA= option Callout 2 and the EXCLUSIVE option in the PROC MEANS step.

Although in the LAB_CHEMISTRY data set there are no subjects that have a visit 3 and SUBJECT 210 does not have a visit 9, in the summary data set (WORK.ALLMEANS) Callout 3 which was created by PROC MEANS, each subject will have a summary row for all sixteen visits. Subject 210, which is shown here, now has both a visit 3 Callout 4, and a visit 9 Callout 5.

2.5.4 Using CLASSDATA
MEANS / SUMMARY
Obs  SUBJECT  VISIT  sodium  potassium  chloride
161    210       1    14.0      5.0        103
162    210       2    14.2      4.0        103
163    210       3      .        .           . Callout 4
164    210       4    14.1      3.9        101
165    210       5    14.4      4.7        105
166    210       6    14.5      4.7        104
167    210       7    14.3      4.7        103
168    210       8    14.1      4.4        102
169    210       9      .        .           . Callout 5
170    210      10    14.2      4.3        106
171    210      11    14.2      4.1        104
172    210      12    13.9      4.0        103
173    210      13    14.2      4.4        104
174    210      14    14.1      4.1         99
175    210      15    13.8      3.8         99
176    210      16    14.3      5.2        104

ASIDE: The CLASSDATA data set must contain each of the CLASS variables.

MORE INFORMATION

The CLASSDATA option is also discussed in Sections 7.9 (PROC MEANS and PROC SUMMARY) and 8.1.4 (PROC TABULATE).

2.5.5 Using Preloaded Formats

For the TABULATE, MEANS, SUMMARY, and REPORT procedures, preloaded formats can be used to add rows to output tables. Like the CLASSDATA option shown in Section 2.5.4, this method adds the sparsed rows to the table, not the data set. Thus, we are not required to either modify the original data or to even make a copy.

proc format; Callout 1
value visits
 1='1'
 2='2'
 3='3'
 4='4'
 5='5'
 6='6'
 7='7'
 8='8'
 9='9'
 10='10';
 run;
ods pdf file="&path
esultsE2_5_5.pdf";
proc report data=advrpt.lab_chemistry nowd 
            completerows;Callout 2
   column visit sodium potassium chloride;
   define visit / group 
                  f=visits. preloadfmt Callout 3
                  'Visit' order=data;
   define sodium /analysis mean f=5.2;
   define potassium /analysis mean f=5.3;
   define chloride/analysis mean f=5.1;
   run;
ods pdf close;

Let’s assume that we need to generate a report of mean lab chemistry values for lab visits. The report must contain the first 10 visits regardless of whether or not they appear in the data.

Callout 1 A format is created which contains each of the first 10 visits.

Callout 2 The COMPLETEROWS option, which is unique to PROC REPORT, is used to ensure that every row in the preloaded format will appear in the report.

Callout 3 The PRELOADFMT option will always be present when using preloaded formats. Here the PRELOADFMT option is associated with the format to be preloaded by placing both on the DEFINE statement. A portion of the resultant report is shown to the right.

image shown here

MORE INFORMATION

Preloaded formats can also be used to exclude observations, and are introduced and discussed in more detail in Section 12.1.

2.5.6 Using the SPARSE Option with PROC FREQ

proc freq data=advrpt.demog;
  table edu*symp/ list; Callout 1
  table edu*symp/ list sparse; Callout 2
  run;

By default the table generated by PROC FREQ will contain only those levels that actually exist in the data. In the first TABLE statement Callout 1, only the combinations of the two classification variables (EDU and SYMP) will exist in the table. The SPARSE option on the second TABLE statement Callout 2 will have all combinations of any value of EDU and SYMP. Notice that on the first table EDU=10 has only two levels of SYMP (04 and 10); however, on the second table Callout 2 each level of SYMP that exists somewhere in the data set is associated with EDU=10. Notice that, since no subject has a SYMP of either ‘07’ or ‘08’, those levels are not included in the SPARSED reports.

image shown here

image shown here

2.6 Some General Concepts

There are a number of general techniques, shortcuts, and did you know that you cans, of which you should be aware.

2.6.1 Shorthand Variable Naming

When creating a long list of variable names it is sometimes helpful to not actually write each name individually. Fortunately there are several ways to create lists of variables that require less coding.

These shorthand variable lists can be used wherever a list of variables is expected. This includes the VAR, KEEP, DROP, and ARRAY statements.

Common Prefix Variable Lists (Numbered Range)

Variables with a common prefix and a numeric suffix can be listed as:

   visit1 - visit10

This list will include all the variables between VISIT1 and VISIT10 inclusively. As a general rule, it does not matter if all the variables are already present on the PDV, and their order on the PDV is not important. However, as with any list of variables, the usage itself can have unintended consequences.

In the following ARRAY statement only the first 10 visits will be included in the array; however, if one of these variables is not already on the PDV, it will be added.

   array vis {10} visit1 - visit10;

The KEEP statement does not establish variables, so unlike the previous ARRAY statement, variables in the list that are not already on the PDV will cause an error. If there is no VISIT3 variable on the PDV, the following KEEP statement will produce a warning.

   keep visit1 - visit10;

This type of list can be used wherever a variable list is expected. This includes statements and options such as: KEEP, DROP, VAR, RETAIN. Functions that accept a list of values, e.g., MIN, MAX, MEAN, require the use of the OF operator to prevent confusion with a subtraction.

   m = max(of visit1 - visit10);

PDV Order Dependent Lists (Named Range)

When the order of the variables on the PDV is known, you can use the double dash to specify the list. Unlike the common prefix variable list shown above, the order of the variables on the PDV is very important and this form of variable list cannot be used to create variables. The following PROC CONTENTS step shows the variables in ADVRPT.DEMOG and their relative position on the PDV (through the use of the VARNUM option – the VARNUM option replaces the now outdated POSITION option).

title1 '2.6.1 Variable Shorthand Lists';
title2 'List of variables and their positions';
proc contents data=advrpt.demog varnum;
   run;

The resulting listing shows the names of the variables, their attributes, and their order.


				
2.6.1 Variable Shorthand Lists
List of variables and their positions
The CONTENTS Procedure
                   Variables in Creation Order
 #    Variable    Type    Len    Format    Label
 1    subject     Num       8
 2    CLINNUM     Char      6              clinic number
 3    LNAME       Char     10              last name
 4    FNAME       Char      6              first name
 5    SSN         Char      9              social security number
 6    SEX         Char      1              patient sex
 7    DOB         Num       8    DATE7.    date of birth
 8    DEATH       Num       8    DATE7.    date of death
 9    RACE        Char      1              race
10    EDU         Num       8              years of education
11    WT          Num       8              weight in pounds
12    HT          Num       8              height in inches
13    SYMP        Char      2              symptom code
14    death2      Num       8    DATE9.

The variable list LNAME--SYMP includes all variables (numeric and character) in the data set ADVRPT.DEMOG except SUBJECT, CLINNUM, and DEATH2.

Inclusion of the list modifiers NUMERIC and CHARACTER can be used to restrict the list to just numeric or just character. Again the list is order dependent and includes the endpoints, assuming they are the correct type. The list DOB-numeric-HT excludes RACE, while the list SEX-character-SYMP, contains only three variables (SEX, RACE, and SYMP).

Inclusion of an incorrect type does not cause an error. The designation death-character-symp will correctly contain the two variables RACE and SYMP.

Unlike the list abbreviation with a single dash (common prefix numbered list), this list form cannot be used to create variables or add variables to the PDV. It can, however, be otherwise used where you need a list of variables.

CAVEAT

Since the order of the variables on the PDV is generally of secondary importance to most SAS programmers be very careful when using these forms of lists. If the variable order changes for some reason, the list may no longer be what you intend.

Using the Colon Operator (Name Prefix)

Variables named with a common prefix (with or without a numeric suffix) can be listed by following the prefix with a colon.

For the data set STATS generated by the following PROC SUMMARY, you could select all the statistics associated with HT by using the list HT_: Callout 3.


				
proc summary data=advrpt.demog;
   class race edu;
   var ht wt; Callout 1
   output out=stats
          mean=
          stderr=
          min=/autoname; Callout 2
   run;
proc print data=stats;
   id race edu;
   var ht_: Callout 3;
   run;

Callout 1 The analysis variables HT and WT are used to generate a series of statistics. The names of these statistics are automatically generated Callout 2, and are of the form of analsysisvariable_statistic (see Section 7.2 more details on the AUTONAME option).

Callout 3 The list of all statistics generated for the HT variable will be printed using the name prefix list in the VAR statement.

MORE INFORMATION

This list abbreviation is used in a PROC TRANSPOSE example in Section 2.5.2.

Special Name Lists

Three name lists exist that allow you to address variables by their type. These include:

  • _CHARACTER_

All character variables

  • _NUMERIC_

All numeric variables

  • _ALL_

All variables on the PDV

Since each of these lists pertains to the current list of variables, they will not create variables. In each case the resulting list of variables will be in the same order as they are on the Program Data Vector.

The _ALL_ list abbreviation is used in the following DATASETS step to remove the label and format attributes from a data set. This example was suggested by SAS Sample #25052.

proc datasets lib=work nolist;
   modify demog;
      attrib _all_ label=' '
                   format=;
      contents;
   quit;

The MODIFY statement opens the WORK.DEMOG data set and the ATTRIB statement is applied to all the data set’s variables by listing the variables using the _ALL_ list abbreviation.

SEE ALSO

This example with further explanation can be found in the SAS Sample library at http://support.sas.com/kb/25/052.html.

2.6.2 Understanding the ORDER= Option

The ORDER= option can be used with most procedures that classify or summarize data. It allows us to control both the analysis and display order of information without physically sorting the data. Depending on the procedure the option may be applied on the PROC statement or on one or more of the supporting statements, such as the CLASS statement.

The option can take on the values of:

Option Value:

Order is based on:

INTERNAL

the unformatted values (like PROC SORT)

FORMATTED

the formatted value

FREQ

the descending frequency

DATA

the order of the data values

For most procedures the default value for ORDER= is INTERNAL.

In each of the following examples, a simple PROC MEANS with a single classification variable (SYMP) is used to demonstrate the effect of the ORDER= option.

ORDER=INTERNAL

This is typically the order of the variable if it had been sorted with PROC SORT, and is usually the procedure’s default. Its alias is UNFORMATTED.

title2 'order=internal';
proc means data=advrpt.demog 
           n mean;
   class symp;
   var ht;
   run;

The ORDER= option is not specified and the PROC MEANS default order for all classification variables is ORDER=INTERNAL. As a result the symptoms appear in alphabetical order (SYMP is character).

image shown here

ORDER=FORMATTED

When the ORDER=FORMATTED option is used the values are first formatted and then ordered.


				
proc format;
   value $SYMPTOM	
      '01'='Sleepiness'
      '02'='Coughing'
      '03'='Limping'
      '04'='Bleeding'
      '05'='Weak'
      '06'='Nausea'
      '07'='Headache'
      '08'='Cramps'
      '09'='Spasms'
      '10'='Shortness of Breath';
   run;
title2 'order=formatted';
proc means data=advrpt.demog 
           n mean
           order=formatted; Callout 1
   class symp;
   var ht;
   format symp $symptom.; Callout 2
   run;

image shown here

Callout 1 The ORDER=FORMATTED option on the PROC statement is applied to all classification variables.

Callout 2 The user-defined format $SMPTOM. is applied to the classification variable SYMP.

The formatted values now determine the order of the rows for the classification variable.

ORDER=FREQ

The frequency of the levels of the classification variable is used to determine the order when ORDER=FREQ is used.

title2 'order=freq';
proc means data=advrpt.demog
           n mean;
   class symp / order=freq; Callout 3
   var ht;
   run;

image shown here

Callout 3 Placing the ORDER= option on the CLASS statement instead of on the PROC statement, allows the selective application of the option to only specific classification variables (see Section 7.1.3 for more on the use of options on the CLASS statement).

The symptoms are now listed in order of decreasing frequency. The CLASS statement also supports the ASCENDING option which can be used with the ORDER=FREQ option to list the levels in ascending order.

ORDER=DATA

The order of the classification variables will reflect their order in the data itself. The first level detected will be written first. The data do not have to be in any particular order.

title2 'order=data';
proc means data=advrpt.demog 
           n mean 
           order=data;
   class symp;
   var ht;
   run;

Symptom 02 (coughing) is the first symptom in the data, followed by 10 and 06.

image shown here

MORE INFORMATION

Missing values of classification variables are not normally included in the table, see Section 7.1.1 to change this behavior. The ORDER= option is discussed in terms of the TABULATE procedure in Section 8.1.5.

2.6.3 Quotes within Quotes within Quotes

The quote mark is used to identify constant text to the parser. Sometimes that quoted string of constant text will itself contain quotes. Fortunately SAS comes with both single and double quotes and either can be used within the other. But what happens if you need to call a macro variable within the interior string? Regardless of which type is used on the inside, the macro variable will be within single quotes and, therefore, will probably not be resolved.

Each of the following three statements has a quoted string within a quoted string. And each executes successfully.

A DEFINE routine in a REPORT step compute block:

call define(_col_,'style', 'style={flyover="myloc"}'),

An X statement executing a Windows DIR command:

x 'dir "c:myloc*.sas" /o:n /b > c:mylocpgmlist.txt';

A DM statement being used to reroute the LOG file (see Section 14.4.2):

dm 'log; file "c:myloclogdump1.log"';

Now assume that we need to embed a macro variable in the above examples. Since macro variables tend not to be resolved when they are used inside of single quotes, we need to understand not only how the statements are parsed and executed, but how we can recode them. Simply substituting the macro variable into the statement does not always work. Interestingly it does not necessarily fail either.

%let temp = myloc;
call define(_col_,'style', 'style={flyover="&temp"}'), Callout 1
x 'dir "c:&temp*.sas" /o:n /b > c:&temppgmlist.txt'; Callout 2
dm 'log; file "c:&templogdump1.log"'; Callout 3

Callout 1 While it is generally true that macro variables will not be resolved when they occur within single quotes, this is not strictly true. The CALL DEFINE routine is only called from within a PROC REPORT compute block (not shown), and because of the way that these blocks are executed the macro variable will be resolved even though it is inside of single quotes. Nothing special needs to be done.

x dir "c:&temp*.sas" /o:n /b > c:&temppgmlist.txt;   

Callout 2 The X statement will not work as it is currently coded; the macro variable will not resolve. When we pass a path to the OS under Windows, the path should be enclosed in double quotes. Under the current versions of SAS the X statement generally no longer requires the use of the quotes that surround the command that is to be passed to the OS. This simplifies the statement and eliminates the problem. When this does not work, consider one of the solutions used for the DM statement.

dm %unquote(%bquote(')log%bquote(;)file "c:&templogdump1.log" %bquote('));

Callout 3 In the DM statement the string following the keyword DM must be quoted, and the macro variable will not be resolved. The macro quoting functions can be helpful by temporarily masking the single quotes until after the macro variable has been resolved. Since the single quote has been masked, the semicolon used to separate the two DM commands must also be temporarily masked. Prior to execution the macro quoting is removed using the %UNQUOTE function. An approach similar to this may be needed in the FILENAME statement as well. Here we are using the pipe engine to route the results of the DOS command to a virtual (piped) file.

filename list pipe %unquote(%bquote(')dir "&temp*.rtf" /o:n /b %bquote('));
x "%str(md %"&tempoutput%"; )";

In fact, since we are delaying recognition of the quote marks, we do not even need to use both types of quote marks. In the X statement shown here, the %STR function is used to delay recognition of the inner pair of double quotes until after the outer ones have been utilized.

title1 'Tom''s Truck';

Within SAS it is not too unusual to be able to delay the parser’s recognition of a character by doubling it. This technique was common before double quotes were introduced into the language. To show an apostrophe in a title statement two single quotes were used. This works because the parser sees the two single quotes and in a second pass of the string, converts them to a single quote mark (an apostrophe). This technique still works and we can use it to our advantage in the DM statement that we have been working with.

dm "log; file ""c:&templogdump1.log""";

Here only double quote marks are used. Notice a single double quote at the start, a double double quote in the middle of the string, and a triple double quote at the end. This will require three passes for the parser to resolve all the strings. In the meantime the macro variable will have been resolved.

This works because the quote marks are being used by the parser to 'mark' the strings in such a way as to tell the parser how to handle the string. Double double quote marks are resolved to a just one ‘mark’ in a second pass of the parser, and by then the macro variable has been resolved.

Rewriting the DM statement using single quotes would only partially be successful. The parser would handle the resolution process for the quotes the same; however, since the first pair of single quotes still resolves to a single quote, that quote would prevent the resolution of the macro variable.

dm 'log; file ''c:&templogdump1.log''';

More rarely you may need a third level of quoting. For this problem what we really want is a third type of quote mark. We only have two, however, so again we can take advantage of the parsing process, and consequently expand the previous technique to additional levels. Surround the whole string with single quotes, as you have already done. Then change each interior single quote to two single quotes (not a double quote). This forces the parser to take a second pass.

2.6.4 Setting the Length of Numeric Variables

While we regularly reduce or control the length of character variables, we more rarely do so for numeric variables. In both cases reduction of variable length can be a successful strategy to reduce data set storage requirements. However, there are specific issues associated with reduction of the length of numeric variables – reducing the length of a numeric variable, especially non-integers, can drastically reduce the precision of the variable. The documentation associated with your version of SAS and your OS will cover topics such as the loss of precision and the size of integers that can be stored with a given length.

So how can you minimize storage costs by controlling length? With character variables it is easy; use the minimum length that avoids truncation. For numeric variables it is less straightforward. One of the first considerations is the value itself. If a numeric code is just a code, such as clinic number, and will not be used in calculations, it should generally be stored as a character variable. An exception would be social security numbers (SSN) and Employer Identification Numbers (EIN), which can be stored in 8 bytes as a numeric variable, but require at least 9 bytes as character variables.

While SAS dates, which are always integer values, can be safely stored in four bytes, most users and some companies (as company policy) never reduce (or never allow the reduction of) the length of numeric variables – “just in case.” Given that storage is generally cheap and access is generally fast, my rule of thumb is that codes (regardless of content) are text. Only numbers are stored as numeric values. And I only rarely reduce the length of a numeric value.

While the readers of this book are probably an exception, most users are not sophisticated enough to understand the subtle implications of reducing length for numeric variables. If they do understand AND/OR they know that they are only dealing with integers, then some reduction of storage requirements can be achieved by reducing the length of numeric variables.

2.7 WHERE Specifics

The WHERE statement (DATA and PROC steps), WHERE= data set option, and PROC SQL WHERE clause provide subsetting capabilities that are not otherwise available. While the subsetting IF in the DATA step can have similar syntax and often similar results, the filter generated by a WHERE can be substantially different in efficiency, usage, syntax, and result. The differences are important.

When importing or exporting data, it is often necessary to filter portions of data that are to be transferred. There are several ways to provide this filtering, and building a WHERE clause, which can be used in a variety of data import and export situations, is a core methodology. It can be used in both procedure and DATA steps, and can be generated using statements, options, and clauses. Having a firm understanding of the capability of the WHERE can have a major impact when transferring large amounts of data.

The WHERE is a primary tool for the creation of subsets of data. It can be used as a statement, data set option, and as an SQL clause. Not only is it flexible in how it can be used, it has a number of inherently beneficial properties. The following are a few comments about the WHERE that mostly fall into the category of “Did you know that….”.

When creating a data subset in a DATA step the WHERE generally tends to be more efficient than the subsetting IF statement. The selection criteria associated with the IF statement is applied to the values in the PDV, which means that every observation is read, and then potentially discarded. The WHERE clause on the other hand is applied before the observation is read, which can save resources by minimizing the I/O. The WHERE does have some additional overhead and the efficiency gains are first noticed and become more pronounced (compared to the subsetting IF statement) as the fraction of discarded data becomes larger.

Clearly the WHERE clause must evaluate observations; however, sometimes complete blocks of observations can be eliminated depending on what SAS knows about the data. When the WHERE clause is applied to an indexed data set, the WHERE clause will take advantage of the indexes to optimize the selection process.

In the DATA step and in procedures, a WHERE clause can be established either through the use of the WHERE statement Callout 1 or as a WHERE= data set option Callout 2 on the incoming data set. As a general rule best practices suggest that the WHERE statement should be used only when the WHERE= data set option is not available, as the use of the data set option tends to make the code easier to understand. The following two steps yield the same subset of the observations in the data set ADVRPT.DEMOG.

title1 'E1.4a WHERE Statement';
proc print data=advrpt.demog;
   var lname fname sex dob;
   where year(dob)>1960;Callout 1
   run;
title1 'E1.4b WHERE Data Set Option';
proc print data=advrpt.demog(where=(year(dob)>1960))Callout 2;
   var lname fname sex dob;
   run;

In PROC SQL there are three ways of using the WHERE clause:

  • WHERE clause in a pass-through
  • WHERE clause in the SAS SQL
  • WHERE= data set option

Obviously if we are writing code that will be passed through SQL to a database system other than SAS, Oracle for instance, the WHERE= data set option cannot be used. The WHERE clause in SQL pass-through code has to be appropriate to the receiving database, and SAS data set options can only be applied to SAS data sets. More on the efficiency issues of an SQL pass-through is discussed in Section 1.4.

In a SAS SQL step there can be performance differences between the WHERE clause and the WHERE= data set option. As was mentioned above, the WHERE= data set option is generally optimized for use with indexes. Depending on the type of JOIN the WHERE clause will sometimes be applied after the read.

2.7.1 Operators Just for the WHERE

While the basic syntax of the WHERE statement is similar to that of the subsetting IF statement, there are several operators that can be used only with the WHERE (statement, data set option, or SQL clause). These include:

  • BETWEEN

Builds an inclusive range

  • CONTAINS

String search

  • IS MISSING

Check for missing values

  • LIKE

Pattern matching

  • SAME

WHERE clause concatenation

  • =*

Sounds like

The examples below all use the WHERE statement, but these operators apply to the WHERE= data set option and the SQL WHERE clause as well.

BETWEEN

The BETWEEN operator allows us to establish an inclusive range.

title2 'BETWEEN';
proc print data=advrpt.demog;
   var lname fname edu;
   where edu between 15 and 17; Callout 1
/*   where 15 le edu le 17;*/ Callout 2
   run;

Callout 1 The acceptable range for EDU is between 15 and 17 inclusively.

Callout 2 The same list could have been established using this compound expression, which can also be used in an IF statement.

The negation of this range is requested by using the NOT operator with the BETWEEN operator. The following two WHERE statements are equivalent to each other and are the exact opposites of those in the previous PROC PRINT step.

   where edu not between 15 and 17;
   where edu lt 15 or edu gt 17;

CONTAINS

The CONTAINS operator works much like the INDEX function to determine if a text string can be found within another string. The word CONTAINS can be replaced with its mnemonic, the question mark (?).

title2 'CONTAINS';
proc print data=advrpt.demog;
   var lname fname edu;
   where lname contains 'son'; Callout 3
/*   where lname ? 'son';*/ Callout 4
/*   where index(lname,'son'),*/ Callout 5
   run;

Callout 3 All last names that contain the letters ‘son’ will be printed. Like all string comparisons the search is case sensitive.

Callout 4 The question mark could be used to replace the CONTAINS operator.

Callout 5 The INDEX function could also be used.

CONTAINS is negated by preceding the operator with a NOT or other negation mnemonic.

IS MISSING

The IS MISSING operator can be used to check for missing values. One advantage is that it can be used to check for either a numeric or a character missing value. Either the IS MISSING or the IS NULL operator can be used.

title2 'IS MISSING';
proc print data=advrpt.demog;
   var lname fname edu symp;
   where edu is missing or symp is missing; Callout 6
/*   where edu is null or symp is null;*/ Callout 7
/*   where edu = . or symp = ' ';*/ Callout 8
   run;

Callout 6 The syntax is the same for numeric variables (EDU) and character variables (SYMP).

Callout 7 IS NULL can be used instead of IS MISSING.

Callout 8 When checking for missing values using the ‘traditional’ approach the programmer must be aware of the variable’s type.

Negation is formed using NOT (or other negation operator). The NOT may appear either before the IS or between the IS and MISSING.

   where edu is not missing or symp not is missing;

LIKE

The LIKE operator allows for simple pattern matching. When appropriate, more complex pattern matching can be accomplished using regular expressions in the RX() family of functions. This operator uses the percent sign (%) and the underscore (_) as wildcards. The % will stand for any number of characters and the _ is a place holder for exactly one character.

The following table shows some examples using the LIKE operator and alternate equivalent expressions.

Using the LIKE Operator Without the LIKE Operator What It Does
lname like 'S%' lname =: 'S' substr(lname,1,1) = 'S' Find all last names beginning with a capital S.
lname like '%ly%' index(lname,'ly') > 0 lname contains 'ly' Find all last names containing an 'ly'.
lname like 'Ch__' substr(lname,1,2) = 'Ch' and length(lname)<5 Any two, three, or four letter last names starting with Ch.

When using the % and _ with the LIKE operator, you need to be careful, as it is possible to return unanticipated values.

  • A trailing _ will not select anything if the _ is past the length of the variable.
  • Whenever the % is followed by other search characters, be sure to enclose the string in single quotes to prevent the macro parser from interpreting the % as a macro trigger.
  • When the searched string contains either a _ or a % there can be confusion between the wildcards and the actual characters. Be sure that you specify what you really mean. Since the CONTAINS operator does not utilize wildcards, it can be used when your target string contains an underscore or percent sign.

Negation of the LIKE operator is achieved using the standard negation operators.

SAME

The SAME operator allows you to specify a composite clause through the use of separate WHERE statements. Primarily used in interactive or run-group processing, it has little use in other programming situations, since the clause cannot be maintained across step boundaries.

proc print data=advrpt.demog;
   var lname fname edu symp;
   where lname like 'S%';
   where same edu le 15;
   run;

The first clause is specified as usual, and the second is appended with the SAME operator. If the SAME operator had not been included on the second WHERE statement, the second clause would have replaced the first. When joining two WHERE clauses with the SAME operator, the two clauses are effectively joined with an AND (both clauses have to be true for the overall result to be true).

   where lname like 'S%';
   where same and edu le 15;

It is common to explicitly specify the AND on the subsequent WHERE clauses. This can reduce ambiguity.

In both cases the resulting WHERE statement could have been written as:

   where lname like 'S%' and edu le 15;

Sounds like

The sounds like operator, which is coded using the mnemonic = *, uses the same algorithm as the SOUNDEX function:

  • The first letter is preserved
  • Vowels are eliminated
  • Double letters are compressed
  • The remaining letters are converted to numbers using a scheme that nominally groups letters that sound similarly in English

In theory two words with similar pronunciation will yield the same code.

proc print data=advrpt.demog;
   var lname fname dob;
   where lname =* 'che';
   run;

In this example we are searching for all patients with last names that sound something like ‘che’.

2.7.1 Operators Just for the WHERE
Sounds like
Obs    LNAME    FNAME        DOB
 14    Chou     John     15MAY58
 15    Chu      David    18JUN51

The resulting listing shows two names that match the requested text string. Since vowels are dropped, as is the silent ‘h’, the portion that is actually used to form the comparison is the ‘C’.

MORE INFORMATION

There can be serious and sometimes unanticipated consequences for using the MIN and MAX operators. These operators are available to expressions in both the WHERE and IF statements, but they behave differently depending on how they are used. Please review Section 2.2.5 before using either of these two operators or their mnemonics.

2.7.2 Interaction with the BY Statement

When a WHERE clause is created and a BY statement is also present, the groups of observations formed using the BY variables are created after the application of the WHERE clause. This means that any FIRST. or LAST. processing will be applied only to those observations that meet the WHERE criteria. The result can be quite different from using a subsetting IF statement to form the groups, as the BY groups are formed before the IF statement is applied.

data WHEREcnt(keep=clinnum Wsympcnt);
   set demog(keep=clinnum symp);
   by clinnum symp;
   where symp ne ' '; Callout 1
   if first.clinnum then do; Callout 2
      Wsympcnt=0;
      end;
   if first.symp then Wsympcnt+1; Callout 3
   if last.clinnum then output; Callout 4
   run;

The following DATA step correctly counts the number of distinct symptoms within each clinic (CLINNUM). Since we do not want to count observations without symptoms (SYMP=’ ‘), a WHERE statement Callout 1 is used to exclude those observations, and the variable WSYMPCNT is used to accumulate the symptom count within each clinic (CLINNUM).

Callout 2 The first observation for a given clinic is used to initialize the counter, WSYMCNT.

Callout 3 Each distinct value of SYMP is counted. We could count either FIRST or LAST here.

Callout 4 When we have processed all the rows for this clinic we know that we have the total count and we can output the result.


				
   if symp ne ' '; Callout 5

Using the subsetting IF statement Callout 5 instead of the WHERE statement Callout 1 changes the way that the BY groups are formed. If the DATA step that uses the IF statement Callout 5 does not take this formation process into account, the results will be incorrect.

2.7.2 WHERE and BY Group Processing
Showing Counts
Obs    CLINNUM   IFsympcnt   Wsympcnt
 1     031234        2           2
 2     033476        3           1
 3     036321        2           2
 4     043320        3           1
 5     046789        1           1
 6     049060        1           1
 
2.7.2 WHERE and BY Group Processing

					
Obs    CLINNUM    SYMP    subject
  1    031234      01       127
  2    031234      01       168
  3    031234      04       110
  4    031234      04       156
  5    033476               148
  6    033476               161
  7    033476      09       116
  8    033476      09       157
  9    036321      02       128
 10    036321      02       147
 11    036321      06       135
 12    036321      06       169
 13    038362               145
 14    038362               175
 15    043320               132
 16    043320               134
 17    043320      02       124
 18    043320      02       152
 19    046789      10       107
 20    046789      10       121
 21    049060      02       101
 22    049060      02       108
 23    049060      02       164
 24    049060      02       165

The table above shows the counts for these two methods for a few of the clinics. Clearly we can see that there is disagreement for two of these clinics, and the reason for this difference is at the heart of the problem.

Examination of the data used to form the counts shows that the variable SYMP has at least one missing value for each of the clinics that have an incorrect count.

When the subsetting IF statement Callout 5 executes, it eliminates these rows, but more importantly these rows are also the rows for which FIRST.CLINNUM is true. Since the row is eliminated, the counter cannot be reset Callout 2. This does not happen when the WHERE statement Callout 1 is used because the BY groups are formed after the elimination of rows. As a result FIRST.CLINNUM will be available for testing and will not be eliminated inappropriately.

You are not constrained to using the WHERE with BY-group processing, but you must be careful. If we remove the subsetting IF statement from the DATA step and add the same logic to the line that counts the symptoms, the counting problem is corrected.

   if first.symp & symp ne ' ' then IFsympcnt+1;

Interestingly a comparison of the WHERE statement and the corrected logic using the IF statement highlight another difference between the two approaches.

2.7.2 WHERE and BY Group Processing
Showing Counts Using the Corrected IF
Obs   CLINNUM   IFsympcnt   Wsympcnt
 1    031234        2           2
 2    033476        1           1
 3    036321        2           2
 4    038362        0           .
 5    043320        1           1
 6    046789        1           1
 7    049060        1           1

Notice that a clinic number (038362), which was not in the previous report, now appears. All observations for this clinic have SYMP=’ ‘; consequently, it was completely removed from consideration by both the WHERE and the subsetting IF statements ➊➎. Since the revised DATA step does not eliminate

rows, and instead chooses which rows should be counted, this clinic now shows up with a count of 0. Of course you have to decide whether or not this clinic is appropriate for your report.

2.8 Appending Data Sets

There are several approaches that can be taken when appending two (or more) data sets. Each of these approaches has its own costs and capabilities. It is important for the programmer to understand the differences, similarities, and efficiencies of these techniques.

title '2.8 Appending Data Sets';
* Create a not so big data set;
data big;
   set sashelp.class
         (keep=name sex age height weight);
   where name > 'L';
   output big;
   * create a duplicate for Mary;
   if name=:'M' then output big; 
   run; 
data small;
   * The variable WEIGHT has been misspelled as WT;
   * The variables WT and HEIGHT are out of order;
   name='fred'; sex='m'; age=5;  wt=45; height=30;
   run;

The examples in this section use the data sets BIG and SMALL to build the data set BIGGER. The BIG data set, which is really only pretending to be big, is simply a portion of the familiar SASHELP.CLASS data set and SMALL is a single observation data set with nominally the same variables. Notice, however, that the SMALL data set has the variable WT instead of WEIGHT, and the order of WT and HEIGHT on the PDV has been reversed.

2.8.1 Appending Data Sets Using the DATA Step and SQL UNION

One of the simplest, if not the most simple, approach for appending two data sets is through the use of the SET statement.

data bigger; 
   set big small;
   run;
2.8 Appending Data Sets
Using the SET Statement
Obs    Name       Sex    Age    Height    Weight    wt
1     Louise      F      12     56.3        77      .
 2     Mary        F      15     66.5       112      .
 3     Mary        F      15     66.5       112      . Callout 2
 4     Philip      M      16     72.0       150      .
 5     Robert      M      12     64.8       128      .
 6     Ronald      M      15     67.0       133      .
 7     Thomas      M      11     57.5        85      .
 8     William     M      15     66.5       112      .
 9     fred        m       5     30.0         .     45 Callout 1

In this simplified example, a small transaction data set (SMALL) is appended onto a larger data set. The variable list in the new data set will be a union of the two data sets. Although this approach is sometimes necessary, it is unfortunately more commonly used by a programmer who does not understand the operations conducted by SAS in order to carry out the instructions. This DATA step will read and write each of the observations one at a time from the BIG data set before reading any of the observations from the SMALL data set. Since we are not doing anything with these observations (only reading and then writing), this is not very efficient. However it can be used to concatenate a very large number of data sets with a minimal amount of coding.

Variable attributes are determined from those in the left-most data set in which the variable appears, in this case the data set BIG. Variables unique to the SMALL data set will be added to the PDV on the right and subsequently to data set BIGGER. The order of the variables in data set SMALL is not important. Callout 1 Notice that the single observation contributed by data set SMALL is the last one in the listing, and the variable WT has been added last to the PDV for data set BIGGER. Callout 2 The duplicate observation for Mary correctly appears twice in data set BIGGER.

data bigger; 
   set big 
       small(rename=(wt=weight));
   run;

A simple RENAME= data set option for the data set SMALL corrects this naming issue and the data sets are appended correctly.

proc sql noprint;
create table bigger as
   select *
      from  big
   union
   select *
      from small;
   quit;

The use of the SQL UNION clause is similar to the previous DATA step, in that all observations from both data sets must be read (in the case of SQL, they are read into memory first) and then written. This SQL step is more sensitive than the DATA step to variable order and type. In fact it is the order of the variables in the second table, and not the variable name, that determines which column from data set SMALL is matched to which column in data set BIG. This can have disastrous consequences.

2.8 Appending Data Sets
Using SQL UNION
Obs   Name      Sex   Age   Height   Weight
 1    Louise     F     12    56.3       77
 2    Mary       F     15    66.5      112 Callout 2
 3    Philip     M     16    72.0      150
 4    Robert     M     12    64.8      128
 5    Ronald     M     15    67.0      133
 6    Thomas     M     11    57.5       85
 7    William    M     15    66.5      112
 8    fred       m      5    45.0       30 Callout 3

This SQL UNION produces almost the same data set as the SET in the previous example; however, the differences are very important. In data set SMALL the variable HEIGHT has a value of 30 and WT is 45. The SQL UNION has ignored the names of the variables in data set SMALL and has appended their values onto the BIGGER data set using position alone. Callout 3

Notice also that the duplicate observation for MARY has been eliminated Callout 2. If the keyword ALL had been used with the SQL UNION operator (UNION ALL), the duplicate observation would not have been removed. The CORR keyword can also be used with SQL UNION. This keyword would both eliminate duplicate records and any variables that are not in common to both tables (UNION CORR).

proc sql noprint;
create table bigger as
   select *
      from  big
   union all Callout 6
   select Name,Sex,Age,Height Callout 4,
          wt as Weight Callout 5
      from small;
   quit;

In an SQL step we can duplicate the results of the DATA step’s SET statement by naming the incoming variables, while also renaming WT. Callout 4 Variable order is determined on the SELECT clause (Height has been listed before Weight to match the order in data set BIGGER), and Callout 5 WT is renamed to WEIGHT. Callout 6 Notice that to prevent the elimination of duplicate observations, the ALL keyword has been added to the UNION statement.

2.8.2 Using the DATASETS Procedure’s APPEND Statement

proc datasets library=work nolist;
   append base=big Callout 1 
          data=small Callout 2;
   quit;

The APPEND statement in PROC DATASETS is designed to efficiently append two data tables. The primary advantage of using PROC DATASETS’ APPEND statement is that it does not read any of the observations from the data set named with the BASE= option Callout 1. The second data set (DATA= option Callout 2) is read and appended to the first. Rather than create a new data set, the BIG data set is to be replaced with the appended version.

APPEND assumes that both data tables have the same suite of variables and that those variables have the same attributes. The APPEND statement above fails because of the inconsistencies in the two PDVs:

  • NAME is $8 in BIG and $4 in SMALL
  • WT exists in SMALL, but is not found in BIG

Variables in the BASE= data set that are not in the DATA= data set do not create a problem.

Adding the FORCE option Callout 3 to the APPEND statement permits the process to take place despite the inconsistencies. The new version of the data set BIG will retain the same variables and variable attributes as were found on the first version of data set BIG. A listing of the new version of data set BIG shows that the single observation from data set SMALL has been added; however, its value for WT has been lost. A warning is issued to the LOG for each of the inconsistencies.

proc datasets library=work nolist;
   append base=big 
          data=small 
          force Callout 3;
   quit;
2.8 Appending Data Sets
Using the APPEND Statement
Obs   Name      Sex   Age   Height   Weight
 1    Louise     F     12    56.3       77
 2    Mary       F     15    66.5      112
 3    Philip     M     16    72.0      150
 4    Robert     M     12    64.8      128
 5    Ronald     M     15    67.0      133
 6    Thomas     M     11    57.5       85
 7    William    M     15    66.5      112
 8    fred       m      5    30.0        .

The functionality of the APPEND procedure, which is no longer documented, has been incorporated into the DATASETS procedure’s APPEND statement.

2.9 Finding and Eliminating Duplicates

When talking about duplicates, we need to be careful about our terminology. Duplicate observations are equivalent in all regards – the values of all the variables are the same. Observations with duplicate key variables (BY variables) may or may not be duplicate observations. Checks for duplicate key variables, such as the NODUPKEY option in PROC SORT, ignore all the variables that are not on the BY statement, and only compare values associated with the key variables.

The detection and elimination of duplicate observations can be very important, especially when dealing with data sets that should have no duplicates. There are several techniques for dealing with duplicate observations; however, they are not equally effective. It is also important to note that very often program authors are not as careful as they should be when distinguishing between duplicate observations and duplicate key values.

Obs   SUBJECT   VISIT        LABDT   potassium   sodium   chloride
  1     200       1     07/06/2006      3.7       14.0       103
  2     200       2     07/13/2006      4.9       14.4       106
  3     200       1     07/06/2006      3.7       14.0       103
  4     200       4     07/13/2006      4.1       14.0       103

This table shows the first few lines of the lab chemistry data. The variables SUBJECT and VISIT should form a unique key (they don’t) and there should be no duplicate observations (there are – see observations 1 and 3 above). This table will be used in the examples below.

SEE ALSO

Kohli (2006) reviews some of the techniques shown in this section, as well as discussing some others.

2.9.1 Using PROC SORT

The NODUPLICATES option (a.k.a. NODUPREC and NODUPS) on the PROC SORT statement is often used in the mistaken belief that it removes duplicate observations. Actually it will eliminate duplicate observations, but only up to a point. This option only eliminates duplicate observations if they fall next to each other after sorting. This means that if your key (BY) variables are insufficient to bring two duplicate observations next to each other, the duplicate will not be eliminated.

To be absolutely certain that all duplicates are eliminated, the BY statement must contain either a sufficient key or all the variables in the data set (_ALL_). This is generally not very practical and certainly not very efficient. I have found that if your data set contains a derived variable such as a statistic, for instance a mean or variance, including that variable on the BY statement is likely to create a sufficient key so that the NODUPLICATES option will indeed eliminate all duplicate observations.

The data set used in the examples in this section, ADVRPT.LAB_CHEMISTRY, has 166 observations. These include three pairs of duplicate observations and two more pairs of observations with duplicate key variable values (SUBJECT VISIT).

After using the NODUPLICATES option with an insufficient key, the LOG shows that only 5 duplicate observations were eliminated.

proc sort data=advrpt.lab_chemistry 
          out=none noduprec
          ;
   by subject;
   run;
NOTE: There were 169 observations read from the data set ADVRPT.LAB_CHEMISTRY.
NOTE: 5 duplicate observations were deleted

Re-running the SORT step using all the variables in the data set shows, however, that 6 duplicate observations were eliminated.

proc sort data=advrpt.lab_chemistry 
           out=none nodup;
   by _all_;
   run;
NOTE: There were 169 observations read from the data set ADVRPT.LAB_CHEMISTRY.
NOTE: 6 duplicate observations were deleted.

The NODUPKEY option will successfully return a list of unique combinations of the BY variables, and does not suffer from any of the limitations of the NODUPLICATES option.

MORE INFORMATION

Section 4.1.1 discusses in more detail the NODUPREC option and its associated inability to remove duplicate observations. This section also presents the NODUPKEY and DUPOUT= options.

2.9.2 Using FIRST. and LAST. BY-Group Processing

In the DATA step checks for duplicate key fields can be implemented using the BY-group processing techniques known as FIRST. and LAST. processing. Because FIRST. and LAST. processing can only be used with variables listed in a BY statement, these checks are necessarily restricted to duplicates in the key fields.

title1 '2.9.2 Using FIRST. and LAST. Processing';
proc sort data=advrpt.lab_chemistry
                     (keep = subject visit labdt)
          out=labs;
   by subject visit; Callout 1
   run;
data dups;
   set labs;
   by subject visit; Callout 1
   if not (first.visit and last.visit); Callout 2
   run;
proc print data=dups;
   run;

The data are sorted and the same BY statement is used in both the SORT and DATA steps Callout 1. Inclusion of the BY statement automatically makes the FIRST. and LAST. temporary variables available for each variable on the BY statement. An observation with a unique set of key fields will necessarily be both the FIRST. and the LAST. observation for that combination of key variables. An observation that is not both FIRST. and LAST. Callout 2 will necessarily be non-unique.

Obs    SUBJECT    VISIT         LABDT

					
  1      200        1      07/06/2006
  2      200        1      07/06/2006
  3      200        4      07/13/2006
  4      200        4      07/13/2006
  5      200        7      08/04/2006
  6      200        7      08/04/2006
  7      200        9      09/12/2006 Callout 3
  8      200        9      09/13/2006 Callout 3
  9      200        9      09/13/2006
portions of the table not shown

The listing of the duplicates shows those observations that do not have unique values of their key variables. Since we have only shown three variables, we do not know if the entire observation is duplicated or not. Certainly for SUBJECT 200 Callout 3 the lab date (LABDT) indicates that while the key fields are not unique, the observations are not necessarily duplicates.

data unique;
   set labs(keep=subject visit);
   by subject visit;
   if first.visit;
   run;

Clearly this technique allows us to distinguish between unique and non-unique combinations of key variables, but does not create a data set with an exhaustive list of all possible unique combinations. However, to build a data set of all possible unique combinations of the key variables requires only minor changes to the DATA step. By changing the IF statement to a subsetting IF statement, which checks only for the first or last occurrence of the BY variable combination, we guarantee that each combination of the two BY variables will be unique.

MORE INFORMATION

The use of FIRST. and LAST. processing is described in more detail in Section 3.1.1.

SEE ALSO

The following SAS Forum thread contains examples of NODUPKEY, the DUPOUT option and the use of FIRST. and LAST. processing http://communities.sas.com/message/41965#41965.

2.9.3 Using PROC SQL

proc SQL noprint; 
create table nodups as 
   select distinct *
      from advrpt.lab_chemistry
         order by subject,visit;
quit;

We can remove duplicate observations using an SQL step and the DISTINCT function. The asterisk in the DISTINCT * function is used to indicate that all the variables are to be considered when looking for duplicates. Since SQL holds the entire data set in memory, all observations – not just adjacent ones – are compared. The resulting data set will not contain any duplicate observations. Adding an ORDER BY clause will cause the new data set to be sorted.

If you only want to create a list of unique key values, adding a KEEP= option to the incoming data set in the FROM clause will restrict the variables that are checked by the DISTINCT function. The SQL SET operators EXCEPT, INTERSECT, and UNION can also be used to return unique rows.

2.9.4 Using PROC FREQ

PROC FREQ can be used to build a data set of key variable combinations that are either already unique or already non-unique. It can also be used to create unique combinations of key variables.

proc freq data=advrpt.lab_chemistry;
   table subject*visit / noprint 
                         out=unique(where=(count=1));
   run;

The following code does not eliminate duplicates, but like the first example in Section 3.9.2 (FIRST. and LAST. processing), it only selects combinations that are already unique. PROC FREQ creates a data set with one row for each combination of SUBJECT and VISIT. The variable COUNT indicates how often that combination appears in the data. Using COUNT in a WHERE clause allows us to select for duplicated (COUNT>1) or unique (COUNT=1) combinations of the key variables (SUBJECT and VISIT).

proc freq data=advrpt.lab_chemistry;
   table subject*visit / noprint 
                         out=unique(keep=subject visit);
   run;

The default data set that is created by the TABLE statement contains a list of unique combinations of the key variables. Using the KEEP= option is a simple way to save this list of unique combinations and the result is sorted!

2.9.5 Using the Data Component Hash Object

The data component hash object can be used to eliminate both duplicate observations and duplicate key values. Because the hash object is loaded into memory it can be a fast alternative that does not require the incoming data set to be sorted.

An incoming data set can be loaded directly into the hash object or it can be loaded one observation at a time using one or more methods that have been designed to work with the data component objects.

Determining a Unique Key

In the following example the hash object is loaded from a SAS data set one observation at a time. Each observation is written to the object using the key variables and successive observations overwrite previous observations with the same combination of key values.

data _null_;
   if _n_=1 then do; Callout 1
      declare hash chem (ordered:'Y') ; Callout 2
      chem.definekey ('subject', 'visit'), Callout 3
      chem.definedata ('subject','visit','labdt'), Callout 4
      chem.definedone (); Callout 5
   end;
   set advrpt.lab_chemistry end=eof; Callout 6
   rc = chem.replace(); Callout 7
   if eof then chem.output(dataset:'nokeydups'), Callout 8
   run;

Callout 1 The hash table must be defined before it can be used. This DO block will only be executed once. It is at this time that the hash object is instantiated and its structure is defined. This IF statement and its associated overhead could be eliminated if the SET statement Callout 6 was placed within a special type of DO loop known as a DOW loop (introduced in Section 3.9.1 and used later in this section).

Callout 2 The CHEM hash table is established. The ORDERED: ‘Y’ option causes the table to be written in ascending order based on the values of the variables that have been named as key variables Callout 3.

Callout 3 The key variables for the CHEM hash table are defined using the DEFINEKEY method. These variables are used to determine how an observation is to be written to the hash table when the REPLACE method Callout 7 is executed. Notice the use of the dot notation to form the association with the specific hash table and its key variables.

Callout 4 The data variables that are to be transferred from the PDV to the hash table are specified. In this example we are interested in building a list of unique key variables, so LABDT is not needed. It has been included here to make it easier to see how the REPLACE method works (see Callout 7).

Callout 5 The definition of the CHEM hash table is complete.

Callout 6 Observations from the incoming data set are read sequentially using the SET statement.

2.9.5 Using the Hash Object
Eliminating Duplicate KEY Values
Obs    SUBJECT    VISIT         LABDT
  7      200         8     08/11/2006
  8      200         9     09/13/2006
  9      200        10     10/13/2006
 portions of the table not shown 

Callout 7 The REPLACE method is used to write the contents of the Program Data Vector to the CHEM hash object. If the current combination of key variable values already exists in the hash table, they will be replaced – not added. Subject 200 has three observations for VISIT 9 (lab dates were 9/12 and 9/13 – see table in Section 2.9.2). Because of the use of the

REPLACE method, the observation with a lab date of 9/13 overwrote the one for 9/12

Callout 8 After the last observation has been read, the contents of the CHEM hash table are written to the data set WORK.NOKEYDUPS.

Eliminating Duplicate Observations

With only a slight modification, the previous example can be used to eliminate duplicate observations rather than duplicate key values. The difference is in the definition of the key variables Callout 9.

data _null_;
   if _n_=1 then do;
      declare hash chem (ordered:'Y') ;
      chem.definekey ('subject', 'visit','labdt', 'sodium', 'potassium', 'chloride'), ”
      chem.definedata('subject', 'visit','labdt', 'sodium', 'potassium', 'chloride') ;
      chem.definedone () ;
   end;
   set advrpt.lab_chemistry end=eof;
   rc = chem.replace();
   if eof then chem.output(dataset:'nodups'),
   run;

Callout 9 The list of key variables has been expanded to include all the variables. Much like using the _ALL_ in the BY statement of PROC SORT (see Section 2.9.1), this forces the hash object to recognize and replace duplicate observations.

While the previous code does what we want it to do, it could be more efficient. There are two IF statements (Callout 1 and Callout 8) that are executed for every observation on the incoming data set, but each is true only once. We can eliminate both IF statements by using what is commonly referred to as a DOW loop (DO-Whitlock). Named for Ian Whitlock, who popularized the approach by demonstrating its advantages, this loop places the SET statement inside of a DO UNTIL loop.

data _null_;
   declare hash chem (ordered:'Y') ;
   chem.definekey ('subject', 'visit','labdt', 'sodium', 'potassium', 'chloride'),
   chem.definedata('subject', 'visit','labdt', 'sodium', 'potassium', 'chloride') ;
   chem.definedone () ;
   do until(eof);
      set advrpt.lab_chemistry end=eof;
      rc = chem.replace();
   end;
   chem.output(dataset:'nodups'),
   stop;
   run;

Because all of the incoming observations are read inside of the DO UNTIL loop, there is only one pass of the DATA step during execution. Here the STOP statement is not necessary since we have read the last observation from the incoming data set. As a general rule the STOP provides insurance against infinite loops when processing using this approach.

If you want to eliminate duplicate observations you can take better advantage of the properties of the hash object. In the following DATA step the incoming data are loaded into the hash object directly using the DATASET: constructor on the DECLARE statement, and then written to the data set WORK.NODUPS using the OUTPUT method.


				
data _null_;
   length subject $3 Callout 10
          visit    8
          labdt    8
          sodium potassium chloride $12;
   declare hash chem (dataset:'advrpt.lab_chemistry', ordered:'Y') ;
      chem.definekey ('subject','visit','labdt','sodium','potassium','chloride'),
      chem.definedata('subject','visit','labdt','sodium','potassium','chloride'),
      chem.definedone () ; 
   call missing(subject,visit,labdt, sodium, potassium,chloride);
   chem.output(dataset:'nodups'),
   run;

Callout 10 Since there is no SET statement, the attributes of variables in the data set created by the OUTPUT method must be established. In this example the attributes are defined using the LENGTH statement. The CALL MISSING routine initializes the variables and assigns them missing values.

Using the LENGTH statement to set the variable attributes requires a certain level of knowledge about the incoming data set. We can avoid the LENGTH statement and the use of the MISSING method by taking advantage of the information that SAS already knows.

In the following DATA step the SET statement will never be executed (the expression in the IF statement is false), so no observations are read; however, during DATA step compilation the attributes of the variables are loaded into the PDV. Since the last observation is not read from the incoming data set, the STOP statement is needed to close the implied loop created by the SET statement.

data _null_;
   if 0 then set advrpt.lab_chemistry(keep= subject visit labdt                   
                                            sodium potassium chloride);                                
   declare hash chem (dataset:'advrpt.lab_chemistry', ordered:'Y') ;                   
   chem.definekey ('subject', 'visit','labdt', 'sodium', 'potassium', 'chloride'),                                        
   chem.definedata('subject', 'visit','labdt', 'sodium', 'potassium', 'chloride') ;                                        
   chem.definedone () ;
   chem.output(dataset:'nodups'),
   stop;
   run;

MORE INFORMATION

DATA step component objects are discussed in more detail in Section 3.3.

SEE ALSO

Kohli (2006) includes a brief example of the use of the hash object to remove duplicate observations. Secosky and Bloom (2007) provide a nice introduction to DATA step component (HASH) objects.

2.10 Working with Missing Values

While even the most novice of SAS programmers is familiar with the general concept of working with missing values, most do not realize that there is a great deal more to the topic. To many programmers this seems like a simple topic; however, the ability to fully take advantage of missing values, both character and numeric, is essential when processing data within SAS.

MORE INFORMATION

The replacement of missing values in PROC TABULATE is discussed separately in
Section 8.1.1.

SEE ALSO

Humphreys (2006) includes a number of nice examples and explanations on the use of missing values.

2.10.1 Special Missing Values

Although we usually think of a period (.) as the symbol for a numeric missing value, there are actually 28 different numeric missing values. In addition to the period, which is most commonly used, numeric missing values can also be designated by preceding each of the 26 letters of the alphabet (a through z) and the underscore with a period. These different values can be used to distinguish between kinds of missing values, such as a dropped sample as opposed to a sample that was not taken.

When using these special missing values we need to know how to read them, how to use them, and how they will be displayed. In the following step two different missing values are read from a flat file. These are then processed as a part of an expression. Notice that the missing values are designated by following the period with the designation letter.

data ages;
input name $ age;
if age=.y then note='Too Young';
else if age=.f then note='Refused';
datalines;
Fred 15
Sally .f
Joe .y
run;
2.10.1 Missing Numerics
Obs    name     age      note
 1     Fred      15
 2     Sally      F    Refused
 3     Joe        Y    Too Young

Interestingly, when this data set is printed, the special missing value is displayed capitalized and without the period.

data ages;
missing f y;
input name $ age;
if age=.y then note='Too Young';
else if age=.f then note='Refused';
datalines;
Fred 15
Sally f
Joe y
run;

When the data are coded without the dot in front of the letter, the MISSING statement can be used to declare specific letters as special missing values. The dot is still used when designating the missing value in code.

There is a hierarchy associated with the 28 numeric missing values, and understanding this hierarchy can become critical when comparisons between them are made. In terms of size (sort order) the traditional missing value (.) is neither the smallest nor the largest of the 28 types of numeric missing values. The ._ is smallest, and is the only missing value smaller than (.). The largest numeric missing value is .z.

Suppose we want to subset for all valid dates in a data set. The WHERE clause or subsetting IF statement might be written as where date > .; . However, this expression would only eliminate two of the 28 potential types of numeric missing values. In order to guarantee that all numeric missing values are eliminated, the expression should be written as where date > .z; . Conversely, if you are searching for the smallest numeric value, (._) is smaller than the traditional missing (.).

MORE INFORMATION

The .z missing value is used in a subsetting example in Section 2.3.1. A user-defined informat is created to import special codes that need to be mapped to special numeric missing values in Section 12.5.3.

2.10.2 MISSING System Option

The MISSING system option allows you to specify a character to display other than the period (.). Like all system option settings, once specified the replacement value remains in effect, persists, until the end of the SAS session, job, or until reset.

data showmiss;
input name $ age;
datalines;
Fred 15
Sally .f
Joe .
run;
options missing=X; Callout 1
title2 'MISSING Text is: X';
proc print data=showmiss;
run;
2.10.2 Using the MISSING System Option
MISSING Text is: X
Obs    name     age
 1     Fred      15
 2     Sally      F
 3     Joe        X Callout 2

The data set SHOWMISS has three observations and two missing values, the special missing value .f (see Section 2.10.1) and a standard missing value. The MISSING option will not change how a missing value is read or how it is used in an expression; however, it does change how the missing value is displayed. Callout 1 Here the MISSING system option is given the value of ‘X’ (the use of the quotes is optional on the OPTIONS statement.

Examination of the PROC PRINT results shows that special missing values (.f) are not replaced; however, the missing value for Joe’s age Callout 2 has been replaced with an X.

Because you are limited to a single character when using the MISSING system option, it is often far more flexible to write and use a user-defined format to recode missing values (see Section 12.5.3).

SEE ALSO

The SAS Forum thread found at http://communities.sas.com/message/57619#57619 discusses the use of the MISSING system option.

2.10.3 Using the CMISS, NMISS, and MISSING Functions

The CMISS and NMISS functions have been designed to count the number of arguments (character and numeric arguments respectively) with missing values, while the MISSING function detects whether or not its argument (numeric or character) is missing.

data cntmiss;
infile cards missover;
input (a b c) ($1.) x y z;
nmisscnt = nmiss(x,y,z);
cmisscnt = cmiss(a,b,c);
missval  = missing(x+y+z);
datalines;
abc 1 2 3
de  3 4 .
    . . .
    1 2 .a
ghi
run;

This example uses the NMISS and CMISS functions to count the number of numeric and character missing values within each observation. The expression used as the argument for the MISSING function will return a 1 if any one of the values of X, Y, or Z are missing.

2.10.3 Using the NMISS, CMISS and MISSING Functions
Noticing Missing Values
Obs  a   b   c   x   y   z   nmisscnt  cmisscnt    missval
 1   a   b   c   1   2   3       0         0          0
 2               .   .   .       3         3          1
 3   d   e       3   4   .       1         1          1
 4               .   .   .       3         3          1
 5               1   2   A       1         3          1
 6   g   h   i   .   .   .       3         0          1

When you do not know the variable names or you just do not want to list them, the NMISS and CMISS functions can still be used. A variant of the following expression was suggested on a SAS Forum thread to perform this count.

                           Callout 1     Callout 2                     Callout 1      Callout 3     Callout 4
tot_missing = nmiss( of _numeric_,1 ) + cmiss( of _character_, 'a' ) -1;                  

Callout 1 The _NUMERIC_ and _CHARACTER_ variable list abbreviations (see Section 2.6.1) are used instead of explicit variable lists.

Callout 2 A non-missing numeric constant has been added as an argument just in case there are no numeric variables.

Callout 3 A non-missing character constant has been added as an argument to prevent an error if there are no character variables.

Callout 4 The variable TOT_MISSING will always be missing at this point (unless its value is retained), therefore it will be counted by NMISS. Consequently we want to decrease the count by one.

MORE INFORMATION

The MISSING function is used in Section 2.2.6 to convert missing values to 0.

SEE ALSO

These two SAS Forum threads discuss missing value functions: http://communities.sas.com/message/57614

http://communities.sas.com/message/57624.

2.10.4 Using the CALL MISSING Routine

Unlike the MISSING function which detects missing values, the CALL MISSING routine assigns missing values. The arguments to the MISSING routine can be numeric, character, or both. The arguments to this routine can also be variable lists, list abbreviations, and even calls to arrays.

data annual(keep=year q: totsales);
   set sashelp.retail(keep=sales date year);
   by year;
   retain q1-q4 .;
   array annual {4} q1-q4;
   if first.year then call missing(of annual{*});
   annual{qtr(date)}=sales;
   if last.year then do;
      totsales=sum(of q:);
      output annual;
   end;
   run;

In the example shown here the CALL MISSING routine is used to clear the values of an array (ANNUAL) by setting them all to missing. Rather than using a DO loop to step through the array one element at a time to assign the missing values, the MISSING routine allows us to access the array values much more efficiently.

SEE ALSO

This example is used in a sasCommunity.org tip: http://www.sascommunity.org/wiki/Tips:Use_CALL_MISSING_to_Set_a_List_of_Variables_to_Missing.

The CALL MISSING routine is used to avoid uninitialized variable notes in the SAS Forum thread:
http://communities.sas.com/message/56784.

2.10.5 When Classification Variables Are Missing

Throughout SAS, when classification variables are missing, their associated observation is excluded from the analysis. This is true for procedures with explicit CLASS statements, such as PROC MEANS and PROC GLM, as well as for those with implicit classification variables, such as PROC FREQ and PROC REPORT. Sometimes this is the behavior that you want; however, often it is important that these observations not be removed. The MISSING option allows missing values to be valid levels of the classification variable.

The MISSING option can be used with most procedures that have either implicit or explicit classification variables. This option can be used on a CLASS statement or on the PROC statement. When used on the PROC statement the option applies to all the classification variables; however, when it is used on the CLASS statement it is only applied to those specific classification variables. In PROC FREQ the MISSING option can also be used as an option on the TABLES statement, and in PROC REPORT it can appear on the DEFINE statement.

MORE INFORMATION

The MISSING option on a CLASS statement is discussed in Section 7.1.1.

2.10.6 Missing Values and Macro Variables

The macro language does not support the concept of a missing value. While a macro variable can take on the value of a blank or a period, these values are not treated as missing values by the macro language. A macro variable can take on a null value; that is, the macro variable can store nothing. This is not possible for variables on a data set.

When working with null macro variables the syntax may at first look odd to the DATA step programmer.

%if &city = %then %do;

This %IF statement is considered to be standard syntax for comparing a macro variable (&CITY) to a null value. Since DATA step comparisons must have something on the right of the comparison operator, this form makes some macro programmers uneasy. Other methods for comparing against a null value include the use of a quoting function such as %STR. Since the macro variable can contain nothing the %LENGTH function can return a zero, and this can also be used to detect a null value in a macro variable.

% if &city = %str() %then %do;
% if %length(&city) = 0 %then %do;

2.10.7 Imputing Missing Values

There are a number of techniques that have been proposed for imputing missing values in a data set. These include various schemes using spline fitting techniques, which can be found in the SAS/GRAPH procedure G3GRID. The SAS/ETS procedure EXPAND and the SAS/STAT procedure STDIZE can also estimate missing values. Of these, however, only PROC MI (Multiple Imputation) has the primary objective of imputing missing values and is by far the most sophisticated.

This procedure works well; however, there is a caveat of which the user should be aware. Since the procedure calculates values based on the values of related variables it can be sensitive to changes in the order of the data. The PROC MI results may change, although usually not by a lot, just by changing the sort order of the data. And mere changes in the order of the variables in the VAR statement can also result in minor changes to the imputed values even with a fixed SEED value.

Neither of these situations is alarming, unless you encounter them and are not expecting them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset