Chapter 13: Working with Arrays

13.1  Introduction

13.2  Setting Values of 999 to a SAS Missing Value for Several Numeric Variables

13.3  Setting Values of NA and ? to a Missing Character Value

13.4  Converting All Character Values to Propercase

13.5  Using an Array to Create New Variables

13.6  Changing the Array Bounds

13.7  Temporary Arrays

13.8  Loading the Initial Values of a Temporary Array from a Raw Data File

13.9  Using a Multidimensional Array for Table Lookup

13.10  Problems

 

13.1  Introduction

Cody’s rule of SAS programming goes something like this: if you are writing a SAS program, and it is becoming very tedious, stop. There is a good chance that there is a SAS tool, perhaps arrays, functions, or macros, that will make your task less tedious.

To the beginning programmer, arrays can be a bit frightening—to an experienced programmer, arrays can be a huge time saver. So, let’s get you over the fright and into saving time.

First of all, what is an array? SAS arrays are a collection of elements (usually SAS variables) that allow you to write SAS statements referencing this group of variables.

Note: SAS arrays are different from arrays in many other programming languages. They do not hold values, and they allow you to refer to a collection of SAS variables in a convenient manner.

It is much easier to understand what an array is by a few simple examples. So, here goes.

13.2  Setting Values of 999 to a SAS Missing Value for Several Numeric Variables

Typically, arrays are used to perform a similar operation on a group of variables. In this example, you have a SAS data set called SPSS that contains several numeric variables. The folks that created this data set used a value of 999 whenever there was a missing value. (Some statistical packages such as SPSS allow you to substitute a missing value for a specific value. It is common to use values such as 999 or 9999 to represent this missing value.)

The following is a program that solves this problem without using arrays:

Program 13.1: Converting Values of 999 to a SAS Missing Value—Without Using Arrays

  data New;

     set Learn.SPSS;

     if Height = 999 then Height = .;

     if Weight = 999 then Weight = .;

     if Age    = 999 then Age    = .;

  run;

Notice that you are writing the same SAS statement several times—the only thing that is changing is the name of the variable. You would like to be able to write something like this:

if Height, Weight, or Age = 999 then

   Height, Weight or Age = .;

This is pretty much how an array works. Let’s first see the program using arrays, and then we’ll go through the explanation:

Program 13.2: Converting Values of 999 to a SAS Missing Value—Using Arrays

  data New;

     set Learn.SPSS;

     array Myvars{3} Height Weight Age;

     do i = 1 to 3;

        if Myvars{i} = 999 then Myvars{i} = .;

     end;

     drop i;

  run;

The first thing you may notice is that the program with arrays is longer than the one without arrays! However, if you had 50 or 100 variables to process, the program using arrays would be the same length as this program.

The ARRAY statement is used to create the array. Following the keyword ARRAY is the name you choose for your array. Array names follow the same rules you use for SAS variables. In this example, you chose the name MYVARS as the array name. Following the array name, you place the number of elements (in this example, the three variables) in brackets. Finally, you list the variables you want to include in the array. You may use any of the SAS shorthand methods for referring to a list of variables here, such as Var1–Varn. This list of variables must be all numeric or all character—you cannot mix them.

You may also use square brackets [] or parentheses () following the array name to specify the number of elements in the array. SAS documentation usually uses curly brackets {}. We recommend that you always use the same type of brackets (either straight or curly) when you use arrays. The reason—SAS function names are always followed by a set of parentheses, so it might be difficult to know if a program is making an array reference or using a function if you use parentheses in your array references.

Once you have defined your array, you can use an array reference in place of a variable name anywhere in the DATA step. For example, MYVARS{2} can be used in place of the variable Weight. The number in the brackets following the array name is called a subscript, even though it is not true subscript notation.

By placing the array in a DO loop, you can process each variable in the array. Also, because you do not need or want the DO loop counter included in the SAS data set, you use a DROP statement.

Let’s “play computer” and follow the logic of this program. At the top of the DATA step, the SET statement brings in an observation from data set SPSS. Next, the DO loop counter starts at 1. The IF statement references the array element MYVARS{1}, which is equivalent to the variable Height. SAS checks if the value of Height is equal to 999 and, if so, replaces it with a SAS missing value.

The DO loop continues in the same way for all three array elements. When the DO loop finishes, you are at the bottom of the DATA step. An observation is written out to the NEW data set and control returns to the top of the DATA step, where the next observation from data set SPSS is read. This process continues until there are no more observations to read from the SPSS data set.

The best way to get started writing arrays is to first write a few lines of code without an array. Next, write an ARRAY statement where the array elements are all the variables you want to process. Next, use one of the sample lines as a template and write that line, substituting your array name (with a subscript of your choice) for the variable name. Finally, place this line (or lines) of code inside a DO loop. (Don’t forget to drop the DO loop counter.) You are done.

13.3  Setting Values of NA and ? to a Missing Character Value

As we mentioned in the introduction, SAS arrays must contain all numeric or all character variables. If the variables you want to include in a character array are already defined as character (for example, they are coming from a SET statement), you can write an ARRAY statement resembling the one in the previous section. However, if you want the array to contain new variables, you need to include a dollar sign ($) and, optionally, a length when you define the array. For example, to create an array of character variables Q1–Q20, each with a length of 2 bytes, you would write the following:

array Mychars{20} $ 2 Q1-Q20;

It is a good idea to include the dollar sign ($) in every character array, even if the array variables have been previously defined as character.

As an example, the following program uses a character array to convert all values of NA (Not Applicable) or question mark (?) to a SAS missing value. Suppose you are given a SAS data set Chars and you want to create a new data set named Missing with these changes. Here is the program:

 

Program 13.3: Converting Values of NA and ? to a Missing Character Value

  data Missing;

     set Learn.Chars;

     array Char_Vars{*} $ _character_;

     do Loop = 1 to dim(Char_Vars);

        if Char_Vars{Loop} in ('NA','?') then

        Char_Vars{Loop} = ' ';

     end;

     drop Loop;

  run;

This program introduces a number of new features. First, an asterisk is used in place of the number of elements in the array. You can always use an asterisk here if you don’t know how many variables are in the array (as may be the case here where you haven’t counted the number of character variables in the Chars data set). SAS counts for you and computers are better at counting than people, anyway. Next, the keyword _CHARACTER_ is used as the variable list. Because this statement follows the SET statement, _CHARACTER_ includes all the character variables in the Chars data set. _CHARACTER_ references character variables that are present in the PDV at that point in the DATA step. For example, if you define character variables A, B, and C in the first three lines of a DATA step and then use the reference _CHARACTER_ followed by defining character variables D and E, only variables A, B, and C are referenced by _CHARACTER_. In Program 13.3, if you place the ARRAY statement before the SET statement, the array will not reference any variables.

Because you used an asterisk in place of the number of elements in the array, what value do you use for the upper bound of the DO loop? Luckily, the DIM function comes to the rescue. It returns the number of elements in an array. Finally, an IF statement checks for the values of NA or ? and sets them equal to a SAS character missing value.

13.4  Converting All Character Values to Propercase

If you have data (either a raw data file or a SAS data set) where the data entry folks were careless (or didn’t set standards ahead of time), you may have a hodgepodge of upper-, lower-, or propercase values. This section describes a simple program to convert all character values to propercase.

First, here is a listing of data set Careless:

Figure 13.1: Listing of Data Set Careless

Figure 13.1: Listing of Data Set Careless

 

The following program converts all the character variable to propercase:

Program 13.4: Converting All Character Values in a SAS Data Set to Propercase

  data Proper;

     set Learn.Careless;

     array All_Chars{*} _character_;

     do i = 1 to dim(All_Chars);

        All_Chars{i} = propcase(All_Chars{i});

     end;

     drop i;

  run;

The logic of this program is similar to the previous program. You use the keyword _CHARACTER_ to reference all the character variables in data set Careless, and then use a DO loop to convert all the values to Propercase.

Data set Proper, with all the character values in propercase, is shown here:

Figure 13.2: Listing of Data Set Proper

Figure 13.2: Listing of Data Set Proper

13.5  Using an Array to Create New Variables

You can include variables in an ARRAY statement that do not yet exist in your SAS data set. For example, if your SAS data set had variables Fahren1–Fahren24 containing 24-Fahrenheit temperatures, you could use an array to create 24 new variables (say Celsius1–Celsius24) with the Celsius equivalents. Here is a program that accomplishes this:

Program 13.5: Using an Array to Create New Variables

  data Temp;

     input Fahren1-Fahren24 @@;

     array Fahren[24];

     array Celsius[24] Celsius1-Celsius24;

     do Hour = 1 to 24;

        Celsius{Hour} = (Fahren{Hour} - 32)/1.8;

     end;

     drop Hour;

  datalines;

  35 37 40 42 44 48 55 59 62 62 64 66 68 70 72 75 75

  72 66 55 53 52 50 45

  ;

The variables Celsius1–Celsius24 are created by the ARRAY statement. Inside the DO loop, you convert each of the 24-Fahrenheit temperatures to Celsius. Data set Temp contains all 24 Fahrenheit and 24-Celsius temperatures. You may wonder where the variable list went in the FAHREN array. If you omit a variable list in an ARRAY statement and you include the number of elements following the array name, SAS automatically creates variable names for you, using the array name as the base and adding the numbers from 1 to n, where n is the number of elements in the array. In this program, SAS creates the variables Fahren1–Fahren24. You could have used this feature for the Celsius array as well.

13.6  Changing the Array Bounds

By default, SAS numbers the elements of an array starting from 1. There are times when it is useful to specify the beginning and ending values of the array elements. For example, if you have variables Income2010 to Income2017, it would be nice to have the array elements start with 2010 and end with 2017.

The program that follows creates an array of the eight Income values, using the values of 2010 and 2017 as the array bounds, and computes the taxes for each of the eight years:

Program 13.6: Changing the Array Bounds

  data Account;

     input ID Income2010-Income2017;

     array Income{2010:2017} Income2010-Income2017;

     array Taxes{2010:2017} Taxes2010-Taxes2017;

     do Year = 2010 to 2017;

        Taxes{Year} = .25*Income{Year};

     end;

     format Income2010-Income2017

            Taxes2010-Taxes2017 dollar10.;

  datalines;

  001 45000 47000 47500 48000 48000 52000 53000 55000

  002 67130 68000 72000 70000 65000 52000 49000 40100

  ;

In this program, you specify the lower and upper bounds in the brackets following the array name and separate them with a colon.

13.7  Temporary Arrays

You can create an array that only has elements and no variables! As strange as this sounds, elements of temporary arrays are great places to store values or perform table lookups. If you want, you can assign the array elements initial values when you create the temporary array. Alternatively, you can load values into the temporary array in the DATA step. Either way, the values in the temporary array are automatically retained (that is, they are not set to missing values when the DATA step iterates). Thus, they are useful places to store values that you need during the execution of the DATA step.

We start out with an example that uses a temporary array to store the correct answer for each of 10 questions on a multiple-choice quiz. You can then score the quiz using the temporary array as the answer key. Here is the program:

Program 13.7: Using a Temporary Array to Score a Test

  data Score;

     array Ans{10} $ 1;

     array Key{10} $ 1 _temporary_

        ('A','B','C','D','E','E','D','C','B','A');

     input ID (Ans1-Ans10)($1.);

     RawScore = 0;

     do Ques = 1 to 10;

        RawScore + (key{Ques} eq Ans{Ques});

     end;

     Percent = 100*RawScore/10;

     keep ID RawScore Percent;

  datalines;

  123 ABCDEDDDCA

  126 ABCDEEDCBA

  129 DBCBCEDDEB

  ;

This program uses a temporary array (Key) to hold the answers to the 10 quiz questions. The keyword _TEMPORARY_ tells SAS that this is a temporary array and the 10 values in parentheses are the initial values for each of the elements of this array. It is important to remember that there are no corresponding variables (Key1, Key2, and so on) in this DATA step. Also, because elements of a temporary array are retained, the 10 answer key values are available throughout the DATA step for scoring each of the student tests.

The scoring is done in a DO loop. A “trick” is used to do the scoring: a logical comparison is performed between the student answer and the corresponding answer key. If they match, the logical comparison returns a 1 and this is added to RawScore. If not, the result is a 0 and RawScore is not incremented. Here is a listing of the output:

Figure 13.3: Listing of Data Set Score

Figure 13.3: Listing of Data Set Score

13.8  Loading the Initial Values of a Temporary Array from a Raw Data File

If you had a long test, you would probably prefer to load the answer key into the array elements by reading the values from a text file (especially if you are scoring the tests using an optical mark-sense reader). In the program that follows, you enter the answer key values on the first line of your data file and the IDs and student answers on the remaining lines (2 through n):

Program 13.8: Loading the Initial Values of a Temporary Array from a Raw Data File

  data Score;

     array Ans{10} $ 1;

     array Key{10} $ 1 _temporary_;

     /* Load the temporary array elements */

     if _n_ = 1 then do Ques = 1 to 10;

        input key{Ques} $1. @;

     end;

 

     input ID (Ans1-Ans10)($1.);

     RawScore = 0;

     /* Score the test */

     do Ques = 1 to 10;

        RawScore + (key{Ques} eq Ans{Ques});

     end;

     Percent = 100*RawScore/10;

     keep ID RawScore Percent;

  datalines;

  ABCDEEDCBA

  123 ABCDEDDDCA

  126 ABCDEEDCBA

  129 DBCBCEDDEB

  ;

Because you want to read the first line of the data file differently from the other lines, the value of _N_ (which counts iterations of the DATA step) can be used to ensure that the answer key values are read only once.

The INPUT statement that reads the student ID and student answers uses a form of input that may be new to you. You place the list of variables in a set of parentheses and an informat that you want to use to read these variables in a second set of parentheses. (See Chapter 21 for information about using variable and informat lists.)

The result of running this DATA step is identical to the data set you obtained from Program 13.7.

13.9  Using a Multidimensional Array for Table Lookup

SAS arrays can also be multidimensional. Instead of having a single index to identify elements of an array, you can, for example, use two indices (usually thought of as row and column indices) to identify an element. This is particularly useful when you want to retrieve a single value based on two selection criteria.

To define a multidimensional array, you specify the number of elements in each dimension in the brackets following the array name, separated by commas. For example, to define an array called MULTI with three elements on the first dimension and five elements on the second dimension, you could use:

array Multi{3,5} X1-X15;

To determine the number of elements in a multidimensional array, you multiply the number of elements in each dimension. In this example, the 3 by 5 array has 3 times 5 equals 15 elements.

The following table shows the benzene levels for each year (from 1944 to 1949) and the job codes (A through E) at a rubber factory. You want to retrieve a benzene level, given a year and job code. You can see a solution using formats in Chapter 22 of this book. Here is an array solution.

To start, you need to create a two-dimensional array with one index representing the year and the other the job code. To make matters more convenient, you can make the index values for the years range from 1944 to 1949 rather than from 1 to 6. You also need to decide if you want to populate the array with the benzene values as part of the ARRAY statement or if you want to read those values from raw data or, perhaps, a SAS data set. Finally, you can use a regular array or a temporary array.

The solution shown here loads the array from raw data and uses a temporary array to hold the benzene values.

You start with a data set (Expose) that holds a worker ID, the year worked, and the job code. Given this information, you want to look up the worker’s benzene exposure.

A listing of data set Expose is shown here:

Figure 13.4: Listing of Data Set Expose

Figure 13.4: Listing of Data Set Expose

Here is the table of benzene exposures by year and job code.

 

Job Code

Year

A

B

C

D

E

1944

220

180

210

110

90

1945

202

170

208

100

85

1946

150

110

150

60

50

1947

105

56

88

40

30

1948

60

30

40

20

10

1949

45

22

22

10

8

 

 

The first step is to load a temporary array with these values, as follows:

Program 13.9: Loading a Two-Dimensional, Temporary Array with Data Values

  data Look_Up;

     /******************************************************

        Create the array, the first index is the year and

        it ranges from 1944 to 1949. The second index is

        the job code (we're using 1-5 to represent job codes

        A through E).

     *******************************************************/

     array Level{1944:1949,5} _temporary_;

     /* Populate the array */

     if _n_ = 1 then do Year = 1944 to 1949;

        do Job = 1 to 5;

           input level{Year,Job} @;

        end;

     end;

     set Learn.Expose;

     /* Compute the job code index from the JobCode value */

     Job = input(translate(Jobcode,'12345','ABCDE'),1.);

     Benzene = level{Year,Job};

     drop Job;

  datalines;

  220 180 210 110 90

  202 170 208 100 85

  150 110 150 60 50

  105 56 88 40 30

  60 30 40 20 10

  45 22 22 10 8

  ;

There is a lot going on in this program so let’s take it one step at a time. The key is the two-dimensional array (LEVEL). The dimensions of the array are defined by the comma in the brackets following the array name. You can think of the first dimension as a row and the second dimension as a column in a table. Each row of raw data following the DATALINES statement represents a different year (starting from 1944) and each of the five columns represents the values for job codes A through E.

Because you want to populate the array only once, you execute the nested DO loops when _N_ is equal to 1. As mentioned earlier, the index values of the first dimension of the array range from 1944 to 1949. This saves you the trouble of computing the correct row value for a given year.

Finally, you use the keyword _TEMPORARY_ to declare the array to be temporary. This has several advantages. First, you don’t have to drop (or maintain in the PDV) 30 variables for each of the array elements. Next, these values are automatically retained so they are available for the duration of the DATA step. Finally, using temporary arrays is very efficient. They require less storage than regular variables, and all the values are stored in memory for rapid retrieval.

The only remaining problem is that the JobCode variable is a letter from A to E. You use the TRANSLATE function (see Chapter 12, section 16 for more details) to convert each of the letters A to E to the character values 1 to 5. You then use the INPUT function to do the character-to-numeric conversion.

To look up any benzene level, you simply obtain the array value corresponding to the Year and JobCode (converted to a number) values.

Here is a listing of data set Look_Up:

Figure 13.5: Listing of Data Set Look_Up

Figure 13.5: Listing of Data Set Look_Up

SAS arrays are powerful and flexible. You can change array bounds and create multidimensional arrays. Temporary arrays provide a convenient place to store values for efficient table lookup.

13.10  Problems

Solutions to odd-numbered problems are located at the back of this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion website at support.sas.com/cody for information about how to obtain the solutions to all problems.

1.       Using the permanent SAS data set Survey1, create a new, temporary SAS data set (Survey1) where the values of the variables Ques1–Ques5 are reversed as follows:

1 à 5; 2 à 4; 3 à 3; 4 à 2; 5 à 1.

Note: Ques1–Ques5 are character variables. Accomplish this using an array.

2.      Redo Problem 1, except use data set Survey2.

Note: Ques1–Ques5 are numeric variables.

3.       Using the SAS data set Nines, create a new temporary SAS data set (Nonines) where all values of 999 are replaced by SAS missing values. Do this without explicitly naming the numeric variables in data set Nines (use _NUMERIC_ when you define your array).

4.        Data set Survey2 has five numeric variables (Q1–Q5), each with values of 1, 2, 3, 4, or 5. You want to determine for each subject (observation) if they responded with a 5 on any of the five questions. This is easily done using the OR or the IN operators. However, for this question, use an array to check each of the five questions. Set variable (ANY5) equal to Yes if any of the five questions is a 5 and No otherwise.

5.       The passing score on each of five tests is 65, 70, 60, 62, and 68. Using the data here, use a temporary array to count the number of tests passed by each student.

ID

Test 1

Test 2

Test 3

Test 4

Test 5

001

90

88

92

95

90

002

64

64

77

72

71

003

68

69

80

75

70

004

88

77

66

77

67

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset