Chapter 11: Working with Numeric Functions

11.1  Introduction

11.2  Functions That Round and Truncate Numeric Values

11.3  Functions That Work with Missing Values

11.4  Setting Character and Numeric Values to Missing

11.5  Descriptive Statistics Functions

11.6  Computing Sums within an Observation

11.7  Mathematical Functions

11.8  Computing Some Useful Constants

11.9  Generating Random Numbers

11.10  Special Functions

11.11  Functions That Return Values from Previous Observations

11.12  Sorting Within an Observations—a Game Changer

11.13  Problems

 

11.1  Introduction

SAS functions are essential tools in DATA step programming. They perform such tasks as rounding numbers, computing dates from month-day-year values, summing and averaging the values of SAS variables, and hundreds of other tasks. This chapter focuses on functions that primarily operate on numeric values—the next chapter describes some of the functions that work with character data. Date functions were discussed separately in Chapter 9.

There is so much to say about SAS functions that one could write a whole book on the subject. As a matter of fact, someone did. Please check out SAS Functions by Example, 2nd edition by this author. Information about SAS functions can be found on the SAS Help Center at http://go.documentation.sas.com. These two chapters touch only the surface of the enormous power of SAS functions.

11.2  Functions That Round and Truncate Numeric Values

Three of the most useful functions in this category are the ROUND, INT, and CEIL functions. As the names suggest, ROUND is used to round numbers, either to the nearest integer or to other values such as 10ths or 100ths. The INT function returns the integer portion of a numeric value. (That is different from Ents, which are talking trees in Hobbit land.) The CEIL function (stands for ceiling) returns the next highest integer.

As an example, suppose you have raw data values on age (in years) and weight (in pounds). You want a data set with age as of the last birthday (that is, throw away any fractional part of a year) and weight in pounds, rounded to the nearest pound. In addition, you also want to compute weight in kilograms, rounded to the nearest 10th. Here is the program:

Program 11.1: Demonstrating the ROUND and INT Truncation Functions

  data Truncate;

     input Age Weight Cost;

     Age = int(Age);

     WtKg = round(Weight/2.2, .1);

     Weight = round(Weight);

     Next_Dollar = Ceil(Cost);

  datalines;

  18.8 100.7 98.25

  25.12 122.4 5.99

  64.99 188 .0001

  ;

The ROUND function is used twice in this program. When it is used to round the WtKg values, it has two arguments, separated by a comma. The first argument is the value to be rounded—the second argument is the round-off unit. Typical values for round-off units are .1, .01, .001, and so forth. However, you can use other values here as well. For example, a round-off unit of 2 rounds values to the nearest even number—a value of 100 rounds a value to the nearest 100. When ROUND is used with the Weight variable, there is no second argument. Therefore, the default action—rounding to the nearest whole number—occurs. Finally, the CEIL function returns the next highest integer.

Here is a listing of the resulting data set:

Figure 11.1: Listing of Data Set Truncate

Figure 11.1: Listing of Data Set Truncate

Notice that the fractional part of all the age values has been dropped, the WtKg values are rounded to the nearest 10th and the Weight values are rounded to the nearest whole number. In this program, it was important to compute the value of WtKg before Weight was rounded. Each value of Cost was “rounded up” to the next highest integer (even the value of .0001 returned a value of 1).

11.3  Functions That Work with Missing Values

In SAS, you can refer to a missing numeric value with a period and a missing character value by a single blank (inside single or double quotes). For example, take a look at the short program here that tests to see if several variables contain a missing value:

Program 11.2: Testing for Missing Numeric and Character Values (without the MISSING Function)

  data Test_Miss;

     set Learn.Blood;

     if Gender = ' ' then MissGender + 1;

     if WBC = . then MissWBC + 1;

     if RBC = . then MissWBC + 1;

     if Chol lt 200 and Chol ne . then Level = 'Low ';

     else if Chol ge 200 then Level = 'High';

  run;

Instead of using separate values for a missing numeric and character value, you can use the MISSING function instead. This function returns a value of true if its argument is a missing value and false otherwise. The amazing thing about this function is that the argument can be either character or numeric—a real rarity among SAS functions. You can rewrite Program 11.2 using the MISSING function like this:

Program 11.3: Demonstrating the MISSING Function

  data Test_Miss;

     set Learn.Blood;

     if missing(Gender) then MissGender + 1;

     if missing(WBC) then MissWBC + 1;

     if missing(RBC) then MissWBC + 1;

     if Chol lt 200 and not missing(Chol) then

        Level = 'Low ';

     else if Chol ge 200 then Level = 'High';

  run;

Besides making programs easier to read, the MISSING function also returns a value of true for the alternate numeric values (.A, .B,.Z, and ._). These alternative numeric missing values can be useful when you have different categories of missing information.

For example, you could assign .A to did not answer and .B to Not applicable. SAS treats all of these values as missing for numeric calculations but allows you to distinguish among them.

11.4  Setting Character and Numeric Values to Missing

The CALL routine CALL MISSING is a handy way to set one or more character and/or numeric variables to missing. When you use a variable list such as X1–X10 or Char1–Char5, you need to precede the list with the keyword OF. Here are some examples:

 

Examples of the CALL MISSING Routine

call missing(X,Y,Z,of A1–A10);

X, Y, and Z are numeric and A1–A10 are character. X, Y, and Z are set to a numeric missing value; A1–A10 are set to a character missing value.

call missing(of X1–X10);

X1–X10 are numeric and are set to a numeric missing value.

call missing(of _all_);

All variables defined up to the point of the function call are set to missing.

call missing(of X1–X5,of Y1–Y5);

X1–X5 and Y1–Y5 are all numeric and are set to a numeric missing value.

For those readers interested in the details, the following is a brief discussion on the difference between SAS functions and CALL routines.

A SAS function can only return a single value and, typically, the arguments do not change their value when the function is executed. A CALL routine, unlike a function, cannot be used in an assignment statement. Also, the arguments used in a CALL routine can change their value after the call executes. Therefore, if you want to retrieve more than a single value or, as in the CALL MISSING examples above, setting all of the arguments to a missing value, you need to use a CALL routine instead of a function.

11.5  Descriptive Statistics Functions

One group of SAS functions is called descriptive statistics functions. While these functions are capable of computing statistical results such as means and standard deviations, many functions in this category provide extremely useful non-statistical tasks.

Suppose you want to score a psychological test and the scoring instructions state that you should take the average (mean) of the first 10 questions (labeled Q1–Q10). This calculation should be performed only if there are seven or more nonmissing values. In addition, you want to identify the two questions that resulted in the highest and lowest scores, respectively. Here is a program that performs all of these tasks:

Program 11.4: Demonstrating the N, MEAN, MIN, and MAX Functions

  data Psych;

     input ID $ Q1-Q10;

     if n(of Q1-Q10) ge 7 then Score = mean(of Q1-Q10);

     MaxScore = max(of Q1-Q10);

     MinScore = min(of Q1-Q10);

  datalines;

  001 4 1 3 9 1 2 3 5 . 3

  002 3 5 4 2 . . . 2 4 .

  003 9 8 7 6 5 4 3 2 1 5

  ;

What seemed like a difficult problem was solved in just three lines of SAS code, using some of the descriptive statistics functions. The N function returns the number of nonmissing numeric values among its arguments. As with all of the functions in this category, you must precede any list of variables in the form Var1–Varn with the keyword OF. Without this, SAS assumes that you want to subtract the two values. If there are seven or more nonmissing values, you use the MEAN function to compute the mean of the nonmissing values. Otherwise, Score will be a missing value. The MEAN function, as well as the other functions discussed in this section, ignores missing values. For example, subject 001 has nine nonmissing values so the mean is computed by adding up the nine values and dividing by 9.

The MAX and MIN functions return the largest and smallest (nonmissing) value of its arguments. Here is a listing of data set Psych:

Figure 11.2: Listing of Data Set Psych

Figure 11.2: Listing of Data Set Psych

Another function similar to the N function is NMISS. This function returns the number of missing values in the list of variables.

What if you want the second or third largest value in a group of values? The LARGEST function enables you to extract the nth largest value, given a list of variables. For example, to find the sum of the three largest scores in the Psych data set, you could use the SUM and LARGEST functions together like this:

Program 11.5: Finding the Sum of the Three Largest Values in a List of Variables

  data Three_Large;

     set Psych(keep=ID Q1-Q10);

     SumThree = sum(largest(1,of Q1-Q10),

                    largest(2,of Q1-Q10),

                    largest(3,of Q1-Q10));

  run;

The first argument of the LARGEST function tells SAS which value you want—1 gives you the largest value, 2 gives you the second largest, and so forth. This function ignores missing values, the same as the other descriptive statistics functions.

Note: LARGEST(1, variable-list) is the same as MAX(variable-list).

Here is a listing of Three_Large:

Figure 11.3: Listing of Data Set Three_Large

Figure 11.3: Listing of Data Set Three_Large

Are you surprised to know that SAS also has a SMALLEST function? SMALLEST(1, variable-list) gives you the smallest nonmissing value in a list of variables (equal to the result of the MIN function); SMALLEST(2, variable-list) gives you the second smallest value, and so on.

11.6  Computing Sums within an Observation

One way to compute the sum of several variables is to write a statement such as this:

SumCost = Cost1 + Cost2 + Cost3;

What if one of the Cost values is missing? This causes the sum to be missing. If you want to ignore missing values, you can either write some DATA step logic, or simply use the SUM function, like this:

SumCost = sum(of Cost1–Cost3);

If one or two of the Cost values are missing, SumCost is the sum of the nonmissing values. If all three Cost values are missing, the SUM function returns a missing value. If you want the sum to be 0 if all the arguments are missing, a nice trick is to include a 0 in the list of arguments, like this:

SumCost = sum(0, of Cost1–Cost3);

You can also see how useful the SUM function is when you have a large number of variables to sum. Here is an example.

You have a SAS data set (EndofYear) containing Pay1–Pay12 and Extra1–Extra12. You want to compute the sum of these 24 values. In addition, you want the sum to be 0 if all 24 values are missing. Here is the program:

Program 11.6: Using the SUM Function to Compute Totals

  data Sum;

     set Learn.EndofYear;

     Total = sum(0, of Pay1-Pay12, of Extra1-Extra12);

  run;

11.7  Mathematical Functions

We'll start out with four straightforward functions: ABS (absolute value), SQRT (square root), EXP (exponentiation), and LOG (natural log). The following program demonstrates these four functions—a short explanation follows:

Program 11.7: Demonstrating the ABS, SQRT, EXP, and LOG Functions

  data Math;

     input x @@;

     Absolute = abs(x);

     Square = sqrt(x);

     Exponent = exp(x);

     Natural = log(x);

  datalines;

  2 -2 10 100

  ;

To take an absolute value of a number, you throw away the minus sign (if the number is negative). Although you can raise a number to the .5 power to compute a square root, using the SQRT function may make your program easier to read (and to write). The EXP function raises e (the base of natural logarithms) to the value of its argument. The LOG function takes the natural logarithm of its argument. (If you need a base 10 log, use the LOG10 function.)

The @@ signs at the end of the INPUT statement are called a double trailing @. Notice that you are reading four values of x on a single line and creating four observations. Without the @@ at the end of the INPUT statement, SAS would go to a new line each time SAS reached the bottom of the DATA step. The @@ at the end of the line is an instruction to “hold the line” and keep reading values on the same line until there are no more values to read.

Note: The double trailing @ is different from the single trailing @ (described in Chapter 21, sections 11 and 12). The single trailing @ holds the line for another INPUT statement in the DATA step but releases the line at the bottom of the DATA step—the double trailing @ does not release the line at the bottom of the DATA step. One of the only reasons for using a double trailing @ is to create more than one observation from a single line of data.

Here is a listing of the Math data set:

Figure 11.4: Listing of Data Set Math

Figure 11.4: Listing of Data Set Math

Notice that there are missing values for the SQRT and LOG functions when the argument to these functions is a negative value. SAS also prints a note in the log to inform you about this fact.

11.8  Computing Some Useful Constants

The CONSTANT function returns values of commonly used mathematical constants such as pi and e. For a SAS programmer, perhaps the most useful feature of this function is its ability to compute the largest integer that can be stored exactly in less than 8 bytes. The next program demonstrates some of the more common uses of this function:

Program 11.8: Computing Some Useful Constants with the CONSTANT Function

  data Constants;

     Pi = constant('pi');

     e = constant('e');

     Integer3 = constant('exactint',3);

     Integer4 = constant('exactint',4);

     Integer5 = constant('exactint',5);

     Integer6 = constant('exactint',6);

     Integer7 = constant('exactint',7);

     Integer8 = constant('exactint',8);

  run;

Pi and e are computed as shown here. To compute the largest integer stored in n bytes, you provide a second argument indicating the number of bytes. Output from this program is shown as follows:

Figure 11.5: Listing of Data Set Constants

Figure 11.5: Listing of Data Set Constants

To be sure the exact integer feature of this function is clear, if you use a LENGTH statement to set the length of a numeric variable to 3, the largest integer you can represent without losing accuracy is 8,192; with a length of 4, you can represent integers up to 2,097,152. (These values may vary, depending on your operating system.)

11.9  Generating Random Numbers

We will discuss only one of the random number functions here—RAND. This function generates random numbers based on any one of 29 probability distributions. Some of the more commonly used distributions are listed in Figure 11.6:

Figure 11.6: Some of the Commonly Used Probability Distributions

Distribution

Example

Description

Uniform

X = rand('uniform');

Uniform random numbers between 0 and 1

Normal

HR = rand('normal',70,10);

Normally distributed numbers with a mean of 70 and a standard deviation of 10

Bernoulli

Heads = rand('Bernoulli',.5);

A Bernoulli distribution with p = .5 (simulates a coin toss)

Binomial

N = rand('Binomial',.3,10)

The number of “successes” in 10 trials with a probability of .3

 

Note: The RAND function replaces some of the older random number functions such as RANUNI (generates uniform random numbers) and RANNOR (generates normally distributed random numbers).

You may wonder why you would ever need to generate random numbers. Some possible uses are to create data sets for benchmarking, to select random samples, and to assign subjects randomly to two or more groups.

Computers do not really generate true random numbers. However, they are capable of generating series of numbers that are very close to random (called pseudo-random numbers by snobs). In order to get the computer started, you need to provide a seed number to be used to generate the first number in the random sequence. If you use the RAND function without first setting a seed with a call streaminit(n) command (where n is an integer, preferably a large one), SAS uses the computer’s clock to supply the seed. If you choose any positive integer in the call streaminit, that number is used as a seed. SAS recommends that you choose a seed of at least 7 digits. If you supply the seed, the program generates the same sequence of random numbers every time you run the program. Without the call to streaminit, the sequence is different every time you run the program. Hopefully, several examples will make this clear.

As the first example, you want to generate 5 uniform random numbers.

Program 11.9: Program to Generate Five Uniform Random Numbers

  data Uniform;   

     do i = 1 to 5;

        X = rand('uniform');

        output;

     end;

  run;  

First, notice that this DATA step does not read any data. The DO loop causes the RAND function to execute 5 times and output 5 uniform random numbers. It is important that the OUTPUT statement is placed inside the loop, otherwise the program would only output one random number. Let's run this program once and list the output data set:

Figure 11.7: Listing of Data Set Uniform (First Run)

Figure 11.7: Listing of Data Set Uniform (First Run)

Notice that all the random numbers are between 0 and 1. Let's run the program again and list the data:

Figure 11.8: Running Program 11.9 Again and Listing the Results

Figure 11.8: Running Program 11.9 Again and Listing the Results

Notice that the 5 random numbers are different from the ones produced by the first run. The reason is that you did not include the call to streaminit. Let's run Program 11.9 twice more, but specify a seed value with a call to streaminit:

Program 11.10: Including a Call to Streaminit

  data Uniform;

     call streaminit(1234567);  

     do i = 1 to 5;

        X = rand('uniform');

        output;

     end;

  run;  

We will run Program 11.10 twice and list the output:

Figure 11.9: Listing of Data Set Uniform with Call to Streaminit (First Run)

Figure 11.9: Listing of Data Set Uniform with Call to Streaminit (First Run)

Figure 11.10: Listing of Data Set Uniform with Call to Streaminit (Second Run)

Figure 11.10: Listing of Data Set Uniform with Call to Streaminit (Second Run)

Notice that by setting a seed value, the program generates the same random sequence each time it is run.

You may wonder why would anyone want to do this? One possible reason is to generate random groupings for a clinical trial. When the trial is over, you want to be able to generate the exact same random assignments, so you use the same seed value that you used originally.

Suppose you want to select approximately 10% of the observations from the Blood data set. You can use uniform random numbers like this:

Program 11.11: Using the RAND function to randomly select observations

  data Subset;

     set Learn.Blood;

     if rand('uniform') le .1;

  run;

Because values from a uniform distribution are always between 0 and 1, approximately 10% of these numbers will be less than .1 and the observation will be written out to data set Subset. Here is part of the SAS log after this program was run:

NOTE: There were 1000 observations read from the data set LEARN.BLOOD.

NOTE: The data set WORK.SUBSET has 104 observations and 7 variables.

Notice that the Subset data set is not exactly 10% of the Blood data set. If you need an exact 10% sample, Use PROC SURVEYSELECT as shown next.

PROC SURVEYSELECT, which is available in SAS/STAT software, is the best (and most efficient) way to create random subsets. This procedure is quite flexible and offers many options for generating these subsets. As an example, to obtain a simple random sample of size 100 from the Blood data set, you could use the following program:

Program 11.12: Using PROC SURVEYSELECT to Obtain a Random Sample

  proc surveyselect data=Learn.Blood

                    out=Subset

                    method=srs

                    sampsize=100;

  run;

The procedure options DATA= and OUT= are pretty clear. METHOD= allows you to choose a method for selecting your sample (SRS is a simple random sample, a sample taken without replacement). SAMPSIZE= allows you to choose the size of the sample. One additional option, which is not used here, is SEED=. If you supply a value for this option, this value is used as the seed. Also, if you prefer to select a percentage of the original data set, substitute the option SAMPRATE=Percent, (where Percent is the proportion of the data set you want to sample) instead of SAMPSIZE=. For more details on this procedure, see the SAS Help Center at http://go.documentation.sas.com.

To generate random integers in the range from 1 to 10, you could use:

RandomInteger = ceil(rand('uniform')*10);

Remember that the CEIL function rounds up to the next highest integer. For example, suppose that your first random number is almost 0 (remember, it can never be 0). Ten times almost 0 is also a small number and the CEIL function returns a 1. If the RAND function returns a number almost equal to 1, ten times almost 1 is almost 10. However, the CEIL function will return a 10 in this case.

11.10  Special Functions

The name special functions (a SAS category) makes these functions seem esoteric. That is far from the case. The INPUT and PUT functions, in particular, are extremely useful functions that you will use all the time.

The INPUT function enables you to “read” a character value using a SAS or user-defined informat and assign the resulting value to a SAS numeric variable. One of the most common uses of this function is to perform a character-to-numeric conversion. Another is to convert a date as a character string to a SAS date value. Here is an example.

You are given a SAS data set (Chars) that contains the variables Height, Weight, and Date. All three variables are character variables. You want to create a new data set called Nums with the same variables, except that they are numeric variables.

 

A listing of data set Chars is shown here:

Figure 11.11: Listing of Data Set Chars

Figure 11.11: Listing of Data Set Chars

Although it may not be obvious from the listing, all three variables are character variables.

Program 11.13: Using the INPUT Function to Perform a Character-to-Numeric Conversion

  data Nums;

     set Learn.Chars (rename=

                     (Height = Char_Height

                      Weight = Char_Weight

                      Date   = Char_Date));

     Height = input(Char_Height,8.);

     Weight = input(Char_Weight,8.);

     Date   = input(Char_Date,mmddyy10.);

     drop Char_Height Char_Weight Char_Date;

  run;

The technique used in this program is called swap and drop. Because the same variable cannot be both character and numeric, this technique allows you to wind up with numeric variables with the same names as the original character variables. You use the data set option RENAME= to rename all three variables. Next, you use the INPUT function to do the character-to-numeric conversions. A good way to understand the INPUT function is to think about what an INPUT statement does. It reads character data from a raw data file using an informat to determine how the value should be read and assigns the result to a SAS variable. Think of the INPUT function as “reading” a value of a character variable according to whatever informat you supply as the second argument of the function. The INPUT function, when used for the Height and Weight variables, uses an 8. informat. It doesn’t matter if this value is larger than you need. Because the dates are in the month-day-year form, you use the MMDDYY10. informat to do the conversion. Because you don’t want the original character values, you drop them. That’s why this technique is called swap and drop.

Before we leave this function, we should point out that you can shorten the DROP statement as follows:

drop Char_:;

The colon notation is a SAS wildcard. This statement says to drop all variables that begin with Char_.

The other special function we discuss here is the PUT function. Just as a PUT statement can send the formatted value of a variable to an external file, a PUT function takes a value (the first argument), formats this value using the format supplied (the second argument), and “writes” the result to a variable. The result of a PUT function is always a character value. One common use of a PUT function is to perform a numeric-to-character conversion.

The following program demonstrates some possible uses of the PUT function:

Program 11.14: Demonstrating the PUT Function

  proc format;

     value Agefmt low-<20 = 'Group One'

                  20-<40  = 'Group Two'

                  40-high = 'Group Three';

  run;

  data Convert;

     set Learn.Numeric;

     Char_Date = put(Date,date9.);

     AgeGroup = put(Age,Agefmt.);

     Char_Cost = put(Cost,dollar10.);

     drop Date Cost;

  run;

Data set Numeric contains three numeric variables: Date, Age, and Cost. In data set Convert, the three variables Char_Date, AgeGroup, and Char_Cost are all character variables. The second argument of the PUT function specifies the format to apply to the first argument. Therefore, Char_date is a date in the DATE9. format, and Char_Cost is a value written using the DOLLAR10. format. Finally, AgeGroup applies a user-written format to place the age values into one of three groups. Here is a listing of the resulting data set.

Figure 11.12: Listing of Data Set Convert

Figure 11.12: Listing of Data Set Convert

11.11  Functions That Return Values from Previous Observations

Because SAS processes data from raw data files and SAS data sets line by line (or observation by observation), it is difficult to compare a value in the present observation with one from a previous observation. Two functions, LAG and DIF, are useful in this regard.

Let’s start out with a short program that demonstrates how the LAG function works:

 

Program 11.15: Demonstrating the LAG and LAGn Functions

  data Look_Back;

     input Time Temperature;

     Prev_Temp = lag(Temperature);

     Two_Back = lag2(Temperature);

  datalines;

  1 60

  2 62

  3 65

  4 70

  ;

A listing of data set Look_Back follows:

Figure 11.13: Listing of Data Set Look_Back

Figure 11.13: Listing of Data Set Look_Back

As you can see from this listing, the LAG function returns the temperature from the previous time and the LAG2 function returns the temperature from the time before that. (There is a whole family of LAG functions: Lag, LAG2, LAG3, and so on.) This program might give you the idea that the LAG function returns the value of its argument from the previous observation. This is not always true.

Note: The correct definition of the LAG function is that it returns the value of its argument the last time the LAG function executed.

To help clarify this somewhat clunky sounding definition, see if you can predict the values of X and Last_X in the program that follows:

Program 11.16: Demonstrating What Happens When You Execute a LAG Function Conditionally

  data Laggard;

     input X @@;

     if X ge 5 then Last_X = lag(X);

  datalines;

  9 8 7 1 2 12

  ;

 

Here is a listing of data set Laggard:

Figure 11.14: Listing of Data Set Laggard

Figure 11.14: Listing of Data Set Laggard

Are you surprised? The value of Last_X in the first three observations is clear. But, what happened in Observation 6? To understand this, you need to read the definition carefully. The IF statement is not true in Observations 4 and 5; therefore, Last_X, which is set to a missing value at each iteration of the DATA step, remains missing. In Observation 6, the IF statement is true and the LAG function returns the value of X from the last time this function executed, which was back at Observation 3, where X was equal to 7.

Note: The take-home message is this: “Be careful if you execute a LAG function conditionally.” In most cases, you want to execute the LAG function for each iteration of the DATA step. When you do, this function returns the value of its argument from the previous observation.

A common use of the LAG function is to compute differences between observations. For example, you can modify Program 11.15 to compute the difference in temperature from one time to the next, as follows:

Program 11.17: Using the LAG Function to Compute Inter-observation Differences

  data Diff;

     input Time Temperature;

     Diff_Temp = Temperature - lag(Temperature);

  datalines;

  1 60

  2 62

  3 65

  4 70  

  ;

 

Here is a listing of Diff:

Figure 11.15: Listing of Data Set Diff

Figure 11.15: Listing of Data Set Diff

The LAG function is often used to compute differences between values from one observation to the next in the form:

x – lag(x);

Therefore, SAS has a set of DIF functions (DIF, DIF2, DIF3, and so on) that is equal to X – LAG(X). You could, therefore, rewrite Program 11.17 like this:

Program 11.18: Demonstrating the DIF Function

  data Diff;

     input Time Temperature;

     Diff_Temp = dif(Temperature);

  datalines;

  1 60

  2 62

  3 65

  4 70

  ;

For more examples using the LAG and DIF functions, please see the examples in Chapter 24.

11.12  Sorting Within an Observations—a Game Changer

There are some functions that are useful and can save you some programming time. There are other functions and/or call routines that I would call “game changers.” CALL SORTN is in the latter category.

Suppose you give a quiz every week for eight weeks and, if the students take all of the quizzes, you are willing to drop the two lowest quizzes. If they take just 7 quizzes, you will drop only one single lowest quiz grade. You could always use an approach like this:

 

Program 11.19: Solving the Quiz Problem the Hard Way

data Quizzes;

   input ID $ Quiz1 - Quiz8;

   if n(of Quiz1-Quiz8) ge 7 then Quiz_Score = mean(

      largest(1, of Quiz1-Quiz8), largest(2, of Quiz1-Quiz8),

      largest(3, of Quiz1-Quiz8), largest(4, of Quiz1-Quiz8),

      largest(5, of Quiz1-Quiz8), largest(6, of Quiz1-Quiz8));

datalines;

001 80 70 90 100 88 90 90 51

002 80 70 90 100 88 90 . .

003 60 60 70 70 70 70 80 .

;

You are using the LARGEST function to select the six highest quiz scores. The first thing you notice about this code is that it violates what I call “Cody's Law of SAS Programming”:

Note: If a program is getting very tedious to write, check for an easier way (perhaps an ARRAY, a FUNCTION, a CALL ROUTINE or a MACRO).

In order to simplify the program above, we need to introduce the CALL SORTN routine. Given a list of arguments, this call routine returns the values of each of the arguments in ascending order.

For example if x1 = 7, x2 = ., x3=3, and x4 = 9

If you execute the line call sortn(of x1-x4); the four x values are now:

x1 = ., x2 =3, x3 = 7, and x4 = 9. That is, the x-values are now sorted within an observation. Do you see how this reduces the problem of computing the quiz grades above?

Let's rewrite the program above using the CALL SORN routine.

Program 11.20: Repeating Program 11.19 Using the CALL SORTN Routine

data Quizzes;

   input ID $ Quiz1 - Quiz8;

   call sortn(of Quiz1 - Quiz8);

   /***Scores are now in ascending order***/

   if n(of Quiz1-Quiz8) ge 7 then Quiz_Score = mean(of Quiz3-Quiz8);

datalines;

001 80 70 90 100 88 90 90 51

002 80 70 90 100 88 90 . .

003 60 60 70 70 70 70 80 .

;

This works because the call routine rearranged all the Quiz scores (including missing values) from lowest to highest.

Note: Remember that when you use the CALL SORTN routine, the original values of the arguments are lost. If you need the original values and the values are in sorted order, copy the original values, using different variable names (i.e. Orig_Quiz1-Orig_Quiz8).

To make the program a bit easier to read, you can simulate a descending sort by listing the Quiz scores in reverse order like this:

call sortn(of Quiz8 – Quiz1); /***This is a descending sort***/

if n(of Quiz1-Quiz8) ge 7 then Quiz_Score = mean(of Quiz1 – Quiz6);

You will find many applications where using CALL SORTN will save you from writing a tedious program.

11.13  Problems

Solutions to odd-numbered problems are located at the back of this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion website at support.sas.com/cody for information about how to obtain the solutions to all problems.

1.       Using the SAS data set Health, compute the body mass index (BMI) defined as the weight in kilograms divided by the height (in meters) squared. Create four other variables based on BMI: 1) BMIRound is the BMI rounded to the nearest integer, 2) BMITenth is the BMI rounded to the nearest tenth, 3) BMIGroup is the BMI rounded to the nearest 5, and 4) BMITrunc is the BMI with a fractional amount truncated. Conversion factors you will need are: 1 Kg equals 2.2 Lbs and 1 inch = .0254 meters.

2.       Count the number of missing values for WBC, RBC, and Chol in the Blood data set. Use the MISSING function to detect missing values.

3.       Create a new, temporary SAS data set (Miss_Blood) based on the SAS data set Blood. Set Gender, RBC, and Chol to a missing value if WBC is missing. Use the MISSING and CALL MISSING functions in this program.

4.       The SAS data set Psych contains an ID variable, 10 question responses (Ques1–Ques10), and 5 scores (Score1–Score5). You want to create a new, temporary SAS data set (Evaluate) containing the following:

a.      A variable called QuesAve computed as the mean of Ques1–Ques10. Perform this computation only if there are seven or more nonmissing question values.

b.      If there are no missing Score values, compute the minimum score (MinScore), the maximum score (MaxScore), and the second highest score (SecondHighest).

5.       The SAS data set Psych contains an ID variable, 10 question responses (Ques1–Ques10), and 5 scores (Score1–Score5). You want to create a new, temporary SAS data set (Evaluate) containing the following:

a.      A value (ScoreAve) consisting of the mean of the three highest Score values. If there are fewer than three nonmissing score values, ScoreAve should be missing.

b.      An average of Ques1–Ques10 (call it QuesAve) if there are seven or more nonmissing values.

c.      A composite score (Composit) equal to ScoreAve plus 10 times QuesAve.

6.       Write a short DATA _NULL_ step to determine the largest integer you can score on your computer in 3, 4, 5, 6, and 7 bytes.

 

7.       Given values of x, y, and z, compute the following (using a DATA _NULL_ step):

d.      AbsZ = absolute value of z

e.      Expx = e raised to the x power

f.       Circumference = 2 times pi times y

Use values of x, y, and z equal to 10, 20, and –30, respectively. Round the values for b and c to the nearest .001.

8.       Create a temporary SAS data set (Random) consisting of 1,000 observations, each with a random integer from 1 to 5. Make sure that all integers in the range are equally likely. Run PROC FREQ to test this assumption.

9.       Using the random functions, create a temporary SAS data set (Fake) with 100 observations. Each observation should contain a subject number (Subj) starting from 1, a random gender (with approximately 40% females and 60% males), and a random age (integers from 10 to 50). Compute the frequencies for Gender and list the first 10 observations in the data set.

10.   Data set Char_Num contains character variables Age and Weight and numeric variables SS and Zip. Create a new, temporary SAS data set called Convert with new variables NumAge and NumWeight that are numeric values of Age and Weight, respectively, and CharSS and CharZip that are character variables created from SS and Zip. CharSS should contain leading 0s and dashes in the appropriate places for Social Security numbers and CharZip should contain leading 0s. Hint: The Z5. format includes leading 0s for the ZIP code.

11.   Repeat Problem 10, except this time use the same variable names for the converted variables. Hint: Swap and drop.

12.   Using the Stocks data set (containing variables Date and Price), compute daily changes in the prices. Use the statements here to create the plot.

   title "Plot of Daily Price Differences";

   proc sgplot data=Difference;

      series x=Date y=Date;

   run;

13.   Plot the daily stock prices in data set Stocks along with a moving average of the prices using a three-day moving average. Use the PLOT statements here to produce the plots.   

   title "Plot of Price and Moving Average";

   proc sgplot data=Moving;

      series x=Date y=Price;

      series x=Date y=Moving;

   run;

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset