Chapter 3. Programming Techniques for Data Analysis

Contents

  • 3.1 Overview of Programming Techniques 55

  • 3.2 Reading and Writing Data 56

    • 3.2.1 Creating Matrices from SAS Data Sets 56

    • 3.2.2 Creating SAS Data Sets from Matrices 58

  • 3.3 Frequently Used Techniques in Data Analysis 59

    • 3.3.1 Applying a Variable Transformation 59

    • 3.3.2 Locating Observations That Satisfy a Criterion 60

    • 3.3.3 Assigning Values to Observations That Satisfy a Criterion 65

    • 3.3.4 Handling Missing Values 65

    • 3.3.5 Analyzing Observations by Categories 68

  • 3.4 Defining SAS/IML Modules 72

    • 3.4.1 Function and Subroutine Modules 72

    • 3.4.2 Local Variables 73

    • 3.4.3 Global Symbols 74

    • 3.4.4 Passing Arguments by Reference 74

    • 3.4.5 Evaluation of Arguments 75

    • 3.4.6 Storing Modules 76

    • 3.4.7 The IMLMLIB Library of Modules 78

  • 3.5 Writing Efficient SAS/IML Programs 79

    • 3.5.1 Avoid Loops to Improve Performance 79

    • 3.5.2 Use Subscript Reduction Operators 80

    • 3.5.3 Case Study: Standardizing the Columns of a Matrix 83

  • 3.6 Case Study: Finding the Minimum of a Function 84

  • 3.7 References 88

3.1 Overview of Programming Techniques

This chapter features SAS/IML statements, tips, and techniques in SAS/IML that might not be familiar to a SAS/STAT programmer, but that are useful for writing programs that analyze data.

Many of these techniques are used in subsequent chapters. For example, this chapter describes how to locate observations that satisfy a criterion, how to handle missing values in matrices, how to analyze observations by categories, how to define and call SAS/IML modules, and how to write efficient SAS/IML programs.

3.2 Reading and Writing Data

You can create a matrix by reading data from a SAS data set. You can also create a SAS data set from the contents of a matrix. This section discusses both of these techniques.

3.2.1 Creating Matrices from SAS Data Sets

This section describes how to read data from a SAS libref into SAS/IML vectors and matrices. If you have not already done so, follow the directions in Section 1.8 to install the data used in this book. (The examples in this book read data from the Sasuser library. If you install the data in a different SAS library, use that library instead of Sasuser.)

The Sasuser.Vehicles data contains engine characteristics and fuel efficiency for 1,187 vehicles sold in 2007. You can read data from a SAS data set by using the USE and READ statements. You can read variables into individual vectors by specifying a character matrix that contains the names of the variables that you want to read. The READ statement creates column vectors with those same names, as shown in the following statements:

/* read variables from a SAS data set into vectors */
varNames = {"Make" "Model" "Mpg_City" "Mpg_Hwy"};
use Sasuser.Vehicles;      /* open data set for reading              */
read all var varNames;     /* create four vectors: Make,...,Mpg_Hwy  */
close Sasuser.Vehicles;    /* close the data set                     */

idx = 1:3;                 /* print only the first 3 observations    */
print (Make[idx])
      (Model[idx])
      (Mpg_City[idx])[label="Mpg_City"]
      (Mpg_Hwy[idx])[label= "Mpg_Hwy"];
First Three Observations Read from a SAS Data Set

Figure 3.1. First Three Observations Read from a SAS Data Set

The USE statement opens Sasuser.Vehicles for reading. The READ statement reads observations into vectors or into a matrix. In this book, the examples almost always use the ALL option to read all the observations. See the chapter "Working with SAS Data Sets" in the SAS/IML User's Guide for further details on reading data sets.

The Sasuser.Vehicles data has 1,187 observations, but only the first few observations are printed. The section "Printing a Submatrix or Expression" on page 35 describes the use of parentheses in printing submatrices.

You can also read a set of variables into a matrix (assuming the variables are either all numeric or all character) by using the INTO clause on the READ statement. The following statements illustrate this approach:

/* read variables from a SAS data set into a matrix */
varNames = {"Mpg_City" "Mpg_Hwy" "Engine_Cylinders"};
use Sasuser.Vehicles;
read all var varNames into m;   /* create matrix with three columns */
close Sasuser.Vehicles;
print (m[1:3,])[colname=VarNames];
First Three Observations Read from a SAS Data Set into a Matrix

Figure 3.2. First Three Observations Read from a SAS Data Set into a Matrix

The COLNAME= option on the PRINT statement is used to display the variable names in the output. Notice also that the READ/INTO statement does not associate variable names with columns of a matrix. However, you can use the MATTRIB statement to permanently associate variable names to columns of a matrix.

You can read only the numeric variable in a data set by specifying the _NUM_ keyword on the READ statement:

/* read all numeric variables from a SAS data set into a matrix */
use Sasuser.Vehicles;
read all var _NUM_ into y[colname=NumericNames];
close Sasuser.Vehicles;
print NumericNames;
The Names of the Numeric Variables Read into a Matrix

Figure 3.3. The Names of the Numeric Variables Read into a Matrix

The matrix NumericNames contains the names of the numeric variables that were read; the columns of matrix y contain the data for those variables.

3.2.2 Creating SAS Data Sets from Matrices

You can use the CREATE and APPEND statements to write a SAS data set from vectors or matrices. The following statements create a data set called MyData in the Work library:

/* create SAS data set from vectors */
call randseed(12345);          /* set seed for RANDGEN              */
x = j(10, 1);                  /* allocate 10 × 1 vector            */
e = x;                         /* allocate 10 × 1 vector            */
call randgen(x, "Uniform");    /* fill x; values from uniform dist  */
call randgen(e, "Normal");     /* fill e; values from normal dist   */
y = 3*x + 2 + e;               /* linear response plus normal error */

create MyData var {x y};       /* create Work.MyData for writing    */
append;                        /* write data in x and y             */
close MyData;                  /* close the data set                */

The CREATE statement opens Work.MyData for writing. The variables x and y are created; the type of the variables (numeric or character) is determined by the type of the SAS/IML vectors of the same name. The APPEND statement writes the values of the vectors listed on the VAR clause of the CREATE statement. The CLOSE statement closes the data set. It is a good programming practice to close data sets when you are finished reading or writing the data because it frees up memory and system resources, and also prevents corruption of the data in the event that your program terminates prematurely. (For example, you might need to terminate a program if a computation is taking much longer than you expected.)

You can write character vectors in the same way. Row vectors are written to data sets as if they were column vectors. If the x and y vectors do not contain the same number of elements, then the shortened vector is padded with missing values.

If you want to create a data set from a matrix of values, you can use the FROM clause on the CREATE and APPEND statements. If you do not explicitly specify names for the data set variables, the default names are COL1, COL2, and so forth. You can explicitly specify names for the data set variables by using the COLNAME= option to the FROM clause, as shown in the following statements:

/* create SAS data set from a matrix */
call randseed(12345);            /* set seed for RANDGEN            */
x = j(10, 3);                    /* allocate 10 × 3 matrix          */
call randgen(x, "Normal");       /* fill x; values from normal dist */

create MyData2 from x[colname={"Random1" "Random2" "Random3"}];
append from x;
close MyData2;

Most of the examples in this book use the Work libref for temporary storage. You can create a permanent data set by using the LIBNAME statement to create a libref, as shown in the following statements:

/* create a libref and create a permanent SAS data set */
libname MyLib 'your data directory';
create MyLib.MyData var {x y};
append;
close MyLib.MyData;

3.3 Frequently Used Techniques in Data Analysis

The SAS/IML language can be very useful in data analysis. This section describes SAS/IML statements and techniques that you can use to analyze your data. In particular, this section describes how to do the following:

  • transform variables

  • locate observations that satisfy a criterion

  • handle missing values

  • analyze observations for each level of a categorical variable

3.3.1 Applying a Variable Transformation

It is common to transform data during exploratory data analysis and statistical modeling. For example, you can apply a linear transformation to scale or center a variable. If a variable is heavily skewed, it is common to apply a logarithmic transformation in an attempt to work with a more symmetric distribution.

Logarithmic transformations are common when dealing with variables that describe how much something costs. For example, in the Movies data set, the Budget and US_Gross variables are prime candidates for a log transformation. Both natural logarithms and base-10 logarithms are frequently used in practice.

Transforming variables in SAS/IML software is easy since the language supports common mathematical functions and algebraic expressions. In addition to logarithmic transformations, frequently used transformations in data analysis include square root transformations, inverse transformations, power transformations, and exponential transformations. All of these are easily programmed in SAS/IML software, often with a single statement. For example, the following statements apply a natural log transform to the Budget variable in the Movies data set:

/* apply log transform to data */
use Sasuser.Movies;                      /* open data set for reading */
read all var {"Budget"};                 /* read data                 */
close Sasuser.Movies;                    /* close data set            */

logBudget = log(Budget);                 /* apply log transform       */

A log transformation is well-defined for these data because the budget of a movie is always positive. If some data vector, say y, is not positive, it is common to add a constant to all values in order to produce a vector that contains positive values. You can show that y + c > 0, provided that c > − min(y). The following statements exploit that fact to compute a log10 transformation for nonpositive data:

/* apply log10 transform to data with negative values */
y = {10 0 -29 -20 273 70};
c = 1 - min(y);                 /* (y + c) > 0 whenever c > -min(y) */
log10Y = log10(y + c);          /* apply log10 transformation       */
print log10Y;
A log10 Transformation

Figure 3.4. A log10 Transformation

The section "Variable Transformations" on page 181 describes an alternative approach for applying a log transformation to a variable that contains nonpositive values.

For other analyses (especially multivariate analyses), you might want to standardize the data to have zero mean and unit standard deviation. Standardizing a variable is described in section "Case Study: Standardizing the Columns of a Matrix" on page 83.

3.3.2 Locating Observations That Satisfy a Criterion

A frequent task in data analysis is to locate observations that satisfy some criterion. For example, you might want to locate all of the males in your data, or all of the events that happened prior to some target date. The LOC function is the key to efficiently finding these observations.

Many experienced SAS programmers are not familiar with the LOC function. That is unfortunate, because it is a powerful function. Perhaps the reason for its obscurity is that it does not have an analogue in the DATA step. Or perhaps the SAS/IML documentation is partially to blame, since the documentation states that the LOC function "finds nonzero elements of a matrix." The casual reader can be forgiven for overlooking the power of such a seemingly bland function.

The power of the LOC function is best understood by looking first at the SAS/IML comparison operators (less than, equal to, greater than, and so forth) as described in the section "Comparision Operators" on page 36. The SAS/IML comparison operators create a matrix of zeros and ones, where 0 indicates that the comparison is false and where 1 indicates that the comparison is true. For example, the following are valid SAS/IML statements:

/* compare data; the comparison operator returns zeros and ones */
x = 1:5;
t = (x > 2);               /* t = {0 0 1 1 1} */

The matrix t contains nonzero values in exactly the locations at which x satisfies the given condition. Consequently, if you want to know the indices for which x is greater than two, you can use the LOC function, as shown in the following statements:

/* find indices that correspond to nonzero values */
idx = loc(t);              /* idx = {3 4 5};  */
print t, idx;
Indices That Satisfy a Condition

Figure 3.5. Indices That Satisfy a Condition

In practice, you do not need to create the named vector t. SAS/IML creates a temporary matrix when it needs to compute an intermediate result. Therefore, an experienced SAS/IML programmer writes the following statements:

/* find indices that satisfy a condition by using the LOC function */
x = 1:5;
idx = loc(x > 2);          /* idx = {3 4 5};  */

Now the utility of the LOC function becomes clear: you can use the LOC function to find observations that satisfy arbitrarily complex criteria! For example, if you are trying to find vehicles that are fuel efficient, you can write the following statements:

/* find vehicles that get at least 33 mpg in the city */
varNames = {"Make" "Model" "Mpg_City" "Engine_Cylinders"};
use Sasuser.Vehicles;            /* read data */
read all var varNames;
close Sasuser.Vehicles;

idx = loc(Mpg_City >= 33);
print (Make[idx])
      (Model[idx])
      (Mpg_City[idx])[label="Mpg_City"];
Vehicles with Mpg_City ≥ 33

Figure 3.6. Vehicles with Mpg_City ≥ 33

Continuing the previous program, if you want to find vehicles that have six cylinders and that are fuel efficient, you use the AND (&) operator in conjunction with the LOC function, as shown in the following statements:

/* find vehicles that have six cylinders and get > 30 mpg in the city */
idx = loc(Engine_Cylinders=6 & Mpg_City>30);
numSatisfy = ncol(idx);
print numSatisfy "vehicles satisfy the criteria.";
print (Make[idx]) (Model[idx]) (Mpg_City[idx])[label="Mpg_City"];
Vehicles That Satisfy Multiple Criteria

Figure 3.7. Vehicles That Satisfy Multiple Criteria

Notice that you can use the NCOL function to count observations that satisfy the criteria.

You can use the LOC function to do the following:

  • find extreme values in univariate data

  • find observations or variables that contain missing values

  • identify observations that are more than a certain distance from the mean or median

  • detect outliers in a regression model by finding observations whose absolute residuals are large

  • detect observations that have a high influence on the regression model by finding observations with a large Cook's D, leverage, or PRESS value

As an example, the following statements demonstrate how to use the LOC function to detect outliers in a regression model. Figure 3.8 shows a scatter plot of the Mpg_Hwy versus Mpg_City, along with a robust regression curve. (Chapter 4, "Calling SAS Procedures," shows how you can call a SAS procedure such as ROBUSTREG to compute the predicted and residual values for a model. Chapter 7, "Creating Statistical Graphs," shows how to create the graph in SAS/IML Studio.)

Scatter Plot with Robust Regression Line

Figure 3.8. Scatter Plot with Robust Regression Line

For now, assume that the parameters for the regression line are estimated by some robust regression algorithm. The estimates are approximately given by b = (0.24,1.35). The following statements use the LOC statement to extract and print information about the vehicles that have large residuals:

/* find vehicles with large residuals for a linear model */
varNames = {"Make" "Model" "Mpg_City" "Mpg_Hwy" "Hybrid"};
use Sasuser.Vehicles;                    /* read data               */
read all var varNames;
close Sasuser.Vehicles;

Pred_Mpg_Hwy = 0.24 + 1.35*Mpg_City;     /* create predicted values */
residual = Mpg_Hwy - Pred_Mpg_Hwy;       /* create residual values  */
idx = loc(abs(residual) > 10);
print (Make[idx])
      (Model[idx])
      (residual[idx])[label="Residual"]
      (Hybrid[idx])[label="Hybrid?"];

By now, the first few statements in the program are familiar: define the variables of interest and use the USE and READ statements to create vectors that contain the values for the given variables. The subsequent statement creates predicted values based on some regression model. The predicted values are used to form the residuals. The LOC function finds observations (shown in Figure 3.9) whose observed value for Mpg_Hwy is more than 10 MPG different than the predicted value.

Vehicles with Large Residuals

Figure 3.9. Vehicles with Large Residuals

Notice that the value of the Hybrid indicator variable is 1 for all of these observations. This indicates that the hybrid vehicles do not satisfy the same linear relationship as the other vehicles. A revised model might incorporate the Hybrid variable as a classification variable in the model.

There is a potential problem with each of the previous examples: what happens if there are no observations that satisfy the specified criterion? In that case, the LOC function returns an empty matrix that has zero rows and zero columns. It is invalid to use an empty matrix as an index. In other words, it is an error to form an expression such as residual[idx] when idx is empty.

You can determine whether a matrix is empty by using one of three functions: TYPE, NROW, or NCOL. (These functions are described in Section 2.2.) Which function you use is a matter of personal preference. This book uses the NCOL function. Consequently, the following statements show how to handle the case in which there are no sufficiently large residuals:

/* handle the situation when no observations satisfy a criterion */
Pred_Mpg_City = 0.24 + 1.35*Mpg_Hwy;
residual = Mpg_City - Pred_Mpg_City;
idx = loc(abs(residual) > 10);
if ncol(idx) > 0 then
   print (Make[idx]) (Model[idx]) (residual[idx])[label="Residual"]
         (Hybrid[idx])[label="Hybrid?"];
else
   print "No observations satisfy the criterion.";

3.3.3 Assigning Values to Observations That Satisfy a Criterion

The previous section indicates that hybrid-electric vehicles are different from traditional vehicles in terms of the relationship between MPG_Hwy and MPG_City. Suppose you compute two linear models—one for hybrid vehicles and the other for traditional vehicles—and you want to assign a vector of predicted values for every observation in the Vehicles data set. You can use the LOC function in conjunction with an IF-THEN statement to conditionally assign the predicted values, as shown in the following statements:

/* assign predicted values based on whether vehicle is hybrid-electric */
varNames = {"Mpg_City" "Mpg_Hwy" "Hybrid"};
use Sasuser.Vehicles;      /* read data */
read all var varNames;
close Sasuser.Vehicles;

/* first approach: use the LOC function */
Pred = j(nrow(Hybrid), 1); /* allocate vector for predicted values  */
idx = loc(Hybrid=1);       /* indices that satisfy a criterion      */
if ncol(idx)>0 then        /* assign an expression to those indices */
   Pred[idx] = 6.91 + 0.7*MPG_City[idx];
idx = loc(Hybrid^=1);      /* the other indices                     */
if ncol(idx)>0 then        /* assign a different expression         */
   Pred[idx] = 0.09 + 1.36*MPG_City[idx];

The predicted values for the hybrid-electric vehicles are based on one model; the predicted values for traditional vehicles are based on another model. The LOC function finds the indices for which each model applies. (A less efficient alternative to the LOC function would be to loop over all observations and to use an IF-THEN/ELSE statement inside the loop to check whether each vehicle is hybrid-electric. You should avoid looping over observations whenever possible.)

There is, in fact, another efficient way to conditionally assign values to a vector. The CHOOSE function is a function that takes three arguments. The first argument is a logical expression. The second argument contains values that are assigned to all observations that satisfy the logical expression; the third argument contains values that are assigned to all observations that do not satisfy the logical expression. The following statement is equivalent to several statements in the previous program:

/* second approach: use the CHOOSE function */
Pred = choose(Hybrid=1,            /* if the i_th vehicle is hybrid */
              6.91+0.7*MPG_City,   /* assign this value,            */
              0.09+1.36*MPG_City); /* otherwise, assign this value  */

3.3.4 Handling Missing Values

It is common to have missing values in data. For example, in the Sasuser.Movies data, there are 25 missing values for the World_Gross variable. Similarly, there are missing values for the Distributor variable for those movies that are "independent films." Missing values can occur for a variety of reasons such as nonresponse in a survey or because no measurement was taken. Missing values can occur for mathematical or statistical reasons. For example, the logarithm of a nonpositive value or the skewness of constant data are both often represented by a missing value.

There are two common techniques for dealing with missing values during data analysis: excluding observations with any missing variable values from the analysis or imputing values to replace the missing value. Most SAS procedures (for example, UNIVARIATE, REG, and PRINCOMP) exclude the entire observation if the observation has a missing value in any relevant variable. The MI and MIANALYZE procedures in SAS/STAT use multiple imputation to handle missing values. This book does not discuss multiple imputation methods.

Many SAS/IML built-in subroutines and matrix operators handle missing values by omitting them from a computation. For example, you can compute the sum of the elements in a matrix by using the SUM function or by using the summation subscript reduction operator (+), which is described in "Use Subscript Reduction Operators" on page 80. If the matrix contains missing values, those missing values are excluded from the arithmetic operations. Similarly, the Median module (which is distributed with SAS/IML as part of the IMLMLIB module library) excludes missing values when computing the median of each column of a matrix.

However, some SAS/IML subroutines and operators cannot check for missing values without becoming inordinately inefficient. For example, matrix and vector multiplications stop with an error if there are missing values in either factor. Other SAS/IML statistical and mathematical functions similarly report errors if they are called with arguments that contain missing values. Consequently, if you want to write robust, reusable programs, you should write programs that explicitly handle missing values in your data.

For example, suppose you wanted to compute the mean of the worldwide revenue for the movies in the Movies data set. The following statements look correct upon a first glance:

/* incorrect computation for data that contain missing values */
use Sasuser.Movies;                  /* read data */
read all var {"World_Gross"};
close Sasuser.Movies;

/* mean is wrong because of missing values */
wrong_mean = sum(World_Gross) / nrow(World_Gross);
print wrong_mean;
An Incorrect Computation

Figure 3.10. An Incorrect Computation

The value for the mean (116.77) is wrong. Missing values are the reason. The SUM function skips missing values as it adds up the elements in the World_Gross vector, but the NROW function counts all of the rows, even those with missing values.

There are two ways to get the correct answer: use functions that account for the missing values, or delete the missing values from the data. There are several functions that account for missing values, as shown in the following statements:

/* use functions that account for missing values */
mean1 = World_Gross[:];
mean2 = mean(World_Gross);                           /* SAS/IML 9.22 */
mean3 = sum(World_Gross)/countn(World_Gross);        /* SAS/IML 9.22 */
print mean1 mean2 mean3;
Result of Calling Functions That Handle Missing Values

Figure 3.11. Result of Calling Functions That Handle Missing Values

The subscript reduction operator (:) is described in Section 3.5. The MEAN function is a function that computes the mean of each column in a matrix. The COUNTN function returns the number of nonmissing values in each column of a matrix.

The alternative way to handle missing values is to delete elements in a vector (or rows in a matrix) that are missing. This is an easy application of using the LOC function. The following statements exclude observations with missing values:

/* exclude observations with missing values */
nonMissing = loc(World_Gross ^= .);       /* find nonmissing obs    */
if ncol(nonMissing)=0 then mean = .;
else do;                                  /* extract nonmissing obs */
   World_Gross = World_Gross[nonMissing,];
   mean4 = sum(World_Gross) / nrow(World_Gross);
end;
print mean4;
Result of Deleting Missing Values

Figure 3.12. Result of Deleting Missing Values

This time the value for the mean (125.51) is correct because only the nonmissing values are used in the calculation. Section 3.4 extends this example to create a module that excludes all rows of a matrix that contain a missing value.

You can use this same technique to exclude observations that satisfy any criterion—including a criterion that has nothing to do with missing values. To exclude observations that satisfy a criterion, first locate all observations that do not satisfy the criterion, and then extract those observations from the vector. In this way, you can exclude all of the R-rated movies from the movies data, all of the films that had a large budget, and so forth.

The observant reader who is familiar with DATA step programming might ask "Can't you exclude those observations when you read the data by using a WHERE clause?" Yes, you can. You can specify the WHERE clause on the READ statement, as the following statement indicates:

read all var {"World_Gross"} where (World_Gross ^= .);

However, this does not eliminate the need for the technique of this section. For example, suppose that you want to read several variables (say X and Y) in a single READ statement. Do you want to exclude observations from some X variable because a value is missing in the Y variable? The answer is often "yes" for a multivariate analysis, but is often "no" for univariate analyses. One way to handle both situations is to read in all of the data but to exclude missing values in a way that makes sense for a particular analysis.

A second consideration is that much of SAS/IML programming is done inside of a module (modules are introduced in the section "Defining SAS/IML Modules"). A module typically receives a matrix as an argument, and has no control over whether or not it contains missing values. Consequently, the module has to handle missing values.

A third consideration is that a matrix can be created from many sources: SAS data sets on a server, data residing in a DataObject, data from a Microsoft Excel worksheet, an R data frame, or values resulting from a SAS/IML computation. (These situations are discussed in Chapter 6, "Understanding IMLPlus Classes.") In general, an IMLPlus programmer cannot control whether matrices contain missing values, so it is best to write programs that handle them.

3.3.5 Analyzing Observations by Categories

The LOC function is often used in concert with the UNIQUE function when you are analyzing categorical variables. The UNIQUE function returns a vector that contains a sorted list of unique values. The LOC and UNIQUE functions both return their results as row vectors. Consequently, you can use the NCOL function to count how many unique values there are, and also use the NCOL function to count how many observations are associated with each unique value.

For example, the simplest usage of UNIQUE and LOC is to carry out a one-way frequency analysis of a variable. The following statements count the number of different categories in the MPAARating variable in the movies data set:

/* count the number of observations in each category */
use Sasuser.Movies;
read all var {"MPAARating"};              /* 1 */
close Sasuser.Movies;

categories = unique(MPAARating);          /* 2 */
count = j(ncol(categories), 1, 0);        /* 3 */
do i = 1 to ncol(categories);
   idx = loc(MPAARating = categories[i]); /* 4 */
   count[i] = ncol(idx);                  /* 5 */
end;
print count[rowname=categories];
A One-Way Frequency Analysis

Figure 3.13. A One-Way Frequency Analysis

The previous program contains five main steps:

  1. Read the MPAARating variable. This creates a vector of the same name.

  2. Use the UNIQUE function to get a row vector that contains unique categories of the MPAARating variable. The type (numeric or character) of the categories vector is the same as the type of the MPAARating variable.

  3. Create a constant vector, count, to hold the results. The size of the count vector is the number of unique categories in MPAARating. The count vector is initialized to the zero vector, but count[i] is overwritten by the actual number of observations belonging to the i th category.

  4. Inside a loop over all categories, use the LOC function to find the observations belonging to the i th category. Notice that there is at least one observation in each category because the loop is over the categories found by the UNIQUE function. Therefore, you do not need to check whether idx is empty.

  5. The number of observations in the i th category are counted by using the NCOL function. For this computation, the actual observations are not needed, only the number of observations.

This algorithm is an extremely useful programming technique. You can also use this technique to loop over levels of a categorical variable and to conduct an analysis for the observations in each level.

You can use the UNIQUE-LOC technique twice (by using two nested loops) in order to analyze observations that belong two categories. For example the following statements create a two-way frequency table for the Year and MPAARating variables of the Movies data:

/* count the number of observations in each pair of categories */
use Sasuser.Movies;                          /* read data */
read all var {"Year" "MPAARating"};
close Sasuser.Movies;

/* create two-way frequency table */
uYear = unique(Year);
uMPAARating = unique(MPAARating);
Table = j(ncol(uYear), ncol(uMPAARating), 0);    /* 1 */
do j = 1 to ncol(uMPAARating);                   /* 2 */
   jdx = loc(MPAARating = uMPAARating[j]);       /* 3 */
   t = Year[jdx];                                /* 4 */
   do i = 1 to ncol(uYear);
      idx = loc(t=uYear[i]);                     /* 5 */
      Table[i,j] = ncol(idx);
   end;
end;
YearLabel = char(uYear,4);                       /* 6 */
print Table[rowname=YearLabel colname=uMPAARating];
A Two-Way Frequency Analysis of MPAA Rating by Year

Figure 3.14. A Two-Way Frequency Analysis of MPAA Rating by Year

The example follows the same general principles as the one-variable example:

  1. Create a matrix to hold the results before entering a loop.

  2. Loop over categories, not over observations.

  3. Use the LOC function to find observations in the j th category of the MPAARating variable.

In addition, the example adds new ideas:

  1. Before entering a second loop, restrict your attention to only the observations that satisfy the first condition (that is, movies with a particular MPAA rating). Notice that you do not have to check whether jdx is an empty matrix because the value uMPAARating[j] is one of the values returned by the UNIQUE function.

  2. Find movies of a given rating that also belong to the i th year. The element Table[i,j] contains the count of the number of movies in the joint level of Year and MPAARating. This count might be zero. Therefore, it is important to check that idx is nonempty if you use it as an index into the data.

  3. The ROWNAME= and COLNAME= options to the PRINT statement require character vectors. You can convert a numerical matrix to a character matrix by using the CHAR function in SAS/IML. (The Base PUTN function is also useful if you need to apply a format to the numerical values.)

The purpose of this example is not to form a two-way table, since the FREQ procedure already does this quite well. Rather, the purpose is to show a framework in which you can compute statistics on observations that belong to a joint level of two categorical variables. The matrix idx contains the observation numbers in the j th category of one variable and the i th category of the other variable.

This same technique enables you to loop over joint levels of two categorical variables and to conduct an analysis for the observations in each joint level.

As a closing comment, notice that the LOC function finds the observations in each category without sorting the data. Would this process be easier if you sort the data? Sometimes. If the goal of the analysis is to conduct BY group processing, then calling the SORT procedure followed by a DATA step or SAS/STAT procedure might be simpler and more efficient than programming the analysis in PROC IML. For example, if the statements that produce Figure 3.13 were changed to compute, say, the quartiles of US_Gross for each MPAARating category, you could compute the same quantities by using the following SAS procedures:

/* sort the data by MPAARating */
proc sort data=Sasuser.Movies out=movies;
   by MPAARating;
run;

/* compute quartiles for each BY group */
proc means data=movies noprint;
   by MPAARating;
   var US_Gross;
   output out=OutStat Q1=Gross_Q1 median=Gross_Q2 Q3=Gross_Q3;
run;

proc print data=OutStat noobs;
   var MPAARating Gross_Q1 Gross_Q2 Gross_Q3;
run;
Using Base SAS for BY Group Analysis

Figure 3.15. Using Base SAS for BY Group Analysis

However, in general, the SAS/IML technique has value for several reasons:

  1. You might be writing the analysis in a SAS/IML module (see Section 3.4), or the data may not exist in any SAS data set. In these instances, it might be more efficient to compute a desired analysis completely in the SAS/IML language than to write the data to a data set, call a SAS procedure, and then read the results into SAS/IML matrices.

  2. The goal of the technique might not be to compute a statistic. In an IMLPlus program, you might want to color observations according to their category, or exclude observations from an analysis if some criterion is true.

  3. You can use the technique to compute statistics that are not available in any SAS procedure. For example, you can use the technique to implement a proprietary algorithm or to compute a new statistic that recently appeared in a journal article.

3.4 Defining SAS/IML Modules

A powerful programming feature in the SAS/IML language is the ability to define your own module. A module is a user-defined function or subroutine that can be called from programs. A module can return a value or not. A module that returns a value is called a function module. You can use a function module as part of an expression. A module that does not return a value is called a subroutine module. You can call a subroutine module by using the CALL or RUN statement.

A module enables you to reuse and maintain related SAS/IML statements in a convenient way. You can pass matrices from your program into the module to provide input data and to control the way the module behaves.

3.4.1 Function and Subroutine Modules

You define a module by using the START and FINISH statements. The START statement defines the name of the module and the arguments.

There are two kinds of modules: functions and subroutines. You can use the RETURN statement to return a value from a function module. A subroutine module usually modifies one or more matrices that are passed into the module.

For example, the following statements define a function module that returns the Pearson correlation coefficient between two column vectors:

/* module to compute Pearson correlation between two column vectors */
start CorrCoef(x, y);                  /* assume no missing values: */
   xStd = standard(x);                 /* standardize data          */
   yStd = standard(y);
   df = nrow(x) - 1;                   /* degrees of freedom        */
   return ( xStd` * yStd / df );
finish;
u = {1,2,3,4,5};
v = {2,3,1,4,4};
r = CorrCoef(u,v);
print r;
Result from a User-Defined Function Module

Figure 3.16. Result from a User-Defined Function Module

The module (named CorrCoef) is defined to accept two matrix arguments. Inside the module, these vectors are named x and y. The x and y vectors are standardized by calling the Standard module, which is included in the IMLMLIB module library. (The IMLMLIB library is described in "The IMLMLIB Library of Modules" on page 78.) The Standard module applies a linear transformation to the data so that it has zero mean and unit standard deviation: x → (xResult from a User-Defined Function Module)/sx where Result from a User-Defined Function Module is the sample mean and sx is the sample standard deviation. The CorrCoef module returns the inner product of the standardized vectors, divided by the degrees of freedom. (Recall that the transpose operator (`) transposes a matrix.)

In spite of the recommendations in the section "Handling Missing Values" on page 65, the CorrCoef module does not handle missing values in the x and y vectors.

3.4.2 Local Variables

A module usually maintains its own set of variables. This means that names used for variables inside a module do not conflict with variables of the same name that exist outside the module. Consequently, the matrix df defined inside the CorrCoef module does not overwrite or conflict with any other variable of the same name that might exist in some other module or in the main program.

If you use the SHOW NAMES statement inside a module, it displays the names, dimensions, and types of all matrices that are local to the module.

3.4.3 Global Symbols

You can force a variable to be a global variable by using the optional GLOBAL clause on the START statement for the module. An example of a module that uses a GLOBAL clause is shown in the following statements:

/* use the GLOBAL clause in a module definition */
start HasValue(x) global(g_Value);
   idx = loc(x=g_Value);              /* find value, if it exists   */
   return ( ncol(idx)>0 );            /* return 1 if value exists   */
finish;

g_Value = 1;                          /* global value to search for */
v = {4,2,1,3,8};                      /* data to search             */
z = HasValue(v);

A more realistic example of using the GLOBAL clause is described in Section 11.8.3.

In PROC IML, a module that has no arguments implicitly uses variables in the global scope of the program. This is discussed in Section 5.7.2.

3.4.4 Passing Arguments by Reference

SAS/IML passes matrices to modules "by reference," which means that if you change one of the arguments inside the module, the corresponding matrix also changes outside the module. This is demonstrated in the following example:

/* matrices are passed by reference to modules */
start ReverseRows(x);               /* define subroutine module     */
   r = x[nrow(x):1, ];              /* reverse rows of argument     */
   x = r;                           /* reassign the argument matrix */
finish;

u = {1,2,3,4,5};                    /* original values              */
run ReverseRows(u);                 /* values of u are changed      */
print u;
Result of Passing a Matrix by Reference

Figure 3.17. Result of Passing a Matrix by Reference

The module ReverseRows reverses the position of the rows of a matrix. Inside the module, the argument to the module is called x. The module reverses the rows of x and overwrites x with new values. The module alters the matrix u.

3.4.5 Evaluation of Arguments

The SAS/IML language resolves all matrix expressions prior to calling a subroutine. For example, consider the following statements:

/* pass temporary matrices to a function or subroutine */
w = {1 2, 2 3, 3 1, 4 4, 5 4};
r = CorrCoef(w[,1], w[,2]);

The statements call the CorrCoeff module defined in Section 3.4.1. The arguments to the module are matrix expressions that each extract a column from the w matrix. When the SAS/IML language processes the CorrCoef function call, the following occurs:

  1. The expression w[,1] is evaluated. The results are placed in a temporary matrix called something like _TEM0001.

  2. The expression w[,2] is evaluated. The results are placed in a temporary matrix called something like _TEM0002.

  3. The _TEM0001 and _TEM0002 vectors are passed into the CorrCoef function as arguments to the module. The module computes the correlation coefficient for the centered data.

  4. When the CorrCoef module returns, the return value of the module is copied into the r matrix. The two temporary vectors, _TEM0001 and _TEM0002, are no longer needed, so they are deleted.

In general, SAS/IML allocates temporary matrices when necessary to hold intermediate expressions. These temporary matrices are deleted when they are no longer needed. In most cases, you do not need to worry about temporary matrices: SAS/IML software handles the creation and deletion without any intervention from you. However, if you use an expression as the argument to a module, SAS/IML creates a temporary matrix to hold the expression. If the module modifies that argument, then it is the temporary matrix that is modified. This is usually not what you intend.

For example, the following statements call the ReverseRows module defined in Section 3.4.4:

/* be careful when you pass a temporary matrix to a subroutine */
w = {1 2, 2 3, 3 1, 4 4, 5 4};
run ReverseRows(w[,1]);         /* w[,1] creates a temporary matrix */
print w;                        /* w is unchanged                   */
An Unmodified Matrix

Figure 3.18. An Unmodified Matrix

In this example, the first column of the w matrix is passed as the input to the ReverseRows module. You might initially guess that the elements in the first column of w will be reversed by the module call. But this is not what happens! Instead, the following occurs:

  1. The SAS/IML language copies the first column of w into a temporary vector (say, _TEM0003) and passes that vector into the ReverseRows module.

  2. The module reverses the rows of _TEM0003.

  3. When the subroutine returns, the temporary vector _TEM0003 is deleted.

Because a copy of the column was passed in, the matrix w is unchanged.

3.4.6 Storing Modules

An important advantage of modules is that you can write and debug a module and then store it so that you can call it in future programs. You store a module by using the STORE statement. In SAS/IML you need to use the LOAD statement to load a stored module before you can call it. (In IMLPlus, modules are stored into Windows directories. SAS/IML Studio automatically searches certain directories in order to find the definition for an unresolved module. See the section "IMLPlus Modules" on page 114.)

As a final example of writing a module, the following statements define and store a module that returns the rows in a matrix that contain only nonmissing values. You can use this module to delete all rows that contain a missing value in any variable.

/* create a module to return the nonmissing rows of a matrix */
start LocNonMissingRows(x);
   c = cmiss(x);                /* matrix of 0 and 1's              */
   r = c[,+];                   /* number of missing in row         */
   nonMissingldx = loc(r=0);    /* rows that do not contain missing */
   return ( nonMissingldx );
finish;
store module=LocNonMissingRows;
z = {1 2, 3 ., 5 6, 7 8, . 10, 11 12};
nonMissing = LocNonMissingRows(z);
if ncol(nonMissing)>0 then
   z = z[nonMissing, ];
print z;
Nonmissing Rows of a Matrix

Figure 3.19. Nonmissing Rows of a Matrix

The LocNonMissingRows module calls the CMISS function in Base SAS software to create a matrix, c, of zeros and ones. The matrix returned by the CMISS function has the value 1 at locations for which the corresponding element of x is missing, and has the value 0 at all other locations. The column vector r is the sum across each row of all of the elements of c. The vector r is computed by using the summation subscript reduction operator (+), which is described in Section 3.5.2. The module uses the LOC function to find all rows that contain no missing values. These row numbers are returned by the function. In the example, the module returns the vector {1 3 4 6}. The example then extracts these rows from the matrix z, which results in a matrix that does not contain any missing values.

If the argument to the module has a missing value in every row, then the module returns an empty matrix. Consequently, a careful programmer will check that the matrix nonMissing is nonempty prior to using it. If the module is sent a matrix, z, that does not contain missing values, the module returns a vector identical to 1:nrow(z).

In PROC IML, the STORE statement stores the LocNonMissingRows module in the default "storage library," which is a SAS catalog that is used to store modules and matrices. The default storage library is Work.IMLStore. Notice that this catalog is in the Work library and recall that the Work library vanishes when you exit the SAS System. Consequently, if you want a stored module to persist between SAS sessions, you need to save the module to a permanent library.

You can set the storage location for a module in PROC IML by using the RESET STORAGE statement. For example, the following statements store the LocNonMissingRows module in a catalog in the Sasuser library:

reset storage=Sasuser.MyModules;          /* set location for storage */
store module=LocNonMissingRows;

By using the RESET STORAGE statement prior to any STORE statement, you can ensure that you are storing modules to a permanent storage location. Similarly, you can use the RESET STORAGE statement prior to a LOAD statement to load a module definition from a permanent storage location.

In SAS/IML Studio, the STORE and LOAD statements work differently. For details, see the section "IMLPlus Modules" on page 114.

In SAS/IML 9.22, you can use the COUNTMISS function to write the LocNonMissingRows module more efficiently:

/* SAS/IML 9.22: a module to return the nonmissing rows of a matrix */
start LocNonMissingRows(x);
   r = countmiss(x, "row");     /* number of missing in row         */
   nonMissingldx = loc(r=0);    /* rows that do not contain missing */
   return ( nonMissingIdx );
finish;

This module is more efficient because it uses less memory. The first version uses the CMISS function to create a matrix that is the same size as the input matrix x. The second version of the function does not create that (potentially large) matrix.

3.4.7 The IMLMLIB Library of Modules

SAS/IML software is distributed with a set of predefined modules called the IMLMLIB modules, because the modules are stored in a SAS catalog called IMLMLIB. You can call these modules without having to LOAD them. Table 3.1 describes some of the modules that are useful in data analysis.

Table 3.1. Frequently Used Modules in IMLMLIB

Statement

Description

ColVec

Reshapes a matrix into a column vector.

Corr

Computes a correlation matrix from a data matrix. This module was replaced by the more general CORR function in SAS/IML 9.22.

Median

Computes the median value for each column of a matrix.

Quartile

Computes the minimum, lower quartile, median, upper quartile, and maximum values for each column of a matrix.

RandNormal, . . .

The RandNormal module generates a random sample from a multivariate normal distribution. There are also modules for generating a random sample from other multivariate distributions.

RowVec

Reshapes a matrix into a row vector.

Standard

Standardizes each column of a matrix.

3.5 Writing Efficient SAS/IML Programs

What does it mean to write efficient SAS/IML programs? The simplest rule to follow is "whenever possible, avoid writing loops." SAS/IML provides many high-level operators and built-in subroutines for working with matrices and vectors, rather than working with each element of an array.

The previous section introduced the LOC and UNIQUE functions. A first step to writing efficient programs is to use those functions whenever you can.

3.5.1 Avoid Loops to Improve Performance

Take another look at the algorithm on page 68 that counts the number of observations that belong to each category. Notice that the algorithm does not loop over the observations. There is a loop over the categories, but no loop over observations. This is in sharp contrast to the following inefficient statements, which compute the same quantities as the statements that produce Figure 3.13.

/* inefficient algorithm which loops over observations */
use Sasuser.Movies;
read all var {"MPAARating"};                      /* 1 */
close Sasuser.Movies;

/* inefficient: do not imitate */
categories = MPAARating[1];                       /* 2 */
count = 1;
do obs = 2 to nrow(MPAARating);                   /* 3 */
   i = loc(categories = MPAARating[obs]);         /* 4 */
   if ncol(i) = 0 then do;                        /* 5 */
      categories = categories // MPAARating[obs];
      count = count // 1;
   end;
   else count[i] = count[i] +1;                   /* 6 */
end;
print count[rowname=categories];
Result of Inefficient Statements

Figure 3.20. Result of Inefficient Statements

The main steps of this program are as follows:

  1. Read the MPAARating variable.

  2. Initialize the matrix categories with the first category. When the program encounters a category that has not previously been observed, the categories vector will grow.

  3. Loop over all observations.

  4. Inside the loop, compare the current observation with the categories that have been seen previously. (Some beginning programmers replace this statement with a second loop over the number of categories, leading to even worse performance!)

  5. If i is empty, then MPAARating[obs] is a new category. Use vertical concatenation to add a new row to the vector that keeps track of categories and to the vector that keeps track of the count for each category.

  6. If i is not empty, then increment count[i], the number of observations encountered for this category.

While the two algorithms compute the same quantity, the statements that produce Figure 3.13 are more efficient for the following reasons:

  • The efficient algorithm loops over the number of unique categories. This is typically a small number. The inefficient algorithm loops over the number of observations, which is typically a much larger number.

  • The efficient algorithm determines the size of all vectors before beginning the loop and preal-locates any storage it needs. The inefficient algorithm uses a vertical concatenation operator (//) to dynamically enlarge the result vectors every time a new category is encountered. This inefficient technique is discussed in the section "Concatenation Operators" on page 41.

Not only is the program that produces Figure 3.13 more efficient than the program that produces Figure 3.20, but also the results of the former program are computed in a standard form. The categories of MPAARating are produced in alphabetical order, and therefore the results are independent of the way that the Movies data set is sorted. In contrast, the second algorithm computes the results in the order that the categories appear in the data set.

3.5.2 Use Subscript Reduction Operators

One way to avoid writing unnecessary loops is to take full advantage of the subscript reduction operators for matrices. These operators enable you to perform common statistical operations (such as sums, means, and sums of squares) on either the rows or the columns of a matrix. For example, the following statements compute the sum and mean of columns and of rows for a matrix; the results are shown in Figure 3.21:

/* compute sum and mean of each column */
x = {1 2 3,
     4 5 6,
     7 8 9,
     4 3 .};
colSums = x[+, ];
colMeans = x[:, ];
rowSums = x[ ,+];
rowMeans = x[ ,:];
print colSums, colMeans, rowSums rowMeans;
Sums and Means of Rows and Columns

Figure 3.21. Sums and Means of Rows and Columns

The expression x[+, ] uses the ' + ' subscript operator to "reduce" the matrix by summing the elements of each row for all columns. (Recall that not specifying a column in the second subscript is equivalent to specifying all columns.) The expression x[:, ] uses the ':' subscript operator to compute the mean for each column. The row sums and means are computed similarly. Note that the subscript reduction operators correctly handle the missing value in the third column.

A common use of subscript reduction operators is to compute the marginal frequencies in a two-way frequency table. For example, the following statements compute the row total, column total, and grand total for a frequency table of two variables in the Movies data:

/* compute column sums, row sums, and grand sum for a matrix */
uYear = 2005:2007;
uMPAARating = {"G"    "NR"    "PG"    "PG-13"   "R"};
Table = {       6       0      26       63       42,
                5       3      23       57       48,
                2       0      17       29       38};
colSums = Table[+, ];
rowSums = Table[ ,+];
Total = Table[+];
YearLabel = char(uYear, 7);
print Table[rowname=YearLabel colname=uMPAARating] rowSums,
      colSums[rowname="colSums" label=""] Total[label=""];
Marginal Frequencies for a Matrix

Figure 3.22. Marginal Frequencies for a Matrix

The section "Analyzing Observations by Categories" on page 68 describes how to compute the cell counts for this table from the Movies data set. The new feature in this example is the use of the ' + ' subscript operator to compute three different marginal sums.

The following table summarizes the subscript reduction operators for matrices and an equivalent way to perform the operation that uses function calls.

Table 3.2. Subscript Reduction Operators for Matrices

Operator

Action

Equivalent Function

+

Addition

sum(x)

 

#

Multiplication

prod(x)

/* SAS/IML 9.22 */

><

Minimum

min(x)

 

<>

Maximum

max(x)

 

>:<

Index of minimum

loc(x=min(x))[1]

 

<:>

Index of maximum

loc(x=max(x))[1]

 

:

Mean

mean(x)

/* SAS/IML 9.22 */

##

Sum of squares

ssq(x)

 

Beginning in SAS/IML 9.22, the SAS/IML language supports a MEAN function and a PROD function. The MEAN function computes the mean of each column in a matrix, so it is equivalent to x[:, ]. The PROD function returns the product of all nonmissing elements of its arguments.

3.5.3 Case Study: Standardizing the Columns of a Matrix

The CorrCoef module in the section "Defining SAS/IML Modules" on page 72 calls the standard module in the IMLMLIB module library. This section describes how you can write statements that reproduce the results of that module.

The following statements standardize each column in a numeric matrix to have mean zero and unit standard deviation:

/* standardize data: assume no missing values and no constant column */
x = {7 7, 8 9, 7 9, 5 7, 8 8};
xc = x - x[:,];              /* 1. center the data                  */
ss = xc[##,];                /* 2. sum of squares for columns of xc */
n = nrow(x);                 /* 3. assume no missing values         */
std = sqrt(ss/(n-1));        /* 4. sample standard deviation        */
std_x = xc / std;            /* 5. divide each column (standardize) */
print std_x;

The output from the program is shown in Figure 3.23. Each statement is explained in the following list:

  1. The row vector x[:,] contains the means of each column of x. Consequently, the matrix xc contains the centered data: the mean of each column is subtracted from the data in that column.

  2. The row vector ss contains the sum of squares for each centered column.

  3. Assuming that there are no missing values, the scalar quantity n contains the number of nonmissing values in each column.

  4. The row vector std contains the standard deviation of each column. Each element of std is nonzero, provided that no column of x is constant (that is, contains the same value for all observations).

  5. The matrix std_x contains the standardized data that results from dividing the matrix xc by the row vector std.

Standardized Columns of a Matrix

Figure 3.23. Standardized Columns of a Matrix

You can relax the assumption that the data do not contain missing values: count the number of nonmissing observations in each column and compute the standard deviation for each column by using the number of nonmissing values for that column. This is left as an exercise for the reader.

Beginning in SAS/IML 9.22, the SAS/IML language supports a MEAN function that computes the mean of each column and a VAR function that computes the variance of each column. With the addition of these functions, the following statements standardize the data:

/* standardize data: SAS/IML 9.22 and beyond */
xc = x - mean(x);               /* center the data                  */
std = sqrt(var(xc));            /* sample standard deviation        */
std_x = xc / std;               /* divide each column (standardize) */

The MEAN and VAR functions both handle missing values. Consequently, the previous statements correctly standardize data that contain missing values.

3.6 Case Study: Finding the Minimum of a Function

This section shows how you can write a SAS/IML module to minimize a continuous function of one variable on a closed interval. To make the problem easier, assume that the function is unimodal, meaning that it has a single minimum on the interval.

There are many techniques for finding the minimum of a continuous function on a closed interval. The technique presented in this section is known as the golden section search (Thisted 1988). The basic idea is to bracket the minimum, meaning that you must find three points a < x1 < b such that the value f (x1) is less than the function evaluated at either endpoint. In the golden section search, the initial interval [a, x1] is smaller than the interval [x1, b]. Therefore, a new point x2 3.6 Case Study: Finding the Minimum of a Function [x1, b] is chosen and the function is evaluated at x2. If f (x1) < f (x2), then a < x1 < x2 is a new smaller interval that brackets the minimum. Otherwise, x1 < x2 < b is a new smaller interval that brackets the minimum. In either case, the process repeats with the new smaller interval. This process is schematically indicated in Figure 3.24.

A Schematic Illustration of the Golden Section Search

Figure 3.24. A Schematic Illustration of the Golden Section Search

In the figure, a = 0, b = 1, x1 = 0.382, x2 = 0.618, and the function is minimized at the value indicated by the vertical dotted line. Because f (x1) < f (x2), the algorithm chooses [a, x2] as the new bracketing interval. A new point is chosen in the larger of the intervals [a, x1] or [x1, x2]. For the example in the figure, the new point is x3 = 0.236. Because f (x3) < f (x1), the algorithm chooses [a, x1] as the new bracketing interval, and so forth.

The golden section algorithm guarantees that the length of the bracketing interval decreases at a constant rate, and that rate is related to the "golden ratio" that gives the algorithm its name. The following statements define a SAS/IML module that implements the golden section search in order to find the value of x that minimizes the function defined in the module Func:

/*  define module that implements the golden section search */
start GoldenSection(a0,b0,eps);
/* Find the value that minimizes the continuous unimodal function
 * f (defined by the module "Func") on the interval [a,b] by using
 * a golden section search. The algorithm stops when b-a < eps.
 *
 * Example:
 * start Func(x);
 *    return  ( x#(x-0.5) );
 * finish;
 * x0 = GoldenSection(0,1,1e-4);
 */

w = (sqrt(5)-1)/2;                     /* "golden ratio" - 1        */
a = a0; b = b0;
x1 = w*a + (1-w)*b;
x2 = (1-w)*a + w*b;
fx1 = Func(x1);
fx2 = Func(x2);
do while (b-a>eps);
   if fx1 < fx2 then do;               /* choose new right endpoint */
      b = x2;                          /* update x1 and x2          */
      x2 = x1;
      fx2 = fx1;
      x1 = w*a + (1-w)*b;
      fx1 = Func(x1);
   end;
   else do;                            /* choose new left endpoint  */
      a = x1;                          /* update x1 and x2          */
      x1 = x2;
      fx1 = fx2;
      x2 = (1-w)*a + w*b;
      fx2 = Func(x2);
   end;
end;
return ( choose(fx1<=fx2, a, b) );
finish;

To test the module, the following statements define a module that returns values of the function f (x) = x(x − 0.5). This function is minimized at x = 0.25. The iteration history is not printed, but is shown in Figure 3.24 for this function. (Incidentally, Figure 3.24 was created in SAS/IML Studio by using drawing commands that are discussed in Chapter 7, "Creating Statistical Graphs.") The output is shown in Figure 3.25.

/* Minimize unimodal function by using GoldenSection module */
/* define unimodal function on [0,1], Minimum at x=0.25.    */
start Func(x);
   return  ( x#(x-0.5) );
finish;

/* search for minimum on an interval */
x0 = GoldenSection(0, 1, 1e-4);
y = Func(x0);
QuadSoln = x0 || y;
print QuadSoln[colname={"XMin", "f(XMin)"}];
Minimum Found by Golden Section Search

Figure 3.25. Minimum Found by Golden Section Search

Although the golden section search assumes that the function is unimodal, you can often use it in conjunction with other algorithms to refine intervals where a local minimum occurs. You might be interested in a function that has many local minima, but if you can find an interval that brackets a given minimum, you can use the golden section search to home in on the minimum value in that interval.

For example, Figure 3.26 shows the graph of the function f (x) = x(x − 0.5) + 0.1 sin(6πx) on the interval [0,1]. The global minimum occurs at x = 0.25. If you evaluate the function on a coarse grid, as shown in Figure 3.26, you can assume that the true minimum of the function occurs near the smallest value of the function on the grid. The evaluation of f on a grid is a "presearch," which enables you to find a bracketing interval for the minimum. In Figure 3.26 the function f is evaluated at 0,0.1,..., 1. The smallest value of f on that set of points occurs at 0.2, so you can use the bracketing interval [0.1, 0.3] as an initial interval for the golden section search.

Locating an Interval That Brackets a Minimum

Figure 3.26. Locating an Interval That Brackets a Minimum

The following statements use this technique to find the minimum of the function f on the interval [0,1]:

/* evaluate function on a grid, followed by golden section search */
/* This function has three local minima on [0,1], Global min at 0.25. */
start Func(x);
   pi = constant('PI'),                               /* 3.14159... */
   return ( x#(x-0.5) + 0.1 *sin(6*pi*x) );
finish;

/* use "presearch" evaluation on coarse grid on [0,1] */
x = do(0, 1, 0.1);
y = Func(x);                     /* 1. evaluate at grid on [0,1]    */
yMinIdx = y[>:<];                /* 2. find index of minimum value  */
aIdx = max(yMinIdx-1, 1);        /* 3. find index of left endpoint  */
bIdx = min(yMinIdx+1, ncol(x));  /*    find index of right endpoint */
a = x[aIdx];                     /* assume global min in interval   */
b = x[bidx];
print a b;

x0 = GoldenSection(a, b, 1e-4);  /* 4. finds minimum on [a,b]       */
y = Func(x0);
Soln = x0 || y;
print Soln[colname={"XMin", "f(XMin)"}];
Minimum Found by Presearch Followed by Golden Section Search

Figure 3.27. Minimum Found by Presearch Followed by Golden Section Search

The following list explains a few of the steps in the example:

  1. Notice that the function module is written so that its argument, x, can be a vector. The quadratic term in the function is implemented by using the elementwise multiplication operator, #. This enables you to call Func on every point in the grid with a single call to the module.

  2. The example uses the "index of minimum" operator (>:<) to find the (first) index of y with the smallest value. As shown in Table 3.2, the expression y[>:<] is equivalent to the longer expression y[loc(y=min(y)[1]].

  3. Handle the case where the minimum value of the function occurs at an endpoint of the interval. In the example, the minimum value of the function on the grid occurs for yMinIdx equal to 3. The bracketing interval is therefore chosen to have endpoints x[2] and x[4]. But if yMinIdx were equal to 1, the endpoints of the bracketing interval would need to be x[1] and x[2]. The case where the minimum occurs at x[11] is handled similarly.

  4. The minimum is found on the bracketing interval [a, b]. Note in Figure 3.26 that the function is actually unimodal on this interval.

The example requires that the module is named Func, but in IMLPlus you can use the ALIAS statement to get rid of this restriction. For details, see "Creating an Alias for a Module" on page 117.

3.7 References

[bibch03_01] R. A. Thisted, (1988), Elements of Statistical Computing: Numerical Computation, London: Chapman & Hall.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset