Chapter 11. Calling Functions in the R Language

Contents

  • 11.1 Overview of Calling Functions in the R Language 251

  • 11.2 Introduction to the R Language 252

  • 11.3 Calling R Functions from IMLPlus 252

  • 11.4 Data Frames and Matrices: Passing Data to R 254

    • 11.4.1 Transferring SAS Data to R 254

    • 11.4.2 What Happens to the Data Attributes? 256

    • 11.4.3 Transferring Data from R to SAS Software 257

  • 11.5 Importing Complicated R Objects 259

  • 11.6 Handling Missing Values 261

    • 11.6.1 R Functions and Missing Values 261

    • 11.6.2 Merging R Results with Data That Contain Missing Values 262

  • 11.7 Calling R Packages 264

    • 11.7.1 Installing a Package 264

    • 11.7.2 Calling Functions in a Package 264

  • 11.8 Case Study: Optimizing a Smoothing Parameter 267

    • 11.8.1 Computing a Loess Smoother in R 268

    • 11.8.2 Computing an AICC Statistic in R 269

    • 11.8.3 Encapsulating R Statements into a SAS/IML Module 270

    • 11.8.4 Finding the Best Smoother by Minimizing the AICC Statistic 271

    • 11.8.5 Conclusions 272

  • 11.9 Creating Graphics in R 273

  • 11.10 References 278

11.1 Overview of Calling Functions in the R Language

Just as you can call SAS statements from an IMLPlus program, you can also call functions in the R language. This chapter shows how to do the following:

  • transfer data between R and SAS software

  • import the results of a statistical analysis that was computed in R

  • reconcile differences between the ways that R and SAS software handle missing values

  • incorporate R code into an IMLPlus module

  • create graphics in R from an IMLPlus program

11.2 Introduction to the R Language

R is an open-source language and environment for statistical computing. The R language has many similarities with SAS/IML: it supports matrix-vector computations, it has a rich library of statistical functions, and it is intended primarily for researchers and statistical programmers. R supports the creation and distribution of packages: user-written functions, data, and documentation that can be added to the R library.

These packages are sometimes written by leading researchers in the academic statistical community. In some fields of statistics, it is common for authors of journal articles and books to create an R package that accompanies the publication. The R package typically includes functions that implement the main ideas of the publication and also provides data used in the publication. In this way, researchers disseminate not only their statistical ideas but also a particular implementation of their ideas.

There are over 2,400 R packages. Some packages are general purpose libraries that implement functions in a particular statistical area. For example, the boot package (Canty and Ripley 2009) contains "functions and datasets for bootstrapping from the book Bootstrap Methods and Their Applications by A. C. Davison and D. V. Hinkley" (Davison and Hinkley 1997). Other packages contain a few functions (or sometimes just one!) that implement particular computations from a journal article or from someone's class project, masters thesis, or PhD dissertation.

SAS/IML Studio 3.2 (released in July 2009) was the first SAS product to offer an interface to the R language. Beginning in SAS/IML 9.22, you can also call R from PROC IML. This chapter describes how to exchange data between R and various SAS data formats, and how to call functions in the R language. This chapter does not teach you how to program in R, but rather how to incorporate knowledge of R into your SAS programs.

The chapter assumes that a 32-bit Windows version of R is installed on the same PC that runs SAS/IML Studio. You can download R from CRAN: The Comprehensive R Archive Network at http://cran.r-project.org. The SAS/IML Studio online Help lists versions of R that are supported.

11.3 Calling R Functions from IMLPlus

You can call R functions from an IMLPlus program by using the SUBMIT and ENDSUBMIT statements. These are the same statements that are used to run SAS statements and procedures. The difference is that to call R functions, you use the R option in the SUBMIT statement as shown in the following example:

/* call R functions */
submit / R;
   n <- 100                  # 100 obs
   set.seed(1)               # seed for pseudorandom number generation
   x <- runif(n)             # x is pseudorandom uniform of length n
   y <- 3*x + 2 + rnorm(n)   # y is linearly related to x
   model <- lm(y~x)          # model y as a function of x
   summary(model)            # print summary of linear model
endsubmit;

The statements in the SUBMIT block are sent to R for interpreting. The left-pointing arrow, <-, is the R assignment operator. The first four R statements create a vector y that is linearly related to x, but that contains errors randomly generated from the standard normal distribution. The set.seed function sets a seed value for the random number generator used by the rnorm and runif functions. (See the section "Using Functions to Create Matrices" on page 24 for analogous SAS/IML statements.) The lm function uses the method of least squares to model the response y as a function of the explanatory variable x. The summary function displays relevant statistics for the model, as shown in Figure 11.1. In SAS/IML Studio, R output appears in the current output window; in PROC IML, R output is sent to all open ODS destinations.

Output from R Statements

Figure 11.1. Output from R Statements

11.4 Data Frames and Matrices: Passing Data to R

In most cases, an R function requires data in the form of an R matrix or an R data frame. A data frame is an R object that stores variables and observations. Like a SAS data set or an IMLPlus data object, a data frame can contain mixed types of variables (numeric or character), whereas a matrix can contain only a single type of data. This section describes how to transfer SAS data sets, data objects, or matrices to an R matrix or data frame.

11.4.1 Transferring SAS Data to R

You can transfer data from three sources: a SAS data set in a SAS library (such as Work, Sasuser, and so on), a data object, and a SAS/IML matrix.

You can transfer a SAS data set by calling the ExportDataSetToR module. The first argument to the module is the two-level name (libref.filename) of the SAS data set. The second argument is the name of a data frame in R to contain the data. For example, the following statements create a data frame called MovieFrame that contains a copy of the data in the Sasuser.Movies data set:

/* transfer data from SAS data set to R data frame */
run ExportDataSetToR("Sasuser.Movies", "MovieFrame");
submit / R;
   names(MovieFrame)
   summary(MovieFrame$Budget)
endsubmit;

The names function displays the names of variables in a data frame. As the example indicates, you can use a dollar sign ($) to refer to a particular variable within a data frame. The summary function prints some descriptive statistics about the variable, as shown in Figure 11.2.

Statistics from R (Data from a SAS Data Set)

Figure 11.2. Statistics from R (Data from a SAS Data Set)

If your data are in a data object, you can transfer all variables in a DataObject by calling the Ex-portToR method in the DataObject class. The method takes a single argument: the name of the data frame in R to contain the data. This is shown in the following statements:

/* transfer data from data object to R data frame */
declare DataObject dobj;
Dobj = DataObject.CreateFromServerDataSet("Sasuser.Movies");
dobj.ExportToR("MovieFrame");                /* write all variables */
submit / R;
   summary(MovieFrame$ReleaseDate)
endsubmit;
Statistics from R (Data from a Data Object)

Figure 11.3. Statistics from R (Data from a Data Object)

Lastly, you can transfer data from a SAS/IML matrix by calling the ExportMatrixToR module. The following statements transfer the data from a SAS/IML vector into an R matrix:

/* transfer data from SAS/IML matrix to R matrix */
dobj.GetVarData("World_Gross", w);        /* get data into vector w */
run ExportMatrixToR(w, "WorldG");         /* send matrix to R       */
submit / R;
   summary(WorldG)
endsubmit;

The statements create a SAS/IML vector w that contains the data in the World_Gross variable. Those data are then transferred to the R matrix named WorldG. The output is shown in Figure 11.4. Notice that the output is column-oriented (compare with Figure 11.3) because the summary function behaves differently for a matrix than for a variable in a data frame.

Statistics from R (Data from a SAS/IML Matrix)

Figure 11.4. Statistics from R (Data from a SAS/IML Matrix)

Notice also in Figure 11.4 that there are 25 NAs in the World vector. The NA represents a missing value in R. A SAS missing value is automatically converted to NA when data are transferred to R.

The following table summarizes the frequently used methods and modules that transfer SAS data to R. For details of the data transfer, see the online Help chapter titled "Accessing R."

Table 11.1. Methods and Modules for Transferring Data from SAS Software to R

Method/Module

Data Source

Data Destination

DataObject.ExportToR

DataObject

R data frame

ExportDataSetToR

SAS data set in libref

R data frame

ExportMatrixToR

SAS/IML matrix

R matrix

11.4.2 What Happens to the Data Attributes?

Although SAS data (numbers and strings) can be sent to R, the properties that are associated with variables and the attributes associated with observations are not, in general, transferred to R. R does not support the same properties of variables that SAS software uses. For example, SAS data sets associate formats, lengths, and labels with variables, but R does not. Similarly, an IMLPlus data object maintains attributes for observations, including marker color, shape, and whether the observation is selected. In general, these attributes are not transferred to R when the data are transferred.

However, there are two variable properties that receive special handling: variables with date or time formats, and IMLPlus nominal variables.

When the ExportDataSetToR module or the ExportToR method transfers a variable with a DATEw., TIMEw., or DATETIMEw. format, the corresponding R variable is automatically assigned a special R class that indicates that the variable contains time or date information. A SAS variable with the DATEw. format is transferred to an R variable with the "Date" class. A SAS variable with a TIMEw. or DATETIMEw. format is transferred to an R variable with the "POSIXct" class.

Similarly, when the ExportToR method of the DataObject class transfers a nominal variable to R, the corresponding R variable is automatically assigned the "factor" class. A factor is a storage class in R that is used for categorical variables. For statistical modeling, using a factor in R to represent a variable is similar to listing the variable on a CLASS statement in a SAS procedure. Factors are treated as classification variables in analyses.

For example, in the Movies data set, the variable ReleaseDate has a DATE9. format. When the Movies data set is used to create an R data frame, the corresponding ReleaseDate variable in R is automatically assigned the "Date" class. Similarly, the NumAANomin variable is a numeric nominal variable in the IMLPlus data object; the corresponding R variable is assigned the "factor" class in R. This is shown in the following statements and in Figure 11.5:

/* show that data transfer preserves some variable properties */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Movies");
dobj.ExportToR("MovieFrame");                /* write all variables */
submit / R;
   class(MovieFrame$ReleaseDate)
   class(MovieFrame$NumAANomin)
endsubmit;
Classes of R Objects Reflect Variable Properties

Figure 11.5. Classes of R Objects Reflect Variable Properties

Note that not all formats are preserved when data are transferred to R. For example, the Budget variable has a DOLLAR8.1 variable in the SAS data set, but Figure 11.2 shows that the corresponding R variable is just an ordinary numeric variable with no special characteristics.

The SAS/IML Studio online Help provides details about the differences between the ways that SAS and R software handle date and time values.

11.4.3 Transferring Data from R to SAS Software

You can also transfer data from R objects to SAS software. The source (an R data frame or matrix) and destination (a data object, SAS dataset, or SAS/IML matrix) determine which module or method you should use.

If the data are contained in an R data frame (or any expression that can be converted to a data frame), you can create a data object by using the CreateFromR method of the DataObject class, as shown in the following statements.

/* transfer data from R data frame to data object */
declare DataObject dobjR;
dobjR = DataObject.CreateFromR("Old Faithful Data", "faithful");
dobjR.GetVarNames(VarNames);
dobjR.GetDimensions(NumVar, NumObs);
print VarNames NumVar NumObs;

The CreateFromR method has two arguments: the first names the data object and the second gives the name of an R data frame (or an expression that can be converted to a data frame). The previous example creates a data object from the faithful data frame that contains data about eruptions of the Old Faithful geyser in Yellowstone National Park. After the data object is created, the Get-VarNames and GetDimensions methods are called to get basic information about the data. This information is displayed in Figure 11.6.

Attributes of a Data Object Created from a Data Frame

Figure 11.6. Attributes of a Data Object Created from a Data Frame

You can also transfer R data directly to a SAS data set by using the ImportDataSetFromR module. You can transfer data that are contained in an R data frame, an R matrix, or any expression that can be converted to either of these objects. For example, the following statement transfers the Old Faithful data directly to a SAS data set in the Work library:

run ImportDataSetFromR("Work.OldFaithful", "faithful");

You can also create a SAS/IML matrix from either an R data frame or an R matrix by using the ImportMatrixFromR module, as shown in the following statements:

/* transfer data from R matrix to SAS/IML matrix */
run ImportMatrixFromR(w, "faithful$waiting");
q = quartile(w);
print q[rowname={"Min" "Q1" "Median" "Q3" "Max"}];

The output from this example is shown in Figure 11.7 and shows summary statistics for the waiting variable in the faithful data frame. In particular, the median waiting time between eruptions for Old Faithful was 76 minutes for the eruptions recorded in these data.

Summary of Data in a SAS/IML Matrix Created from R Data

Figure 11.7. Summary of Data in a SAS/IML Matrix Created from R Data

Since SAS/IML matrices are either character or numeric, you need to be careful if you transfer data from an R data frame into a SAS/IML matrix. The ImportMatrixFromR module requires that the R data be either all character or all numeric. If df is a data frame, you can always extract the numeric variables with the command df[sapply(df,is.numeric)] and the character variables with df[sapply(df,is.factor)].

Finally, it is possible to add a new variable to an existing DataObject by using the AddVarFromR method in the DataObject class. For example, the R expression faithful$waiting>70 creates a logical vector that indicates which eruptions of Old Faithful occurred more than 70 minutes after the previous eruption. The following statement uses this R expression to create an indicator variable in the dobjR data set that was created earlier in this section:

dobjR.AddVarFromR("wait70", "faithful$waiting>70");

In order for the AddVarFromR method to succeed, the length of the R vector must be the same as the number of observations in the data object.

The following tables summarizes the frequently used methods that transfer data between SAS and R. For details of the data transfer and a full list of methods to transfer data, see the chapter titled "Accessing R" in the SAS/IML Studio online Help.

Table 11.2. Methods and Modules for Transferring Data from R to SAS Software

Method/Module

Data Source

Data Destination

DataObject.CreateFromR

R expression

DataObject

DataObject.AddVarFromR

R expression

DataObject variable

ImportDataSetFromR

R expression

SAS data set in libref

ImportMatrixFromR

R expression

SAS/IML matrix

11.5 Importing Complicated R Objects

The result returned by an R function is often an R object. Not all objects can be converted to a data frame. If m is an object in R, you can make sure it is convertible by typing as.data.frame(m). If this call fails, use str(m) to examine the structure of the object. Often the object is composed of simpler objects (or components) such as a vector, matrix, or list. These components can be readily imported into SAS/IML matrices by using the methods and modules that are described in the previous section.

For example, consider the object returned by the lm function in R. The object is complicated, as partially shown in Figure 11.8. The figure is created by the following statements:

/* create R object */
submit / R;
   lm.obj <- lm(waiting~eruptions, data=faithful)
   str(lm.obj)
endsubmit;
A Partial Listing of an R Data Structure

Figure 11.8. A Partial Listing of an R Data Structure

The first line in the output from the str function is "List of 12." This tells you that the lm.obj object contains 12 components. The description of each component begins with a dollar sign ($) followed by the name of the component. For example, the first object is lm.obj$coefficients, so you could get these regression coefficients into a SAS/IML matrix by using the following statements:

/* access coefficients of R object. First technique. */
submit / R;
   coef <- lm.obj$coefficients    # direct access of coefficients
endsubmit;
run ImportMatrixFromR(c, "coef");

Or, more concisely, you can omit the entire SUBMIT block, as shown in the following statement:

/* access coefficients of R object. Second technique. */
run ImportMatrixFromR(c, "lm.obj$coefficients");

Because the second argument to the ImportMatrixFromR module can be any R expression, both of the previous examples produce the same result.

This technique works, but the R documentation discourages the direct access of components for some analysis objects such as the object of the "lm" class in the example. Instead, each class is supposed to overload certain "helper functions" (R calls them generic functions), and you are supposed to call the helper functions to extract relevant components. For example, the "lm" object overloads the coef, residuals, fitted, and predict functions, among others. The following statement shows the preferred technique for obtaining the coefficients of the linear model:

/* access coefficients of R object. Third technique. */
run ImportMatrixFromR(c, "coef(lm.obj)");

11.6 Handling Missing Values

When you transfer a SAS data set to R, missing values are automatically converted to the missing value in R, which is represented by NA. Similarly, an NA value is converted to a SAS missing value when you transfer data from R into SAS software. However, there are some differences between the way that R statistical functions and SAS procedures handle missing values. This section describes these differences.

11.6.1 R Functions and Missing Values

The treatment of missing values in R varies from function to function. For example, the following statements compare the computation of a mean in SAS/IML to the same computation in R:

/* compute a mean in SAS/IML and in R */
x = {., 2, 1, 4, 3};
meanIML = x[:];
print meanIML;
run ExportMatrixToR(x, "x");
submit / R;
   mean(x)            # compute mean in R; this also prints the result
endsubmit;
Default Computation of Mean for Data with Missing Values

Figure 11.9. Default Computation of Mean for Data with Missing Values

Note that the result of the R mean function is NA because one of the elements of the x vector was missing. You can override this default behavior, as shown in the following statements:

/* compute mean in R; omit missing values */
submit / R;
   mean(x, na.rm=TRUE)
endsubmit;
Mean Computed by R after Removing Missing Values

Figure 11.10. Mean Computed by R after Removing Missing Values

The na.rm= option specifies whether to remove missing values before the computation occurs. In addition to the mean function, the same option occurs for var, sum, min, max, and many other functions.

11.6.2 Merging R Results with Data That Contain Missing Values

By convention, an output data set created by a SAS regression procedure contains the same number of observations as the input data. In fact, the output data set created by most OUTPUT statements also contains a copy of the modeling variables from the input data set. If some observation contains a missing value for any of the explanatory variables in the model, then the predicted value for that observation is also missing.

This is not the default behavior for R regression functions. Instead, the default behavior is to drop observations that contain missing values. Consequently, if your explanatory variables contain missing values, the number of observations for vectors of fitted values and residuals is smaller than the number of observations in the data. This can be a problem if you want to merge the results from R into a SAS data set.

The following statements demonstrate this behavior.

/* compute predicted values in R when the data contain missing values */
x = {., 2, 1, 4, 3};
y = {1, 2, 2, 4, 3};
run ExportMatrixToR(x, "x");
run ExportMatrixToR(y, "y");

submit / R;
   lm.obj <- lm(y~x)        # default behavior: omit obs with NA
   pred <- fitted(lm.obj)   # predicted values from linear model
   print(pred)
endsubmit;

The output, shown in Figure 11.11, shows that the result of the fitted function is a vector of four numbers, whereas the input data contains five observations. The first row of numbers in Figure 11.11 represents the observations with nonmissing values. The predicted values for these observations are given by the second row of numbers.

Predicted Values from R Computation

Figure 11.11. Predicted Values from R Computation

If you want to merge these results with data in SAS, there are two techniques that you can use. You can use the observation numbers to help you merge the results, or you can rerun the lm function with an option that overrides the default behavior.

For the first technique, you need to get the nonmissing observation numbers. You can obtain a character vector that contains the first row of Figure 11.11 by using the names function in R. You then need to convert the character names to observation numbers. The following IMLPlus statements retrieve the fitted values and the observations to which they correspond:

/* adjust for missing values */
run ImportMatrixFromR(p0, "pred");
run ImportMatrixFromR(nonMissing, "as.numeric(names(pred))");
p = j(nrow(x), 1, .);        /* allocate vector with missing values */
p[nonMissing] = p0;          /* plug in nonmissing values           */
print p;
Merging Results with Missing Values

Figure 11.12. Merging Results with Missing Values

The nonmissing fitted values are transferred into the vector p0 while the nonmissing observation numbers are transferred into the vector nonMissing. The observation numbers are used to create a vector p that contains the predicted values in the correct locations.

The alternative to this technique is to specify an option in R so that predicted (and residual) values are always the same length as the input data and contain missing values where applicable. You can specify the option by using the options function as indicated in the following statements:

/* tell R to pad results with missing values (global option) */
submit / R;
options(na.action="na.exclude")
# continue with other R statements...
endsubmit;

The na.action= option specifies the name of a function that determines how the statistical models handle missing values. The function "na.exclude" specifies that models should pad residual and predicted values with missing values where appropriate. You can also explicitly override the default value of this option each time you call a statistical model, as shown in the following statements:

/* tell R to pad results with missing values (each function) */
submit / R;
   lm.obj <- lm(y~x, na.action="na.exclude")   # pad obs with NA
   pred <- fitted(lm.obj)                      # predicted values
endsubmit;

run ImportMatrixFromR(p, "pred");
print p;

The results are identical to Figure 11.12. In particular, there are five fitted values in the vector p.

11.7 Calling R Packages

Because the SAS/IML language (like the R language) is a matrix-vector language with a rich library of statistical functions, the SAS/IML programmer will not typically need to use R for basic computations. Both languages enable the programmer to iterate, index, locate certain elements in a matrix, and so on. However, as mentioned previously, R supports packages that add functionality in certain specialized areas of statistics. The SAS programmer might want to call these packages from a SAS/IML program in order to take advantage of those computational methods.

This section describes how to call functions in a package from SAS/IML programs. The package used in this section is the KernSmooth package by M. Wand, which implements kernel smoothing functions that are associated with the book Kernel Smoothing (Wand and Jones 1995).

In the 1970s, "kernel regression" became a popular method for smoothing a scatter plot. In kernel regression, the prediction at a point x is determined by a weighted mean of y values near x, with the weight given by a kernel function such as the normal distribution. The bandwidth of the kernel determines how much weight is given to points near x. This is similar to loess regression, except that kernel regression uses all observations within a certain distance from x, whereas loess uses a fixed number of neighbors regardless of their distance.

SAS/STAT software does not have a procedure that computes kernel regression, so this is a good candidate for calling an R package to compute a kernel estimator. Furthermore, the KernSmooth package is distributed with the core R distribution as a "recommended" package, so you do not need to download this package from CRAN.

11.7.1 Installing a Package

For packages that are not contained in the core R distribution, the R Project Web site (http://cran.r-project.org/web/packages/) gives instructions for installing packages. The R software contains commands that automatically download and install packages. You can also install a package manually if the automatic installation does not succeed.

11.7.2 Calling Functions in a Package

Calling a function in a package is easy. First, load the package into R by using the library function in R. Then call any function in the package. The example in this section uses the KernSmooth package, which is distributed with the core R software. Be aware that the KernSmooth functions do not support missing values, so you might want to use a different package if your data contain missing values.

The goal of this section is to create a figure similar to Figure 9.7. However, instead of calling a SAS procedure to compute the scatter plot smoother, this section describes how to call R functions to compute the smoother.

As explained in Section 11.4.2, the ReleaseDate variable in the Movies data set is transferred to a variable in R that is assigned the "Date" class. However, the functions in KernSmooth do not handle date values (and neither do many other modeling functions in R). There are two ways to handle this problem: you can remove the format from the SAS variable before transferring the data to R, or you can remove the "Date" class from the R variable after the transfer. This example removes the format from the ReleaseDate, as shown in the following statements; see the R documentation for the unclass function to learn about the second method.

/* remove format on ReleaseDate variable since some R
   functions cannot handle dates for explanatory variables */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Movies");

oldFormat = dobj.GetVarFormat("ReleaseDate");      /* save format   */
dobj.SetVarFormat("ReleaseDate", "");              /* remove format */
dobj.ExportToR("df");           /* export data to R as a data frame */

Notice that the variable's format (DATE9.) is saved in oldFormat so that it can be restored later in the program. A variable's format is cleared by specifying the empty string in the SetVarFormat method.

A copy of the data is now in the R data frame named df. The following statements load the KernSmooth package and compute the smoother:

/* compute kernel regression by calling an R package */
YVar = "Budget";
XVar = "ReleaseDate";

/* note: KernSmooth does not handle missing values */
submit XVar YVar / R;
   library(KernSmooth)                  # load package
   attach(df)                           # make names of vars available
   h <- dpill(&XVar, &YVar)  # compute plug-in bandwidth
   model <- locpoly(&XVar, &YVar, bandwidth=h)  # fit smoother
   detach(df);
endsubmit;

To make the program more general, the matrices YVar and XVar are defined to contain the name of the response and explanatory variables. These matrices are listed in the SUBMIT statement and their values are substituted into the R statements as discussed in Section 4.4. As explained in that section, the values of the matrices are referred to by preceding the matrix name with an ampersand (&). The substitution is made by IMLPlus before the SUBMIT block is sent to R.

The SUBMIT block contains R statements. The first statement loads the KernSmooth package. The next statement uses the attach statement to add df to a search path so that variables in the data frame can be referred to by their name. (Alternatively, you can refer to a variable as df$&XVar, and so forth.) The dpill and locpoly functions are provided by the KernSmooth package. The first function uses a "plug-in" method to return a bandwidth for the kernel regression method. The bandwidth is stored in the variable h. The value is used in the locpoly function, which actually computes the kernel regression estimate. The last statement in the SUBMIT block is a call to the detach function, which removes the df data frame from the search path.

The model object is a simple structure that contains two vectors: model$x and model$y. The next statements read the R structure directly into a SAS/IML matrix and plot the model on a scatter plot of the data:

/* read simple R structure and overlay results */
run ImportMatrixFromR(m, "model");          /* x=m[,1]; pred=m[,2]; */
dobj.SetVarFormat("ReleaseDate", oldFormat);/* replace format       */

declare ScatterPlot p1;
p1 = ScatterPlot.Create(dobj, XVar, YVar);
p1.DrawUseDataCoordinates();
p1.DrawSetPenColor(BLUE);
p1.DrawLine(m[,1], m[,2]);

The data in model is transferred into a SAS/IML matrix named m. (Alternatively, the program could have transferred model$x and model$y with two separate calls.) The program then restores the format for the ReleaseDate variable, creates a scatter plot of the data, and overlays the kernel smoother. The scatter plot is shown in Figure 11.13. A comparison with Figure 9.7 shows that the two smoothers are qualitatively similar, although Figure 11.13 exhibits a bump near November 2007 that is not present in the loess model.

Kernel Smoother Computed in R

Figure 11.13. Kernel Smoother Computed in R

11.8 Case Study: Optimizing a Smoothing Parameter

Section 9.7 describes how to call the LOESS procedure and add a smoother onto a scatter plot. The LOESS procedure used the SELECT= option to automatically select a smoothing parameter in the interval [0,1] that minimizes the AICC statistic.

There is also a loess function in R, but that function does not contain an option to automatically choose a bandwidth that optimizes some criterion. However, Section 3.6 describes a SAS/IML module for optimizing a function of one variable over an interval. Is it possible to use the golden section search module that is developed in that section to choose a smoothing parameter for the R loess function?

The answer is yes. Recall that the previously defined GoldenSection module requires that you define the module to be minimized. For the current example, the function must evaluate the loess model for a given parameter and then compute the AICC statistic for that model.

This section demonstrates how SAS and R software can work together. The section consists of four subsections.

  1. Section 11.8.1 describes how to call the loess function in R.

  2. Section 11.8.1 describes how to compute the AICC statistic from the object returned by the loess function.

  3. Section 11.8.3 describes how to wrap (or encapsulate) the R statements into a SAS/IML module.

  4. Section 11.8.4 describes how to call the GoldenSection module developed in an earlier chapter to find the smoothing parameter that minimizes the AICC statistic for a loess fit.

11.8.1 Computing a Loess Smoother in R

The first step in the example is to figure out how to call the loess function in R. Assume that the Movies data set has already been copied to a data frame named df in R as in Section 11.7.2. Then the following statements fit a loess model that is similar to the model from the LOESS procedure that is shown in Figure 9.20:

/* fit loess model in R */
XVar = "ReleaseDate";
YVar = "Budget";

submit XVar YVar / R;
   loess.obj <- loess(&YVar ~ &XVar, data=df, span=0.15,
                     na.action="na.exclude", degree=1, normalize=FALSE,
                     control=loess.control(iterations=1));
   summary(loess.obj)
endsubmit;
Summary of a Loess Model in R

Figure 11.14. Summary of a Loess Model in R

The loess function in R uses different default parameters than the LOESS procedure, but the options specified in the preceding statements force the loess model to be fairly similar to the model shown in Figure 9.20. (For an even closer match, use the INTERP=CUBIC option in the MODEL statement for the LOESS procedure.) Note that the smoothing parameter used in the R model is 0.15. This value is chosen because Figure 9.20 shows that 0.1518 is the value of the smoothing parameter that optimizes the AICC statistic as computed by the LOESS procedure.

11.8.2 Computing an AICC Statistic in R

The second step in this example is to compute the AICC statistic. The documentation for the LOESS procedure specifies that the procedure uses the Mowing formula for the AICC statistic:

11.8.2 Computing an AICC Statistic in R

where n is the number of nonmissing observations, cc 11.8.2 Computing an AICC Statistic in R2 is the error mean square, and L is the loess smoothing matrix. Fortunately, the loess.obj object returned by the loess function contains all of the necessary information required to compute the AICC statistic.

To compute the AICC statistic, look at the contents of the loess.obj object by using the str function, as shown in the following statements. A partial listing of the structure is shown in Figure 11.15.

/* display structure of an R object */
submit / R;
   str(loess.obj)
endsubmit;
Partial Listing of the Structure of a Loess Object in R

Figure 11.15. Partial Listing of the Structure of a Loess Object in R

The structure contains 17 components, including three components that can be used to construct the AICC statistic. The n component gives the number of nonmissing observations. The residuals component can be used to form the expression Partial Listing of the Structure of a Loess Object in R2 = Σir2i/n, where ri, is the i th residual. The trace of the smoothing matrix, Trace(L), is given by trace.hat. Consequently, the following statements create an R function named aicc that computes the AICC statistic for a loess object:

/* create an R function that computes AICC for a loess model */
submit / R;
   aicc <- function(loess.obj)
   {
      n <- loess.obj$n
      r <- loess.obj$residuals;
      TraceL <- loess.obj$trace.hat
      sigma.hat2 <- sum(r^2, na.rm=TRUE) / n
      aicc <- log(sigma.hat2) + 1 + (2*(TraceL+1)) / (n-TraceL-2)
      return(aicc)
   }
endsubmit;

Notice that defining a function in R is very similar to defining a function module in SAS/IML. The following statements call the function and display the AICC value for the loess model with the smoothing parameter 0.15:

submit / R;
   aicc(loess.obj)
endsubmit;

The value computed by the aicc function (shown in Figure 11.16) is close to the value reported in Figure 9.20.

AICC Statistic Computed by an R Function

Figure 11.16. AICC Statistic Computed by an R Function

11.8.3 Encapsulating R Statements into a SAS/IML Module

The third step in this example is to write a SAS/IML module that encapsulates the R code. Recall that the GoldenSection module (defined in Chapter 3) calls a SAS/IML function module called Func. The Func module takes a single parameter and returns the function evaluated at that parameter. In this example, the parameter is the smoothing parameter for the loess fit. The function is the AICC statistic for the fit. Consequently, the following statements define a SAS/IML module that uses R to fit a loess model and to compute the AICC statistic for that model:

/* module that calls R to compute a model and statistic */
start Func(alpha) global (XVar, YVar);
   /* Fits a loess model in R for a value of the smoothing parameter, alpha.
    * Returns the AICC statistic. */
   submit alpha XVar YVar / R;
      loess.obj <- loess(&YVar ~ &XVar, data=df, span=&alpha,
                     na.action="na.exclude", degree=1, normalize=FALSE,
                     control=loess.control(iterations=1));
      crit <- aicc(loess.obj);
   endsubmit;
   run ImportMatrixFromR(aicc, "crit");
   return ( aicc );
finish;

The Func module has a single argument, alpha, but also uses two global variables, XVar and YVar, which should be defined in the scope of the main program. These global variables give the names of the two variables used in the model. All three of these variable values are passed to R by listing the variables in the SUBMIT statement and referring to the values of the variables by preceding the names with an ampersand (&). The R statements merely evaluate a loess model for the given smoothing parameter and compute the AICC statistic for that fit by calling the aicc function that you defined in the previous section. Lastly, the value of the AICC statistic is transferred into a SAS/IML matrix, and this value is returned by the module.

You can call the module to see the AICC value for the loess model with the smoothing parameter 0.15, as shown in the following statements:

AICC = Func(0.15);
print AICC;

The AICC statistic is shown in Figure 11.17. Note that this is the same value shown in Figure 11.16.

Result of Calling the Func Module

Figure 11.17. Result of Calling the Func Module

11.8.4 Finding the Best Smoother by Minimizing the AICC Statistic

The last step of this example is to call the GoldenSection module that you defined in Section 3.6. Assuming that the GoldenSection module is on the SAS/IML Studio search path, you can find the loess smoothing parameter that leads to a minimum value of the AICC statistic by calling the module as shown in the following statements:

/* find parameter that minimizes the AICC on [0, 0.9] */
Alpha = GoldenSection(0, 0.9, 1e-3);
minimumAICC = Func(Alpha);
print Alpha minimumAICC;
Smoothing Parameter for the Minimum Value of the AICC Statistic

Figure 11.18. Smoothing Parameter for the Minimum Value of the AICC Statistic

The minimum value, shown in Figure 11.18, is close to 0.1518, which is the optimal value found by the LOESS procedure. The differences between the two values can be explained by noting that the algorithms are not identical.

Given that the AICC function is not a continuous function of the smoothing parameter and that the AICC function can have multiple local minima, it is surprising that the GoldenSection algorithm works as well as it does for this example. It turns out that for these data the AICC function is particularly well-behaved. The following statements graph the AICC statistic as a function of the smoothing parameter. The resulting graph is shown in Figure 11.19.

/* graph AICC statistic as a function of the smoothing parameter    */
x = do(0.1, 0.9, 0.02);          /* row vector of sequential values */
x = t(x);                        /* transpose to column vector      */
y = j(nrow(x), 1);
do i = 1 to nrow(x);
   y[i] = Func(x[i]);
end;
declare ScatterPlot p2;
p2 = ScatterPlot.Create("AICC", x, y);
p2.SetAxisLabel(XAXIS, "Loess Smoothing Parameter");
p2.SetAxisLabel(YAXIS, "AICC");
The AICC Statistic as a Function of the Smoothing Parameter

Figure 11.19. The AICC Statistic as a Function of the Smoothing Parameter

11.8.5 Conclusions

The example in this section demonstrates how SAS and R software can work together. The SAS/IML program controls the flow of the algorithm, but you can choose to carry out some computation in R.

The interface to R in SAS/IML Studio enables you to pass data between the two statistical systems and to call R functions from SAS/IML programs. As the example in this section shows, you can have data in a SAS data set but choose to call an analysis in R. The loess analysis in R lacks a feature found in the LOESS procedure, namely the automatic selection of a smoothing parameter that minimizes some criterion. However, you can use a previously written SAS/IML module to optimize the smoothing parameter for the R model. You could also have written the AICC function in the SAS/IML language instead of in R.

What else can be learned from this example? First, by calling R functions from SAS/IML programs you can, in effect, increase the SAS/IML library of functions. Second, by wrapping R statements in a SAS/IML module, you can encapsulate the R code; the program that calls the module can ignore the fact that R is used internally by the module. Third, the interface to R provided by SAS/IML Studio gives you tremendous flexibility to take advantage of algorithms written in either software language. You can continue to use all of your existing knowledge about SAS programming even while you evaluate statistical techniques implemented in R packages.

11.9 Creating Graphics in R

Advocates of R software often point to the graphics as a major reason for using R. In R, there is a graphical window that you can partition in order to create multiple plots in a single window. There are built-in high-level functions such as plot and hist that create graphs, and other functions such as lines and points that overlay markers or curves onto an existing graph. There are also several graphical packages that you can use to create special graph types or to add functionality to the R graphics.

You do not have to do anything special to use R graphics from SAS/IML Studio. When you call an R graphical function, that function creates a graphical window and draws to it. The graphical window is not part of the SAS/IML Studio workspace: it is owned and managed by the R process.

The following statements create an R graph from an IMLPlus program:

/* create an R graph */
run ExportDataSetToR("Sasuser.Movies", "MovieFrame");
submit / R;
   plot(Budget ~ ReleaseDate, data=MovieFrame)
endsubmit;

A window appears that contains the R graph, as shown in Figure 11.20. In contrast to the SAS/IML Studio graphics, the default graphics in R are not dynamically linked—neither to each other nor to IMLPlus graphics of the same data. In particular, as explained in Section 11.4.2, the R graphs do not reflect any attributes that are associated with marker shape, color, or selection state, even if the data are transferred to R from an IMLPlus data object.

An R Graph Created from an IMLPlus Program

Figure 11.20. An R Graph Created from an IMLPlus Program

There is little need to use R to create a basic scatter plot, but R also provides some sophisticated graphs that you might want to use as part of an analysis. For example, R has a function called coplot that creates a conditioning plot. A conditioning plot is a panel of scatter plots for the variables X and Y, given particular values of a third variable C (called the conditioning variable). When C is a nominal variable, the conditioning plot is similar to BY-group processing in SAS: coplot produces a scatter plot for each level of C. The coplot function also graphically indicates the levels of C next to the panel of scatter plots. This is shown in Figure 11.21, which is produced by the following statements:

submit / R;
   coplot(Budget ~ ReleaseDate | MPAARating, data=MovieFrame)
endsubmit;
An R Coplot Graph

Figure 11.21. An R Coplot Graph

The scatter plot of Budget versus ReleaseDate, given that the movie is rated G, is located in the leftmost column of the bottom row of Figure 11.21. The scatter plot, given that the movie is not rated, is in the second column of the bottom row, and so forth. This panel of scatter plots enables you to see all BY groups of the MPAARating variable in a single glance. (You can create a similar panel of plots in SAS/IML Studio; see the chapter "Plotting Subsets of Data" in the SAS/IML Studio User's Guide.)

The conditioning plot panel in R is easy to make, and it can be very informative. In SAS/IML Studio you can create each plot in the panel by using the interactive and dynamically linked IMLPlus graphics. The following statements create a single scatter plot of Budget versus ReleaseDate and also create a bar chart of the MPAARating variable:

/* show coplot functionality in IMLPlus */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Movies");
declare ScatterPlot cp;
cp = ScatterPlot.Create(dobj, "ReleaseDate", "Budget");
cp.ShowObs(false);          /* show observations only when selected */
declare BarChart bar;
bar = BarChart.Create(dobj, "MPAARating");

The ShowObs method in the Plot class is used to show only those observations that are selected. When you click on a bar in the bar chart, the scatter plot displays one of the five scatter plots shown in the conditioning plot. For example, Figure 11.22 shows the scatter plot for PG-rated movies. The IMLPlus formulation uses the interactive nature of the IMLPlus graphics to enable you to see the scatter plot, given a particular movie rating. (You can even see a scatter plot for a union of rating categories, such as for movies that are rated either PG or PG-13.) Both approaches have advantages: the interactive display is more flexible, but the paneled display enables you to more easily compare data across different categories, such as comparing movies rated PG with those rated PG-13.

An Interactive Coplot in SAS/IML Studio

Figure 11.22. An Interactive Coplot in SAS/IML Studio

As shown in Section 11.8.3, it is sometimes helpful to create an IMLPlus module that encapsulates a call to R. For example, the following statements create an IMLPlus module that calls the coplot function in R:

/* define module that creates a graph in R */
start CallRCoplot(DataObject dobj, XVar, YVar, CoVar);
   DSName = dobj.GetServerDataSetName();
   dobj.ExportToR(DSName);
   submit XVar YVar CoVar DSName / R;
    coplot(&YVar ~ &XVar | &CoVar, data = &DSName)
   endsubmit;
finish;

run CallRCoplot(dobj, "ReleaseDate", "Budget", "MPAARating");

The module takes four arguments: a data object and three character matrices that contain, respectively, the name of the X, Y, and conditioning variables. The first statement in the module illustrates a clever technique: the GetServerDataSetName returns a string that can be used as the name of the R data frame. (In this example, the string is "Movies.") You could hard-code this string by writing DSName = "MyData", but calling the GetServerDataSetName method enables you to make sure that the R data frame has a name related to the data. The second statement in the module creates the R data frame from the data object. The SUBMIT statement then calls the coplot function in R, with the arguments to the function substituted from the SAS/IML matrices XVar, YVar, CoVar, and DSName. After the module is defined, you can call it as you would any IMLPlus module. The results from the previous program is identical to Figure 11.21.

You can even use the following technique to modify the module so that it skips the creation of the R graphics window. (Thanks to Simon Smith for this idea.) The graphic is written to the Windows clipboard and pasted from the clipboard directly into the SAS/IML Studio output document.

/* paste R graphic into the SAS/IML Studio output window */
start CallRCoplot2(DataObject dobj, XVar, YVar, CoVar);
   DSName = dobj.GetServerDataSetName();
   dobj.ExportToR(DSName);
   submit XVar YVar CoVar DSName / R;
     win.metafile(); # 1. write graphics to Windows clipboard
     coplot(&YVar ~ &XVar | &CoVar, data = &DSName)
     rc <- dev.off() # 2. close graphics device (the metafile)
   endsubmit;
   /* 3. paste graphics from clipboard to output document */
   OutputDocument.GetDefault().PasteGraphic(GRAPHICTYPE_WMF);
finish;

run CallRCoplot2(dobj, "ReleaseDate", "Budget", "MPAARating");

The output from the module is shown in Figure 11.23. The module is identical to the CallRCoplot module except for three statements.

  1. The win.metafile function writes the subsequent graph to a file in Windows Metafile Format (WMF). If no filename is specified, as in this example, the output is copied to the Windows clipboard. No writing occurs until the file is closed.

  2. The dev.off function closes the current graphics device, which results in the graph being written to the clipboard.

  3. The contents of the clipboard are pasted into the output document. This is accomplished by using the OutputDocument class to get the default output document. The PasteGraphic method is then used to paste from the clipboard in WMF format. (The OutputDocument class is described in Section 12.8.)

The statement that uses the OutputDocument class might look strange to readers who are not experienced with object-oriented programming techniques. The following statements might look more familiar, but are equivalent:

declare OutputDocument doc;
doc = OutputDocument.GetDefault();
doc.PasteGraphic(GRAPHICTYPE_WMF);

This second formulation explicitly declares that doc is an object of the OutputDocument class. (The OutputDocument class enables you to programatically control a SAS/IML Studio output window.) The GetDefault method returns the current output document if it exists, or creates a new output document if one does not exist. The second formulation is preferred when you intend to use the output document several times. The shorthand used in the module is sometimes used to save typing when you call only a single OutputDocument method.

Copying the R Graphic to the SAS/IML Studio Output Window

Figure 11.23. Copying the R Graphic to the SAS/IML Studio Output Window

11.10 References

[bibch11_01] A. Canty, and B. D. Ripley, (2009), boot: Bootstrap R (S-Plus) Functions, R package version 1.237.

[bibch11_02] A. C. Davison, and D. V. Hinkley, (1997), Bootstrap Methods and Their Application, Cambridge, UK: Cambridge University Press.

[bibch11_03] M. P. Wand, and M. C. Jones, (1995), Kernel Smoothing, London: Chapman & Hall.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset