Chapter 12. Regression Diagnostics

Contents

  • 12.1 Overview of Regression Diagnostics 279

  • 12.2 Fitting a Regression Model 280

  • 12.3 Identifying Influential Observations 284

  • 12.4 Identifying Outliers and High-Leverage Observations 286

  • 12.5 Examining the Distribution of Residuals 287

  • 12.6 Regression Diagnostics for Models with Classification Variables 289

  • 12.7 Comparing Two Regression Models 292

    • 12.7.1 Comparing Analyses in Different Workspaces 293

    • 12.7.2 Comparing Analyses in the Same Workspace 294

  • 12.8 Case Study: Comparing Least Squares and Robust Regression Models 296

  • 12.9 Logistic Regression Diagnostics 303

  • 12.10 Viewing ODS Statistical Graphics 307

  • 12.11 References 310

12.1 Overview of Regression Diagnostics

A regression diagnostic identifies observations that are either unusual or highly influential in a regression analysis. It is a good idea to plot regression diagnostics when you perform such an analysis.

The SAS/STAT regression procedures use ODS Graphics to create diagnostic plots. However, it is also useful to create diagnostic plots by using the statistical graphics in SAS/IML Studio, as shown in this chapter. Because the graphics in SAS/IML Studio are interactive, it is easy to identify and investigate outliers and high-leverage points for a regression model. For example, you can click on an outlier in a graph. This labels the observation and highlights it in other graphs. You can double-click on an observation to examine the values of other covariates. If there is a large group of outliers, you can select all of the outliers and determine whether those observations share some characteristic such as a high (or low) value of some covariate. If so, you can revise your model by incorporating that covariate into the regression model.

If used correctly, these plots can assist you in building statistical models and investigating outliers. For an explanation of the statistics that are associated with regression diagnostics, see Belsley, Kuh, and Welsch (1980).

This chapter is an applied chapter that describes how to do the following:

  • Fit a regression model.

  • Add reference lines to a scatter plot.

  • Identify influential observations such as outliers and high-leverage observations.

  • Compare two regression models.

12.2 Fitting a Regression Model

This section uses the techniques that are described in Chapter 4 to call the GLM procedure to perform a linear regression and to create graphs of the analysis in SAS/IML Studio. Techniques from Chapter 9 are used to add prediction bands and reference lines to the graphs in order to make the graphs more interpretable.

The regression model for this example uses variables from the Movies data set. As seen in previous chapters, there appears to be a relationship between the Budget of a movie and the US_Gross variable (see Figure 7.7). However, neither variable is normally distributed (see Figure 7.6), so Atkinson (1985) and others suggest applying a normalizing transformation to these variables. For these data, it makes sense to apply a logarithmic transformation to each variable (see Figure 8.5) and to model the log-transformed variables.

The analysis that begins in this section is continued in subsequent sections. The following list is an outline of the main steps in the analysis. The numbers used in the list are also used in comments in the program.

  1. Apply a logarithmic transformation to the US_Gross and Budget variables.

  2. Call PROC GLM to run a regression analysis.

  3. Read predicted values and confidence limits into SAS/IML vectors.

  4. Create a data object that contains the data and results.

  5. Create a plot of the data.

  6. Overlay the results of the regression analysis.

  7. Create Cook's D diagnostic plot (Section 12.3).

  8. Overlay reference lines on Cook's D plot.

  9. Create a plot of studentized residuals versus leverage (see Section 12.4).

  10. Overlay reference lines on the residual-leverage plot.

The following statements transform the variables and use PROC GLM to fit a regression model:

/* create regression diagnostic plots in SAS/IML Studio */
submit;
/* 1. log transform of US_Gross and Budget variables */
data movies;
   set Sasuser.Movies;
   LogUS_Gross = log(US_Gross);
   LogBudget = log(Budget);
   label LogUS_Gross = "log(US Gross)" LogBudget = "log(Budget)";
run;

/* 2. run classic regression analysis *
proc glm data=movies;
   model LogUS_Gross = LogBudget;
   output out=GLMOut
          P=P LCLM=LCLM UCLM=UCL           /* predictions and CLM   */
          R=R RSTUDENT=ExtStudR            /* residuals             */
          COOKD=CookD H=Leverage;          /* influence diagnostics */
quit;
endsubmit;

The DATA step computes the transformed variables named LogBudget and LogUS_Gross. Then the GLM procedure is called to model LogBudget by LogUS_Gross. The output data set (called GLMOut) contains the input data in addition to six new variables created by the procedure:

  • P contains the predicted values for the model

  • LCLM contains the lower 95% confidence limits for the mean of the predicted value

  • UCLM contains the upper 95% confidence limits for the mean of the predicted value

  • R contains the residual values for the model

  • ExtStudR contains externally studentized residuals, which are studentized residuals with the current observation deleted. (Recall that a studentized residual is a residual divided by its standard error.)

  • CookD contains Cook's D influence statistic

  • Leverage contains the leverage statistic

The output from the GLM procedure is shown in Figure 12.1.

Regression Summary

Figure 12.1. Regression Summary

The parameter estimates table shown in Figure 12.1 provides estimates for the coefficients of the linear model. The estimate for the LogBudget coefficient indicates that each unit of LogBudgetcan be expected to generate an additional 0.72 units of LogUS_Gross. In terms of the original variables, US_Gross ≈ e0.9 (Budget)0.72. Thus if you have two movies, one with twice the budget of the other, the more expensive movie is expected to bring in 20.72 ≈ 1.64 times the revenue of the less expensive movie. The fact that the coefficient for LogBudget is less than 1 implies that movies with very large budgets have predicted gross revenues that are less than the budget! For this model, the "break-even" point is a budget of about exp(0.9/(1 − 0.72)) ≈ 24.9 million dollars; movies with larger budgets have predicted US gross revenues that are less than their budgets.

As the previous analysis demonstrates, it can be difficult to interpret a model whose variables are log-transformed. It can be useful, therefore, to invert the log-transformation and plot the predicted values and explanatory variables in the original scale units, as shown in Figure 12.2. The following statements read some of the output variables from the GLMOut dataset into SAS/IML matrices and use the EXP function to transform the variables from the log scale back into the original scale:

/* 3a. read variables from output data set into vectors*/
use GLMOut;
read all var {LogBudget P UCLM LCLM} into logFit; /* log scale       */
read all var {CookD};                             /* read for Step 8 */
close GLMOut;

/* 3b. invert transformation for predicted values and CLM */
fit = exp(LogFit);

You can also read the GLMOut data set into a data object so that you can create dynamically linked graphics. The following statements create a data object and add new columns that contain the predicted values in the original scale. The new columns come from data in the LogFit matrix. Observation numbers are also added. These variables will be used in subsequent diagnostic plots.

/* 4. create data object that contains data and results */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Work.GLMOut");
dobj.SetRoleVar(ROLE_LABEL, "Title");

/* 4a. add transformed variable to the data object */
dobj.AddVar("Pred", "Predicted US Gross (million $)", fit[,2]);
dobj.AddVar("ObsNumber", "Observation Number", T(1:nrow(fit)));

The SetRoleVar method specifies a label for selected observations in subsequent plots. The Title variable contains the name of each movie in the data.

After the data object is created, you can create graphs that are linked to the data object. The following statements create a line plot of the Budget and Pred variables, and override a few default characteristics of the line plot:

/* 5. create line plot */
declare LinePlot line;
line = LinePlot.Create(dobj, "Budget", "Pred");
line.SetAxisLabel(XAXIS, AXISLABEL_VARLABEL);
line.SetAxisLabel(YAXIS, AXISLABEL_VARLABEL);
line.ShowObs(false);
line.SetLineWidth(2);

It is useful to add confidence limits to the line plot of the predicted values. The following statements sort the fit matrix by the Budget variable and use the DrawPolygon method to add confidence limits as described in Section 9.1.4:

/* 6. Use drawing subsystem to draw CLM band. */
line.DrawUseDataCoordinates();
line.DrawSetRegion(PLOTBACKGROUND);
LightBlue = 0x9BBBC9;                 /* RGB value: (155, 187, 201) */
line.DrawSetBrushColor(LightBlue);
line.DrawSetPenAttributes(GRAY, DOTTED, 1);
call sort(fit, 1);                    /* sort CLM by Budget         */
x = fit[,1]; LCLM = fit[,3]; UCLM = fit[,4];
N = nrow(fit);
line.DrawPolygon(x//x[N:1], LCLM//UCLM[N:1], true);

The resulting graph is shown in Figure 12.2. This plot (shown on the left in the figure) enables you to focus on the predicted values without seeing the observed values. Alternatively, you could create a scatter plot of the observed values, and add the predicted curve with the Plot.DrawLine method as described in Section 9.1. The second plot (shown on the right in the figure) displays the log-transformed data and predicted values. The program statements that produce this second graph are left as an exercise.

Regression Shown in the Original Scale (left) and the Log-Transformed Scale (right)

Figure 12.2. Regression Shown in the Original Scale (left) and the Log-Transformed Scale (right)

12.3 Identifying Influential Observations

An influential observation is one that, according to some measure, has a large effect on a regression analysis. Intuitively, an influential observation is one for which the analysis would change substantially if the observation were deleted. There are many statistics that attempt to quantify various aspects of the influence of an observation. For details, see Belsley, Kuh, and Welsch (1980) or see the documentation for the INFLUENCE option in the MODEL statement of the REG procedure in the SAS/STAT User's Guide.

In a regression analysis, there are two types of influential observations: outliers in the response variable and outliers in the space of the explanatory variables. The outliers in the response variable (often simply called "outliers") are typically determined by the size of a residual or studentized residual. The outliers in the space of the explanatory variables are often called "high-leverage points." Cook's D is a popular statistic that incorporates both of these notions of influence. An observation is often judged to be influential if the Cook's D statistic exceeds 4/n, where n is the number of nonmissing observations (Rawlings, Pantula, and Dickey 1998).

The program statements in this section are a continuation of the analysis in Section 12.2 and use the abline module defined in Section 9.4. The standard Cook's D plot is a graph of the Cook's D statistic versus the observation number. The Cook's D statistic is contained in the CookD variable in the dobj data object. The data object also contains the ObsNumber variable. The Cook's D statistic is also contained in the SAS/IML vector named CookD, so you can count the number of nonmissing values and use that value to compute the cutoff value for the Cook's D statistic. You can use the abline module to display the cutoff, as shown in the following statements:

/* 7. create Cook's D plot */
declare ScatterPlot cook;
cook = ScatterPlot.Create(dobj, "ObsNumber", "cookD")
cook.SetMarkerSize(5);

/* 8. draw reference lines on Cook's D plot */
solidGray = 0xc8c8c8|| SOLID|| 1;    /* use solid gray for y=0      */
run abline(cook, 0, 0, solidGray);   /* horiz line at y=0           */

nonMiss = sum(CookD^=.);             /* number of nonmissing values */
cookCutoff = 4 / nonMiss;
run abline(cook, cookCutoff, 0, .);  /* high Cook's D: horiz line   */

The plot of Cook's D is shown in Figure 12.3. As discussed in Section 12.1, the reason for creating these plots in SAS/IML Studio is that you can interactively click on influential observations to identify the observation. This is shown in Figure 12.3 where the five movies with the largest Cook's D value are selected. Upon further investigation of the selected movies, you can discover that the movies have similar characteristics. They all have extremely low budgets (most less than one million dollars), which makes them high-leverage points. They also have US_Gross values that are either much higher or much lower than predicted by the model, which makes them outliers or nearly outliers. Two of the movies (Once and Mad Hot Ballroom) were commercial successes that earned many times more than predicted by the regression model; the others were commercial flops.

Plot of Cook's D Statistic

Figure 12.3. Plot of Cook's D Statistic

12.4 Identifying Outliers and High-Leverage Observations

The Cook's D plot attempts to identify observations that either have high leverage or are outliers in the response variable. However, if an observation has a large Cook's D value, it is not clear (without further investigation) whether that point has high leverage, is an outlier, or both. A plot that complements the Cook's D plot is a plot of the studentized residuals versus the leverage statistic for each observation.

These residual and leverage statistics were created in the output data set by the GLM procedure in Section 12.2. The program statements in this section are a continuation of the analysis in the preceding sections. For the residual-leverage plot, Belsley, Kuh, and Welsch (1980) suggests adding a horizontal reference line at ±2 to identify observations with large studentized residuals, and a vertical line at 2p/n to identify points with high leverage, where p is the total number of parameters in the model, including the intercept. The following statements create the plot and use the abline module defined in Section 9.4 to add reference lines:

9. create plot of studentized residuals versus leverage */
declare ScatterPlot resid;
resid = ScatterPlot.Create(dobj, "Leverage", "ExtStudR");
resid.SetMarkerSize(5);

/* 10. draw reference lines on residual-leverage plot */
run abline(resid, 0, 0, solidGray);  /* horiz line at y=0             */

run abline(resid, {2,-2}, {0,0}, .); /* outliers: horiz lines at +/-2 */
p = 2;                               /* number of params              */
leverage = 2*p / nonMiss;
run abline(resid, leverage, ., .);   /* high leverage: vertical line  */

The resulting plot is shown in Figure 12.4. The graph shows 10 points that have been manually selected and labeled:

  • The five movies (also shown in Figure 12.3) that have the largest Cook's D: Once, Feast, Lonesome Jim, Mad Hot Ballroom, and Quiet. It is now apparent that all are high-leverage points and that Once and Feast are outliers.

  • Two movies with large positive residuals that are not high-leverage points: Juno and Saw II.

  • Two movies with large negative residuals: Fateless and DOA: Dead or Alive.

  • A movie with high leverage that has a modest positive residual: Wordplay.

Studentized Residuals versus Leverage

Figure 12.4. Studentized Residuals versus Leverage

You can select all observations with large Cook's D statistics by using the mouse to create a selection rectangle in the Cook's D plot (not shown). If you do this, then the residual-leverage plot updates and reveals that those observations have either high leverage or are outliers.

12.5 Examining the Distribution of Residuals

A quantile-quantile (Q-Q) plot shows whether univariate data approximately follows some distribution. The most common Q-Q plot is the normal Q-Q plot, which graphically compares quantiles of the data to quantiles of a normal distribution. The normal Q-Q plot is often used to check whether the residuals of a regression are normally distributed.

There are several ways to compare the quantiles of data with the quantiles of a theoretical distribution. The UNIVARIATE procedure follows Blom (1958) who recommends the following approach:

  1. Order the n nonmissing data values: x(1) ≤ x(2) ≤ … x(n). These are quantiles of the empirical cumulative distribution function (ECDF).

  2. For each x (i), compute the standardized quantity vi = (i −0:375)/(n + 0:25/.

  3. Compute the associated quantiles from the theoretical distribution: q i = F−1(vi), where F is the theoretical distribution function with unit scale and zero location parameter.

  4. Plot the ordered pairs (qi, vi) for i = 1,…, n.

The SAS documentation for the UNIVARIATE procedure includes a section on how to interpret Q-Q plots. Briefly, if the Q-Q plot lies along a straight line, then there is graphical evidence to believe that the data are well-described by the theoretical distribution.

The following statements define a module that uses Blom's approach to compute the normal quantiles that are associated with univariate data:

/* define module to compute normal quantiles associated with data */
start GetNormalQuantiles(x);
   nonMissing = loc(x^=.);           /* locate nonmissing values    */
   numNM = ncol(nonMissing);         /* count the nonmissing values */

   if numNM=0 then                   /* all values are missing      */
      return ( j(nrow(x), ncol(x), .) );

   /* rank the observations; handle any missing values */
   r = j(nrow(x), ncol(x), .);
   r[nonMissing] = ranktie(x[nonMissing]);

   v = (r-0.375) / (numNM+0.25);     /* transform quantiles of ECDF */
   q = quantile("NORMAL", v);        /* associated normal quantiles */
   return ( q );
finish;

You can call the GetNormalQuantiles module on the residual values computed with the R= option in Section 12.2. These residuals are contained in the data object in the R variable. The following statements retrieve the R variable and compute the normal quantiles that are associated with the residual values. The DataObject.AddVar method adds those normal quantiles to the data object. After adding the normal quantiles, you can create a Q-Q plot that is linked to the other plots.

/* create QQ plot of residuals */
dobj.GetVarData("R", r);
q = GetNormalQuantiles(r);
dobj.AddVar("QN_R", "Normal Quantiles", q);

declare ScatterPlot qq;
qq = ScatterPlot.Create(dobj, "QN_R", "R");
qq.SetAxisLabel(XAXIS, AXISLABEL_VARLABEL);

mu = r[:];
sigma = sqrt(var(r));
run abline(qq, mu, sigma, .);

The graph is shown in Figure 12.5. The abline module adds a line to the plot. The slope of the line is the sample standard deviation of the residuals; the intercept is the mean of the residuals, which is zero.

A Quantile-Quantile Plot of Residuals

Figure 12.5. A Quantile-Quantile Plot of Residuals

Because the points in the Q-Q plot in Figure 12.5 do not lie along a straight line, you can conclude that the residual values are not well-described by a normal distribution. In fact, the curved pattern of the points indicates that the distribution of the residuals is skewed to the left.

12.6 Regression Diagnostics for Models with Classification Variables

There have been many studies that attempt to determine whether movies that are nominated for (or win) Academy Awards also enjoy greater box-office revenue as a result of the increased publicity (for example, Simonoff and Sparrow (2000); Nelson et al. (2001)). Most conclude that an Academy Award nomination does, in fact, result in increased revenue.

The Movies data set contains an indicator variable, AANomin, that has the value 1 if the movie was nominated for an Academy Award. This section modifies the regression analysis in the previous sections to include this classification variable. The complete program is available from this book's companion Web site

The first step in the program (transforming variables) remains the same. However, in Step 2 of the program, you need to modify the call to the GLM procedure. Specifically, you need to add a CLASS statement to specify that AANomin is a classification variable, and also add AANomin and any interaction effects to the MODEL statement. It turns out that the interaction between LogBudget and AANomin is not significant (the p-value for the Type 3 statistic is 0.1223), so the following statements use only main effects:

/* 2. run classic regression analysis: class variable, main effects */
proc glm data=movies;
   class AANomin;
   model LogUS_Gross = LogBudget AANomin / solution;
   output out=GLMOut2
          P=P LCLM=LCLM UCLM=UCLM          /* predictions and CLM   */
          RSTUDENT=ExtStudR                /* studentized residuals */
          COOKD=CookD H=Leverage;          /* influence diagnostics */

quit;

Notice that you need to include the SOLUTION option on the MODEL statement if you want to see the parameter estimates table.

In Step 3, it is useful to also read the AANomin variable into a SAS/IML vector:

read all var {AANomin};

In Step 4, after creating a data object from the GLM output data set, it is convenient to use marker shapes to represent whether a movie was nominated for an Academy Award. The following statements set all marker shapes to 'x', and then set the markers for award-nominated movies to triangles:

/* 4. create data object that contains data and results */
declare DataObject dobj1;
dobj1 = DataObject.CreateFromServerDataSet("Work.GLMOut2");
... lines omitted ...

/* 4b. use marker shapes to represent AANomin levels */
dobj1.SetMarkerShape(OBS_ALL, MARKER_X);
dobj1.SetMarkerShape(loc(AANomin), MARKER_TRIANGLE);
dobj1.SetNominal("AANomin");

The call to the DataObject.SetNominal method is recommended because it enables you to create a bar chart of the AANomin variable. If you do not call the SetNominal method, SAS/IML Studio will use a histogram to plot the distribution of the AANomin variable.

In Step 5, you can use the LinePlot.CreateWithGroup method to create the line plot of the predicted values. When you overlay the confidence limits, you need to iterate over the levels of the classification variable, as shown in the following statements:

/* 5. create line plot */
declare LinePlot line1;
line1 = LinePlot.CreateWithGroup(dobj1, "Budget", "P", "AANomin");
... lines omitted ...

u = unique(AANomin);
do i = 1 to ncol(u);
   idx = loc(AANomin=u[i]);
   N = ncol(idx);
   z = fit[idx,];
   call sort(z, 1);
   x = z[,1]; LCLM = z[,3]; UCLM = z[,4];
   line1.DrawPolygon(x//x[N:1], LCLM//UCLM[N:1], true);
end;

The only additional modification to the program is in Step 10 where you need to adjust the value of p, the number of parameters in the model. You can manually code p=3 for this example, but that is prone to error if you decide to add additional parameters to the model at a later date. A better idea is to obtain the degrees of freedom from one of the ODS tables created by the GLM procedure. For example, prior to calling the GLM procedure, you can add the following statement:

ods output OverallAnova = Anova;

This writes the ANOVA table to a SAS data set. The first element of the DF variable contains the degrees of freedom for the model. You can add 1 to the model degrees of freedom to obtain the number of variables in the design matrix, as shown in the following statements:

/* 10. draw reference lines on residual plot */
use Anova;
read point 1 var {DF};       /* first element is Model DF           */
close Anova;
p = DF + 1;                  /* number of params (+1 for intercept) */

You can use same technique (that is, reading relevant information from an ODS table created by PROC GLM) to read in the number of nonmissing observations from the NObs table. This is left as an exercise.

The results of the program that models LogUS_Gross by LogBudget and AANomin is shown in Figure 12.6. The upper-right graph shows the predicted values and 95% confidence limits for the model. Given a value for the Budget variable, the expected value for the US_Gross variable is higher for movies that are nominated for an Academy Award. For example, for movies with 100-million-dollar budgets, the expected US_Gross is 50 million dollars higher for award-nominated movies.

Regression Model with a Classification Variable

Figure 12.6. Regression Model with a Classification Variable

The upper-left graph in Figure 12.6 is Cook's D plot. The five movies with the largest Cook's D statistic were manually selected. Four of the movies are the same as shown in Figure 12.3. One movie (Good German) was not in the top five for the previous model; it has more leverage under the new model because it was a small-budget movie that was nominated for an Academy Award. The residuals-leverage plot is shown in the lower-right graph in Figure 12.6 (compare with Figure 12.4). In several cases, the movies classified as outliers or as high-leverage points under the new model are different than those classified under the previous model.

12.7 Comparing Two Regression Models

When you are faced with two (or more) regression models on the same data, it is useful to compare details of the models in order to determine which model is better. This section briefly describes techniques you can use to compare regression models. Some of the ideas in this section are implemented in Section 12.8.

Although this section presents techniques in the context of regression models, most of the ideas in this section apply equally well to other comparisons such as comparing density estimates or comparing multivariate analyses.

12.7.1 Comparing Analyses in Different Workspaces

SAS/IML Studio uses the concept of workspaces to help you organize your work. Every time you create a new program or open a new data set, the SAS/IML Studio application creates a new workspace and adds a new button to the workspace bar at the bottom of the main SAS/IML Studio window. You can see the workspace bar at the bottom of Figure 12.6.

When you click a button in the workspace bar, all windows that are associated with the chosen workspace become visible. By default, windows that are associated with other workspaces are hidden. This makes it difficult to compare, for example, the output from two different programs. However, it is easy to get SAS/IML Studio to display the output from two different workspaces at the same time. Assume one of the workspaces is named A and another is named B. The following steps enable you to compare the output window in Workspace A with the output window in Workspace B:

  1. In Workspace A, click on the icon on the left side of the window title bar. A menu appears, as shown in Figure 12.7. (When the output window is active, you can also press ALT+HYPHEN to display this control menu.)

  2. Select Auto Hide from the menu. This toggles the Auto Hide property for the window. When the property is not selected, the window is visible regardless of the active workspace.

  3. Position the output window for easy comparison. For example, move it so that it occupies the entire left side of the SAS/IML Studio application.

  4. Click B on the workspace bar to activate Workspace B. Notice that the output window from Workspace A is still visible.

  5. On the output window for Workspace B, display the control menu and select Auto Hide. Now the output window for Workspace B is visible regardless of the active workspace.

  6. Position the output window for Workspace B for easy comparison. For example, move it so that it occupies the entire right side of the SAS/IML Studio application.

Turning Off the Auto Hide Property

Figure 12.7. Turning Off the Auto Hide Property

When the Auto Hide property is turned off for the output windows, you can scroll both output windows in order to compare parameter estimates, adjusted R square values, and other statistics.

The same technique works for comparing the contents of program windows, data tables, and graphs.

12.7.2 Comparing Analyses in the Same Workspace

The previous section describes how to compare results that are in different workspaces. However, SAS/IML Studio also enables you to use a single workspace to directly compare two different analyses of the same data.

The main technique to compare two different analyses of the same data is to graph similar quantitites against each other. For example, if you are comparing two regression analyses, you can create a scatter plot that graphs the residuals of one regression model against the residuals of the other model. Similarly, you can plot predicted values against each other or, in the case of one regressor, overlay both models' predicted values on a scatter plot of the data. This technique is implemented in Section 12.8.

In order to graph two quantities against each other in SAS/IML Studio, those quantitites should be in a common data object. Therefore, you should add predicted values, residual values, and other observation-wise statistics to a single data object. You can use the DATA step to merge the results from several analyses into a single data set, and then create a data object from that merged data set.

The previous sections describe how to compare the contents of output windows that are in different workspaces. You can also use IMLPlus programming techniques to direct the output from a program to two (or more!) separate output windows in the same workspace. This enables you to compare results (for example, parameter estimates) side-by-side instead of scrolling an output window up and down to compare the output of two procedures. You can use the OutputDocument class in SAS/IML Studio to create multiple output windows and to control which window receives the output from an analysis. This technique is implemented and described in Section 12.8.

In addition to sending text to an output window, you can also paste IMLPlus graphics into an output window. The pasted graphic is not dynamically linked; it is merely a static image of an IMLPlus graph. You can copy and paste a graphic manually by using the standard Windows CTRL+C and CTRL+V as described in the section "Copying Plots to the Windows Clipboard" (Chapter 11, SAS/IML Studio User's Guide). You can also write program statements to copy and paste a graph. For example, suppose you want to copy the graph shown in Figure 12.4 and paste it into the output document for the program in Section 12.2. You can accomplish this with the following statement, which continues the program that creates Figure 12.4:

resid.CopyToOutputDocument();

A static version of the image is pasted into the output window, as shown in Figure 12.8. You can resize the graph in the output window, drag it to a different position within the output window, or copy and paste it to other Windows applications.

A Graph Image Pasted into an Output Window

Figure 12.8. A Graph Image Pasted into an Output Window

12.8 Case Study: Comparing Least Squares and Robust Regression Models

This section shows how you can compare a robust regression model to the least squares regression model that is described in Section 12.2. This comparison is useful if you know that there are outliers or high-leverage points in the data and you want to investigate how the regression changes if you use robust statistics instead of classical least squares. If the robust regression model is similar to the least squares model, you can confidently use the least squares model, which is faster to compute.

The program in this section consists of the following main steps:

  1. Clear and position an output window that will receive output from the classical least squares regression analysis.

  2. Run a least squares regression analysis by using the GLM procedure.

  3. Create a second output window that will receive output from the robust regression analysis.

  4. Run a robust regression on the same data by using the ROBUSTREG procedure.

  5. Create a single data object that contains the data and the results from both analyses.

  6. Create a scatter plot of the data and overlay regression lines for both models.

  7. Create a scatter plot to compare residuals.

  8. Create other interactive plots as needed for further analysis.

The program begins by programatically positioning the default output window and by clearing any previous output it contains. This is accomplished by using the OutputDocument class that is described in the SAS/IML Studio online Help. The program then proceeds as in Section 12.2 by transforming the data and running the GLM procedure, as shown in the following statements:

/* compare regression models */
/* 1. first window is for least squares regression */
declare OutputDocument LSDoc = OutputDocument.GetDefault();
LSDoc.SetWindowPosition(0, 50, 50, 50);
LSDoc.Clear();                        /* clear any previous content */

/* 2. transform variables and run least squares regression analysis */
submit;
data movies;
   set Sasuser.Movies;
   LogUS_Gross = log(US_Gross);
   LogBudget = log(Budget);
   label LogUS_Gross = "log(US Gross)" LogBudget = "log(Budget)";
run;

proc glm data=movies;
   model LogUS_Gross = LogBudget;
   output out=GLMOut P=P R=Resid;
quit;
endsubmit;

printnow;                              /* flush any pending output  */

The last statement is the PRINTNOW statement, which is an IMLPlus statement that ensures that any output buffered on the SAS server is transferred to and displayed on the SAS/IML Studio client. Usually you do not need the PRINTNOW statement, since all output is flushed to the client when the program ends. However, the next step in the current program is to create a new output window to receive the output from a robust regression model, so it is a good idea to make sure that all output related to PROC GLM is displayed in the first output window prior to creating a second output window.

The following statements create and position a new output window, and run the ROBUSTREG procedure for the same model used by the GLM procedure:

/* 3. set up second output window */
declare OutputDocument RobustDoc = OutputDocument.Create();
RobustDoc.SetWindowPosition(50, 50, 50, 50);

/* 4. run robust regression on the same data */
submit;
proc robustreg data=movies method=LTS;
   model LogUS_Gross = LogBudget / leverage;

         output out=RROut P=RobP R=RobResid
                          RD=RobDist MD=MahalDist
                          outlier=RobOutFlag leverage=RobLevFlag;
         ods output DiagSummary = RobustDiagnostics;
quit;
endsubmit;

The output from the ROBUSTREG procedure appears in the second output window. This makes it easy to compare parameter estimates and other statistics side-by-side, as shown in Figure 12.9. The statements also save the DiagSummary table, which shows cutoff values for outliers and leverage points

Separate Output Windows for Two Analyses

Figure 12.9. Separate Output Windows for Two Analyses

The parameter estimates for each model are also displayed in Figure 12.10. The intercept estimate is smaller for the least squares model (0.90) than for the robust model (1.71), and the slope of the least squares model (0.72) is greater than for the robust model (0.60). This occurs because the least squares model for the logged data is affected by a number of small-budget movies that did not earn much money. This is shown graphically in Figure 12.11.

Parameter Estimates from Two Regression Models

Figure 12.10. Parameter Estimates from Two Regression Models

This example does not produce any further output. However, you can call LSDoc.SetDefault() to direct further output to the first output window and RobustDoc.SetDefault() to direct output to the second output window.

The next step in the analysis is to create a single data object that contains the data and the results from both analyses. There are several ways to accomplish this. One way is to use the DATA step (inside of a SUBMIT block) to merge the output from the procedures, and then to create a data object from that merged data. A second way, shown in the following statements and described in Section 8.8.2, is to create a data object from, say, the output data set created by the ROBUSTREG procedure, and then use the CopyServerDataToDataObject module to add certain variables from the GLM output to the same data set:

/* 5. create data object that contains data and results */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Work.RROut");
dobj.SetRoleVar(ROLE_LABEL, "Title");

rc = CopyServerDataToDataObject("Work", "GLMOut", dobj,
   {"P" "Resid"},
   {"P" "Resid"},
   {"Predicted log(US_Gross)" "Residual"},
   true);

The data object contains all of the variables in the Movies data set, the log-transformed variables, and the predicted values, residuals, and statistics from the GLM and ROBUSTREG procedures. You can therefore explore the results interactively by using the SAS/IML Studio GUI, or you can create plots programmatically.

The next step is to create a scatter plot of the data and to overlay regression lines for both models in order to compare them. The following statements use the techniques that are described in Section 9.1 to overlay each regression curve on the data:

/* 6. create scatter plot that compares fits */
declare ScatterPlot plot;
plot = ScatterPlot.Create(dobj, "LogBudget", "LogUS_Gross");
plot.SetAxisLabel(XAXIS, AXISLABEL_VARLABEL);
plot.SetAxisLabel(YAXIS, AXISLABEL_VARLABEL);

/* 6a. use drawing subsystem to add fit lines */
dobj.GetVarData({"LogBudget" "P" "RobP"}, fit);
call sort(fit, 1);                                     /* sort by X */
x = fit[,1]; P = fit[,2]; RobP = fit[,3];

plot.DrawUseDataCoordinates();
plot.DrawSetPenAttributes(BLUE, DASHED, 2);
plot.DrawLine(x, RobP);
plot.DrawSetPenAttributes(BLACK, SOLID, 2);
plot.DrawLine(x, P);

/* 6b. add legend */
Labels = {"Robust LTS" "Least Squares"};
Color = BLUE || BLACK;
Style = DASHED || SOLID;
run DrawLegend(plot, Labels, 12, Color, Style, _NULL_, -1, "ILT");

The previous statements use a dashed blue line to represent the robust regression model, and a solid black line to represent the least squares model. In order to make the graph easier to understand, the statements also call the DrawLegend module to add a legend to the upper left portion of the plot area, as shown in Figure 12.11. The DrawLegend module is described in Section 9.2.

Comparison of Two Regression Models

Figure 12.11. Comparison of Two Regression Models

Figure 12.11 makes it apparent that the least squares regression line is "pulled down" by the high-leverage points in the lower left of the figure. The two movies that have the most negative LogUS_Gross values are Lonesome Jim and Feast. Those two movies are among the movies labeled in Figure 12.4; it is evident from Figure 12.4 that they are high-leverage points with large negative residuals for the least square model.

When comparing two different models, it is often useful to compare the residual values for the models. It is sometimes the case that there are observations that have large residuals for one model but have small residuals or negative residuals for the other model. (This happens frequently with nonparametric models.) The following statements create a scatter plot of the residuals for each model and use the abline module (see Section 9.4) to add a reference line:

/* 7. compare residuals */
declare ScatterPlot residPlot;
residPlot = ScatterPlot.Create(dobj, "Resid", "RobResid");
residPlot.SetMarkerSize(5);
residPlot.ShowReferenceLines();
solidGray = 0xc8c8c8 || SOLID || 1;       /* define line attributes */
run abline(residPlot, 0, 1, solidGray);   /* identity line          */

The results are shown in Figure 12.12. All pairs of residuals from these two models lie near the identity line, which indicates that the two models are fairly similar. If a point in Figure 12.12 were far from the identity line, it would indicate an observation for which the predicted value from one model was vastly different than the predicted value from the other model. Notice that all of the points in Figure 12.12 are below the identity line. This indicates that every residual value for the least squares model is less than the corresponding residual value for the robust model. This is obvious from Figure 12.11, because the predicted values for the robust model are larger than the predicted values for the least squares model for every value of the explanatory variable.

Comparison of Residuals for Each Model

Figure 12.12. Comparison of Residuals for Each Model

Figure 12.12 compares residual values and enables you to examine whether an observation that is an outlier for one model is also an outlier for the other model. You can do a similar comparison for the leverage statistic displayed in Figure 12.4. The robust analogue to the leverage statistic is the robust distance statistic (created by the RD= option on the ROBUSTREG OUTPUT statement) because both statistics indicated how far an observation is from the center of the data in the space of the explanatory variabes. However, the leverage statistic is proportional to a squared distance, so it is desirable to directly compare these quantities. However, recall that the square root of the leverage is proportional to the Mahalanobis distance (created by the MD= option on the ROBUSTREG OUTPUT statement). Therefore, the distance-distance plot shown in Figure 12.13 compares the leverage of the least squares model (measured by the Mahalanobis distance) to the leverage of the robust model (measured by the robust distance). The following statements create Figure 12.13:

/* 8. compare leverage */
declare ScatterPlot distPlot;
distPlot = ScatterPlot.Create(dobj, "MahalDist", "RobDist");
distPlot.SetAxisLabel(XAXIS, AXISLABEL_VARLABEL);
distPlot.SetAxisLabel(YAXIS, AXISLABEL_VARLABEL);
distPlot.SetMarkerSize(5);
run abline(distPlot, 0, 1, solidGray);       /* identity line      */
use RobustDiagnostics;
   read point 2 var {Cutoff};                /* cutoff for RobDist */
close RobustDiagnostics;
run abline(distPlot, Cutoff, 0, solidGray);  /* high-leverage line */
Comparison of Classical and Robust Leverage

Figure 12.13. Comparison of Classical and Robust Leverage

Figure 12.13 includes a horizontal line that indicates the cutoff location for high-leverage points. The value for the cutoff is read from the RobustDiagnostics data set that was created from the DiagSummary ODS table. If you use the mouse to drag out a selection rectangle that selects all observations above the cutoff line, those high-leverage movies are highlighted in all other graphs that were created in this section.

12.9 Logistic Regression Diagnostics

As shown in the previous sections, the dynamically linked graphics in SAS/IML Studio make it easy to identify influential observations. After you create graphs of diagnostic statistics (Cook's D, leverage, and so forth), you can click on the highly influential observations in order to reveal more information about those observations. This technique is not limited to least squares regression. You can also apply the ideas to logistic models, mixed models, nonparametric models, and so forth. This section briefly describes how to create and examine diagnostic statistics for a logistic model.

This section uses the Vehicles data set, which contains an indicator variable named Hybrid. The Hybrid variable has the value 1 for hybrid-electric vehicles and the value 0 for other vehicles.

A distinguishing characteristic of many hybrid-electric vehicles is that they achieve better gas mileage during city driving than on the highway. This is shown in Figure 12.14, where the hybrid-electric vehicles are indicated by stars (*). The graph might lead you to suspect that you can predict whether a vehicle is hybrid-electric based on comparing values of the Mpg_Hwy and Mpg_City variables.

Mileage for Various Vehicles. Stars Represent Hybrid-Electric Vehicles.

Figure 12.14. Mileage for Various Vehicles. Stars Represent Hybrid-Electric Vehicles.

A particularly simple model is to consider only the difference between the Mpg_Hwy and Mpg_City variables. This quantity is contained in the Mpg_Hwy_Minus_City variable in the Vehicles data set.

The process of fitting a logistic model, displaying the results in SAS/IML Studio, and using linked graphics to identify influential observations is very similar to the process described in the section "Fitting a Regression Model" on page 280. The main steps of the process are as follows:

  1. Fit a logistic regression model.

  2. Read the results into a data model.

  3. Set the observation markers to indicate values of the response variable.

  4. Create plots of predicted values, residual values, and/or influence diagnostics.

  5. Use dynamically linked graphics to identify outliers and influential observations.

The following statements call the LOGISTIC procedure to build a predictive model of hybrid electric vehicles that is based on the difference between the mileage on the highway and in the city:

/* create logistic regression diagnostic plots */
submit;
/* 1. run logistic analysis */
proc logistic data=Sasuser.Vehicles;
   model Hybrid(event='1') = Mpg_Hwy_Minus_City;
   output out=LogOut P=Prob c=C
                     reschi=ResChiSq difchisq=DifChiSq;
quit;

data LogOut;            /* add ObsNumber variable and revise labels */
   set LogOut;
   ObsNumber = _N_;
   label Prob      = "Estimated Probability of Being Hybrid"
         C         = "CI Displacement C"
         DifChiSq  = "Influence on Chi-Square"
         ObsNumber = "Observation Number";
run;
endsubmit;

The OUTPUT statement for PROC LOGISTIC creates an output data set that contains all of the variables in the Vehicles data in addition to four variables based on the model:

  • Prob contains the estimated probability that a vehicle is a hybrid.

  • C contains an influence diagnostic statistic (called the confidence interval displacement statistic) which is similar to Cook's D statistic.

  • ResChiSq contains the Pearson chi-square residuals.

  • DifChiSq contains a statistic that measures the effect of deleting an observation on the chi-square statistic.

After using the DATA step to add the ObsNumber variable and to revise the labels of several variables, the next step is to read the data and output statistics into a data object and to set the marker characteristics for the observations. The following statements set the markers for hybrid-electric vehicles to be stars, and the markers for traditional vehicles to be hollow circles:

/* 2. create data object that contains data and results */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Work.LogOut");
dobj.SetRoleVar(ROLE_LABEL, "Model");

/* 3. set observation markers to encode Hybrid=1 */
use LogOut;   read all var {Hybrid};   close LogOut;
dobj.SetMarkerShape(loc(Hybrid=0), MARKER_CIRCLE);
dobj.SetMarkerFillColor(loc(Hybrid=0), NOCOLOR);
dobj.SetMarkerShape(loc(Hybrid=1), MARKER_STAR);

The markers are used for all plots of the data.

The following statements create a graph that shows the estimated probability of being a hybrid, based on a model that uses only the Mpg_Hwy_Minus_City variable:

/* 4. create line plot of estimated probability */
declare LinePlot line;
line = LinePlot.Create(dobj, "Mpg_Hwy_Minus_City", "Prob");
line.SetAxisLabel(XAXIS, AXISLABEL_VARLABEL);
line.SetAxisLabel(YAXIS, AXISLABEL_VARLABEL);
line.SetMarkerSize(5);

The graph is shown in Figure 12.15. Of the 19 hybrid-electric vehicles in the data, the model assigns a high probability of being a hybrid to the 10 that have negative values for the Mpg_Hwy_Minus_City variable. The 1,177 vehicles that have positive values for the Mpg_Hwy_Minus_City variable are assigned a low probability of being a hybrid. Nine of those vehicles are, in fact, hybrid-electric. Consequently, these misclassified vehicles have relatively large Pearson chi-square residuals.

Plot of Estimated Probability

Figure 12.15. Plot of Estimated Probability

You can create many influence diagnostics plots, as described in the documentation for the LOGISTIC procedure. The following statements create two plots. One is a graph of the confidence interval displacement statistic (see Figure 12.16), whereas the other (not shown) is a plot that shows the effect of deleting an observation on the Pearson chi-square statistic.

/* 5. create influence diagnostic plot */
declare ScatterPlot influencePlot;
influencePlot = ScatterPlot.Create(dobj, "ObsNumber", "C");
influencePlot.SetAxisLabel(XAXIS, AXISLABEL_VARLABEL);
influencePlot.SetAxisLabel(YAXIS, AXISLABEL_VARLABEL);
influencePlot.SetMarkerSize(5);

/* 6. create other diagnostic plots */
declare ScatterPlot difChiSqPlot;
difChiSqPlot = ScatterPlot.Create(dobj, "Prob", "DifChiSq");
difChiSqPlot.SetAxisLabel(XAXIS, AXISLABEL_VARLABEL);
difChiSqPlot.SetAxisLabel(YAXIS, AXISLABEL_VARLABEL);
difChiSqPlot.SetMarkerSize(5);

Figure 12.16 shows nine observations with large values of the C statistic. Five of those observations were selected by clicking on them. By using the interactive graphs in SAS/IML Studio, you can come to a better understanding of why certain vehicles are outliers for the models. For example, in the Vehicles data set, observations numbered 610-790 are pickup trucks, and four pickup trucks are hybrid-electric. These four trucks are shown in Figure 12.16 as vehicles with large C values, which indicates that the simple logistic model developed in this section is not adequate to distinguish hybrid-electric pickup trucks from the other pickup trucks. If correctly classifying hybrid-electric pickup trucks is important, then you need to revise the logistic model.

Change in the Deviance Resulting from Deleting an Individual Observation

Figure 12.16. Change in the Deviance Resulting from Deleting an Individual Observation

12.10 Viewing ODS Statistical Graphics

Many of the SAS statistical procedures can produce their own diagnostic plots by using ODS statistical graphics. These plots are not dynamically linked to each other, but still can help you to analyze the fit of a statistical model.

The output window in SAS/IML Studio displays the listing destination for ODS tables. As of SAS/IML Studio 3.3, there is not a way to display HTML output from SAS procedures within SAS/IML Studio. However, if SAS/IML Studio and SAS Foundation are both running on the same PC. you can easily write the HTML output to a file, and then use a Web browser to view the tables and graphs.

The following program demonstrates this approach. Suppose you want to create and view the ODS graphics that are produced by the REG procedure. One way to do this is to specify a Windows path and file name (say, Output.html) for the ODS output on the ODS HTML statement. You can then run one or more procedures and view the HTML output (including any graphs that you have requested) by opening the Output.html file in a Web browser. You can use the Windows operating system to navigate to the file and open it, or you can programatically launch a browser to view the file, as shown in the following example:

/* write ODS output (including graphics) to HTML */
run GetPersonalFilesDirectory(path);                     /* 1 */
htmlfile = "Output.html";
submit path htmlfile;                                    /* 2 */
ds graphics on;
ods html body="&htmlfile" path="&path";                  /* 3 */

proc reg data=Sasuser.Vehicles;                          /* 4 */
   model MPG_Hwy = MPG_City;
quit;

ods html close;
ods graphics off;
endsubmit;                                               /* 5 */

/* launch browser to display HTML */
pgm = "C:Program FilesInternet Exploreriexplore.exe";
ok = ExecuteOSProgram(pgm, path+htmlfile, false);        /* 6 */

The program contains the following main steps:

  1. Define a directory in which to store the HTML and image files. You must have write permission to this directory. One portable way to get such a directory is to use the GetPersonalFiles-Directory module, which is distributed with SAS/IML Studio. By default, the personal files directory corresponds to one of the following Windows directories:

    Windows XP

    C:Documents and SettingsuseridMy DocumentsMy IML Studio Files

    Windows Vista

    C:UsersuseridDocumentsMy IML Studio Files

    Windows 7

    C:UsersuseridDocumentsMy IML Studio Files

  2. Use the SUBMIT statement to set ODS options and to call PROC REG. The Windows path and the name of the HTML file are passed into the SUBMIT block as parameters.

  3. Use the ODS HTML statement to specify that PROC REG should write output to the ODS destination. The PATH= option specifies the directory in which to store the HTML file and image files. The BODY= option specifies the name of the HTML file that you can view with a browser in order to display the output.

  4. Call PROC REG. Because this program specifies the ODS GRAPHICS ON statement, PROC REG creates ODS graphics as part of its output.

  5. The HTML output is written when the ODS HTML CLOSE statement is executed.

  6. Use the ExecuteOSProgram module (which is distributed with SAS/IML Studio) to launch a browser to view the HTML output. The first argument to the module is the name of the program to execute. The second argument contains command-line arguments to the program. In this case, Internet Explorer needs to know the name and location of the Output.html file. The third argument specifies whether or not SAS/IML Studio should pause until Internet Explorer exits. By using false for this argument, the IMLPlus program continues to execute while Internet Explorer is running.

The output from the program is shown in Figure 12.17. You can modify the program to use a different browser, or to store the HTML and image files in a different location. After running the program, the HTML and image files are not automatically deleted, so you should manually delete those files when you are finished viewing them.

If you have configured SAS/IML Studio to communicate with a SAS Workspace Server that is running on a remote computer, then there is no simple way to display ODS graphics on the client PC.

For more information on using ODS statistical graphics, see the chapter "Statistical Graphics Using ODS" in the SAS/STAT User's Guide.

ODS Graphics from SAS Procedures

Figure 12.17. ODS Graphics from SAS Procedures

12.11 References

[bibch12_01] A. C. Atkinson, (1985), Plots, Transformations, and Regression, New York: Oxford University Press.

[bibch12_02] D. A. Belsley,, E. Kuh,, and R. E. Welsch, (1980), Regression Diagnostics, New York: John Wiley & Sons.

[bibch12_03] G. Blom, (1958), Statistical Estimates and Transformed Beta Variables, New York: John Wiley & Sons.

[bibch12_04] R. A. Nelson,, M. R. Donihue,, D. M. Waldman,, and C. Wheaton, (2001), "What's an Oscar Worth?" Economic Inquiry, 39(1), 15-24. URL http://ssrn.com/abstract=2530 82

[bibch12_05] J. O. Rawlings,, S. G. Pantula,, and D. A. Dickey, (1998), Applied Regression Analysis: A Research Tool, Springer Texts in Statistics, Second Edition, New York: Springer-Verlag.

[bibch12_06] J. S. Simonoff, and I. R. Sparrow, (2000), "Predicting Movie Grosses: Winners and Losers, Blockbusters and Sleepers," CHANCE, 13(3), 15-24.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset