Chapter 14. Bootstrap Methods

Contents

  • 14.1 An Introduction to Bootstrap Methods 349

  • 14.2 The Bootstrap Distribution for a Mean 350

    • 14.2.1 Obtaining a Random Sample 350

    • 14.2.2 Creating a Bootstrap Distribution 351

    • 14.2.3 Computing Bootstrap Estimates 353

  • 14.3 Comparing Two Groups 356

  • 14.4 Using SAS Procedures in Bootstrap Computations 358

    • 14.4.1 Resampling by Using the SURVEYSELECT Procedure 359

    • 14.4.2 Computing Bootstrap Statistics with a SAS Procedure 361

  • 14.5 Case Study: Bootstrap Principal Component Statistics 363

    • 14.5.1 Plotting Confidence Intervals on a Scree Plot 366

    • 14.5.2 Plotting the Bootstrap Distributions 368

  • 14.6 References 369

14.1 An Introduction to Bootstrap Methods

The bootstrap method is a general technique for estimating unknown parameters, computing standard errors for statistics, and finding confidence intervals for parameters. Introductory references for bootstrap techniques include Efron and Tibshirani (1993) and Davison and Hinkley (1997).

This chapter is an applied chapter that describes how to use the bootstrap method in SAS/IML and IMLPlus. IMLPlus has several features that make the bootstrap method easier to implement. The chapter introduces resampling techniques and explains estimating variability with the bootstrap standard error and bootstrap confidence intervals.

The basic bootstrap algorithm is as follows. Given data x = (x1,x2,...,xn) and a statistic s(x), the bootstrap method generates a large number of independent bootstrap samples x*1, x*2,..., x*B. Each sample is typically of size n. For each bootstrap sample, you compute s(x*i), the statistic of the i th sample. In the simplest cases, the bootstrap samples are formed by randomly sampling from x with replacement.

The union of the statistics forms the bootstrap distribution for s. The bootstrap distribution is an estimate for the sampling distribution of the statistic. The mean of the bootstrap distribution is an estimate of s(x) (which is itself an estimate of the corresponding population parameter). The standard deviation of the bootstrap distribution is the bootstrap estimate of the standard error of s. Percentiles of the bootstrap distribution are the simplest estimates for confidence intervals. There are also more sophisticated estimates for confidence intervals (see Efron and Tibshirani (1993)), but these are not described in this book.

14.2 The Bootstrap Distribution for a Mean

One of the simplest statistics is the mean. Because the sampling distribution of the mean and the standard error of the mean (SEM) are well understood, the mean is a good choice for a statistic to use in an introduction to bootstrap methods.

This section consists of three parts. The first part creates a data set to be analyzed. The second part resamples from the data many times and computes the mean of each resample. The union of the means is the bootstrap distribution for the mean statistic. The third part uses the bootstrap distribution to estimate the SEM and confidence intervals for the mean of the underlying population.

14.2.1 Obtaining a Random Sample

Suppose you are interested in statistics related to movies released in the US during the time period 2005-2007. The Movies data set contains a sampling of movies that were also rated by the kids-in-mind.com Web site. Strictly speaking, these movies are not a random sample from the population of all US movies, since the data includes all major movies with nationwide release. (In fact, the data essentially equal the population of US movies; the data set is lacking only minor releases that had limited distribution.) In discussing bootstrap statistics, it is important to differentiate between a random sample and the underlying population. To make this distinction perfectly clear, the following DATA step extracts a random sample of approximately 25% of the movies into the Sample data set:

/* extract random sample of movies for 2005-2007 */
submit;
data Sample;
   set Sasuser.Movies;
   if (ranuni(1)<0.25);
run;
endsubmit;

Figure 14.1 shows the distribution of the budgets for movies in the Sample data set. The histogram is created by the following statements:

/* read in movies and draw histogram */
use Sample;
read all var {Budget} into x;
close Sample;

declare Histogram hist;
hist = Histogram.Create("Movies", x);
hist.SetAxisLabel(XAXIS,"Sample Movie Budgets");
Distribution of Movie Budgets in Sample

Figure 14.1. Distribution of Movie Budgets in Sample

The data are not normally distributed. The sample mean of the budgets is $39.2 million. There are several movies with very large budgets, although the complete Movies data set has three movies with budgets greater than $200 million that are not included in the Sample data. (For comparison, Figure 7.4 shows the distribution for all of the budgets in the Movies data set.) The remainder of this chapter analyzes the Sample data.

14.2.2 Creating a Bootstrap Distribution

Implementing a bootstrap estimate often includes the following steps:

  1. Compute the statistic of interest on the original data.

  2. Resample B times from the data (with replacement) to form B bootstrap samples.

  3. Compute the statistic on each resample.

  4. If desired, graph the bootstrap distribution.

  5. Compute standard errors and confidence intervals.

For the movies in the Sample data set, you can use bootstrap methods to examine the sampling distribution of the mean (or any other statistic). Since the first step in the bootstrap method is to resample (with replacement) from the data, it is convenient to use the SampleWithReplace module that was developed in Chapter 13, "Sampling and Simulation."

The following statements use the SampleWithReplace module to carry out a simple bootstrap on the budgets for movies:

/* bootstrap method: the statistic to bootstrap is the mean */
/* 1. Compute the statistic on the original data */
Mean = x[:];                             /* sample mean             */

/* 2. Resample B times from the data (with replacement)
   to form B bootstrap samples. */
call randseed(12345);
load module=SampleWithReplace;           /* not required in IMLPlus */
B = 1000;
n = nrow(x);
EQUAL = .;
xBoot = SampleWithReplace(x, B||n, EQUAL);

/* 3. compute the statistic on each resample */
s = xBoot[,:];

As indicated previously, the main steps of the bootstrap algorithm are as follows:

  1. Compute the statistic (in this case, the sample mean) on the data.

  2. Use the SampleWithReplace module to resample from the data. The module returns a matrix, xBoot, which consists of B rows and n columns, where n is the number of observations in the data. Each row represents a resampling from the data.

  3. Compute the statistic on each resample. In this case, compute the mean of each row of xBoot. The resulting vector, s, contains B statistics (means).

The vector s in the previous program contains B statistics. The union of the statistics is the bootstrap distribution for the statistic. It is an estimate for the sampling distribution of the statistic. A histogram of s is shown in Figure 14.2 and is created with the following IMLPlus statements:

/* 4. If desired, graph the bootstrap distribution */
declare Histogram hBoot;
hBoot = Histogram.Create("Means", s);
hBoot.SetAxisLabel(XAXIS,"Means of Resamples");
hBoot.ReBin(40, 1);
attrib = BLACK || SOLID ||2;
run abline(hBoot, Mean, ., attrib);
Estimate of the Sampling Distribution of the Mean

Figure 14.2. Estimate of the Sampling Distribution of the Mean

The figure shows that most of the means for the resampled data are between $35 and $43 million per movie. A small number were as small as $27 million per movie, or as large as $54 million per movie. The sample statistic has the value 39.2, which is shown as a vertical line in Figure 14.2. This figure enables you to see the variation in the mean of the resampled data.

14.2.3 Computing Bootstrap Estimates

This section shows how you can use the bootstrap distribution shown in Figure 14.2 to compute bootstrap estimates for familiar quantities: the mean, the standard error of the mean (SEM), and the confidence intervals for the mean. The mean of the bootstrap distribution is an estimate of the mean of the original sample. The standard deviation of the bootstrap distribution is the bootstrap estimate of the standard error of s. Percentiles of the bootstrap distribution are the simplest estimates for confidence intervals. The following statements (which continue the program in the previous section) compute these quantitites:

/* 5. Compute standard errors and confidence intervals. */
MeanBoot = s[:];                   /* a. mean of bootstrap dist     */
StdErrBoot = sqrt(var(s));         /* b. std error                  */
alpha = 0.05;
prob = alpha/2 || 1-alpha/2;       /* lower/upper percentiles       */
call qntl(CIBoot, s, prob);        /* c. quantiles of sampling dist */
print MeanBoot StdErrBoot CIBoot;
Bootstrap Estimates for the Sampling Distribution of the Mean

Figure 14.3. Bootstrap Estimates for the Sampling Distribution of the Mean

The output is shown in Figure 14.3. The following list describes the main steps:

  1. Recall that the bootstrap distribution for the mean is contained in s. The mean of the bootstrap distribution is an estimate of the sample mean. In this case, Figure 14.2 shows that the bootstrap distribution is symmetric and unimodal (as expected by the central limit theorem), so the mean of s represents the central tendency of the bootstrap distribution.

  2. The standard deviation of the bootstrap distribution is an estimate of the SEM. In this case, the standard deviation s represents the spread of the bootstrap distribution. The VAR function is available in SAS/IML 9.22. If you are using an earlier version, you can use the Var module that is defined in Appendix D.

  3. The percentiles of the sampling distribution provide estimate of a 95% confidence interval. The QNTL function is implemented in SAS/IML 9.22. If you are using an earlier version, you can use the Qntl module that is defined in Appendix D.

You can visualize these quantities by adding vertical lines to the histogram of the sampling distribution. The following statements draw lines on Figure 14.2. The resulting graph is shown in Figure 14.4.

attrib = BLACK || DASHED || 1 || PLOTFOREGROUND;
run abline(hBoot, MeanBoot//CIBoot, {.,.,.}, attrib);
Estimate of the Mean and Confidence Intervals for the Sampling Distribution

Figure 14.4. Estimate of the Mean and Confidence Intervals for the Sampling Distribution

In Figure 14.4, the solid line is the original statistic on the data. The dashed lines are bootstrap estimates.

The power of the bootstrap is that you can create bootstrap estimates for any statistic, s, even those for which there are no closed-form expression for the standard error and confidence intervals. However, because the statistic in this example is the mean, you can use elementary statistical theory to compute classical statistics (based on the sample) and compare the bootstrap estimates to the classical estimates.

The sampling distribution of the mean is well understood. The central limit theorem says that the sampling distribution of the mean is approximately normally distributed with parameters (μ,σ). The sample mean, Estimate of the Mean and Confidence Intervals for the Sampling Distribution, is an estimate for μ. An estimate for σ is the quantity Estimate of the Mean and Confidence Intervals for the Sampling Distribution where s2x is the sample variance and n is the size of the sample. Furthermore, the two-sided 100(1 − α)% confidence interval for the mean has upper and lower limits given by Estimate of the Mean and Confidence Intervals for the Sampling Distribution where t1−α/2,n−1 is the 1 − α/2 percentile of the Student's t distribution with n − 1 degrees of freedom. The following program computes these estimates:

/* Compute traditional estimates for the sampling distribution of
   the mean by computing statistics of the original data */
StdErr = sqrt(var(x)/n);           /* estimate SEM                  */
t = quantile("T", prob, n-1);      /* percentiles of t distribution */
CI = Mean + t * StdErr;            /* 95% confidence interval       */
print Mean StdErr CI;
Estimates for the Sampling Distribution of the Mean

Figure 14.5. Estimates for the Sampling Distribution of the Mean

The output is shown in Figure 14.5. The sample mean is 39.15. The SEM is estimated by 3.76 and the 95% confidence interval by [31.7,46.6].

A comparison of Figure 14.5 with Figure 14.3 shows that the classical estimates are similar to the bootstrap estimates. However, note that the classical confidence interval is symmetric about the mean (by definition), whereas the bootstrap confidence interval is usually not symmetric. The bootstrap estimate is based on percentiles of the statistic on the resampled data, not on any formula. This fact implies that bootstrap confidence intervals might be useful for computing confidence intervals for a statistic whose sampling distribution is not symmetric.

It is interesting to note that the mean budget in the Movies data set is $44.8 million, which falls within the 95% confidence limits for either method. The Movies data set contains essentially all (major) movies released in 2005-2007, so you can take $44.8 million to be the mean of the population.

14.3 Comparing Two Groups

In the previous example, you can replace the mean statistic with any other statistic such as the variance or a trimmed mean. If you choose a set of variables in the Sample data set, you can bootstrap the standard errors and confidence intervals for regression coefficients or correlation coefficients.

You can also use the bootstrap method to compare statistics between two subpopulations. For example, you can compare the mean budgets of PG-13 and the R-rated movies in the Sample data set and ask whether they are significantly different. The computations are straightforward:

  1. Given two groups, resample B times from each group (with replacement) to form B bootstrap samples.

  2. Compute the difference between the means for each of the B resamples. This creates the bootstrap distribution for the difference of means.

  3. Determine whether the 95% percentile confidence intervals contain zero. If so, conclude that there is no evidence that the means of the two groups are different.

The following statements compute the bootstrap estimate:

/* compute bootstrap estimate for difference between means of two groups */
use Sample where (MPAARating="PG-13");    /* read data from group 1 */
read all var {Budget} into x1;
use Sample where (MPAARating="R");        /* read data from group 2 */
read all var {Budget} into x2;
close Sample;

/* 1. compute bootstrap distribution for difference between means   */
call randseed(12345);
load module=SampleWithReplace;       /* not required in IMLPlus     */
B = 1000;
n1 = nrow(x1);
n2 = nrow(x2);
EQUAL = .;
Boot1 = SampleWithReplace(x1, B||n1, EQUAL);   /* resample B times  */
Boot2 = SampleWithReplace(x2, B||n2, EQUAL);   /* from each group   */

/* 2. difference between the B means computes for each resample     */
s1 = Boot1[,:];                      /* means of B resample from x1 */
s2 = Boot2[,:];                      /* means of B resample from x2 */
s = s1 - s2;                         /* difference of means         */

/* 3. Compute bootstrap estimate for 95% C.I. */
alpha = 0.05;
prob = alpha/2 || 1-alpha/2;
call qntl(CIBoot, s, prob);
print CIBoot;
Bootstrap Confidence Intervals for Difference of Means

Figure 14.6. Bootstrap Confidence Intervals for Difference of Means

As shown in Figure 14.6, the 95% confidence interval for the mean difference includes zero, so the observed difference in the mean budgets between PG-13 and R-rated movies is not significant. Figure 14.7 shows a bootstrap distribution for the difference of means. Vertical lines on the figure indicate the 2.5% and 97.5% percentiles of the distribution.

A Bootstrap Distribution for the Difference of Means

Figure 14.7. A Bootstrap Distribution for the Difference of Means

14.4 Using SAS Procedures in Bootstrap Computations

In many ways, the SAS/IML language is a natural language for programming simple bootstrap algorithms on small or medium-sized data sets. As shown in the previous sections, it is fast and easy to resample the data and to compute a mean or another descriptive statistic.

However, it is also possible to implement bootstrap methods in IMLPlus by combining SAS/IML statements with calls to SAS procedures. Procedures offer several advantages over implementing complicated bootstrapping algorithms entirely in SAS/IML software:

  • More Resampling Methods The bootstrap resampling is supposed to mimic the design that produced the original data. For random sampling with replacement, this is straightforward, as indicated in Chapter 13, "Sampling and Simulation." For complicated sampling schemes, it might be useful to use the SURVEYSELECT procedure to carry out the bootstrap resampling, as opposed to implementing the resampling entirely in SAS/IML statements.

  • More Statistical Computations Most of the statistical functionality in SAS software is contained in procedures. The functionality available in SAS/STAT, SAS/ETS, and other products is much greater than the built-in functionality of the SAS/IML statistical library. For example, consider the problem of estimating the accuracy of parameter estimates for a complicated generalized linear model. It is straightforward to call the GENMOD procedure in SAS/STAT, whereas SAS/IML does not have an analogous built-in function or module.

  • Procedures Handle Missing Values Not all SAS/IML operations support missing values (for example, matrix multiplication), but SAS procedures handle missing values without any effort from you.

  • Procedures Handling Degenerate Data Writing robust SAS/IML statements to perform a computation means you have to handle errors that arise in analyzing all possible data sets. For example, a SAS/IML module for linear regression should correctly handle singular data such as collinear variables and constant data. SAS procedures do this already.

  • Procedures Save Time Writing your own SAS/IML computation means spending time verifying that your program is computing the correct answers. A SAS/IML program that you write yourself might have a subtle (or not so subtle!) error, whereas the SAS procedures have been tested and validated. Furthermore, SAS procedures (which are implemented in compiled and optimized C code) can compute statistics more efficiently than an interpreted language such as the SAS/IML language.

  • Procedures Handle Large Data SAS/IML software requires that all vectors and matrices fit into RAM. One of the advantages of SAS procedures is the ability to compute with data sets so large that they do not fit in RAM. While bootstrapping a mean is not memory-intensive, computing more complex statistics on large data sets might require more memory than is available on your computer.

For all of these reasons, this section describes how you can call SAS procedures to generate resamples and to compute statistics on the resamples. You can use the SURVEYSELECT procedure in SAS/STAT software to carry out the resampling portion of the bootstrap method. You can call any SAS procedure to compute the statistics.

14.4.1 Resampling by Using the SURVEYSELECT Procedure

Some SAS programmers use the DATA step for everything, including generating random samples from data. The DATA step can sample reasonably fast. But it is even faster to use the SURVEY-SELECT procedure for generating a random sample.

The SURVEYSELECT procedure can generate B random samples of the input data, and is therefore ideal for bootstrap resampling (Cassell 2007, 2010; Wicklin 2008). It implements many different sampling techniques, but this section focuses on random sampling with replacement. The SURVEYSELECT procedure refers to "sampling with replacement" as "unrestricted random sampling" (URS).

As an example, suppose your data consists of four values: A, B, C, and D. The following statements create a data set named BootIn that contains the data:

/* create small data set */
x = {A,B,C,D};
create BootIn var {"x"};
append;
close BootIn;

Suppose you want to generate B bootstrap resamples from the data in the BootIn data set. The following statements use the SUBMIT statement to call the SURVEYSELECT procedure and to pass in two parameters to the procedure. The procedure reads data from BootIn and writes the resamples into the BootSamp data set.

/* Use the SURVEYSELECT procedure to generate bootstrap resamples */
N = nrow(x);
B = 5;

submit N B;
proc surveyselect data=BootIn out=BootSamp noprint
     seed   = 12345               /* 1 */
     method = urs                 /* 2 */
     n      = &N                  /* 3 */
     rep    = &B                  /* 4 */
     OUTHITS;                     /* 5 */
run;
endsubmit;

The following list describes the options to the SURVEYSELECT procedure:

  1. Set the seed for the random number generator used by the SURVEYSELECT procedure.

  2. Specify the method used for resampling. This example uses unrestricted random sampling (URS).

  3. Specify the size of the resampled data. For bootstrap resampling this is often the size of the input data set. This value is passed in on the SUBMIT statement.

  4. Specify the number of times, B, to resample. This value is also passed in on the SUBMIT statement.

  5. Specify the OUTHITS option on the PROC SURVEYSELECT statement when you want the output data to consist of nB observations, where the first n observations correspond to the first resample, the next n observations correspond to the next resample, and so on.

The output data set created by the SURVEYSELECT procedure is displayed by the following statements, and is shown in Figure 14.8:

submit;
proc print data=BootSamp;
   var Replicate x;
run;
endsubmit;
Bootstrap Resamples Using OUTHITS Option

Figure 14.8. Bootstrap Resamples Using OUTHITS Option

It is not necessary to specify the OUTHITS option. In fact, it is more efficient to omit the OUTHITS option. If you omit the OUTHITS option, the NumberHits variable in the output data set contains the frequency that each observation is selected. Omitting the OUTHITS option and using NumberHits as a frequency variable when computing statistics is typically faster than using the OUTHITS option because there are fewer observations in the BootSamp data set to read. Most, although not all, SAS procedures have a FREQ statement for specifying a frequency variable. Figure 14.9 shows the output data set created by the SURVEYSELECT procedure when you omit the OUTHITS option:

Bootstrap Resamples without Using OUTHITS Option

Figure 14.9. Bootstrap Resamples without Using OUTHITS Option

The BootSamp data set contains a copy of all variables in the input data set and contains the variable Replicate, which identifies the bootstrap resample that is associated with each observation. The data set also contains the NumberHits variable, which can be used on a FREQ statement to specify the frequency of each observation in the resample.

14.4.2 Computing Bootstrap Statistics with a SAS Procedure

The preceding section decribed how to use the SURVEYSELECT procedure to generate B bootstrap resamples. This section shows how to use the BY statement (and, if desired, the FREQ statement) of a SAS procedure to generate the bootstrap distribution of a statistic.

This section duplicates the example in Section 14.2.2 by creating a bootstrap distribution for the mean statistic. However, instead of computing the means of each resample in the SAS/IML language, this section uses the MEANS procedure to do the same analysis.

Assume that the data set Sample contains the Budget variable for a subset of movies, as described in Section 14.2.2. The following program uses the SURVEYSELECT procedure to generate bootstrap resamples and uses the MEANS procedure to generate the bootstrap distribution for the mean statistic:

/* use SURVEYSELECT to resample; use PROC to compute statistics */
DSName = "Sample";                          /* 1 */
VarName = "Budget";
B = 1000;

submit DSName VarName B;                    /* 2 */
/* generate bootstrap resamples */
proc surveyselect data=&DSName out=BootSamp noprint
     seed=12345 method=urs rep= &B
     rate = 1;                              /* 3 */
run;

/* use procedure to compute statistic on each resample */
proc means data=BootSamp noprint;           /* 4 */
   by Replicate;                            /* a */
   freq NumberHits;                         /* b */
   var &VarName;
   output out=BootDist mean=s;              /* c */
run;
endsubmit;

The following list describes the main steps of the program:

  1. Define SAS/IML variables that contain parameters for the analysis. These include the name of the input data set, the name of the variable to use in the bootstrap analysis, and the number of bootstrap resamples to generate.

  2. Pass the parameters to SAS procedures by listing the names of the SAS/IML variables on the SUBMIT statement.

  3. Generate the bootstrap resamples by calling the SURVEYSELECT procedure. The options for the procedure are the same as in the previous section, except that the program uses the RATE= option instead of the N= option. The option RATE=1 specifies that each resample should contain the same number of observations as the input data set, but does not require that you specify how many observations are in the input data set. Notice that the OUTHITS option is omitted from the PROC SURVEYSELECT statement.

  4. The MEANS procedure computes the mean of the Budget variable for each resample.

    • a) The BY statement specifies that the B resamples correspond to values of the Replicate variable.

    • b) The FREQ statement specifies that the NumberHits variable contains the frequency of each value of the Budget variable.

    • c) The mean for each BY group is saved in the BootDist data set, in the variable s.

You can visualize the bootstrap distribution by using IMLPlus graphics, as shown in the following statements and in Figure 14.10:

/* create histogram of bootstrap distribution */
use BootDist;
read all var {s};
close BootDist;

declare Histogram hBoot;
hBoot = Histogram.Create("Means", s);
hBoot.SetAxisLabel(XAXIS,"Means of Resamples (PROC MEANS)");
hBoot.ReBin(40, 1);
Estimate of the Sampling Distribution of the Mean

Figure 14.10. Estimate of the Sampling Distribution of the Mean

You can compare Figure 14.10, which was computed by calling the SURVEYSELECT and MEANS procedures, with Figure 14.4, which was computed entirely in the SAS/IML language. The distributions are essentially the same. The small differences are due to the fact that the random numbers that were used to generate the resamples in SAS/IML are different from the random numbers used to generate the resamples in the SURVEYSELECT procedure.

14.5 Case Study: Bootstrap Principal Component Statistics

The bootstrap examples in this chapter so far have been fairly simple: the statistic was either a mean or the difference of means. This section uses the bootstrap method to estimate confidence intervals for a statistic that occurs as part of a principal component analysis.

The starting point for the bootstrap example is a principal component analysis for several variables in the Sample subset of the Movies data set. The following program calls the PRINCOMP procedure to compute a principal component analysis of the variables Budget, US_Gross, Sex, Violence, and Profanity:

/* compute principal component analysis of data */
DSName = "Sample";
VarNames = {"Budget" "US_Gross" "Sex" "Violence" "Profanity"};

submit DSName VarNames;
ods select NObsNVar EigenValues Eigenvectors;
proc princomp data=&DSName;
   var &VarNames;
   ods output Eigenvalues=EigenValues;
run;
endsubmit;
Principal Component Analysis

Figure 14.11. Principal Component Analysis

Let Pi be the proportion of variance explained (PVE) by the i th principal component, i = 1 ... 5. This quantity is shown in the third column of the "Eigenvalues" table, shown in Figure 14.11. According to Figure 14.11, the first principal component explains 41.6% of the variance. But how much uncertainty is in that estimate? Is a 95% confidence interval for the PVE small (such as [0.414,0.418]) or large (such as [0.32,0.52])? There is no analytic formula for estimating the confidence intervals for the PVE, but bootstrap methods provide a computational way to estimate the sampling distribution of the Pi.

A plot of the Pi versus the component number, i, is called a scree plot. A scree plot for these data is shown in Figure 14.12. The goal of this section is to use bootstrap methods to estimate the variation in the Y coordinates of the scree plot.

Scree Plot of Eigenvalues for PCA

Figure 14.12. Scree Plot of Eigenvalues for PCA

The following program builds on the bootstrap method implemented in Section 14.4.2. The SURVEYSELECT procedure is used to generate the bootstrap resamples, and the call to the PRIN-COMP procedure that generates Figure 14.11 is modified to include a BY statement that generates the sampling distribution for the PVE for each principal component.

/* use SURVEYSELECT to resample; use PROC to compute statistics */
B = 1000;                        /* number of bootstrap samples */

submit DSName VarNames B;
/* generate bootstrap resamples */
proc surveyselect data=&DSName out=BootSamp noprint
     seed=12345 method=urs rate=1 rep=&B;
run;

/* Compute the statistic for each bootstrap sample */
ods listing exclude all;
proc princomp data=BootSamp;
   by Replicate;
   freq NumberHits;
   var &VarNames;
   ods output Eigenvalues=BootEVals(keep=Replicate Number Proportion);
run;
ods listing exclude none;
endsubmit;

The sampling distribution of the PVE can be visualized in two ways: as a scree plot and as a panel of five dynamically linked histograms.

14.5.1 Plotting Confidence Intervals on a Scree Plot

You can create a scatter plot that consists of all of the B scree plots generated by the principal component analysis computed on the B resamples:

/* create union of scree plots for bootstrap resamples */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Work.BootEVals");
dobj.SetVarFormat("Proportion","BEST5.");

declare ScatterPlot p;
p = ScatterPlot.Create(dobj, "Number", "Proportion");
p.SetMarkerSize(3);

To that scatter plot, you can overlay the PVE for the original sample, in addition to confidence intervals for the underlying parameter in the population of movies. The following statements overlay the quantities; the result is shown in Figure 14.14.

/* overlay statistics of the original data */
use EigenValues;                 /* 1. read PVE for data */
read all var {Number Proportion};
close EigenValues;
p.DrawUseDataCoordinates();      /* define coordinate system         */
p.DrawMarker(Number, Proportion, MARKER_X, 7); /* plot proportions   */

use BootEVals;                   /* 2. read stats for resamples      */
read all var {Number Proportion};
close BootEVals;

NumPC = ncol(VarNames);          /* number of principal components   */
s = shape(Proportion, 0, NumPC); /* 3. reshape results               */

mean = s[:,];                    /* 4. compute mean of each column   */
alpha = 0.05;                    /* significance level for C.I.      */
prob = alpha/2 || 1-alpha/2;     /* lower/upper values for quantiles */
call qntl(CI, s, prob);          /* compute C.I. for each column     */
print mean, CI[rowname={LCI UCI} label="Confidence Intervals"];

p.DrawSetRegion(PLOTBACKGROUND);
p.DrawSetBrushColor(0xC8C8C8);   /* light gray */
dx = 0.1;                        /* half-width of rectangles         */
do i = 1 to NumPC;               /* 5. draw rectangles and mean line */
   p.DrawRectangle(i-dx, CI[1,i], i+dx, CI[2,i], true);
   p.DrawLine(i-dx, mean[i], i+dx, mean[i]);
end;

The program consists of the following main steps:

  1. Read the statistics for the PVE of the original data. Those proportions are indicated on the scatter plot by large markers (x).

  2. Read the Proportion variable that contains the PVE for each bootstrap resample. The vector Proportion has 5B elements.

  3. Reshape the Proportion vector into a matrix with 5 columns so that the ith column represents the proportion of variance explained by the i th principal component.

  4. Compute the mean and bootstrap percentile confidence interval for each column. The output is shown in Figure 14.13.

  5. Plot rectangles that indicate the bootstrap percentile confidence intervals.

Principal Component Analysis

Figure 14.13. Principal Component Analysis

Scree Plot with Bootstrap Confidence Intervals

Figure 14.14. Scree Plot with Bootstrap Confidence Intervals

Notice that Figure 14.13 answers the question "how much uncertainty is in the PVE estimate for the sample data?" The answer is that there is quite a bit of uncertainty. The estimate for the proportion of variance explained by the first principal component is 0.416, but the confidence interval is fairly wide: [0.368,0.484]. The half-width of the confidence interval is more than 10% of the size of the estimate. Similar results apply to proportion of variance explained by the other principal components.

14.5.2 Plotting the Bootstrap Distributions

If you want to view the bootstrap distributions of the PVE for each principal component, you can create a panel of histograms. The i th column of the s matrix (computed in the previous section) contains the data for the ith histogram. The mean and ci matrices contain the mean and 95% confidence intervals for each PVE. The following statements create a data object from the s matrix. For each column, the program creates a histogram and overlays the mean and bootstrap percentile confidence interval. The resulting panel of histograms is shown in Figure 14.15.

/* create data object from bootstrap distribution;
   create histograms and overlay bootstrap mean and C.I. */
names = "Proportion1":"Proportion"+strip(char(NumPC));
declare DataObject dobj2;
dobj2 = DataObject.Create("PVE", names, s);

declare Histogram hist;
attrib = BLACK || DASHED || 1 || PLOTFOREGROUND;
do i = 1 to NumPC;
   hist = Histogram.Create(dobj2, names[i]);
   /* use a module distributed with SAS/IML Studio to compute
      the position of i_th plot in an array of 6 plots */
   run CalcPlotPosition(i, 6, left, top, width, height);
   hist.SetWindowPosition(left, top, width, height);

   /* plot mean and C.I. */
   run abline(hist, mean[i]//CI[,i], {.,.,.}, attrib);

   /* draw rectangle in background to indicate C.I. */
   hist.DrawUseDataCoordinates();
   hist.GetAxisViewRange(YAXIS, YMin, YMax);
   hist.DrawSetRegion(PLOTBACKGROUND);
   hist.DrawSetBrushColor(0xC8C8C8);
   hist.DrawRectangle(CI[1,i], 0, CI[2,i], YMax+10, true);
end;

Figure 14.15 shows the bootstrap distributions for the proportion of variance explained by each principal component. The distributions are slightly skewed; this is especially noticeable for Propor-tion2.

The figure also shows how the dynamically linked graphics in SAS/IML Studio can give additional insight into the fact that these five statistics are not independent, since the sum of the five proportions must equal unity. In the figure, large values of the Proportion1 variable are selected. The selected observations correspond to bootstrap resamples for which the value of the Proportion1 variable is large. For these bootstrap resamples, the Proportion2 variable tends to be smaller than the mean taken over all bootstrap resamples. Similar relationships hold for the other variables, and also for the bootstrap resamples for which the Proportion1 variable is small.

Panel of Histograms of Bootstrap Distributions

Figure 14.15. Panel of Histograms of Bootstrap Distributions

14.6 References

[bibch14_01] D. L. Cassell, (2007), "Don't Be Loopy: Re-Sampling and Simulation the SAS Way," in Proceedings of the SAS Global Forum 2007 Conference, Cary, NC: SAS Institute Inc., available at http://www2.sas.com/proceedings/forum2007/183-2007.pdf.

[bibch14_02] D. L. Cassell, (2010), "BootstrapMania!: Re-Sampling the SAS Way," in Proceedings of the SAS Global Forum 2010 Conference, Cary, NC: SAS Institute Inc., available at http://support.sas.com/resources/papers/proceedings10/268-2010.pdf.

[bibch14_03] A. C. Davison, and D. V. Hinkley, (1997), Bootstrap Methods and Their Application, Cambridge, UK: Cambridge University Press.

[bibch14_04] B. Efron, and R. Tibshirani, (1993), An Introduction to the Bootstrap, New York: Chapman & Hall.

[bibch14_05] R. Wicklin, (2008), "SAS Stat Studio: A Programming Environment for High-End Data Analysts," in Proceedings of the SAS Global Forum 2008 Conference, Cary, NC: SAS Institute Inc., available at http://www2.sas.com/proceedings/forum2008/362-2008.pdf.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset