Chapter 6
Strategies for Efficient and Effective Simulation
Contents
6.1 Overview of Simulation Strategies
6.2 The Design of a Simulation Study
6.3 The Effect of the Number of Samples
6.4 Writing Efficient Simulations
6.4.1 The Basic Structure of Efficient Simulations in SAS
6.4.2 How to Suppress ODS Output and Graphics
6.4.3 How to Suppress Notes to the SAS Log
6.4.4 Caution: Avoid Macro Loops
6.4.6 Profiling a SAS/IML Simulation
6.4.7 Tips for Shortening Simulation Times
6.1 Overview of Simulation Strategies
An efficient simulation is one that optimizes computational resources so that it finishes in a reasonable amount of time. An effective simulation is one that produces a desired statistical result. Efficiency is achieved through good programming practices and is a primary theme of this book. Effectiveness is achieved though a solid understanding of statistical concepts and methodology.
The design of a simulation study affects its efficiency and effectiveness. One design choice is the number of samples to use in a simulation. If you use too few samples, then the program runs quickly but uncertainty in the estimates renders the simulation useless. If you use too many samples, then the standard error of the estimates are tiny but the program requires hours or days to finish. Clearly, the best choice lies between these two extremes.
This chapter discusses issues that are important for designing efficient and effective simulation studies. The ideas in this chapter are applied throughout this book, but this chapter collects the various strategies in one place. This chapter also describes details of implementing simulations in SAS software, and points out why some implementations are more efficient than others.
6.2 The Design of a Simulation Study
Gentle (2009, Appendix A) discusses the fact that simulation studies are computer experiments. As such, Gentle argues, the principles of statistical design can be used to improve the efficiency, reproducibility, and documentation of a simulation study.
Gentle outlines a typical simulation study. Suppose that you want to study the power of ordinary least squares regression to test the hypothesis H0: β1 = 0 for the simple regression model Yi = β0 + β1Xi + Ei,, where each response value, Yi, is modeled as a linear function of an explanatory variable value, Xi, and a random variable, Ei, which represents an error term.
This problem contains many experimental factors that might or might not be important in computing the power. Among these factors are the following:
If you use a full factorial design with, for example, 16 values for β1, then there are 16×3×5×3 = 720 experiments that you need to run. Each experiment is a simulation study, which consists of generating thousands of random samples and running a regression analysis on each sample. It is tempting to feel overwhelmed by the magnitude of the study and by the challenge of summarizing the results. How should you proceed?
The best advice is to start small and start simply. For this experiment, you might want to choose the smallest sample size (20) and the simplest choice for the design of x (equally spaced) and for the distribution of the error term (normally distributed). You should also use a smaller number of samples than the number that you intend to use for the final study. One-tenth the number is often appropriate during the development, debugging, profiling, and optimization of the simulation algorithm.
When you run the small-scale simulation, consider whether you can present the results graphically. For example, Figure 6.1 presents the results as a curve that shows the power as a function of the parameter β1. Standard errors for the estimates are displayed. Section 11.4 shows how to create graphs like Figure 6.1.
You can learn a lot from small-scale simulations, such as the values of β1 for which the power curve is steep or flat, and the magnitude of the standard error for the estimates along the curve. Use this preliminary analysis to plan the final simulation. For example, use the preliminary analysis to choose an appropriate range for β1. For small samples, you need a large range for β1; for large samples, you need a much smaller range. Because the preliminary analysis estimates the standard error along the curve, you might decide to generate more samples when the variance of the estimator is relatively large (approximately 1 ≤ β1 ≤ 2 in Figure 6.1) and use fewer samples when the variance is smaller.
You can also use the preliminary analysis to estimate the run time for the final simulation. If you used 1,000 samples for each simulation in the preliminary study, then a simulation that uses 10,000 samples will take about 10 times as long.
In a similar way, you can run preliminary analyses for some of the other parameter values and scenarios, such as for a contaminated normal error term and for different design points for x. For each scenario, you should save the results (the points and upper/lower limits of the power curve) so that, for example, you can combine several curves on a single graph. By the time you run your final simulation study, you should have a good idea what the results will be.
In summary, when you begin a large simulation study, resist the urge to immediately write and run the entire simulation. Start small. Debug and optimize the simple cases. Run a small-scale simulation. These techniques provide important information that you can incorporate into the final study.
6.3 The Effect of the Number of Samples
There are two “sizes” in a simulation study, and it is important to not confuse them. One is the sample size; the other is the number of samples, which is also known as the number of repetitions. As demonstrated in Section 4.4.5, the sample size, N, affects the width of the sampling distribution (that is, the standard error of the statistic). For most well-behaved statistics that are used in practice, small sample sizes correspond to large standard errors, whereas large sample sizes result in smaller standard errors. The value of N is often controlled in a simulation study. It might be set to the sample size of an observed data set, or it might be systematically varied, such as in the power study that is described in Section 6.2.
In contrast, the number of samples in the simulation is something that you have to choose every time that you run a simulation. The number of samples (which in this book is determined by the NumSamples
macro variable) determines how well the approximate sampling distribution (ASD) of a statistic approximates the exact sampling distribution.
How many samples are sufficient? Unfortunately, that is a difficult question to answer. The number of samples that you need depends on characteristics of the sampling distribution. Lower order moments of the sampling distribution (such as the mean) require fewer samples than statistics that are functions of higher order moments, such as the variance and skewness. You might need many, many, samples to capture the extreme tail behavior of a sampling distribution.
A popular choice in research studies is 10,000 or more samples. However, do not be a slave to any particular number. The best approach is to understand what you are trying to estimate and to report not only point estimates but also standard errors and/or confidence intervals.
If you are only interested in the Monte Carlo estimate of the mean, then you can often use a relatively small number of samples. You can quantify this statement by looking at the Monte Carlo standard error, which is a measure of how close the mean of the sampling distribution is to the unknown population parameter. The Monte Carlo standard error is of the form where m is the number of samples that you generate (Ross 2006). This implies that to halve the Monte Carlo standard error, you need to quadruple the number of samples in the simulation.
Increasing the number of samples is one technique to reduce the Monte Carlo standard error, but other methods also exist. These so-called variance reduction techniques are described in Ross (2006, Ch. 8), Ripley (1987, Ch. 5), and Jones, Maillardet, and Robinson (2009, Ch. 20). There are two difficulties with using these techniques. First, they sometimes require considerable ingenuity because you need to construct certain auxiliary random variables with special properties. Second, they make programming the simulation more complicated because there is more “bookkeeping” in order to keep track of the auxiliary variables. In other words, these techniques enable you to use fewer simulations to obtain better estimates, but at the cost of a more complicated program.
6.4 Writing Efficient Simulations
The previous sections described several efficient simulation techniques. This section summarizes those techniques and describes additional techniques for efficiency.
6.4.1 The Basic Structure of Efficient Simulations in SAS
Recall that there are two basic techniques for simulating and analyzing data: the BY-group technique and the in-memory technique.
When you use the BY-group technique, do the following:
SampleID
variable. Use a BY statement to compute statistics for each BY group.When you use the SAS/IML in-memory technique, do the following:
x[,:]
computes the means of all rows in the x
matrix.6.4.2 How to Suppress ODS Output and Graphics
This section assumes that you are familiar with the Output Delivery System (ODS), as described in Section 3.5. ODS enables you to select, exclude, and output tables and graphics.
When you use a SAS procedure to compute statistics for each BY group, you should create an output data set that contains the statistics, as described in Section 4.4.1. Because you are writing the statistics to a data set, you do not need to display any output from the procedure. There are two ways to suppress procedure output: by using a NOPRINT option and by using the ODS EXCLUDE ALL statement and related ODS statements.
About 50 SAS/STAT procedures support the NOPRINT option in the PROC statement. When you specify the NOPRINT option, ODS is temporarily disabled while the procedure runs. This prevents SAS from displaying tables and graphs that would otherwise be produced for each BY group. For a simulation that computes statistics for thousands of BY groups, suppressing the display of tables results in a substantial savings of time. (But, the SAS/STAT User's Guide states, “However, there are a few procedures that for historical reasons still might produce some output even when NOPRINT is specified.”)
The NOPRINT option is ideal for writing statistics to a data set by using procedure syntax such as the OUTPUT statement and other “OUT” options, such as the OUTP= option in PROC CORR, the OUT= option in PROC FREQ, the OUTEST= option in PROC REG, and the OUTSTAT= option in PROC GLM.
However, sometimes the statistic of interest is available only in an ODS table. In these cases, you cannot use the NOPRINT option because it suppresses all ODS tables, including the one that contains the statistic of interest. In these cases, use ODS to prevent tables and graphics from displaying on your computer monitor, but create an output data set by using the ODS OUTPUT statement.
For example, you can use the following technique to turn off ODS output. Prior to calling the procedure, execute the following statements:
/* suppress output to ODS destinations */
ods graphics off;
ods exclude all;
ods noresults;
The first statement turns off ODS graphics. Technically, you only need to use this statement prior to calling a procedure (such as the REG procedure) that produces ODS graphics by default, but there is no harm in unconditionally turning off ODS graphics. The second statement excludes all tables from open destinations such as HTML or LISTING. The third statement prevents ODS from making entries in the ODS Results window, which is shown in Figure 6.2.
You can now run a SAS procedure without seeing any output displayed, and you can use the ODS OUTPUT statement to save a table of statistics to an output data set. After you compute the statistics for each BY group, re-enable ODS output by using the following statements:
ods graphics on;
ods exclude none;
ods results;
Because these sequences of commands are used so frequently in simulation studies, it is convenient to package them into SAS macros:
%macro ODSOff; /* Call prior to BY-group processing */
ods graphics off;
ods exclude all;
ods noresults;
%mend;
%macro ODSOn; /* Call after BY-group processing */
ods graphics on;
ods exclude none;
ods results;
%mend;
With these definitions, you can easily disable ODS output temporarily while computing the statistics for each BY group. For example, the following statements write descriptive statistics to a data set named Desc, but suppresses the display of tables or graphs. The SimNormal data was created in Section 4.4.4.
%ODSOff
proc means data=SimNormal;
by SampleID;
var x;
ods output Summary=Desc;
run;
%ODSOn
Sometimes it is convenient to also suppress SAS notes during a simulation. This is covered in the next section.
6.4.3 How to Suppress Notes to the SAS Log
Some SAS procedures write a note to the SAS log as part of their normal operation. For example, procedures that use maximum likelihood estimation write a note for each BY group that reports whether the numerical optimization succeeded for that BY group, as shown in the following example:
NOTE: Convergence criterion (GCONV=1E-8) satisfied.
NOTE: The above message was for the following BY group:
SampleID=3
Not only do these messages take time to print, but they can also fill up the SAS log, which (when you are running SAS software interactively) results in a dialog box that gives you the option to clear the log. It is therefore desirable to deal with these messages in one of two ways:
options nonotes;
After the procedure runs, you can re-enable notes:
options notes;
proc printto log='name-of-log-file' new;
run;
After the simulation completes, you can check the log file, reset the log, and restore the output destinations to their default values:
proc printto; /* no options ==> restore defaults */
run;
6.4.4 Caution: Avoid Macro Loops
Many SAS users attempt to run a simulation by using a macro loop instead of using the template presented in Section 4.4.1. Do not do this. Novikov (2003) compares times for the two methods and concludes (for his application) that the macro-loop technique is 80–100 times slower than the BY-group technique.
To be concrete, the following program is an example of macro code that computes the same quantities as in Section 4.4.2. You should avoid writing programs like this:
/*************************************/
/* DO NOT USE THIS CODE: INEFFICIENT */
/*************************************/
%macro Simulate(N, NumSamples);
options nonotes; /* turn off notes to log */
proc datasets nolist;
delete OutStats; /* delete data if it exists */
run;
%do i = 1 %to &NumSamples;
data Temp; /* create one sample */
call streaminit(0);
do i = 1 to &N;
x = rand(“Uniform”);
output;
end;
run;
proc means data=Temp noprint; /* compute one statistic */
var x;
output out=Out mean=SampleMean;
run;
proc append base=OutStats data=Out; /* accumulate stats */
run;
%end;
options notes;
%mend;
/* call macro to simulate data and compute ASD */
%Simulate(10, 100) /* means of 100 samples of size 10 */
This approach suffers from a low ratio of work to overhead. The DATA step and the MEANS procedure are called 100 times, but they generate or analyze only 10 observations in each call. This is inefficient because every time that SAS encounters a procedure call, it must parse the SAS code, open the data set, load data into memory, do the computation, close the data set, and exit the procedure. When a procedure computes complicated statistics on a large data set, these “overhead” costs are small relative to the computation performed by the procedure. However, for this example, the overhead costs are large relative to the computational work. For an example of a hybrid approach that uses macro and BY-group processing together, see the next section.
The macro approach also makes it difficult to reproduce the simulation. Notice that zero is used as the seed value, which means that the random number stream is initialized by using the system clock. In this example, you cannot call the STREAMINIT subroutine with a nonzero seed because then every sample data set would contain exactly the same data. An alternative to using a zero seed is to use the macro variable &i
as the seed value.
Caution: If you do not turn off the NOTES option, then the performance of the %Simulate
macro will be even worse. If the number of simulation loops is sufficiently large, then you might fill the SAS log with irrelevant text.
Exercise 6.1: Compare the run time for the %Simulate
macro with the run time for the BY-group technique in Section 4.4.2. Because the NONOTES option is used, you might want to use PROC PRINT to print something when the macro completes.
6.4.5 When Macros Are Useful
The point of the previous section is that SAS simulations run faster when you use a small number of DATA steps and procedure calls to generate and analyze a large amount of data. The BY-group approach is efficient because a single call to a SAS procedure results in computing thousands of statistics.
This does not imply that macros are “bad.” After all, a macro merely generates SAS code. However, if you use macros naively, then you might generate inefficient SAS code.
For huge simulation studies with a large number of design parameters, macros can be useful. Provided that a macro generates an efficient SAS program that uses BY-groups to read and process a lot of data, it is perfectly acceptable to encapsulate your simulation into a macro. In fact, for huge simulation studies, it makes sense to run the simulation as a series of smaller sub-studies. For example, for the simulation study in Section 6.2, you might write a macro that takes parameters for the sample size, the distribution of the error term, and the design of the independent variable. You could then call the macro 3 × 5 × 3 times. Each call creates results that are sufficient to create a graph similar to Figure 6.1.
For huge simulation studies, encapsulating the program into a macro can provide several benefits:
6.4.6 Profiling a SAS/IML Simulation
Some simulations take a long time to run. Consequently, it is essential to know how to profile a program, which means estimating the performance of various portions of the program. Wicklin (2010, p. 371) defines the performance of a program as “the time required for it to run on typical data, and also how that time changes with the size or characteristics of the data.”
Suppose that part of your simulation involves finding the eigenvalues of a large matrix, and that the size of the matrix varies with the number of variables in the simulation. The following SAS/IML program generates random symmetric n × n matrices for various values of n, and times how long it takes to compute the eigenvalues:
proc iml;
size = do(500, 2000, 250); /* 500, 1000, …, 2000 */
time = j(1, ncol(size)); /* allocatevectorforresults */
call randseed(12345);
do i = 1 to ncol(size);
n = size[i];
r = j(n* (n+1)/2, 1); /* generatelowertriangularelements */
call randgen(r, “uniform”);
A = sqrvech(r); /* createsymmetricmatrix */
t0 = time();
evals = eigval(A);
time[i] = time()-t0; /* elapsed time for computation */
end;
The TIME function returns the time of day as the number of seconds after midnight. The TIME function is called once prior to the eigenvalue computation and then again after the eigenvalue computation. The difference between those times is the elapsed time in seconds.
You can write the elapsed times to a SAS data set and use PROC SGPLOT to visualize the performance as a function of the size of the matrix. See Figure 6.3.
create eigen var {“Size” “Time”}; append; close;
quit;
proc sgplot data=eigen;
title “Performance of Eigenvalue Computation”;
series x=Size y=Time / markers;
yaxis grid label=“Time to Compute Eigenvalues (s)”;
xaxis grid label=“Size of Matrix”;
run;
Figure 6.3 shows that the computation time increases nonlinearly with the size of the matrix. The time required for a 500 × 500 matrix is a fraction of a second. For a 2000 × 2000 matrix, the computation requires about eight seconds.
You can use this technique to time various components of your simulation. You can then focus on optimizing the program statements that are consuming the most time.
6.4.7 Tips for Shortening Simulation Times
This section collects some general suggestions for making your simulation run faster and more efficiently.
The following tips apply to SAS/IML programs:
In addition, there are some general performance tips that apply to simulation studies that involve parameters. For example, Section 5.4.2 uses a grid of uniformly spaced values to estimate the power of the t test. Similarly, Section 16.10 uses an evenly spaced grid of parameters to explore how skewness and kurtosis affect the coverage probability of a confidence interval. In studies such as these, a simulation is run for every parameter value, which means that in a straightforward implementation the total time is directly proportional to number of parameter values. Consequently, keep in mind the following tips:
It is a good idea to figure out how long it takes to generate a billion random normal variates on your computer. You can use this length of a time as a benchmark for running future simulations.
Exercise 6.2: Run a DATA step that generates one million samples, each containing 1,000 random normal observations. Check the SAS log to determine how long it takes to create that simulated data.
Exercise 6.3: Run a SAS/IML program that generates a vector of 1,000 random normal observations. Use the TIME function (see Section 6.4.6) to time how long it takes to fill the vector one million times within a DO loop. (You probably cannot generate all of these values in a single matrix. A billion doubles requires 8 GB of RAM, and SAS/IML 9.3 cannot allocate matrices larger than 2 GB.)
6.5 Disadvantages of Simulation
Simulation is a powerful technique, but it has limitations, which include difficulty in generalizing the results, difficulty in organizing the results, and difficulty in applying the results to real data.
A common criticism of simulation studies is that they cannot be generalized beyond the specific population models and the parameter values in the study. Furthermore, real data are rarely distributed according to any “named” distribution, so how well do the results generalize to real data?
As an example, suppose that you use simulation to reveal how deviations from normality affect the performance of a statistical test. In your study, you might simulate data from a t distribution and a lognormal distribution as two instances of nonnormal distributions. If the test performs well on the simulated nonnormal data, then you might want to conclude that the statistic is robust to departures from normality.
However, can you justify generalizing this result to other nonnormal distributions? Probably not. Do your conclusions hold for an exponential distribution or, more generally, for gamma distributions? Simulations provide evidence that something is true, but you need to be careful not to extrapolate beyond the cases that were simulated.
Because of the previous shortcoming, you might be tempted to run a huge simulation study that incorporates a wide range of nonnormal distributions, such as the exponential, gamma, and beta families. You might feel a need to run a simulation for every nonnormal distribution under the sun.
If you were to give in to these temptations, then how would you organize the results? Would you create pages of tables that list results for every distribution and every parameter value? Some researchers do. However, the fact that the study is not well designed makes it difficult to present the results in an organized manner.
Before you start simulating data, think about what you are trying to demonstrate. Perhaps “show this statistic works well for nonnormal data” is too nebulous a goal. Perhaps a more attainable goal is to study how the test behaves for populations with a wide range of skewness and kurtosis.
With this more modest goal, you can design a better simulation study. As shown in Chapter 16, you can sample from distributions whose skewness and kurtosis values lie on a regularly spaced grid (see Figure 16.16). With this design, you can display the results in an organized graph.
In spite of its limitations, simulation remains a powerful technique that belongs in the toolbox of every statistical programmer. Data simulation enables you to compare two or more statistical techniques by studying their performance on common data with known properties. Most importantly, data simulation enables you to understand the sampling distribution of statistics without resorting to asymptotic theory or overly restrictive assumptions about the distribution of data.
6.6 References
Gentle, J. E. (2009), Computational Statistics, New York: Springer-Verlag.
Jones, O., Maillardet, R., and Robinson, A. (2009), Introduction to Scientific Programming and Simulation Using R, Boca Raton, FL: Chapman & Hall/CRC.
Novikov, I. (2003), “A Remark on Efficient Simulations in SAS,” Journal of the Royal Statistical Society, Series D, 52, 83–86.
URL http://www.jstor.org/stable/4128171
Novikov, I. and Oberman, B. (2007), “Optimization of Large Simulations Using Statistical Software,” Computational Statistics and Data Analysis, 51, 2747–2752.
Ripley, B. D. (1987), Stochastic Simulation, New York: John Wiley & Sons.
Ross, S. M. (2006), Simulation, 4th Edition, Orlando, FL: Academic Press.
Wicklin, R. (2010), Statistical Programming with SAS/IML Software, Cary, NC: SAS Institute Inc.