Chapter 1

Introduction to Simulation

Contents

 

1.1 Overview of Simulation of Data

There are many kinds of simulation. Climate scientists use simulation to model the interactions between the earth's atmosphere, oceans, and land. Astrophysicists use simulation to model the evolution of galaxies. Biologists use simulation to model the spread of epidemics and the effects of vaccination programs. Engineers use simulation to study the safety and fuel efficiency of automobile and airplane designs. In these simulations of physical systems, scientists model reality and use a computer to study the model under various conditions.

Statisticians also build models. For example, a simple model of human height might assume that height is normally distributed in the population. This is a useful model, but it turns out that human heights are not actually normally distributed (Schilling, Watkins, and Watkins 2002). Even if you restrict the data to a single gender, there are more very tall and very short people than would be expected from a normal distribution of heights.

If a set of data is only approximately normal, what does that mean for statistical tests that assume normality? If you compute a t test to compare the means of two groups—a test that assumes that the two underlying populations are normally distributed—how sensitive is your conclusion to the actual population distribution? If the populations are slightly nonnormal, does that invalidate the t test? Or are the results fairly robust to deviations from normality?

One way to answer these questions is to simulate data from nonnormal populations. If you construct a distributional model, then you can generate random samples from the model and examine how the t test performs on the simulated data. Simulation gives you complete control over the characteristics of the population model from which the (simulated) data are drawn.

Simulating data is also useful for comparing two different statistical techniques. Perhaps Technique A performs better on skewed data than Technique B. Perhaps Technique B is more robust to the presence of outliers. To a practicing statistician, this kind of information is quite valuable. As Gentle (2009, p. xi) says, “Learning to simulate data with given characteristics means that one understands those characteristics. Applying statistical methods to simulated data…helps us better to understand those methods and the principles underlying them.”

This book is about simulating data in SAS software. This book demonstrates how to generate observations from populations that have specified statistical characteristics. In this book, the phrases “simulating data,” “generating a random sample,” and “sampling from a distribution” are used interchangeably.

A large portion of this book is about learning how to construct statistical models (distributions) that have certain statistical properties. Skewed distributions, fat-tailed distributions, bimodal distributions—these are a few examples of models that you can construct by using the techniques in this book. This book also presents techniques for generating data from correlated multivariate distributions. Each technique is accompanied by a SAS program.

Although this book uses statistics, its audience is not limited to statisticians. It is a how-to book for statistical programmers who use SAS software and who want to simulate data efficiently.

In short, this book describes how to write SAS programs that simulate data with a wide range of characteristics, and describes how to use that data to understand the performance and applicability of statistical techniques.

 

1.2 The Goal of This Book

The goal of this book is to provide tips, techniques, and examples for efficiently simulating data in SAS software.

Data simulation is a fundamental technique in statistical programming and research. To evaluate statistical methods, you often need to create data with known properties, both random and nonrandom. This book contains more than one hundred annotated programs that simulate data with specified characteristics. You can use simulated data to estimate the probability of an event, to estimate the sampling distribution of a statistic, to estimate the coverage probabilities of confidence intervals, and to evaluate the robustness of a statistical test.

Some programs are presented in two forms, first by using the DATA step and then again by using the SAS/IML language. By presenting the same algorithm in two different ways, the novice SAS/IML programmer can learn how to write simulations in the SAS/IML language. Later chapters that discuss multivariate simulation use the SAS/IML language heavily. If you are serious about simulation, you should invest the time to learn how to use the SAS/IML language efficiently.

Although this book covers many standard examples of data distributions, there are many other examples that are not covered. However, many techniques that are described in this book are generally applicable. For example, Section 7.5 describes how to generate random samples from a mixture of normal distributions. The same technique can be used to simulate data from a mixture of other distributions.

The book also includes more than 100 exercises. Many exercises extend the results of a section to other distributions or to related problems. The exercises provide practical programming problems that encourage you to master the material before moving on to the next section. Most exercises will take five to 15 minutes of programming. As Gentle (2009, p. xiii) says, “Programming is the best way to learn programming.” Solutions to selected exercises are available from the book's Web site. See support.sas.com/wicklin.

 

1.3 Who Should Read This Book?

The audience for this book is statisticians, analysts, programmers, graduate students, and researchers who use SAS to analyze and model data.

This book assumes that you are familiar with SAS programming concepts such as missing values, formats, and the SAS DATA step. This book uses simple macro statements such as the %LET statement, but does not include sophisticated macro techniques.

This book presumes familiarity with basic statistical ideas that you might encounter when you use the FREQ, UNIVARIATE, CORR, TTEST, and GLM procedures. This book discusses random variables, quantiles, distributions, and regression. The chapters of the book that deal with multivariate distributions presume familiarity with concepts of computational linear algebra, such as matrix decompositions. There are also sections of the book that discuss various topics in regression analysis and distributional modeling.

 

1.4 The SAS/IML Language

The DATA step is sufficient for simulating data from simple univariate distributions. However, for simulating from more complicated distributions, SAS/IML software is essential.

IML stands for “interactive matrix language.” The SAS/IML language enables you to implement custom algorithms that use vectors and matrices. The SAS/IML language contains hundreds of built-in functions and subroutines, and you can also call hundreds of functions in Base SAS software.

To learn how to write efficient programs in the SAS/IML language, see Wicklin (2010). The Web site for that book (support.sas.com/wicklin) has a “Getting Started” chapter that is available as a free Web download. Furthermore, the SAS/IML language is used and discussed frequently on the author's blog, The DO Loop, which can be read at blogs.sas.com/content/iml. Statistical simulation is a frequent topic on the blog.

This book explains SAS/IML statements that are not obvious. Readers who have previous matrix programming experience with MATLAB or the R language should be able to read the SAS/IML programs in this book. The three languages are similar in syntax.

You need a license for the SAS/IML product in order to run the IML procedure. To see whether PROC IML is licensed at your site, submit the following program:

proc product_status;
run;

If SAS/IML is listed in the SAS log, then SAS/IML software is licensed for your site.

To see if PROC IML is installed at your site, submit the following program:

proc iml;
quit;

If the SAS log contains the message ERROR: Procedure IML not found, then SAS/IML software is not installed. If the SAS log contains the message IML Ready, then SAS/IML is installed.

If SAS/IML software is licensed but not installed, ask your SAS administrator to install it.

 

1.5 Comparing the DATA Step and SAS/IML Language

Many data analysts use the DATA step as their primary programming tool. The DATA step is adequate for simulating univariate data from many distributions and for simulating uncorrelated multivariate data. However, simulating multivariate correlated data (and even data from complicated univariate distributions) is much easier if you use matrix-vector computations.

Conceptually, there are two main differences between the DATA step and a SAS/IML program. First, a DATA step implicitly loops over all observations; a typical SAS/IML program does not. Second, the fundamental unit in the DATA step is an observation; the fundamental unit in the SAS/IML language is a matrix.

The syntax of the SAS/IML language has much in common with the DATA step: neither language is case sensitive, variable names can contain up to 32 characters, and statements must end with a semicolon. Furthermore, the syntax for control statements such as the IF-THEN/ELSE statement and the iterative DO statement is similar for both languages. The two languages use the same symbols to test a quantity for equality (=) and inequality (^=), and to compare quantities (for example, <=). The SAS/IML language enables you to call the same mathematical functions provided in the DATA step, such as LOG, EXP, SQRT, CEIL, and FLOOR, except that the SAS/IML versions act on vectors and matrices.

SAS/IML software is intended for statistical computing, and writing a simulation in the SAS/IML language is usually more compact than writing the equivalent simulation in the DATA step. Furthermore, because the SAS/IML language keeps data in RAM (whereas the DATA step writes data sets), the SAS/IML language offers excellent performance for simulations in which the simulated data fit in memory and for which the computations can be vectorized.

A computation is vectorized if it consists of a few executable statements, each of which operates on a fairly large quantity of data, usually a matrix or a vector. A program in a matrix-vector language is more efficient when it is vectorized because most of the computations are done in a low-level language such as C. In contrast, a program that is not vectorized requires many calls that transfer small amounts of data between the high-level program interpreter and the low-level computational code. To vectorize a program in a matrix-vector language, take advantage of built-in functions and linear algebra operations. Avoid loops that access individual elements of matrices.

 

1.6 Overview of This Book

This book consists of four parts. The first part introduces essential concepts. It shows you how to use SAS software to simulate data from frequently used distributions, and how to compute useful related quantities. In addition to the current chapter, this part of the book contains the following chapters:

Chapter 2: Simulate univariate samples from common discrete and continuous distributions.

Chapter 3: Compute basic quantities in SAS software that are essential for simulating data.

The second part of the book describes how to use simulated data to examine the sampling distribution of statistics and to evaluate statistical techniques. This part of the book contains the following chapters:

Chapter 4: Use simulated data to estimate the sampling distributions of basic statistics such as the mean, median, and Pearson correlations.

Chapter 5: Use simulation to evaluate statistical techniques. Examples include using simulation to investigate the coverage probability of a confidence interval, to estimate p-values, and to estimate the power of a t test.

Chapter 6: Develop strategies for efficient and effective simulation in SAS software.

The third part of the book describes advanced simulation of univariate and multivariate data. It also describes how to construct covariance matrices that are often needed for simulation studies. This part of the book contains the following chapters:

Chapter 7: Develop advanced techniques in univariate simulation, including sampling from mixture distributions, acceptance-rejection sampling, and inverse CDF sampling.

Chapter 8: Simulate data from basic multivariate distributions.

Chapter 9: Simulate multivariate data with special structure, such as multivariate binary variables, ordinal variables, and data from copulas.

Chapter 10: Simulate correlation and covariance matrices with known properties and structure, such as Toeplitz or AR(1) structure. The chapter shows how to find the nearest correlation matrix to an estimate that is not positive semidefinite.

The fourth part of the book shows how to use simulation in statistical modeling. This part of the book contains the following chapters:

Chapter 11: Simulate data from a variety of basic regression models, such as linear models with continuous and classification variables.

Chapter 12: Simulate data from generalized linear models, mixed models, and models in survival analysis.

Chapter 13: Simulate data from time series.

Chapter 14: Simulate data from spatial models.

Chapter 15: Use bootstrap methods to resample from the data that you want to simulate.

Chapter 16: Use moment matching and moment-ratio diagrams to simulate data that have properties similar to a given set of real data.

This book also includes an appendix that provides additional background and details about programming in the SAS/IML language.

 

1.7 Obtaining the Programs Used in This Book

This book was developed using SAS 9.3M2, which is the second maintenance release of SAS 9.3 and was released in August 2012. This release includes SAS/IML 12.1 software. When a SAS/IML 12.1 feature is used, the book also describes how to obtain the same result by using SAS/IML 9.3 software.

The SAS programs in this book are available as a free download from the book's Web site: support.sas.com/wicklin.

Of particular interest are dozens of simulation algorithms that the author implemented in the SAS/IML language. You can use these functions in your own simulation studies by doing the following:

  • Download the zip file that includes the programs for this book.
  • The zip file includes the file SimulatingData.sas, which contains the SAS/IML functions.
  • Save SimulatingData.sas to a convenient location, such as
    C:Users<userid>DocumentsMy SAS Files.

Whenever you want to use the SAS/IML functions, submit the following statement prior to calling PROC IML:

%include “C:Users<userid>DocumentsMy SAS FilesSimulatingData.sas”;

The statement defines all of the SAS/IML functions and stores them in a library. To use a function, you can load it by name from within PROC IML. To load all of the functions, run the following SAS/IML statement:

load module=_all_;

 

1.8 Specialized Simulation Tools in SAS Software

This book uses the SAS DATA step and SAS/IML software to simulate data. The book also uses the SIMNORMAL and SIM2D procedures in SAS/STAT software, and the COPULA procedure in SAS/ETS software.

SAS software contains other specialized simulation tools that are not covered in this book, including the following:

  • SAS Simulation Studio, which is part of the SAS/OR product, enables you to build discrete event simulations such as those that arise in queuing theory.
  • The MCMC procedure in SAS/STAT software is a general-purpose Markov chain Monte Carlo procedure. This procedure enables you to use simulation to fit Bayesian models. You can also use the MCMC procedure to perform direct sampling.
  • PROC MODEL in SAS/ETS software enables you to perform Monte Carlo simulation of time series models.

If you are interested in these topics, it is worthwhile to learn how to use these specialized tools. Although some of the techniques in this book can be applied to topics such as Monte Carlo integration and Bayesian analysis, using a specialized tool such as PROC MCMC is easier and more efficient than using general-purpose techniques.

 

1.9 References

Gentle, J. E. (2009), Computational Statistics, New York: Springer-Verlag.

Schilling, M. F., Watkins, A. E., and Watkins, W. (2002), “Is Human Height Bimodal?” American Statistician, 56, 223–229.
URL http://www.jstor.org/stable/3087302

Wicklin, R. (2010), Statistical Programming with SAS/IML Software, Cary, NC: SAS Institute Inc.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset