Overview 266
Introduction 266
Objectives 266
Computing Statistics Using PROC MEANS 267
Procedure Syntax 267
Creating a Summarized Data Set Using PROC SUMMARY 275
Example 276
Producing Frequency Tables Using PROC FREQ 276
Procedure Syntax 276
Specifying Variables in PROC FREQ 277
Example 278
Chapter Summary 284
Text Summary 284
Syntax 287
Sample Programs 287
Chapter Quiz 288
As you have seen, one of the many features of PROC REPORT is the ability to summarize large amounts of data by producing descriptive statistics. However, there are SAS procedures that are designed specifically to produce various types of descriptive statistics and to display them in meaningful reports. The type of descriptive statistics that you need and the SAS procedure that you should use depend on whether you need to summarize continuous data values or discrete data values.
If the data values that you want to describe are continuous numeric values (for example, people's ages), then you can use the MEANS procedure or the SUMMARY procedure to calculate statistics such as the mean, sum, minimum, and maximum.
If the data values that you want to describe are discrete (for example, the color of people's eyes), then you can use the FREQ procedure to show the distribution of these values, such as percentages and counts.
This chapter will show you how to use the MEANS, SUMMARY, and FREQ procedures to describe your data.
In this chapter, you learn to
determine the n-count, mean, standard deviation, minimum, and maximum of numeric variables using the MEANS procedure
control the number of decimal places used in PROC MEANS output
specify the variables for which to produce statistics
use the PROC SUMMARY procedure to produce the same results as the PROC MEANS procedure
describe the difference between the SUMMARY and MEANS procedures
create one-way frequency tables for categorical data using the FREQ procedure
create two-way and n-way crossed frequency tables
control the layout and complexity of crossed frequency tables.
Descriptive statistics such as the mean, minimum, and maximum provide useful information about numeric data. The MEANS procedure provides these and other data summarization tools, as well as helpful options for controlling your output.
The MEANS procedure can include many statements and options for specifying needed statistics. For simplicity, let's consider the procedure in its basic form.
In its simplest form, PROC MEANS prints the n-count (number of nonmissing values), the mean, the standard deviation, and the minimum and maximum values of every numeric variable in a data set.
proc means data=perm.survey; run;
The default statistics that the MEANS procedure produces (n-count, mean, standard deviation, minimum, and maximum) are not always the ones that you need. You might prefer to limit output to the mean of the values. Or you might need to compute a different statistic, such as the median or range of the values.
To specify statistics, include statistic keywords as options in the PROC MEANS statement. When you specify a statistic in the PROC MEANS statement, default statistics are not produced. For example, to see the median and range of Perm.Survey numeric values, add the MEDIAN and RANGE keywords as options.
proc means data=perm.survey median range;
run;
The following keywords can be used with PROC MEANS to compute statistics:
Table 9.1. Descriptive Statistics
Keyword | Description |
---|---|
CLM | Two-sided confidence limit for the mean |
CSS | Corrected sum of squares |
CV | Coefficient of variation |
KURTOSIS | Kurtosis |
LCLM | One-sided confidence limit below the mean |
MAX | Maximum value |
MEAN | Average |
MODE | Value that occurs most frequently (new in SAS 9.2) |
MIN | Minimum value |
N | Number of observations with nonmissing values |
NMISS | Number of observations with missing values |
RANGE | Range |
SKEWNESS | Skewness |
STDDEV / STD | Standard deviation |
STDERR | Standard error of the mean |
SUM | Sum |
SUMWGT | Sum of the Weight variable values |
UCLM | One-sided confidence limit above the mean |
USS | Uncorrected sum of squares |
VAR | Variance |
Table 9.2. Quantile Statistics
Keyword | Description |
---|---|
MEDIAN / P50 | Median or 50th percentile |
P1 | 1st percentile |
P5 | 5th percentile |
P10 | 10th percentile |
Q1 / P25 | Lower quartile or 25th percentile |
Q3 / P75 | Upper quartile or 75th percentile |
P90 | 90th percentile |
P95 | 95th percentile |
P99 | 99th percentile |
QRANGE | Difference between upper and lower quartiles: Q3-Q1 |
By default, PROC MEANS output automatically uses the BEST w. format to display numeric values in the report.
The BESTw. format is the default format that SAS uses for writing numeric values. When there is no format specification, SAS chooses the format that provides the most information about the value according to the available field width. At times, this can result in unnecessary decimal places, making your output hard to read.
proc means data=clinic.diabetes min max; run;
To limit decimal places, use the MAXDEC= option in the PROC MEANS statement, and set it equal to the length that you prefer.
proc means data=clinic.diabetes min max maxdec=0;
run;
By default, the MEANS procedure generates statistics for every numeric variable in a data set. But you'll typically want to focus on just a few variables, particularly if the data set is large. It also makes sense to exclude certain types of variables. The values of ID, for example, are unlikely to yield useful statistics.
To specify the variables that PROC MEANS analyzes, add a VAR statement and list the variable names.
proc means data=clinic.diabetes min max maxdec=0;
var age height weight;
run;
In addition to listing variables separately, you can use a numbered range of variables.
proc means data=perm.survey mean stderr maxdec=2;
var item1-item5;
run;
You will often want statistics for grouped observations, instead of for observations as a whole. For example, census numbers are more useful when grouped by region than when viewed as a national total. To produce separate analyses of grouped observations, add a CLASS statement to the MEANS procedure.
PROC MEANS does not generate statistics for CLASS variables, because their values are used only to categorize data. CLASS variables can be either character or numeric, but they should contain a limited number of discrete values that represent meaningful groupings.
The output of the program shown below is categorized by values of the variables Survive and Sex. The order of the variables in the CLASS statement determines their order in the output table.
proc means data=clinic.heart maxdec=1;
var arterial heart cardiac urinary;
class survive sex;
run;
Like the CLASS statement, the BY statement specifies variables to use for categorizing observations.
But BY and CLASS differ in two key ways:
Unlike CLASS processing, BY processing requires that your data already be sorted or indexed in the order of the BY variables. Unless data set observations are already sorted, you will need to run the SORT procedure before using PROC MEANS with any BY group.
BY group results have a layout that is different from the layout of CLASS group results. Note that the BY statement in the program below creates four small tables; a CLASS statement would produce a single large table.
proc sort data=clinic.heart out=work.heartsort; by survive sex; run; proc means data=work.heartsort maxdec=1; var arterial heart cardiac urinary; by survive sex; run;
You might want to create an output SAS data set that contains just the summarized variable. You can do this by using the OUTPUT statement in PROC MEANS.
When you use the OUTPUT statement, the summary statistics N, MEAN, STD, MIN, and MAX are produced for all of the numeric variables or for all of the variables that are listed in a VAR statement by default. To specify which statistics to produce, use the STATISTIC= option.
You can specify which statistics to produce in the output data set. To do so, you must specify the statistic and then list all of the variables. The variables must be listed in the same order as in the VAR statement. You can specify more than one statistic in the OUTPUT statement.
The following program creates a typical PROC MEANS report and also creates a summarized output data set.
proc means data=clinic.diabetes; var age height weight; class sex; output out=work.sum_gender mean=AvgAge AvgHeight AvgWeight min=MinAge MinHeight MinWeight; run;
To see the contents of the output data set, submit the following PROC PRINT step.
proc print data=work.sum_gender; run;
proc means data=clinic.diabetes noprint;
var age height weight;
class sex;
output out=work.sum_gender
mean=AvgAge AvgHeight AvgWeight;
run;
You can also create a summarized output data set by using PROC SUMMARY. When you use PROC SUMMARY, you use the same code to produce the output data set that you would use with PROC MEANS.
The difference between the two procedures is that PROC MEANS produces a report by default (remember that you can use the NOPRINT option to suppress the default report). By contrast, to produce a report in PROC SUMMARY, you must include a PRINT option in the PROC SUMMARY statement.
The following example creates an output data set but does not create a report:
proc summary data=clinic.diabetes; var age height weight; class sex; output out=work.sum_gender mean=AvgAge AvgHeight AvgWeight; run;
If you placed a PRINT option in the PROC SUMMARY statement above, this program would produce the same report as if you replaced the word SUMMARY with MEANS.
proc summary data=clinic.diabetes print;
var age height weight;
class sex;
output out=work.sum_gender
mean=AvgAge AvgHeight AvgWeight;
run;
The FREQ procedure is a descriptive procedure as well as a statistical procedure. It produces one-way and n-way frequency tables, and it concisely describes your data by reporting the distribution of variable values. You can use the FREQ procedure to create crosstabulation tables that summarize data for two or more categorical variables by showing the number of observations for each combination of variable values.
The FREQ procedure can include many statements and options for controlling frequency output. For simplicity, let's consider the procedure in its basic form.
By default, PROC FREQ creates a one-way table with the frequency, percent, cumulative frequency, and cumulative percent of every value of all variables in a data set.
For example, the following FREQ procedure creates a frequency table for each variable in the data set Parts.Widgets. All the unique values are shown for ItemName, LotSize, and Region.
proc freq data=parts.widgets; run;
By default, the FREQ procedure creates frequency tables for every variable in your data set. But this isn't always what you want. A variable that has continuous numeric values—such as DateTime—can result in a lengthy and meaningless table. Likewise, a variable that has a unique value for each observation—such as FullName—is unsuitable for PROC FREQ processing. Frequency distributions work best with variables whose values can be described as categorical, and whose values are best summarized by counts rather than by averages.
To specify the variables to be processed by the FREQ procedure, include a TABLES statement.
The order in which the variables appear in the TABLES statement determines the order in which they are listed in the PROC FREQ report.
Consider the SAS data set Finance.Loans. The variables Rate and Months are best described as categorical values, so they are the best choices for frequency tables.
proc freq data=finance.loans;
tables rate months;
run;
In addition to listing variables separately, you can use a numbered range of variables.
proc freq data=perm.survey;
tables item1-item3;
run;
Adding the NOCUM option to your TABLES statement suppresses the display of cumulative frequencies and cumulative percentages in one-way frequency tables and in list output. The syntax for the NOCUM option shown below.
TABLES variable(s) / NOCUM;
So far, you have used the FREQ procedure to create one-way frequency tables. The table results show total frequency counts for the values within the data set. However, it is often helpful to crosstabulate frequencies with the values of other variables. For example, census data is typically crosstabulated with a variable that represents geographical regions.
The simplest crosstabulation is a two-way table. To create a two-way table, join two variables with an asterisk (*) in the TABLES statement of a PROC FREQ step.
When crosstabulations are specified, PROC FREQ produces tables with cells that contain
column cell frequency
cell percentage of total frequency
cell percentage of row frequency
cell percentage of frequency.
For example, the following program creates the two-way table shown below.
Note that the first variable, Weight, forms the table rows, and the second variable, Height, forms the columns; reversing the order of the variables in the TABLES statement would reverse their positions in the table. Note also that the statistics are listed in the legend box.
For a frequency analysis of more than two variables, use PROC FREQ to create n-way crosstabulations. A series of two-way tables is produced, with a table for each level of the other variables.
For example, suppose you want to add the variable Sex to your crosstabulation of Weight and Height in the data set Clinic.Diabetes. Add Sex to the TABLES statement, joined to the other variables with an asterisk (*).
tables sex*weight*height;
The order of the variables is important. In n-way tables, the last two variables of the TABLES statement become the two-way rows and columns. Variables that precede the last two variables in the TABLES statement stratify the crosstabulation tables.
levels ↓ tables sex*weight*height; ↑ ↑ rows + columns = two-way tables
Notice the structure of the output that is produced by the program shown below.
Beginning in SAS 9, adding the CROSSLIST option to your TABLES statement displays crosstabulation tables in ODS column format. This option creates a table that has a table definition that you can customize by using the TEMPLATE procedure.
Notice the structure of the output that is produced by the program shown below.
When three or more variables are specified, the multiple levels of n-way tables can produce considerable output. Such bulky, often complex crosstabulations are often easier to read as a continuous list. Although this eliminates row and column frequencies and percents, the results are compact and clear.
To generate list output for crosstabulations, add a slash (/) and the LIST option to the TABLES statement in your PROC FREQ step.
TABLES variable-1 *variable-2 <* ... variable-n> / LIST;
Adding the LIST option to our Clinic.Diabetes program puts its frequencies in a simple, short table.
Another way to control the format of crosstabulations is to limit the output of the FREQ procedure to a few specific statistics. Remember that when crosstabulations are run, PROC FREQ produces tables with cells that contain:
cell frequency
cell percentage of total frequency
cell percentage of row frequency
cell percentage of column frequency.
You can use options to suppress any of these statistics. To control the depth of crosstabulation results, add any combination of the following options to the TABLES statement:
NOFREQ suppresses cell frequencies
NOPERCENT suppresses cell percentages
NOROW supresses row percentages
NOCOL suppresses column percentages.
Suppose you want to use only the percentages of Sex and Weight combinations in the data set Clinic.Diabetes. To block frequency counts and row and column percentages, add the NOFREQ, NOROW, and NOCOL options to your program's TABLES statement.
Notice that Percent is the only statistic that remains in the table's legend box.
The MEANS procedure provides an easy way to compute descriptive statistics. Descriptive statistics such as the mean, minimum, and maximum provide useful information about numeric data.
By default, PROC MEANS computes the n-count (the number of non-missing values), the mean, the standard deviation, and the minimum and maximum values for variables. To specify statistics, list their keywords in the PROC MEANS statement.
Table 9.4. Descriptive Statistics
Keyword | Description |
---|---|
CLM | Two-sided confidence limit for the mean |
CSS | Corrected sum of squares |
CV | Coefficient of variation |
KURTOSIS | Kurtosis |
LCLM | One-sided confidence limit below the mean |
MAX | Maximum value |
MEAN | Average |
MIN | Minimum value |
N | Number of observations with nonmissing values |
NMISS | Number of observations with missing values |
RANGE | Range |
SKEWNESS | Skewness |
STDDEV / STD | Standard deviation |
STDERR | Standard error of the mean |
SUM | Sum |
SUMWGT | Sum of the Weight variable values |
UCLM | One-sided confidence limit above the mean |
USS | Uncorrected sum of squares |
VAR | Variance |
Table 9.5. Quantile Statistics
Keyword | Description |
---|---|
MEDIAN / P50 | Median or 50th percentile |
P1 | 1st percentile |
P5 | 5th percentile |
P10 | 10th percentile |
Q1 / P25 | Lower quartile or 25th percentile |
Q3 / P75 | Upper quartile or 75th percentile |
P90 | 90th percentile |
P95 | 95th percentile |
P99 | 99th percentile |
QRANGE | Difference between upper and lower quartiles: Q3-Q1 |
Because PROC MEANS uses the BEST. format by default, procedure output can contain unnecessary decimal places. To limit decimal places, use the MAXDEC= option and set it equal to the length that you prefer.
By default, PROC MEANS computes statistics for all numeric variables. To specify the variables to include in PROC MEANS output, list them in a VAR statement.
Include a CLASS statement, specifying variable names, to group PROC MEANS output by variable values. Statistics are not computed for the CLASS variables.
Include a BY statement, specifying variable names, to group PROC MEANS output by variable values. Your data must be sorted according to those variables. Statistics are not computed for the BY variables.
You can create an output data set that contains summarized variables by using the OUTPUT statement in PROC MEANS. When you use the OUTPUT statement without specifying the statistic-keyword= option, the summary statistics N, MEAN, STD, MIN, and MAX are produced for all of the numeric variables or for all of the variables that are listed in a VAR statement.
You can also create a summarized output data set by using PROC SUMMARY. The PROC SUMMARY code for producing an output data set is exactly the same as the code for producing an output data set with PROC MEANS. The difference between the two procedures is that PROC MEANS produces a report by default, whereas PROC SUMMARY produces an output data set by default.
The FREQ Procedure is a descriptive procedure as well as a statistical procedure that produces one-way and n-way frequency tables. It concisely describes your data by reporting the distribution of variable values.
By default, the FREQ procedure creates frequency tables for every variable in your data set. To specify the variables to analyze, include them in a TABLES statement.
When a TABLES statement contains two variables joined by an asterisk (*), PROC FREQ produces crosstabulations. The resulting table displays values for
cell frequency
cell percentage of total frequency
cell percentage of row frequency
cell percentage of column frequency.
Crosstabulations can include more than two variables. When three or more variables are joined in a TABLES statement, the result is a series of two-way tables that are grouped by the values of the first variables listed. Beginning in SAS 9, you can use the CROSSLIST option to format your tables in ODS column format.
To reduce the bulk of n-way table output, add a slash (/) and the LIST option to the end of the TABLES statement. PROC FREQ then prints compact, multi-column lists instead of a series of tables.
PROC MEANS < DATA=SAS-data-set>
< statistic-keyword(s)><option(s)>;
< VAR variable(s)>;
< CLASS variable(s)>;
< BY variable(s)>;
< OUTPUT out=SAS-data-set statistic=variable(s)>;
RUN;
PROC SUMMARY < DATA=SAS-data-set>
< statistic-keyword(s)><option(s)>;
< VAR variable(s)>;
< CLASS variable(s)>;
< OUTPUT out=SAS-data-set>;
RUN;
PROC FREQ < DATA=SAS-data-set>;
TABLES variable-1 *variable-2 <* ... variable-n>
/ <NOFREQ|NOPERCENT|NOROW|NOCOL> <LIST>;
RUN;
proc means data=clinic.heart min max maxdec=1; var arterial heart cardiac urinary; class survive sex; run; proc summary data=clinic.diabetes; var age height weight; class sex; output out=work.sum_gender mean=AvgAge AvgHeight AvgWeight; run; proc freq data=clinic.heart order=freq; tables sex*survive*shock / nopercent list; run;
In PROC MEANS, use a VAR statement to limit output to relevant variables. Exclude statistics for nominal variables such as ID or ProductCode.
By default, PROC MEANS prints the full width of each numeric variable. Use the MAXDEC= option to limit decimal places and to improve legibility.
Data must be sorted for BY group processing. You might need to run PROC SORT before using PROC MEANS with a BY statement.
PROC MEANS and PROC SUMMARY produce the same results; however, the default output is different. PROC MEANS produces a report, whereas PROC SUMMARY produces an output data set.
If you do not include a TABLES statement, PROC FREQ produces statistics for every variable in the data set.
Variables that have continuous numeric values can create a large amount of output. Use a TABLES statement to exclude such variables, or group their values by applying a FORMAT statement.
Select the best answer for each question. After completing the quiz, check your answers using the answer key in the appendix.
The default statistics produced by the MEANS procedure are are n-count, mean, minimum, maximum, and...
median
range
standard deviation
standard error of the mean.
Which statement will limit a PROC MEANS analysis to the variables Boarded, Transfer, and Deplane?
by boarded transfer deplane;
class boarded transfer deplane;
output boarded transfer deplane;
var boarded transfer deplane;
The data set Survey.Health includes the following variables. Which is a poor candidate for PROC MEANS analysis?
IDnum
Age
Height
Weight
Which of the following statements is true regarding BY group processing?
BY variables must be either indexed or sorted.
Summary statistics are computed for BY variables.
BY group processing is preferred when you are categorizing data that contains few variables.
BY group processing overwrites your data set with the newly grouped observations.
Which group processing statement produced the PROC MEANS output shown below?
class sex survive;
class survive sex;
by sex survive;
by survive sex;
Which program can be used to create the following output?
proc means data=clinic.diabetes; var age height weight; class sex; output out=work.sum_gender mean=AvgAge AvgHeight AvgWeight; run;
proc summary data=clinic.diabetes print; var age height weight; class sex; output out=work.sum_gender mean=AvgAge AvgHeight AvgWeight; run;
proc means data=clinic.diabetes noprint; var age height weight; class sex;
output out=work.sum_gender mean=AvgAge AvgHeight AvgWeight; run;
Both a and b.
By default, PROC FREQ creates a table of frequencies and percentages for which data set variables?
character variables
numeric variables
both character and numeric variables
none: variables must always be specified
Frequency distributions work best with variables that contain
continuous values.
numeric values.
categorical values.
unique values.
Which PROC FREQ step produced this two-way table?
proc freq data=clinic.diabetes; tables height weight; format height htfmt. weight wtfmt.; run;
proc freq data=clinic.diabetes; tables weight height; format weight wtfmt. height htfmt.; run;
proc freq data=clinic.diabetes; tables height*weight; format height htfmt. weight wtfmt.; run;
proc freq data=clinic.diabetes; tables weight*height; format weight wtfmt. height htfmt.; run;
Which PROC FREQ step produced this table?
proc freq data=clinic.diabetes; tables sex weight / list; format weight wtfmt.; run;
proc freq data=clinic.diabetes; tables sex*weight / nocol; format weight wtfmt.; run;
proc freq data=clinic.diabetes; tables sex weight / norow nocol; format weight wtfmt.; run;
proc freq data=clinic.diabetes; tables sex*weight / nofreq norow nocol; format weight wtfmt.; run;