Chapter 9. Producing Descriptive Statistics

Overview

Introduction

As you have seen, one of the many features of PROC REPORT is the ability to summarize large amounts of data by producing descriptive statistics. However, there are SAS procedures that are designed specifically to produce various types of descriptive statistics and to display them in meaningful reports. The type of descriptive statistics that you need and the SAS procedure that you should use depend on whether you need to summarize continuous data values or discrete data values.

If the data values that you want to describe are continuous numeric values (for example, people's ages), then you can use the MEANS procedure or the SUMMARY procedure to calculate statistics such as the mean, sum, minimum, and maximum.

Introduction

If the data values that you want to describe are discrete (for example, the color of people's eyes), then you can use the FREQ procedure to show the distribution of these values, such as percentages and counts.

Introduction

This chapter will show you how to use the MEANS, SUMMARY, and FREQ procedures to describe your data.

Objectives

In this chapter, you learn to

  • determine the n-count, mean, standard deviation, minimum, and maximum of numeric variables using the MEANS procedure

  • control the number of decimal places used in PROC MEANS output

  • specify the variables for which to produce statistics

  • use the PROC SUMMARY procedure to produce the same results as the PROC MEANS procedure

  • describe the difference between the SUMMARY and MEANS procedures

  • create one-way frequency tables for categorical data using the FREQ procedure

  • create two-way and n-way crossed frequency tables

  • control the layout and complexity of crossed frequency tables.

Computing Statistics Using PROC MEANS

Descriptive statistics such as the mean, minimum, and maximum provide useful information about numeric data. The MEANS procedure provides these and other data summarization tools, as well as helpful options for controlling your output.

Procedure Syntax

The MEANS procedure can include many statements and options for specifying needed statistics. For simplicity, let's consider the procedure in its basic form.

In its simplest form, PROC MEANS prints the n-count (number of nonmissing values), the mean, the standard deviation, and the minimum and maximum values of every numeric variable in a data set.

proc means data=perm.survey;
run;
Procedure Syntax

Selecting Statistics

The default statistics that the MEANS procedure produces (n-count, mean, standard deviation, minimum, and maximum) are not always the ones that you need. You might prefer to limit output to the mean of the values. Or you might need to compute a different statistic, such as the median or range of the values.

To specify statistics, include statistic keywords as options in the PROC MEANS statement. When you specify a statistic in the PROC MEANS statement, default statistics are not produced. For example, to see the median and range of Perm.Survey numeric values, add the MEDIAN and RANGE keywords as options.

proc means data=perm.survey median range;
run;
Selecting Statistics

The following keywords can be used with PROC MEANS to compute statistics:

Table 9.1. Descriptive Statistics

Keyword

Description

CLM

Two-sided confidence limit for the mean

CSS

Corrected sum of squares

CV

Coefficient of variation

KURTOSIS

Kurtosis

LCLM

One-sided confidence limit below the mean

MAX

Maximum value

MEAN

Average

MODE

Value that occurs most frequently (new in SAS 9.2)

MIN

Minimum value

N

Number of observations with nonmissing values

NMISS

Number of observations with missing values

RANGE

Range

SKEWNESS

Skewness

STDDEV / STD

Standard deviation

STDERR

Standard error of the mean

SUM

Sum

SUMWGT

Sum of the Weight variable values

UCLM

One-sided confidence limit above the mean

USS

Uncorrected sum of squares

VAR

Variance

Table 9.2. Quantile Statistics

Keyword

Description

MEDIAN / P50

Median or 50th percentile

P1

1st percentile

P5

5th percentile

P10

10th percentile

Q1 / P25

Lower quartile or 25th percentile

Q3 / P75

Upper quartile or 75th percentile

P90

90th percentile

P95

95th percentile

P99

99th percentile

QRANGE

Difference between upper and lower quartiles: Q3-Q1

Table 9.3. Hypothesis Testing

Keyword

Description

PROBT

Probability of a greater absolute value for the t value

T

Student's t for testing the hypothesis that the population mean is 0

Limiting Decimal Places

By default, PROC MEANS output automatically uses the BEST w. format to display numeric values in the report.

The BESTw. format is the default format that SAS uses for writing numeric values. When there is no format specification, SAS chooses the format that provides the most information about the value according to the available field width. At times, this can result in unnecessary decimal places, making your output hard to read.

proc means data=clinic.diabetes min max;
run;
Limiting Decimal Places

To limit decimal places, use the MAXDEC= option in the PROC MEANS statement, and set it equal to the length that you prefer.

proc means data=clinic.diabetes min max maxdec=0;
run;
Limiting Decimal Places

Specifying Variables in PROC MEANS

By default, the MEANS procedure generates statistics for every numeric variable in a data set. But you'll typically want to focus on just a few variables, particularly if the data set is large. It also makes sense to exclude certain types of variables. The values of ID, for example, are unlikely to yield useful statistics.

To specify the variables that PROC MEANS analyzes, add a VAR statement and list the variable names.

proc means data=clinic.diabetes min max maxdec=0;
   var age height weight;
run;
Specifying Variables in PROC MEANS

In addition to listing variables separately, you can use a numbered range of variables.

proc means data=perm.survey mean stderr maxdec=2;
   var item1-item5;
run;
Specifying Variables in PROC MEANS

Group Processing Using the CLASS Statement

You will often want statistics for grouped observations, instead of for observations as a whole. For example, census numbers are more useful when grouped by region than when viewed as a national total. To produce separate analyses of grouped observations, add a CLASS statement to the MEANS procedure.

PROC MEANS does not generate statistics for CLASS variables, because their values are used only to categorize data. CLASS variables can be either character or numeric, but they should contain a limited number of discrete values that represent meaningful groupings.

The output of the program shown below is categorized by values of the variables Survive and Sex. The order of the variables in the CLASS statement determines their order in the output table.

proc means data=clinic.heart maxdec=1;
   var arterial heart cardiac urinary;
   class survive sex;
run;
Group Processing Using the CLASS Statement

Group Processing Using the BY Statement

Like the CLASS statement, the BY statement specifies variables to use for categorizing observations.

But BY and CLASS differ in two key ways:

  1. Unlike CLASS processing, BY processing requires that your data already be sorted or indexed in the order of the BY variables. Unless data set observations are already sorted, you will need to run the SORT procedure before using PROC MEANS with any BY group.

    Group Processing Using the BY Statement
  2. BY group results have a layout that is different from the layout of CLASS group results. Note that the BY statement in the program below creates four small tables; a CLASS statement would produce a single large table.

proc sort data=clinic.heart out=work.heartsort;
   by survive sex;
run;
proc means data=work.heartsort maxdec=1;
   var arterial heart cardiac urinary;
   by survive sex;
run;
Group Processing Using the BY Statement
Group Processing Using the BY Statement
Group Processing Using the BY Statement

Creating a Summarized Data Set Using PROC MEANS

You might want to create an output SAS data set that contains just the summarized variable. You can do this by using the OUTPUT statement in PROC MEANS.

When you use the OUTPUT statement, the summary statistics N, MEAN, STD, MIN, and MAX are produced for all of the numeric variables or for all of the variables that are listed in a VAR statement by default. To specify which statistics to produce, use the STATISTIC= option.

Specifying the STATISTIC= option

You can specify which statistics to produce in the output data set. To do so, you must specify the statistic and then list all of the variables. The variables must be listed in the same order as in the VAR statement. You can specify more than one statistic in the OUTPUT statement.

The following program creates a typical PROC MEANS report and also creates a summarized output data set.

proc means data=clinic.diabetes;
   var age height weight;
   class sex;
   output out=work.sum_gender
      mean=AvgAge AvgHeight AvgWeight
      min=MinAge MinHeight MinWeight;
run;
Specifying the STATISTIC= option

To see the contents of the output data set, submit the following PROC PRINT step.

proc print data=work.sum_gender;
run;
Specifying the STATISTIC= option
Specifying the STATISTIC= option
proc means data=clinic.diabetes noprint;
   var age height weight;
   class sex;
   output out=work.sum_gender
      mean=AvgAge AvgHeight AvgWeight;
run;
Specifying the STATISTIC= option

Creating a Summarized Data Set Using PROC SUMMARY

You can also create a summarized output data set by using PROC SUMMARY. When you use PROC SUMMARY, you use the same code to produce the output data set that you would use with PROC MEANS.

The difference between the two procedures is that PROC MEANS produces a report by default (remember that you can use the NOPRINT option to suppress the default report). By contrast, to produce a report in PROC SUMMARY, you must include a PRINT option in the PROC SUMMARY statement.

Example

The following example creates an output data set but does not create a report:

proc summary data=clinic.diabetes;
   var age height weight;
   class sex;
   output out=work.sum_gender
      mean=AvgAge AvgHeight AvgWeight;
run;

If you placed a PRINT option in the PROC SUMMARY statement above, this program would produce the same report as if you replaced the word SUMMARY with MEANS.

proc summary data=clinic.diabetes print;
   var age height weight;
   class sex;
   output out=work.sum_gender
      mean=AvgAge AvgHeight AvgWeight;
run;
Example

Producing Frequency Tables Using PROC FREQ

The FREQ procedure is a descriptive procedure as well as a statistical procedure. It produces one-way and n-way frequency tables, and it concisely describes your data by reporting the distribution of variable values. You can use the FREQ procedure to create crosstabulation tables that summarize data for two or more categorical variables by showing the number of observations for each combination of variable values.

Procedure Syntax

The FREQ procedure can include many statements and options for controlling frequency output. For simplicity, let's consider the procedure in its basic form.

By default, PROC FREQ creates a one-way table with the frequency, percent, cumulative frequency, and cumulative percent of every value of all variables in a data set.

Procedure Syntax

For example, the following FREQ procedure creates a frequency table for each variable in the data set Parts.Widgets. All the unique values are shown for ItemName, LotSize, and Region.

proc freq data=parts.widgets;
run;
Procedure Syntax

Specifying Variables in PROC FREQ

By default, the FREQ procedure creates frequency tables for every variable in your data set. But this isn't always what you want. A variable that has continuous numeric values—such as DateTime—can result in a lengthy and meaningless table. Likewise, a variable that has a unique value for each observation—such as FullName—is unsuitable for PROC FREQ processing. Frequency distributions work best with variables whose values can be described as categorical, and whose values are best summarized by counts rather than by averages.

To specify the variables to be processed by the FREQ procedure, include a TABLES statement.

Example

The order in which the variables appear in the TABLES statement determines the order in which they are listed in the PROC FREQ report.

Consider the SAS data set Finance.Loans. The variables Rate and Months are best described as categorical values, so they are the best choices for frequency tables.

Example
proc freq data=finance.loans;
   tables rate months;
run;
Example
Example

In addition to listing variables separately, you can use a numbered range of variables.

proc freq data=perm.survey;
   tables item1-item3;
run;
Example

Adding the NOCUM option to your TABLES statement suppresses the display of cumulative frequencies and cumulative percentages in one-way frequency tables and in list output. The syntax for the NOCUM option shown below.

TABLES variable(s) /  NOCUM;

Creating Two-Way Tables

So far, you have used the FREQ procedure to create one-way frequency tables. The table results show total frequency counts for the values within the data set. However, it is often helpful to crosstabulate frequencies with the values of other variables. For example, census data is typically crosstabulated with a variable that represents geographical regions.

The simplest crosstabulation is a two-way table. To create a two-way table, join two variables with an asterisk (*) in the TABLES statement of a PROC FREQ step.

When crosstabulations are specified, PROC FREQ produces tables with cells that contain

  • column cell frequency

  • cell percentage of total frequency

  • cell percentage of row frequency

  • cell percentage of frequency.

For example, the following program creates the two-way table shown below.

Creating Two-Way Tables

Note that the first variable, Weight, forms the table rows, and the second variable, Height, forms the columns; reversing the order of the variables in the TABLES statement would reverse their positions in the table. Note also that the statistics are listed in the legend box.

Creating N-Way Tables

For a frequency analysis of more than two variables, use PROC FREQ to create n-way crosstabulations. A series of two-way tables is produced, with a table for each level of the other variables.

For example, suppose you want to add the variable Sex to your crosstabulation of Weight and Height in the data set Clinic.Diabetes. Add Sex to the TABLES statement, joined to the other variables with an asterisk (*).

tables sex*weight*height;

Determining the Table Layout

The order of the variables is important. In n-way tables, the last two variables of the TABLES statement become the two-way rows and columns. Variables that precede the last two variables in the TABLES statement stratify the crosstabulation tables.

       levels 
                       

tables  sex*weight*height; 
                                 

           rows + columns = two-way tables

Notice the structure of the output that is produced by the program shown below.

Determining the Table Layout

Changing the Table Format

Beginning in SAS 9, adding the CROSSLIST option to your TABLES statement displays crosstabulation tables in ODS column format. This option creates a table that has a table definition that you can customize by using the TEMPLATE procedure.

Notice the structure of the output that is produced by the program shown below.

Changing the Table Format

Creating Tables in List Format

When three or more variables are specified, the multiple levels of n-way tables can produce considerable output. Such bulky, often complex crosstabulations are often easier to read as a continuous list. Although this eliminates row and column frequencies and percents, the results are compact and clear.

To generate list output for crosstabulations, add a slash (/) and the LIST option to the TABLES statement in your PROC FREQ step.

TABLES variable-1 *variable-2 <* ... variable-n> / LIST;

Example

Adding the LIST option to our Clinic.Diabetes program puts its frequencies in a simple, short table.

Example

Suppressing Table Information

Another way to control the format of crosstabulations is to limit the output of the FREQ procedure to a few specific statistics. Remember that when crosstabulations are run, PROC FREQ produces tables with cells that contain:

  • cell frequency

  • cell percentage of total frequency

  • cell percentage of row frequency

  • cell percentage of column frequency.

You can use options to suppress any of these statistics. To control the depth of crosstabulation results, add any combination of the following options to the TABLES statement:

  • NOFREQ suppresses cell frequencies

  • NOPERCENT suppresses cell percentages

  • NOROW supresses row percentages

  • NOCOL suppresses column percentages.

Example

Suppose you want to use only the percentages of Sex and Weight combinations in the data set Clinic.Diabetes. To block frequency counts and row and column percentages, add the NOFREQ, NOROW, and NOCOL options to your program's TABLES statement.

Example

Notice that Percent is the only statistic that remains in the table's legend box.

Chapter Summary

Text Summary

Purpose of PROC MEANS

The MEANS procedure provides an easy way to compute descriptive statistics. Descriptive statistics such as the mean, minimum, and maximum provide useful information about numeric data.

Specifying Statistics

By default, PROC MEANS computes the n-count (the number of non-missing values), the mean, the standard deviation, and the minimum and maximum values for variables. To specify statistics, list their keywords in the PROC MEANS statement.

Table 9.4. Descriptive Statistics

Keyword

Description

CLM

Two-sided confidence limit for the mean

CSS

Corrected sum of squares

CV

Coefficient of variation

KURTOSIS

Kurtosis

LCLM

One-sided confidence limit below the mean

MAX

Maximum value

MEAN

Average

MIN

Minimum value

N

Number of observations with nonmissing values

NMISS

Number of observations with missing values

RANGE

Range

SKEWNESS

Skewness

STDDEV / STD

Standard deviation

STDERR

Standard error of the mean

SUM

Sum

SUMWGT

Sum of the Weight variable values

UCLM

One-sided confidence limit above the mean

USS

Uncorrected sum of squares

VAR

Variance

Table 9.5. Quantile Statistics

Keyword

Description

MEDIAN / P50

Median or 50th percentile

P1

1st percentile

P5

5th percentile

P10

10th percentile

Q1 / P25

Lower quartile or 25th percentile

Q3 / P75

Upper quartile or 75th percentile

P90

90th percentile

P95

95th percentile

P99

99th percentile

QRANGE

Difference between upper and lower quartiles: Q3-Q1

Table 9.6. Hypothesis Testing

Keyword

Description

PROBT

Probability of a greater absolute value for the t value

T

Student's t for testing the hypothesis that the population mean is 0

Limiting Decimal Places

Because PROC MEANS uses the BEST. format by default, procedure output can contain unnecessary decimal places. To limit decimal places, use the MAXDEC= option and set it equal to the length that you prefer.

Specifying Variables in PROC MEANS

By default, PROC MEANS computes statistics for all numeric variables. To specify the variables to include in PROC MEANS output, list them in a VAR statement.

Group Processing Using the CLASS Statement

Include a CLASS statement, specifying variable names, to group PROC MEANS output by variable values. Statistics are not computed for the CLASS variables.

Group Processing Using the BY Statement

Include a BY statement, specifying variable names, to group PROC MEANS output by variable values. Your data must be sorted according to those variables. Statistics are not computed for the BY variables.

Creating a Summarized Data Set Using PROC MEANS

You can create an output data set that contains summarized variables by using the OUTPUT statement in PROC MEANS. When you use the OUTPUT statement without specifying the statistic-keyword= option, the summary statistics N, MEAN, STD, MIN, and MAX are produced for all of the numeric variables or for all of the variables that are listed in a VAR statement.

Creating a Summarized Data Set Using PROC SUMMARY

You can also create a summarized output data set by using PROC SUMMARY. The PROC SUMMARY code for producing an output data set is exactly the same as the code for producing an output data set with PROC MEANS. The difference between the two procedures is that PROC MEANS produces a report by default, whereas PROC SUMMARY produces an output data set by default.

The FREQ Procedure

The FREQ Procedure is a descriptive procedure as well as a statistical procedure that produces one-way and n-way frequency tables. It concisely describes your data by reporting the distribution of variable values.

Specifying Variables

By default, the FREQ procedure creates frequency tables for every variable in your data set. To specify the variables to analyze, include them in a TABLES statement.

Creating Two-Way Tables

When a TABLES statement contains two variables joined by an asterisk (*), PROC FREQ produces crosstabulations. The resulting table displays values for

  • cell frequency

  • cell percentage of total frequency

  • cell percentage of row frequency

  • cell percentage of column frequency.

Creating N-Way Tables

Crosstabulations can include more than two variables. When three or more variables are joined in a TABLES statement, the result is a series of two-way tables that are grouped by the values of the first variables listed. Beginning in SAS 9, you can use the CROSSLIST option to format your tables in ODS column format.

Creating Tables in List Format

To reduce the bulk of n-way table output, add a slash (/) and the LIST option to the end of the TABLES statement. PROC FREQ then prints compact, multi-column lists instead of a series of tables.

Suppressing Table Information

You can suppress the display of specific statistics by adding one or more options to the TABLES statement:

  • NOFREQ suppresses cell frequencies

  • NOPERCENT suppresses cell percentages

  • NOROW suppresses row percentages

  • NOCOL suppresses column percentages.

Syntax

PROC MEANS < DATA=SAS-data-set>

        < statistic-keyword(s)><option(s)>;

        < VAR variable(s)>;

        < CLASS variable(s)>;

        < BY variable(s)>;

        < OUTPUT out=SAS-data-set statistic=variable(s)>;

RUN;

PROC SUMMARY < DATA=SAS-data-set>

        < statistic-keyword(s)><option(s)>;

        < VAR variable(s)>;

        < CLASS variable(s)>;

        < OUTPUT out=SAS-data-set>;

RUN;

PROC FREQ < DATA=SAS-data-set>;

        TABLES variable-1 *variable-2 <* ... variable-n>

        / <NOFREQ|NOPERCENT|NOROW|NOCOL> <LIST>;

RUN;

Sample Programs

  proc means data=clinic.heart min max maxdec=1;
     var arterial heart cardiac urinary;
     class survive sex;
run;


proc summary data=clinic.diabetes;
   var age height weight;
   class sex;
   output out=work.sum_gender
     mean=AvgAge AvgHeight AvgWeight;
run;


proc freq data=clinic.heart order=freq;
   tables sex*survive*shock / nopercent list;
run;

Points to Remember

  • In PROC MEANS, use a VAR statement to limit output to relevant variables. Exclude statistics for nominal variables such as ID or ProductCode.

  • By default, PROC MEANS prints the full width of each numeric variable. Use the MAXDEC= option to limit decimal places and to improve legibility.

  • Data must be sorted for BY group processing. You might need to run PROC SORT before using PROC MEANS with a BY statement.

  • PROC MEANS and PROC SUMMARY produce the same results; however, the default output is different. PROC MEANS produces a report, whereas PROC SUMMARY produces an output data set.

  • If you do not include a TABLES statement, PROC FREQ produces statistics for every variable in the data set.

  • Variables that have continuous numeric values can create a large amount of output. Use a TABLES statement to exclude such variables, or group their values by applying a FORMAT statement.

Chapter Quiz

Select the best answer for each question. After completing the quiz, check your answers using the answer key in the appendix.

  1. The default statistics produced by the MEANS procedure are are n-count, mean, minimum, maximum, and...

    1. median

    2. range

    3. standard deviation

    4. standard error of the mean.

  2. Which statement will limit a PROC MEANS analysis to the variables Boarded, Transfer, and Deplane?

    1. by boarded transfer deplane;
    2. class boarded transfer deplane;
    3. output boarded transfer deplane;
    4. var boarded transfer deplane;
  3. The data set Survey.Health includes the following variables. Which is a poor candidate for PROC MEANS analysis?

    1. IDnum
    2. Age
    3. Height
    4. Weight
  4. Which of the following statements is true regarding BY group processing?

    1. BY variables must be either indexed or sorted.

    2. Summary statistics are computed for BY variables.

    3. BY group processing is preferred when you are categorizing data that contains few variables.

    4. BY group processing overwrites your data set with the newly grouped observations.

  5. Which group processing statement produced the PROC MEANS output shown below?

    Chapter Quiz
    1. class sex survive;
    2. class survive sex;
    3. by sex survive;
    4. by survive sex;
  6. Which program can be used to create the following output?

    Chapter Quiz
    1. proc means data=clinic.diabetes;
         var age height weight;
         class sex;
         output out=work.sum_gender
            mean=AvgAge AvgHeight AvgWeight;
      run;
    2. proc summary data=clinic.diabetes print;
         var age height weight; class sex;
         output out=work.sum_gender
            mean=AvgAge AvgHeight AvgWeight;
      run;
    3. proc means data=clinic.diabetes noprint;
         var age height weight;
         class sex;
         output out=work.sum_gender
            mean=AvgAge AvgHeight AvgWeight;
      run;
    4. Both a and b.

  7. By default, PROC FREQ creates a table of frequencies and percentages for which data set variables?

    1. character variables

    2. numeric variables

    3. both character and numeric variables

    4. none: variables must always be specified

  8. Frequency distributions work best with variables that contain

    1. continuous values.

    2. numeric values.

    3. categorical values.

    4. unique values.

  9. Which PROC FREQ step produced this two-way table?

    Chapter Quiz
    1. proc freq data=clinic.diabetes;
         tables height weight;
         format height htfmt. weight wtfmt.;
      run;
    2. proc freq data=clinic.diabetes;
         tables weight height;
         format weight wtfmt. height htfmt.;
      run;
    3. proc freq data=clinic.diabetes;
         tables height*weight;
         format height htfmt. weight wtfmt.;
      run;
    4. proc freq data=clinic.diabetes;
         tables weight*height;
         format weight wtfmt. height htfmt.;
      run;
  10. Which PROC FREQ step produced this table?

    Chapter Quiz
    1. proc freq data=clinic.diabetes;
         tables sex weight / list;
         format weight wtfmt.;
      run;
    2. proc freq data=clinic.diabetes;
         tables sex*weight / nocol;
         format weight wtfmt.;
      run;
    3. proc freq data=clinic.diabetes;
         tables sex weight / norow nocol;
         format weight wtfmt.;
      run;
    4. proc freq data=clinic.diabetes;
         tables sex*weight / nofreq norow nocol;
         format weight wtfmt.;
      run;
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset