Image483_fmt.png

Chapter 7 MEANS and SUMMARY Procedures

7.1 Using Multiple CLASS Statements and CLASS Statement Options

7.1.1 MISSING and DESCENDING Options

7.1.2 GROUPINTERNAL Option

7.1.3 Order= Option

7.2 Letting SAS Name the Output Variables

7.3 Statistic Specification on the OUTPUT Statement

7.4 Identifying the Extremes

7.4.1 Using the MAXID and MINID Options

7.4.2 Using the IDGROUP Option

7.4.3 Using Percentiles to Create Subsets

7.5 Understanding the _TYPE_ Variable

7.6 Using the CHARTYPE Option

7.7 Controlling Summary Subsets Using the WAYS Statement

7.8 Controlling Summary Subsets Using the TYPES Statement

7.9 Controlling Subsets Using the CLASSDATA= and EXCLUSIVE Options

7.10 Using the COMPLETETYPES Option

7.11 Identifying Summary Subsets Using the LEVELS and WAYS Options

7.12 CLASS Statement vs. BY Statement

While the MEANS and SUMMARY procedures have been a part of Base SAS for a long time (MEANS is an original procedure), and while these procedures are used extensively, many users of these procedures actually take advantage of only a fraction of their capabilities. Primarily this is true because a great deal can be accomplished with fairly simple procedure steps.

With recent enhancements (especially in SAS 8 and SAS®9), a number of additional capabilities have been added to the MEANS and SUMMARY procedures, and even the ‘seasoned’ programmer may not have been exposed to them. This chapter covers some of the more useful of these capabilities. Because these two procedures have the same capabilities and have very few differences, most of the examples and text in this chapter will highlight only one of them. In each case either of the two procedures could be used.

Prior to SAS 6, MEANS and SUMMARY were distinct procedures with overlapping capabilities. Currently the same software is used behind the scenes regardless of which procedure the user calls; therefore, their capabilities are now the same. The only real differences between these procedures are seen in their defaults, and then primarily in the way each procedure creates printed tables. By default MEANS always creates a table to be printed. If you do not want a printed table you must explicitly turn it off (NOPRINT option). On the other hand, the SUMMARY procedure never creates a printed table unless it is specifically requested (PRINT option).

SEE ALSO

Carpenter (2008) discusses these two procedures in more detail, including an introduction as well as additional options not covered in this book.

7.1 Using Multiple CLASS Statements and CLASS Statement Options

Although the following discussion concerning the use of multiple CLASS statements and CLASS statement options is within the context of the MEANS and SUMMARY procedures, it can be generalized to most procedures that use the CLASS statement.

The CLASS statement can be specified as a single statement or it can be broken up into a series of CLASS statements. The order of the CLASS statements determines the overall order of the classification variables.

class race sex edu;
class race sex;
class edu;

The CLASS statement now accepts options and for most procedures that accept the CLASS statement, a single class statement can be replaced by a series of CLASS statements. This allows us to control the application of CLASS statement options to specific classification variables. One or more options are specified on a CLASS statement by preceding the option with a slash. While it is not necessary to have multiple CLASS statements just to apply CLASS statement options, multiple CLASS statements allow you to apply these options differentially. For instance when you use the MISSING option on the PROC statement, it is applied to all of the classification variables. By using multiple CLASS statements along with the MISSING option on the CLASS statement, you can choose which classification variables are to utilize the MISSING option.

class race sex;
class edu / missing;

CLASS statement options include:

ASCENDING/DESCENDING (Section 7.1.1)

Analogous to the DESCENDING option in PROC SORT and other procedures, these options allow you to reverse the order of the displayed values.

GROUPINTERNAL and EXCLUSIVE (Section 7.1.2)

You can use these two options to control how formats associated with CLASS variables are to be used when forming groups (see Section 12.1 for the related topic on preloaded formats).

MISSING (Section 7.1.1)

Observations with missing levels of the classification variables are normally excluded from the analysis. This option allows missing values to represent valid levels of the classification variable.

MLF (Section 12.3)

Multilevel formats allow overlapping formatted levels.

ORDER= (Section 7.1.3)

This option allows you to control the order of the classification variables levels. The ORDER= option is discussed in more detail in Section 2.6.1.

PRELOADFMT and EXCLUSIVE (Section 12.1.3)

When formats are preloaded they can be used to establish data filters when forming groups.

The following example performs a simple SUMMARY step and generates the data set STATS. In this step the two classification variables (RACE and EDU) are used to summarize the data for the two analysis variables (HT and WT).

title1 '7.1 Single Class Statement';
proc summary data=advrpt.demog;
   class race edu;
   var ht wt;
   output out=stats
          mean= htmean wtmean
          stderr=htse wtse
          ;
   run;
proc print data=stats;
   run;

Examination of the partial listing below shows that only 75 (of the potential 77) observations were used in the calculation of the summary statistics. At this point it is not clear why two of the observations were excluded from the analysis.

In the examples that follow, this analysis will be repeated using various CLASS statement options.

MORE INFORMATION

The CLASS statement is not the only statement that can be split. Both the VAR statement and the CLASS statement are commonly split to allow the assignment of options in PROC TABULATE, and there is a PROC PRINT example in Section 8.5.2 with multiple VAR statements.

7.1.1 MISSING and DESCENDING Options

It is very important to understand that all SAS procedures eliminate an entire observation from the analysis if any one of the classification variables has a missing value. This is true for both explicitly declared classification variables (through the use of the CLASS statement), or implicitly declared classification variables such as those on the TABLES statement in PROC FREQ (which does not use a CLASS statement). Since the entire observation is eliminated, this can affect data summaries that do not even include the offending classification variable. This is a problem that can result in incorrect analyses.

The data table ADVRPT.DEMOG has 77 rows; however, because of missing values in one or both of the classification variables, only 75 observations have been used in the previous summary (Section 7.1). From the LISTING above, or even by inspection of the LOG, it is unclear which classification variable has the missing values.

In the next example, the DESCENDING option is applied to RACE and the MISSING option is applied to the classification variable EDU.

proc summary data=advrpt.demog;
   class race/descending; Callout 1
   class edu/missing; Callout 2
   var ht wt;
   output out=stats
          mean= htmean wtmean
          stderr=htse wtse
          ;
   run;

Callout 1 The groups formed by RACE are now shown in decreasing (DESCENDING) order.

Callout 2 A missing value for the variable EDU will now be considered to be a valid level and will be included in the report. Any observation with a missing value for RACE will still be excluded.

7.1.1 Multiple Class Statements           
MISSING and DESCENDING Options           
Obs    race    edu    _TYPE_    _FREQ_     htmean     wtmean      htse       wtse           
  1              .       0        76      67.6053    162.237    0.40135     4.0115           
  2              .  Callout 2                   1         1         68.0000    240.000     .           .
  3             10       1        11      71.3636    194.091    0.96552     5.7532           
  4             12       1        19      67.0526    168.105    0.65102     6.3628           
  5             13       1         4      70.0000    197.000    1.15470    10.3923           
  6             14       1        10      64.2000    108.400    0.13333     1.4236           
  7             15       1         7      65.2857    155.571    0.86504    11.1160           
  8             16       1        10      70.4000    165.200    0.54160     6.1946           
  9             17       1        10      65.2000    145.200    0.74237     7.9342           
 10     Callout 1        18        1         4      69.0000    174.000    2.30940    15.5885           
 11     5        .       2         4      66.5000    147.000    0.86603     0.0000           
 12     4        .       2         4      64.5000    113.500    0.28868     0.8660           
 13     3        .       2         8      65.0000    112.000    0.65465     4.5826           
 14     2        .       2        18      67.6667    166.333    0.72310     8.8325           
 15     1        .       2        42      68.4762    176.143    0.58756     4.0053           
. . . . portions of the table are not shown . . . .           

The overall number of observations is now 76, and we can see that there is one observation with a missing value of EDU (OBS=2 in the listing). Since there are 77 observations in the data set, there must be an observation with a missing value for RACE as well.

7.1.2 GROUPINTERNAL Option

When a classification variable is associated with a format, that format is used in the formation of the groups. In the next example, the EDULEVEL. format maps the years of education into levels of education.

title1 '7.1.2 CLASS Statement Options';
proc format;
   value edulevel Callout 1
      0-12 = 'High School'
      13-16= 'College'
      17-high='Post Graduate';
   run;
title2 'GROUPINTERNAL not used';
proc summary data=advrpt.demog;
   class edu; Callout 3
   var ht wt;
   output out=stats
          mean= MeanHT MeanWT
          ;   
   format edu edulevel.; Callout 2
   run;
proc print data=stats;
   run;

Callout 1 The EDULEVEL. format maps years of education into three ranges.

Callout 2 In the SUMMARY step the FORMAT statement has been used to create the association between EDU and the EDULEVEL. format.

Callout 3 The MISSING option has not been applied; consequently missing values of EDU will not be included in the summary.

A PROC PRINT LISTING of the resulting data table shows that the SUMMARY procedure has used the format to collapse the individual levels of EDU into the three levels of the formatted classification variable.

To use the original data values (internal values) to form the groups, rather than the formatted values, the GROUPINTERNAL option is added to the CLASS statement.

class edu/groupinternal;
7.1.2 CLASS Statement Options
GROUPINTERNAL not used
Obs         edu         _TYPE_ _FREQ_   MeanHT   MeanWT
 1                 .       0     76    67.5526  160.461
 2     High School         1     30    68.6333  177.633
 3     College             1     32    67.0938  147.438
 4     Post Graduate       1     14    66.2857  153.429

Notice that although the original values of EDU are used to form the groups, the formatted values are still displayed. In this example we could have achieved similar results by using the ORDER=INTERNAL option shown in Section 7.1.3.

7.1.2 CLASS Statement Options
Using GROUPINTERNAL
Obs         edu        _TYPE_  _FREQ_   MeanHT   MeanWT
 1                 .      0      76    67.5526  160.461
 2     High School        1      11    71.3636  194.091
 3     High School        1      19    67.0526  168.105
 4     College            1       4    70.0000  197.000
 5     College            1      11    64.1818  108.091
 6     College            1       7    65.2857  155.571
 7     College            1      10    70.4000  165.200
 8     Post Graduate      1      10    65.2000  145.200
 9     Post Graduate      1       4    69.0000  174.000

7.1.3 Order= Option

When procedures create ordered output, often based on the classification variables, there are several different criteria that can be used to determine the order. The ORDER= option is used to establish the scheme, which establishes the ordering criteria. The ORDER= option can generally appear on the PROC statement where it applies to all the classification variables (implicit or explicit), or as an option on the CLASS statement where it can be applied to selected classification variables.

These schemes include:

  • DATA

order is based on the order of the incoming data

  • FORMATTED

values are formatted first and then ordered

  • FREQ

the order is based on the frequency of the class level

  • INTERNAL

same as UNFORMATTED or GROUPINTERNAL

The default ordering is always INTERNAL (whether or not the variable is formatted) except for PROC REPORT. In PROC REPORT, formatted variables have a default order of FORMATTED.

Using the ORDER=FREQ option on the CLASS statement causes the table to be ordered according to the most common levels of education.

class edu/order=freq;

In this table EDU has been left unformatted. Notice that the order of the rows for EDU is based on the value of _FREQ_.

7.1.3 CLASS Statement Options
Using ORDER=FREQ
Obs    edu    _TYPE_    _FREQ_     MeanHT     MeanWT
 1       .       0        76      67.5526    160.461
 2      12       1        19      67.0526    168.105
 3      14       1        11      64.1818    108.091
 4      10       1        11      71.3636    194.091
 5      17       1        10      65.2000    145.200
 6      16       1        10      70.4000    165.200
 7      15       1         7      65.2857    155.571
 8      18       1         4      69.0000    174.000
 9      13       1         4      70.0000    197.000

7.2 Letting SAS Name the Output Variables

In each of the examples in Section 7.1, the statistics that are to be calculated and written to the output data set are explicitly specified. However, you do not necessarily have to specify the statistics or provide names for the variables that are to hold the calculated values.

OUTPUT Statement without Specified Statistics

When no statistics are specified on the OUTPUT statement, the resulting data set will contain a specific set of statistics and will be in a different form. Rather than one column per statistic, the statistics will be in a transposed form—one row per statistic. The type of statistic is named in the _STAT_ column.

This SUMMARY step uses the OUTPUT statement with only the OUT= option. No statistics have been requested; consequently, a standard suite of statistics (the same list for the default printed statistics) are calculated and included in the data set. One column (named after the analysis variable) holds the value for each of the statistics noted under the variable _STAT_.

title1 '7.2 No Statistics Specified';
proc summary data=advrpt.demog;
   class race;
   var ht;
   output out=stats;
   run;

While this form of data set can have its uses, you need to be careful when using it, as the variable HT contains different types of information for each row.

7.2 No Statistics Specified
Obs    race    _TYPE_    _FREQ_    _STAT_       ht
  1               0        76       N        76.0000
  2               0        76       MIN      62.0000
  3               0        76       MAX      74.0000
  4               0        76       MEAN     67.6053
  5               0        76       STD       3.4989
  6     1         1        42       N        42.0000
  7     1         1        42       MIN      62.0000
  8     1         1        42       MAX      74.0000
  9     1         1        42       MEAN     68.4762
 10     1         1        42       STD       3.8078
. . . . portions of the table are not shown . . . .

AUTONAME and AUTOLABEL Options

The OUTPUT statement has always used options to name the summary data set (OUT=), and usually the summary statistics of interest (e.g., MEAN=, N=, MAX=). A second type of option can be placed on the OUTPUT statement. These options follow a slash (/) on the OUTPUT statement and include:

  • AUTONAME

allows MEANS and SUMMARY to determine names for the generated variables

  • AUTOLABEL

allows MEANS and SUMMARY to supply a label for each generated variable

  • LEVELS

adds the _LEVELS_ column to the summary data set (see Section 7.11)

  • WAYS

adds the _WAYS_ column to the summary data set (see Section 7.11)

When a statistic is requested on the OUTPUT statement, the variable that it generates has the default name of the corresponding analysis variable in the VAR statement. Since only one statistic can be generated using the default name, there will be a naming conflict if default naming is used when two or more statistics are requested. For this reason the following PROC SUMMARY will fail because of naming conflicts in the new data set STATS. Actually it only partially fails, which is probably worse. An error is produced in the LOG, but a partial table is still produced.

The AUTONAME option allows you to select multiple statistics without picking a name for the resulting variables in the OUTPUT table. The generated names are unique and therefore naming conflicts are eliminated. Similarly the AUTOLABEL option creates a label, which is based on the analysis variable’s existing label, for variables added to the OUT= data set.

proc summary 
data=advrpt.demog;
   class race;
   var ht;
   output out=stats
          n=
          mean= 
          stderr=
          ;
   run;
output out=stats
       n= 
       mean= 
       stderr=/autoname
      ;

Conveniently the names of the generated variables are both reasonable and predictable, and are in the form of variable_statistic.

7.2 Using AUTONAME
                                                       ht_Std
Obs    race    _TYPE_    _FREQ_    ht_N    ht_Mean      Err
 1                0        76       76     67.6053    0.40135
 2      1         1        42       42     68.4762    0.58756
 3      2         1        18       18     67.6667    0.72310
 4      3         1         8        8     65.0000    0.65465
 5      4         1         4        4     64.5000    0.28868
 6      5         1         4        4     66.5000    0.86603

7.3 Statistic Specification on the OUTPUT Statement

When creating variables based on statistics specified on the OUTPUT statement, there are several ways to name the variables and to associate the resultant variable with the original analysis variable.

The most traditional way of specifying the statistics and naming the generated variables is shown to the left.

var ht wt;
output out=stats
       n   = n_ht n_wt
       mean= mean_HT Mean_WT
       ;

The option specifying the statistic (N= and MEAN= are shown here) is followed by a variable list. This form requires the programmer to make sure that the list of analysis variables and the list of new variables that hold the values of the selected statistics are in the same position and order. The disadvantages of this form include:

  • The order of the statistics is tied to the order of the analysis variables.
  • A statistic must be generated for each of the analysis variables; that is, in order to calculate a mean for WT you must also calculate a mean for HT (since HT is first on the VAR statement).

This is not the only, or even necessarily the most practical, way of specifying the statistics and their associated variables. A list of analysis variables can be included in parentheses as a part of the statistic option. This allows you to specify a subset of the analysis variables Callout 1, as was done here with the N statistic. You can also use this technique to control the order of the usage of the analysis variables Callout 2.

var ht wt;
output out=stats
       n(wt) = n_wt Callout 1
       mean(wt ht) = mean_WT Mean_HT Callout 2
       ;

It is also possible to split up the specification of the statistics of interest. A given statistic can be specified multiple times, each with a different analysis variable. This form of option specification gives you quite a bit of flexibility, not only over which statistics will be calculated for which analysis variables, but also over the order of the generated variables on the resultant data set.

var ht wt;
output out=stats
       n(wt)    = n_wt
       mean(wt) = mean_WT
       n(ht)    = n_ht
       mean(ht) = Mean_HT
       ;

A PROC PRINT of the data set WORK.STATS generated by the OUTPUT statement to the left shows the variable order. You might also notice that SAS remembers the case of the name of the variable as it is first defined: Mean_HT as opposed to mean_WT.

7.3 Splitting the Stat(varlist)=           
Obs    race  _TYPE_   _FREQ_  n_wt    mean_WT    n_ht    Mean_HT           
 1              0       76     76     162.237     76     67.6053           
 2      1       1       42     42     176.143     42     68.4762           
 3      2       1       18     18     166.333     18     67.6667           
 4      3       1        8      8     112.000      8     65.0000           
 5      4       1        4      4     113.500      4     64.5000           
 6      5       1        4      4     147.000      4     66.5000           

It is also possible to specify more than one OUTPUT statement within a given PROC step. Each OUTPUT statement could have a different combination of statistics.

7.4 Identifying the Extremes

When working with data, it is not at all unusual to want to be able to identify the observations that contain the highest or lowest values of the analysis variables. These extreme values are automatically displayed in PROC UNIVARIATE output, but must be requested in the MEANS and SUMMARY procedures.

While the MIN and MAX statistics show the extreme value, they do not identify the observation that contains the extreme. Fortunately there are a couple of ways to identify the observation that contains the MAX or MIN.

7.4.1 Using the MAXID and MINID Options

The MAXID and MINID options in the OUTPUT statement can be used to identify the observations with the maximum and minimum values (the examples in this section are for MAXID; however, MINID has the same syntax). The general form of the option is:

MAXID(analysis_var_list(ID_var_list))=new_var_list

The MAXID option is used in the following example to identify which subjects had the maximums for each value of any classification variables. This option allows us to add a new variable to the OUTPUT data set, which takes on the value of the ID variable for the maximum observation.

title1 '7.4.1a Using MAXID';
title2 'One Analysis Variable';
proc summary data=advrpt.demog;
   class race;
   var ht;
   output out=stats
          mean= meanHT
          max=maxHt Callout 1
          maxid(ht(subject))=maxHtSubject Callout 2
          ;
   run;
proc print data=stats;
   run; 

The maximum for the analysis variable HT is requested Callout 1 for each RACE. We would also like to know which SUBJECT had the maximum HT Callout 2 (the subject number is to be stored in the variable MAXHTSUBJECT).

Using the same generalized option syntax as was discussed in the previous section, there are several variations of the syntax for the MAXID option shown in this example. In this case there is a single analysis variable and a single ID variable.

7.4.1a Using MAXID
One Analysis Variable
                                       max     maxHt
Obs   race  _TYPE_  _FREQ_   meanHT     Ht    Subject
                                        Callout 1       Callout 2
 1             0      76    67.6053     74      209
 2     1       1      42    68.4762     74      209
 3     2       1      18    67.6667     72      201
 4     3       1       8    65.0000     68      215
 5     4       1       4    64.5000     65      244
 6     5       1       4    66.5000     68      212

When there is more than one analysis variable, the MAXID statement can be expanded following the same syntax rules as were discussed in Section 7.3.

In this example the subject number of the tallest and of the heaviest subjects in the study are to be displayed.

var ht wt;
output out=stats
       mean= meanHT MeanWT
       max=maxHt maxWT
       maxid(ht(subject) wt(subject))=maxHtSubject MaxWtSubject
       ;
7.4.1b Using MAXID
Two Analysis Variables
                                             max  max    maxHt    MaxWt
Obs  race  _TYPE_  _FREQ_   meanHT   MeanWT   Ht   WT   Subject  Subject

									
 1            0      76    67.6053  162.237   74  240     209      203
 2    1       1      42    68.4762  176.143   74  215     209      208
 3    2       1      18    67.6667  166.333   72  240     201      203
 4    3       1       8    65.0000  112.000   68  133     215      215
 5    4       1       4    64.5000  113.500   65  115     244      230
 6    5       1       4    66.5000  147.000   68  147     212      211

Because of the flexibility for structuring options in the OUTPUT statement, the previous MAXID option could also have been written as:

var ht wt;
output out=stats
       mean= meanHT MeanWT
       max=maxHt maxWT
       maxid(ht(subject))=maxHtSubject
       maxid(wt(subject))=maxWtSubject
       ;

When more than one variable is needed to identify the observation with the extreme value, the MAXID supports a list. As before when specifying lists, there is a one-to-one correspondence between the two lists (the list of ID variables and the list of generated variables). In this OUTPUT statement both the SUBJECT and SSN are used in the list of identification variables. Consequently a new variable is created for each in the summary data set.

var ht wt;
output out=stats
       mean= meanHT MeanWT
       max=maxHt maxWT
       maxid(ht(subject ssn))= MaxHtSubject MaxHtSSN
       maxid(wt(subject ssn))= MaxWtSubject MaxWtSSN
       ;

7.4.2 Using the IDGROUP Option

The MAXID and MINID options allow you to capture only a single extreme. It is also possible to display a group of the extreme values using the IDGROUP option.

Like the MAXID and MINID options, this option allows you to capture the maximum or minimum value and associated ID variable(s). More importantly, however, you may select more than just the single extreme value.

proc summary data=advrpt.demog;
   class race;
   var wt;
   output out=stats
          mean= MeanWT
          max(wt)=maxWT Callout 1   
          idgroup (max(wt)out[2](subject sex)=maxsubj)
          ;           Callout 2       Callout 3         Callout 4      Callout 5
   run;

In this example the maximum WT has been requested using the MAX statistic Callout 1. In addition the IDGROUP option has been requested to identify the two Callout 3 individuals (identified by using SUBJECT and SEX Callout 4) with the largest values of WT Callout 2.

Callout 5 The prefix for the variable name that will hold SUBJECT number with the maximum weight is MAXSUBJ. Since there is no corresponding prefix for SEX, the original variable will be used as the prefix. Because we have only one analysis variable, there will not be a naming conflict; however, not specifying the new variable’s name is generally to be avoided. Even with the /AUTONAME option in force there can be naming conflicts with only moderately complex IDGROUP options that do not name the new variables.

Since we have requested the top two Callout 3 values, the values are written to MAXSUBJ_1, MAXSUBJ_2, SEX_1, and SEX_2. Notice that a number indicating the relative position is appended to the variable name. In this example we can see that the second heaviest subject in the study had a subject number of 236 and a SEX of M.

7.4.2a Using IDGROUP                      
                                      max                      
Obs   race  _TYPE_  _FREQ_   MeanWT    WT  maxsubj_1 maxsubj_2  sex_1 sex_2                      
 1             0      76    162.237   240     203       236       M     M                      
 2     1       1      42    176.143   215     208       216       M     M                      
 3     2       1      18    166.333   240     203       236       M     M                      
 4     3       1       8    112.000   133     215       256       M     M                      
 5     4       1       4    113.500   115     230       240       F     F                      
 6     5       1       4    147.000   147     211       212       M     M                      

The request for the MAX in IDGROUP is actually independent of the MAX= request issued at.

In the previous example we are able to see who the second heaviest subject was, but because we used the MAX option, which shows only one value—the heaviest, we cannot see the weight of the second heaviest individual. This problem disappears with a slight modification of the IDGROUP option.

In the following example we want to identify the two oldest individuals within each group (minimum value of DOB). Since we want to see the date of birth for each of the oldest two individuals, DOB Callout 6 has been included in the list of ID variables. Notice that the MIN statistic, which would show only one DOB, is not being used at all.

proc summary data=advrpt.demog;
   class race;
   var dob;
   output out = stats
          idgroup (min(dob)out[2](dob subject sex)=
                     MinDOB OldestSubj OldestGender)
          ;
   run;
7.4.2b Using IDGROUP with the Analysis Variable
                           MinDOB_  MinDOB_   Oldest  Oldest   Oldest    Oldest                           
Obs  race  _TYPE_  _FREQ_     1        2      Subj_1  Subj_2  Gender_1  Gender_2                           
 1            0      76    03NOV21  05NOV24     252     269      M         M                           
 2    1       1      42    03NOV21  05NOV24     252     269      M         M                           
 3    2       1      18    15JAN34  15AUG34     203     236      M         M                           
 4    3       1       8    02JUL46  11JUN47     234     268      F         M                           
 5    4       1       4    13FEB48  28FEB49     230     240      F         F                           
 6    5       1       4    18FEB51  18JUN51     212     214      M         M                           

SEE ALSO

The IDGROUP option is used to transpose data in King and Zdeb (2010). It is also used in a subsetting question in the SAS Forum thread
http://communities.sas.com/message/102002102002.

7.4.3 Using Percentiles to Create Subsets

The percentile statistics can be used to create search bounds for potential outlier boundaries. Several percentile statistics are available including the 1% and 5% bounds. In this example we would like to know if any observations fall outside of the 1% percentiles.

title1 '7.4.3 Using Percentiles';
proc summary data=advrpt.lab_chemistry;
   var potassium;
   output out=stats 
          p1=  Callout 1
          p99= /autoname;
   run;
data chkoutlier;
   set stats(keep=potassium_p1 potassium_p99); Callout 2
   do until (done);
      set advrpt.lab_chemistry Callout 2
                (keep=subject visit potassium) 
          end=done; Callout 3
      if   potassium_p1 ge potassium 
        or potassium ge potassium_p99 Callout 4
                             then output chkoutlier;
   end; Callout 5
   run;
options nobyline; Callout 6
proc print data=chkoutlier;
 by potassium_p1 potassium_p99; Callout 7
 title2 'Potassium 1% Bounds are #byval1, #byval2'; Callout 8
 run;

Callout 1 The 1st and 99th percentiles are calculated and saved in the data set STATS.

Callout 2 The single observation of WORK.STATS is added to the Program Data Vector (PDV).

Callout 3 The analysis data are read one row at a time in a DO UNTIL loop. The END= option on the SET statement creates the numeric 0/1 variable DONE, which is used to end the loop.

Callout 4 Check to see if the current POTASSIUM reading is above or below the 1st and 99th percentiles.

Callout 5 A STOP statement has not been used. Although the SET statement is inside the DO UNTIL loop, the STOP is not necessary because all observations have been read from the STATS data set.

Callout 6 The NOBYLINE system option removes the BY variable values from the table created by PRINT.

Callout 7 A BY statement is used so that the values can be loaded into the #BYVAL options on the TITLE statement.

Callout 8 Since the bounds are constants, the #BYVAL option is used to place them in the title. Generally TITLE statements are placed outside of the PROC step; however, for better clarity, when I use the #BYVAR and #BYVAL options I like to move the TITLE statement so that it follows the BY statement.

MORE INFORMATION

The TITLE statement option #BYVAL is introduced in Section 15.1.2.

7.5 Understanding the _TYPE_ Variable

One of the variables automatically included in the summary data set is _TYPE_. By default this is a numeric variable, which can be used to help us track the level of summarization, and to distinguish the groups of statistics. It is not, however, intuitively obvious how to predict its value.

In this SUMMARY step there are three variables in the CLASS statement (RACE, EDU, and SYMP).

proc summary 
      data=advrpt.demog
         (where=(race in('1','4')
                & 12 le edu le 15
                & symp in('01','02','03')))
      ;
   class race edu symp;
   var ht;
   output out=stats
          mean= meanHT
          ;
   run;

Examination of a listing of the data set STATS shows that _TYPE_ varies from 0 to 7 (8 distinct values). With the _TYPE_=0 associated with the single row that summarizes across the entire data set (all three classification variables are ignored), and with _TYPE_=7 summarizing the interaction of all three classification variables (all three classification variables are used). The remaining values of _TYPE_ represent other combinations of classification variables and vary according to which are used and which are ignored.

7.5 Understanding _TYPE_
                                                   mean
Obs    race    edu    symp    _TYPE_    _FREQ_      HT
  1              .               0         8      66.25
  2              .     01        1         2      64.00
  3              .     02        1         4      66.50
  4              .     03        1         2      68.00
  5             12               2         4      67.50
  6             14               2         2      64.00
  7             15               2         2      66.00
  8             12     02        3         2      67.00
  9             12     03        3         2      68.00
 10             14     01        3         2      64.00
 11             15     02        3         2      66.00
 12     1        .               4         6      67.00
 13     4        .               4         2      64.00
 14     1        .     02        5         4      66.50
 15     1        .     03        5         2      68.00
 16     4        .     01        5         2      64.00
 17     1       12               6         4      67.50
 18     1       15               6         2      66.00
 19     4       14               6         2      64.00
 20     1       12     02        7         2      67.00
 21     1       12     03        7         2      68.00
 22     1       15     02        7         2      66.00
 23     4       14     01        7         2      64.00

The following table summarizes the eight possible combinations of these three classification variables for the LISTING shown above. Under the classification variables, a 0 indicates that levels of the classification variable are being ignored when calculating the summary statistics, while a 1 indicates that the classification variables is being used. When considered together, these three zeros and ones (representing each classification variable) form a 3-digit binary number (one digit for each of the three classification variables. When this binary value is converted to decimal, the result yields _TYPE_.

CLASS VARIABLES

Observation Number

RACE

EDU

SYMP

Binary Value

_TYPE_

1

0

0

0

000

0

2 - 4

0

0

1

001

1

5 - 7

0

1

0

010

2

8 - 11

0

1

1

011

3

12 - 13

1

0

0

100

4

14 - 16

1

0

1

101

5

17 - 19

1

1

0

110

6

20 - 23

1

1

1

111

7

The conversion of a binary number to decimal involves the use of powers of 2. A binary value of 110 = 1*22 + 1*21 + 0*20 = 1*4 + 1*2 + 0*1 = 6 = _TYPE_.

The NWAY option limits the output data set to the highest order interaction and consequently only the highest value of _TYPE_ would be displayed.

MORE INFORMATION

Interestingly enough, some SAS programmers find converting a binary number to a decimal number to be inconvenient. The CHARTYPE option (see Section 7.6) makes that conversion unnecessary.

7.6 Using the CHARTYPE Option

The CHARTYPE option displays _TYPE_ as a character binary value rather than the decimal value. The following example repeats the example shown in Section 7.5, while adding the CHARTYPE option on the PROC statement.

Instead of being numeric, _TYPE_ is now created as a character variable with a length corresponding to the number of classification variables.

proc summary 
      data=advrpt.demog
         (where=(race in('1','4')
                & 12 le edu le 15
                & symp in('01','02','03')))
      chartype;
   class race edu symp;
   var ht;
   output out=stats
          mean= meanHT
          ;
   run;
7.6 Using the CHARTYPE Option
                                                   mean
Obs    race    edu    symp    _TYPE_    _FREQ_      HT
  1              .             000         8      66.25
  2              .     01      001         2      64.00
  3              .     02      001         4      66.50
  4              .     03      001         2      68.00
  5             12             010         4      67.50
  6             14             010         2      64.00
  7             15             010         2      66.00
  8             12     02      011         2      67.00
  9             12     03      011         2      68.00
 10             14     01      011         2      64.00
 11             15     02      011         2      66.00
 12     1        .             100         6      67.00
 13     4        .             100         2      64.00
 14     1        .     02      101         4      66.50
 15     1        .     03      101         2      68.00
 16     4        .     01      101         2      64.00
 17     1       12             110         4      67.50
 18     1       15             110         2      66.00
 19     4       14             110         2      64.00
 20     1       12     02      111         2      67.00
 21     1       12     03      111         2      68.00
 22     1       15     02      111         2      66.00
 23     4       14     01      111         2      64.00

7.7 Controlling Summary Subsets Using the WAYS Statement

When you do not need to calculate all possible combinations of the classification variables, you can save not only the resources used in calculating the unneeded values, but the effort of eliminating them later as well. There are several ways that you can specify which combinations are of interest. The WAYS statement can be used to specify the number of classification variables to utilize.

Combinations of the WAYS statement for three classification variables include the following summarizations:

  • ways 0;

across all classification variables

  • ways 1;

each classification variable individually (no cross products)

  • ways 2;

each two-way combination of the classification variables (two-way interactions)

  • ways 3;

three-way interaction. For three classification variables, this is the same as using the NWAY option

  • ways 0,3;

lists of numbers are acceptable

When the number of classification variables becomes large the WAYS statement can utilize an incremental list much like an iterative DO.

ways 0 to 9 by 3;

In the following example main effect summaries and the three-way interaction are eliminated; as a matter of fact, they are not even calculated.

proc summary data=advrpt.demog
         (where=(race in('1','4')
                & 12 le edu le 15
                & symp in('01','02','03')));
   class race edu symp;
   var ht;
   ways 0,2;
   output out=stats
          mean= meanHT
          ;
   run;

The WAYS statement has been used to request calculation of only the overall summary and the two-way interactions.

Notice in the listing shown below that _TYPE_ does not take on the values of 1 or 2. These would be the main effect summaries for SYMP and EDU, respectively. A full examination of the table shows that _TYPE_ appropriately only takes on the values of 0, 3, 5, and 6.

7.7 Using the WAYS Statement
                                                   mean
Obs    race    edu    symp    _TYPE_    _FREQ_      HT
  1              .               0         8      66.25
  2             12     02        3         2      67.00
  3             12     03        3         2      68.00
  4             14     01        3         2      64.00
  5             15     02        3         2      66.00
  6     1        .     02        5         4      66.50
  7     1        .     03        5         2      68.00
  8     4        .     01        5         2      64.00
  9     1       12               6         4      67.50
 10     1       15               6         2      66.00
 11     4       14               6         2      64.00

7.8 Controlling Summary Subsets Using the TYPES
Statement

Like the WAYS statement, the TYPES statement can be used to select and limit the data roll-up summaries. As an added bonus, the TYPES statement eliminates much of the need to understand and to be able to use the _TYPE_ automatic variable. While the WAYS statement (see Section 7.7) lists which levels of summarization are desired, TYPES designates specific summarization levels (effects and interactions).

The TYPES statement used here explicitly requests that statistics be calculated only for the main effect for EDU, and the interaction between RACE and SYMP. None of the other effects or summarizations will even be calculated.

proc summary data=advrpt.demog
         (where=(race in('1','4')
                & 12 le edu le 15
                & symp in('01','02','03')));
class race edu symp;
var ht;
types edu race*symp;
output out=stats
       mean= meanHT
       ;
run;
7.8 Using the TYPES Statement
                                                  mean
Obs    race    edu    symp    _TYPE_    _FREQ_     HT
 1              12               2         4      67.5
 2              14               2         2      64.0
 3              15               2         2      66.0
 4      1        .     02        5         4      66.5
 5      1        .     03        5         2      68.0
 6      4        .     01        5         2      64.0

For the following CLASS statement:

class race edu symp;

Variations of the TYPES statement could also include:

  • types (); overall summary
  • types race*edu edu*symp; two two-way interactions
  • types race*(edu symp); two two-way interactions
  • types race*edu*symp; three-way interaction—same as NWAY

7.9 Controlling Subsets Using the CLASSDATA= and EXCLUSIVE Options

While the WAYS and TYPES statements control the combinations of classification variables that are to be summarized, you can also specify which levels of the classification variables are to appear in the report or output data set by creating a data set that contains the combinations and levels of interest. The data set can even include levels of classification variables that do not exist in the data itself, but that nonetheless are to appear in the data set or report.

This DATA step builds a data set that will be used with the CLASSDATA= option. As an illustration, it also adds a level for each classification variable that does not exist in the data.

data selectlevels(keep=race edu symp);
   set advrpt.demog
         (where=(race in('1','4')
               & 12 le edu le 15
               & symp in('01','02','03')));
   output;
   * For fun add some nonexistent levels;
   if _n_=1 then do;
      edu=0;
      race='0';
      symp='00';
      output;
      end;
   run;
Show the SELECTLEVELS Data
Obs    race    edu    symp
 1      0        0     00
 2      1       12     02
 3      1       12     03
 4      1       15     02
 5      4       14     01

The data set specified with the CLASSDATA option becomes a sophisticated filter for the data entering into the analysis.

proc summary data=advrpt.demog
             classdata=selectlevels;
   class race edu symp;
   var ht;
   output out=stats mean= meanHT;
   run;

The CLASSDATA option can be paired with the EXCLUSIVE option to radically change the observations that are available to the procedure. When the EXCLUSIVE option is not used, all levels of the classification variables that exist either in the analysis data or in the CLASSDATA= data set are included in the summary data set. Since we specifically included a level for each of the classification variables that are not in the data, we should expect to see them summarized in the summary data set.

In the data that is being summarized the variable SYMP never takes on the value of ‘00’, but since it is a value of SYMP in the CLASSDATA= data set it appears in the summary data.

CLASSDATA without EXCLUSIVE
Obs  race   edu   symp  _TYPE_  _FREQ_   meanHT
  1           .            0      63    67.2381
  2           .    00      1       0      .
  3           .    01      1       4    67.5000
  4           .    02      1      10    66.8000
. . . . portions of the table are not shown . . . .

When the EXCLUSIVE option is paired with the CLASSDATA= option the makeup of the summary data set can be altered dramatically. The EXCLUSIVE option forces only those levels that are in the CLASSDATA= data set to appear in the summary report. This includes the levels of the classification variables that do not appear in the data set.

proc summary data=advrpt.demog
             classdata=selectlevels
             exclusive;
   class race edu symp;
   var ht;
   output out=stats mean= meanHT;
   run;

The summary lines for observations 2 and 6 represent levels of the classification variables that do not appear in the data. They were generated through a combination of the CLASSDATA= data set and the EXCLUSIVE option.

7.9 Using the CLASSDATA and EXCLUSIVE Options
                                                   mean
Obs    race    edu    symp    _TYPE_    _FREQ_      HT
  1              .               0         8      66.25
  2              .     00        1         0        .
  3              .     01        1         2      64.00
  4              .     02        1         4      66.50
  5              .     03        1         2      68.00
  6              0               2         0        .
  7             12               2         4      67.50
  8             14               2         2      64.00
  9             15               2         2      66.00
. . . . portions of the table are not shown . . . .

Through the use of these two options we have the capability of creating a sophisticated filter for the classification variables. This combination not only gives us the ability to remove levels, but to add them as well.

The ability to add levels at run time without altering the analysis data set has some potentially huge advantages. First, we can modify the filter by changing the CLASSDATA= data set without changing the program that utilizes the data set. Second, we do not need to ‘sparse’ the data (see Section 2.5 for other sparsing techniques) prior to the analysis, thus increasing the program’s efficiency.

MORE INFORMATION

The CLASSDATA= and EXCLUSIVE options are also available in the TABULATE procedure (see Section 8.1.4).

7.10 Using the COMPLETETYPES Option

All combinations of the classification variables may not exist in the data and therefore those combinations will not appear in the summary table. If all possible combinations are desired, regardless as to whether or not they exist in the data, you can use the COMPLETETYPES option on the PROC statement.

proc summary data=advrpt.demog
                 (where=(race in('1','4')
                        & 12 le edu le 15
                        & symp in('01','02','03')))
                completetypes;
   class race edu symp;
   var ht;
   output out=stats mean= meanHT;
   run;

In the data (ADVRPT.DEMOG) there are no observations with both EDU=12 and SYMP=‘01’; however, since both levels exist somewhere in the data (individually or in combination with another classification variable), the COMPLETETYPES option causes the combination to appear in the summary data set (obs=8).

7.10 Using the COMPLETETYPES Option
                                                   mean
Obs    race    edu    symp    _TYPE_    _FREQ_      HT
  1              .               0         8      66.25
  2              .     01        1         2      64.00
  3              .     02        1         4      66.50
  4              .     03        1         2      68.00
  5             12               2         4      67.50
  6             14               2         2      64.00
  7             15               2         2      66.00
  8             12     01        3         0        .
  9             12     02        3         2      67.00
 10             12     03        3         2      68.00
 11             14     01        3         2      64.00
 12             14     02        3         0        .
 13             14     03        3         0        .
 14             15     01        3         0        .
 15             15     02        3         2      66.00
. . . . portions of the table are not shown . . . .

MORE INFORMATION

COMPLETETYPES is also used to create sparsed data in Section 2.5.3.

The procedures REPORT and TABULATE also have the ability to display non-existent combinations. See Section 8.1.4 for a TABULATE example. Preloaded formats can also be used to similar advantage, see Section 12.1 for examples with the MEANS, SUMMARY, TABULATE, and REPORT procedures.

7.11 Identifying Summary Subsets Using the LEVELS and WAYS Options

LEVELS and WAYS are options that can be used on the OUTPUT statement. They add the variables _LEVEL_ and _WAY_, respectively to the generated data table. Together or individually these variables can be used to help navigate the summary data set.

proc summary data=advrpt.demog;
   class race edu;
   var ht;
   output out=stats 
       mean= meanHT /levels Callout 1
                     ways; Callout 2 
   run;

The LEVELS option Callout 1 adds the variable _LEVEL_ to the OUT= data table. This numeric variable contains a sequential counter of rows within a given value of _TYPE_. This can be useful when working with rows within _TYPE_. Not only does the combination of _TYPE_ and _LEVEL_ form a unique sorted key for the new data set but, for further subsetting and subsequent summarization, when FIRST._TYPE_ is true, _LEVEL_ will necessarily equal 1.

The WAYS option Callout 2 adds the variable _WAY_ to the OUT= data table. This numeric variable equals the number of classification variables that were used to calculate each observation. A two-way interaction between two classification variables will have _WAY_=2.

7.11 Using the LEVELS and WAYS Options
Obs    race    edu    _WAY_    _TYPE_    _LEVEL_    _FREQ_     meanHT                           
                        Callout 2                   Callout 1
  1              .      0         0          1        75      67.6000
  2             10      1         1          1        11      71.3636
  3             12      1         1          2        19      67.0526
  4             13      1         1          3         4      70.0000
  5             14      1         1          4        10      64.2000
  6             15      1         1          5         7      65.2857
  7             16      1         1          6        10      70.4000
  8             17      1         1          7        10      65.2000
  9             18      1         1          8         4      69.0000
 10     1        .      1         2          1        42      68.4762
 11     2        .      1         2          2        17      67.6471
 12     3        .      1         2          3         8      65.0000
 13     4        .      1         2          4         4      64.5000
 14     5        .      1         2          5         4      66.5000
 15     1       10      2         3          1        11      71.3636
 16     1       12      2         3          2        16      67.2500
 17     1       13      2         3          3         4      70.0000
 18     1       15      2         3          4         5      64.2000
 19     1       16      2         3          5         2      71.0000
 20     1       17      2         3          6         2      63.0000
. . . . portions of the table are not shown . . . .

7.12 CLASS Statement vs. BY Statement

Although the CLASS and BY statements will often produce similar results, the user should be aware of the differences, not only in performance, but in function as well, for these two statements.

In terms of general operation the BY statement requires the incoming data to be sorted. Given that the data is sorted, the data is processed in BY groups—one group at a time. This requires less memory and processing resources than when the CLASS statement is used. However, when the data is not already sorted, the sorting of the data itself will generally outweigh the performance advantages of the BY statement.

When the CLASS statement is used, it is possible to calculate any of the possible interactions among the classification variables. This is not possible when using BY group processing. We can examine statistics within each unique combination of BY variables, but not across BY variables.

When a classification variable takes on a missing value, the entire observation is removed from the analysis (see Section 7.1.1 for the use of the MISSING option to change this behavior). Missing levels of the BY variables are considered valid levels and are not eliminated.

Since the MEANS and SUMMARY procedures allow for multi-threaded processing, if you execute SAS on a server or a machine with multiple CPUs you may see a performance difference in the use of BY vs. CLASS statements. The procedure will take advantage of multi-threading for both types of summarizations; however, the internals are not necessarily the same. You may want to experiment a bit on your system.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset