CHAPTER  4

Sorting, Printing, and Summarizing Your Data

 

4.1    Using SAS Procedures

4.2    Subsetting in Procedures with the WHERE Statement

4.3    Sorting Your Data with PROC SORT

4.4    Changing the Sort Order for Character Data

4.5    Printing Your Data with PROC PRINT

4.6    Changing the Appearance of DataValues with Formats

4.7    Selected Standard Formats

4.8    Creating Your Own Formats with PROC FORMAT

4.9    Writing a Report to a Text File

4.10  Summarizing Your Data Using PROC MEANS

4.11  Writing Summary Statistics to a SAS Data Set

4.12  Producing One-Way Frequencies with PROC FREQ

4.13  Producing Crosstabulations with PROC FREQ

4.14  Grouping Data with User-Defined Formats

4.15  Producing Tabular Reports with PROC TABULATE

4.16  Adding Statistics to PROC TABULATE Output

4.17  Enhancing the Appearance of PROC TABULATE Output

4.18  Changing Headers in PROC TABULATE Output

4.19  Producing Simple Output with PROC REPORT

4.20  Using DEFINE Statements in PROC REPORT

4.21  Creating Summary Reports with PROC REPORT

4.22  Adding Summary Breaks to PROC REPORT Output

4.23  Adding Statistics to PROC REPORT Output

4.24  Adding Computed Variables to PROC REPORT Output

 

4.1     Using SAS Procedures

image

Using a procedure, also called a PROC, is like filling out a form. Someone else designed the form, and all you have to do is fill in the blanks and choose from a list of options. Each PROC has its own unique form with its own list of options. But while each procedure is unique, there are similarities too. This section discusses some of those similarities.

All procedures have required statements, and most have optional statements. For example, the only statement required in PROC PRINT is the PROC statement:

PROC PRINT;

However, by adding optional statements you could make this procedure a dozen lines long or even more.

PROC statement  All procedures start with the keyword PROC followed by the name of the procedure, such as PRINT or CONTENTS. Options, if there are any, follow the procedure name. The DATA= option tells SAS which data set to use as input for that procedure. The DATA= option is, of course, optional. If you skip it, then SAS will use the most recently created data set, which is not necessarily the same as the most recently used. Sometimes it is easier to specify the data set you want than to figure out which data set SAS will use by default. In this case, SAS will use a temporary SAS data set named BANANA:

PROC CONTENTS DATA = banana;

To use a permanent SAS data set, add the libref to the data set name. In this case, SAS will use a data set named BANANA that is stored in a SAS data library named TROPICAL:

PROC CONTENTS DATA = tropical.banana;

You can also refer to a permanent SAS data set directly by placing your operating environment’s name for it between quotation marks.

PROC CONTENTS DATA = 'c:MySASLibanana';

See Section 2.2 for a more complete discussion of how to reference SAS data libraries and data set members.

BY statement  The BY statement is required for only one procedure, PROC SORT. In PROC SORT the BY statement tells SAS how to arrange the observations. In all other procedures, the BY statement is optional, and tells SAS to perform a separate analysis for each combination of values of the BY variables rather than treating all observations as one group. For example, this statement tells SAS to run a separate analysis for each state:

BY State;

All procedures, except PROC SORT, assume that your data are already sorted by the variables in your BY statement. If your observations are not already sorted, then use PROC SORT to do the job.

TITLE and FOOTNOTE statements  The TITLE statement specifies titles to print at the top of your procedure output. FOOTNOTE works the same way, but prints at the bottom of the page. These global statements are not technically part of any step. You can put them anywhere in your program, but since they apply to the procedure output it generally makes sense to put them with the procedure. The most basic TITLE statement consists of the keyword TITLE followed by your title enclosed in quotation marks. SAS doesn’t care if the two quotation marks are single or double as long as they are the same:

TITLE 'This is a title';

If you find that your title contains an apostrophe, use double quotation marks around the title, or replace the single apostrophe with two:

TITLE "Here's another title";

TITLE 'Here''s another title';

You can specify up to ten titles or footnotes by adding numbers to the keywords TITLE and FOOTNOTE:

FOOTNOTE3 'This is the third footnote';

When you specify a new title or footnote, it replaces the old title or footnote with the same number and cancels those with a higher number. For example, a new TITLE2 cancels an existing TITLE3, if there is one. Titles and footnotes stay in effect until you replace them with new ones or cancel them with a null statement. The following null statement would cancel all current titles:

TITLE;

LABEL statement  Sometimes variable names are not as informative as you would like. In those cases, you can use the LABEL statement to create descriptive labels, up to 256 characters long, for each variable. This statement creates labels for the variables ReceiveDate and ShipDate:

LABEL ReceiveDate  = 'Date order was received'  

      ShipDate = 'Date merchandise was shipped';

If you put a LABEL statement in a DATA step, the labels will be saved with the data; but if you put a LABEL statement in a PROC, then the labels will be used only by that particular step.

Customizing output  You have a lot of control over the output produced by procedures. Using system options, you can set many features such as centering, dates, and paper orientation.  (See Section 1.7.) With the Output Delivery System, you can also change the overall style of your output, produce output in different formats such as PDF or RTF, or change specific attributes of your output such as font or color. (See Chapter 5.)

Output data sets  Most procedures produce some kind of report, but sometimes you would like the results of the procedure saved as a SAS data set so you can perform further analysis. You can create SAS data sets from any procedure output using the ODS OUTPUT statement (Section 5.13). Some procedures can also write a SAS data set using an OUTPUT statement or OUT= option.

4.2     Subsetting in Procedures with the WHERE Statement

image

One optional statement for a PROC that reads a SAS data set is the WHERE statement. The WHERE statement tells a procedure to use a subset of the data. There are other ways to subset data, as you probably remember, so you could get by without ever using the WHERE statement. However, the WHERE statement is a shortcut. While subsetting IFs work only in DATA steps, the WHERE statement works in PROC steps too.

Unlike subsetting in a DATA step, using a WHERE statement in a procedure does not create a new data set. That is one of the reasons why WHERE statements are sometimes more efficient than other ways of subsetting. (The WHERE= data set option is similar to a WHERE statement. See Section 6.12 for more information.)

The basic form of a WHERE statement is:

WHERE condition;

Only observations satisfying the condition will be used by the PROC. This may look familiar since it is similar to a subsetting IF. The left side of the condition is a variable name, and the right side is a variable name, a constant, or a mathematical expression. Mathematical expressions can contain the standard arithmetic symbols for addition (+), subtraction (-), multiplication (*), division (/), and exponentiation (**). Between the two sides of the expression, you can use comparison and logical operators; those operators may be symbolic or mnemonic:

Symbolic

Mnemonic

Example

=

EQ

WHERE Region = 'Spain';

¬=, ~=, ^=

NE

WHERE Region ~= 'Spain';

GT

WHERE Rainfall > 20;

LT

WHERE Rainfall < AvgRain;

>=

GE

WHERE Rainfall >= AvgRain + 5;

<=

LE

WHERE Rainfall <= AvgRain / 1.25;

&

AND

WHERE Rainfall > 20 AND Temp < 90;

|, ¦ , !

OR

WHERE Rainfall > 20 OR Temp < 90;

 

IS MISSING

WHERE Region IS MISSING;

 

IS NOT MISSING

WHERE Region IS NOT MISSING;

 

BETWEEN AND

WHERE Region BETWEEN 'Plain' AND 'Spain';

?

CONTAINS

WHERE Region CONTAINS 'ain';

=*

 

WHERE Region =* 'Spane';

 

LIKE

WHERE Region LIKE 'Sp%n';

 

IN (list)

WHERE Region IN ('Rain', 'Spain', 'Plain');

 

The sounds-like operator (=*) uses the Soundex algorithm to select observations with data values that are phonetically similar in English (such as Spain and Spane). The LIKE operator selects observations that match a specific pattern where a percent sign (%) serves as a wildcard corresponding to any number of characters.

Example  The following comma-separated values data contain information about well-known painters. For each artist, the data include the painter’s name, primary style, and nationality:

Name,Genre,Nationality

Mary Cassatt,Impressionism,U

Paul Cezanne,Post-impressionism,F

Salvador Dali,Surrealism,S

Henri Matisse,Post-impressionism,F

Claude Monet,Impressionism,F

Berthe Morisot,Impressionism,F

Pablo Picasso,Surrealism,S

Jackson Pollock,Abstract expressionism,U

Pierre Auguste Renoir,Impressionism,F

Vincent van Gogh,Post-impressionism,N

Suppose you wanted a list of just the Impressionist painters. The quick-and-easy way to do this is with PROC PRINT and a WHERE statement. In the following program, a PROC IMPORT reads the data from a file named Artists.csv, and uses a LIBNAME statement to create a permanent SAS data set named STYLE in a directory named MySASLib. Then in the PROC PRINT, a WHERE statement selects just the Impressionists, and a FOOTNOTE statement explains the codes for Nationality.

* Import CSV file of artists;

LIBNAME art 'c:MySASLib';

PROC IMPORT DATAFILE = 'c:MyRawDataArtists.csv' OUT = art.style REPLACE;

RUN;

* Print list of Impressionist painters;

PROC PRINT DATA = art.style;

   WHERE Genre = 'Impressionism';

   TITLE 'Impressionist Painters';

   FOOTNOTE 'F = France U = USA';

RUN;

Here are the results. Notice that because the data were subsetted in the PROC PRINT, the numbers in the Obs column reflect the order of observations in the original data set. If you don’t like that, you can tell SAS not to print the Obs column. (See Section 4.5.)

Impressionist Painters

Obs

Name

Genre

Nationality

1

Mary Cassatt

Impressionism

U

5

Claude Monet

Impressionism

F

6

Berthe Morisot

Impressionism

F

9

Pierre Auguste Renoir

Impressionism

F

F = France U = USA

4.3     Sorting Your Data with PROC SORT

image

There are many reasons for sorting your data: to organize data for a report, before combining data sets, or before using a BY statement in another PROC or DATA step. Fortunately, PROC SORT is quite simple. The basic form of this procedure is:

   PROC SORT;

      BY variable-list;

The variables named in the BY statement are called BY variables. You can specify as many BY variables as you like. With one BY variable, SAS sorts the data based on the values of that variable. With more than one variable, SAS sorts observations by the first variable, then by the second variable within categories of the first, and so on. A BY group is all the observations that have the same values of BY variables. If, for example, your BY variable is Country, then all the observations for Spain form one BY group.

Controlling the output data set  The DATA= and OUT= options specify the input and output data sets. If you don’t specify the DATA= option, then SAS will use the most recently created data set. If you don’t specify the OUT= option, then SAS will replace the original data set with the newly sorted version. This sample statement tells SAS to sort the data set named MESSY, and then put the sorted data into a data set named NEAT:

PROC SORT DATA = messy OUT = neat;

The NODUPKEY option tells SAS to eliminate duplicate observations that have the same values for the BY variables. NODUPKEY applies only to variables listed in the BY statement. So if you want to eliminate duplicate observations that have identical values for all the variables, then you need to include all the variables in your BY statement. If you specify the DUPOUT= option, then SAS will put the deleted observations in that data set. This statement would remove any duplicate observations and put them into a data set named EXTRAOBS:

PROC SORT DATA = messy OUT = neat  NODUPKEY  DUPOUT = extraobs;

Ascending versus descending sorts  By default, SAS sorts data in ascending order, from lowest to highest. To have your data sorted in the opposite order, add the keyword DESCENDING to the BY statement before each variable that should be sorted in reverse order. This statement tells SAS to sort first by Country (from A to Z) and then by City (from Z to A) within Country:

BY Country DESCENDING City;

Example  The following data show the length in feet of whales and sharks. Notice that each line includes data for more than one animal.

beluga  whale 15 dwarf    shark  .5 sperm  whale 60

basking shark 30 humpback .      50 whale  shark 40

gray    whale 50 blue     whale 100 killer whale 30

mako    shark 12 whale    shark  40 blue   whale 90

In the following program, the DATA step reads a raw data file named Lengths.dat and creates a SAS data set named MARINE. Then PROC SORT orders the observations by Family and Name.

DATA marine;

   INFILE 'c:MyRawDataLengths.dat';

   INPUT Name $ Family $ Length @@;

RUN;

* Sort the data and remove duplicates;

PROC SORT DATA = marine OUT = seasort NODUPKEY;

   BY Family Name;

RUN;

* Print the sorted data;

PROC PRINT DATA = seasort;

   TITLE 'Whales and Sharks';

RUN;

The OUT= option writes the sorted data into a new data set named SEASORT. The NODUPKEY option eliminates observations with duplicate values of the BY variables (Family and Name). The log contains these notes showing that the sorted data set has two fewer observations than the original data set.

NOTE: There were 12 observations read from the data set WORK.MARINE.

NOTE: 2 observations with duplicate key values were deleted.

NOTE: The data set WORK.SEASORT has 10 observations and 3 variables.

The output from PROC PRINT looks like this:

Whales and Sharks

Obs

Name

Family

Length

1

humpback

 

50.0

2

basking

shark

30.0

3

dwarf

shark

0.5

4

mako

shark

12.0

5

whale

shark

40.0

6

beluga

whale

15.0

7

blue

whale

100.0

8

gray

whale

50.0

9

killer

whale

30.0

10

sperm

whale

60.0

Notice that the humpback, with a missing value for Family, became observation 1. That is because missing values are always lowest for both numeric and character variables. Also, the NODUPKEY option eliminated two duplicate observations, one for the whale shark and one for the blue whale. Because the BY variables were Family and Name, SAS ignored the variable Length, and just kept the first observation for each combination of Family and Name.

4.4     Changing the Sort Order for Character Data

At first glance, sorting character data appears straightforward. After all, everyone knows that "A" comes before "B." However, it is less obvious whether "A" comes before "a." SAS offers dozens of options for controlling the sort order of character data (also called the collating sequence). This section describes a few of them.

ASCII versus EBCDIC  The default collating sequence for the z/OS operating environment is EBCDIC. The default collating sequence for most other operating environments is ASCII. From lowest to highest, the basic sort orders for character data are

ASCII

blank

numerals

uppercase letters

lowercase letters

EBCDIC

blank

lowercase letters

uppercase letters

numerals

If you work in only one operating environment, this may not matter to you. However, if you need to create a data set on Windows that will be used on z/OS or vice versa, then you might want your data to be sorted in the order expected by that environment. You can use the options SORTSEQ=EBCDIC or SORTSEQ=ASCII to change the sort order:

PROC SORT SORTSEQ = EBCDIC;

Other possible values for the SORTSEQ= option include DANISH, FINNISH, ITALIAN, NORWEGIAN, POLISH, SPANISH, and SWEDISH.

Linguistic sorting  By default, upper- and lowercase letters will be sorted separately, but this is not the way that people generally sort them. You can use linguistic sorting to produce a more intuitive order. The SORTSEQ=LINGUISTIC option with the STRENGTH=PRIMARY suboption tells SAS to ignore case. To use these options, add them to the PROC SORT statement like this:

PROC SORT SORTSEQ = LINGUISTIC (STRENGTH = PRIMARY);

Here are data that are unsorted, sorted in the default ASCII order, and then sorted ignoring case:

Unsorted order

Default Sort

Linguistic Sort (STRENGTH=PRIMARY)

eva

ANNA

amanda

amanda

Zenobia

ANNA

Zenobia

amanda

eva

ANNA

eva

Zenobia

When numerals are sorted as character data, the value "10" comes before "2." The NUMERIC_COLLATION=ON suboption tells SAS to treat numerals as their numeric equivalent.

PROC SORT SORTSEQ = LINGUISTIC (NUMERIC_COLLATION = ON);

 

Here are data that are unsorted, sorted in the default order, and sorted with numeric collation:

Unsorted order

Default Sort

Linguistic Sort (NUMERIC_COLLATION=ON)

1500m freestyle

100m backstroke

50m freestyle

200m breaststroke

1500m freestyle

100m backstroke

100m backstroke

200m breaststroke

200m breaststroke

50m freestyle

50m freestyle

1500m freestyle

Example  The following data contain names and addresses:

Name   Street             City       State

Seiki  100 A St.          Anchorage  Alaska

Wong   2 A St.            honolulu   Hawaii

Shaw   10 A St. Apt. 10   juneau     Alaska

Smith  10 A St. Apt. 2    Honolulu   Hawaii

Lee    100 A St.          Honolulu   hawaii

Jones  10 A St. Apt. 22   Juneau     alaska

This program imports the tab-delimited file. Then the data are sorted by the variables State, City, and Street ignoring case and using numeric collation.

PROC IMPORT DATAFILE = 'c:MyRawDataMail.txt' OUT = addresses REPLACE;

RUN;

* Sort addresses using linguistic sorting with numeric collation;

PROC SORT DATA = addresses OUT = sortout

      SORTSEQ = LINGUISTIC (STRENGTH = PRIMARY NUMERIC_COLLATION = ON);

   BY State City Street;

RUN;

* Print the linguistically sorted data;

PROC PRINT DATA = sortout;

   TITLE 'Addresses Sorted by State, City, and Street';

RUN;

Here are the results:

Addresses Sorted by State, City, and Street

Obs

Name

Street

City

State

1

Seiki

100 A St.

Anchorage

Alaska

2

Shaw

10 A St. Apt. 10

juneau

Alaska

3

Jones

10 A St. Apt. 22

Juneau

alaska

4

Wong

2 A St.

honolulu

Hawaii

5

Smith

10 A St. Apt. 2

Honolulu

Hawaii

6

Lee

100 A St.

Honolulu

hawaii

4.5     Printing Your Data with PROC PRINT

The PRINT procedure is perhaps the most widely used SAS procedure. In its simplest form, PROC PRINT prints all the values of all variables for all observations in the SAS data set. SAS decides the best way to format the output, so you don’t have to worry about things like how many variables will fit on a page. But there are a few more features of PROC PRINT that you might want to use.

The PRINT procedure starts with the keywords PROC PRINT followed by options:

PROC PRINT options;

We recommend always using the DATA= option for clarity in your programs. By default, SAS prints the observation numbers along with the variables’ values. If you don’t want observation numbers, use the NOOBS option. If you define variable labels with a LABEL statement, and you want to print the labels instead of the variable names, then add the LABEL option as well. The following statement shows all of these options together:

PROC PRINT DATA = data-set NOOBS LABEL;

The following are optional statements that sometimes come in handy:

BY variable-list;

The BY statement starts a new section in the output for each new value of the BY variables and prints the values of the BY variables at the top of each section. The data must be presorted by the BY variables.

ID variable-list;

When you use the ID statement, the observation numbers are not printed. Instead, the variables in the ID variable list appear on the left-hand side of the page.

SUM variable-list;

The SUM statement prints sums for the variables in the list.

VAR variable-list;

The VAR statement specifies which variables to print and their order. Without a VAR statement, all variables in the SAS data set are printed in the order that they occur in the data set.

Example  Students from two fourth-grade classes are selling cookies to earn money for a special field trip. The following are the data for the cookie sale. The students’ names are followed by their classroom number, the date they turned in their money, the type of cookie (mint patties or chocolate dinosaurs), and the number of boxes sold. Note that each line of data contains information for two students:

Adriana 21 3/21/2020 MP  7 Nathan 14 3/21/2020 CD 19

Matthew  1 3/21/2020 CD 14 Claire 14 3/22/2020 CD 11

Ian     21 3/24/2020 MP 18 Chris  14 3/25/2020 CD 6

Anthony 21 3/25/2020 MP 13 Erika  21 3/25/2020 MP 17

The class earns $2.50 for each box of cookies sold. The teachers want a report showing the money earned for each classroom, the money earned by each student, the type of cookie sold, and the date the students returned their money. The following program reads the data, creates a permanent SAS data set named SALES and computes money earned (Profit). Then PROC SORT sorts the data by classroom and creates a temporary data set named SALESSORT. The PROC PRINT uses a BY statement to print the data by Class and a SUM statement to give the totals for Profit. The VAR statement lists the variables to be printed.

LIBNAME class 'c:MySASLib';

DATA class.sales;

   INFILE 'c:MyRawDataCookieSales.dat';

   INPUT Name $ Class DateReturned MMDDYY10. CookieType $ Quantity @@;

   Profit = Quantity * 2.5;

RUN;

PROC SORT DATA = class.sales OUT = salessort;

   BY Class;

RUN;

PROC PRINT DATA = salessort;

   BY Class;

   SUM Profit;

   VAR Name DateReturned CookieType Profit;

   TITLE 'Cookie Sales for Field Trip by Class';

RUN;

Here are the results. Notice that the values for the variable DateReturned are printed as their SAS date values. You can use formats, covered in the next section, to print dates in readable forms.

Cookie Sales for Field Trip by Class

Class=14

Obs

Name

DateReturned

CookieType

Profit

1

Nathan

21995

CD

47.5

2

Matthew

21995

CD

35.0

3

Claire

21996

CD

27.5

4

Chris

21999

CD

15.0

Class

 

 

 

125.0

Class=21

Obs

Name

DateReturned

CookieType

Profit

5

Adriana

21995

MP

17.5

6

Ian

21998

MP

45.0

7

Anthony

21999

MP

32.5

8

Erika

21999

MP

42.5

Class

 

 

 

137.5

 

 

 

 

262.5

4.6     Changing the Appearance of Data Values with Formats

image

When SAS displays your data, it decides which format is besthow many decimal places to show, how much space to allow for each value, and so on. This is very convenient and makes your job much easier, but SAS doesn’t always do what you want. Fortunately, you’re not stuck with the format SAS thinks is best. You can change the appearance of data values using SAS formats.

SAS has many formats for character, numeric, and date values. For example, you can use the COMMAw.d format to print numbers with embedded commas, the $w. format to control the number of characters printed, and the MMDDYYw. format to print SAS date values (the number of days since January 1, 1960) in a readable form like 12/03/2023. You can even print your data in more obscure formats like hexadecimal, zoned decimal, and packed decimal, if you like. Using the FORMAT procedure, you can also create your own formats (Section 4.8).

The general forms for SAS formats are:

Character

Numeric

Date

$formatw.

formatw.d

formatw.

where the $ indicates character formats, format is the name of the format, w is the total width including any decimal point, and d is the number of decimal places.

FORMAT statement  You can associate formats with variables in a FORMAT statement. The FORMAT statement starts with the keyword FORMAT, followed by the variable name (or names if more than one variable is to be associated with the same format), followed by the format. For example, the following FORMAT statement associates the DOLLAR8.2 format with the variables Profit and Loss and associates the MMDDYY8. format with the variable SaleDate:

FORMAT Profit Loss DOLLAR8.2 SaleDate MMDDYY8.;

FORMAT statements can go in either DATA steps or PROC steps. If the FORMAT statement is in a DATA step, then the format association is permanent and is stored with the SAS data set. If the FORMAT statement is in a PROC step, then it is temporaryaffecting only the results from that procedure.

Example  In the previous section, results from the fourth-grade cookie sale were printed using the PRINT procedure. The names of the students were printed along with the date they turned in their money, the type of cookie sold, and the profit. You may have noticed that the dates printed were numbers like 21995 and 21999. Using a FORMAT statement in the PRINT procedure, we can print the dates in a readable form. At the same time, we can print the variable Profit using the DOLLAR6.2 format so dollar signs appear before the numbers.

 

This example uses the permanent SAS data set created in the preceding section. The data are the students’ names followed by their classroom, the date they turned in their money, the type of cookie sold (mint patties or chocolate dinosaurs), the number of boxes sold, and profit:

 

Name

Class

DateReturned

CookieType

Quantity

Profit

1

Adriana

21

21995

MP

7

17.5

2

Nathan

14

21995

CD

19

47.5

3

Matthew

14

21995

CD

14

35.0

4

Claire

14

21996

CD

11

27.5

5

Ian

21

21998

MP

18

45.0

6

Chris

14

21999

CD

6

15.0

7

Anthony

21

21999

MP

13

32.5

8

Erika

21

21999

MP

17

42.5

The following program uses the permanent SAS data set SALES. The FORMAT statement in the PRINT procedure associates the DATE9. format with the variable DateReturned and the DOLLAR6.2 format with the variable Profit:

LIBNAME class 'c:MySASLib';

* Print cookie sales data with formatted values;

PROC PRINT DATA = class.sales;

   VAR Name DateReturned CookieType Profit;

   FORMAT DateReturned DATE9. Profit DOLLAR6.2;

   TITLE 'Cookie Sale Data Using Formats';

RUN;

Here are the results with formatted values for DateReturned and Profit:

Cookie Sale Data Using Formats

Obs

Name

DateReturned

CookieType

Profit

1

Adriana

21MAR2020

MP

$17.50

2

Nathan

21MAR2020

CD

$47.50

3

Matthew

21MAR2020

CD

$35.00

4

Claire

22MAR2020

CD

$27.50

5

Ian

24MAR2020

MP

$45.00

6

Chris

25MAR2020

CD

$15.00

7

Anthony

25MAR2020

MP

$32.50

8

Erika

25MAR2020

MP

$42.50

 

4.7     Selected Standard Formats

Format

Definition

Width range

Default width

Character

$UPCASEw.                                                

Converts character data to uppercase

1–32767                                                

Length of variable or 8

$w.                                                

Writes standard character data—does not trim leading blanks (same as $CHARw.)

1–32767                                                

Length of variable or 1

Date, Time, and Datetime1

DATEw.

Writes SAS date values in form ddmmmyy or ddmmmyyyy

5–11                                                

7

DATETIMEw.d

Writes SAS datetime values in form ddmmmyy:hh:mm:ss.ss

7–40                                                

16

DTDATEw.

Writes SAS datetime values in form ddmmmyy or ddmmmyyyy

5–9

7

EURDFDDw.

Writes SAS date values in form dd.mm.yy or dd.mm.yyyy

2–10

8

JULIANw.

Writes SAS date values in Julian date form yyddd or yyyyddd

5–7

5

MMDDYYw.

Writes SAS date values in form mm/dd/yy or mm/dd/yyyy

2–10                                                

8

TIMEw.d                                                

Writes SAS time values in form hh:mm:ss.ss

2–20

8

WEEKDATEw.

Writes SAS date values in form day-of-week, month-name dd, yy, or yyyy

3–37                                                

29

WORDDATEw.

Writes SAS date values in form
month-name dd, yyyy

3–32

18

Numeric

BESTw.                                                

SAS chooses the best format—default format for numeric data

1–32                                                

12

COMMAw.d

 

Writes numbers with commas

1–32

6

DOLLARw.d                                                

Writes numbers with a leading $ and commas separating every three digits

2–32                                                

6

Ew.                                                 

Writes numbers in scientific notation

7–32

12

EUROXw.d

Writes numbers with a leading and periods separating every three digits

1–32

6

PERCENTw.d                                                

Writes numeric data as percentages

4–32

6

w.d                                                

Writes standard numeric data

1–32

none

 

1  SAS date values are the number of days since January 1, 1960. SAS time values are the number of seconds past midnight, and
datetime values are the number of seconds since midnight January 1, 1960.

 

 

Format

Input data

FORMAT statement           

Results

Character

$UPCASEw.                                                 

my cat

FORMAT Animal $UPCASE6.;

MY CAT

$w.                                                 

my cat
 my snake

FORMAT Animal $8.;   

my cat  
 my snak

Date, Time, and Datetime

DATEw.

8966                     

FORMAT Birth DATE7.;
FORMAT Birth DATE9.;

19JUL84
19JUL1984

DATETIMEw.

12182

FORMAT Start DATETIME13.;
FORMAT Start DATETIME18.1;

01JAN60:03:23
01JAN60:03:23:02.0

DTDATEw.                                                

12182                     

FORMAT Start DTDATE7.;
FORMAT Start DTDATE9.;

01JAN60
01JAN1960

EURDFDDw.

8966

FORMAT Birth EURDFDD8.;
FORMAT Birth EURDFDD10.;

19.07.84
19.07.1984

JULIANw.                                                

8966

FORMAT Birth JULIAN5.;
FORMAT Birth JULIAN7.;

84201
1984201

MMDDYYw.                                                

8966                     

FORMAT Birth MMDDYY8.;
FORMAT Birth MMDDYY6.;

7/19/84
071984

TIMEw.d                                                

12182

FORMAT Start TIME8.;
FORMAT Start TIME11.2;

3:23:02
3:23:02.00

WEEKDATEw.                                                

8966                     

FORMAT Birth WEEKDATE15.;
FORMAT Birth WEEKDATE29.;

Thu, Jul 19, 84
Thursday, July 19, 1984

WORDDATEw.                                                

8966                     

FORMAT Birth WORDDATE12.;
FORMAT Birth WORDDATE18.;

Jul 19, 1984
July 19, 1984

Numeric

BESTw.                                                

1200001

FORMAT Value BEST6.;
FORMAT Value BEST8.;

1.20E6
1200001

COMMAw.d

 

1200001                     

FORMAT Value COMMA9.;
FORMAT Value COMMA12.2;

1,200,001
1,200,001.00

DOLLARw.d                                                

1200001

FORMAT Value DOLLAR10.;
FORMAT Value DOLLAR13.2;

$1,200,001
$1,200,001.00

Ew.                                                 

1200001

FORMAT Value E7.;    

1.2E+06

EUROXw.d                                                

1200001

FORMAT Value EUROX13.2;

1.200.001,00

PERCENTw.d                                                

0.05

FORMAT Value PERCENT9.2;

5.00%

w.d

23.635

FORMAT Value 6.3;    
FORMAT Value 5.2;    

23.635
23.64

 

4.8     Creating Your Own Formats with PROC FORMAT

image

At some time you will probably want to create your own custom formatsespecially if you use a lot of coded data. Suppose that you have just completed a survey for your company and to save disk space and time, all the responses to the survey questions are coded. For example, the age categories teen, adult, and senior are coded as numbers 1, 2, and 3. This is convenient for data entry and analysis, but bothersome when it comes time to interpret the results. You could present your results along with a code book, and your company directors could look up the codes as they read the results. But this will probably not get you that promotion you’ve been looking for. A better solution is to create user-defined formats using PROC FORMAT and print the formatted values instead of the coded values.

The FORMAT procedure creates formats that will later be associated with variables in a FORMAT statement. (Formats can also be used with a PUT function to modify data values in a DATA step. See Section 4.14 for an example.) Here is the general form of PROC FORMAT:

PROC FORMAT;

     VALUE name range-1 = 'formatted-text-1'

                range-2 = 'formatted-text-2'

                        .

                        .

                        .

                range-n = 'formatted-text-n';

The name in the VALUE statement is the name of the format that you are creating. If the format is for character data, the name must start with a $. The name can’t be longer than 32 characters (including the $ for character data), it must not start or end with a number, and cannot contain any special characters except an underscore. In addition, the name can’t be the name of an existing format. Each range is the value of a variable that is to be assigned to the text given in quotation marks on the right side of the equal sign. The text can be up to 32,767 characters long. The following are examples of valid range specifications:

            'A' = 'Asia'

  1, 3, 5, 7, 9 = 'Odd'

  500000 - HIGH = 'Not Affordable'

       13 -< 20 = 'Teenager'

      0 <- HIGH = 'Positive Non Zero'

          OTHER = 'Bad Data'

Character data values should be enclosed in quotation marks ('A' for example). If there is more than one value in the range, then separate the values with a comma or use a hyphen (-) for a continuous range. The keywords LOW and HIGH can be used in ranges to indicate the lowest and the highest nonmissing value for the variable. You can also use the less than symbol (<) in ranges to exclude either end point of the range. The OTHER keyword can be used to assign a format to any values not listed in the VALUE statement.

 

Example  Universe Cars is surveying its customers as to their preferences for car colors. They have information about the customer’s age, sex (coded as 1 for male and 2 for female), annual income, and preferred car color (yellow, gray, blue, or white). Here are the data in CSV format:

Age,Sex,Income,Color

19,1,28000,Y

45,1,130000,G

72,2,70000,B

31,1,88000,Y

58,2,166000,W

The following program reads the data; creates three user-defined formats for age, sex, and car color using the FORMAT procedure; and then prints the data using the new formats:

PROC IMPORT DATAFILE = 'c:MyRawDataCars.csv' OUT = carsurvey REPLACE;

RUN;

PROC FORMAT;

   VALUE gender 1 = 'Male'

                2 = 'Female';

   VALUE agegroup 13 -< 20 = 'Teen'

                  20 -< 65 = 'Adult'

                  65 - HIGH = 'Senior';

   VALUE $col  'W' = 'Moon White'

               'B' = 'Sky Blue'

               'Y' = 'Sunburst Yellow'

               'G' = 'Rain Cloud Gray';

RUN;

* Print data using user-defined and standard formats;

PROC PRINT DATA = carsurvey;

   FORMAT Sex gender. Age agegroup. Color $col. Income DOLLAR8.;

   TITLE 'Survey Results Printed with User-Defined Formats';

RUN;

This program creates two numeric formats: GENDER. for the variable Sex and AGEGROUP. for the variable Age. The program creates a character format, $COL., for the variable Color. Notice that the format names do not end with periods in the VALUE statement, but do in the FORMAT statement.

Here is the output:

Survey Results Printed with User-Defined Formats

Obs

Age

Sex

Income

Color

1

Teen

Male

$28,000

Sunburst Yellow

2

Adult

Male

$130,000

Rain Cloud Gray

3

Senior

Female

$70,000

Sky Blue

4

Adult

Male

$88,000

Sunburst Yellow

5

Adult

Female

$166,000

Moon White

This example creates temporary formats that exist only for the current job or session. It is possible to create permanent formats. (See the SAS Documentation for details.) Formats can also be used to group data. (See Section 4.14.)

4.9     Writing a Report to a Text File

PROC PRINT is flexible and easy to use. Still, there are times when PROC PRINT just won’t do: when your report to a state agency has to be spaced just like their fill-in-the-blank form, or when your client insists that the report contain complete sentences, or when you want one page per observation. At those times you can use the flexibility of the DATA step to write a text file. Text files are simple, but with simplicity comes a level of control that is not always possible with other file formats.

You can write data in a DATA step the same way you read databut in reverse. Instead of using an INFILE statement, you use a FILE statement; instead of INPUT statements, you use PUT statements. This is similar to writing a raw data file in a DATA step (Section 10.6), but to write a report you use the PRINT option telling SAS to include the carriage returns and page breaks needed for printing. Here is the general form of a FILE statement for creating a report:

FILE 'file-specification' PRINT;

Like INPUT statements, PUT statements can be in list, column, or formatted style, but since SAS already knows whether a variable is numeric or character, you don’t have to put a $ after character variables. If you use list format, SAS will automatically put a space between each variable. If you use column or formatted styles of PUT statements, SAS will put the variables wherever you specify. You can control spacing with the same pointer controls that INPUT statements use: @n to move to column n, +n to move n columns, / to skip to the next line, #n to skip to line n, and the trailing @ to hold the current line. In addition to printing variables, you can insert a text string by simply enclosing it in quotation marks.

Example  To show how this differs from PROC PRINT, this example uses the permanent SAS data set created in Section 4.5. Two fourth-grade classes have sold cookies to raise money for a field trip. The data are the students’ names followed by their classroom, the date they turned in their money, the type of cookie sold (mint patties or chocolate dinosaurs), the number of boxes sold, and profit:

 

Name

Class

DateReturned

CookieType

Quantity

Profit

1

Adriana

21

21995

MP

7

17.5

2

Nathan

14

21995

CD

19

47.5

3

Matthew

14

21995

CD

14

35.0

4

Claire

14

21996

CD

11

27.5

5

Ian

21

21998

MP

18

45.0

6

Chris

14

21999

CD

6

15.0

7

Anthony

21

21999

MP

13

32.5

8

Erika

21

21999

MP

17

42.5

The teachers want a report for each student showing how much money that student earned. They want each student’s report on a separate page so it is easy to hand out. Lastly, they want it to be easy for fourth graders to understand, with complete sentences. Here is the program:

 

* Write a report with FILE and PUT statements;

LIBNAME class 'c:MySASLib';

DATA _NULL_;

   SET class.sales;

   FILE 'c:MyTextFilesStudent.txt' PRINT;

   TITLE;

   PUT @5 'Cookie sales report for ' Name 'from classroom ' Class

     // @5 'Congratulations!  You sold ' Quantity 'boxes of cookies'

     / @5 'and earned ' Profit DOLLAR6.2 ' for our field trip.';

   PUT _PAGE_;

RUN;

Notice that the keyword _NULL_ appears in the DATA statement instead of a data set name. _NULL_ tells SAS not to bother writing a SAS data set (since the goal is to create a text file not a data set), and makes the program run slightly faster. The FILE statement creates the output file for the report, and the PRINT option tells SAS to include carriage returns and page breaks. The null TITLE statement tells SAS to eliminate any automatic titles.

The first PUT statement in this program starts with a pointer, @5, telling SAS to go to column 5. Then it tells SAS to print the words Cookie sales report for followed by the current value of the variable Name. The variables Name, Class, and Quantity are printed in list style, whereas Profit is printed using formatted style and the DOLLAR6.2 format. A slash line pointer tells SAS to skip to the next line; two slashes skips two lines. You could use multiple PUT statements instead of slashes to skip lines because SAS goes to a new line every time there is a new PUT statement. The statement PUT _PAGE_ inserts a page break after each student’s report. When the program is run, the log will contain these notes:

NOTE: 24 records were written to the file 'c:MyRawDataStudent.txt'.

NOTE: There were 8 observations read from CLASS.SALES.

The first three pages of the report look like this:

    Cookie sales report for Adriana from classroom 21

 

    Congratulations!  You sold 7 boxes of cookies

    and earned $17.50 for our field trip.

 

    Cookie sales report for Nathan from classroom 14

 

    Congratulations!  You sold 19 boxes of cookies

    and earned $47.50 for our field trip.

 

    Cookie sales report for Matthew from classroom 14

 

    Congratulations!  You sold 14 boxes of cookies

    and earned $35.00 for our field trip.

4.10   Summarizing Your Data Using PROC MEANS

One of the first things people usually want to do with their data, after reading it and making
sure it is correct, is to look at some simple statistics. Statistics such as the mean value, standard deviation, and minimum and maximum values give you a feel for your data. These types of information can also alert you to errors in your data (a score of 980 in a basketball game, for example, is suspect). The MEANS procedure provides simple statistics for numeric variables.

The MEANS procedure starts with the keywords PROC MEANS, followed by options:

PROC MEANS options;

Some options control how your data are summarized:

MAXDEC = n

rounds the value to n decimal places

MISSING

treats missing values as valid summary groups

Other options request specific summary statistics :

MAX

maximum value

MIN

minimum value

MEAN

mean

MEDIAN

median

MODE

mode

N

number of nonmissing values

NMISS

number of missing values

RANGE

range

STDDEV

standard deviation

SUM

sum

If you do not specify any summary statistics, SAS will print the number of nonmissing values, the mean, the standard deviation, and the minimum and maximum values for each variable. More options for PROC MEANS are listed in Section 9.3.

Here are some of the optional statements:

BY variable-list;

The BY statement performs a separate analysis for each level of the variables in the list. The data must first be sorted by these variables. (You can use PROC SORT to do this.)

CLASS variable-list;

The CLASS statement also performs a separate analysis for each level of the variables in the list, but its output is more compact than with the BY statement, and the data do not have to be sorted first.

VAR variable-list;

The VAR statement specifies which numeric variables to use in the analysis. If it is absent, then SAS uses all numeric variables.

Example  A wholesale nursery is selling garden flowers, and they want to summarize their sales figures by month. The data contain the customer ID, date of sale, and number of petunias, snapdragons, and marigolds sold:

756-01  05/04/2020 120  80 110

834-01  05/12/2020  90 160  60

901-02  05/18/2020  50 100  75

834-01  06/01/2020  80  60 100

756-01  06/11/2020 100 160  75

901-02  06/19/2020  60  60  60

756-01  06/25/2020  85 110 100

The following program reads the data from a file named Flowers.dat, and creates a permanent SAS data set named PLANTS. Then it computes a new variable, Month, which is the month of the sale; and summarizes the data by Month using PROC MEANS with a CLASS statement. The MAXDEC option is set to zero, so no decimal places will be printed.

LIBNAME garden 'c:MySASLib';

DATA garden.plants;

   INFILE 'c:MyRawDataFlowers.dat';

   INPUT CustID $ @9 SaleDate MMDDYY10. Petunia SnapDragon

         Marigold;

   Month = MONTH(SaleDate);

   FORMAT SaleDate MMDDYY10;

RUN;

* Calculate means by Month for flower sales;

PROC MEANS DATA = garden.plants MAXDEC = 0;

   CLASS Month;

   VAR Petunia SnapDragon Marigold;

   TITLE 'Summary of Flower Sales by Month';

RUN;

Here are the results of the PROC MEANS:

Summary of Flower Sales by Month

The MEANS Procedure

 

Month

N Obs

Variable

N

Mean

Std Dev

Minimum

Maximum

5

3

Petunia
SnapDragon
Marigold

3
3
3

87
113
82

35
42
26

50
80
60

120
160
110

6

4

Petunia
SnapDragon
Marigold

4
4
4

81
98
84

17
48
20

60
60
60

100
160
100

4.11   Writing Summary Statistics to a SAS Data Set

image

Sometimes you want to save summary statistics in a SAS data set for further analysis, or to merge with other data. For example, you might want to
plot the hourly temperature in your office to show how it heats up every afternoon causing you to fall asleep, but the instrument you have records data for every minute. The MEANS procedure can condense the data by computing the mean temperature for each hour, and then save the results in a SAS data set so it can be plotted.

There are two methods in PROC MEANS for saving summary statistics in a SAS data set. You can use the OUTPUT destination (covered in Section 5.13), or you can use the OUTPUT statement. The OUTPUT statement has the following form:

OUTPUT OUT = data-set output-statistic-list;

Here, data-set is the name of the SAS data set which will contain the results, and output-statistic-list specifies the statistics you want and the associated variable names. You can have more than one OUTPUT statement and multiple output statistic lists. This is one of the possible forms for output-statistic-list:

statistic(variable-list) = name-list

Here, statistic can be any of the statistics available in PROC MEANS (SUM, N, MEAN, for example), variable-list specifies the variables in the VAR statement you want to output, and
name-list specifies names for the new summary variables. The new variable names must be
in the same order as their corresponding variables in variable-list. For example, the following
PROC MEANS statements produce a new data set called ZOOSUM, which contains one observation with the variables LionWeight, the mean of the lions’ weights, and BearWeight,
the mean of the bears’ weights:

PROC MEANS DATA = zoo NOPRINT;

   VAR Lions Tigers Bears;

   OUTPUT OUT = zoosum MEAN(Lions Bears) = LionWeight BearWeight;

RUN;

The NOPRINT option in the PROC MEANS statement tells SAS there is no need to produce any printed results since we are saving the results in a SAS data set. (Using PROC SUMMARY is the same as using PROC MEANS with the NOPRINT option.)

The SAS data set created in the OUTPUT statement will contain all the variables defined in the output-statistic-list; any variables listed in a BY or CLASS statement; plus two new variables, _TYPE_ and _FREQ_. If there is no BY or CLASS statement, then the data set will have just one observation. If there is a BY statement, then the data set will have one observation for each level of the BY group. CLASS statements produce one observation for each level of interaction of the class variables. The value of the _TYPE_ variable depends on the level of interaction. The observation where _TYPE_ has a value of zero is the grand total.

 

Example  This example uses the permanent SAS data set created in the preceding section. The data are sales for a wholesale nursery with the customer ID; date of sale; the number of petunias, snapdragons, and marigolds sold, and the month of sale.

 

CustID

SaleDate

Petunia

SnapDragon

Marigold

Month

1

756-01

05/04/2020

120

80

110

5

2

834-01

05/12/2020

90

160

60

5

3

901-02

05/18/2020

50

100

75

5

4

834-01

06/01/2020

80

60

100

6

5

756-01

06/11/2020

100

160

75

6

6

901-02

06/19/2020

60

60

60

6

7

756-01

06/25/2020

85

110

100

6

The nursery wants to summarize the data, and then output the results as a SAS data set for further analysis. The following program uses the MEANS procedure with the NOPRINT option to summarize the data by CustID. Because a CLASS statement is used, the new data set will include one observation for each customer along with one for the grand total. The OUPTUT statement creates  a temporary SAS data set named TOTALS. The maximum values are given the variable names MaxP, MaxS, and MaxM, and the sums are given the original variable names Petunia, SnapDragon, and Marigold:

LIBNAME garden 'c:MySASLib';

* Calculate means by CustID, output sum and max to new data set;

PROC MEANS NOPRINT DATA = garden.plants;

   CLASS CustID;

   VAR Petunia SnapDragon Marigold;

   OUTPUT OUT = totals  

      MAX(Petunia SnapDragon Marigold) = MaxP MaxS MaxM

      SUM(Petunia SnapDragon Marigold) = Petunia SnapDragon Marigold;

RUN;

Here is the TOTALS data set. Notice that the _TYPE_ variable has a value of 0 for the grand total, and 1 for the summaries by customer. If you want just the summaries by customer (and no grand total), you could use a BY statement instead of a CLASS statement, but then the data would need to be sorted by CustID first.

 

CustID

_TYPE_

_FREQ_

MaxP

MaxS

MaxM

Petunia

SnapDragon

Marigold

1

 

0

7

120

160

110

585

730

580

2

756-01

1

3

120

160

110

305

350

285

3

834-01

1

2

90

160

100

170

220

160

4

901-02

1

2

60

100

75

110

160

135

4.12   Producing One-Way Frequencies with PROC FREQ

image

A frequency table is a simple list of counts answering the question “How many?” When you have counts for one variable, they are called one-way frequencies. When you combine two or more variables, the counts are called two-way frequencies, three-way frequencies, and so on. This section focuses on one-way frequencies while the next two sections cover more complex examples.

The most obvious reason for using PROC FREQ is to create tables showing the distribution of categorical data values, but PROC FREQ can also reveal irregularities in your data. You could get dizzy proofreading a large data set, but data entry errors are often glaringly obvious in a frequency table. The basic form of PROC FREQ is:

    PROC FREQ;

       TABLES variable-list;

If you skip the TABLES statement, then you will get one-way frequencies for all variables. With a TABLES statement, SAS will produce just the frequencies you specify. This statement would produce two frequency tables: one for the variable YearsEducation and another for Sex:

TABLES YearsEducation Sex;

You can specify any number of table requests in a single TABLES statement, and you can have
as many TABLES statements as you like.

Options, if any, appear after a slash in the TABLES statement and include the following:

MISSING

includes missing values in frequencies and in percentages

MISSPRINT

includes missing values in frequencies but not in percentages

NOCUM

suppresses cumulative frequencies in one-way frequencies

NOPERCENT

suppresses printing of percentages

NOPRINT

suppresses printing of frequency tables

OUT = data-set

writes a data set containing frequencies

The statement below, for example, tells SAS to treat missing values as valid and save the results in a temporary SAS data set named EDFREQS:

TABLES YearsEducation / MISSING OUT = edfreqs;

PROC FREQ also offers many statistical options. See Section 9.6 for more information.

Example  The proprietor of a coffee shop keeps a record of sales. For each drink sold, she records the type of coffee (cappuccino, espresso, kona, or iced coffee), and whether the customer walked in or came to the drive-up window. Here are the data with ten observations per line:

esp w cap d cap w kon w ice w kon d esp d kon w ice d esp d

cap w esp d cap d Kon d .   d kon w esp d cap w ice w kon w

kon w kon w ice d esp d kon w esp d esp w kon w cap w kon w

 

The following program reads the data from a file named Coffee.dat, and creates a permanent SAS data set named ORDERS in a directory named MySASLib (Windows). Then it produces one-way frequencies for two variables: Window and Coffee:

LIBNAME drinks 'c:MySASLib';

DATA drinks.orders;

   INFILE 'c:MyRawDataCoffee.dat';

   INPUT Coffee $ Window $ @@;

RUN;

* Print tables for Window and Coffee;

PROC FREQ DATA = drinks.orders;

   TABLES Window Coffee;

   TITLE 'Coffee Sales by Window and by Type of Drink';

RUN;

The output contains two tables. The first is a one-way frequency table for the variable Window. You can see that 13 customers came to the drive-up window while 17 walked into the restaurant. In addition to the frequency, the table also shows the percent, cumulative frequency, and cumulative percent for each value of the variable Window.

Coffee Sales by Window and by Type of Drink

The FREQ Procedure

 

Window

Frequency

Percent

Cumulative
Frequency

Cumulative
Percent

d

13

43.33

13

43.33

w

17

56.67

30

100.00

 

Coffee

Frequency

Percent

Cumulative
Frequency

Cumulative
Percent

Kon

1

3.45

1

3.45

cap

6

20.69

7

24.14

esp

8

27.59

15

51.72

ice

4

13.79

19

65.52

kon

10

34.48

29

100.00

Frequency Missing = 1

The second table is a one-way frequency table for Coffee. You can see how many customers ordered each type of drink. Notice that the missing value is noted at the bottom, but not included in the statistics. (Use the MISSING or MISSPRINT options if you want missing values to be included in the counts.) Also, there is one observation with a value of Kon for Coffee. Data values are case sensitive. This is a data entry error; it should be kon.

4.13   Producing Crosstabulations with PROC FREQ

One-way frequencies are a good start, but often you want to know how particular variables interact. You can cross any number of variables to get two-way frequencies, three-way frequencies, and so on, up to n-way frequencies. Tables combining two or more variables are also called crosstabulations or contingency tables.

To produce crosstabulations, separate the variables with asterisks. This statement produces a two-way table showing frequencies for each combination of values of the variables YearsEducation and Sex:

TABLES YearsEducation * Sex;

This statement produces a three-way table:

TABLES YearsEducation * Sex * Age;

Be careful though; reading crosstabulations of three or more levels can feel like playing three-dimensional tic-tac-toe without the benefit of a three-dimensional board.

You can use parentheses to group variables. This statement produces two, two-way tables: one for YearsEducation by Sex, and another for YearsEducation by Age:

TABLES YearsEducation * (Sex Age);

You can specify any number of table requests in a single TABLES statement, and you can have as many TABLES statements as you like.

Options, if any, appear after a slash in the TABLES statement. Possible options include some listed in the previous section (MISSPRINT, MISSING, NOPERCENT, NOPRINT, and OUT=) and some that apply specifically to crosstabulations:

CROSSLIST

displays crosstabulations in list format with totals

LIST

displays crosstabulations in list format without totals

NOCOL

suppresses column percentages in crosstabulations

NOROW

suppresses row percentages in crosstabulations

The statement below, for example, tells SAS to leave out all the percentages:

TABLES YearsEducation * Sex / NOPERCENT NOROW NOWCOL;

Example  This example uses the SAS data set created in the previous section. The data are from a coffee shop. For each drink sold, the owner records the type of coffee (cappuccino, espresso, kona, or iced coffee), and whether the customer walked in or came to the drive-up window.

 

The following program produces two, two-way frequency tables:

LIBNAME drinks 'c:MySASLib';

* Print two tables for Window by Coffee;

PROC FREQ DATA = drinks.orders;

   TABLES Window * Coffee;

   TABLES Window * Coffee / MISSING NOPERCENT NOCOL NOROW;

   TITLE 'Coffee Sales by Window and Type of Drink';

RUN;

Here are the results:

Coffee Sales by Window and Type of Drink

The FREQ Procedure

 

Table of Window by Coffee

Window

Coffee

Frequency
Percent
Row Pct
Col Pct

Kon

cap

esp

ice

kon

Total

d

1
3.45
8.33
100.00

2
6.90
16.67
33.33

6
20.69
50.00
75.00

2
6.90
16.67
50.00

1
3.45
8.33
10.00

12
41.38


w

0
0.00
0.00
0.00

4
13.79
23.53
66.67

2
6.90
11.76
25.00

2
6.90
11.76
50.00

9
31.03
52.94
90.00

17
58.62


Total

1
3.45

6
20.69

8
27.59

4
13.79

10
34.48

29
100.00

Frequency Missing = 1

 

Table of Window by Coffee

Window

Coffee

Frequency

 

Kon

cap

esp

ice

kon

Total

d

1

1

2

6

2

1

13

w

0

0

4

2

2

9

17

Total

1

1

6

8

4

10

30

These two tables are the same except that the second one uses options to include missing values in the table instead of in a note at the bottom, and to exclude all the percentages.

4.14   Grouping Data with User-Defined Formats

With user-defined formats you can change how variables are displayed (covered in Section 4.8), but user-defined formats are also a powerful tool for grouping data. Grouping data this way is a two-step process. First, use the FORMAT procedure to define a format that assigns all the values that you want to group together to a text string. Second, apply the new format to the variable you want to group in a PUT function in a DATA step or a FORMAT statement in a procedure.

PUT function  If you want to create a new variable with a user-defined format, you can use a PUT function in a DATA step. The original variable may be numeric or character, but the resulting variable is always character. The general form is:

new-variable = PUT(old-variable, user-defined-format.);

FORMAT statement  You can also group data directly in a procedure without creating a new variable. This is handy for cases where you have a lot of data, and it takes a long time to run a DATA step. Using this method, it is easy to change the groupings by simply creating a new format. This method works for procedures that group data such as PROC FREQ, PROC TABULATE, PROC MEANS with a CLASS statement, and PROC REPORT with GROUP or ACROSS variables. The general form is:

FORMAT old-variable user-defined-format.;

Example  The staff of the local library want to see the type of books people check out by age group. They have the age of the patron in years and the type of book (fiction, mystery, science fiction, biography, non-fiction, or reference). Here are the data with nine observations per line:

17 sci  9 bio 28 fic 50 mys 13 fic 32 fic 67 fic 81 non 38 non

53 non 16 sci 15 bio 61 fic 52 ref 22 mys 76 bio 37 fic 86 fic

49 mys 78 non 45 sci 64 bio  8 fic 11 non 41 fic 46 ref 69 fic

34 fic 26 mys 23 sci 74 ref 15 sci 27 fic 23 mys 63 fic 78 non

40 bio 12 fic 29 fic 54 mys 67 fic 60 fic 38 sci 42 fic 80 fic

Here is the program that creates three user-defined formats: one for the type of book and two to group the age data in different ways.

*Define formats to group the data;

PROC FORMAT;

   VALUE $typ

        'bio','non','ref' = 'Non-Fiction'

        'fic','mys','sci' = 'Fiction';

   VALUE agegpa

         0-18    = '0 to 18'

         19-25   = '19 to 25'

         26-49   = '26 to 49'

         50-HIGH = '  50+ ';

   VALUE agegpb

         0-25    = '0 to 25'

         26-HIGH = '  26+ ';

RUN;

 

DATA books;

   INFILE 'c:MyRawDataLibraryBooks.dat';

   INPUT Age Book $ @@;

   BookType = PUT(Book,$typ.);

RUN;

*Create two way table with Age grouped into four categories;

PROC FREQ DATA = books;

   TITLE 'Patron Age by Book Type: Four Age Groups';

   TABLES BookType * Age / NOPERCENT NOROW NOCOL;

   FORMAT Age agegpa.;

RUN;

*Create two way table with Age grouped into two categories;

PROC FREQ DATA = books;

   TITLE 'Patron Age by Book Type: Two Age Groups';

   TABLES BookType * Age / NOPERCENT NOROW NOCOL;

   FORMAT Age agegpb.;

RUN;

In a DATA step, a PUT function uses the format $TYP. to create a new variable named BookType that groups books into Fiction or Non-Fiction. Then the first PROC FREQ groups age into four categories using the format AGEGPA., while the second PROC FREQ groups age into two categories using the format AGEGPB. Because the NOPERCENT, NOROW, and NOCOL options were added to the TABLES statements, only frequencies appear in the results.

Patron Age by Book Type: Four Age Groups

The FREQ Procedure

Table of BookType by Age

BookType

Age

Frequency

0 to 18

19 to 25

26 to 49

50+

Total

Fiction

6

3

12

10

31

Non-Fiction

3

0

3

8

14

Total

9

3

15

18

45

 

Patron Age by Book Type: Two Age Groups

The FREQ Procedure

Table of BookType by Age

BookType

Age

Frequency

0 to 25

26+

Total

Fiction

9

22

31

Non-Fiction

3

11

14

Total

12

33

45

 

4.15   Producing Tabular Reports with PROC TABULATE

image

Every summary statistic the TABULATE procedure computes can also be produced by other procedures such as PRINT, MEANS, and FREQ, but PROC TABULATE is popular because its reports are pretty. If PROC TABULATE were a box, it would be gift-wrapped.

PROC TABULATE is so powerful that entire books have been written about it, but it is also so concise that you may feel like you’re reading hieroglyphics. If you find the syntax of PROC TABULATE a little hard to get used to, that may be because it has roots outside of SAS. PROC TABULATE is based in part on the Table Producing  
 Language, a language developed by the U.S. Department of Labor.

The general form of PROC TABULATE is:

PROC TABULATE;

   CLASS classification-variable-list;

   TABLE page-dimension, row-dimension, column-dimension;

The CLASS statement tells SAS which variables contain categorical data to be used for dividing observations into groups, while the TABLE statement tells SAS how to organize your table and which numbers to compute. Each TABLE statement defines only one table, but you can have multiple TABLE statements. If a variable is listed in a CLASS statement, then, by default, PROC TABULATE produces simple counts of the number of observations in each category of that variable. PROC TABULATE offers many other statistics, and the next section describes how to request those.

Dimensions  Each TABLE statement can specify up to three dimensions. Those dimensions, separated by commas, tell SAS which variables to use for the pages, rows, and columns in the report. If you specify only one dimension, then that becomes the column dimension. If you specify two dimensions, then you get rows and columns, but no page dimension. If you specify three dimensions, then you get pages, rows, and columns.

When you write a TABLE statement, start with the column dimension. Once you have that debugged, add the rows. Once you are happy with your rows and columns, you are ready
to add a page dimension, if you need one. Notice that the order of dimensions in the TABLE statement is page, then row, then column. So, to avoid scrambling your table when you add dimensions, insert the page and row specifications in front of the column dimension.

Missing data  By default, observations are excluded from tables if they have missing values for variables listed in a CLASS statement (even if that variable is not listed in the TABLE statement). If you want to keep these observations, then simply add the MISSING option to your PROC statement, like this:

PROC TABULATE MISSING;

 

Example  Here are data about day cruises by boat including the name of each boat, its home port, its type (sailing or power), the kind of vessel (schooner, catamaran, or yacht), the price of an excursion, and the length of the boat in feet.

Name         Port    Type  Vessel  Price Length

Silent Lady  Maalea  sail  sch     95.00     64

America II   Maalea  sail  yac     72.95     65

Aloha Anai   Lahaina sail  cat    112.00     60

Ocean Spirit Maalea  power cat     62.00     65

Anuenue      Maalea  sail  sch    177.50     52

Hana Lei     Maalea  power cat     88.99    110

Leilani      Maalea  power yac     99.99     45

Kalakaua     Maalea  power cat     69.50     70

Reef Runner  Lahaina power yac     59.95     50

Blue Dolphin Maalea  sail  cat     92.95     65

Suppose that you want a report showing the number of boats by Type and Vessel for each port. The following program reads the tab-delimited file using PROC IMPORT and creates a permanent SAS data set named BOATS. Then PROC TABULATE creates a three-dimensional report with the values of Port for the pages, Type for the rows, and Vessel for the columns.

LIBNAME trips 'c:MySASLib';

PROC IMPORT DATAFILE = 'c:MyRawDataBoats.txt' OUT = trips.boats REPLACE;

RUN;

* Tabulations with three dimensions;

PROC TABULATE DATA = trips.boats;

   CLASS Port Type Vessel;

   TABLE Port, Type, Vessel;

   TITLE 'Number of Boats by Port, Type, and Vessel';

RUN;

This report has two pages, one for each value of the page dimension. Here is one page:

Number of Boats by Port, Type, and Vessel

Port Maalea

 

Vessel

cat

sch

yac

N

N

N

Type

3

.

1

power

sail

1

2

1

The value of the page dimension appears above the top, left corner of the table. You can see that this is the page for the port of Maalea. The heading N tells you that the numbers in this table are simple counts, the number of boats in each group.

4.16   Adding Statistics to PROC TABULATE Output

By default, PROC TABULATE produces simple counts for variables listed in a CLASS statement, but you can request many other statistics in a TABLE statement. You can also concatenate or cross variables within dimensions. In fact, you can write TABLE statements so complicated that even you won’t know what the report is going to look like until you run it.

While the CLASS statement lists categorical variables, the VAR statement tells SAS which variables contain continuous data. Here is the general form:

PROC TABULATE;

   VAR analysis-variable-list;

   CLASS classification-variable-list;

   TABLE page-dimension, row-dimension, column-dimension;

While the variables in a CLASS statement can be either numeric or character, variables in a VAR statement must be numeric. You may have both a CLASS statement and a VAR statement, or just one, but all variables listed in a TABLE statement must also appear in either a CLASS or a VAR statement.

Keywords  In addition to variable names, each dimension can contain keywords. Here are a few of the values that PROC TABULATE can compute:

ALL

adds a row, column, or page showing the total

MAX

highest value

MIN

lowest value

MEAN

arithmetic mean

MEDIAN

median

MODE

mode

N

number of nonmissing values

NMISS

number of missing values

PCTN

percentage of observations for that group

PCTSUM

percentage of total represented by that group

STDDEV

standard deviation

SUM

sum

Concatenating, crossing, and grouping  Within a dimension, variables and keywords can be concatenated, crossed, or grouped. To concatenate variables or keywords, simply list them separated by a space; to cross variables or keywords, separate them with an asterisk (*); and to group them, enclose the variables or keywords in parentheses. The keyword ALL is generally concatenated. To request other statistics, however, cross that keyword with the variable name.

Concatenating:

TABLE Type Vessel ALL;

Crossing:

TABLE MEAN * Price;

Crossing, grouping, and concatenating:

TABLE PCTN *(Type Vessel);

 

Example  This example uses the permanent SAS data set created in the preceding section. The data contain the name of each boat, its home port, its type (sailing or power), the kind of vessel (schooner, catamaran, or yacht), the price of an excursion, and the boat's length.

 

Name

Port

Type

Vessel

Price

Length

1

Silent Lady

Maalea

sail

sch

95

64

2

America II

Maalea

sail

yac

72.95

65

3

Aloha Anai

Lahaina

sail

cat

112

60

4

Ocean Spirit

Maalea

power

cat

62

65

5

Anuenue

Maalea

sail

sch

177.5

52

6

Hana Lei

Maalea

power

cat

88.99

110

7

Leilani

Maalea

power

yac

99.99

45

8

Kalakaua

Maalea

power

cat

69.5

70

9

Reef Runner

Lahaina

power

yac

59.95

50

10

Blue Dolphin

Maalea

sail

cat

92.95

65

The following program creates another report about the boats. However, this PROC TABULATE includes a VAR statement. The TABLE statement in this program contains only two dimensions; but it also concatenates, crosses, and groups variables and statistics.

LIBNAME trips 'c:MySASLib';

* Tabulations with two dimensions and statistics;

PROC TABULATE DATA = trips.boats;

   CLASS Type Vessel;

   VAR Price;

   TABLE Type ALL, MEAN*Price*(Vessel ALL);

   TITLE 'Mean Price by Type and Vessel';

RUN;

The row dimension of this table concatenates the classification variable Type with ALL to produce totals. The column dimension, on the other hand, crosses MEAN with the analysis variable Price and with the classification variable Vessel (which happens to be concatenated and grouped with ALL). Here are the results:

Mean Price by Type and Vessel

 

Mean

Price

Vessel

All

cat

sch

yac

Type

73.50

.

79.97

76.09

power

sail

102.48

136.25

72.95

110.08

All

85.09

136.25

77.63

93.08

4.17   Enhancing the Appearance of PROC TABULATE Output

When you use PROC TABULATE, SAS wraps your data in tidy little boxes, but there may be times when they just don’t look right. Using three simple options, you can enhance the appearance of your output. Think of it as changing the wrapping paper.

FORMAT= option  To change the format of all the data cells in your table, use the FORMAT= option in your PROC statement. For example, if you needed the numbers in your table to have commas and no decimal places, you could use this PROC statement:

PROC TABULATE FORMAT = COMMA8.0;

telling SAS to use the COMMA8.0 format for all the data cells in the table.

To apply a format to an individual variable, cross it with the variable name, like this:

variable-name*FORMAT=formatw.d

Then you insert this rather convoluted construction into your TABLE statement:

TABLE Region, Sales*FORMAT=COMMA8.0 Profit*FORMAT=DOLLAR10.2;

This TABLE statement applies the COMMA8.0 format to a variable named Sales, and the DOLLAR10.2 format to Profit. If you specify formats in both the PROC and TABLE statements, then the TABLE statement will override the PROC statement.

BOX= and MISSTEXT= options  While the FORMAT= option can go in either PROC or TABLE statements, the BOX= and MISSTEXT= options go only in TABLE statements. The BOX= option allows you to insert a brief phrase in the normally empty box that appears in the upper left corner of every PROC TABULATE report. Using this empty space can give your reports a polished look. The MISSTEXT= option, on the other hand, specifies a value for SAS to print in empty data cells. The period that SAS prints, by default, for missing values can seem downright mysterious to someone, perhaps your CEO, who is not familiar with SAS output. You can give them something more meaningful with the MISSTEXT= option. This statement:

TABLE Region, MEAN*Sales / BOX='Mean Sales by Region' MISSTEXT='No Sales';

tells SAS to print the title “Mean Sales by Region” in the upper left corner of the table, and to print the words “No Sales” in any cells of the table that have no data. The BOX= and MISSTEXT= options must be separated from the dimensions of the TABLE statement by a slash.

Example  This example uses the permanent SAS data set created in Section 4.15. The data contain the name of each boat, its home port, its type (sailing or power), the kind of vessel (schooner, catamaran, or yacht), the price of an excursion, and the boat's length.

 

 

Name

Port

Type

Vessel

Price

Length

1

Silent Lady

Maalea

sail

sch

95

64

2

America II

Maalea

sail

yac

72.95

65

3

Aloha Anai

Lahaina

sail

cat

112

60

4

Ocean Spirit

Maalea

power

cat

62

65

5

Anuenue

Maalea

sail

sch

177.5

52

6

Hana Lei

Maalea

power

cat

88.99

110

7

Leilani

Maalea

power

yac

99.99

45

8

Kalakaua

Maalea

power

cat

69.5

70

9

Reef Runner

Lahaina

power

yac

59.95

50

10

Blue Dolphin

Maalea

sail

cat

92.95

65

This program is similar to the one in the previous section. However, the variable Length has been added to the VAR and TABLE statements along with the FORMAT=, BOX=, and MISSTEXT= options in the PROC and TABLE statements. Because the BOX= option serves as a title, a null TITLE statement is used to remove the usual title.

LIBNAME trips 'c:MySASLib';

* PROC TABULATE report with options;

PROC TABULATE DATA = trips.boats FORMAT = DOLLAR7.2;

   CLASS Type Vessel;

   VAR Price Length;

   TABLE Type ALL,

         MEAN * (Price Length*FORMAT=2.0) * (Vessel ALL)

         /BOX='Full Day Excursions' MISSTEXT='none';

   TITLE;

RUN;

Here is the enhanced output:

Full Day Excursions

Mean

Price

Length

Vessel

All

Vessel

All

cat

sch

yac

cat

sch

yac

Type

$73.50

none

$79.97

$76.09

82

none

48

68

power

sail

$102.48

$136.25

$72.95

$110.08

63

58

65

61

All

$85.09

$136.25

$77.63

$93.08

74

58

53

65

Notice that the data cells for Price now use the DOLLAR7.2 format as specified in the PROC statement, while the cells for Length use the 2.0 format specified in the TABLE statement. The text “Full Day Excursions” appears in the upper left corner, which was empty in the previous section. In addition, the one data cell with no data shows the word “none” instead of a period.

4.18   Changing Headers in PROC TABULATE Output

The TABULATE procedure produces reports with a lot of headers. Sometimes there are so many headers that your reports look cluttered; at other times you may simply feel that a different header would be more meaningful. Before you can change a header, though, you need to understand what type of header it is. PROC TABULATE reports have two basic types of headers: headers that are the values of variables listed in a CLASS statement, and headers that are the names of variables and keywords. You use different methods to change different types of headers.

CLASS variable values  To change headers that are the values of variables listed in a CLASS statement, use the FORMAT procedure to create a user-defined format. Then assign the format to the variable in a FORMAT statement (discussed in Section 4.8).

Variable names and keywords  You can change headers that are the names of variables using a simple LABEL statement as described in Section 4.1, but that won’t work for keywords. There is another method that works for both variable names and keywords. In the TABLE statement, you put an equal sign after the variable or keyword followed by the new header enclosed in quotation marks. You can eliminate a header entirely by setting it equal to blank (two quotation marks with nothing in between), and SAS will remove the box for that header. This TABLE statement tells SAS to remove the headers for Region and MEAN, and to change the header for the variable Sales to “Mean Sales by Region.”

TABLE Region='', MEAN=''*Sales='Mean Sales by Region';

In some cases SAS leaves the empty box when a row header is set to blank. This happens for statistics and analysis variables (but not class variables). To force SAS to remove the empty box, add the ROW=FLOAT option to the end of your TABLE statement, like this:

TABLE MEAN=''*Sales='Mean Sales by Region', Region='' / ROW=FLOAT;

Example  This example uses the permanent SAS data set created in Section 4.15. The data contain the name of each boat, its home port, its type (sailing or power), the kind of vessel (schooner, catamaran, or yacht), the price of an excursion, and the boat's length.

 

Name

Port

Type

Vessel

Price

Length

1

Silent Lady

Maalea

sail

sch

95

64

2

America II

Maalea

sail

yac

72.95

65

3

Aloha Anai

Lahaina

sail

cat

112

60

4

Ocean Spirit

Maalea

power

cat

62

65

5

Anuenue

Maalea

sail

sch

177.5

52

6

Hana Lei

Maalea

power

cat

88.99

110

7

Leilani

Maalea

power

yac

99.99

45

8

Kalakaua

Maalea

power

cat

69.5

70

9

Reef Runner

Lahaina

power

yac

59.95

50

10

Blue Dolphin

Maalea

sail

cat

92.95

65

 

The following program shows how to change headers. To start with, a FORMAT procedure creates a user-defined format named $VES. Then the $VES. format is assigned to the variable Vessel using a FORMAT statement so that the formatted data values will be used for column headers. In the TABLE statement, more headers are changed. The headers for Type, MEAN, and Vessel are all set to blank, while the header for Price is set to “Mean Price by Kind of Vessel”.

LIBNAME trips 'c:MySASLib';

* Changing headers;

PROC FORMAT;

   VALUE $ves  'cat' = 'catamaran'

               'sch' = 'schooner'

               'yac' = 'yacht';

RUN;

PROC TABULATE DATA = trips.boats FORMAT=DOLLAR7.2;

   CLASS Type Vessel;

   VAR Price;

   FORMAT Vessel $ves.;

   TABLE Type='' ALL,

      MEAN=''*Price='Mean Price by Kind of Vessel'*(Vessel='' ALL)

      /BOX='Full Day Excursions' MISSTEXT='none';

   TITLE;

RUN;

This program does not require the ROW=FLOAT option because the only variable being set to blank in the row dimension is a class variable. If you put an analysis variable or a statistics keyword in the row dimension and set it equal to blank, then you would need to add the ROW=FLOAT option to remove empty boxes.

Here is the output:

Full Day Excursions

Mean Price by Kind of Vessel

catamaran

schooner

yacht

All

power

$73.50

none

$79.97

$76.09

sail

$102.48

$136.25

$72.95

$110.08

All

$85.09

$136.25

$77.63

$93.08

 

4.19   Producing Simple Output with PROC REPORT

image

The REPORT procedure shares features with the PRINT, MEANS, TABULATE, and SORT procedures and the DATA step. With all those features rolled into one procedure, it’s not surprising that PROC REPORT can be complexin fact entire books have been written about itbut with all those features comes power.

 Here is the general form of a basic REPORT procedure:

     PROC REPORT;

          COLUMN variable-list;

In its simplest form, the COLUMN statement is similar to a VAR statement in PROC PRINT, telling SAS which variables to include and in what order. If you leave out the COLUMN statement, SAS will, by default, include all the variables in your data set.

Numeric versus character data  The type of report you get from PROC REPORT depends, in part, on the type of data you use. If you have at least one character variable in your report, then, by default, you will get a detail report with one row per observation. If, on the other hand, your report includes only numeric variables, then, by default, PROC REPORT will sum those variables. Even dates will be summed, by default, because they are numeric. (You can override this default by assigning one of your numeric variables a usage type of DISPLAY in a DEFINE statement. See the next section.)

Example  Here are data about national parks and monuments in the USA. The variables are name, type (NP for national park or NM for national monument), region (East or West), number of museums (including visitor centers), and number of campgrounds.

Dinosaur               NM  West  3   6

Everglades             NP  East  4   2

Grand Canyon           NP  West  3   3

Great Smoky Mountains  NP  East  7  10

Hawaii Volcanoes       NP  West  1   2

Statue of Liberty      NM  East  2   0

Theodore Roosevelt     NP  .     2   2

Yellowstone            NP  West  8  12

Yosemite               NP  West  5  11

The following program creates a permanent SAS data set named NATPARKS, and then runs two reports. The first report has no COLUMN statement so SAS will use all the variables, while the second uses a COLUMN statement to select just the numeric variables.

 

LIBNAME visit 'c:MySASLib';

DATA visit.natparks;

   INFILE 'c:MyRawDataParks.dat';

   INPUT Name $ 1-21 Type $ Region $ Museums Camping;

RUN;

PROC REPORT DATA = visit.natparks;

   TITLE 'Report with Character and Numeric Variables';

RUN;

PROC REPORT DATA = visit.natparks;

   COLUMN Museums Camping;

   TITLE 'Report with Only Numeric Variables';

RUN;

While the two PROC steps are only slightly different, the reports they produce differ dramatically. The first report is almost identical to the output you would get from a PROC PRINT except for the absence of the OBS column. The second report, since it contained only numeric variables, was summed.

Report with Character and Numeric Variables

Name

Type

Region

Museums

Camping

Dinosaur

NM

West

3

6

Everglades

NP

East

4

2

Grand Canyon

NP

West

3

3

Great Smoky Mountains

NP

East

7

10

Hawaii Volcanoes

NP

West

1

2

Statue of Liberty

NM

East

2

0

Theodore Roosevelt

NP

 

2

2

Yellowstone

NP

West

8

12

Yosemite

NP

West

5

11

 

Report with Only Numeric Variables

Museums

Camping

35

48

4.20   Using DEFINE Statements in PROC REPORT

The DEFINE statement is a general purpose statement that specifies options for an individual variable. You can have a DEFINE statement for every variable, but you only need to have a DEFINE statement if you want to specify an option for that particular variable. The general form of a DEFINE statement is:

DEFINE variable / options 'column-header';

In a DEFINE statement, you specify the variable name followed by a slash and any options for that particular variable.

Usage options  The most important option is a usage option that tells SAS how that variable is to be used. Possible values of usage options include the following:

ACROSS

creates a column for each unique value of the variable.

ANALYSIS

calculates statistics for the variable. This is the default usage for numeric variables, and the default statistic is SUM.

COMPUTED

creates a new variable whose value you calculate in a compute block. See Section 4.24 for a discussion of compute blocks.

DISPLAY

creates one row for each observation in the data set. This is the default usage for character variables.

GROUP

creates one row for each unique value of the variable.

ORDER

creates one row for each observation with rows arranged according to the values of the order variable.

Changing column headers  There are several ways to change column headers in PROC REPORT including using a LABEL statement, as described in Section 4.1, or specifying a column header in a DEFINE statement. The following statement tells SAS to arrange a report by the values of the variable Age, and use the words “Age at Admission” as the column header for that variable. Putting a slash in a column header tells SAS to split the header at that point.

DEFINE Age / ORDER 'Age at/Admission';

Missing data  By default, observations are excluded from reports if they have missing values for variables with a usage type of ORDER, GROUP, or ACROSS. If you want to keep these observations, then simply add the MISSING option to your PROC statement, like this:

PROC REPORT MISSING;

Example  This example uses the permanent SAS data set created in the preceding section with data about national parks and monuments in the USA. The variables are name, type (NP for national park or NM for national monument), region (East or West), number of museums (including visitor centers), and number of campgrounds.

 

 

Name

Type

Region

Museums

Camping

1

Dinosaur

NM

West

3

6

2

Everglades

NP

East

4

2

3

Grand Canyon

NP

West

3

3

4

Great Smoky Mountains

NP

East

7

10

5

Hawaii Volcanoes

NP

West

1

2

6

Statue of Liberty

NM

East

2

0

7

Theodore Roosevelt

NP

 

2

2

8

Yellowstone

NP

West

8

12

9

Yosemite

NP

West

5

11

The following PROC REPORT contains two DEFINE statements. The first defines Region with a usage type of ORDER. The second specifies a column header for the variable Camping. Camping is a numeric variable and has a default usage of ANALYSIS, so the DEFINE statement does not change its usage. The MISSING option in the PROC statement tells SAS to include observations with missing values of Region.

LIBNAME visit 'c:MySASLib';

* PROC REPORT with ORDER variable, MISSING option, and column header;

PROC REPORT DATA = visit.natparks MISSING;

   COLUMN Region Name Museums Camping;

   DEFINE Region / ORDER;

   DEFINE Camping / ANALYSIS 'Campgrounds';

   TITLE 'National Parks and Monuments Arranged by Region';

RUN;

Here are the results:

National Parks and Monuments Arranged by Region

Region

Name

Museums

Campgrounds

 

Theodore Roosevelt

2

2

East

Everglades

4

2

 

Great Smoky Mountains

7

10

 

Statue of Liberty

2

0

West

Dinosaur

3

6

 

Grand Canyon

3

3

 

Hawaii Volcanoes

1

2

 

Yellowstone

8

12

 

Yosemite

5

11

4.21   Creating Summary Reports with PROC REPORT

Two different usage types tell the REPORT procedure to “roll up” data into summary groups. While the GROUP usage type produces summary rows, the ACROSS usage type produces summary columns. However, keep in mind that, if you have any display or order variables in a COLUMN statement, they will override this behavior, and SAS will produce a detail report instead of a summary report.

Group variables  Defining a group variable is fairly simple. Just specify the GROUP usage option in a DEFINE statement. By default, analysis variables will be summed. (PROC REPORT can produce many other statistics, see Section 4.23.) The following PROC REPORT tells SAS to produce a report showing the sum of Salary and of Bonus with a row for each value of Department.

image

PROC REPORT DATA = employees;

          COLUMN Department Salary Bonus;

   DEFINE Department / GROUP;

RUN;

 

Across variables  To define an across variable, you also use a DEFINE statement. However, by default, SAS produces counts rather than sums. To obtain sums for across variables, you must tell SAS which variables to summarize. You do that by putting a comma between the across variable and analysis variable (or variables if you enclose them in parentheses). The following PROC REPORT tells SAS to produce a report showing the sum of Salary and of Bonus with one column for each value of Department.

image

PROC REPORT DATA = employees;

   COLUMN Department , (Salary Bonus);

   DEFINE Department / ACROSS;

RUN;

Example  This example uses the permanent SAS data set created in Section 4.19 with data about national parks and monuments in the USA. The variables are name, type (NP for national park or NM for national monument), region, number of museums, and number of campgrounds.

 

Name

Type

Region

Museums

Camping

1

Dinosaur

NM

West

3

6

2

Everglades

NP

East

4

2

3

Grand Canyon

NP

West

3

3

4

Great Smoky Mountains

NP

East

7

10

5

Hawaii Volcanoes

NP

West

1

2

6

Statue of Liberty

NM

East

2

0

7

Theodore Roosevelt

NP

 

2

2

8

Yellowstone

NP

West

8

12

9

Yosemite

NP

West

5

11

 

The following program contains two PROC REPORTs. In the first, Region and Type are both defined as group variables. In the second, Region is still a group variable, but Type is an across variable. Notice that the two COLUMN statements are the same except for punctuation added to the second procedure to cross the across variable with the analysis variables.

LIBNAME visit 'c:MySASLib';

* Region and Type as GROUP variables;

PROC REPORT DATA = visit.natparks;

   COLUMN Region Type Museums Camping;

   DEFINE Region / GROUP;

   DEFINE Type / GROUP;

   TITLE 'Summary Report with Two Group Variables';

RUN;

* Region as GROUP and Type as ACROSS with sums;

PROC REPORT DATA = visit.natparks;

   COLUMN Region Type,(Museums Camping);

   DEFINE Region / GROUP;

   DEFINE Type / ACROSS;

   TITLE 'Summary Report with a Group and an Across Variable';

RUN;

Here is the resulting output:

Summary Report with Two Group Variables

Region

Type

Museums

Camping

East

NM

2

0

 

NP

11

12

West

NM

3

6

 

NP

17

28

 

Summary Report with a Group and an Across Variable

 

Type

 

NM

NP

Region

Museums

Camping

Museums

Camping

East

2

0

11

12

West

3

6

17

28

 

4.22   Adding Summary Breaks to PROC REPORT Output

Two kinds of statements allow you to insert breaks into a report. The BREAK statement adds a break for each unique value of the variable you specify, while the RBREAK statement does the same for the entire report (or BY-group if you are using a BY statement). The general forms of these statements are:

BREAK  location variable / options;

RBREAK location / options;

where location has two possible values—BEFORE or AFTER—depending on whether you want the break to precede or follow that particular section of the report. The options that come after the slash tell SAS what kind of break to insert. Some of the possible options are:

PAGE

starts a new page

SUMMARIZE

inserts summary statistics for numeric variables

Notice that the BREAK statement requires you to specify a variable, but the RBREAK statement does not. That’s because the RBREAK statement produces only one break (at the beginning or end of the report or BY group), while the BREAK statement produces one break for every unique value of the variable you specify. That variable must be either a group or an order variable and, therefore, must also be listed in a DEFINE statement with either the GROUP or ORDER usage option. You can use an RBREAK statement in any report, but you can use BREAK only if you have at least one group or order variable.

Example  This example uses the permanent SAS data set created in Section 4.19 with data about national parks and monuments in the USA. The variables are name, type (NP for national park or NM for national monument), region (East or West), number of museums (including visitor centers), and number of campgrounds.

 

Name

Type

Region

Museums

Camping

1

Dinosaur

NM

West

3

6

2

Everglades

NP

East

4

2

3

Grand Canyon

NP

West

3

3

4

Great Smoky Mountains

NP

East

7

10

5

Hawaii Volcanoes

NP

West

1

2

6

Statue of Liberty

NM

East

2

0

7

Theodore Roosevelt

NP

 

2

2

8

Yellowstone

NP

West

8

12

9

Yosemite

NP

West

5

11

 

 

The following program defines Region as an order variable, and then uses both BREAK and RBREAK statements with the AFTER location. The SUMMARIZE option tells SAS to print totals for the numeric variables.

LIBNAME visit 'c:MySASLib';

* PROC REPORT with breaks;

PROC REPORT DATA = visit.natparks;

   COLUMN Name Region Museums Camping;

   DEFINE Region / ORDER;

   BREAK AFTER Region / SUMMARIZE;

   RBREAK AFTER / SUMMARIZE;

   TITLE 'Detail Report with Summary Breaks';

RUN;

Here is the resulting output:

Detail Report with Summary Breaks

Name

Region

Museums

Camping

Everglades

East

4

2

Great Smoky Mountains

 

7

10

Statue of Liberty

 

2

0

 

East

13

12

Dinosaur

West

3

6

Grand Canyon

 

3

3

Hawaii Volcanoes

 

1

2

Yellowstone

 

8

12

Yosemite

 

5

11

 

West

20

34

 

 

33

46

 

4.23   Adding Statistics to PROC REPORT Output

There are several ways to request statistics in the REPORT procedure. An easy method is to insert statistics keywords directly into the COLUMN statement along with the variable names. This is a little like requesting statistics in a TABLE statement in PROC TABULATE, except that instead of using an asterisk to cross a statistics keyword with a variable, you use a comma. In fact, PROC REPORT can produce all the same statistics as PROC TABULATE and PROC MEANS because it uses the same internal engine to compute those statistics. These are a few of the statistics that PROC REPORT can compute:

MAX

highest value

MIN

lowest value

MEAN

arithmetic mean

MEDIAN

median

MODE

mode

N

number of nonmissing values

NMISS

number of missing values

PCTN

percentage of observations for that group

PCTSUM

percentage of a total sum represented by that group

STD

standard deviation

SUM

sum

Applying statistics to variables  To request a statistic for a particular variable, insert a comma between the statistic and the variable in the COLUMN statement. One statistic, N, does not require a comma because it does not apply to a particular variable. If you insert N in a COLUMN statement, then SAS will print the number of observations that contributed to that row of the report. This statement tells SAS to print two columns of data: the median of a variable named Age, and the number of observations in that row.

COLUMN  Age,MEDIAN  N;

To request multiple statistics or statistics for multiple variables, put parentheses around the statistics or variables. This statement uses parentheses to request two statistics for the variable Age, and then requests one statistic for two variables, Height and Weight.

COLUMN Age,(MIN MAX) (Height Weight),MEAN;

Example  This example uses the permanent SAS data set created in Section 4.19 with data about national parks and monuments in the USA. The variables are name, type, region, number of museums, and number of campgrounds.

 

The following program contains two PROC REPORTs. Both procedures request the statistics N and MEAN, but the first report defines Type as a group variable, while the second defines Type as an across variable.

LIBNAME visit 'c:MySASLib';

*Statistics in COLUMN statement with two group variables;

PROC REPORT DATA = visit.natparks;

   COLUMN Region Type N (Museums Camping),MEAN;

   DEFINE Region / GROUP;

   DEFINE Type / GROUP;

   TITLE 'Statistics with Two Group Variables';

RUN;

*Statistics in COLUMN statement with group and across variables;

PROC REPORT DATA = visit.natparks;

   COLUMN Region N Type,(Museums Camping),MEAN;

   DEFINE Region / GROUP;

   DEFINE Type / ACROSS;

   TITLE 'Statistics with a Group and Across Variable';

RUN;

Here is the resulting output:

Statistics with Two Group Variables

 

Museums

Camping

Region

Type

N

MEAN

MEAN

East

NM

1

2

0

 

NP

2

5.5

6

West

NM

1

3

6

 

NP

4

4.25

7

 

Statistics with a Group and Across Variable

 

Type

 

NM

NP

 

Museums

Camping

Museums

Camping

Region

N

MEAN

MEAN

MEAN

MEAN

East

3

2

0

5.5

6

West

5

3

6

4.25

7

Notice that these reports are similar to the reports in Section 4.21 except that these contain counts and means instead of sums.

4.24   Adding Computed Variables to PROC REPORT Output

Unlike most procedures, the REPORT procedure has the ability to compute not only statistics (like sums and means) but also new variables. You do this using a compute block. Compute blocks start with a COMPUTE statement and end with an ENDCOMP statement. In between, you put the programming statements to calculate your new variable. PROC REPORT uses a limited set of programming statements including assignment statements, IF-THEN/ELSE statements, and DO loops. You do not have to specify a DEFINE statement for the new variable, but if you do, then give it a usage type of COMPUTED. The general form of these statements is:

DEFINE new-variable-name / COMPUTED;

COMPUTE new-variable-name / options;

   programming statements

ENDCOMP;

Computing a numeric variable  For a numeric variable, simply name the new variable in the COMPUTE statement. If you use any variables with a type of analysis in the compute block, you must append the variable name with its statistic. The default statistic for an analysis variable is SUM. The following statements compute a variable named Income by adding Salary and Bonus:

DEFINE Salary / ANALYSIS SUM;

DEFINE Bonus / ANALYSIS SUM;

DEFINE Income / COMPUTED;

COMPUTE  Income;

   Income = Salary.SUM + Bonus.SUM;

ENDCOMP;

Computing a character variable  For a character variable, add the CHAR option to the COMPUTE statement. You will probably also want to include the LENGTH option. Lengths for computed character variables range from 1 to 200, with a default of 8. The following statements compute a variable named JobType using IF-THEN/ELSE statements:

DEFINE Title / DISPLAY;

DEFINE JobType / COMPUTED;

COMPUTE  JobType / CHAR LENGTH = 10;

   IF Title = 'Programmer' THEN JobType = 'Technical';

   ELSE JobType = 'Other';

ENDCOMP;

Example  This example uses the permanent SAS data set created in Section 4.19 with data about national parks and monuments in the USA. The variables are name, type, region, number of museums, and number of campgrounds.

 

 

The following PROC REPORT computes two variables named Facilities and Note. Facilities is a numeric variable equal to the number of museums plus the number of campgrounds. Note is a character variable, which is equal to “No Camping” for parks that have no campgrounds. Notice that the variables Museums and Camping must be listed in the COLUMN statement because they are used to compute the new variables. In order to exclude them from the report, the program uses the NOPRINT option in DEFINE statements.

LIBNAME visit 'c:MySASLib';

* COMPUTE new variables that are numeric and character;

PROC REPORT DATA = visit.natparks;

   COLUMN Name Region Museums Camping Facilities Note;

   DEFINE Museums / ANALYSIS SUM NOPRINT;

   DEFINE Camping / ANALYSIS SUM NOPRINT;

   DEFINE Facilities / COMPUTED 'Campgrounds/and/Museums';

   DEFINE Note / COMPUTED;

   COMPUTE Facilities;

      Facilities = Museums.SUM + Camping.SUM;

   ENDCOMP;

   COMPUTE Note / CHAR LENGTH = 10;

      IF Camping.SUM = 0 THEN Note = 'No Camping';

   ENDCOMP;

   TITLE 'Report with Two Computed Variables';

RUN;

Here is the resulting output:

Report with Two Computed Variables

Name

Region

Campgrounds
and
Museums

Note

Dinosaur

West

9

 

Everglades

East

6

 

Grand Canyon

West

6

 

Great Smoky Mountains

East

17

 

Hawaii Volcanoes

West

3

 

Statue of Liberty

East

2

No Camping

Theodore Roosevelt

 

4

 

Yellowstone

West

20

 

Yosemite

West

16

 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset