Chapter 4 Sorting, Printing, and Summarizing Your Data

Using a procedure, also called a PROC, is like filling out a form. Someone else designed the form, and all you have to do is fill in the blanks and choose from a list of options. Each PROC has its own unique form with its own list of options. But while each procedure is unique, there are similarities too. This section discusses some of those similarities.

All procedures have required statements, and most have optional statements. For example, the only statement required in PROC PRINT is the PROC statement:

PROC PRINT;

However, by adding optional statements you could make this procedure a dozen lines long or even more.

PROC statement All procedures start with the keyword PROC followed by the name of the procedure, such as PRINT or CONTENTS. Options, if there are any, follow the procedure name. The DATA= option tells SAS which data set to use as input for that procedure. The DATA= option is, of course, optional. If you skip it, then SAS will use the most recently created data set, which is not necessarily the same as the most recently used. Sometimes it is easier to specify the data set you want than to figure out which data set SAS will use by default. In this case, SAS will use a temporary SAS data set named BANANA:

PROC CONTENTS DATA = banana;

To use a permanent SAS data set, add the libref to the data set name. In this case, SAS will use a data set named BANANA that is stored in a SAS data library named TROPICAL:

PROC CONTENTS DATA = tropical.banana;

You can also refer to a permanent SAS data set directly by placing your operating environment’s name for it between quotation marks.

PROC CONTENTS DATA = 'c:MySASLibanana';

See Section 2.2 for a more complete discussion of how to reference SAS data libraries and data set members.

BY statement The BY statement is required for only one procedure, PROC SORT. In PROC SORT the BY statement tells SAS how to arrange the observations. In all other procedures, the BY statement is optional, and tells SAS to perform a separate analysis for each combination of values of the BY variables rather than treating all observations as one group. For example, this statement tells SAS to run a separate analysis for each state:

BY State;

All procedures, except PROC SORT, assume that your data are already sorted by the variables in your BY statement. If your observations are not already sorted, then use PROC SORT to do the job.

TITLE and FOOTNOTE statements The TITLE statement specifies titles to print at the top of your procedure output. FOOTNOTE works the same way, but prints at the bottom of the page. These global statements are not technically part of any step. You can put them anywhere in your program, but since they apply to the procedure output it generally makes sense to put them with the procedure. The most basic TITLE statement consists of the keyword TITLE followed by your title enclosed in quotation marks. SAS doesn’t care if the two quotation marks are single or double as long as they are the same:

TITLE 'This is a title';

If you find that your title contains an apostrophe, use double quotation marks around the title, or replace the single apostrophe with two:

TITLE "Here's another title";

TITLE 'Here''s another title';

You can specify up to ten titles or footnotes by adding numbers to the keywords TITLE and FOOTNOTE:

FOOTNOTE3 'This is the third footnote';

When you specify a new title or footnote, it replaces the old title or footnote with the same number and cancels those with a higher number. For example, a new TITLE2 cancels an existing TITLE3, if there is one. Titles and footnotes stay in effect until you replace them with new ones or cancel them with a null statement. The following null statement would cancel all current titles:

TITLE;

LABEL statement Sometimes variable names are not as informative as you would like. In those cases, you can use the LABEL statement to create descriptive labels, up to 256 characters long, for each variable. This statement creates labels for the variables ReceiveDate and ShipDate:

LABEL ReceiveDate = 'Date order was received'

ShipDate = 'Date merchandise was shipped';

If you put a LABEL statement in a DATA step, the labels will be saved with the data; but if you put a LABEL statement in a PROC, then the labels will be used only by that particular step.

Customizing output You have a lot of control over the output produced by procedures. Using system options, you can set many features such as centering, dates, and paper orientation. (See Section 1.7.) With the Output Delivery System, you can also change the overall style of your output, produce output in different formats such as PDF or RTF, or change specific attributes of your output such as font or color. (See Chapter 5.)

Output data sets Most procedures produce some kind of report, but sometimes you would like the results of the procedure saved as a SAS data set so you can perform further analysis. You can create SAS data sets from any procedure output using the ODS OUTPUT statement (Section 5.13). Some procedures can also write a SAS data set using an OUTPUT statement or OUT= option.

4.2 Subsetting in Procedures with the WHERE Statement

One optional statement for a PROC that reads a SAS data set is the WHERE statement. The WHERE statement tells a procedure to use a subset of the data. There are other ways to subset data, as you probably remember, so you could get by without ever using the WHERE statement. However, the WHERE statement is a shortcut. While subsetting IFs work only in DATA steps, the WHERE statement works in PROC steps too.

Unlike subsetting in a DATA step, using a WHERE statement in a procedure does not create a new data set. That is one of the reasons why WHERE statements are sometimes more efficient than other ways of subsetting. (The WHERE= data set option is similar to a WHERE statement. See Section 6.12 for more information.)

The basic form of a WHERE statement is:

WHERE condition;

Only observations satisfying the condition will be used by the PROC. This may look familiar since it is similar to a subsetting IF. The left side of the condition is a variable name, and the right side is a variable name, a constant, or a mathematical expression. Mathematical expressions can contain the standard arithmetic symbols for addition (+), subtraction (-), multiplication (*), division (/), and exponentiation (**). Between the two sides of the expression, you can use comparison and logical operators; those operators may be symbolic or mnemonic:

Symbolic	Mnemonic	Example
=	EQ	WHERE Region = 'Spain';
¬=, ~=, ^=	NE	WHERE Region ~= 'Spain';
>	GT	WHERE Rainfall > 20;
<	LT	WHERE Rainfall < AvgRain;
>=	GE	WHERE Rainfall >= AvgRain + 5;
<=	LE	WHERE Rainfall <= AvgRain / 1.25;
&	AND	WHERE Rainfall > 20 AND Temp < 90;
\|, ¦ , !	OR	WHERE Rainfall > 20 OR Temp < 90;
	IS MISSING	WHERE Region IS MISSING;
	IS NOT MISSING	WHERE Region IS NOT MISSING;
	BETWEEN AND	WHERE Region BETWEEN 'Plain' AND 'Spain';
?	CONTAINS	WHERE Region CONTAINS 'ain';
=*		WHERE Region =* 'Spane';
	LIKE	WHERE Region LIKE 'Sp%n';
	IN (list)	WHERE Region IN ('Rain', 'Spain', 'Plain');

The sounds-like operator (=*) uses the Soundex algorithm to select observations with data values that are phonetically similar in English (such as Spain and Spane). The LIKE operator selects observations that match a specific pattern where a percent sign (%) serves as a wildcard corresponding to any number of characters.

Example The following comma-separated values data contain information about well-known painters. For each artist, the data include the painter’s name, primary style, and nationality:

Name,Genre,Nationality

Mary Cassatt,Impressionism,U

Paul Cezanne,Post-impressionism,F

Salvador Dali,Surrealism,S

Henri Matisse,Post-impressionism,F

Claude Monet,Impressionism,F

Berthe Morisot,Impressionism,F

Pablo Picasso,Surrealism,S

Jackson Pollock,Abstract expressionism,U

Pierre Auguste Renoir,Impressionism,F

Vincent van Gogh,Post-impressionism,N

Suppose you wanted a list of just the Impressionist painters. The quick-and-easy way to do this is with PROC PRINT and a WHERE statement. In the following program, a PROC IMPORT reads the data from a file named Artists.csv, and uses a LIBNAME statement to create a permanent SAS data set named STYLE in a directory named MySASLib. Then in the PROC PRINT, a WHERE statement selects just the Impressionists, and a FOOTNOTE statement explains the codes for Nationality.

* Import CSV file of artists;

LIBNAME art 'c:MySASLib';

PROC IMPORT DATAFILE = 'c:MyRawDataArtists.csv' OUT = art.style REPLACE;

RUN;

* Print list of Impressionist painters;

PROC PRINT DATA = art.style;

WHERE Genre = 'Impressionism';

TITLE 'Impressionist Painters';

FOOTNOTE 'F = France U = USA';

RUN;

Here are the results. Notice that because the data were subsetted in the PROC PRINT, the numbers in the Obs column reflect the order of observations in the original data set. If you don’t like that, you can tell SAS not to print the Obs column. (See Section 4.5.)

Impressionist Painters

Obs	Name	Genre	Nationality
1	Mary Cassatt	Impressionism	U
5	Claude Monet	Impressionism	F
6	Berthe Morisot	Impressionism	F
9	Pierre Auguste Renoir	Impressionism	F

F = France U = USA

4.3 Sorting Your Data with PROC SORT

There are many reasons for sorting your data: to organize data for a report, before combining data sets, or before using a BY statement in another PROC or DATA step. Fortunately, PROC SORT is quite simple. The basic form of this procedure is:

PROC SORT;

BY variable-list;

The variables named in the BY statement are called BY variables. You can specify as many BY variables as you like. With one BY variable, SAS sorts the data based on the values of that variable. With more than one variable, SAS sorts observations by the first variable, then by the second variable within categories of the first, and so on. A BY group is all the observations that have the same values of BY variables. If, for example, your BY variable is Country, then all the observations for Spain form one BY group.

Controlling the output data set The DATA= and OUT= options specify the input and output data sets. If you don’t specify the DATA= option, then SAS will use the most recently created data set. If you don’t specify the OUT= option, then SAS will replace the original data set with the newly sorted version. This sample statement tells SAS to sort the data set named MESSY, and then put the sorted data into a data set named NEAT:

PROC SORT DATA = messy OUT = neat;

The NODUPKEY option tells SAS to eliminate duplicate observations that have the same values for the BY variables. NODUPKEY applies only to variables listed in the BY statement. So if you want to eliminate duplicate observations that have identical values for all the variables, then you need to include all the variables in your BY statement. If you specify the DUPOUT= option, then SAS will put the deleted observations in that data set. This statement would remove any duplicate observations and put them into a data set named EXTRAOBS:

PROC SORT DATA = messy OUT = neat NODUPKEY DUPOUT = extraobs;

Ascending versus descending sorts By default, SAS sorts data in ascending order, from lowest to highest. To have your data sorted in the opposite order, add the keyword DESCENDING to the BY statement before each variable that should be sorted in reverse order. This statement tells SAS to sort first by Country (from A to Z) and then by City (from Z to A) within Country:

BY Country DESCENDING City;

Example The following data show the length in feet of whales and sharks. Notice that each line includes data for more than one animal.

beluga whale 15 dwarf shark .5 sperm whale 60

basking shark 30 humpback . 50 whale shark 40

gray whale 50 blue whale 100 killer whale 30

mako shark 12 whale shark 40 blue whale 90

In the following program, the DATA step reads a raw data file named Lengths.dat and creates a SAS data set named MARINE. Then PROC SORT orders the observations by Family and Name.

DATA marine;

INFILE 'c:MyRawDataLengths.dat';

INPUT Name $ Family $ Length @@;

RUN;

* Sort the data and remove duplicates;

PROC SORT DATA = marine OUT = seasort NODUPKEY;

BY Family Name;

RUN;

* Print the sorted data;

PROC PRINT DATA = seasort;

TITLE 'Whales and Sharks';

RUN;

The OUT= option writes the sorted data into a new data set named SEASORT. The NODUPKEY option eliminates observations with duplicate values of the BY variables (Family and Name). The log contains these notes showing that the sorted data set has two fewer observations than the original data set.

NOTE: There were 12 observations read from the data set WORK.MARINE.

NOTE: 2 observations with duplicate key values were deleted.

NOTE: The data set WORK.SEASORT has 10 observations and 3 variables.

The output from PROC PRINT looks like this:

Whales and Sharks

Obs	Name	Family	Length
1	humpback		50.0
2	basking	shark	30.0
3	dwarf	shark	0.5
4	mako	shark	12.0
5	whale	shark	40.0
6	beluga	whale	15.0
7	blue	whale	100.0
8	gray	whale	50.0
9	killer	whale	30.0
10	sperm	whale	60.0

Notice that the humpback, with a missing value for Family, became observation 1. That is because missing values are always lowest for both numeric and character variables. Also, the NODUPKEY option eliminated two duplicate observations, one for the whale shark and one for the blue whale. Because the BY variables were Family and Name, SAS ignored the variable Length, and just kept the first observation for each combination of Family and Name.

4.4 Changing the Sort Order for Character Data

At first glance, sorting character data appears straightforward. After all, everyone knows that "A" comes before "B." However, it is less obvious whether "A" comes before "a." SAS offers dozens of options for controlling the sort order of character data (also called the collating sequence). This section describes a few of them.

ASCII versus EBCDIC The default collating sequence for the z/OS operating environment is EBCDIC. The default collating sequence for most other operating environments is ASCII. From lowest to highest, the basic sort orders for character data are

ASCII	blank	numerals	uppercase letters	lowercase letters
EBCDIC	blank	lowercase letters	uppercase letters	numerals

If you work in only one operating environment, this may not matter to you. However, if you need to create a data set on Windows that will be used on z/OS or vice versa, then you might want your data to be sorted in the order expected by that environment. You can use the options SORTSEQ=EBCDIC or SORTSEQ=ASCII to change the sort order:

PROC SORT SORTSEQ = EBCDIC;

Other possible values for the SORTSEQ= option include DANISH, FINNISH, ITALIAN, NORWEGIAN, POLISH, SPANISH, and SWEDISH.

Linguistic sorting By default, upper- and lowercase letters will be sorted separately, but this is not the way that people generally sort them. You can use linguistic sorting to produce a more intuitive order. The SORTSEQ=LINGUISTIC option with the STRENGTH=PRIMARY suboption tells SAS to ignore case. To use these options, add them to the PROC SORT statement like this:

PROC SORT SORTSEQ = LINGUISTIC (STRENGTH = PRIMARY);

Here are data that are unsorted, sorted in the default ASCII order, and then sorted ignoring case:

Unsorted order	Default Sort	Linguistic Sort (STRENGTH=PRIMARY)
eva	ANNA	amanda
amanda	Zenobia	ANNA
Zenobia	amanda	eva
ANNA	eva	Zenobia

When numerals are sorted as character data, the value "10" comes before "2." The NUMERIC_COLLATION=ON suboption tells SAS to treat numerals as their numeric equivalent.

PROC SORT SORTSEQ = LINGUISTIC (NUMERIC_COLLATION = ON);

Here are data that are unsorted, sorted in the default order, and sorted with numeric collation:

Unsorted order	Default Sort	Linguistic Sort (NUMERIC_COLLATION=ON)
1500m freestyle	100m backstroke	50m freestyle
200m breaststroke	1500m freestyle	100m backstroke
100m backstroke	200m breaststroke	200m breaststroke
50m freestyle	50m freestyle	1500m freestyle

Example The following data contain names and addresses:

Name Street City State

Seiki 100 A St. Anchorage Alaska

Wong 2 A St. honolulu Hawaii

Shaw 10 A St. Apt. 10 juneau Alaska

Smith 10 A St. Apt. 2 Honolulu Hawaii

Lee 100 A St. Honolulu hawaii

Jones 10 A St. Apt. 22 Juneau alaska

This program imports the tab-delimited file. Then the data are sorted by the variables State, City, and Street ignoring case and using numeric collation.

PROC IMPORT DATAFILE = 'c:MyRawDataMail.txt' OUT = addresses REPLACE;

RUN;

* Sort addresses using linguistic sorting with numeric collation;

PROC SORT DATA = addresses OUT = sortout

SORTSEQ = LINGUISTIC (STRENGTH = PRIMARY NUMERIC_COLLATION = ON);

BY State City Street;

RUN;

* Print the linguistically sorted data;

PROC PRINT DATA = sortout;

TITLE 'Addresses Sorted by State, City, and Street';

RUN;

Here are the results:

Addresses Sorted by State, City, and Street

Obs	Name	Street	City	State
1	Seiki	100 A St.	Anchorage	Alaska
2	Shaw	10 A St. Apt. 10	juneau	Alaska
3	Jones	10 A St. Apt. 22	Juneau	alaska
4	Wong	2 A St.	honolulu	Hawaii
5	Smith	10 A St. Apt. 2	Honolulu	Hawaii
6	Lee	100 A St.	Honolulu	hawaii

4.5 Printing Your Data with PROC PRINT

The PRINT procedure is perhaps the most widely used SAS procedure. In its simplest form, PROC PRINT prints all the values of all variables for all observations in the SAS data set. SAS decides the best way to format the output, so you don’t have to worry about things like how many variables will fit on a page. But there are a few more features of PROC PRINT that you might want to use.

The PRINT procedure starts with the keywords PROC PRINT followed by options:

PROC PRINT options;

We recommend always using the DATA= option for clarity in your programs. By default, SAS prints the observation numbers along with the variables’ values. If you don’t want observation numbers, use the NOOBS option. If you define variable labels with a LABEL statement, and you want to print the labels instead of the variable names, then add the LABEL option as well. The following statement shows all of these options together:

PROC PRINT DATA = data-set NOOBS LABEL;

The following are optional statements that sometimes come in handy:

BY variable-list;	The BY statement starts a new section in the output for each new value of the BY variables and prints the values of the BY variables at the top of each section. The data must be presorted by the BY variables.
ID variable-list;	When you use the ID statement, the observation numbers are not printed. Instead, the variables in the ID variable list appear on the left-hand side of the page.
SUM variable-list;	The SUM statement prints sums for the variables in the list.
VAR variable-list;	The VAR statement specifies which variables to print and their order. Without a VAR statement, all variables in the SAS data set are printed in the order that they occur in the data set.

Example Students from two fourth-grade classes are selling cookies to earn money for a special field trip. The following are the data for the cookie sale. The students’ names are followed by their classroom number, the date they turned in their money, the type of cookie (mint patties or chocolate dinosaurs), and the number of boxes sold. Note that each line of data contains information for two students:

Adriana 21 3/21/2020 MP 7 Nathan 14 3/21/2020 CD 19

Matthew 1 3/21/2020 CD 14 Claire 14 3/22/2020 CD 11

Ian 21 3/24/2020 MP 18 Chris 14 3/25/2020 CD 6

Anthony 21 3/25/2020 MP 13 Erika 21 3/25/2020 MP 17

The class earns $2.50 for each box of cookies sold. The teachers want a report showing the money earned for each classroom, the money earned by each student, the type of cookie sold, and the date the students returned their money. The following program reads the data, creates a permanent SAS data set named SALES and computes money earned (Profit). Then PROC SORT sorts the data by classroom and creates a temporary data set named SALESSORT. The PROC PRINT uses a BY statement to print the data by Class and a SUM statement to give the totals for Profit. The VAR statement lists the variables to be printed.

LIBNAME class 'c:MySASLib';

DATA class.sales;

INFILE 'c:MyRawDataCookieSales.dat';

INPUT Name $ Class DateReturned MMDDYY10. CookieType $ Quantity @@;

Profit = Quantity * 2.5;

RUN;

PROC SORT DATA = class.sales OUT = salessort;

BY Class;

RUN;

PROC PRINT DATA = salessort;

BY Class;

SUM Profit;

VAR Name DateReturned CookieType Profit;

TITLE 'Cookie Sales for Field Trip by Class';

RUN;

Here are the results. Notice that the values for the variable DateReturned are printed as their SAS date values. You can use formats, covered in the next section, to print dates in readable forms.

Cookie Sales for Field Trip by Class

Class=14

Obs	Name	DateReturned	CookieType	Profit
1	Nathan	21995	CD	47.5
2	Matthew	21995	CD	35.0
3	Claire	21996	CD	27.5
4	Chris	21999	CD	15.0
Class				125.0

Class=21

Obs	Name	DateReturned	CookieType	Profit
5	Adriana	21995	MP	17.5
6	Ian	21998	MP	45.0
7	Anthony	21999	MP	32.5
8	Erika	21999	MP	42.5
Class				137.5
				262.5

4.6 Changing the Appearance of Data Values with Formats

When SAS displays your data, it decides which format is best—how many decimal places to show, how much space to allow for each value, and so on. This is very convenient and makes your job much easier, but SAS doesn’t always do what you want. Fortunately, you’re not stuck with the format SAS thinks is best. You can change the appearance of data values using SAS formats.

SAS has many formats for character, numeric, and date values. For example, you can use the COMMAw.d format to print numbers with embedded commas, the $w. format to control the number of characters printed, and the MMDDYYw. format to print SAS date values (the number of days since January 1, 1960) in a readable form like 12/03/2023. You can even print your data in more obscure formats like hexadecimal, zoned decimal, and packed decimal, if you like. Using the FORMAT procedure, you can also create your own formats (Section 4.8).

The general forms for SAS formats are:

Character	Numeric	Date
$formatw.	formatw.d	formatw.

where the $ indicates character formats, format is the name of the format, w is the total width including any decimal point, and d is the number of decimal places.

FORMAT statement You can associate formats with variables in a FORMAT statement. The FORMAT statement starts with the keyword FORMAT, followed by the variable name (or names if more than one variable is to be associated with the same format), followed by the format. For example, the following FORMAT statement associates the DOLLAR8.2 format with the variables Profit and Loss and associates the MMDDYY8. format with the variable SaleDate:

FORMAT Profit Loss DOLLAR8.2 SaleDate MMDDYY8.;

FORMAT statements can go in either DATA steps or PROC steps. If the FORMAT statement is in a DATA step, then the format association is permanent and is stored with the SAS data set. If the FORMAT statement is in a PROC step, then it is temporary—affecting only the results from that procedure.

Example In the previous section, results from the fourth-grade cookie sale were printed using the PRINT procedure. The names of the students were printed along with the date they turned in their money, the type of cookie sold, and the profit. You may have noticed that the dates printed were numbers like 21995 and 21999. Using a FORMAT statement in the PRINT procedure, we can print the dates in a readable form. At the same time, we can print the variable Profit using the DOLLAR6.2 format so dollar signs appear before the numbers.

This example uses the permanent SAS data set created in the preceding section. The data are the students’ names followed by their classroom, the date they turned in their money, the type of cookie sold (mint patties or chocolate dinosaurs), the number of boxes sold, and profit:

	Name	Class	DateReturned	CookieType	Quantity	Profit
1	Adriana	21	21995	MP	7	17.5
2	Nathan	14	21995	CD	19	47.5
3	Matthew	14	21995	CD	14	35.0
4	Claire	14	21996	CD	11	27.5
5	Ian	21	21998	MP	18	45.0
6	Chris	14	21999	CD	6	15.0
7	Anthony	21	21999	MP	13	32.5
8	Erika	21	21999	MP	17	42.5

The following program uses the permanent SAS data set SALES. The FORMAT statement in the PRINT procedure associates the DATE9. format with the variable DateReturned and the DOLLAR6.2 format with the variable Profit:

LIBNAME class 'c:MySASLib';

* Print cookie sales data with formatted values;

PROC PRINT DATA = class.sales;

VAR Name DateReturned CookieType Profit;

FORMAT DateReturned DATE9. Profit DOLLAR6.2;

TITLE 'Cookie Sale Data Using Formats';

RUN;

Here are the results with formatted values for DateReturned and Profit:

Cookie Sale Data Using Formats

Obs	Name	DateReturned	CookieType	Profit
1	Adriana	21MAR2020	MP	$17.50
2	Nathan	21MAR2020	CD	$47.50
3	Matthew	21MAR2020	CD	$35.00
4	Claire	22MAR2020	CD	$27.50
5	Ian	24MAR2020	MP	$45.00
6	Chris	25MAR2020	CD	$15.00
7	Anthony	25MAR2020	MP	$32.50
8	Erika	25MAR2020	MP	$42.50

4.7 Selected Standard Formats

Format	Definition	Width range	Default width
Character
$UPCASEw.	Converts character data to uppercase	1–32767	Length of variable or 8
$w.	Writes standard character data—does not trim leading blanks (same as $CHARw.)	1–32767	Length of variable or 1
Date, Time, and Datetime1
DATEw.	Writes SAS date values in form ddmmmyy or ddmmmyyyy	5–11	7
DATETIMEw.d	Writes SAS datetime values in form ddmmmyy:hh:mm:ss.ss	7–40	16
DTDATEw.	Writes SAS datetime values in form ddmmmyy or ddmmmyyyy	5–9	7
EURDFDDw.	Writes SAS date values in form dd.mm.yy or dd.mm.yyyy	2–10	8
JULIANw.	Writes SAS date values in Julian date form yyddd or yyyyddd	5–7	5
MMDDYYw.	Writes SAS date values in form mm/dd/yy or mm/dd/yyyy	2–10	8
TIMEw.d	Writes SAS time values in form hh:mm:ss.ss	2–20	8
WEEKDATEw.	Writes SAS date values in form day-of-week, month-name dd, yy, or yyyy	3–37	29
WORDDATEw.	Writes SAS date values in form month-name dd, yyyy	3–32	18
Numeric
BESTw.	SAS chooses the best format—default format for numeric data	1–32	12
COMMAw.d	Writes numbers with commas	1–32	6
DOLLARw.d	Writes numbers with a leading $ and commas separating every three digits	2–32	6
Ew.	Writes numbers in scientific notation	7–32	12
EUROXw.d	Writes numbers with a leading € and periods separating every three digits	1–32	6
PERCENTw.d	Writes numeric data as percentages	4–32	6
w.d	Writes standard numeric data	1–32	none

¹SAS date values are the number of days since January 1, 1960. SAS time values are the number of seconds past midnight, and
datetime values are the number of seconds since midnight January 1, 1960.

Format	Input data	FORMAT statement	Results
Character
$UPCASEw.	my cat	FORMAT Animal $UPCASE6.;	MY CAT
$w.	my cat my snake	FORMAT Animal $8.;	my cat my snak
Date, Time, and Datetime
DATEw.	8966	FORMAT Birth DATE7.; FORMAT Birth DATE9.;	19JUL84 19JUL1984
DATETIMEw.	12182	FORMAT Start DATETIME13.; FORMAT Start DATETIME18.1;	01JAN60:03:23 01JAN60:03:23:02.0
DTDATEw.	12182	FORMAT Start DTDATE7.; FORMAT Start DTDATE9.;	01JAN60 01JAN1960
EURDFDDw.	8966	FORMAT Birth EURDFDD8.; FORMAT Birth EURDFDD10.;	19.07.84 19.07.1984
JULIANw.	8966	FORMAT Birth JULIAN5.; FORMAT Birth JULIAN7.;	84201 1984201
MMDDYYw.	8966	FORMAT Birth MMDDYY8.; FORMAT Birth MMDDYY6.;	7/19/84 071984
TIMEw.d	12182	FORMAT Start TIME8.; FORMAT Start TIME11.2;	3:23:02 3:23:02.00
WEEKDATEw.	8966	FORMAT Birth WEEKDATE15.; FORMAT Birth WEEKDATE29.;	Thu, Jul 19, 84 Thursday, July 19, 1984
WORDDATEw.	8966	FORMAT Birth WORDDATE12.; FORMAT Birth WORDDATE18.;	Jul 19, 1984 July 19, 1984
Numeric
BESTw.	1200001	FORMAT Value BEST6.; FORMAT Value BEST8.;	1.20E6 1200001
COMMAw.d	1200001	FORMAT Value COMMA9.; FORMAT Value COMMA12.2;	1,200,001 1,200,001.00
DOLLARw.d	1200001	FORMAT Value DOLLAR10.; FORMAT Value DOLLAR13.2;	$1,200,001 $1,200,001.00
Ew.	1200001	FORMAT Value E7.;	1.2E+06
EUROXw.d	1200001	FORMAT Value EUROX13.2;	€1.200.001,00
PERCENTw.d	0.05	FORMAT Value PERCENT9.2;	5.00%
w.d	23.635	FORMAT Value 6.3; FORMAT Value 5.2;	23.635 23.64

4.8 Creating Your Own Formats with PROC FORMAT

At some time you will probably want to create your own custom formats—especially if you use a lot of coded data. Suppose that you have just completed a survey for your company and to save disk space and time, all the responses to the survey questions are coded. For example, the age categories teen, adult, and senior are coded as numbers 1, 2, and 3. This is convenient for data entry and analysis, but bothersome when it comes time to interpret the results. You could present your results along with a code book, and your company directors could look up the codes as they read the results. But this will probably not get you that promotion you’ve been looking for. A better solution is to create user-defined formats using PROC FORMAT and print the formatted values instead of the coded values.

The FORMAT procedure creates formats that will later be associated with variables in a FORMAT statement. (Formats can also be used with a PUT function to modify data values in a DATA step. See Section 4.14 for an example.) Here is the general form of PROC FORMAT:

PROC FORMAT;

VALUE name range-1 = 'formatted-text-1'

range-2 = 'formatted-text-2'

range-n = 'formatted-text-n';

The name in the VALUE statement is the name of the format that you are creating. If the format is for character data, the name must start with a $. The name can’t be longer than 32 characters (including the $ for character data), it must not start or end with a number, and cannot contain any special characters except an underscore. In addition, the name can’t be the name of an existing format. Each range is the value of a variable that is to be assigned to the text given in quotation marks on the right side of the equal sign. The text can be up to 32,767 characters long. The following are examples of valid range specifications:

'A' = 'Asia'

1, 3, 5, 7, 9 = 'Odd'

500000 - HIGH = 'Not Affordable'

13 -< 20 = 'Teenager'

0 <- HIGH = 'Positive Non Zero'

OTHER = 'Bad Data'

Character data values should be enclosed in quotation marks ('A' for example). If there is more than one value in the range, then separate the values with a comma or use a hyphen (-) for a continuous range. The keywords LOW and HIGH can be used in ranges to indicate the lowest and the highest nonmissing value for the variable. You can also use the less than symbol (<) in ranges to exclude either end point of the range. The OTHER keyword can be used to assign a format to any values not listed in the VALUE statement.

Example Universe Cars is surveying its customers as to their preferences for car colors. They have information about the customer’s age, sex (coded as 1 for male and 2 for female), annual income, and preferred car color (yellow, gray, blue, or white). Here are the data in CSV format:

Age,Sex,Income,Color

19,1,28000,Y

45,1,130000,G

72,2,70000,B

31,1,88000,Y

58,2,166000,W

The following program reads the data; creates three user-defined formats for age, sex, and car color using the FORMAT procedure; and then prints the data using the new formats:

PROC IMPORT DATAFILE = 'c:MyRawDataCars.csv' OUT = carsurvey REPLACE;

RUN;

PROC FORMAT;

VALUE gender 1 = 'Male'

2 = 'Female';

VALUE agegroup 13 -< 20 = 'Teen'

20 -< 65 = 'Adult'

65 - HIGH = 'Senior';

VALUE $col 'W' = 'Moon White'

'B' = 'Sky Blue'

'Y' = 'Sunburst Yellow'

'G' = 'Rain Cloud Gray';

RUN;

* Print data using user-defined and standard formats;

PROC PRINT DATA = carsurvey;

FORMAT Sex gender. Age agegroup. Color $col. Income DOLLAR8.;

TITLE 'Survey Results Printed with User-Defined Formats';

RUN;

This program creates two numeric formats: GENDER. for the variable Sex and AGEGROUP. for the variable Age. The program creates a character format, $COL., for the variable Color. Notice that the format names do not end with periods in the VALUE statement, but do in the FORMAT statement.

Here is the output:

Survey Results Printed with User-Defined Formats

Obs	Age	Sex	Income	Color
1	Teen	Male	$28,000	Sunburst Yellow
2	Adult	Male	$130,000	Rain Cloud Gray
3	Senior	Female	$70,000	Sky Blue
4	Adult	Male	$88,000	Sunburst Yellow
5	Adult	Female	$166,000	Moon White

This example creates temporary formats that exist only for the current job or session. It is possible to create permanent formats. (See the SAS Documentation for details.) Formats can also be used to group data. (See Section 4.14.)

4.9 Writing a Report to a Text File

PROC PRINT is flexible and easy to use. Still, there are times when PROC PRINT just won’t do: when your report to a state agency has to be spaced just like their fill-in-the-blank form, or when your client insists that the report contain complete sentences, or when you want one page per observation. At those times you can use the flexibility of the DATA step to write a text file. Text files are simple, but with simplicity comes a level of control that is not always possible with other file formats.

You can write data in a DATA step the same way you read data—but in reverse. Instead of using an INFILE statement, you use a FILE statement; instead of INPUT statements, you use PUT statements. This is similar to writing a raw data file in a DATA step (Section 10.6), but to write a report you use the PRINT option telling SAS to include the carriage returns and page breaks needed for printing. Here is the general form of a FILE statement for creating a report:

FILE 'file-specification' PRINT;

Like INPUT statements, PUT statements can be in list, column, or formatted style, but since SAS already knows whether a variable is numeric or character, you don’t have to put a $ after character variables. If you use list format, SAS will automatically put a space between each variable. If you use column or formatted styles of PUT statements, SAS will put the variables wherever you specify. You can control spacing with the same pointer controls that INPUT statements use: @n to move to column n, +n to move n columns, / to skip to the next line, #n to skip to line n, and the trailing @ to hold the current line. In addition to printing variables, you can insert a text string by simply enclosing it in quotation marks.

Example To show how this differs from PROC PRINT, this example uses the permanent SAS data set created in Section 4.5. Two fourth-grade classes have sold cookies to raise money for a field trip. The data are the students’ names followed by their classroom, the date they turned in their money, the type of cookie sold (mint patties or chocolate dinosaurs), the number of boxes sold, and profit:

	Name	Class	DateReturned	CookieType	Quantity	Profit
1	Adriana	21	21995	MP	7	17.5
2	Nathan	14	21995	CD	19	47.5
3	Matthew	14	21995	CD	14	35.0
4	Claire	14	21996	CD	11	27.5
5	Ian	21	21998	MP	18	45.0
6	Chris	14	21999	CD	6	15.0
7	Anthony	21	21999	MP	13	32.5
8	Erika	21	21999	MP	17	42.5

The teachers want a report for each student showing how much money that student earned. They want each student’s report on a separate page so it is easy to hand out. Lastly, they want it to be easy for fourth graders to understand, with complete sentences. Here is the program:

* Write a report with FILE and PUT statements;

LIBNAME class 'c:MySASLib';

DATA _NULL_;

SET class.sales;

FILE 'c:MyTextFilesStudent.txt' PRINT;

TITLE;

PUT @5 'Cookie sales report for ' Name 'from classroom ' Class

// @5 'Congratulations! You sold ' Quantity 'boxes of cookies'

/ @5 'and earned ' Profit DOLLAR6.2 ' for our field trip.';

PUT _PAGE_;

RUN;

Notice that the keyword _NULL_ appears in the DATA statement instead of a data set name. _NULL_ tells SAS not to bother writing a SAS data set (since the goal is to create a text file not a data set), and makes the program run slightly faster. The FILE statement creates the output file for the report, and the PRINT option tells SAS to include carriage returns and page breaks. The null TITLE statement tells SAS to eliminate any automatic titles.

The first PUT statement in this program starts with a pointer, @5, telling SAS to go to column 5. Then it tells SAS to print the words Cookie sales report for followed by the current value of the variable Name. The variables Name, Class, and Quantity are printed in list style, whereas Profit is printed using formatted style and the DOLLAR6.2 format. A slash line pointer tells SAS to skip to the next line; two slashes skips two lines. You could use multiple PUT statements instead of slashes to skip lines because SAS goes to a new line every time there is a new PUT statement. The statement PUT _PAGE_ inserts a page break after each student’s report. When the program is run, the log will contain these notes:

NOTE: 24 records were written to the file 'c:MyRawDataStudent.txt'.

NOTE: There were 8 observations read from CLASS.SALES.

The first three pages of the report look like this:

Cookie sales report for Adriana from classroom 21

Congratulations! You sold 7 boxes of cookies

and earned $17.50 for our field trip.

Cookie sales report for Nathan from classroom 14

Congratulations! You sold 19 boxes of cookies

and earned $47.50 for our field trip.

Cookie sales report for Matthew from classroom 14

Congratulations! You sold 14 boxes of cookies

and earned $35.00 for our field trip.

4.10 Summarizing Your Data Using PROC MEANS

One of the first things people usually want to do with their data, after reading it and making
sure it is correct, is to look at some simple statistics. Statistics such as the mean value, standard deviation, and minimum and maximum values give you a feel for your data. These types of information can also alert you to errors in your data (a score of 980 in a basketball game, for example, is suspect). The MEANS procedure provides simple statistics for numeric variables.

The MEANS procedure starts with the keywords PROC MEANS, followed by options:

PROC MEANS options;

Some options control how your data are summarized:

MAXDEC = n	rounds the value to n decimal places
MISSING	treats missing values as valid summary groups

Other options request specific summary statistics :

MAX	maximum value
MIN	minimum value
MEAN	mean
MEDIAN	median
MODE	mode
N	number of nonmissing values
NMISS	number of missing values
RANGE	range
STDDEV	standard deviation
SUM	sum

If you do not specify any summary statistics, SAS will print the number of nonmissing values, the mean, the standard deviation, and the minimum and maximum values for each variable. More options for PROC MEANS are listed in Section 9.3.

Here are some of the optional statements:

BY variable-list;	The BY statement performs a separate analysis for each level of the variables in the list. The data must first be sorted by these variables. (You can use PROC SORT to do this.)
CLASS variable-list;	The CLASS statement also performs a separate analysis for each level of the variables in the list, but its output is more compact than with the BY statement, and the data do not have to be sorted first.
VAR variable-list;	The VAR statement specifies which numeric variables to use in the analysis. If it is absent, then SAS uses all numeric variables.

Example A wholesale nursery is selling garden flowers, and they want to summarize their sales figures by month. The data contain the customer ID, date of sale, and number of petunias, snapdragons, and marigolds sold:

756-01 05/04/2020 120 80 110

834-01 05/12/2020 90 160 60

901-02 05/18/2020 50 100 75

834-01 06/01/2020 80 60 100

756-01 06/11/2020 100 160 75

901-02 06/19/2020 60 60 60

756-01 06/25/2020 85 110 100

The following program reads the data from a file named Flowers.dat, and creates a permanent SAS data set named PLANTS. Then it computes a new variable, Month, which is the month of the sale; and summarizes the data by Month using PROC MEANS with a CLASS statement. The MAXDEC option is set to zero, so no decimal places will be printed.

LIBNAME garden 'c:MySASLib';

DATA garden.plants;

INFILE 'c:MyRawDataFlowers.dat';

INPUT CustID $ @9 SaleDate MMDDYY10. Petunia SnapDragon

Marigold;

Month = MONTH(SaleDate);

FORMAT SaleDate MMDDYY10;

RUN;

* Calculate means by Month for flower sales;

PROC MEANS DATA = garden.plants MAXDEC = 0;

CLASS Month;

VAR Petunia SnapDragon Marigold;

TITLE 'Summary of Flower Sales by Month';

RUN;

Here are the results of the PROC MEANS:

Summary of Flower Sales by Month

The MEANS Procedure

Month	N Obs	Variable	N	Mean	Std Dev	Minimum	Maximum
5	3	Petunia SnapDragon Marigold	3 3 3	87 113 82	35 42 26	50 80 60	120 160 110
6	4	Petunia SnapDragon Marigold	4 4 4	81 98 84	17 48 20	60 60 60	100 160 100

4.11 Writing Summary Statistics to a SAS Data Set

Sometimes you want to save summary statistics in a SAS data set for further analysis, or to merge with other data. For example, you might want to
plot the hourly temperature in your office to show how it heats up every afternoon causing you to fall asleep, but the instrument you have records data for every minute. The MEANS procedure can condense the data by computing the mean temperature for each hour, and then save the results in a SAS data set so it can be plotted.

There are two methods in PROC MEANS for saving summary statistics in a SAS data set. You can use the OUTPUT destination (covered in Section 5.13), or you can use the OUTPUT statement. The OUTPUT statement has the following form:

OUTPUT OUT = data-set output-statistic-list;

Here, data-set is the name of the SAS data set which will contain the results, and output-statistic-list specifies the statistics you want and the associated variable names. You can have more than one OUTPUT statement and multiple output statistic lists. This is one of the possible forms for output-statistic-list:

statistic(variable-list) = name-list

Here, statistic can be any of the statistics available in PROC MEANS (SUM, N, MEAN, for example), variable-list specifies the variables in the VAR statement you want to output, and
name-list specifies names for the new summary variables. The new variable names must be
in the same order as their corresponding variables in variable-list. For example, the following
PROC MEANS statements produce a new data set called ZOOSUM, which contains one observation with the variables LionWeight, the mean of the lions’ weights, and BearWeight,
the mean of the bears’ weights:

PROC MEANS DATA = zoo NOPRINT;

VAR Lions Tigers Bears;

OUTPUT OUT = zoosum MEAN(Lions Bears) = LionWeight BearWeight;

RUN;

The NOPRINT option in the PROC MEANS statement tells SAS there is no need to produce any printed results since we are saving the results in a SAS data set. (Using PROC SUMMARY is the same as using PROC MEANS with the NOPRINT option.)

The SAS data set created in the OUTPUT statement will contain all the variables defined in the output-statistic-list; any variables listed in a BY or CLASS statement; plus two new variables, _TYPE_ and _FREQ_. If there is no BY or CLASS statement, then the data set will have just one observation. If there is a BY statement, then the data set will have one observation for each level of the BY group. CLASS statements produce one observation for each level of interaction of the class variables. The value of the _TYPE_ variable depends on the level of interaction. The observation where _TYPE_ has a value of zero is the grand total.

Example This example uses the permanent SAS data set created in the preceding section. The data are sales for a wholesale nursery with the customer ID; date of sale; the number of petunias, snapdragons, and marigolds sold, and the month of sale.

	CustID	SaleDate	Petunia	SnapDragon	Marigold	Month
1	756-01	05/04/2020	120	80	110	5
2	834-01	05/12/2020	90	160	60	5
3	901-02	05/18/2020	50	100	75	5
4	834-01	06/01/2020	80	60	100	6
5	756-01	06/11/2020	100	160	75	6
6	901-02	06/19/2020	60	60	60	6
7	756-01	06/25/2020	85	110	100	6

The nursery wants to summarize the data, and then output the results as a SAS data set for further analysis. The following program uses the MEANS procedure with the NOPRINT option to summarize the data by CustID. Because a CLASS statement is used, the new data set will include one observation for each customer along with one for the grand total. The OUPTUT statement creates a temporary SAS data set named TOTALS. The maximum values are given the variable names MaxP, MaxS, and MaxM, and the sums are given the original variable names Petunia, SnapDragon, and Marigold:

LIBNAME garden 'c:MySASLib';

* Calculate means by CustID, output sum and max to new data set;

PROC MEANS NOPRINT DATA = garden.plants;

CLASS CustID;

VAR Petunia SnapDragon Marigold;

OUTPUT OUT = totals

MAX(Petunia SnapDragon Marigold) = MaxP MaxS MaxM

SUM(Petunia SnapDragon Marigold) = Petunia SnapDragon Marigold;

RUN;

Here is the TOTALS data set. Notice that the _TYPE_ variable has a value of 0 for the grand total, and 1 for the summaries by customer. If you want just the summaries by customer (and no grand total), you could use a BY statement instead of a CLASS statement, but then the data would need to be sorted by CustID first.

	CustID	_TYPE_	_FREQ_	MaxP	MaxS	MaxM	Petunia	SnapDragon	Marigold
1		0	7	120	160	110	585	730	580
2	756-01	1	3	120	160	110	305	350	285
3	834-01	1	2	90	160	100	170	220	160
4	901-02	1	2	60	100	75	110	160	135

4.12 Producing One-Way Frequencies with PROC FREQ

A frequency table is a simple list of counts answering the question “How many?” When you have counts for one variable, they are called one-way frequencies. When you combine two or more variables, the counts are called two-way frequencies, three-way frequencies, and so on. This section focuses on one-way frequencies while the next two sections cover more complex examples.

The most obvious reason for using PROC FREQ is to create tables showing the distribution of categorical data values, but PROC FREQ can also reveal irregularities in your data. You could get dizzy proofreading a large data set, but data entry errors are often glaringly obvious in a frequency table. The basic form of PROC FREQ is:

PROC FREQ;

TABLES variable-list;

If you skip the TABLES statement, then you will get one-way frequencies for all variables. With a TABLES statement, SAS will produce just the frequencies you specify. This statement would produce two frequency tables: one for the variable YearsEducation and another for Sex:

TABLES YearsEducation Sex;

You can specify any number of table requests in a single TABLES statement, and you can have
as many TABLES statements as you like.

Options, if any, appear after a slash in the TABLES statement and include the following:

MISSING	includes missing values in frequencies and in percentages
MISSPRINT	includes missing values in frequencies but not in percentages
NOCUM	suppresses cumulative frequencies in one-way frequencies
NOPERCENT	suppresses printing of percentages
NOPRINT	suppresses printing of frequency tables
OUT = data-set	writes a data set containing frequencies

The statement below, for example, tells SAS to treat missing values as valid and save the results in a temporary SAS data set named EDFREQS:

TABLES YearsEducation / MISSING OUT = edfreqs;

PROC FREQ also offers many statistical options. See Section 9.6 for more information.

Example The proprietor of a coffee shop keeps a record of sales. For each drink sold, she records the type of coffee (cappuccino, espresso, kona, or iced coffee), and whether the customer walked in or came to the drive-up window. Here are the data with ten observations per line:

esp w cap d cap w kon w ice w kon d esp d kon w ice d esp d

cap w esp d cap d Kon d . d kon w esp d cap w ice w kon w

kon w kon w ice d esp d kon w esp d esp w kon w cap w kon w

The following program reads the data from a file named Coffee.dat, and creates a permanent SAS data set named ORDERS in a directory named MySASLib (Windows). Then it produces one-way frequencies for two variables: Window and Coffee:

LIBNAME drinks 'c:MySASLib';

DATA drinks.orders;

INFILE 'c:MyRawDataCoffee.dat';

INPUT Coffee $ Window $ @@;

RUN;

* Print tables for Window and Coffee;

PROC FREQ DATA = drinks.orders;

TABLES Window Coffee;

TITLE 'Coffee Sales by Window and by Type of Drink';

RUN;

The output contains two tables. The first is a one-way frequency table for the variable Window. You can see that 13 customers came to the drive-up window while 17 walked into the restaurant. In addition to the frequency, the table also shows the percent, cumulative frequency, and cumulative percent for each value of the variable Window.

Coffee Sales by Window and by Type of Drink

The FREQ Procedure

Window	Frequency	Percent	Cumulative Frequency	Cumulative Percent
d	13	43.33	13	43.33
w	17	56.67	30	100.00

Coffee	Frequency	Percent	Cumulative Frequency	Cumulative Percent
Kon	1	3.45	1	3.45
cap	6	20.69	7	24.14
esp	8	27.59	15	51.72
ice	4	13.79	19	65.52
kon	10	34.48	29	100.00
Frequency Missing = 1

The second table is a one-way frequency table for Coffee. You can see how many customers ordered each type of drink. Notice that the missing value is noted at the bottom, but not included in the statistics. (Use the MISSING or MISSPRINT options if you want missing values to be included in the counts.) Also, there is one observation with a value of Kon for Coffee. Data values are case sensitive. This is a data entry error; it should be kon.

4.13 Producing Crosstabulations with PROC FREQ

One-way frequencies are a good start, but often you want to know how particular variables interact. You can cross any number of variables to get two-way frequencies, three-way frequencies, and so on, up to n-way frequencies. Tables combining two or more variables are also called crosstabulations or contingency tables.

To produce crosstabulations, separate the variables with asterisks. This statement produces a two-way table showing frequencies for each combination of values of the variables YearsEducation and Sex:

TABLES YearsEducation * Sex;

This statement produces a three-way table:

TABLES YearsEducation * Sex * Age;

Be careful though; reading crosstabulations of three or more levels can feel like playing three-dimensional tic-tac-toe without the benefit of a three-dimensional board.

You can use parentheses to group variables. This statement produces two, two-way tables: one for YearsEducation by Sex, and another for YearsEducation by Age:

TABLES YearsEducation * (Sex Age);

You can specify any number of table requests in a single TABLES statement, and you can have as many TABLES statements as you like.

Options, if any, appear after a slash in the TABLES statement. Possible options include some listed in the previous section (MISSPRINT, MISSING, NOPERCENT, NOPRINT, and OUT=) and some that apply specifically to crosstabulations:

CROSSLIST	displays crosstabulations in list format with totals
LIST	displays crosstabulations in list format without totals
NOCOL	suppresses column percentages in crosstabulations
NOROW	suppresses row percentages in crosstabulations

The statement below, for example, tells SAS to leave out all the percentages:

TABLES YearsEducation * Sex / NOPERCENT NOROW NOWCOL;

Example This example uses the SAS data set created in the previous section. The data are from a coffee shop. For each drink sold, the owner records the type of coffee (cappuccino, espresso, kona, or iced coffee), and whether the customer walked in or came to the drive-up window.

The following program produces two, two-way frequency tables:

LIBNAME drinks 'c:MySASLib';

* Print two tables for Window by Coffee;

PROC FREQ DATA = drinks.orders;

TABLES Window * Coffee;

TABLES Window * Coffee / MISSING NOPERCENT NOCOL NOROW;

TITLE 'Coffee Sales by Window and Type of Drink';

RUN;

Here are the results:

Coffee Sales by Window and Type of Drink

The FREQ Procedure

Table of Window by Coffee
Window	Coffee
Frequency Percent Row Pct Col Pct	Kon	cap	esp	ice	kon	Total
d	1 3.45 8.33 100.00	2 6.90 16.67 33.33	6 20.69 50.00 75.00	2 6.90 16.67 50.00	1 3.45 8.33 10.00	12 41.38
w	0 0.00 0.00 0.00	4 13.79 23.53 66.67	2 6.90 11.76 25.00	2 6.90 11.76 50.00	9 31.03 52.94 90.00	17 58.62
Total	1 3.45	6 20.69	8 27.59	4 13.79	10 34.48	29 100.00
Frequency Missing = 1

Table of Window by Coffee
Window	Coffee
Frequency		Kon	cap	esp	ice	kon	Total
d	1	1	2	6	2	1	13
w	0	0	4	2	2	9	17
Total	1	1	6	8	4	10	30

These two tables are the same except that the second one uses options to include missing values in the table instead of in a note at the bottom, and to exclude all the percentages.

4.14 Grouping Data with User-Defined Formats

With user-defined formats you can change how variables are displayed (covered in Section 4.8), but user-defined formats are also a powerful tool for grouping data. Grouping data this way is a two-step process. First, use the FORMAT procedure to define a format that assigns all the values that you want to group together to a text string. Second, apply the new format to the variable you want to group in a PUT function in a DATA step or a FORMAT statement in a procedure.

PUT function If you want to create a new variable with a user-defined format, you can use a PUT function in a DATA step. The original variable may be numeric or character, but the resulting variable is always character. The general form is:

new-variable = PUT(old-variable, user-defined-format.);

FORMAT statement You can also group data directly in a procedure without creating a new variable. This is handy for cases where you have a lot of data, and it takes a long time to run a DATA step. Using this method, it is easy to change the groupings by simply creating a new format. This method works for procedures that group data such as PROC FREQ, PROC TABULATE, PROC MEANS with a CLASS statement, and PROC REPORT with GROUP or ACROSS variables. The general form is:

FORMAT old-variable user-defined-format.;

Example The staff of the local library want to see the type of books people check out by age group. They have the age of the patron in years and the type of book (fiction, mystery, science fiction, biography, non-fiction, or reference). Here are the data with nine observations per line:

17 sci 9 bio 28 fic 50 mys 13 fic 32 fic 67 fic 81 non 38 non

53 non 16 sci 15 bio 61 fic 52 ref 22 mys 76 bio 37 fic 86 fic

49 mys 78 non 45 sci 64 bio 8 fic 11 non 41 fic 46 ref 69 fic

34 fic 26 mys 23 sci 74 ref 15 sci 27 fic 23 mys 63 fic 78 non

40 bio 12 fic 29 fic 54 mys 67 fic 60 fic 38 sci 42 fic 80 fic

Here is the program that creates three user-defined formats: one for the type of book and two to group the age data in different ways.

*Define formats to group the data;

PROC FORMAT;

VALUE $typ

'bio','non','ref' = 'Non-Fiction'

'fic','mys','sci' = 'Fiction';

VALUE agegpa

0-18 = '0 to 18'

19-25 = '19 to 25'

26-49 = '26 to 49'

50-HIGH = ' 50+ ';

VALUE agegpb

0-25 = '0 to 25'

26-HIGH = ' 26+ ';

RUN;

DATA books;

INFILE 'c:MyRawDataLibraryBooks.dat';

INPUT Age Book $ @@;

BookType = PUT(Book,$typ.);

RUN;

*Create two way table with Age grouped into four categories;

PROC FREQ DATA = books;

TITLE 'Patron Age by Book Type: Four Age Groups';

TABLES BookType * Age / NOPERCENT NOROW NOCOL;

FORMAT Age agegpa.;

RUN;

*Create two way table with Age grouped into two categories;

PROC FREQ DATA = books;

TITLE 'Patron Age by Book Type: Two Age Groups';

TABLES BookType * Age / NOPERCENT NOROW NOCOL;

FORMAT Age agegpb.;

RUN;

In a DATA step, a PUT function uses the format $TYP. to create a new variable named BookType that groups books into Fiction or Non-Fiction. Then the first PROC FREQ groups age into four categories using the format AGEGPA., while the second PROC FREQ groups age into two categories using the format AGEGPB. Because the NOPERCENT, NOROW, and NOCOL options were added to the TABLES statements, only frequencies appear in the results.

Patron Age by Book Type: Four Age Groups

The FREQ Procedure

Table of BookType by Age
BookType	Age
Frequency	0 to 18	19 to 25	26 to 49	50+	Total
Fiction	6	3	12	10	31
Non-Fiction	3	0	3	8	14
Total	9	3	15	18	45

Patron Age by Book Type: Two Age Groups

The FREQ Procedure

Table of BookType by Age
BookType	Age
Frequency	0 to 25	26+	Total
Fiction	9	22	31
Non-Fiction	3	11	14
Total	12	33	45

4.15 Producing Tabular Reports with PROC TABULATE

Every summary statistic the TABULATE procedure computes can also be produced by other procedures such as PRINT, MEANS, and FREQ, but PROC TABULATE is popular because its reports are pretty. If PROC TABULATE were a box, it would be gift-wrapped.

PROC TABULATE is so powerful that entire books have been written about it, but it is also so concise that you may feel like you’re reading hieroglyphics. If you find the syntax of PROC TABULATE a little hard to get used to, that may be because it has roots outside of SAS. PROC TABULATE is based in part on the Table Producing
Language, a language developed by the U.S. Department of Labor.

The general form of PROC TABULATE is:

PROC TABULATE;

CLASS classification-variable-list;

TABLE page-dimension, row-dimension, column-dimension;

The CLASS statement tells SAS which variables contain categorical data to be used for dividing observations into groups, while the TABLE statement tells SAS how to organize your table and which numbers to compute. Each TABLE statement defines only one table, but you can have multiple TABLE statements. If a variable is listed in a CLASS statement, then, by default, PROC TABULATE produces simple counts of the number of observations in each category of that variable. PROC TABULATE offers many other statistics, and the next section describes how to request those.

Dimensions Each TABLE statement can specify up to three dimensions. Those dimensions, separated by commas, tell SAS which variables to use for the pages, rows, and columns in the report. If you specify only one dimension, then that becomes the column dimension. If you specify two dimensions, then you get rows and columns, but no page dimension. If you specify three dimensions, then you get pages, rows, and columns.

When you write a TABLE statement, start with the column dimension. Once you have that debugged, add the rows. Once you are happy with your rows and columns, you are ready
to add a page dimension, if you need one. Notice that the order of dimensions in the TABLE statement is page, then row, then column. So, to avoid scrambling your table when you add dimensions, insert the page and row specifications in front of the column dimension.

Missing data By default, observations are excluded from tables if they have missing values for variables listed in a CLASS statement (even if that variable is not listed in the TABLE statement). If you want to keep these observations, then simply add the MISSING option to your PROC statement, like this:

PROC TABULATE MISSING;

Example Here are data about day cruises by boat including the name of each boat, its home port, its type (sailing or power), the kind of vessel (schooner, catamaran, or yacht), the price of an excursion, and the length of the boat in feet.

Name Port Type Vessel Price Length

Silent Lady Maalea sail sch 95.00 64

America II Maalea sail yac 72.95 65

Aloha Anai Lahaina sail cat 112.00 60

Ocean Spirit Maalea power cat 62.00 65

Anuenue Maalea sail sch 177.50 52

Hana Lei Maalea power cat 88.99 110

Leilani Maalea power yac 99.99 45

Kalakaua Maalea power cat 69.50 70

Reef Runner Lahaina power yac 59.95 50

Blue Dolphin Maalea sail cat 92.95 65

Suppose that you want a report showing the number of boats by Type and Vessel for each port. The following program reads the tab-delimited file using PROC IMPORT and creates a permanent SAS data set named BOATS. Then PROC TABULATE creates a three-dimensional report with the values of Port for the pages, Type for the rows, and Vessel for the columns.

LIBNAME trips 'c:MySASLib';

PROC IMPORT DATAFILE = 'c:MyRawDataBoats.txt' OUT = trips.boats REPLACE;

RUN;

* Tabulations with three dimensions;

PROC TABULATE DATA = trips.boats;

CLASS Port Type Vessel;

TABLE Port, Type, Vessel;

TITLE 'Number of Boats by Port, Type, and Vessel';

RUN;

This report has two pages, one for each value of the page dimension. Here is one page:

Number of Boats by Port, Type, and Vessel

Port Maalea
	Vessel
	cat	sch	yac
	N	N	N
Type	3	.	1
power	3	.	1
sail	1	2	1

The value of the page dimension appears above the top, left corner of the table. You can see that this is the page for the port of Maalea. The heading N tells you that the numbers in this table are simple counts, the number of boats in each group.

4.16 Adding Statistics to PROC TABULATE Output

By default, PROC TABULATE produces simple counts for variables listed in a CLASS statement, but you can request many other statistics in a TABLE statement. You can also concatenate or cross variables within dimensions. In fact, you can write TABLE statements so complicated that even you won’t know what the report is going to look like until you run it.

While the CLASS statement lists categorical variables, the VAR statement tells SAS which variables contain continuous data. Here is the general form:

PROC TABULATE;

VAR analysis-variable-list;

CLASS classification-variable-list;

TABLE page-dimension, row-dimension, column-dimension;

While the variables in a CLASS statement can be either numeric or character, variables in a VAR statement must be numeric. You may have both a CLASS statement and a VAR statement, or just one, but all variables listed in a TABLE statement must also appear in either a CLASS or a VAR statement.

Keywords In addition to variable names, each dimension can contain keywords. Here are a few of the values that PROC TABULATE can compute:

ALL	adds a row, column, or page showing the total
MAX	highest value
MIN	lowest value
MEAN	arithmetic mean
MEDIAN	median
MODE	mode
N	number of nonmissing values
NMISS	number of missing values
PCTN	percentage of observations for that group
PCTSUM	percentage of total represented by that group
STDDEV	standard deviation
SUM	sum

Concatenating, crossing, and grouping Within a dimension, variables and keywords can be concatenated, crossed, or grouped. To concatenate variables or keywords, simply list them separated by a space; to cross variables or keywords, separate them with an asterisk (*); and to group them, enclose the variables or keywords in parentheses. The keyword ALL is generally concatenated. To request other statistics, however, cross that keyword with the variable name.

Concatenating:	TABLE Type Vessel ALL;
Crossing:	TABLE MEAN * Price;
Crossing, grouping, and concatenating:	TABLE PCTN *(Type Vessel);

Example This example uses the permanent SAS data set created in the preceding section. The data contain the name of each boat, its home port, its type (sailing or power), the kind of vessel (schooner, catamaran, or yacht), the price of an excursion, and the boat's length.

	Name	Port	Type	Vessel	Price	Length
1	Silent Lady	Maalea	sail	sch	95	64
2	America II	Maalea	sail	yac	72.95	65
3	Aloha Anai	Lahaina	sail	cat	112	60
4	Ocean Spirit	Maalea	power	cat	62	65
5	Anuenue	Maalea	sail	sch	177.5	52
6	Hana Lei	Maalea	power	cat	88.99	110
7	Leilani	Maalea	power	yac	99.99	45
8	Kalakaua	Maalea	power	cat	69.5	70
9	Reef Runner	Lahaina	power	yac	59.95	50
10	Blue Dolphin	Maalea	sail	cat	92.95	65

The following program creates another report about the boats. However, this PROC TABULATE includes a VAR statement. The TABLE statement in this program contains only two dimensions; but it also concatenates, crosses, and groups variables and statistics.

LIBNAME trips 'c:MySASLib';

* Tabulations with two dimensions and statistics;

PROC TABULATE DATA = trips.boats;

CLASS Type Vessel;

VAR Price;

TABLE Type ALL, MEAN*Price*(Vessel ALL);

TITLE 'Mean Price by Type and Vessel';

RUN;

The row dimension of this table concatenates the classification variable Type with ALL to produce totals. The column dimension, on the other hand, crosses MEAN with the analysis variable Price and with the classification variable Vessel (which happens to be concatenated and grouped with ALL). Here are the results:

Mean Price by Type and Vessel

	Mean
	Price
	Vessel			All
	cat	sch	yac	All
Type	73.50	.	79.97	76.09
power	73.50	.	79.97	76.09
sail	102.48	136.25	72.95	110.08
All	85.09	136.25	77.63	93.08

4.17 Enhancing the Appearance of PROC TABULATE Output

When you use PROC TABULATE, SAS wraps your data in tidy little boxes, but there may be times when they just don’t look right. Using three simple options, you can enhance the appearance of your output. Think of it as changing the wrapping paper.

FORMAT= option To change the format of all the data cells in your table, use the FORMAT= option in your PROC statement. For example, if you needed the numbers in your table to have commas and no decimal places, you could use this PROC statement:

PROC TABULATE FORMAT = COMMA8.0;

telling SAS to use the COMMA8.0 format for all the data cells in the table.

To apply a format to an individual variable, cross it with the variable name, like this:

variable-name*FORMAT=formatw.d

Then you insert this rather convoluted construction into your TABLE statement:

TABLE Region, Sales*FORMAT=COMMA8.0 Profit*FORMAT=DOLLAR10.2;

This TABLE statement applies the COMMA8.0 format to a variable named Sales, and the DOLLAR10.2 format to Profit. If you specify formats in both the PROC and TABLE statements, then the TABLE statement will override the PROC statement.

BOX= and MISSTEXT= options While the FORMAT= option can go in either PROC or TABLE statements, the BOX= and MISSTEXT= options go only in TABLE statements. The BOX= option allows you to insert a brief phrase in the normally empty box that appears in the upper left corner of every PROC TABULATE report. Using this empty space can give your reports a polished look. The MISSTEXT= option, on the other hand, specifies a value for SAS to print in empty data cells. The period that SAS prints, by default, for missing values can seem downright mysterious to someone, perhaps your CEO, who is not familiar with SAS output. You can give them something more meaningful with the MISSTEXT= option. This statement:

TABLE Region, MEAN*Sales / BOX='Mean Sales by Region' MISSTEXT='No Sales';

tells SAS to print the title “Mean Sales by Region” in the upper left corner of the table, and to print the words “No Sales” in any cells of the table that have no data. The BOX= and MISSTEXT= options must be separated from the dimensions of the TABLE statement by a slash.

Example This example uses the permanent SAS data set created in Section 4.15. The data contain the name of each boat, its home port, its type (sailing or power), the kind of vessel (schooner, catamaran, or yacht), the price of an excursion, and the boat's length.

	Name	Port	Type	Vessel	Price	Length
1	Silent Lady	Maalea	sail	sch	95	64
2	America II	Maalea	sail	yac	72.95	65
3	Aloha Anai	Lahaina	sail	cat	112	60
4	Ocean Spirit	Maalea	power	cat	62	65
5	Anuenue	Maalea	sail	sch	177.5	52
6	Hana Lei	Maalea	power	cat	88.99	110
7	Leilani	Maalea	power	yac	99.99	45
8	Kalakaua	Maalea	power	cat	69.5	70
9	Reef Runner	Lahaina	power	yac	59.95	50
10	Blue Dolphin	Maalea	sail	cat	92.95	65

This program is similar to the one in the previous section. However, the variable Length has been added to the VAR and TABLE statements along with the FORMAT=, BOX=, and MISSTEXT= options in the PROC and TABLE statements. Because the BOX= option serves as a title, a null TITLE statement is used to remove the usual title.

LIBNAME trips 'c:MySASLib';

* PROC TABULATE report with options;

PROC TABULATE DATA = trips.boats FORMAT = DOLLAR7.2;

CLASS Type Vessel;

VAR Price Length;

TABLE Type ALL,

MEAN * (Price Length*FORMAT=2.0) * (Vessel ALL)

/BOX='Full Day Excursions' MISSTEXT='none';

TITLE;

RUN;

Here is the enhanced output:

Full Day Excursions	Mean
	Price				Length
	Vessel			All	Vessel			All
	cat	sch	yac	All	cat	sch	yac	All
Type	$73.50	none	$79.97	$76.09	82	none	48	68
power	$73.50	none	$79.97	$76.09	82	none	48	68
sail	$102.48	$136.25	$72.95	$110.08	63	58	65	61
All	$85.09	$136.25	$77.63	$93.08	74	58	53	65

Notice that the data cells for Price now use the DOLLAR7.2 format as specified in the PROC statement, while the cells for Length use the 2.0 format specified in the TABLE statement. The text “Full Day Excursions” appears in the upper left corner, which was empty in the previous section. In addition, the one data cell with no data shows the word “none” instead of a period.

4.18 Changing Headers in PROC TABULATE Output

The TABULATE procedure produces reports with a lot of headers. Sometimes there are so many headers that your reports look cluttered; at other times you may simply feel that a different header would be more meaningful. Before you can change a header, though, you need to understand what type of header it is. PROC TABULATE reports have two basic types of headers: headers that are the values of variables listed in a CLASS statement, and headers that are the names of variables and keywords. You use different methods to change different types of headers.

CLASS variable values To change headers that are the values of variables listed in a CLASS statement, use the FORMAT procedure to create a user-defined format. Then assign the format to the variable in a FORMAT statement (discussed in Section 4.8).

Variable names and keywords You can change headers that are the names of variables using a simple LABEL statement as described in Section 4.1, but that won’t work for keywords. There is another method that works for both variable names and keywords. In the TABLE statement, you put an equal sign after the variable or keyword followed by the new header enclosed in quotation marks. You can eliminate a header entirely by setting it equal to blank (two quotation marks with nothing in between), and SAS will remove the box for that header. This TABLE statement tells SAS to remove the headers for Region and MEAN, and to change the header for the variable Sales to “Mean Sales by Region.”

TABLE Region='', MEAN=''*Sales='Mean Sales by Region';

In some cases SAS leaves the empty box when a row header is set to blank. This happens for statistics and analysis variables (but not class variables). To force SAS to remove the empty box, add the ROW=FLOAT option to the end of your TABLE statement, like this:

TABLE MEAN=''*Sales='Mean Sales by Region', Region='' / ROW=FLOAT;

	Name	Port	Type	Vessel	Price	Length
1	Silent Lady	Maalea	sail	sch	95	64
2	America II	Maalea	sail	yac	72.95	65
3	Aloha Anai	Lahaina	sail	cat	112	60
4	Ocean Spirit	Maalea	power	cat	62	65
5	Anuenue	Maalea	sail	sch	177.5	52
6	Hana Lei	Maalea	power	cat	88.99	110
7	Leilani	Maalea	power	yac	99.99	45
8	Kalakaua	Maalea	power	cat	69.5	70
9	Reef Runner	Lahaina	power	yac	59.95	50
10	Blue Dolphin	Maalea	sail	cat	92.95	65

The following program shows how to change headers. To start with, a FORMAT procedure creates a user-defined format named $VES. Then the $VES. format is assigned to the variable Vessel using a FORMAT statement so that the formatted data values will be used for column headers. In the TABLE statement, more headers are changed. The headers for Type, MEAN, and Vessel are all set to blank, while the header for Price is set to “Mean Price by Kind of Vessel”.

LIBNAME trips 'c:MySASLib';

* Changing headers;

PROC FORMAT;

VALUE $ves 'cat' = 'catamaran'

'sch' = 'schooner'

'yac' = 'yacht';

RUN;

PROC TABULATE DATA = trips.boats FORMAT=DOLLAR7.2;

CLASS Type Vessel;

VAR Price;

FORMAT Vessel $ves.;

TABLE Type='' ALL,

MEAN=''*Price='Mean Price by Kind of Vessel'*(Vessel='' ALL)

/BOX='Full Day Excursions' MISSTEXT='none';

TITLE;

RUN;

This program does not require the ROW=FLOAT option because the only variable being set to blank in the row dimension is a class variable. If you put an analysis variable or a statistics keyword in the row dimension and set it equal to blank, then you would need to add the ROW=FLOAT option to remove empty boxes.

Here is the output:

Full Day Excursions	Mean Price by Kind of Vessel
Full Day Excursions	catamaran	schooner	yacht	All
power	$73.50	none	$79.97	$76.09
sail	$102.48	$136.25	$72.95	$110.08
All	$85.09	$136.25	$77.63	$93.08

4.19 Producing Simple Output with PROC REPORT

The REPORT procedure shares features with the PRINT, MEANS, TABULATE, and SORT procedures and the DATA step. With all those features rolled into one procedure, it’s not surprising that PROC REPORT can be complex—in fact entire books have been written about it—but with all those features comes power.

Here is the general form of a basic REPORT procedure:

PROC REPORT;

COLUMN variable-list;

In its simplest form, the COLUMN statement is similar to a VAR statement in PROC PRINT, telling SAS which variables to include and in what order. If you leave out the COLUMN statement, SAS will, by default, include all the variables in your data set.

Numeric versus character data The type of report you get from PROC REPORT depends, in part, on the type of data you use. If you have at least one character variable in your report, then, by default, you will get a detail report with one row per observation. If, on the other hand, your report includes only numeric variables, then, by default, PROC REPORT will sum those variables. Even dates will be summed, by default, because they are numeric. (You can override this default by assigning one of your numeric variables a usage type of DISPLAY in a DEFINE statement. See the next section.)

Example Here are data about national parks and monuments in the USA. The variables are name, type (NP for national park or NM for national monument), region (East or West), number of museums (including visitor centers), and number of campgrounds.

Dinosaur NM West 3 6

Everglades NP East 4 2

Grand Canyon NP West 3 3

Great Smoky Mountains NP East 7 10

Hawaii Volcanoes NP West 1 2

Statue of Liberty NM East 2 0

Theodore Roosevelt NP . 2 2

Yellowstone NP West 8 12

Yosemite NP West 5 11

The following program creates a permanent SAS data set named NATPARKS, and then runs two reports. The first report has no COLUMN statement so SAS will use all the variables, while the second uses a COLUMN statement to select just the numeric variables.

LIBNAME visit 'c:MySASLib';

DATA visit.natparks;

INFILE 'c:MyRawDataParks.dat';

INPUT Name $ 1-21 Type $ Region $ Museums Camping;

RUN;

PROC REPORT DATA = visit.natparks;

TITLE 'Report with Character and Numeric Variables';

RUN;

PROC REPORT DATA = visit.natparks;

COLUMN Museums Camping;

TITLE 'Report with Only Numeric Variables';

RUN;

While the two PROC steps are only slightly different, the reports they produce differ dramatically. The first report is almost identical to the output you would get from a PROC PRINT except for the absence of the OBS column. The second report, since it contained only numeric variables, was summed.

Report with Character and Numeric Variables

Name	Type	Region	Museums	Camping
Dinosaur	NM	West	3	6
Everglades	NP	East	4	2
Grand Canyon	NP	West	3	3
Great Smoky Mountains	NP	East	7	10
Hawaii Volcanoes	NP	West	1	2
Statue of Liberty	NM	East	2	0
Theodore Roosevelt	NP		2	2
Yellowstone	NP	West	8	12
Yosemite	NP	West	5	11

Report with Only Numeric Variables

Museums	Camping
35	48

4.20 Using DEFINE Statements in PROC REPORT

The DEFINE statement is a general purpose statement that specifies options for an individual variable. You can have a DEFINE statement for every variable, but you only need to have a DEFINE statement if you want to specify an option for that particular variable. The general form of a DEFINE statement is:

DEFINE variable / options 'column-header';

In a DEFINE statement, you specify the variable name followed by a slash and any options for that particular variable.

Usage options The most important option is a usage option that tells SAS how that variable is to be used. Possible values of usage options include the following:

ACROSS	creates a column for each unique value of the variable.
ANALYSIS	calculates statistics for the variable. This is the default usage for numeric variables, and the default statistic is SUM.
COMPUTED	creates a new variable whose value you calculate in a compute block. See Section 4.24 for a discussion of compute blocks.
DISPLAY	creates one row for each observation in the data set. This is the default usage for character variables.
GROUP	creates one row for each unique value of the variable.
ORDER	creates one row for each observation with rows arranged according to the values of the order variable.

Changing column headers There are several ways to change column headers in PROC REPORT including using a LABEL statement, as described in Section 4.1, or specifying a column header in a DEFINE statement. The following statement tells SAS to arrange a report by the values of the variable Age, and use the words “Age at Admission” as the column header for that variable. Putting a slash in a column header tells SAS to split the header at that point.

DEFINE Age / ORDER 'Age at/Admission';

Missing data By default, observations are excluded from reports if they have missing values for variables with a usage type of ORDER, GROUP, or ACROSS. If you want to keep these observations, then simply add the MISSING option to your PROC statement, like this:

PROC REPORT MISSING;

Example This example uses the permanent SAS data set created in the preceding section with data about national parks and monuments in the USA. The variables are name, type (NP for national park or NM for national monument), region (East or West), number of museums (including visitor centers), and number of campgrounds.

	Name	Type	Region	Museums	Camping
1	Dinosaur	NM	West	3	6
2	Everglades	NP	East	4	2
3	Grand Canyon	NP	West	3	3
4	Great Smoky Mountains	NP	East	7	10
5	Hawaii Volcanoes	NP	West	1	2
6	Statue of Liberty	NM	East	2	0
7	Theodore Roosevelt	NP		2	2
8	Yellowstone	NP	West	8	12
9	Yosemite	NP	West	5	11

The following PROC REPORT contains two DEFINE statements. The first defines Region with a usage type of ORDER. The second specifies a column header for the variable Camping. Camping is a numeric variable and has a default usage of ANALYSIS, so the DEFINE statement does not change its usage. The MISSING option in the PROC statement tells SAS to include observations with missing values of Region.

LIBNAME visit 'c:MySASLib';

* PROC REPORT with ORDER variable, MISSING option, and column header;

PROC REPORT DATA = visit.natparks MISSING;

COLUMN Region Name Museums Camping;

DEFINE Region / ORDER;

DEFINE Camping / ANALYSIS 'Campgrounds';

TITLE 'National Parks and Monuments Arranged by Region';

RUN;

Here are the results:

National Parks and Monuments Arranged by Region

Region	Name	Museums	Campgrounds
	Theodore Roosevelt	2	2
East	Everglades	4	2
	Great Smoky Mountains	7	10
	Statue of Liberty	2	0
West	Dinosaur	3	6
	Grand Canyon	3	3
	Hawaii Volcanoes	1	2
	Yellowstone	8	12
	Yosemite	5	11

4.21 Creating Summary Reports with PROC REPORT

Two different usage types tell the REPORT procedure to “roll up” data into summary groups. While the GROUP usage type produces summary rows, the ACROSS usage type produces summary columns. However, keep in mind that, if you have any display or order variables in a COLUMN statement, they will override this behavior, and SAS will produce a detail report instead of a summary report.

Group variables Defining a group variable is fairly simple. Just specify the GROUP usage option in a DEFINE statement. By default, analysis variables will be summed. (PROC REPORT can produce many other statistics, see Section 4.23.) The following PROC REPORT tells SAS to produce a report showing the sum of Salary and of Bonus with a row for each value of Department.

PROC REPORT DATA = employees;

COLUMN Department Salary Bonus;

DEFINE Department / GROUP;

RUN;

Across variables To define an across variable, you also use a DEFINE statement. However, by default, SAS produces counts rather than sums. To obtain sums for across variables, you must tell SAS which variables to summarize. You do that by putting a comma between the across variable and analysis variable (or variables if you enclose them in parentheses). The following PROC REPORT tells SAS to produce a report showing the sum of Salary and of Bonus with one column for each value of Department.

PROC REPORT DATA = employees;

COLUMN Department , (Salary Bonus);

DEFINE Department / ACROSS;

RUN;

Example This example uses the permanent SAS data set created in Section 4.19 with data about national parks and monuments in the USA. The variables are name, type (NP for national park or NM for national monument), region, number of museums, and number of campgrounds.

	Name	Type	Region	Museums	Camping
1	Dinosaur	NM	West	3	6
2	Everglades	NP	East	4	2
3	Grand Canyon	NP	West	3	3
4	Great Smoky Mountains	NP	East	7	10
5	Hawaii Volcanoes	NP	West	1	2
6	Statue of Liberty	NM	East	2	0
7	Theodore Roosevelt	NP		2	2
8	Yellowstone	NP	West	8	12
9	Yosemite	NP	West	5	11

The following program contains two PROC REPORTs. In the first, Region and Type are both defined as group variables. In the second, Region is still a group variable, but Type is an across variable. Notice that the two COLUMN statements are the same except for punctuation added to the second procedure to cross the across variable with the analysis variables.

LIBNAME visit 'c:MySASLib';

* Region and Type as GROUP variables;

PROC REPORT DATA = visit.natparks;

COLUMN Region Type Museums Camping;

DEFINE Region / GROUP;

DEFINE Type / GROUP;

TITLE 'Summary Report with Two Group Variables';

RUN;

* Region as GROUP and Type as ACROSS with sums;

PROC REPORT DATA = visit.natparks;

COLUMN Region Type,(Museums Camping);

DEFINE Region / GROUP;

DEFINE Type / ACROSS;

TITLE 'Summary Report with a Group and an Across Variable';

RUN;

Here is the resulting output:

Summary Report with Two Group Variables

Region	Type	Museums	Camping
East	NM	2	0
	NP	11	12
West	NM	3	6
	NP	17	28

Summary Report with a Group and an Across Variable

	Type
	NM		NP
Region	Museums	Camping	Museums	Camping
East	2	0	11	12
West	3	6	17	28

4.22 Adding Summary Breaks to PROC REPORT Output

Two kinds of statements allow you to insert breaks into a report. The BREAK statement adds a break for each unique value of the variable you specify, while the RBREAK statement does the same for the entire report (or BY-group if you are using a BY statement). The general forms of these statements are:

BREAK location variable / options;

RBREAK location / options;

where location has two possible values—BEFORE or AFTER—depending on whether you want the break to precede or follow that particular section of the report. The options that come after the slash tell SAS what kind of break to insert. Some of the possible options are:

PAGE	starts a new page
SUMMARIZE	inserts summary statistics for numeric variables

Notice that the BREAK statement requires you to specify a variable, but the RBREAK statement does not. That’s because the RBREAK statement produces only one break (at the beginning or end of the report or BY group), while the BREAK statement produces one break for every unique value of the variable you specify. That variable must be either a group or an order variable and, therefore, must also be listed in a DEFINE statement with either the GROUP or ORDER usage option. You can use an RBREAK statement in any report, but you can use BREAK only if you have at least one group or order variable.

	Name	Type	Region	Museums	Camping
1	Dinosaur	NM	West	3	6
2	Everglades	NP	East	4	2
3	Grand Canyon	NP	West	3	3
4	Great Smoky Mountains	NP	East	7	10
5	Hawaii Volcanoes	NP	West	1	2
6	Statue of Liberty	NM	East	2	0
7	Theodore Roosevelt	NP		2	2
8	Yellowstone	NP	West	8	12
9	Yosemite	NP	West	5	11

The following program defines Region as an order variable, and then uses both BREAK and RBREAK statements with the AFTER location. The SUMMARIZE option tells SAS to print totals for the numeric variables.

LIBNAME visit 'c:MySASLib';

* PROC REPORT with breaks;

PROC REPORT DATA = visit.natparks;

COLUMN Name Region Museums Camping;

DEFINE Region / ORDER;

BREAK AFTER Region / SUMMARIZE;

RBREAK AFTER / SUMMARIZE;

TITLE 'Detail Report with Summary Breaks';

RUN;

Here is the resulting output:

Detail Report with Summary Breaks

Name	Region	Museums	Camping
Everglades	East	4	2
Great Smoky Mountains		7	10
Statue of Liberty		2	0
	East	13	12
Dinosaur	West	3	6
Grand Canyon		3	3
Hawaii Volcanoes		1	2
Yellowstone		8	12
Yosemite		5	11
	West	20	34
		33	46

4.23 Adding Statistics to PROC REPORT Output

There are several ways to request statistics in the REPORT procedure. An easy method is to insert statistics keywords directly into the COLUMN statement along with the variable names. This is a little like requesting statistics in a TABLE statement in PROC TABULATE, except that instead of using an asterisk to cross a statistics keyword with a variable, you use a comma. In fact, PROC REPORT can produce all the same statistics as PROC TABULATE and PROC MEANS because it uses the same internal engine to compute those statistics. These are a few of the statistics that PROC REPORT can compute:

MAX	highest value
MIN	lowest value
MEAN	arithmetic mean
MEDIAN	median
MODE	mode
N	number of nonmissing values
NMISS	number of missing values
PCTN	percentage of observations for that group
PCTSUM	percentage of a total sum represented by that group
STD	standard deviation
SUM	sum

Applying statistics to variables To request a statistic for a particular variable, insert a comma between the statistic and the variable in the COLUMN statement. One statistic, N, does not require a comma because it does not apply to a particular variable. If you insert N in a COLUMN statement, then SAS will print the number of observations that contributed to that row of the report. This statement tells SAS to print two columns of data: the median of a variable named Age, and the number of observations in that row.

COLUMN Age,MEDIAN N;

To request multiple statistics or statistics for multiple variables, put parentheses around the statistics or variables. This statement uses parentheses to request two statistics for the variable Age, and then requests one statistic for two variables, Height and Weight.

COLUMN Age,(MIN MAX) (Height Weight),MEAN;

Example This example uses the permanent SAS data set created in Section 4.19 with data about national parks and monuments in the USA. The variables are name, type, region, number of museums, and number of campgrounds.

The following program contains two PROC REPORTs. Both procedures request the statistics N and MEAN, but the first report defines Type as a group variable, while the second defines Type as an across variable.

LIBNAME visit 'c:MySASLib';

*Statistics in COLUMN statement with two group variables;

PROC REPORT DATA = visit.natparks;

COLUMN Region Type N (Museums Camping),MEAN;

DEFINE Region / GROUP;

DEFINE Type / GROUP;

TITLE 'Statistics with Two Group Variables';

RUN;

*Statistics in COLUMN statement with group and across variables;

PROC REPORT DATA = visit.natparks;

COLUMN Region N Type,(Museums Camping),MEAN;

DEFINE Region / GROUP;

DEFINE Type / ACROSS;

TITLE 'Statistics with a Group and Across Variable';

RUN;

Here is the resulting output:

Statistics with Two Group Variables

			Museums	Camping
Region	Type	N	MEAN	MEAN
East	NM	1	2	0
	NP	2	5.5	6
West	NM	1	3	6
	NP	4	4.25	7

Statistics with a Group and Across Variable

		Type
		NM		NP
		Museums	Camping	Museums	Camping
Region	N	MEAN	MEAN	MEAN	MEAN
East	3	2	0	5.5	6
West	5	3	6	4.25	7

Notice that these reports are similar to the reports in Section 4.21 except that these contain counts and means instead of sums.

4.24 Adding Computed Variables to PROC REPORT Output

Unlike most procedures, the REPORT procedure has the ability to compute not only statistics (like sums and means) but also new variables. You do this using a compute block. Compute blocks start with a COMPUTE statement and end with an ENDCOMP statement. In between, you put the programming statements to calculate your new variable. PROC REPORT uses a limited set of programming statements including assignment statements, IF-THEN/ELSE statements, and DO loops. You do not have to specify a DEFINE statement for the new variable, but if you do, then give it a usage type of COMPUTED. The general form of these statements is:

DEFINE new-variable-name / COMPUTED;

COMPUTE new-variable-name / options;

programming statements

ENDCOMP;

Computing a numeric variable For a numeric variable, simply name the new variable in the COMPUTE statement. If you use any variables with a type of analysis in the compute block, you must append the variable name with its statistic. The default statistic for an analysis variable is SUM. The following statements compute a variable named Income by adding Salary and Bonus:

DEFINE Salary / ANALYSIS SUM;

DEFINE Bonus / ANALYSIS SUM;

DEFINE Income / COMPUTED;

COMPUTE Income;

Income = Salary.SUM + Bonus.SUM;

ENDCOMP;

Computing a character variable For a character variable, add the CHAR option to the COMPUTE statement. You will probably also want to include the LENGTH option. Lengths for computed character variables range from 1 to 200, with a default of 8. The following statements compute a variable named JobType using IF-THEN/ELSE statements:

DEFINE Title / DISPLAY;

DEFINE JobType / COMPUTED;

COMPUTE JobType / CHAR LENGTH = 10;

IF Title = 'Programmer' THEN JobType = 'Technical';

ELSE JobType = 'Other';

ENDCOMP;

The following PROC REPORT computes two variables named Facilities and Note. Facilities is a numeric variable equal to the number of museums plus the number of campgrounds. Note is a character variable, which is equal to “No Camping” for parks that have no campgrounds. Notice that the variables Museums and Camping must be listed in the COLUMN statement because they are used to compute the new variables. In order to exclude them from the report, the program uses the NOPRINT option in DEFINE statements.

LIBNAME visit 'c:MySASLib';

* COMPUTE new variables that are numeric and character;

PROC REPORT DATA = visit.natparks;

COLUMN Name Region Museums Camping Facilities Note;

DEFINE Museums / ANALYSIS SUM NOPRINT;

DEFINE Camping / ANALYSIS SUM NOPRINT;

DEFINE Facilities / COMPUTED 'Campgrounds/and/Museums';

DEFINE Note / COMPUTED;

COMPUTE Facilities;

Facilities = Museums.SUM + Camping.SUM;

ENDCOMP;

COMPUTE Note / CHAR LENGTH = 10;

IF Camping.SUM = 0 THEN Note = 'No Camping';

ENDCOMP;

TITLE 'Report with Two Computed Variables';

RUN;

Here is the resulting output:

Report with Two Computed Variables

Name	Region	Campgrounds and Museums	Note
Dinosaur	West	9
Everglades	East	6
Grand Canyon	West	6
Great Smoky Mountains	East	17
Hawaii Volcanoes	West	3
Statue of Liberty	East	2	No Camping
Theodore Roosevelt		4
Yellowstone	West	20
Yosemite	West	16

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 4 Sorting, Printing, and Summarizing Your Data

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 4 Sorting, Printing, and Summarizing Your Data