Overview 740
Introduction 740
Objectives 741
Prerequisites 741
Eliminating Unnecessary Passes through the Data 756
Using a Single DATA Step to Create Multiple Output Data Sets 757
Using the SORT Procedure with a WHERE Statement to Create Sorted Subsets 757
Using the DATASETS Procedure to Modify Variable Attributes 757
Comparative Example: Creating Multiple Subsets of a SAS Data Set 758
Comparative Example: Creating a Sorted Subset of a SAS Data Set 759
Comparative Example: Changing the Variable Attributes of a SAS Data Set 760
Reading and Writing Only Essential Data 762
Selecting Observations Using Subsetting IF versus WHERE Statement 762
Comparative Example: Creating a Subset of a SAS Data Set 763
Using the WHERE Statement with the OBS= and FIRSTOBS= Options 764
Selecting Observations When Reading Data from External Files 765
Comparative Example: Creating a Subset of Data by Reading Data from an External File 766
Subsetting Variables with the KEEP= and DROP= Statements and Options 767
Comparative Example: Creating a Report That Contains Average and Median Statistics 770
Comparative Example: Creating a SAS Data Set That Contains Only Certain Variables 771
Summary 779
Quiz 780
This chapter demonstrates different ways of using best SAS programming practices to optimize performance. As you compare the techniques described in this chapter, remember that differences in the use of resources are affected by which operating environment you work in and by the characteristics of your data.
This chapter is organized by topics that emphasize the following basic principles:
Execute only necessary statements.
Eliminate unnecessary passes of the data.
Read and write only the data that you require.
Store data in SAS data sets.
Avoid unnecessary procedure invocation.
Each topic includes comparative examples that can improve the efficiency of your programs. Write programs to generate your own benchmarks, and adopt the programming techniques that produce the most savings for you.
This chapter does not cover the SAS Scalable Performance Data Engine (SAS SPD Engine), which is a SAS 9.1 technology for threaded processing. For details about using the SAS SPD Engine to improve performance, see the SAS documentation. Δ
In this chapter, you learn too efficiently
subset observations
create new variables
process and output data conditionally
create multiple output data sets and sorted subsets
modify variable attributes
select observations from SAS data sets and external files
subset variables
read data from SAS data sets
invoke SAS procedures.
Before beginning this chapter, you should complete the following chapters:
Part 1: SQL Processing with SAS
Chapter 1, "Performing Queries Using PROC SQL," on page 3
Chapter 2, "Performing Advanced Queries Using PROC SQL," on page 25
Chapter 3, "Combining Tables Horizontally Using PROC SQL," on page 79
Chapter 4, "Combining Tables Vertically Using PROC SQL," on page 123
Chapter 5, "Creating and Managing Tables Using PROC SQL," on page 159
Chapter 6, "Creating and Managing Indexes Using PROC SQL," on page 221
Chapter 7, "Creating and Managing Views Using PROC SQL," on page 243
Chapter 8, "Managing Processing Using PROC SQL," on page 261.
Part 3: Advanced SAS Programming Techniques
Chapter 13, "Creating Samples and Indexes," on page 449
Chapter 14, "Combining Data Vertically," on page 479
Chapter 15, "Combining Data Horizontally," on page 511
Chapter 16, "Using Lookup Tables to Match Data," on page 557
Chapter 17, "Formatting Data," on page 601
Chapter 18, "Modifying SAS Data Sets and Tracking Changes," on page 631.
Part 4: Optimizing SAS Programs
Chapter 19, "Introduction to Efficient SAS Programming," on page 677
Chapter 20, "Controlling Memory Usage," on page 687
Chapter 21, "Controlling Data Storage Space," on page 705.
Best practices specify that you should write programs that cause SAS to execute only necessary statements. When you execute the minimum number of statements in the most efficient order, you minimize the hardware resources that SAS uses. The resources that are affected include disk usage, memory usage, and CPU usage.
Techniques for executing only the statements you need include
subsetting your data as soon as is logically possible
processing your data conditionally by using the most appropriate syntax for your data.
To subset your data based on a newly derived or computed variable, you must use the subsetting IF statement in a DATA step. You can subset output data to a SAS data set by processing only those observations that meet a specified condition.
The subsetting IF statement causes the DATA step to continue processing only those raw data records or observations that meet the condition of the expression specified in the IF statement. The resulting SAS data set or data sets contain a subset of the original external file or SAS data set.
Position the subsetting IF statement in the program so that it checks the subsetting condition as soon as it is logically possible, and so that unnecessary statements do not execute. When a statement is false, no further statements are processed for that observation.
Also, remember to subset data before performing calculations and to minimize the use of function calls or arithmetic operators. Unnecessary processing of unwanted observations results in higher expenditure of hardware resources.
Suppose you want to create a subset of data, calculate six new variables, and conditionally output data by reading from the SAS data set Retail.Order_fact. The data set should contain new variables for
the month of the order
the elapsed time between the order date and the delivery date
the profit, based on the retail price, discount, and unit price
total profit
total discount
total wait time.
The subset of data that includes only orders for the month of December is approximately 9.66% of the data.
You can accomplish this task by using a subsetting IF statement. Placement of this statement in the DATA step can affect the efficiency of the DATA step in terms of CPU time and real time. Notice the comparison between these two approaches:
A Subsetting IF Statement at the Bottom
A Subsetting IF Statement near the Top.
The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for executing only necessary statements.
Position the subsetting IF statement in a DATA step as soon as is logically possible in order to save the most resources.
In the last comparative example you saw how a subsetting IF statement can be positioned in the DATA step so that no further statements are processed for that observation. Next, look at how different programming techniques can be used to
create variables conditionally using DO groups
create variables conditionally when calling functions.
Before viewing the sample code for these two comparative examples, review guidelines for using these techniques.
You can use conditional logic to change how SAS processes selected observations. Two techniques—IF-THEN/ELSE statements and SELECT statements—can be used interchangeably and perform comparably. Based on the characteristics of your data and depending on your environment, one of these techniques might give you better performance. Choose a technique that conserves your programming time and makes the program easiest to read.
Technique | Action |
---|---|
IF-THEN/ELSE statement | executes a SAS statement for observations that meet specific conditions. |
SELECT statement | executes one of several statements or groups of statements. |
The number of conditions (values) tested and the type of variable tested affect CPU resources. For numeric variables, SELECT statements should always be slightly more efficient (use less CPU time) than IF-THEN/ELSE statements. The performance gap between IF-THEN/ELSE and SELECT statements gradually widens as the number of conditions increases. For character variables, IF-THEN/ELSE statements are always more efficient than SELECT statements. The performance gap widens quickly between the two techniques as the number of conditions increases. Δ
Use IF-THEN/ELSE statements when
the data values are character values
the data values are not uniformly distributed
there are few conditions to check.
For best practices, follow these guidelines for writing efficient IF/THEN logic:
For mutually exclusive conditions, use the ELSE IF statement rather than an IF statement for all conditions except the first.
Check the most frequently occurring condition first, and continue checking conditions in descending order of frequency.
When you execute multiple statements based on a condition, put the statements in a DO group.
Use SELECT statements when
you have a long series of mutually exclusive numeric conditions
data values are uniformly distributed.
Before writing conditional logic, determine the distribution of your data values. You can use the
FREQ procedure to examine the distribution of the data values
GCHART or GPLOT procedure to display the distribution graphically
UNIVARIATE procedure to examine distribution statistics and to display the information graphically.
Suppose you want to calculate an adjusted profit based on the values of the variable Order_Type
in the data set Retail.Order_fact. For retail sales, which are represented by the value 1, the adjusted profit should be calculated as 105% of profit. For catalog sales, which are represented by the value 2, the adjusted profit should be calculated as 103% of profit. For Internet sales, which are represented by the value 3, the adjusted profit should be equal to profit.
The following table shows that the values for the variable Order_Type
are not uniformly distributed.
The following table shows that the values for the variable Discount
also are not uniformly distributed.
Techniques for creating new variables conditionally include
IF-THEN/ELSE statements
SELECT statements.
The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for executing only necessary statements.
Suppose you want to create a report that includes a new variable that is based on the value of an existing variable in the SAS data set Retail.Order_fact. Values for the new Month
variable are extracted from the existing variable Order_Date
by using the MONTH function.
The following table shows that the values for Month
are fairly evenly distributed.
Techniques for creating new variables conditionally include
Parallel IF Statements
ELSE IF Statements, Many Function References
ELSE IF Statements, One Function Reference
SELECT Group.
The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for executing only necessary statements.
Avoid using parallel IF statements, which use the most resources and are the least efficient way to conditionally execute statements.
Use IF-THEN/ELSE statements and SELECT blocks to be more efficient.
To significantly reduce the amount of resources used, write programs that call a function only once instead of repetitively using the same function in many statements. SAS functions are convenient, but they can be expensive in terms of CPU resources.
You have seen different ways of efficiently creating variables conditionally, which reinforce the principle of executing only necessary statements. The next comparative example in this topic addresses techniques that can be used to create data in DO groups. Before viewing this example, you might want to review the following information about which statements are needed in DO groups.
You can conditionally execute only necessary statements by placing them in DO groups that are associated with IF-THEN/ELSE statements or with SELECT/WHEN statements. Groups of statements execute only when a particular condition is true. Remember to use the following criteria when choosing which technique is more efficient:
IF-THEN/ELSE Statements | SELECT/WHEN Statements | |
---|---|---|
The number of conditions | few | many |
The distribution of a variable's values | not uniform | uniform |
The type of variable | character | numeric |
When using a DO group with IF-THEN/ELSE statements, add DO after the THEN clause, and add an END statement after all of the statements that you want executed as a group.
data orders; set company.orders; if order_type = 1 then do; <multiple executable statements here> end; else if order_type = 2 then do; <multiple executable statements here> end; else if order_type = 3 then do; <multiple executable statements here> end; run;
Use an IF-THEN DO group when you create multiple variables based on a condition. Δ
When using a DO group with SELECT/WHEN statements, add DO after the WHEN condition, and add an END statement after all of the statements that you want executed as a group. Use an OTHERWISE statement to specify the statements that you want executed if no WHEN condition is met.
data orders; set company.orders; select (order_type); when (1) do; <multiple executable statements here> end;
when (2) do; <multiple executable statements here> end; when (3) do; <multiple executable statements here> end; otherwise; end; run;
Remember that IF-THEN/ELSE and SELECT/WHEN logic require that there be no intervening statements between the IF and the ELSE conditions or between the SELECT and the WHEN conditions.
Suppose you want to identify which customer groups are Club Members, Club Gold Members, or Internet/Catalog members, based on data from the data set Retail.Customer_hybrid. You also want to identify the nature of customer activity as "inactive", "low activity", "medium activity", or "high activity".
The following table shows the distribution of values for Customer_Type_ID
.
Techniques for creating new variables based on the values of specific variables include
SELECT, IF/SELECT Statements
Nested SELECT Statements
Serial IF Statements
IF-THEN/ELSE IF Statements with a Link.
The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for executing only necessary statements.
Best practices specify that you should eliminate unnecessary passes through the data. To minimize I/O operations and CPU time, avoid reading or writing data more than necessary. Accomplish this by taking advantage of one-step processing, which can lead to efficiencies.
Whenever possible, use a single DATA or PROC step to enhance efficiency. Techniques that minimize passes through the data include
using a single DATA step to create multiple output data sets
using the SORT procedure with a WHERE statement to create sorted subsets
using the DATASETS procedure to modify variable attributes.
Before viewing comparative examples that address these techniques, it might be helpful to review the following information about these practices.
It is good programming practice to take advantage of the DATA step's ability to create multiple output data sets at the same time. This is more efficient than using a series of individual DATA steps. Using a single DATA step saves resources because input data is read only once.
The following program demonstrates how to use a single DATA step to read data once and create three subsets of data.
data Sales_managers Account_managers Finance_managers; set company.organization; if job_title='Sales Manager' then output Sales_managers; else if job_title='Account Manager' then output Account_managers; else if job_title='Finance Manager' then output Finance_managers; run;
It is good programming practice to take advantage of the SORT procedure's ability to sort and subset in the same PROC step. This is more efficient than using two separate steps to accomplish this—a DATA step to subset followed by a procedure step that sorts.
The following program demonstrates how you can sort data and select only those observations that meet the conditions of the WHERE statement—in one step.
Processing a data set only once saves I/O and CPU resources.
proc sort data=company.organization out=company.managers; by job_title; where job_title in('Sales Manager', 'Account Manager', 'Finance Manager'), run;
Use PROC DATASETS instead of a DATA step to modify data attributes. The DATASETS procedure uses fewer resources than the DATA step because it processes only the descriptor portion of the data set, not the data portion. PROC DATASETS retains the sort flag, as well as indexes.
proc datasets lib=company; modify organization; rename Job_title=Title; quit;
You cannot use the DATASETS procedure to modify the type, length, or position of variables because these attributes directly affect the data portion of the data set. To perform these operations, use the DATA step. Δ
Suppose you want to create five subsets of data from the data set Retail.Customer. You need a subset for each of five countries. Techniques for creating multiple subsets include writing
Multiple DATA Steps
A Single DATA Step.
The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for eliminating unnecessary passes through the data.
Suppose you want to create a sorted subset of a SAS data set named Retail.Customer. You want only data for customers in the United States, France, Italy, Germany, and Spain.
Techniques for creating sorted subsets of SAS data sets include
A DATA Step and PROC SORT
PROC SORT with a WHERE Statement.
The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for eliminating unnecessary passes through the data.
When you need to process a subset of data with a procedure, use a WHERE statement in the procedure instead of creating a subset of data and reading that data with the procedure.
Write one program step that both sorts and subsets. This approach can take less programmer time and debugging time than writing separate program steps that subset and sort.
Suppose you want to change the variable attributes in Retail.NewCustomer to make them consistent with those in the Retail.Customer data set. The data set Retail.NewCustomer contains 89954 observations and 12 variables.
The following table shows the variable names and formats in each SAS data set.
SAS Data Set | Variable Name | Variable Format |
---|---|---|
Retail.Customer | Country Birth_Date | $COUNTRY. DATE9. |
Retail.NewCustomer | Country_ID Birth_Date | $COUNTRY. MMDDYYP10 |
Techniques for changing the variable attributes of a SAS data set include
A DATA Step
PROC DATASETS.
The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for eliminating unnecessary passes through the data.
Best practices specify that you should write programs that read and write only essential data. If you process fewer observations and variables, you conserve resources. This topic covers many different techniques that can improve performance when you
select observations from SAS data sets
select observations from external files
keep or drop variables when creating or reporting on SAS data sets.
Look at how two useful statements differ in how they select observations to subset.
You can use WHERE statements or subsetting IF statements to subset data. Although both statements test a condition to determine whether SAS should process an observation, the WHERE statement is more efficient.
The following graphic illustrates differences in how these statements process data.
I/O operations are measured as data moves between the disk that contains input SAS data and the buffer in memory, and when data moves from the output buffer to the disk that contains output data sets. Input data is not affected by the WHERE statement or subsetting IF statement. However, output data is affected by both.
CPU time is measured when data must be processed in the program data vector. CPU time can be saved if fewer observations are processed.
A WHERE statement and an IF statement make different use of the program data vector. The WHERE statement examines what is in the input page buffer and selects observations before they are loaded in the program data vector, which results in a savings in CPU operations. The subsetting IF statement loads all observations sequentially into the program data vector. If the statement finds a match and the statement is true, then the data is processed and is written to the output page buffer.
WHERE statements work on variables that are already in SAS data sets. IF statements can work on any variable in the program data vector, including new or old variables.
Suppose you want to create a subset of the data set Retail.Customer. You want to include data for only the United Kingdom. The subset contains approximately 5.56% of the Retail.Customer data.
Techniques for subsetting observations include
Subsetting IF Statement
WHERE Statement.
The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for reading and writing only essential data.
Review the following table to note other differences between the IF and WHERE statements.
The IF Statement | The WHERE Statement | |
---|---|---|
Selecting Data | can select records from external files, observations from SAS data sets, observations created with an INPUT statement, or observations based on the value of a computed or derived variable. | can select only observations from SAS data sets. |
Conditional Execution | can be executed conditionally as part of an IF statement. | cannot be executed conditionally as part of an IF statement. |
Grouping Data Using a BY Statement | produces the same data set when a BY statement accompanies a SET, MERGE, or UPDATE statement. | produces a different data set when a BY statement accompanies a SET, MERGE, or UPDATE statement. |
Merging Data | selects observations after combining current observations. | applies the selection criteria to each input data set before combining observations. |
If you use the WHERE= data set option and the WHERE statement in the same DATA step, SAS ignores the WHERE statement for input data sets. The WHERE= data set option and the WHERE statement call the same SAS routine. Δ
Another way to read and write only essential data is to process a segment of subsetted data. You accomplish this specialized task by using a WHERE expression in conjunction with the OBS= and FIRSTOBS= data set options. This programming feature is available in SAS 8.1 and later.
In the following example, the WHERE expression selects observations before the OBS= and FIRSTOBS= options are applied. The values specified for OBS= and FIRSTOBS= are the logical observation numbers in the subset, not the physical observation numbers in the data set.
FIRSTOBS = 5 is the fifth observation in the subset, whereas it was observation 101 in the data set Company.Staff.
OBS = 8 is the eighth observation in the subset, whereas it was observation 159 in the data set Company.Staff.
Now that you have seen techniques for efficiently subsetting observations that are read from SAS data sets, look at techniques for subsetting records that are read from external files.
Before viewing a comparative example that illustrates these techniques, it might be useful to review which resources are affected as SAS reads and processes data that is read from external files.
Positioning a subsetting IF statement in a DATA step so that it reads only the variables that are needed to select the subset—before reading all the data—can reduce the overhead required for processing data.
The following graphic illustrates how data is read from an external file, loaded into the input buffer, and read into the program data vector.
Remember that I/O operations are measured as data moves between disks and buffers—for both input and output data. All records are loaded into the input buffer before moving to the program data vector for processing, so I/O is not affected by the placement of a subsetting IF statement in the DATA step.
You can reduce the CPU resources that are required for processing data by limiting what is read into the program data vector. Position a subsetting IF statement after an INPUT statement that reads only the data that is required in order to check for specific conditions. Subsequent statements do not execute and do not process variable values for unwanted observations.
Converting raw character fields to SAS character variables requires less CPU time than converting raw numeric fields to the real binary encoding of SAS numeric variables. Δ
Suppose you want to create a SAS data set by reading a subset of data from an external file that is referenced by the fileref Customerdata. You want the subset to contain only customers in the United Kingdom.
The subset is approximately 5.56% of the countries in the external file, which contains 89,954 records and 12 fields.
Techniques for doing this include
Reading All Variables and Subsetting
Reading Selected Variables and Subsetting.
The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for reading and writing only essential data.
Position a subsetting IF statement in a DATA step so that only variables that are necessary to select the record are read before subsetting. This can result in significant savings in CPU time. There is no difference in I/O or memory usage between the two techniques.
When selecting rows of data from an external file, read the field(s) on which the selection is being made before reading all the fields into the program data vector.
Use the single trailing @ sign to hold the input buffer so that you can continue to read the record when the variable(s) satisfy the IF condition.
In addition to subsetting observations, you can subset variables by using statements or options that efficiently read and write only essential data.
Before viewing two comparative examples that illustrate how to best limit which variables are read and processed, review how these useful statements and options work.
To subset variables, you can use
the DROP and KEEP statements
the DROP= and KEEP= data set options.
You cannot use the DROP and KEEP statements and the DROP= and KEEP= output data set options in the same step. Δ
Use of the KEEP= data set option and the DROP= data set option can affect resource usage, depending on whether they are used in a SET or MERGE statement or in a DATA statement.
The following figure shows how options in these statements process data.
When used in the SET or MERGE statement, the KEEP= and DROP= data set options affect which variables are read into the program data vector. Reading only the variables that need to be processed in the DATA step can improve efficiency.
When used in the DATA statement, these same options put drop flags on variables to be excluded and affect which variables are written to the output data set.
The DROP and KEEP statements work just like the KEEP= or DROP= options in the DATA statement.
The following table describes differences in how the KEEP statement and the KEEP= data set option write variables to SAS data sets.
KEEP Statement | KEEP= Output Data Set Option | KEEP= Input Data Set Option |
---|---|---|
Causes a DATA step to write only the variables listed in the KEEP statement to one or more data sets. | Causes a DATA step to write only the variables listed in the KEEP= variable list to the output data set. | If there is no KEEP=/DROP= output data set option, enables processing of only the variables not listed in the DROP= variable list. |
If there is no KEEP=/DROP= input data set option, enables the DATA step to process all of the variables. | If there is no KEEP=/DROP= input data set option, enables the DATA step to process all of the variables. | Enables processing of only the variables listed in the KEEP= variable list. |
Applies to all data sets that are created within the same DATA step. | Can write different variables to different data sets. | If there is no KEEP=/DROP= output data set option or KEEP/DROP statement, enables processing of only the variables listed in the KEEP= variable list. |
Available only in the DATA step. | Available in the DATA step or in a PROC step. | Available in the DATA step or in a PROC step. |
The following table describes differences in how the DROP statement and the DROP= data set option write variables to SAS data sets.
DROP Statement | DROP= Output Data Set Option | DROP= Input Data Set Option |
---|---|---|
Causes a DATA step to write only the variables not listed in the DROP statement to one or more data sets. | Causes a DATA step to write only the variables not listed in the DROP= variable list to the output data set. | If there is no KEEP=/DROP= output data set option, enables processing of only the variables not listed in the DROP= variable list. |
If there is no KEEP=/DROP= input data set option, enables the DATA step to process all of the variables. | If there is no KEEP=/DROP= input data set option, enables the DATA step to process all of the variables. | Enables processing of only the variables not listed in the DROP= variable list. |
Applies to all data sets created within the same DATA step. | Can write different variables to different data sets. | If there is no KEEP=/DROP= output data set option or KEEP/DROP statement, enables processing of only the variables not listed in the DROP= variable list. |
Available only in the DATA step. | Available in the DATA step or in a PROC step. | Available in the DATA step or in a PROC step. |
Suppose you want to create a report that contains the average and median values for the variable Profit,
based on data that is read from the data set Retail.Order_fact. Depending on the number of variables eliminated, it might be more efficient to use the KEEP= option in a SET statement to limit which variables are read.
Techniques for reading and writing variables to a data set include
Without the KEEP= Statement
KEEP= in the DATA Statement
KEEP= in the DATA and SET Statements
KEEP= in the SET and MEANS Statements.
The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for reading and writing only essential data.
Suppose you want to read data from an external file that is referenced by the fileref Rawdata and to create a SAS data set that contains only the variables Customer_ID
, Country
, Gender
, and Customer_Name
.
Techniques for accomplishing this task include
Reading All Fields
Reading Selected Fields.
The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for reading and writing only essential data.
When possible, read only the fields you need from an external data file to save CPU and real-time resources.
To save CPU resources, avoid converting numerics that you do not need in further processing.
Remember that numeric data is moved into the program data vector after being converted to real binary, floating point numbers; multiple digits are stored in one byte. Character data is moved into the program data vector with no conversion; one character is stored in one byte. Δ
In many cases, it is best practice for you to store data in SAS data sets. You can optimize performance if you know when you should create a SAS data set and when you should read data directly from an external file.
Before viewing the comparative example that illustrates different techniques for reading from a SAS data set versus from an external file, consider the following advantages of storing data in SAS data sets.
When you use SAS to repeatedly analyze or manipulate any particular group of data, it is more efficient to create a SAS data set than to read the raw data each time. Although SAS data sets can be larger than external files and can require more disk space, reading from SAS data sets saves CPU time that is associated with reading a raw data file.
Other reasons for storing data in SAS data sets, rather than external files, include:
When the data is already in a SAS data set, you can use a SAS procedure, function, or routine on the data without further conversion.
SAS data sets are self-documenting.
The descriptor portion of a SAS data set documents information about the data set such as
data set labels
variable labels
variable formats
informats
descriptive variable names.
Create a temporary SAS data set if the data set is used for intermediate tasks such as merging and if it is needed in that SAS session only. Create a temporary SAS data set when the external file on which the data set is based might change between SAS sessions. Δ
Suppose you want to create a SAS data set that contains a large number of variables. One way to accomplish this task is to read from raw data fields in an external file that is referenced by the fileref Rawdata. Another way to accomplish this is to read the same data values from an existing SAS data set named Retail.Customer.
Techniques for accomplishing this task include
Reading from an External File
Reading from a SAS Data Set.
The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for storing data in SAS data sets.
To save CPU resources, you can read variables from a SAS data set instead of reading data from an external file.
To reduce I/O operations, you can read variables from a SAS data set instead of reading data from an external file. However, savings in I/O operations are largely dependent on the block size of the external data file and on the page size of the SAS data set.
Best practices specify that you avoid unnecessary procedure invocation. One way to do this is to take advantage of procedures that accomplish multiple tasks with one invocation.
Several procedures enable you to create multiple reports by invoking the procedure only once. These include
the SQL procedure
the DATASETS procedure
the FREQ procedure
the TABULATE procedure.
BY-group processing can also minimize unnecessary invocations of procedures. Δ
To illustrate this principle, examine features of the DATASETS procedure.
The DATASETS procedure can use RUN-group processing to process multiple sets of statements. RUN-group processing enables you to submit groups of statements without ending the procedure.
When the DATASETS procedure executes,
SAS reads the program statements that are associated with one task until it reaches a RUN statement or an implied RUN statement.
SAS executes all of the preceding statements immediately, and then continues reading until it reaches another RUN statement or an implied RUN statement.
To execute the last task, you must use a RUN statement or a QUIT statement.
proc datasets lib=company; modify orders; rename quantity=Units_Ordered; format costprice_per_unit dollar13.2; label delivery_date='Date of Delivery'; run; modify customers; format customer_birthdate mmddyy10. run; quit;
You can terminate the PROC DATASETS execution by submitting
a DATA statement
a PROC statement
a QUIT statement.
If you can take advantage of RUN-group processing, you can avoid unnecessary procedure invocation. For best programming practices, you need to understand how RUN-group processing affects the execution of SAS statements. The procedures that support RUN-group processing include
CHART, GCHART
PLOT, GPLOT
GIS, GMAP
GLM
REG
DATASETS.
To illustrate how RUN-group processing works, this discussion focuses on the DATASETS procedure. The comparative example that follows includes programs that use PROC DATASETS to modify the descriptor portion of data sets. Before you examine the code to consider efficient programming techniques, review how the principles associated with RUN-group processing apply to PROC DATASETS.
The DATASETS procedure supports four types of RUN groups. Each RUN group is defined by the statements that compose it and by what causes it to execute.
Some statements in PROC DATASETS act as implied RUN statements because they cause the RUN group that precedes them to execute.
The following list identifies which statements compose a RUN group and what causes each RUN group to execute:
The PROC DATASETS statement always executes immediately. No other statement is necessary to cause the PROC DATASETS statement to execute. Therefore, the PROC DATASETS statement alone is a RUN group.
The MODIFY statement and any of its subordinate statements form a RUN group. These RUN groups always execute immediately. No other statement is necessary to cause a MODIFY RUN group to execute.
The APPEND, CONTENTS, and COPY statements (including EXCLUDE and SELECT, if present) form their own separate RUN groups. Every APPEND statement forms a single-statement RUN group; every CONTENTS statement forms a single-statement RUN group; and every COPY step forms a RUN group. Any other statement in the procedure, except those that are subordinate to either the COPY or MODIFY statement, causes the RUN group to execute.
Additionally, one or more of the following statements form a RUN group:
AGE
EXCHANGE
CHANGE
REPAIR.
If any of these statements appear in sequence in the PROC step, the sequence forms a RUN group. For example, if a REPAIR statement appears immediately after a SAVE statement, the REPAIR statement does not force the SAVE statement to execute; it becomes part of the same RUN group. To execute the RUN group, submit one of the following statements:
PROC DATASETS
MODIFY
APPEND
QUIT
CONTENTS
RUN
COPY
another DATA or PROC step.
Suppose you want to use the DATASETS procedure to modify the data sets NewCustomer, NewOrders, and NewItems.
Techniques for accomplishing this task include using
Multiple DATASETS Procedures
A Single DATASETS Procedure.
The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for avoiding unnecessary procedure invocation.
Invoke the DATASETS procedure once and process all the changes for a library in one step to save CPU and I/O resources—at the cost of memory resources.
Because the specified library could change between invocations of the DATASETS procedure, the procedure is reloaded into memory for each invocation. Δ
You minimize the CPU time that SAS uses when you execute the minimum number of statements in the most efficient order.
For a more efficient program, place the subsetting IF statement as soon as logically possible in a DATA step when creating a subset of data.
Review guidelines for using conditional logic efficiently with IF-THEN/ELSE statements or SELECT statements. Remember to minimize the number of statements that use SAS functions or arithmetic operators.
Conditionally execute only necessary statements by placing statements in groups that are associated with IF-THEN/ELSE statements or SELECT/WHEN statements. Groups of statements execute only when a particular condition is true. Review the criteria for using DO groups efficiently.
Review the related comparative examples:
"Comparative Example: Creating a Subset of Data" on page 742
"Comparative Example: Creating Variables Conditionally Using DO Groups" on page 745
"Comparative Example: Creating Variables Conditionally When Calling Functions" on page 747
"Comparative Example: Creating Data in DO Groups" on page 752.
You should avoid reading or writing data more than necessary in order to minimize I/O operations.
There are a variety of techniques that you can use. For example, use a single DATA step to create multiple output data sets from one pass of the input data, rather than using multiple DATA steps to process the input data each time that you create an output data set. Create sorted subsets by subsetting data with the SORT procedure rather than subsetting data in a DATA step and then sorting. Change variable attributes by using PROC DATASETS rather than a DATA step.
Review the related comparative examples:
"Comparative Example: Creating Multiple Subsets of a SAS Data Set" on page 758
"Comparative Example: Creating a Sorted Subset of a SAS Data Set" on page 759
"Comparative Example: Changing the Variable Attributes of a SAS Data Set" on page 760.
If you process fewer observations and variables, SAS performs fewer I/O operations. To limit the number of observations that are processed, you can use the subsetting IF statement and the WHERE statement. Best programming practices can be applied if you understand other differences between subsetting IF and WHERE statements. You can also improve performance by applying OBS= and FIRSTOBS= processing with a WHERE statement.
To select observations when reading data from external files, position a subsetting IF statement in a DATA step so that it reads only the variables that are needed to select the subsetbefore reading all the data. This can reduce the overhead required to process data.
To limit the number of variables that are processed, you can use
the DROP and KEEP statements
the DROP= and KEEP= data set options.
In the SET statement, the DROP= or KEEP= data set option controls which variables are read and subsequently processed. In the DATA statement, the DROP= or KEEP= data set option controls which variables are written to a data set after processing. Using the SET statement with these options is the most efficient and best practice.
Review the related comparative examples:
"Comparative Example: Creating a Subset of a SAS Data Set" on page 763
"Comparative Example: Creating a Subset of Data by Reading Data from an External File" on page 766
"Comparative Example: Creating a Report That Contains Average and Median Statistics" on page 770
"Comparative Example: Creating a SAS Data Set That Contains Only Certain Variables" on page 771.
When you use SAS to repeatedly analyze or manipulate any particular group of data, create a SAS data set instead of reading the raw data each time.
Reading data from an external file versus reading from a SAS data set greatly increases CPU usage.
Review the related comparative example:
"Comparative Example: Creating a SAS Data Set or Reading from an External File" on page 774.
Invoking procedures once rather than multiple times can be the most efficient way to process data. Several procedures enable you to create multiple reports by invoking the procedure only once.
Using a single DATASETS procedure instead of multiple DATASETS procedures to modify the descriptor portion of a data set results in a noticeable savings in both CPU and I/O operations. Also, you can take advantage of RUN-group processing to submit groups of statements without ending the procedure.
Review the related comparative example:
"Comparative Example: Modifying the Descriptor Portion of SAS Data Sets" on page 777.
Select the best answer for each question. After completing the quiz, check your answers using the answer key in the appendix.
Placing the subsetting IF statement at the top rather than near the bottom of a DATA step results in a savings in CPU usage. What happens if the subset is large rather than small?
The savings in CPU usage increases as the subset grows larger because the I/O increases.
The savings in CPU usage decreases as the subset grows larger. However, placing the subsetting IF statement at the top of a DATA step always uses fewer resources than placing it at the bottom.
The savings in CPU usage remains constant as the subset grows larger. However, placing the subsetting IF statement near the bottom of a data set is preferable.
The savings in CPU usage decreases as the subset grows larger. However, placing the subsetting IF statement near the bottom of a data set increases the I/O.
Which of the following statements is true about techniques that are used for modifying data and attributes?
You can use PROC DATASETS to modify both data values and variable attributes.
You can use PROC DATASETS to modify only data values.
You can use the DATA step to modify both data values and variable attributes.
You can use the DATA step to modify only variable attributes.
For selecting observations, is a subsetting IF statement or a WHERE statement more efficient? Why?
A subsetting IF statement is more efficient because it loads all observations sequentially into the program data vector.
A subsetting IF statement is more efficient because it examines what is in the input buffer and selects observations before they are loaded into the program data vector, which results in a savings in CPU operations.
A WHERE statement is more efficient because it loads all observations sequentially into the program data vector.
A WHERE statement is more efficient because it examines what is in the input buffer and selects observations before they are loaded into the program data vector, which results in a savings in CPU operations.
When is it more advantageous to create a temporary SAS data set rather than a permanent SAS data set?
When the external file on which the data set is based might change between SAS sessions.
When the external file on which the data set is based does not change between SAS sessions.
When the data set is needed for more than one SAS session.
When you are converting raw numeric values to SAS data values.
When you compare the technique of using multiple DATASETS procedures to using a single DATASETS procedure to modify the descriptor portion of a data set, which is true?
A one-step DATASETS procedure results in an increase in I/O operations.
Multiple DATASETS procedures result in a decrease in I/O operations.
A one-step DATASETS procedure results in a decrease in CPU usage.
Multiple DATASETS procedures result in a decrease in CPU usage.