Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 22. Using Best Practices

Overview 740
- Introduction 740
- Objectives 741
- Prerequisites 741
Executing Only Necessary Statements 741
- Positioning of the Subsetting IF Statement 742
- Comparative Example: Creating a Subset of Data 742
  - Programming Techniques 743
  - General Recommendations 743
- Using Conditional Logic Efficiently 744
- Comparative Example: Creating Variables Conditionally Using DO Groups 745
  - Programming Techniques 746
  - General Recommendations 747
- Comparative Example: Creating Variables Conditionally When Calling Functions 747
  - Programming Techniques 749
  - General Recommendations 750
- Using DO Groups Efficiently 751
- Comparative Example: Creating Data in DO Groups 752
  - Programming Techniques 753
  - General Recommendations 756
Eliminating Unnecessary Passes through the Data 756
- Using a Single DATA or PROC Step to Enhance Efficiency 756
- Using a Single DATA Step to Create Multiple Output Data Sets 757
- Using the SORT Procedure with a WHERE Statement to Create Sorted Subsets 757
- Using the DATASETS Procedure to Modify Variable Attributes 757
- Comparative Example: Creating Multiple Subsets of a SAS Data Set 758
  - Programming Techniques 758
  - General Recommendations 759
- Comparative Example: Creating a Sorted Subset of a SAS Data Set 759
  - Programming Techniques 760
  - General Recommendations 760
- Comparative Example: Changing the Variable Attributes of a SAS Data Set 760
  - Programming Techniques 761
  - General Recommendations 761
Reading and Writing Only Essential Data 762
- Selecting Observations Using Subsetting IF versus WHERE Statement 762
- Comparative Example: Creating a Subset of a SAS Data Set 763
  - Programming Techniques 763
  - General Recommendations 763
- Other Differences between the IF and WHERE Statements 763
- Using the WHERE Statement with the OBS= and FIRSTOBS= Options 764
- Selecting Observations When Reading Data from External Files 765
- Comparative Example: Creating a Subset of Data by Reading Data from an External File 766
  - Programming Techniques 766
  - General Recommendations 767
- Subsetting Variables with the KEEP= and DROP= Statements and Options 767
- Comparative Example: Creating a Report That Contains Average and Median Statistics 770
  - Programming Techniques 770
  - General Recommendations 771
- Comparative Example: Creating a SAS Data Set That Contains Only Certain Variables 771
  - Programming Techniques 772
  - General Recommendations 772
Storing Data in SAS Data Sets 773
- Comparative Example: Creating a SAS Data Set or Reading from an External File 774
  - Programming Techniques 774
  - General Recommendations 775
Avoiding Unnecessary Procedure Invocation 775
- Executing the DATASETS Procedure 775
- RUN-Group Processing 776
- Using Different Types of RUN Groups with PROC DATASETS 776
- Comparative Example: Modifying the Descriptor Portion of SAS Data Sets 777
  - Programming Techniques 778
- General Recommendations 779
Summary 779
- Executing Only Necessary Statements 779
- Eliminating Unnecessary Passes Through the Data 779
- Reading and Writing Only Essential Data 779
- Storing SAS Data in SAS Data Sets 780
- Avoiding Unnecessary Procedure Invocation 780
Quiz 780

Overview

Introduction

This chapter demonstrates different ways of using best SAS programming practices to optimize performance. As you compare the techniques described in this chapter, remember that differences in the use of resources are affected by which operating environment you work in and by the characteristics of your data.

This chapter is organized by topics that emphasize the following basic principles:

Execute only necessary statements.
Eliminate unnecessary passes of the data.
Read and write only the data that you require.
Store data in SAS data sets.
Avoid unnecessary procedure invocation.

Each topic includes comparative examples that can improve the efficiency of your programs. Write programs to generate your own benchmarks, and adopt the programming techniques that produce the most savings for you.

Note

This chapter does not cover the SAS Scalable Performance Data Engine (SAS SPD Engine), which is a SAS 9.1 technology for threaded processing. For details about using the SAS SPD Engine to improve performance, see the SAS documentation. Δ

Objectives

In this chapter, you learn too efficiently

subset observations
create new variables
process and output data conditionally
create multiple output data sets and sorted subsets
modify variable attributes
select observations from SAS data sets and external files
subset variables
read data from SAS data sets
invoke SAS procedures.

Prerequisites

Before beginning this chapter, you should complete the following chapters:

Part 1: SQL Processing with SAS
- Chapter 1, "Performing Queries Using PROC SQL," on page 3
- Chapter 2, "Performing Advanced Queries Using PROC SQL," on page 25
- Chapter 3, "Combining Tables Horizontally Using PROC SQL," on page 79
- Chapter 4, "Combining Tables Vertically Using PROC SQL," on page 123
- Chapter 5, "Creating and Managing Tables Using PROC SQL," on page 159
- Chapter 6, "Creating and Managing Indexes Using PROC SQL," on page 221
- Chapter 7, "Creating and Managing Views Using PROC SQL," on page 243
- Chapter 8, "Managing Processing Using PROC SQL," on page 261.
Part 3: Advanced SAS Programming Techniques
- Chapter 13, "Creating Samples and Indexes," on page 449
- Chapter 14, "Combining Data Vertically," on page 479
- Chapter 15, "Combining Data Horizontally," on page 511
- Chapter 16, "Using Lookup Tables to Match Data," on page 557
- Chapter 17, "Formatting Data," on page 601
- Chapter 18, "Modifying SAS Data Sets and Tracking Changes," on page 631.
Part 4: Optimizing SAS Programs
- Chapter 19, "Introduction to Efficient SAS Programming," on page 677
- Chapter 20, "Controlling Memory Usage," on page 687
- Chapter 21, "Controlling Data Storage Space," on page 705.

Executing Only Necessary Statements

Best practices specify that you should write programs that cause SAS to execute only necessary statements. When you execute the minimum number of statements in the most efficient order, you minimize the hardware resources that SAS uses. The resources that are affected include disk usage, memory usage, and CPU usage.

Techniques for executing only the statements you need include

subsetting your data as soon as is logically possible
processing your data conditionally by using the most appropriate syntax for your data.

Positioning of the Subsetting IF Statement

To subset your data based on a newly derived or computed variable, you must use the subsetting IF statement in a DATA step. You can subset output data to a SAS data set by processing only those observations that meet a specified condition.

The subsetting IF statement causes the DATA step to continue processing only those raw data records or observations that meet the condition of the expression specified in the IF statement. The resulting SAS data set or data sets contain a subset of the original external file or SAS data set.

Position the subsetting IF statement in the program so that it checks the subsetting condition as soon as it is logically possible, and so that unnecessary statements do not execute. When a statement is false, no further statements are processed for that observation.

Also, remember to subset data before performing calculations and to minimize the use of function calls or arithmetic operators. Unnecessary processing of unwanted observations results in higher expenditure of hardware resources.

Comparative Example: Creating a Subset of Data

Suppose you want to create a subset of data, calculate six new variables, and conditionally output data by reading from the SAS data set Retail.Order_fact. The data set should contain new variables for

the month of the order
the elapsed time between the order date and the delivery date
the profit, based on the retail price, discount, and unit price
total profit
total discount
total wait time.

The subset of data that includes only orders for the month of December is approximately 9.66% of the data.

You can accomplish this task by using a subsetting IF statement. Placement of this statement in the DATA step can affect the efficiency of the DATA step in terms of CPU time and real time. Notice the comparison between these two approaches:

A Subsetting IF Statement at the Bottom
A Subsetting IF Statement near the Top.

The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for executing only necessary statements.

Programming Techniques

General Recommendations

Position the subsetting IF statement in a DATA step as soon as is logically possible in order to save the most resources.
In the last comparative example you saw how a subsetting IF statement can be positioned in the DATA step so that no further statements are processed for that observation. Next, look at how different programming techniques can be used to
create variables conditionally using DO groups
create variables conditionally when calling functions.

Before viewing the sample code for these two comparative examples, review guidelines for using these techniques.

Using Conditional Logic Efficiently

You can use conditional logic to change how SAS processes selected observations. Two techniques—IF-THEN/ELSE statements and SELECT statements—can be used interchangeably and perform comparably. Based on the characteristics of your data and depending on your environment, one of these techniques might give you better performance. Choose a technique that conserves your programming time and makes the program easiest to read.

Technique	Action
IF-THEN/ELSE statement	executes a SAS statement for observations that meet specific conditions.
SELECT statement	executes one of several statements or groups of statements.

Note

The number of conditions (values) tested and the type of variable tested affect CPU resources. For numeric variables, SELECT statements should always be slightly more efficient (use less CPU time) than IF-THEN/ELSE statements. The performance gap between IF-THEN/ELSE and SELECT statements gradually widens as the number of conditions increases. For character variables, IF-THEN/ELSE statements are always more efficient than SELECT statements. The performance gap widens quickly between the two techniques as the number of conditions increases. Δ

Use IF-THEN/ELSE statements when

the data values are character values
the data values are not uniformly distributed
there are few conditions to check.

For best practices, follow these guidelines for writing efficient IF/THEN logic:

For mutually exclusive conditions, use the ELSE IF statement rather than an IF statement for all conditions except the first.
Check the most frequently occurring condition first, and continue checking conditions in descending order of frequency.
When you execute multiple statements based on a condition, put the statements in a DO group.

Use SELECT statements when

you have a long series of mutually exclusive numeric conditions
data values are uniformly distributed.

Before writing conditional logic, determine the distribution of your data values. You can use the

FREQ procedure to examine the distribution of the data values
GCHART or GPLOT procedure to display the distribution graphically
UNIVARIATE procedure to examine distribution statistics and to display the information graphically.

Comparative Example: Creating Variables Conditionally Using DO Groups

Suppose you want to calculate an adjusted profit based on the values of the variable Order_Type in the data set Retail.Order_fact. For retail sales, which are represented by the value 1, the adjusted profit should be calculated as 105% of profit. For catalog sales, which are represented by the value 2, the adjusted profit should be calculated as 103% of profit. For Internet sales, which are represented by the value 3, the adjusted profit should be equal to profit.

The following table shows that the values for the variable Order_Type are not uniformly distributed.

Comparative Example: Creating Variables Conditionally Using DO Groups

The following table shows that the values for the variable Discount also are not uniformly distributed.

Techniques for creating new variables conditionally include

IF-THEN/ELSE statements
SELECT statements.

Programming Techniques

General Recommendations

Check the most frequently occurring condition first, and continue checking conditions in descending order of frequency, regardless of whether you use IF-THEN/ELSE or SELECT statements.
When you execute multiple statements based on a condition, put the statements in a DO group.

Comparative Example: Creating Variables Conditionally When Calling Functions

Suppose you want to create a report that includes a new variable that is based on the value of an existing variable in the SAS data set Retail.Order_fact. Values for the new Month variable are extracted from the existing variable Order_Date by using the MONTH function.

The following table shows that the values for Month are fairly evenly distributed.

Comparative Example: Creating Variables Conditionally When Calling Functions

Techniques for creating new variables conditionally include

Parallel IF Statements
ELSE IF Statements, Many Function References
ELSE IF Statements, One Function Reference
SELECT Group.

Programming Techniques

General Recommendations

Avoid using parallel IF statements, which use the most resources and are the least efficient way to conditionally execute statements.
Use IF-THEN/ELSE statements and SELECT blocks to be more efficient.
To significantly reduce the amount of resources used, write programs that call a function only once instead of repetitively using the same function in many statements. SAS functions are convenient, but they can be expensive in terms of CPU resources.

You have seen different ways of efficiently creating variables conditionally, which reinforce the principle of executing only necessary statements. The next comparative example in this topic addresses techniques that can be used to create data in DO groups. Before viewing this example, you might want to review the following information about which statements are needed in DO groups.

Using DO Groups Efficiently

You can conditionally execute only necessary statements by placing them in DO groups that are associated with IF-THEN/ELSE statements or with SELECT/WHEN statements. Groups of statements execute only when a particular condition is true. Remember to use the following criteria when choosing which technique is more efficient:

	IF-THEN/ELSE Statements	SELECT/WHEN Statements
The number of conditions	few	many
The distribution of a variable's values	not uniform	uniform
The type of variable	character	numeric

When using a DO group with IF-THEN/ELSE statements, add DO after the THEN clause, and add an END statement after all of the statements that you want executed as a group.

data orders;
   set company.orders;
   if order_type = 1 then
      do;
      <multiple executable statements here>
      end;
   else if order_type = 2 then
      do;
      <multiple executable statements here>
       end;
   else  if order_type = 3 then
       do;
       <multiple executable statements here>
       end;
run;

Note

Use an IF-THEN DO group when you create multiple variables based on a condition. Δ

When using a DO group with SELECT/WHEN statements, add DO after the WHEN condition, and add an END statement after all of the statements that you want executed as a group. Use an OTHERWISE statement to specify the statements that you want executed if no WHEN condition is met.

data orders;
   set company.orders;
   select (order_type);
      when (1)
         do;
         <multiple executable statements here>
         end;

      when (2)
         do;
         <multiple executable statements here>
         end;
      when (3)
         do;
         <multiple executable statements here>
         end;
      otherwise;
   end;
run;

Remember that IF-THEN/ELSE and SELECT/WHEN logic require that there be no intervening statements between the IF and the ELSE conditions or between the SELECT and the WHEN conditions.

Comparative Example: Creating Data in DO Groups

Suppose you want to identify which customer groups are Club Members, Club Gold Members, or Internet/Catalog members, based on data from the data set Retail.Customer_hybrid. You also want to identify the nature of customer activity as "inactive", "low activity", "medium activity", or "high activity".

The following table shows the distribution of values for Customer_Type_ID.

Comparative Example: Creating Data in DO Groups

Techniques for creating new variables based on the values of specific variables include

SELECT, IF/SELECT Statements
Nested SELECT Statements
Serial IF Statements
IF-THEN/ELSE IF Statements with a Link.

Programming Techniques

This program creates a permanent SAS data set named Retail.Customers by reading the Retail.Customer_hybrid data set. SELECT/WHEN logic and SELECT/WHEN statements in an IF/THEN DO group populate the variables Customer_Group and Customer_Activity. If the value of the first two digits of Customer_Type_ID is 10, 20, or 30, then Customer_Group is populated. If the value of the first two digits of Customer_Type_ID is 10 or 20, then Customer_Activity is populated by reading the last two digits of Customer_Type_ID.

General Recommendations

Avoid serial IF statements because they use extra resources.

Eliminating Unnecessary Passes through the Data

Best practices specify that you should eliminate unnecessary passes through the data. To minimize I/O operations and CPU time, avoid reading or writing data more than necessary. Accomplish this by taking advantage of one-step processing, which can lead to efficiencies.

Using a Single DATA or PROC Step to Enhance Efficiency

Whenever possible, use a single DATA or PROC step to enhance efficiency. Techniques that minimize passes through the data include

using a single DATA step to create multiple output data sets
using the SORT procedure with a WHERE statement to create sorted subsets
using the DATASETS procedure to modify variable attributes.

Before viewing comparative examples that address these techniques, it might be helpful to review the following information about these practices.

Using a Single DATA Step to Create Multiple Output Data Sets

It is good programming practice to take advantage of the DATA step's ability to create multiple output data sets at the same time. This is more efficient than using a series of individual DATA steps. Using a single DATA step saves resources because input data is read only once.

The following program demonstrates how to use a single DATA step to read data once and create three subsets of data.

data Sales_managers Account_managers Finance_managers;
   set company.organization;
   if job_title='Sales Manager' then
              output Sales_managers;
   else if job_title='Account Manager' then
              output Account_managers;
   else if job_title='Finance Manager' then
              output Finance_managers;
run;

Using the SORT Procedure with a WHERE Statement to Create Sorted Subsets

It is good programming practice to take advantage of the SORT procedure's ability to sort and subset in the same PROC step. This is more efficient than using two separate steps to accomplish this—a DATA step to subset followed by a procedure step that sorts.

The following program demonstrates how you can sort data and select only those observations that meet the conditions of the WHERE statement—in one step.

Processing a data set only once saves I/O and CPU resources.

proc sort data=company.organization
          out=company.managers;
   by job_title;
   where job_title in('Sales Manager',
                      'Account Manager',
                      'Finance Manager'),
run;

Using the DATASETS Procedure to Modify Variable Attributes

Use PROC DATASETS instead of a DATA step to modify data attributes. The DATASETS procedure uses fewer resources than the DATA step because it processes only the descriptor portion of the data set, not the data portion. PROC DATASETS retains the sort flag, as well as indexes.

proc datasets lib=company;
   modify organization;
   rename Job_title=Title;
quit;

Note

You cannot use the DATASETS procedure to modify the type, length, or position of variables because these attributes directly affect the data portion of the data set. To perform these operations, use the DATA step. Δ

Comparative Example: Creating Multiple Subsets of a SAS Data Set

Suppose you want to create five subsets of data from the data set Retail.Customer. You need a subset for each of five countries. Techniques for creating multiple subsets include writing

Multiple DATA Steps
A Single DATA Step.

Programming Techniques

General Recommendations

When creating multiple subsets from a SAS data set, use a single DATA step with IF-THEN/ELSE IF logic to output to appropriate data sets.

Comparative Example: Creating a Sorted Subset of a SAS Data Set

Suppose you want to create a sorted subset of a SAS data set named Retail.Customer. You want only data for customers in the United States, France, Italy, Germany, and Spain.

Techniques for creating sorted subsets of SAS data sets include

A DATA Step and PROC SORT
PROC SORT with a WHERE Statement.

Programming Techniques

General Recommendations

When you need to process a subset of data with a procedure, use a WHERE statement in the procedure instead of creating a subset of data and reading that data with the procedure.
Write one program step that both sorts and subsets. This approach can take less programmer time and debugging time than writing separate program steps that subset and sort.

Comparative Example: Changing the Variable Attributes of a SAS Data Set

Suppose you want to change the variable attributes in Retail.NewCustomer to make them consistent with those in the Retail.Customer data set. The data set Retail.NewCustomer contains 89954 observations and 12 variables.

The following table shows the variable names and formats in each SAS data set.

SAS Data Set	Variable Name	Variable Format
Retail.Customer	Country Birth_Date	$COUNTRY. DATE9.
Retail.NewCustomer	Country_ID Birth_Date	$COUNTRY. MMDDYYP10

SAS Data Set

Variable Name

Variable Format

Retail.Customer

Country

Birth_Date

$COUNTRY.

DATE9.

Retail.NewCustomer

Country_ID

Birth_Date

$COUNTRY.

MMDDYYP10

Techniques for changing the variable attributes of a SAS data set include

A DATA Step
PROC DATASETS.

Programming Techniques

General Recommendations

To save significant resources, use the DATASETS procedure instead of a DATA step to change the attributes of a SAS data set.

Reading and Writing Only Essential Data

Best practices specify that you should write programs that read and write only essential data. If you process fewer observations and variables, you conserve resources. This topic covers many different techniques that can improve performance when you

select observations from SAS data sets
select observations from external files
keep or drop variables when creating or reporting on SAS data sets.

Look at how two useful statements differ in how they select observations to subset.

Selecting Observations Using Subsetting IF versus WHERE Statement

You can use WHERE statements or subsetting IF statements to subset data. Although both statements test a condition to determine whether SAS should process an observation, the WHERE statement is more efficient.

The following graphic illustrates differences in how these statements process data.

Selecting Observations Using Subsetting IF versus WHERE Statement

I/O operations are measured as data moves between the disk that contains input SAS data and the buffer in memory, and when data moves from the output buffer to the disk that contains output data sets. Input data is not affected by the WHERE statement or subsetting IF statement. However, output data is affected by both.

CPU time is measured when data must be processed in the program data vector. CPU time can be saved if fewer observations are processed.

A WHERE statement and an IF statement make different use of the program data vector. The WHERE statement examines what is in the input page buffer and selects observations before they are loaded in the program data vector, which results in a savings in CPU operations. The subsetting IF statement loads all observations sequentially into the program data vector. If the statement finds a match and the statement is true, then the data is processed and is written to the output page buffer.

WHERE statements work on variables that are already in SAS data sets. IF statements can work on any variable in the program data vector, including new or old variables.

Comparative Example: Creating a Subset of a SAS Data Set

Suppose you want to create a subset of the data set Retail.Customer. You want to include data for only the United Kingdom. The subset contains approximately 5.56% of the Retail.Customer data.

Techniques for subsetting observations include

Subsetting IF Statement
WHERE Statement.

Programming Techniques

General Recommendations

To save CPU resources, use a WHERE statement instead of a subsetting IF statement to subset a SAS data set.

Other Differences between the IF and WHERE Statements

Review the following table to note other differences between the IF and WHERE statements.

	The IF Statement	The WHERE Statement
Selecting Data	can select records from external files, observations from SAS data sets, observations created with an INPUT statement, or observations based on the value of a computed or derived variable.	can select only observations from SAS data sets.
Conditional Execution	can be executed conditionally as part of an IF statement.	cannot be executed conditionally as part of an IF statement.
Grouping Data Using a BY Statement	produces the same data set when a BY statement accompanies a SET, MERGE, or UPDATE statement.	produces a different data set when a BY statement accompanies a SET, MERGE, or UPDATE statement.
Merging Data	selects observations after combining current observations.	applies the selection criteria to each input data set before combining observations.

Note

If you use the WHERE= data set option and the WHERE statement in the same DATA step, SAS ignores the WHERE statement for input data sets. The WHERE= data set option and the WHERE statement call the same SAS routine. Δ

Using the WHERE Statement with the OBS= and FIRSTOBS= Options

Another way to read and write only essential data is to process a segment of subsetted data. You accomplish this specialized task by using a WHERE expression in conjunction with the OBS= and FIRSTOBS= data set options. This programming feature is available in SAS 8.1 and later.

In the following example, the WHERE expression selects observations before the OBS= and FIRSTOBS= options are applied. The values specified for OBS= and FIRSTOBS= are the logical observation numbers in the subset, not the physical observation numbers in the data set.

Using the WHERE Statement with the OBS= and FIRSTOBS= Options

FIRSTOBS = 5 is the fifth observation in the subset, whereas it was observation 101 in the data set Company.Staff.

OBS = 8 is the eighth observation in the subset, whereas it was observation 159 in the data set Company.Staff.

Now that you have seen techniques for efficiently subsetting observations that are read from SAS data sets, look at techniques for subsetting records that are read from external files.

Before viewing a comparative example that illustrates these techniques, it might be useful to review which resources are affected as SAS reads and processes data that is read from external files.

Selecting Observations When Reading Data from External Files

Positioning a subsetting IF statement in a DATA step so that it reads only the variables that are needed to select the subset—before reading all the data—can reduce the overhead required for processing data.

The following graphic illustrates how data is read from an external file, loaded into the input buffer, and read into the program data vector.

Selecting Observations When Reading Data from External Files

Remember that I/O operations are measured as data moves between disks and buffers—for both input and output data. All records are loaded into the input buffer before moving to the program data vector for processing, so I/O is not affected by the placement of a subsetting IF statement in the DATA step.

You can reduce the CPU resources that are required for processing data by limiting what is read into the program data vector. Position a subsetting IF statement after an INPUT statement that reads only the data that is required in order to check for specific conditions. Subsequent statements do not execute and do not process variable values for unwanted observations.

Note

Converting raw character fields to SAS character variables requires less CPU time than converting raw numeric fields to the real binary encoding of SAS numeric variables. Δ

Comparative Example: Creating a Subset of Data by Reading Data from an External File

Suppose you want to create a SAS data set by reading a subset of data from an external file that is referenced by the fileref Customerdata. You want the subset to contain only customers in the United Kingdom.

The subset is approximately 5.56% of the countries in the external file, which contains 89,954 records and 12 fields.

Techniques for doing this include

Reading All Variables and Subsetting
Reading Selected Variables and Subsetting.

Programming Techniques

General Recommendations

Position a subsetting IF statement in a DATA step so that only variables that are necessary to select the record are read before subsetting. This can result in significant savings in CPU time. There is no difference in I/O or memory usage between the two techniques.
When selecting rows of data from an external file, read the field(s) on which the selection is being made before reading all the fields into the program data vector.
Use the single trailing @ sign to hold the input buffer so that you can continue to read the record when the variable(s) satisfy the IF condition.

In addition to subsetting observations, you can subset variables by using statements or options that efficiently read and write only essential data.

Before viewing two comparative examples that illustrate how to best limit which variables are read and processed, review how these useful statements and options work.

Subsetting Variables with the KEEP= and DROP= Statements and Options

To subset variables, you can use

the DROP and KEEP statements
the DROP= and KEEP= data set options.

Subsetting Variables with the KEEP= and DROP= Statements and Options

Note

You cannot use the DROP and KEEP statements and the DROP= and KEEP= output data set options in the same step. Δ

Use of the KEEP= data set option and the DROP= data set option can affect resource usage, depending on whether they are used in a SET or MERGE statement or in a DATA statement.

The following figure shows how options in these statements process data.

When used in the SET or MERGE statement, the KEEP= and DROP= data set options affect which variables are read into the program data vector. Reading only the variables that need to be processed in the DATA step can improve efficiency.

When used in the DATA statement, these same options put drop flags on variables to be excluded and affect which variables are written to the output data set.

The DROP and KEEP statements work just like the KEEP= or DROP= options in the DATA statement.

The following table describes differences in how the KEEP statement and the KEEP= data set option write variables to SAS data sets.

KEEP Statement	KEEP= Output Data Set Option	KEEP= Input Data Set Option
Causes a DATA step to write only the variables listed in the KEEP statement to one or more data sets.	Causes a DATA step to write only the variables listed in the KEEP= variable list to the output data set.	If there is no KEEP=/DROP= output data set option, enables processing of only the variables not listed in the DROP= variable list.
If there is no KEEP=/DROP= input data set option, enables the DATA step to process all of the variables.	If there is no KEEP=/DROP= input data set option, enables the DATA step to process all of the variables.	Enables processing of only the variables listed in the KEEP= variable list.
Applies to all data sets that are created within the same DATA step.	Can write different variables to different data sets.	If there is no KEEP=/DROP= output data set option or KEEP/DROP statement, enables processing of only the variables listed in the KEEP= variable list.
Available only in the DATA step.	Available in the DATA step or in a PROC step.	Available in the DATA step or in a PROC step.

The following table describes differences in how the DROP statement and the DROP= data set option write variables to SAS data sets.

DROP Statement	DROP= Output Data Set Option	DROP= Input Data Set Option
Causes a DATA step to write only the variables not listed in the DROP statement to one or more data sets.	Causes a DATA step to write only the variables not listed in the DROP= variable list to the output data set.	If there is no KEEP=/DROP= output data set option, enables processing of only the variables not listed in the DROP= variable list.
If there is no KEEP=/DROP= input data set option, enables the DATA step to process all of the variables.	If there is no KEEP=/DROP= input data set option, enables the DATA step to process all of the variables.	Enables processing of only the variables not listed in the DROP= variable list.
Applies to all data sets created within the same DATA step.	Can write different variables to different data sets.	If there is no KEEP=/DROP= output data set option or KEEP/DROP statement, enables processing of only the variables not listed in the DROP= variable list.
Available only in the DATA step.	Available in the DATA step or in a PROC step.	Available in the DATA step or in a PROC step.

Comparative Example: Creating a Report That Contains Average and Median Statistics

Suppose you want to create a report that contains the average and median values for the variable Profit, based on data that is read from the data set Retail.Order_fact. Depending on the number of variables eliminated, it might be more efficient to use the KEEP= option in a SET statement to limit which variables are read.

Techniques for reading and writing variables to a data set include

Without the KEEP= Statement
KEEP= in the DATA Statement
KEEP= in the DATA and SET Statements
KEEP= in the SET and MEANS Statements.

Programming Techniques

General Recommendations

To reduce both CPU time and I/O operations, avoid reading and writing variables that are not needed.

Comparative Example: Creating a SAS Data Set That Contains Only Certain Variables

Suppose you want to read data from an external file that is referenced by the fileref Rawdata and to create a SAS data set that contains only the variables Customer_ID, Country, Gender, and Customer_Name.

Techniques for accomplishing this task include

Reading All Fields
Reading Selected Fields.

Programming Techniques

General Recommendations

When possible, read only the fields you need from an external data file to save CPU and real-time resources.
To save CPU resources, avoid converting numerics that you do not need in further processing.

Note

Remember that numeric data is moved into the program data vector after being converted to real binary, floating point numbers; multiple digits are stored in one byte. Character data is moved into the program data vector with no conversion; one character is stored in one byte. Δ

Storing Data in SAS Data Sets

In many cases, it is best practice for you to store data in SAS data sets. You can optimize performance if you know when you should create a SAS data set and when you should read data directly from an external file.

Before viewing the comparative example that illustrates different techniques for reading from a SAS data set versus from an external file, consider the following advantages of storing data in SAS data sets.

When you use SAS to repeatedly analyze or manipulate any particular group of data, it is more efficient to create a SAS data set than to read the raw data each time. Although SAS data sets can be larger than external files and can require more disk space, reading from SAS data sets saves CPU time that is associated with reading a raw data file.

Other reasons for storing data in SAS data sets, rather than external files, include:

When the data is already in a SAS data set, you can use a SAS procedure, function, or routine on the data without further conversion.
SAS data sets are self-documenting.

The descriptor portion of a SAS data set documents information about the data set such as

data set labels
variable labels
variable formats
informats
descriptive variable names.

Note

Create a temporary SAS data set if the data set is used for intermediate tasks such as merging and if it is needed in that SAS session only. Create a temporary SAS data set when the external file on which the data set is based might change between SAS sessions. Δ

Comparative Example: Creating a SAS Data Set or Reading from an External File

Suppose you want to create a SAS data set that contains a large number of variables. One way to accomplish this task is to read from raw data fields in an external file that is referenced by the fileref Rawdata. Another way to accomplish this is to read the same data values from an existing SAS data set named Retail.Customer.

Techniques for accomplishing this task include

Reading from an External File
Reading from a SAS Data Set.

Programming Techniques

General Recommendations

To save CPU resources, you can read variables from a SAS data set instead of reading data from an external file.
To reduce I/O operations, you can read variables from a SAS data set instead of reading data from an external file. However, savings in I/O operations are largely dependent on the block size of the external data file and on the page size of the SAS data set.

Avoiding Unnecessary Procedure Invocation

Best practices specify that you avoid unnecessary procedure invocation. One way to do this is to take advantage of procedures that accomplish multiple tasks with one invocation.

Several procedures enable you to create multiple reports by invoking the procedure only once. These include

the SQL procedure
the DATASETS procedure
the FREQ procedure
the TABULATE procedure.

Note

BY-group processing can also minimize unnecessary invocations of procedures. Δ

To illustrate this principle, examine features of the DATASETS procedure.

Executing the DATASETS Procedure

The DATASETS procedure can use RUN-group processing to process multiple sets of statements. RUN-group processing enables you to submit groups of statements without ending the procedure.

When the DATASETS procedure executes,

SAS reads the program statements that are associated with one task until it reaches a RUN statement or an implied RUN statement.
SAS executes all of the preceding statements immediately, and then continues reading until it reaches another RUN statement or an implied RUN statement.

To execute the last task, you must use a RUN statement or a QUIT statement.

proc datasets lib=company;
   modify orders;
      rename quantity=Units_Ordered;
      format costprice_per_unit dollar13.2;
      label delivery_date='Date of Delivery';
   run;
   modify customers;
      format customer_birthdate mmddyy10.
   run;
quit;

You can terminate the PROC DATASETS execution by submitting

a DATA statement
a PROC statement
a QUIT statement.

RUN-Group Processing

If you can take advantage of RUN-group processing, you can avoid unnecessary procedure invocation. For best programming practices, you need to understand how RUN-group processing affects the execution of SAS statements. The procedures that support RUN-group processing include

CHART, GCHART
PLOT, GPLOT
GIS, GMAP
GLM
REG
DATASETS.

Using Different Types of RUN Groups with PROC DATASETS

To illustrate how RUN-group processing works, this discussion focuses on the DATASETS procedure. The comparative example that follows includes programs that use PROC DATASETS to modify the descriptor portion of data sets. Before you examine the code to consider efficient programming techniques, review how the principles associated with RUN-group processing apply to PROC DATASETS.

The DATASETS procedure supports four types of RUN groups. Each RUN group is defined by the statements that compose it and by what causes it to execute.

Some statements in PROC DATASETS act as implied RUN statements because they cause the RUN group that precedes them to execute.

The following list identifies which statements compose a RUN group and what causes each RUN group to execute:

The PROC DATASETS statement always executes immediately. No other statement is necessary to cause the PROC DATASETS statement to execute. Therefore, the PROC DATASETS statement alone is a RUN group.
The MODIFY statement and any of its subordinate statements form a RUN group. These RUN groups always execute immediately. No other statement is necessary to cause a MODIFY RUN group to execute.
The APPEND, CONTENTS, and COPY statements (including EXCLUDE and SELECT, if present) form their own separate RUN groups. Every APPEND statement forms a single-statement RUN group; every CONTENTS statement forms a single-statement RUN group; and every COPY step forms a RUN group. Any other statement in the procedure, except those that are subordinate to either the COPY or MODIFY statement, causes the RUN group to execute.

Additionally, one or more of the following statements form a RUN group:

AGE
EXCHANGE
CHANGE
REPAIR.

If any of these statements appear in sequence in the PROC step, the sequence forms a RUN group. For example, if a REPAIR statement appears immediately after a SAVE statement, the REPAIR statement does not force the SAVE statement to execute; it becomes part of the same RUN group. To execute the RUN group, submit one of the following statements:

PROC DATASETS
MODIFY
APPEND
QUIT
CONTENTS
RUN
COPY
another DATA or PROC step.

Comparative Example: Modifying the Descriptor Portion of SAS Data Sets

Suppose you want to use the DATASETS procedure to modify the data sets NewCustomer, NewOrders, and NewItems.

Techniques for accomplishing this task include using

Multiple DATASETS Procedures
A Single DATASETS Procedure.

Programming Techniques

General Recommendations

Invoke the DATASETS procedure once and process all the changes for a library in one step to save CPU and I/O resources—at the cost of memory resources.

Note

Because the specified library could change between invocations of the DATASETS procedure, the procedure is reloaded into memory for each invocation. Δ

Summary

Executing Only Necessary Statements

You minimize the CPU time that SAS uses when you execute the minimum number of statements in the most efficient order.

For a more efficient program, place the subsetting IF statement as soon as logically possible in a DATA step when creating a subset of data.

Review guidelines for using conditional logic efficiently with IF-THEN/ELSE statements or SELECT statements. Remember to minimize the number of statements that use SAS functions or arithmetic operators.

Conditionally execute only necessary statements by placing statements in groups that are associated with IF-THEN/ELSE statements or SELECT/WHEN statements. Groups of statements execute only when a particular condition is true. Review the criteria for using DO groups efficiently.

Review the related comparative examples:

"Comparative Example: Creating a Subset of Data" on page 742
"Comparative Example: Creating Variables Conditionally Using DO Groups" on page 745
"Comparative Example: Creating Variables Conditionally When Calling Functions" on page 747
"Comparative Example: Creating Data in DO Groups" on page 752.

Eliminating Unnecessary Passes Through the Data

You should avoid reading or writing data more than necessary in order to minimize I/O operations.

There are a variety of techniques that you can use. For example, use a single DATA step to create multiple output data sets from one pass of the input data, rather than using multiple DATA steps to process the input data each time that you create an output data set. Create sorted subsets by subsetting data with the SORT procedure rather than subsetting data in a DATA step and then sorting. Change variable attributes by using PROC DATASETS rather than a DATA step.

Review the related comparative examples:

"Comparative Example: Creating Multiple Subsets of a SAS Data Set" on page 758
"Comparative Example: Creating a Sorted Subset of a SAS Data Set" on page 759
"Comparative Example: Changing the Variable Attributes of a SAS Data Set" on page 760.

Reading and Writing Only Essential Data

If you process fewer observations and variables, SAS performs fewer I/O operations. To limit the number of observations that are processed, you can use the subsetting IF statement and the WHERE statement. Best programming practices can be applied if you understand other differences between subsetting IF and WHERE statements. You can also improve performance by applying OBS= and FIRSTOBS= processing with a WHERE statement.

To select observations when reading data from external files, position a subsetting IF statement in a DATA step so that it reads only the variables that are needed to select the subsetbefore reading all the data. This can reduce the overhead required to process data.

To limit the number of variables that are processed, you can use

the DROP and KEEP statements
the DROP= and KEEP= data set options.

In the SET statement, the DROP= or KEEP= data set option controls which variables are read and subsequently processed. In the DATA statement, the DROP= or KEEP= data set option controls which variables are written to a data set after processing. Using the SET statement with these options is the most efficient and best practice.

Review the related comparative examples:

"Comparative Example: Creating a Subset of a SAS Data Set" on page 763
"Comparative Example: Creating a Subset of Data by Reading Data from an External File" on page 766
"Comparative Example: Creating a Report That Contains Average and Median Statistics" on page 770
"Comparative Example: Creating a SAS Data Set That Contains Only Certain Variables" on page 771.

Storing SAS Data in SAS Data Sets

When you use SAS to repeatedly analyze or manipulate any particular group of data, create a SAS data set instead of reading the raw data each time.

Reading data from an external file versus reading from a SAS data set greatly increases CPU usage.

Review the related comparative example:

"Comparative Example: Creating a SAS Data Set or Reading from an External File" on page 774.

Avoiding Unnecessary Procedure Invocation

Invoking procedures once rather than multiple times can be the most efficient way to process data. Several procedures enable you to create multiple reports by invoking the procedure only once.

Using a single DATASETS procedure instead of multiple DATASETS procedures to modify the descriptor portion of a data set results in a noticeable savings in both CPU and I/O operations. Also, you can take advantage of RUN-group processing to submit groups of statements without ending the procedure.

Review the related comparative example:

"Comparative Example: Modifying the Descriptor Portion of SAS Data Sets" on page 777.

Quiz

Select the best answer for each question. After completing the quiz, check your answers using the answer key in the appendix.

Placing the subsetting IF statement at the top rather than near the bottom of a DATA step results in a savings in CPU usage. What happens if the subset is large rather than small?
1. The savings in CPU usage increases as the subset grows larger because the I/O increases.
2. The savings in CPU usage decreases as the subset grows larger. However, placing the subsetting IF statement at the top of a DATA step always uses fewer resources than placing it at the bottom.
3. The savings in CPU usage remains constant as the subset grows larger. However, placing the subsetting IF statement near the bottom of a data set is preferable.
4. The savings in CPU usage decreases as the subset grows larger. However, placing the subsetting IF statement near the bottom of a data set increases the I/O.
Which of the following statements is true about techniques that are used for modifying data and attributes?
1. You can use PROC DATASETS to modify both data values and variable attributes.
2. You can use PROC DATASETS to modify only data values.
3. You can use the DATA step to modify both data values and variable attributes.
4. You can use the DATA step to modify only variable attributes.
For selecting observations, is a subsetting IF statement or a WHERE statement more efficient? Why?
1. A subsetting IF statement is more efficient because it loads all observations sequentially into the program data vector.
2. A subsetting IF statement is more efficient because it examines what is in the input buffer and selects observations before they are loaded into the program data vector, which results in a savings in CPU operations.
3. A WHERE statement is more efficient because it loads all observations sequentially into the program data vector.
4. A WHERE statement is more efficient because it examines what is in the input buffer and selects observations before they are loaded into the program data vector, which results in a savings in CPU operations.
When is it more advantageous to create a temporary SAS data set rather than a permanent SAS data set?
1. When the external file on which the data set is based might change between SAS sessions.
2. When the external file on which the data set is based does not change between SAS sessions.
3. When the data set is needed for more than one SAS session.
4. When you are converting raw numeric values to SAS data values.
When you compare the technique of using multiple DATASETS procedures to using a single DATASETS procedure to modify the descriptor portion of a data set, which is true?
1. A one-step DATASETS procedure results in an increase in I/O operations.
2. Multiple DATASETS procedures result in a decrease in I/O operations.
3. A one-step DATASETS procedure results in a decrease in CPU usage.
4. Multiple DATASETS procedures result in a decrease in CPU usage.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 22. Using Best Practices

Create new playlist

Sign In

Sign Up

Chapter 22. Using Best Practices

Overview

Introduction

Note

Objectives

Prerequisites

Executing Only Necessary Statements

Positioning of the Subsetting IF Statement

Comparative Example: Creating a Subset of Data

Programming Techniques

General Recommendations

Using Conditional Logic Efficiently

Note

Comparative Example: Creating Variables Conditionally Using DO Groups

Programming Techniques

General Recommendations

Comparative Example: Creating Variables Conditionally When Calling Functions

Programming Techniques

General Recommendations

Using DO Groups Efficiently

Note

Comparative Example: Creating Data in DO Groups

Programming Techniques

General Recommendations

Eliminating Unnecessary Passes through the Data

Using a Single DATA or PROC Step to Enhance Efficiency

Using a Single DATA Step to Create Multiple Output Data Sets

Using the SORT Procedure with a WHERE Statement to Create Sorted Subsets

Using the DATASETS Procedure to Modify Variable Attributes

Note

Comparative Example: Creating Multiple Subsets of a SAS Data Set

Programming Techniques

General Recommendations

Comparative Example: Creating a Sorted Subset of a SAS Data Set

Programming Techniques

General Recommendations

Comparative Example: Changing the Variable Attributes of a SAS Data Set

Programming Techniques

General Recommendations

Reading and Writing Only Essential Data

Selecting Observations Using Subsetting IF versus WHERE Statement

Comparative Example: Creating a Subset of a SAS Data Set

Programming Techniques

General Recommendations

Other Differences between the IF and WHERE Statements

Note

Using the WHERE Statement with the OBS= and FIRSTOBS= Options

Selecting Observations When Reading Data from External Files

Note

Comparative Example: Creating a Subset of Data by Reading Data from an External File

Programming Techniques

General Recommendations

Subsetting Variables with the KEEP= and DROP= Statements and Options

Note

Comparative Example: Creating a Report That Contains Average and Median Statistics

Programming Techniques

General Recommendations

Comparative Example: Creating a SAS Data Set That Contains Only Certain Variables

Programming Techniques

General Recommendations

Note

Storing Data in SAS Data Sets

Note

Comparative Example: Creating a SAS Data Set or Reading from an External File

Programming Techniques

General Recommendations

Avoiding Unnecessary Procedure Invocation

Note

Executing the DATASETS Procedure

RUN-Group Processing

Using Different Types of RUN Groups with PROC DATASETS

Comparative Example: Modifying the Descriptor Portion of SAS Data Sets

Programming Techniques

General Recommendations

Note

Summary

Table of Contents for
22. Using Best Practices