Chapter 22. Using Best Practices

Overview

Introduction

This chapter demonstrates different ways of using best SAS programming practices to optimize performance. As you compare the techniques described in this chapter, remember that differences in the use of resources are affected by which operating environment you work in and by the characteristics of your data.

This chapter is organized by topics that emphasize the following basic principles:

  • Execute only necessary statements.

  • Eliminate unnecessary passes of the data.

  • Read and write only the data that you require.

  • Store data in SAS data sets.

  • Avoid unnecessary procedure invocation.

Each topic includes comparative examples that can improve the efficiency of your programs. Write programs to generate your own benchmarks, and adopt the programming techniques that produce the most savings for you.

Note

This chapter does not cover the SAS Scalable Performance Data Engine (SAS SPD Engine), which is a SAS 9.1 technology for threaded processing. For details about using the SAS SPD Engine to improve performance, see the SAS documentation. Δ

Objectives

In this chapter, you learn too efficiently

  • subset observations

  • create new variables

  • process and output data conditionally

  • create multiple output data sets and sorted subsets

  • modify variable attributes

  • select observations from SAS data sets and external files

  • subset variables

  • read data from SAS data sets

  • invoke SAS procedures.

Prerequisites

Before beginning this chapter, you should complete the following chapters:

  • Part 1: SQL Processing with SAS

    • Chapter 1, "Performing Queries Using PROC SQL," on page 3

    • Chapter 2, "Performing Advanced Queries Using PROC SQL," on page 25

    • Chapter 3, "Combining Tables Horizontally Using PROC SQL," on page 79

    • Chapter 4, "Combining Tables Vertically Using PROC SQL," on page 123

    • Chapter 5, "Creating and Managing Tables Using PROC SQL," on page 159

    • Chapter 6, "Creating and Managing Indexes Using PROC SQL," on page 221

    • Chapter 7, "Creating and Managing Views Using PROC SQL," on page 243

    • Chapter 8, "Managing Processing Using PROC SQL," on page 261.

  • Part 3: Advanced SAS Programming Techniques

    • Chapter 13, "Creating Samples and Indexes," on page 449

    • Chapter 14, "Combining Data Vertically," on page 479

    • Chapter 15, "Combining Data Horizontally," on page 511

    • Chapter 16, "Using Lookup Tables to Match Data," on page 557

    • Chapter 17, "Formatting Data," on page 601

    • Chapter 18, "Modifying SAS Data Sets and Tracking Changes," on page 631.

  • Part 4: Optimizing SAS Programs

    • Chapter 19, "Introduction to Efficient SAS Programming," on page 677

    • Chapter 20, "Controlling Memory Usage," on page 687

    • Chapter 21, "Controlling Data Storage Space," on page 705.

Executing Only Necessary Statements

Best practices specify that you should write programs that cause SAS to execute only necessary statements. When you execute the minimum number of statements in the most efficient order, you minimize the hardware resources that SAS uses. The resources that are affected include disk usage, memory usage, and CPU usage.

Techniques for executing only the statements you need include

  • subsetting your data as soon as is logically possible

  • processing your data conditionally by using the most appropriate syntax for your data.

Positioning of the Subsetting IF Statement

To subset your data based on a newly derived or computed variable, you must use the subsetting IF statement in a DATA step. You can subset output data to a SAS data set by processing only those observations that meet a specified condition.

Positioning of the Subsetting IF Statement

The subsetting IF statement causes the DATA step to continue processing only those raw data records or observations that meet the condition of the expression specified in the IF statement. The resulting SAS data set or data sets contain a subset of the original external file or SAS data set.

Position the subsetting IF statement in the program so that it checks the subsetting condition as soon as it is logically possible, and so that unnecessary statements do not execute. When a statement is false, no further statements are processed for that observation.

Also, remember to subset data before performing calculations and to minimize the use of function calls or arithmetic operators. Unnecessary processing of unwanted observations results in higher expenditure of hardware resources.

Comparative Example: Creating a Subset of Data

Suppose you want to create a subset of data, calculate six new variables, and conditionally output data by reading from the SAS data set Retail.Order_fact. The data set should contain new variables for

  • the month of the order

  • the elapsed time between the order date and the delivery date

  • the profit, based on the retail price, discount, and unit price

  • total profit

  • total discount

  • total wait time.

The subset of data that includes only orders for the month of December is approximately 9.66% of the data.

You can accomplish this task by using a subsetting IF statement. Placement of this statement in the DATA step can affect the efficiency of the DATA step in terms of CPU time and real time. Notice the comparison between these two approaches:

  1. A Subsetting IF Statement at the Bottom

  2. A Subsetting IF Statement near the Top.

The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for executing only necessary statements.

Programming Techniques

General Recommendations

  • Position the subsetting IF statement in a DATA step as soon as is logically possible in order to save the most resources.

    In the last comparative example you saw how a subsetting IF statement can be positioned in the DATA step so that no further statements are processed for that observation. Next, look at how different programming techniques can be used to

  • create variables conditionally using DO groups

  • create variables conditionally when calling functions.

Before viewing the sample code for these two comparative examples, review guidelines for using these techniques.

Using Conditional Logic Efficiently

You can use conditional logic to change how SAS processes selected observations. Two techniques—IF-THEN/ELSE statements and SELECT statements—can be used interchangeably and perform comparably. Based on the characteristics of your data and depending on your environment, one of these techniques might give you better performance. Choose a technique that conserves your programming time and makes the program easiest to read.

Technique

Action

IF-THEN/ELSE statement

executes a SAS statement for observations that meet specific conditions.

SELECT statement

executes one of several statements or groups of statements.

Note

The number of conditions (values) tested and the type of variable tested affect CPU resources. For numeric variables, SELECT statements should always be slightly more efficient (use less CPU time) than IF-THEN/ELSE statements. The performance gap between IF-THEN/ELSE and SELECT statements gradually widens as the number of conditions increases. For character variables, IF-THEN/ELSE statements are always more efficient than SELECT statements. The performance gap widens quickly between the two techniques as the number of conditions increases. Δ

Use IF-THEN/ELSE statements when

  • the data values are character values

  • the data values are not uniformly distributed

  • there are few conditions to check.

For best practices, follow these guidelines for writing efficient IF/THEN logic:

  • For mutually exclusive conditions, use the ELSE IF statement rather than an IF statement for all conditions except the first.

  • Check the most frequently occurring condition first, and continue checking conditions in descending order of frequency.

  • When you execute multiple statements based on a condition, put the statements in a DO group.

Use SELECT statements when

  • you have a long series of mutually exclusive numeric conditions

  • data values are uniformly distributed.

Before writing conditional logic, determine the distribution of your data values. You can use the

  • FREQ procedure to examine the distribution of the data values

  • GCHART or GPLOT procedure to display the distribution graphically

  • UNIVARIATE procedure to examine distribution statistics and to display the information graphically.

Comparative Example: Creating Variables Conditionally Using DO Groups

Suppose you want to calculate an adjusted profit based on the values of the variable Order_Type in the data set Retail.Order_fact. For retail sales, which are represented by the value 1, the adjusted profit should be calculated as 105% of profit. For catalog sales, which are represented by the value 2, the adjusted profit should be calculated as 103% of profit. For Internet sales, which are represented by the value 3, the adjusted profit should be equal to profit.

The following table shows that the values for the variable Order_Type are not uniformly distributed.

Comparative Example: Creating Variables Conditionally Using DO Groups

The following table shows that the values for the variable Discount also are not uniformly distributed.

Comparative Example: Creating Variables Conditionally Using DO Groups

Techniques for creating new variables conditionally include

  1. IF-THEN/ELSE statements

  2. SELECT statements.

The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for executing only necessary statements.

Programming Techniques

General Recommendations

  • Check the most frequently occurring condition first, and continue checking conditions in descending order of frequency, regardless of whether you use IF-THEN/ELSE or SELECT statements.

  • When you execute multiple statements based on a condition, put the statements in a DO group.

Comparative Example: Creating Variables Conditionally When Calling Functions

Suppose you want to create a report that includes a new variable that is based on the value of an existing variable in the SAS data set Retail.Order_fact. Values for the new Month variable are extracted from the existing variable Order_Date by using the MONTH function.

The following table shows that the values for Month are fairly evenly distributed.

Comparative Example: Creating Variables Conditionally When Calling Functions

Techniques for creating new variables conditionally include

  1. Parallel IF Statements

  2. ELSE IF Statements, Many Function References

  3. ELSE IF Statements, One Function Reference

  4. SELECT Group.

The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for executing only necessary statements.

Programming Techniques

General Recommendations

  • Avoid using parallel IF statements, which use the most resources and are the least efficient way to conditionally execute statements.

  • Use IF-THEN/ELSE statements and SELECT blocks to be more efficient.

  • To significantly reduce the amount of resources used, write programs that call a function only once instead of repetitively using the same function in many statements. SAS functions are convenient, but they can be expensive in terms of CPU resources.

You have seen different ways of efficiently creating variables conditionally, which reinforce the principle of executing only necessary statements. The next comparative example in this topic addresses techniques that can be used to create data in DO groups. Before viewing this example, you might want to review the following information about which statements are needed in DO groups.

Using DO Groups Efficiently

You can conditionally execute only necessary statements by placing them in DO groups that are associated with IF-THEN/ELSE statements or with SELECT/WHEN statements. Groups of statements execute only when a particular condition is true. Remember to use the following criteria when choosing which technique is more efficient:

 

IF-THEN/ELSE Statements

SELECT/WHEN Statements

The number of conditions

few

many

The distribution of a variable's values

not uniform

uniform

The type of variable

character

numeric

When using a DO group with IF-THEN/ELSE statements, add DO after the THEN clause, and add an END statement after all of the statements that you want executed as a group.

data orders;
   set company.orders;
   if order_type = 1 then
      do;
      <multiple executable statements here>
      end;
   else if order_type = 2 then
      do;
      <multiple executable statements here>
       end;
   else  if order_type = 3 then
       do;
       <multiple executable statements here>
       end;
run;

Note

Use an IF-THEN DO group when you create multiple variables based on a condition. Δ

When using a DO group with SELECT/WHEN statements, add DO after the WHEN condition, and add an END statement after all of the statements that you want executed as a group. Use an OTHERWISE statement to specify the statements that you want executed if no WHEN condition is met.

data orders;
   set company.orders;
   select (order_type);
      when (1)
         do;
         <multiple executable statements here>
         end;
      when (2)
         do;
         <multiple executable statements here>
         end;
      when (3)
         do;
         <multiple executable statements here>
         end;
      otherwise;
   end;
run;

Remember that IF-THEN/ELSE and SELECT/WHEN logic require that there be no intervening statements between the IF and the ELSE conditions or between the SELECT and the WHEN conditions.

Comparative Example: Creating Data in DO Groups

Suppose you want to identify which customer groups are Club Members, Club Gold Members, or Internet/Catalog members, based on data from the data set Retail.Customer_hybrid. You also want to identify the nature of customer activity as "inactive", "low activity", "medium activity", or "high activity".

The following table shows the distribution of values for Customer_Type_ID.

Comparative Example: Creating Data in DO Groups

Techniques for creating new variables based on the values of specific variables include

  1. SELECT, IF/SELECT Statements

  2. Nested SELECT Statements

  3. Serial IF Statements

  4. IF-THEN/ELSE IF Statements with a Link.

The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for executing only necessary statements.

Programming Techniques

General Recommendations

  • Avoid serial IF statements because they use extra resources.

Eliminating Unnecessary Passes through the Data

Best practices specify that you should eliminate unnecessary passes through the data. To minimize I/O operations and CPU time, avoid reading or writing data more than necessary. Accomplish this by taking advantage of one-step processing, which can lead to efficiencies.

Using a Single DATA or PROC Step to Enhance Efficiency

Whenever possible, use a single DATA or PROC step to enhance efficiency. Techniques that minimize passes through the data include

  • using a single DATA step to create multiple output data sets

  • using the SORT procedure with a WHERE statement to create sorted subsets

  • using the DATASETS procedure to modify variable attributes.

Before viewing comparative examples that address these techniques, it might be helpful to review the following information about these practices.

Using a Single DATA Step to Create Multiple Output Data Sets

It is good programming practice to take advantage of the DATA step's ability to create multiple output data sets at the same time. This is more efficient than using a series of individual DATA steps. Using a single DATA step saves resources because input data is read only once.

The following program demonstrates how to use a single DATA step to read data once and create three subsets of data.

data Sales_managers Account_managers Finance_managers;
   set company.organization;
   if job_title='Sales Manager' then
              output Sales_managers;
   else if job_title='Account Manager' then
              output Account_managers;
   else if job_title='Finance Manager' then
              output Finance_managers;
run;

Using the SORT Procedure with a WHERE Statement to Create Sorted Subsets

It is good programming practice to take advantage of the SORT procedure's ability to sort and subset in the same PROC step. This is more efficient than using two separate steps to accomplish this—a DATA step to subset followed by a procedure step that sorts.

The following program demonstrates how you can sort data and select only those observations that meet the conditions of the WHERE statement—in one step.

Processing a data set only once saves I/O and CPU resources.

proc sort data=company.organization
          out=company.managers;
   by job_title;
   where job_title in('Sales Manager',
                      'Account Manager',
                      'Finance Manager'),
run;

Using the DATASETS Procedure to Modify Variable Attributes

Use PROC DATASETS instead of a DATA step to modify data attributes. The DATASETS procedure uses fewer resources than the DATA step because it processes only the descriptor portion of the data set, not the data portion. PROC DATASETS retains the sort flag, as well as indexes.

proc datasets lib=company;
   modify organization;
   rename Job_title=Title;
quit;

Note

You cannot use the DATASETS procedure to modify the type, length, or position of variables because these attributes directly affect the data portion of the data set. To perform these operations, use the DATA step. Δ

Comparative Example: Creating Multiple Subsets of a SAS Data Set

Suppose you want to create five subsets of data from the data set Retail.Customer. You need a subset for each of five countries. Techniques for creating multiple subsets include writing

  1. Multiple DATA Steps

  2. A Single DATA Step.

The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for eliminating unnecessary passes through the data.

Programming Techniques

General Recommendations

  • When creating multiple subsets from a SAS data set, use a single DATA step with IF-THEN/ELSE IF logic to output to appropriate data sets.

Comparative Example: Creating a Sorted Subset of a SAS Data Set

Suppose you want to create a sorted subset of a SAS data set named Retail.Customer. You want only data for customers in the United States, France, Italy, Germany, and Spain.

Techniques for creating sorted subsets of SAS data sets include

  1. A DATA Step and PROC SORT

  2. PROC SORT with a WHERE Statement.

The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for eliminating unnecessary passes through the data.

Programming Techniques

General Recommendations

  • When you need to process a subset of data with a procedure, use a WHERE statement in the procedure instead of creating a subset of data and reading that data with the procedure.

  • Write one program step that both sorts and subsets. This approach can take less programmer time and debugging time than writing separate program steps that subset and sort.

Comparative Example: Changing the Variable Attributes of a SAS Data Set

Suppose you want to change the variable attributes in Retail.NewCustomer to make them consistent with those in the Retail.Customer data set. The data set Retail.NewCustomer contains 89954 observations and 12 variables.

The following table shows the variable names and formats in each SAS data set.

SAS Data Set

Variable Name

Variable Format

Retail.Customer

Country

Birth_Date

$COUNTRY.

DATE9.

Retail.NewCustomer

Country_ID

Birth_Date

$COUNTRY.

MMDDYYP10

Techniques for changing the variable attributes of a SAS data set include

  1. A DATA Step

  2. PROC DATASETS.

The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for eliminating unnecessary passes through the data.

Programming Techniques

General Recommendations

  • To save significant resources, use the DATASETS procedure instead of a DATA step to change the attributes of a SAS data set.

Reading and Writing Only Essential Data

Best practices specify that you should write programs that read and write only essential data. If you process fewer observations and variables, you conserve resources. This topic covers many different techniques that can improve performance when you

  • select observations from SAS data sets

  • select observations from external files

  • keep or drop variables when creating or reporting on SAS data sets.

Look at how two useful statements differ in how they select observations to subset.

Selecting Observations Using Subsetting IF versus WHERE Statement

You can use WHERE statements or subsetting IF statements to subset data. Although both statements test a condition to determine whether SAS should process an observation, the WHERE statement is more efficient.

The following graphic illustrates differences in how these statements process data.

Selecting Observations Using Subsetting IF versus WHERE Statement

I/O operations are measured as data moves between the disk that contains input SAS data and the buffer in memory, and when data moves from the output buffer to the disk that contains output data sets. Input data is not affected by the WHERE statement or subsetting IF statement. However, output data is affected by both.

CPU time is measured when data must be processed in the program data vector. CPU time can be saved if fewer observations are processed.

A WHERE statement and an IF statement make different use of the program data vector. The WHERE statement examines what is in the input page buffer and selects observations before they are loaded in the program data vector, which results in a savings in CPU operations. The subsetting IF statement loads all observations sequentially into the program data vector. If the statement finds a match and the statement is true, then the data is processed and is written to the output page buffer.

WHERE statements work on variables that are already in SAS data sets. IF statements can work on any variable in the program data vector, including new or old variables.

Comparative Example: Creating a Subset of a SAS Data Set

Suppose you want to create a subset of the data set Retail.Customer. You want to include data for only the United Kingdom. The subset contains approximately 5.56% of the Retail.Customer data.

Techniques for subsetting observations include

  1. Subsetting IF Statement

  2. WHERE Statement.

The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for reading and writing only essential data.

Programming Techniques

General Recommendations

  • To save CPU resources, use a WHERE statement instead of a subsetting IF statement to subset a SAS data set.

Other Differences between the IF and WHERE Statements

Review the following table to note other differences between the IF and WHERE statements.

 

The IF Statement

The WHERE Statement

Selecting Data

can select records from external files, observations from SAS data sets, observations created with an INPUT statement, or observations based on the value of a computed or derived variable.

can select only observations from SAS data sets.

Conditional Execution

can be executed conditionally as part of an IF statement.

cannot be executed conditionally as part of an IF statement.

Grouping Data Using a BY Statement

produces the same data set when a BY statement accompanies a SET, MERGE, or UPDATE statement.

produces a different data set when a BY statement accompanies a SET, MERGE, or UPDATE statement.

Merging Data

selects observations after combining current observations.

applies the selection criteria to each input data set before combining observations.

Note

If you use the WHERE= data set option and the WHERE statement in the same DATA step, SAS ignores the WHERE statement for input data sets. The WHERE= data set option and the WHERE statement call the same SAS routine. Δ

Using the WHERE Statement with the OBS= and FIRSTOBS= Options

Another way to read and write only essential data is to process a segment of subsetted data. You accomplish this specialized task by using a WHERE expression in conjunction with the OBS= and FIRSTOBS= data set options. This programming feature is available in SAS 8.1 and later.

In the following example, the WHERE expression selects observations before the OBS= and FIRSTOBS= options are applied. The values specified for OBS= and FIRSTOBS= are the logical observation numbers in the subset, not the physical observation numbers in the data set.

Using the WHERE Statement with the OBS= and FIRSTOBS= Options

FIRSTOBS = 5 is the fifth observation in the subset, whereas it was observation 101 in the data set Company.Staff.

OBS = 8 is the eighth observation in the subset, whereas it was observation 159 in the data set Company.Staff.

Now that you have seen techniques for efficiently subsetting observations that are read from SAS data sets, look at techniques for subsetting records that are read from external files.

Before viewing a comparative example that illustrates these techniques, it might be useful to review which resources are affected as SAS reads and processes data that is read from external files.

Selecting Observations When Reading Data from External Files

Positioning a subsetting IF statement in a DATA step so that it reads only the variables that are needed to select the subset—before reading all the data—can reduce the overhead required for processing data.

The following graphic illustrates how data is read from an external file, loaded into the input buffer, and read into the program data vector.

Selecting Observations When Reading Data from External Files

Remember that I/O operations are measured as data moves between disks and buffers—for both input and output data. All records are loaded into the input buffer before moving to the program data vector for processing, so I/O is not affected by the placement of a subsetting IF statement in the DATA step.

You can reduce the CPU resources that are required for processing data by limiting what is read into the program data vector. Position a subsetting IF statement after an INPUT statement that reads only the data that is required in order to check for specific conditions. Subsequent statements do not execute and do not process variable values for unwanted observations.

Note

Converting raw character fields to SAS character variables requires less CPU time than converting raw numeric fields to the real binary encoding of SAS numeric variables. Δ

Comparative Example: Creating a Subset of Data by Reading Data from an External File

Suppose you want to create a SAS data set by reading a subset of data from an external file that is referenced by the fileref Customerdata. You want the subset to contain only customers in the United Kingdom.

The subset is approximately 5.56% of the countries in the external file, which contains 89,954 records and 12 fields.

Techniques for doing this include

  1. Reading All Variables and Subsetting

  2. Reading Selected Variables and Subsetting.

The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for reading and writing only essential data.

Programming Techniques

General Recommendations

  • Position a subsetting IF statement in a DATA step so that only variables that are necessary to select the record are read before subsetting. This can result in significant savings in CPU time. There is no difference in I/O or memory usage between the two techniques.

  • When selecting rows of data from an external file, read the field(s) on which the selection is being made before reading all the fields into the program data vector.

  • Use the single trailing @ sign to hold the input buffer so that you can continue to read the record when the variable(s) satisfy the IF condition.

In addition to subsetting observations, you can subset variables by using statements or options that efficiently read and write only essential data.

Before viewing two comparative examples that illustrate how to best limit which variables are read and processed, review how these useful statements and options work.

Subsetting Variables with the KEEP= and DROP= Statements and Options

To subset variables, you can use

  • the DROP and KEEP statements

  • the DROP= and KEEP= data set options.

Subsetting Variables with the KEEP= and DROP= Statements and Options

Note

You cannot use the DROP and KEEP statements and the DROP= and KEEP= output data set options in the same step. Δ

Use of the KEEP= data set option and the DROP= data set option can affect resource usage, depending on whether they are used in a SET or MERGE statement or in a DATA statement.

The following figure shows how options in these statements process data.

Subsetting Variables with the KEEP= and DROP= Statements and Options

When used in the SET or MERGE statement, the KEEP= and DROP= data set options affect which variables are read into the program data vector. Reading only the variables that need to be processed in the DATA step can improve efficiency.

When used in the DATA statement, these same options put drop flags on variables to be excluded and affect which variables are written to the output data set.

The DROP and KEEP statements work just like the KEEP= or DROP= options in the DATA statement.

The following table describes differences in how the KEEP statement and the KEEP= data set option write variables to SAS data sets.

KEEP Statement

KEEP= Output Data Set Option

KEEP= Input Data Set Option

Causes a DATA step to write only the variables listed in the KEEP statement to one or more data sets.

Causes a DATA step to write only the variables listed in the KEEP= variable list to the output data set.

If there is no KEEP=/DROP= output data set option, enables processing of only the variables not listed in the DROP= variable list.

If there is no KEEP=/DROP= input data set option, enables the DATA step to process all of the variables.

If there is no KEEP=/DROP= input data set option, enables the DATA step to process all of the variables.

Enables processing of only the variables listed in the KEEP= variable list.

Applies to all data sets that are created within the same DATA step.

Can write different variables to different data sets.

If there is no KEEP=/DROP= output data set option or KEEP/DROP statement, enables processing of only the variables listed in the KEEP= variable list.

Available only in the DATA step.

Available in the DATA step or in a PROC step.

Available in the DATA step or in a PROC step.

The following table describes differences in how the DROP statement and the DROP= data set option write variables to SAS data sets.

DROP Statement

DROP= Output Data Set Option

DROP= Input Data Set Option

Causes a DATA step to write only the variables not listed in the DROP statement to one or more data sets.

Causes a DATA step to write only the variables not listed in the DROP= variable list to the output data set.

If there is no KEEP=/DROP= output data set option, enables processing of only the variables not listed in the DROP= variable list.

If there is no KEEP=/DROP= input data set option, enables the DATA step to process all of the variables.

If there is no KEEP=/DROP= input data set option, enables the DATA step to process all of the variables.

Enables processing of only the variables not listed in the DROP= variable list.

Applies to all data sets created within the same DATA step.

Can write different variables to different data sets.

If there is no KEEP=/DROP= output data set option or KEEP/DROP statement, enables processing of only the variables not listed in the DROP= variable list.

Available only in the DATA step.

Available in the DATA step or in a PROC step.

Available in the DATA step or in a PROC step.

Comparative Example: Creating a Report That Contains Average and Median Statistics

Suppose you want to create a report that contains the average and median values for the variable Profit, based on data that is read from the data set Retail.Order_fact. Depending on the number of variables eliminated, it might be more efficient to use the KEEP= option in a SET statement to limit which variables are read.

Techniques for reading and writing variables to a data set include

  1. Without the KEEP= Statement

  2. KEEP= in the DATA Statement

  3. KEEP= in the DATA and SET Statements

  4. KEEP= in the SET and MEANS Statements.

The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for reading and writing only essential data.

Programming Techniques

General Recommendations

  • To reduce both CPU time and I/O operations, avoid reading and writing variables that are not needed.

Comparative Example: Creating a SAS Data Set That Contains Only Certain Variables

Suppose you want to read data from an external file that is referenced by the fileref Rawdata and to create a SAS data set that contains only the variables Customer_ID, Country, Gender, and Customer_Name.

Techniques for accomplishing this task include

  1. Reading All Fields

  2. Reading Selected Fields.

The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for reading and writing only essential data.

Programming Techniques

General Recommendations

  • When possible, read only the fields you need from an external data file to save CPU and real-time resources.

  • To save CPU resources, avoid converting numerics that you do not need in further processing.

Note

Remember that numeric data is moved into the program data vector after being converted to real binary, floating point numbers; multiple digits are stored in one byte. Character data is moved into the program data vector with no conversion; one character is stored in one byte. Δ

Storing Data in SAS Data Sets

In many cases, it is best practice for you to store data in SAS data sets. You can optimize performance if you know when you should create a SAS data set and when you should read data directly from an external file.

Before viewing the comparative example that illustrates different techniques for reading from a SAS data set versus from an external file, consider the following advantages of storing data in SAS data sets.

When you use SAS to repeatedly analyze or manipulate any particular group of data, it is more efficient to create a SAS data set than to read the raw data each time. Although SAS data sets can be larger than external files and can require more disk space, reading from SAS data sets saves CPU time that is associated with reading a raw data file.

Storing Data in SAS Data Sets

Other reasons for storing data in SAS data sets, rather than external files, include:

  • When the data is already in a SAS data set, you can use a SAS procedure, function, or routine on the data without further conversion.

  • SAS data sets are self-documenting.

The descriptor portion of a SAS data set documents information about the data set such as

  • data set labels

  • variable labels

  • variable formats

  • informats

  • descriptive variable names.

Note

Create a temporary SAS data set if the data set is used for intermediate tasks such as merging and if it is needed in that SAS session only. Create a temporary SAS data set when the external file on which the data set is based might change between SAS sessions. Δ

Comparative Example: Creating a SAS Data Set or Reading from an External File

Suppose you want to create a SAS data set that contains a large number of variables. One way to accomplish this task is to read from raw data fields in an external file that is referenced by the fileref Rawdata. Another way to accomplish this is to read the same data values from an existing SAS data set named Retail.Customer.

Techniques for accomplishing this task include

  1. Reading from an External File

  2. Reading from a SAS Data Set.

The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for storing data in SAS data sets.

Programming Techniques

General Recommendations

  • To save CPU resources, you can read variables from a SAS data set instead of reading data from an external file.

  • To reduce I/O operations, you can read variables from a SAS data set instead of reading data from an external file. However, savings in I/O operations are largely dependent on the block size of the external data file and on the page size of the SAS data set.

Avoiding Unnecessary Procedure Invocation

Best practices specify that you avoid unnecessary procedure invocation. One way to do this is to take advantage of procedures that accomplish multiple tasks with one invocation.

Several procedures enable you to create multiple reports by invoking the procedure only once. These include

  • the SQL procedure

  • the DATASETS procedure

  • the FREQ procedure

  • the TABULATE procedure.

Note

BY-group processing can also minimize unnecessary invocations of procedures. Δ

To illustrate this principle, examine features of the DATASETS procedure.

Executing the DATASETS Procedure

The DATASETS procedure can use RUN-group processing to process multiple sets of statements. RUN-group processing enables you to submit groups of statements without ending the procedure.

When the DATASETS procedure executes,

  • SAS reads the program statements that are associated with one task until it reaches a RUN statement or an implied RUN statement.

  • SAS executes all of the preceding statements immediately, and then continues reading until it reaches another RUN statement or an implied RUN statement.

To execute the last task, you must use a RUN statement or a QUIT statement.

proc datasets lib=company;
   modify orders;
      rename quantity=Units_Ordered;
      format costprice_per_unit dollar13.2;
      label delivery_date='Date of Delivery';
   run;
   modify customers;
      format customer_birthdate mmddyy10.
   run;
quit;

You can terminate the PROC DATASETS execution by submitting

  • a DATA statement

  • a PROC statement

  • a QUIT statement.

RUN-Group Processing

If you can take advantage of RUN-group processing, you can avoid unnecessary procedure invocation. For best programming practices, you need to understand how RUN-group processing affects the execution of SAS statements. The procedures that support RUN-group processing include

  • CHART, GCHART

  • PLOT, GPLOT

  • GIS, GMAP

  • GLM

  • REG

  • DATASETS.

Using Different Types of RUN Groups with PROC DATASETS

To illustrate how RUN-group processing works, this discussion focuses on the DATASETS procedure. The comparative example that follows includes programs that use PROC DATASETS to modify the descriptor portion of data sets. Before you examine the code to consider efficient programming techniques, review how the principles associated with RUN-group processing apply to PROC DATASETS.

The DATASETS procedure supports four types of RUN groups. Each RUN group is defined by the statements that compose it and by what causes it to execute.

Some statements in PROC DATASETS act as implied RUN statements because they cause the RUN group that precedes them to execute.

The following list identifies which statements compose a RUN group and what causes each RUN group to execute:

  • The PROC DATASETS statement always executes immediately. No other statement is necessary to cause the PROC DATASETS statement to execute. Therefore, the PROC DATASETS statement alone is a RUN group.

  • The MODIFY statement and any of its subordinate statements form a RUN group. These RUN groups always execute immediately. No other statement is necessary to cause a MODIFY RUN group to execute.

  • The APPEND, CONTENTS, and COPY statements (including EXCLUDE and SELECT, if present) form their own separate RUN groups. Every APPEND statement forms a single-statement RUN group; every CONTENTS statement forms a single-statement RUN group; and every COPY step forms a RUN group. Any other statement in the procedure, except those that are subordinate to either the COPY or MODIFY statement, causes the RUN group to execute.

Additionally, one or more of the following statements form a RUN group:

  • AGE

  • EXCHANGE

  • CHANGE

  • REPAIR.

If any of these statements appear in sequence in the PROC step, the sequence forms a RUN group. For example, if a REPAIR statement appears immediately after a SAVE statement, the REPAIR statement does not force the SAVE statement to execute; it becomes part of the same RUN group. To execute the RUN group, submit one of the following statements:

  • PROC DATASETS

  • MODIFY

  • APPEND

  • QUIT

  • CONTENTS

  • RUN

  • COPY

  • another DATA or PROC step.

Comparative Example: Modifying the Descriptor Portion of SAS Data Sets

Suppose you want to use the DATASETS procedure to modify the data sets NewCustomer, NewOrders, and NewItems.

Techniques for accomplishing this task include using

  1. Multiple DATASETS Procedures

  2. A Single DATASETS Procedure.

The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for avoiding unnecessary procedure invocation.

Programming Techniques

General Recommendations

  • Invoke the DATASETS procedure once and process all the changes for a library in one step to save CPU and I/O resources—at the cost of memory resources.

Note

Because the specified library could change between invocations of the DATASETS procedure, the procedure is reloaded into memory for each invocation. Δ

Summary

Executing Only Necessary Statements

You minimize the CPU time that SAS uses when you execute the minimum number of statements in the most efficient order.

For a more efficient program, place the subsetting IF statement as soon as logically possible in a DATA step when creating a subset of data.

Review guidelines for using conditional logic efficiently with IF-THEN/ELSE statements or SELECT statements. Remember to minimize the number of statements that use SAS functions or arithmetic operators.

Conditionally execute only necessary statements by placing statements in groups that are associated with IF-THEN/ELSE statements or SELECT/WHEN statements. Groups of statements execute only when a particular condition is true. Review the criteria for using DO groups efficiently.

Review the related comparative examples:

  • "Comparative Example: Creating a Subset of Data" on page 742

  • "Comparative Example: Creating Variables Conditionally Using DO Groups" on page 745

  • "Comparative Example: Creating Variables Conditionally When Calling Functions" on page 747

  • "Comparative Example: Creating Data in DO Groups" on page 752.

Eliminating Unnecessary Passes Through the Data

You should avoid reading or writing data more than necessary in order to minimize I/O operations.

There are a variety of techniques that you can use. For example, use a single DATA step to create multiple output data sets from one pass of the input data, rather than using multiple DATA steps to process the input data each time that you create an output data set. Create sorted subsets by subsetting data with the SORT procedure rather than subsetting data in a DATA step and then sorting. Change variable attributes by using PROC DATASETS rather than a DATA step.

Review the related comparative examples:

  • "Comparative Example: Creating Multiple Subsets of a SAS Data Set" on page 758

  • "Comparative Example: Creating a Sorted Subset of a SAS Data Set" on page 759

  • "Comparative Example: Changing the Variable Attributes of a SAS Data Set" on page 760.

Reading and Writing Only Essential Data

If you process fewer observations and variables, SAS performs fewer I/O operations. To limit the number of observations that are processed, you can use the subsetting IF statement and the WHERE statement. Best programming practices can be applied if you understand other differences between subsetting IF and WHERE statements. You can also improve performance by applying OBS= and FIRSTOBS= processing with a WHERE statement.

To select observations when reading data from external files, position a subsetting IF statement in a DATA step so that it reads only the variables that are needed to select the subsetbefore reading all the data. This can reduce the overhead required to process data.

To limit the number of variables that are processed, you can use

  • the DROP and KEEP statements

  • the DROP= and KEEP= data set options.

In the SET statement, the DROP= or KEEP= data set option controls which variables are read and subsequently processed. In the DATA statement, the DROP= or KEEP= data set option controls which variables are written to a data set after processing. Using the SET statement with these options is the most efficient and best practice.

Review the related comparative examples:

  • "Comparative Example: Creating a Subset of a SAS Data Set" on page 763

  • "Comparative Example: Creating a Subset of Data by Reading Data from an External File" on page 766

  • "Comparative Example: Creating a Report That Contains Average and Median Statistics" on page 770

  • "Comparative Example: Creating a SAS Data Set That Contains Only Certain Variables" on page 771.

Storing SAS Data in SAS Data Sets

When you use SAS to repeatedly analyze or manipulate any particular group of data, create a SAS data set instead of reading the raw data each time.

Reading data from an external file versus reading from a SAS data set greatly increases CPU usage.

Review the related comparative example:

  • "Comparative Example: Creating a SAS Data Set or Reading from an External File" on page 774.

Avoiding Unnecessary Procedure Invocation

Invoking procedures once rather than multiple times can be the most efficient way to process data. Several procedures enable you to create multiple reports by invoking the procedure only once.

Using a single DATASETS procedure instead of multiple DATASETS procedures to modify the descriptor portion of a data set results in a noticeable savings in both CPU and I/O operations. Also, you can take advantage of RUN-group processing to submit groups of statements without ending the procedure.

Review the related comparative example:

  • "Comparative Example: Modifying the Descriptor Portion of SAS Data Sets" on page 777.

Quiz

Select the best answer for each question. After completing the quiz, check your answers using the answer key in the appendix.

  1. Placing the subsetting IF statement at the top rather than near the bottom of a DATA step results in a savings in CPU usage. What happens if the subset is large rather than small?

    1. The savings in CPU usage increases as the subset grows larger because the I/O increases.

    2. The savings in CPU usage decreases as the subset grows larger. However, placing the subsetting IF statement at the top of a DATA step always uses fewer resources than placing it at the bottom.

    3. The savings in CPU usage remains constant as the subset grows larger. However, placing the subsetting IF statement near the bottom of a data set is preferable.

    4. The savings in CPU usage decreases as the subset grows larger. However, placing the subsetting IF statement near the bottom of a data set increases the I/O.

  2. Which of the following statements is true about techniques that are used for modifying data and attributes?

    1. You can use PROC DATASETS to modify both data values and variable attributes.

    2. You can use PROC DATASETS to modify only data values.

    3. You can use the DATA step to modify both data values and variable attributes.

    4. You can use the DATA step to modify only variable attributes.

  3. For selecting observations, is a subsetting IF statement or a WHERE statement more efficient? Why?

    1. A subsetting IF statement is more efficient because it loads all observations sequentially into the program data vector.

    2. A subsetting IF statement is more efficient because it examines what is in the input buffer and selects observations before they are loaded into the program data vector, which results in a savings in CPU operations.

    3. A WHERE statement is more efficient because it loads all observations sequentially into the program data vector.

    4. A WHERE statement is more efficient because it examines what is in the input buffer and selects observations before they are loaded into the program data vector, which results in a savings in CPU operations.

  4. When is it more advantageous to create a temporary SAS data set rather than a permanent SAS data set?

    1. When the external file on which the data set is based might change between SAS sessions.

    2. When the external file on which the data set is based does not change between SAS sessions.

    3. When the data set is needed for more than one SAS session.

    4. When you are converting raw numeric values to SAS data values.

  5. When you compare the technique of using multiple DATASETS procedures to using a single DATASETS procedure to modify the descriptor portion of a data set, which is true?

    1. A one-step DATASETS procedure results in an increase in I/O operations.

    2. Multiple DATASETS procedures result in a decrease in I/O operations.

    3. A one-step DATASETS procedure results in a decrease in CPU usage.

    4. Multiple DATASETS procedures result in a decrease in CPU usage.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset