Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 23. Selecting Efficient Sorting Strategies

Overview 784
- Introduction 784
- Objectives 785
- Prerequisites 785
Avoiding Unnecessary Sorts 785
- Using BY-Group Processing with an Index 786
- Comparative Example: Using BY-Group Processing with an Index to Avoid a Sort 786
  - Programming Techniques 787
  - General Recommendations 787
- Using the NOTSORTED Option 787
- Example 788
- Using FIRST. and LAST. 790
- Example 790
- Using the GROUPFORMAT Option 791
- Example 792
- Using the CLASS Statement 792
- Example 793
- Comparative Example: Using a BY or CLASS Statement to Avoid a Sort 795
  - Programming Techniques 795
  - General Recommendations 796
- Using the SORTEDBY= Data Set Option 796
- Example 798
Using a Threaded Sort 798
- Using the CPUCOUNT= System Option 799
Calculating and Allocating Sort Resources 800
- Sort Space Requirements 800
- Example 800
- Using the SORTSIZE= Option 802
Handling Large Data Sets 803
- Dividing a Large Data Set 803
- Comparative Example: Dividing and Sorting a Large Data Set 1 803
  - Programming Techniques 804
  - General Recommendations 808
- Comparative Example: Dividing and Sorting a Large Data Set 2 808
  - Programming Techniques 809
  - General Recommendations 813
- Using the TAGSORT Option 813
- Example 814
Removing Duplicate Observations Efficiently 814
- Using the NODUPKEY Option 815
- Example 815
- Using the NODUPRECS Option 816
- Example 817
- Using the EQUALS | NOEQUALS Option 819
- Example 819
- Using FIRST. LAST. Processing in the DATA Step 820
- Comparative Example: Removing Duplicate Observations Efficiently 821
  - Programming Techniques 821
  - Comparing Techniques to Eliminate Duplicate Data 822
  - General Recommendations 823
Additional Features 823
- Selecting a Host Sort Utility 823
- Using the SORTPGM= System Option 824
- Using the SORTCUTP= System Option 824
- Using the SORTCUT= System Option 825
- Using the SORTNAME= System Option 825
- Example 826
Summary 826
- - Avoiding Unnecessary Sorts 826
  - Using a Threaded Sort 827
  - Calculating and Allocating Sort Resources 827
  - Handling Large Data Sets 827
  - Removing Duplicate Observations Efficiently 828
  - Additional Features 828
Quiz 828

Overview

Introduction

Sometimes you need to group observations by the values of a particular variable or order the observations in a particular way, such as alphabetically, in order to

reorder the data for reporting
reduce data retrieval time
enable BY-group processing in both DATA and PROC steps.

The SORT procedure is one technique that can be used to group or order data. However, the SORT procedure can use a high volume of resources. When an uncompressed data file is sorted using the SORT procedure, SAS requires enough space in the data library for two copies of the data file, plus a workspace that is approximately two to four times the size of the data file.

In some cases, you might be able to use techniques other than the SORT procedure to group or order observations. In other cases, you might be able to use options or techniques with the SORT procedure that enable you to minimize resource usage.

Note

This chapter does not cover the SAS Scalable Performance Data Engine (SAS SPD Engine), which is a SAS 9.1 technology for threaded processing. For details about using the SAS SPD Engine to improve performance, see the SAS documentation. Δ

Objectives

In this chapter, you learn to

apply techniques that enable you to avoid unnecessary sorts
calculate and allocate sort resources
use strategies for sorting large data sets
eliminate duplicate observations efficiently.

Prerequisites

Before beginning this chapter, you should complete the following chapters:

Part 1: SQL Processing with SAS

Chapter 1, "Performing Queries Using PROC SQL," on page 3
Chapter 2, "Performing Advanced Queries Using PROC SQL," on page 25
Chapter 3, "Combining Tables Horizontally Using PROC SQL," on page 79
Chapter 4, "Combining Tables Vertically Using PROC SQL," on page 123
Chapter 5, "Creating and Managing Tables Using PROC SQL," on page 159
Chapter 6, "Creating and Managing Indexes Using PROC SQL," on page 221
Chapter 7, "Creating and Managing Views Using PROC SQL," on page 243
Chapter 8, "Managing Processing Using PROC SQL," on page 261.

Part 3: Advanced SAS Programming Techniques

Chapter 13, "Creating Samples and Indexes," on page 449
Chapter 14, "Combining Data Vertically," on page 479
Chapter 15, "Combining Data Horizontally," on page 511
Chapter 16, "Using Lookup Tables to Match Data," on page 557
Chapter 17, "Formatting Data," on page 601
Chapter 18, "Modifying SAS Data Sets and Tracking Changes," on page 631.

Part 4: Optimizing SAS Programs

Chapter 19, "Introduction to Efficient SAS Programming," on page 677
Chapter 20, "Controlling Memory Usage," on page 687
Chapter 21, "Controlling Data Storage Space," on page 705
Chapter 22, "Using Best Practices," on page 739.

Avoiding Unnecessary Sorts

In some cases you can avoid a sort by using

BY-group processing with an index
BY-group processing with the NOTSORTED option
a CLASS statement
the SORTEDBY= data set option.

Using BY-Group Processing with an Index

BY-group processing is a method of processing observations from one or more SAS data sets that are grouped or ordered by the values of one or more common variables. You can use BY-group processing in both DATA steps and PROC steps.

The most common use of BY-group processing in the DATA step is to combine two or more SAS data sets by using the BY statement with a SET, MERGE, UPDATE, or MODIFY statement. When you use a SET, MERGE, or UPDATE statement, the data sets must first be ordered on the values of the BY variable unless you index the data sets. You can also use the NOTSORTED option in the BY statement with a SET statement.

When BY-group processing is used with an index that is based on one of the BY variables, the data can be sequenced without using the SORT procedure. The data can be sequenced by different variables if multiple indexes are used. Because indexes are updated automatically, there is no need to re-sort a data set when observations are modified or added.

However, using BY-group processing with an index has two disadvantages:

It is generally less efficient than sequentially reading a sorted data set because processing BY groups typically means retrieving the entire file.
It requires storage space for the index.

Note

A BY statement does not use an index if the BY statement includes the DESCENDING or NOTSORTED option or if SAS detects that the data file is physically stored in sorted order on the BY variables. Δ

Note

If you use a MODIFY statement, the data does not need to be ordered. However, your program might run more efficiently with ordered data. Δ

Comparative Example: Using BY-Group Processing with an Index to Avoid a Sort

Suppose you want to use an existing data set, Retail.Order_fact, to create a new SAS data set that is ordered by the variable Order_Date. You could accomplish this task using

BY-Group Processing with an Index, Data in Random Order
Presorted Data in a DATA Step
PROC SORT Followed by a DATA Step.

The following sample programs show each of these techniques. You can use these samples as models for creating benchmark programs in your own environment. Your results might vary depending on the structure of your data, your operating environment, and the resources that are available at your site. You can also view general recommendations for BY-group processing with an index or sort.

Programming Techniques

General Recommendations

To conserve resources, use sort order rather than an index for BY-group processing.
Although using an index for BY-group processing is less efficient than using sort order, it might be the best choice if resource limitations make sorting a file difficult.

Using the NOTSORTED Option

You can also use the NOTSORTED option with a BY statement to create ordered or grouped reports without sorting the data. The NOTSORTED option specifies that observations that have the same BY value are grouped together but are not necessarily sorted in alphabetical or numeric order.

The NOTSORTED option can appear anywhere in the BY statement and is useful if you have data that is in logical categories or groupings such as chronological order. The NOTSORTED option works best when observations that have the same BY value are stored together.

Caution

The NOTSORTED option turns off sequence checking. If your data is not grouped, using the NOTSORTED option can produce a large amount of output. Δ

Caution

The NOTSORTED option cannot be used with the MERGE or UPDATE statements. Δ

Example

Suppose you want to use the PRINT procedure to print the contents of the data set Retail.Europe, which contains data about European customers. The data set includes values for country names (Country_Name) as well as two-character country codes (Country_Code).

The data is grouped and sorted by the values of Country_Code. However, you want the observations in the output to be grouped by the values of Country_Name.

Table 23.1. Country Codes and Country Names

Two-Character Country	Code Full-Text Country Name
BE	Belgium
DE	Germany
DK	Denmark
ES	Spain
FI	Finland
FR	France
GB	United Kingdom
GR	Greece
IE	Ireland
IT	Italy
LU	Luxembourg
NL	Netherlands

Table 23.2. SAS Data Set Retail.Europe, Selected Observations

Obs	Customer_ID	Customer_Country	Customer_Name	Country
374	93803	BE	Luc Vandeloo	Belgium
375	93980	BE	Saida Van Itterbeeck	Belgium
376	94040	BE	Pat Sodergard	Belgium
377	13	DE	Markus Sepke	Germany
378	19	DE	Oliver S. Füßling	Germany
379	50	DE	Gert-Gunter Mendler	Germany

You can use the NOTSORTED option with a BY statement to accomplish this task without using the SORT procedure.

PROC PRINT output shows that the data is grouped but is not in sorted order. For example, observations in which the value of Country_Name is Germany are followed by observations in which the value of Country_Name is Denmark.

Table 23.3. PROC PRINT Output, Selected Observations

Using FIRST. and LAST.

The NOTSORTED option can be used with FIRST. variable and LAST. variable, which are temporary automatic variables in the PDV that identify the first and last observations in each BY group.

These temporary variables are available for DATA step programming but are not added to the output data set. Their values indicate whether an observation is

the first one in a BY group
the last one in a BY group
neither the first nor the last one in a BY group
both first and last, as is the case when there is only one observation in a BY group.

You can take actions conditionally, based on whether you are processing the first observation of a BY group or the last.

When an observation is the first in a BY group, SAS sets the value of FIRST.variable to 1. For all other observations in the BY group, the value of FIRST.variable is 0. Likewise, if an observation is the last in a BY group, SAS sets the value of LAST. variable to 1. For all other observations in the BY group, the value of LAST. variable is 0.

Example

The following program creates a new SAS data set Work.New. In the input data set, observations that have the same value for Ordername (Retail, Catalog, or Internet) are grouped together.

data work.new;
   set company.sales
   by ordername notsorted;
run;

When the program is submitted, SAS creates the temporary variables FIRST.Ordername and LAST.Ordername. These variables can be used during the DATA step, but they do not become variables in the new data set. The value 1 flags the beginning and end of each value in the BY group:

Observations			Corresponding FIRST. and LAST. Values
Obs	Customer_ID	OrderName	FIRST.OrderName	FIRST.OrderName
1	11791	Retail	1	0
2	8406	Retail	0	0
3	71020	Retail	0	0
4	21735	Retail	0	1
5	82141	Catalog	1	0
6	30993	Catalog	0	1
7	579	Internet	1	0
8	77184	Internet	0	1

Using the GROUPFORMAT Option

The GROUPFORMAT option uses the formatted values of a variable instead of the internal values to determine where a BY group begins and ends, and how FIRST.variable and LAST. variable are computed.

The GROUPFORMAT option

is available only in the DATA step
is useful when you define formats for grouped data
enables the DATA step to process the same groups of data as a summary procedure or PROC REPORT.

When the GROUPFORMAT option is used, the data set must be sorted by the GROUPFORMAT variable or grouped by the formatted values of the GROUPFORMAT variable.

Example

Suppose you want to create a summary report that includes the number of orders for each quarter in 2002. The data for the report is stored in the SAS data set Company.Orders.

Table 23.4. SAS Data Set Company.Orders, First Five Observations

Obs	Order_ID	Order_Type	Employee_ID	Customer_ID	Order_Date	Delivery_Date
1	1230000033	Internet Sale	99999999	8818	01JAN1998	07JAN1998
2	1230000204	Internet Sale	99999999	47793	01JAN1998	04JAN1998
3	1230000268	Internet Sale	99999999	71727	01JAN1998	03JAN1998
4	1230000487	Internet Sale	99999999	74503	01JAN1998	04JAN1998
5	1230000494	Internet Sale	99999999	8610	01JAN1998	07JAN1998

By creating a format for the data and using the GROUPFORMAT and NOTSORTED options, you cause SAS to create the variables FIRST.Order_Date and LAST.Order_Date based on the formatted values, not the internal values. This groups the data without requiring the creation of a new variable.

SAS Data Set Company.Orders, First Five Observations

Using the CLASS Statement

You can also use a CLASS statement to avoid a sort. Unlike the BY statement, the CLASS statement does not require the data to be presorted using the CLASS values, nor does it require an index that is based on the CLASS variables.

If the data cannot be sorted, the CLASS statement is more efficient than the BY statement in terms of CPU time, memory, and I/O usage. However, unlike using the BY statement, presorting the data for use with a CLASS statement does not provide a significant benefit.

Remember that a CLASS statement specifies the variables whose values define the subgroup combinations for an analysis by a SAS procedure. You can use the CLASS statement with the following Base SAS procedures:

MEANS
TABULATE
SUMMARY
UNIVARIATE.

Variables in a CLASS statement are referred to as class variables. Class variables can be numeric or character. Class variables can have continuous values, but they typically have a few discrete values that define the classifications of the variable.

Caution

The comparison of the use of CLASS and BY statements is appropriate for Base SAS procedures only. Δ

Example

The Company.Orders data set contains the variable Order_Type, which has three discrete values:

Retail
Catalog
Internet.

Suppose you want to show the average retail price, cost per unit, and discount for each value of Order_Type. You could use the MEANS procedure with either a BY statement or a CLASS statement to complete this task. The statistics created with either of these techniques are the same. However, the report layouts differ.

When the BY statement is used, SAS creates a report for each value of the BY variable. The statistics for each value of Order_Type appear in a separate tabular report.

proc sort data=company.order_fact(keep=order_type
          quantity total_retail_price
          costprice_per_unit discount) out=company.orders;
   by order_type; 
run;


proc means data=company.orders mean;

   by order_type; 
   var total_retail_price -- discount;
   freq quantity;
run;

Table 23.5. Output, PROC MEANS with a BY Statement

When the CLASS statement is used, only one report is created. The statistics for each value of Order_Type are consolidated into one tabular report.

Comparative Example: Using a BY or CLASS Statement to Avoid a Sort

Suppose you want to create a summary report that shows the average retail price, cost per unit, and discount for each type of order in the Retail.Order_fact data set.

Among the techniques you could use are

PROC MEANS with a BY Statement, Presorted
PROC MEANS with a CLASS Statement
PROC MEANS with a CLASS Statement, Presorted
PROC SORT and PROC MEANS with a BY Statement.

Programming Techniques

General Recommendations

If the data is presorted, use the BY statement instead of a CLASS statement.
If you cannot sort the data, use the CLASS statement.
Do not presort the data for use with a CLASS statement. Presorting the data does not provide a significant benefit.

Using the SORTEDBY= Data Set Option

If you are working with input data that is already sorted, you can specify how the data is ordered by using the SORTEDBY= data set option.

Note

By default, the collating sequence is that of your operating environment. For details on collating sequences, see the SAS documentation for your operating environment. Δ

Although the SORTEDBY= option does not sort a data set, it sets the value of the Sorted flag. It does not set the value of the Validated sort flag. (PROC SORT sets the Validated sort flag.) To see the values of these flags (YES or NO), use PROC CONTENTS.

proc contents;
   data=company.transactions;
run;

Figure 23.1. Partial PROC CONTENTS Output

Note

Most SAS procedures and subsystems check the order of the data as it is processed unless the Validated sort flag is set on the file. Δ

Note

Using the Sorted flag without the Validated sort flag improves the speed of index creation. Δ

Example

Suppose you want to create a sorted SAS data set from an external file that contains invoice information. The external file is already sorted by invoice number.

You can use the SORTEDBY= data set option to sort the data by the value of Invoice.

data company.transactions (sortedby=invoice); 
     infile extdata;
     input Invoice 1-4 Item $6-20 Amount comma 6.;
   run;

When the Company.Transactions data set is created, the sort information is stored with it. PROC SORT checks the sort information before it sorts a data set so that data is not re-sorted unnecessarily. If you attempt to re-sort the data, the log indicates that the data set is already sorted and that no additional sorting occurred.

proc sort data=company.transactions;
   by invoice;
run;

Table 23.6. SAS Log

Note

You can specify SORTEDBY=_NULL_ to remove the Sorted flag. The Sorted flag is also removed if you change or add any values of the variables by which the data set is sorted. Δ

Using a Threaded Sort

Threaded processing takes advantage of multiple CPUs by executing multiple threads in parallel (parallel processing). Threaded procedures are completed in less real time than if each task were handled sequentially, although the CPU time is generally increased.

Beginning with SAS 9, the SORT procedure can take advantage of threaded processing. A thread is a single, independent flow of control through a program or within a process.

Threaded sorting is enabled or disabled by using the SAS system option THREADS | NOTHREADS or the THREADS | NOTHREADS procedure option.

Note

The THREAD | NOTHREADS procedure option overrides the value of the SAS system option THREADS | NOTHREADS. For information about the THREADS | NOTHREADS system option, see the SAS documentation. Δ

When a threaded sort is used, the observations in the input data set are divided into equal temporary subsets, based on the number of processors that are allocated to the SORT procedure. Each subset is then sorted on a different processor. The sorted subsets are then interleaved to re-create the sorted version of the input data set.

Using the CPUCOUNT= System Option

The performance of threaded sorting is affected by the value of the CPUCOUNT= system option. CPUCOUNT= specifies the number of processors that thread-enabled applications should assume will be available for concurrent processing. SAS uses this information to determine how many threads to start, not to restrict the number of CPUs that will be used.

Caution

Setting CPUCOUNT= to a number greater than the actual number of available CPUs might result in reduced overall performance. Δ

Note

For more information about the CPUCOUNT= system option and other options that are relevant to SAS threading technology, see the SAS documentation. Δ

Calculating and Allocating Sort Resources

Sort Space Requirements

When data is sorted, SAS requires enough space in the data library for two copies of the data file that is being sorted as well as additional workspace.

In releases before SAS 9, the workspace required for an uncompressed data file is approximately three to four times the size of the data file. Beginning with SAS 9, the workspace required for an uncompressed data file is approximately twice the size of the data file. The workspace can be allocated in memory and/or on disk as a utility file, depending on the sort utility and on the options chosen.

You can use the following formula to calculate the amount of workspace that the SORT procedure requires:

Note

The multiplier 4 applies only to utility files used in releases before SAS 9 when PROC SORT needs to use disk space in order to sort the data. For in-memory sorting and sorting with SAS 9 and later, the multiplier is 2 or less. Δ

Example

Suppose you want to submit the following program under SAS 9:

proc sort data=company.customers;
   by customer_group customer_lastname;
run;

You can use the CONTENTS procedure or the DATASETS procedure to obtain the information that is required for the calculation.

Figure 23.2. Partial PROC CONTENTS Output

In this case, the amount of workspace needed to sort the data set is 48,575,160 bytes.

48,575,160 bytes=(70+200)*89954*2

Note

The SORT procedure is very I/O intensive. If the file that you are sorting is located in the Work library, all of the I/O for the procedure takes place on the file system in which the Work library is stored because, by default, the utility files that the SORT procedure creates are created in the Work library.

Beginning with SAS 9, you can use the UTILLOC= system option to specify one or more file systems in which utility files can be stored. For information about the UTILLOC= option, see the SAS documentation. Δ

Using the SORTSIZE= Option

The SORTSIZE= system option or procedure option specifies how much memory is available to the SORT procedure. Specifying the SORTSIZE= option in the PROC SORT statement temporarily overrides the SAS system option SORTSIZE=.

General form, SORTSIZE= option:

SORTSIZE= memory-specification;

where

memory-specification

specifies the maximum amount of memory that is available to PROC SORT. Valid values for memory-specification are as follows:

MAX

specifies that all available memory can be used.

specifies the amount of memory in bytes, where n is a real number.

specifies the amount of memory in kilobytes, where n is a real number.

specifies the amount of memory in megabytes, where n is a real number.

specifies the amount of memory in gigabytes, where n is a real number.

Note

The default value of the SORTSIZE= option depends on your operating environment. See the SAS documentation for your operating environment for more information. Δ

Generally, the value of SORTSIZE= should be less than the physical memory that is available to your process.

If the required workspace is less than or equal to the value specified in the SORTSIZE= system option or procedure option, then the entire sort can take place in memory, which reduces processing time.

If the actual required workspace is greater than the value specified in the SORTSIZE= system option or procedure option, then processing time is increased because the SORT procedure must

create temporary utility files in the Work directory or mainframe temporary area
request memory up to the value specified by SORTSIZE=
write a portion of the sorted data to a utility file.

This process is repeated until all of the data is sorted. The SORT procedure then interleaves the data in the utility files to create the final data set.

PROC SORT attempts to adapt to the constraint that is imposed by the SORTSIZE= option. Because PROC SORT uses memory as much as possible,

a small SORTSIZE= value can increase CPU and I/O resource utilization
a large SORTSIZE= value can decrease CPU and I/O resource utilization.

Handling Large Data Sets

Dividing a Large Data Set

A data set is too large to sort when there is insufficient room in the data library for a second copy of the data set or when there is insufficient disk space for three to four temporary copies of the data set.

One approach to this situation is to divide the large data set into smaller data sets. The smaller data sets can then be sorted and combined to re-create the large data set. This approach is similar to the process that is used in a threaded sort.

Techniques for dividing and sorting a large data set include

using PROC SORT with the OUT= statement option and the FIRSTOBS= and OBS= data set options
using PROC SORT with a WHERE statement
using subsetting with IF-THEN/ELSE or SELECT-WHEN logic to create multiple output data sets, and sorting the output data sets.

Techniques that can be used to rebuild a large data from smaller, sorted data sets set include

concatenating the smaller data sets with a SET statement
interleaving the smaller data sets with SET and BY statements
appending the smaller data sets with the APPEND procedure.

Comparative Example: Dividing and Sorting a Large Data Set 1

Suppose you want to sort the SAS data set Retail.Order_fact by the value of Order_Date. The data set is too large to sort using a single SORT procedure. You could accomplish this task by

Segmenting by Observation
Subsetting Using an IF Statement with the YEAR Function
Subsetting Using an IF Statement with a Date Constant
Subsetting Using a WHERE Statement with the YEAR Function
Subsetting Using a WHERE Statement with a Date Constant.

Programming Techniques

General Recommendations

Use a DATA step rather than PROC APPEND to re-create a large data set from smaller subsets.
Use a constant rather than a SAS function because calling a function repeatedly increases CPU usage.
Use a subsetting IF with either a constant or a function rather than a WHERE statement with a function.

Comparative Example: Dividing and Sorting a Large Data Set 2

Like the programs shown in the previous section, each of the following programs illustrates a method for dividing the large data set Retail.Order_fact into smaller data sets for sorting. However, in this example, the smaller data sets, Work.One, Work.Two, and Work.Three, are combined using the APPEND procedure rather than a DATA step in programs 2, 3, 4, and 5.

Segmenting by Observation
Subsetting Using an IF Statement with the YEAR Function
Subsetting Using an IF Statement with a Date Constant
Subsetting Using a WHERE Statement with the YEAR Function
Subsetting Using a WHERE Statement with a Date Constant.

Programming Techniques

General Recommendations

Use a DATA step rather than PROC APPEND to re-create a large data set from smaller subsets.

Using the TAGSORT Option

You can also use the TAGSORT option to sort a large data set. The TAGSORT option stores only the BY variables and the observation numbers in temporary files. The BY variables and the observation numbers are called tags. At the completion of the sorting process, PROC SORT uses the tags to retrieve records from the input data set in sorted order.

When the total length of the BY variables is small compared to the record length, TAGSORT reduces temporary disk usage considerably because sorting just the BY variables means sorting much less data. However, processing time is usually higher than if a regular sort is used because TAGSORT increases CPU time and I/O usage in order to save memory and disk space. TAGSORT

uses significantly more CPU time and I/O than a regular sort if the data is extremely out of order with regard to the BY variables
uses slightly more CPU time and I/O than a regular sort if the data is mostly in order with regard to the BY variables.

Example

In the following program, only the BY variables, Customer_ID and Order_Date, and the tags are stored in temporary files within the sort workspace. SAS then

sorts the temporary files
uses the tags to retrieve the observations from the original data set
re-creates the sorted data set.

proc sort data=company.orders  tagsort;
   by customer_id order_date;
run;

Caution

The TAGSORT option is not supported by the threaded sort. Δ

Removing Duplicate Observations Efficiently

The SORT procedure can be used to remove duplicate observations when it is

used with the NODUPKEY option
used with the NODUPRECS option
followed by FIRST. processing in the DATA step.

Generally, PROC SORT with the NODUPKEY option uses less I/O and CPU time than PROC SORT followed by a DATA step that uses FIRST. processing. Before viewing a comparative example, examine each of the techniques that are listed above.

Using the NODUPKEY Option

The NODUPKEY option checks for and eliminates observations that have duplicate BY-variable values. If you specify this option, then PROC SORT compares all BY-variable values for each observation to those for the previous observation that was written to the output data set. If an exact match is found, then the observation is not written to the output data set.

Example

The SAS data set Company.Reorder contains two duplicated observations. Observation 9 is a duplicate of observation 1, and observation 7 is a duplicate of observation 2. The duplicate observations are removed when the data is sorted by the values of Product_Line and Product_Name and when the NODUPKEY option is used.

proc sort data=company.reorder  nodupkey; 
   by product_line product_name; 
run;

Table 23.7. SAS Data Set Company.Reorder

Obs	Product_Line	Product_Name	Supplier_Name
1	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S
2	Children	Kids Children's Fleece Hat	3Top Sports
3	Clothes & Shoes	Watchit 120 Sterling/Reﬂective	Eclipse Inc
4	Sports	Sparkle Spray Blue	CrystalClear Optics Inc
5	Outdoors	Money Purse, Black	Top Sports
6	Sports	Mayday Serious Down Jacket	Mayday Inc
8	Clothes & Shoes	Tyfoon Linen Pants	Typhoon Clothing
9	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S

Table 23.8. SAS Data Set Company.Reorder, Before Removing Duplicate Observations

Obs	Product_Line	Product_Name	Supplier_Name
1	Children	Kids Children's Fleece Hat	3Top Sports
2	Children	Kids Children's Fleece Hat	3Top Sports
3	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S
4	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S
5	Clothes & Shoes	Tyfoon Linen Pants	Typhoon Clothing
6	Clothes & Shoes	Watchit 120 Sterling/Reﬂective	Eclipse Inc
7	Outdoors	Money Purse, Black	Top Sports
8	Sports	Mayday Serious Down Jacket	Mayday Inc
9	Sports	Sparkle Spray Blue CrystalClear	Optics Inc

Table 23.9. SAS Data Set Company.Reorder, Duplicate Observations Removed

Obs	Product_Line	Product_Name	Supplier_Name
1	Children	Kids Children's Fleece Hat	3Top Sports
2	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S
3	Clothes & Shoes	Tyfoon Linen Pants	Typhoon Clothing
4	Clothes & Shoes	Watchit 120 Sterling/Reﬂective	Eclipse Inc
5	Outdoors	Money Purse, Black	Top Sports
6	Sports	Mayday Serious Down Jacket	Mayday Inc
7	Sports	Sparkle Spray Blue CrystalClear	Optics Inc

Using the NODUPRECS Option

The NODUPRECS option also checks for and eliminates duplicate observations. However, unlike the NODUPKEY option, the NODUPRECS option compares all of the variable values for each observation to those for the previous observation that was written to the output data set. If an exact match is found, then the observation is not written to the output data set.

Note

NODUP is an alias for NODUPRECS. Δ

Because NODUPRECS checks only consecutive observations, some nonconsecutive duplicate observations might remain in the output data set. You can remove all duplicates with this option by sorting on all variables.

Example

When Company.Reorder is sorted using the BY variable Product_Line with the NODUPRECS option, the two observations that contain the product name Kids Children's Fleece Hat do not become consecutive duplicates. They are written to the output data set. The duplicate observation that contains the product name Ski Jacket w/Removable Fleece become consecutive duplicates. Only one of them is written to the output data set.

proc sort data=company.reorder noduprecs; 
   by product_line; 
run;

Table 23.10. SAS Data Set Company.Reorder

Obs	Product_Line	Product_Name	Supplier_Name
1	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S
2	Children Kids	Children's Fleece Hat	3Top Sports
3	Clothes & Shoes	Watchit 120 Sterling/Reﬂective	Eclipse Inc
4	Sports	Sparkle Spray Blue CrystalClear	Optics Inc
5	Outdoors	Money Purse, Black	Top Sports
6	Sports	Mayday Serious Down Jacket	Mayday Inc
7	Children	Kids Children's Fleece Hat	3Top Sports
8	Clothes & Shoes	Tyfoon Linen Pants	Typhoon Clothing
9	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S

Table 23.11. SAS Data Set Company.Reorder, Before Removing Duplicate Observation

Obs	Product_Line	Product_Name	Supplier_Name
1	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S
2	Children	Kids Children's Fleece Hat	3Top Sports
4	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S
5	Clothes & Shoes	Watchit 120 Sterling/Reﬂective	Eclipse Inc
6	Clothes & Shoes	Tyfoon Linen Pants	Typhoon Clothing
7	Outdoors	Money Purse, Black	Top Sports
8	Sports	Sparkle Spray Blue	CrystalClear Optic Inc
9	Sports	Mayday Serious Down	Jacket Mayday Inc

Table 23.12. SAS Data Set Company.Reorder, Duplicate Observation Remaining

Obs	Product_Line	Product_Name	Supplier_Name
1	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S
2	Children	Kids Children's Fleece Hat	3Top Sports
3	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S
4	Clothes & Shoes	Watchit 120 Sterling/Reﬂective	Eclipse Inc
5	Clothes & Shoes	Tyfoon Linen Pants	Typhoon Clothing
6	Outdoors	Money Purse, Black	Top Sports
7	Sports	Sparkle Spray Blue	CrystalClear Optic Inc
8	Sports	Mayday Serious Down Jacket	Mayday Inc

Both duplicate observations are removed when Company.Reorder is sorted by both Product_Line and Product_Name and when the NODUPRECS option is used.

proc sort data=company.reorder  noduprecs; 
      by product_line product_name; 
run;

Table 23.13. SAS Data Set Company.Reorder, Both Duplicate Observations Removed

Obs	Product_Line	Product_Name	Supplier_Name
1	Children	Kids Children's Fleece Hat	3Top Sports
2	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S
3	Clothes & Shoes	Tyfoon Linen Pants	Typhoon Clothing
4	Clothes & Shoes	Watchit 120 Sterling/Reﬂective	Eclipse Inc
5	Outdoors	Money Purse, Black	Top Sports
6	Sports	Mayday Serious Down Jacket	Mayday Inc
7	Sports	Sparkle Spray Blue CrystalClear	Optics Inc

Note

The SORTDUP= system option controls how NODUPRECS processing works. Specifying SORTDUP=PHYSICAL removes duplicates based on all variables in the data set. This is the default. Specifying SORTDUP=LOGICAL removes duplicates based only on the variables that remain after the DROP= and KEEP= data set options are processed. See the SAS documentation for more information. Δ

Using the EQUALS | NOEQUALS Option

EQUALS | NOEQUALS is a SORT procedure option that helps to determine the order of observations in the output data set. When you use NODUPRECS or NODUPKEY to remove observations from the output data set, the choice of EQUALS or NOEQUALS can have an effect on which observations are removed.

EQUALS is the default. For observations that have identical BY-variable values, EQUALS maintains the order from the input data set in the output data set. NOEQUALS does not necessarily preserve this order in the output data set. NOEQUALS can save CPU time and memory resources.

Example

The following program uses PROC SORT with the NODUPKEY option and the NOEQUALS option to create an output data set that contains only the first observation in each BY group. Notice that the output data set Work.New contains different observations when the EQUALS option is used.

proc sort data=company.products out=work.new
     nodupkey noequals; 
      by product_line;
run;

Table 23.14. SAS Data Set Company.Products

Obs	Product_Line	Product_Name	Supplier_Name
1	Clothes & Shoes	Big Guy Men's Ringer T	Eclipse Inc
2	Children	Boy's and Girl's Ski Pants with Braces	Scandinavian Clothing A/S
3	Outdoors	Cotton Moneybelt/Polyester 45x11	Prime Sports Ltd
4	Sports	Cougar Shorts	SD Sporting Goods Inc
5	Clothes & Shoes	Far Out Teambag S	3Top Sports
6	Children	Kid Basic Tracking Suit	Triple Sportswear Inc
7	Sports	Maxrun Ultra short Sprinter Tights	Force Sports
8	Clothes & Shoes	Wa.leather Street Shoes	Fuller Trading Co.

Table 23.15. SAS Data Set Work.New, NOEQUALS Option Used

Obs	Product_Line	Product_Name	Supplier_Name
1	Children	Kid Basic Tracking Suit	Triple Sportswear Inc
2	Clothes & Shoes	Far Out Teambag S	3Top Sports
3	Outdoors	Cotton Moneybelt/Polyester 45x11	Prime Sports Ltd
4	Sports	Maxrun Ultra short Sprinter Tights	Force Sports

Table 23.16. SAS Data Set Work.New, EQUALS Option Used

Obs	Product_Line	Product_Name	Supplier_Name
1	Children	Boy's and Girl's Ski Pants with Braces	Scandinavian Clothing A/S
2	Clothes & Shoes	Big Guy Men's Ringer T	Eclipse Inc
3	Outdoors	Cotton Moneybelt/Polyester 45x11	Prime Sports Ltd
4	Sports	Cougar Shorts	SD Sporting Goods Inc

Note

The EQUALS | NOEQUALS option is supported by the threaded sort. However, I/O performance might be reduced when you use the EQUALS option because partitioned data sets will be processed as if they are non-partitioned data sets. Δ

Caution

The order of observations within BY groups that are returned by the threaded sort might not be consistent between runs. Therefore, using the NOEQUALS option can produce inconsistent results in your output data sets. Δ

Using FIRST. LAST. Processing in the DATA Step

FIRST. LAST. processing in the DATA step can also be used to remove duplicate observations from a SAS data set.

In the data set Company.Onorder, the fourth observation contains a duplicate value for Product_Name. The following program removes the observation that contains the duplicate value by first sorting the input data set, Company.Onorder, by the value of Product_Name. The DATA step then selects only the first observation in the BY group.

proc sort data=company.onorder 
          out=work.sorted;
      by product_name; 
run;
data work.onorder2;
   set work.sorted;
   by product_name; 
   if first.product_name; 
run;

Table 23.17. SAS Data Set Company.Onorder

Obs	Product_Line	Product_Name	Supplier_Name
1	Clothes & Shoes	Big Guy Men's Ringer T	70
2	Children	Boy's and Girl's Ski Pants with Braces	55
3	Outdoors	Cotton Moneybelt/Polyester 45x11	20
4	Sports	Big Guy Men's Ringer T	70
5	Sports	Cougar Shorts	40
6	Clothes & Shoes	Far Out Teambag S	32
7	Children	Kid's Basic Tracking Suit	20
8	Sports	Maxrun Ultra short Sprinter Tights	25
9	Clothes & Shoes	Wa.leather Street Shoes	30

Table 23.18. SAS Data Set Company.Onorder2

Obs	Product_Line	Product_Name	Supplier_Name
1	Clothes & Shoes	Big Guy Men's Ringer T	70
2	Children	Boy's and Girl's Ski Pants with Braces	55
3	Outdoors	Cotton Moneybelt/Polyester 45x11	20
4	Sports	Cougar Shorts	40
5	Clothes & Shoes	Far Out Teambag S	32
6	Children	Kid's Basic Tracking Suit	20
7	Sports	Maxrun Ultra short Sprinter Tights	25
8	Clothes & Shoes	Wa.leather Street Shoes	30

Comparative Example: Removing Duplicate Observations Efficiently

Suppose you want to remove observations from the data set Retail.Order_fact in which the value of Order_Date is duplicated. Among the techniques you could use are

The NODUPKEY Option and the EQUALS Option
The NODUPKEY Option and the NOEQUALS Option
PROC SORT and a DATA Step with BY-Group and FIRST. Processing

Programming Techniques

Comparing Techniques to Eliminate Duplicate Data

The following table compares the techniques for eliminating duplicate data using PROC SORT, DATA step, and PROC SQL:

Technique	Advantages	Disadvantages
PROC SORT with NODUPKEY and DUPOUT=	No additional passes of the data occur. Only the PROC SORT step is required. Unique observations should be placed in one data set and other observations in another data set.	You cannot specify where observations are output
DATA step with `IF First.by-var=1 and LAST.by-var=1;`	The DATA step has many capabilities for further data manipulation. Unique observations should be placed in one data set and other observations in another data set.	Two steps are required.
DATA step with `IF FIRST.by-var=1;`	The DATA step has many capabilities for further data manipulation.	Two steps are required.
PROC SQL with the SELECT DISTINCT statement	The SQL step has many capabilities for combining, ordering, and grouping data.	The DISTINCT keyword applies to all of the variables in the SELECT statement. Only one data set can be created.

General Recommendations

To remove duplicate observations from a SAS data set, use PROC SORT with the NODUPKEY option rather than a PROC SORT step followed by a DATA step that uses FIRST. processing.
Be careful not to confuse NODUPKEY with NODUPRECS. NODUPRECS compares variables in the OBS.

Additional Features

Selecting a Host Sort Utility

Host sort utilities are third-party sort packages that are available in some operating environments. In some cases, using a host sort utility with PROC SORT might be more efficient than using the SAS sort utility with PROC SORT.

The following table lists the host sort utilities that might be available at your site. SAS uses the values that are set for the SORTPGM=, SORTCUT=, SORTCUTP=, and SORTNAME= system options to determine which sort to use.

Operating Environment	Host Sort Utilities
z/OS	Dfsort (default) Syncsort
UNIX	Cosort Syncsort (default)
Windows	Syncsort

Note

Ask your system administrator whether a host sort utility is available at your site. For more information about host sort utilities, see the SAS documentation for your operating environment. Δ

Using the SORTPGM= System Option

The value specified in the SORTPGM= system option tells SAS whether to use the SAS sort, to use the host sort, or to determine which sort utility is best for the data set.

Using the SORTCUTP= System Option

The SORTCUTP= system option specifies the number of bytes above which the host sort utility is used instead of the SAS sort utility.

Note

To determine the minimum and maximum values for SORTCUTP=, see the SAS documentation for your operating environment. Δ

The following table lists the default values for SORTCUTP= in the z/OS, UNIX, and Windows operating environments.

Operating Environment	Default SORTCUTP= Value	Default Behavior
z/OS	4M	SAS sort is used until this value is reached
UNIX	0	SAS sort is always used
Windows	0	SAS sort is always used

Using the SORTCUT= System Option

Beginning with SAS 9, the SORTCUT= system option can be used to specify the number of observations above which the host sort utility is used instead of the SAS sort utility.

Note

The SORTCUT= system option is not available in the z/OS operating environment.

Note

To determine the maximum value for SORTCUT=, see the SAS documentation for your operating environment. Δ

The default value of the SORTCUT= system option is 0.

Using the SORTNAME= System Option

The SORTNAME= option specifies the host sort utility that will be used if the value of SORTPGM= is BEST or HOST.

Note

The SORTNAME= system option is not available in the Windows operating environment.

Example

When you specify SORTPGM= BEST, SAS uses the value of the SORTCUT= and SORTCUTP= options to determine whether to use the host sort or the SAS sort. If you specify values for both the SORTCUT= and SORTCUTP= options, and if either condition is true, SAS chooses the host sort.

In the program below, if the size of the SAS data set Company.Orders is larger than 10,000 bytes, the host sort utility, Syncsort, will be used instead of the SAS sort utility.

options sortpgm=best sortcutp=10000
        sortname=syncsort;


proc sort data=company.orders out=company.deliveries;
   by delivery_date;
run;

Summary

Avoiding Unnecessary Sorts

When BY-group processing with an index is used, the data can be sequenced by different variables without having to repeat the SORT procedure if multiple indexes are used. Because indexes are updated automatically, there is no need to re-sort a data set when observations are modified or added. However, BY-group processing with an index is less efficient than reading a sorted data set sequentially, and storage space is required for the index.

You can also use the NOTSORTED option with a BY statement to create ordered or grouped reports without sorting the data. The NOTSORTED option specifies that observations that have the same BY value are grouped together but are not necessarily sorted in alphabetical or numeric order. The NOTSORTED option works best when observations that have the same BY value are stored together.

The NOTSORTED option can be used with FIRST. and LAST., which are temporary automatic variables in the PDV that identify the first and last observations in each BY group. These temporary variables are available for DATA step programming but are not added to the output data set.

The GROUPFORMAT option is useful when you have defined formats for grouped data. The GROUPFORMAT option uses the formatted values of a variable, instead of the internal values to determine where a BY group begins and ends, and how FIRST. and LAST. are computed. When the GROUPFORMAT option is used, the data set must be sorted by the GROUPFORMAT variable or grouped by the formatted values of the GROUPFORMAT variable.

You can use a CLASS statement to specify the variables whose values define the subgroup combinations for an analysis by a SAS procedure. Unlike the BY statement, when the CLASS statement is used with Base SAS procedures, it does not require the data to be presorted using the BY-variable values or that you have an index based on the BY variables. If the data cannot be sorted, the CLASS statement is more efficient than the BY statement in terms of CPU time, memory, and I/O usage.

If you are working with input data that is already sorted, you can specify how the data is ordered by using the SORTEDBY= data set option. Although the SORTEDBY= option does not sort a data set, it sets the Sorted flag on the data set.

Review the related comparative examples:

"Comparative Example: Using BY-Group Processing with an Index to Avoid a Sort" on page 786
"Comparative Example: Using a BY or CLASS Statement to Avoid a Sort" on page 795.

Using a Threaded Sort

Beginning with SAS 9, the SORT procedure can take advantage of threaded processing. Threaded jobs are completed in substantially less real time than if each task is handled sequentially. However, the CPU time for threaded jobs is generally increased.

Threaded sorting is enabled or disabled by using the THREADS | NOTHREADS SAS system option or procedure option. The procedure option overrides the value of the system option.

When a threaded sort is used, the observations in the input data set are divided into equal temporary subsets, based on how many processors are allocated to the SORT procedure. Each subset is then sorted on a different processor. The sorted subsets are then interleaved to re-create the sorted version of the input data set.

The performance of a threaded sort is affected by the value of the CPUCOUNT= system option. CPUCOUNT= specifies the number of processors that thread-enabled applications should assume will be available for concurrent processing. SAS uses this information to determine how many threads to start, not to restrict the number of CPUs that will be used.

Calculating and Allocating Sort Resources

When data is sorted, SAS requires enough space in the data library for two copies of the data file that is being sorted, as well as additional workspace.

In releases before SAS 9, the required workspace is approximately three to four times the size of the data file. Beginning with SAS 9, the required workspace is approximately twice the size of the data file. The workspace can be allocated in memory and/or on disk as a utility file, depending on which sort utility and options are specified.

The SORTSIZE= option specifies how much memory is available to the SORT procedure. Generally, the value of SORTSIZE= should be less than the physical memory that is available to your process. If the required workspace is less than or equal to the value specified in the SORTSIZE= system option or procedure option, then the entire sort can take place in memory, which reduces processing time.

Handling Large Data Sets

One approach to this situation is to divide the large data set into smaller subsets. The subsets can then be sorted and combined to re-create the large data set.

You can also use the TAGSORT option to sort a large data set. The TAGSORT option stores only the BY variables and the observation numbers in temporary files. The BY variables and the observation numbers are called tags. At the completion of the sorting process, PROC SORT uses the tags to retrieve records from the input data set in sorted order.

When the total length of the BY variables is small compared to the record length, TAGSORT reduces temporary disk usage considerably because sorting just the BY variables means sorting much less data. However, processing time might be much higher because the TAGSORT option increases CPU and I/O usage in order to save memory and disk space.

Review the related comparative examples:

"Comparative Example: Dividing and Sorting a Large Data Set 1" on page 803
"Comparative Example: Dividing and Sorting a Large Data Set 2" on page 808.

Removing Duplicate Observations Efficiently

The NODUPKEY option checks for and eliminates observations that have duplicate BY-variable values. If you specify this option, then PROC SORT compares all BY-variable values for each observation to those for the previous observation that was written to the output data set. If an exact match is found, then the observation is not written to the output data set.

The NODUPRECS option checks for and eliminates duplicate observations. However, unlike the NODUPKEY option, the NODUPRECS option compares all of the variable values for each observation to those for the previous observation that was written to the output data set. If an exact match is found, then the observation is not written to the output data set.

EQUALS | NOEQUALS is a procedure option that helps to determine the order of observations in the output data set. When you use NODUPRECS or NODUPKEY to remove observations from the output data set, the choice of EQUALS or NOEQUALS can have an effect on which observations are removed.

FIRST. LAST. processing in the DATA step can also be used to remove duplicate observations in a SAS data set.

Review the related comparative example:

"Removing Duplicate Observations Efficiently" on page 814.

Additional Features

Depending on your operating environment, you might be able to use additional sorting options, called host sort utilities. Host sort utilities are third-party sort packages. In some cases, using a host sort utility might be more efficient than using the SAS sort utility with PROC SORT.

SAS uses the values that are set for the SORTPGM=, SORTCUTP=, SORTCUT= and SORTNAME= system options to determine which sort utility to use.

Quiz

Select the best answer for each question. After completing the quiz, check your answers using the answer key in the appendix.

When the following program is submitted, what is the value of FIRST.Product_Line for the third observation in the data set Work.Products?

data new.products;
   set work.products
   by product_line notsorted;
run;

Table 23.19. SAS Data Set Work.Products

Obs	Product_Line	Product_Name	Supplier_Name
1	Children	Kids Children's Fleece Hat	3Top Sports
2	Children	Ski Jacket w/Removable Fleece	Scandinavian Clothing A/S
3	Clothes & Shoes	Tyfoon Linen Pants	Typhoon Clothing
4	Clothes & Shoes	Watchit 120 Sterling/Reﬂective	Eclipse Inc
5	Clothes & Shoes	Money Belt, Black	Top Sports

1
3
0
Clothes & Shoes

Which option is used with the SORT procedure to store only the BY variables and the observation numbers in temporary files?
1. NOTSORTED
2. GROUPFORMAT
3. TAGSORT
4. SORTEDBY=
Which of the following is not an advantage of BY-group processing with an index that is based on the BY variables?
1. The data can be sequenced without using the SORT procedure.
2. There is no need to re-sort a data set when observations are modified or added.
3. It is generally more efficient than reading a sorted data set sequentially.
4. The data can be sequenced by different variables if multiple indexes are used.
Which SORT procedure option compares all of the variable values for each observation to those for the previous observation that was written to the output data set?
1. NODUPKEY
2. NODUPRECS
3. EQUALS
4. NOEQUALS
What happens if the workspace that is required for completing a sort is less than or equal to the value that is specified in the SORTSIZE= system option or procedure option?
1. CPU time is increased.
2. I/O is increased.
3. The entire sort can take place in memory.
4. A temporary utility file is created in the Work directory or in a mainframe temporary area.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 23. Selecting Efficient Sorting Strategies

Create new playlist

Sign In

Sign Up

Chapter 23. Selecting Efficient Sorting Strategies

Overview

Introduction

Note

Objectives

Prerequisites

Avoiding Unnecessary Sorts

Using BY-Group Processing with an Index

Note

Note

Comparative Example: Using BY-Group Processing with an Index to Avoid a Sort

Programming Techniques

General Recommendations

Using the NOTSORTED Option

Caution

Caution

Example

Using FIRST. and LAST.

Example

Using the GROUPFORMAT Option

Example

Using the CLASS Statement

Caution

Example

Comparative Example: Using a BY or CLASS Statement to Avoid a Sort

Programming Techniques

General Recommendations

Using the SORTEDBY= Data Set Option

Note

Note

Note

Example

Note

Using a Threaded Sort

Note

Using the CPUCOUNT= System Option

Caution

Note

Calculating and Allocating Sort Resources

Sort Space Requirements

Note

Example

Note

Using the SORTSIZE= Option

Note

Handling Large Data Sets

Dividing a Large Data Set

Comparative Example: Dividing and Sorting a Large Data Set 1

Programming Techniques

General Recommendations

Comparative Example: Dividing and Sorting a Large Data Set 2

Programming Techniques

General Recommendations

Using the TAGSORT Option

Example

Caution

Removing Duplicate Observations Efficiently

Using the NODUPKEY Option

Example

Using the NODUPRECS Option

Note

Example

Note

Using the EQUALS | NOEQUALS Option

Example

Note

Caution

Using FIRST. LAST. Processing in the DATA Step

Comparative Example: Removing Duplicate Observations Efficiently

Programming Techniques

Comparing Techniques to Eliminate Duplicate Data

General Recommendations

Additional Features

Selecting a Host Sort Utility

Note

Using the SORTPGM= System Option

Table of Contents for
23. Selecting Efficient Sorting Strategies