Image483_fmt.png

Chapter 4 Sorting the Data

4.1 PROC SORT Options

4.1.1 The NODUPREC Option

4.1.2 The DUPOUT= Option

4.1.3 The TAGSORT Option

4.1.4 Using the SORTSEQ Option

4.1.5 The FORCE Option

4.1.6 The EQUALS or NOEQUALS Options

4.2 Using Data Set Options with PROC SORT

4.3 Taking Advantage of Known or Knowable Sort Order

4.4 Metadata Sort Information

4.5 Using Threads

Sorting data is always a resource-intensive operation; therefore, using PROC SORT wisely can save you both time and computing effort. Fortunately, there are both options and strategies to assist you in selecting more efficient, if not optimal methodologies.

MORE INFORMATION

Indexes can be an alternative to sorting the data. Section 5.3 discusses the creation and maintenance of indexes.

4.1 PROC SORT Options

There are a number of options associated with PROC SORT that can be used not only to control performance and capabilities of the procedure, but also the resulting data set. One of the options, NODUPREC, is of special interest as its misuse can result in unanticipated results.

4.1.1 The NODUPREC Option

It is my opinion that the NODUPREC option (as well as its aliases NODUPLICATES and NODUP) is used far too often. While it performs just as is specified in the documentation, it does not do what many users think it does. Consequently, when I see it used in someone else’s code, it always raises a red flag that begs me to look closer at the data. Most users of this option think that it will remove all duplicate observations, and although this is what it nominally is supposed to do, it does not necessarily cause PROC SORT to remove all duplicate observations. In fact it only removes duplicate observations that are adjacent after sorting. When the sorting process results in a data set in which duplicate observations are not next to each other (they do not come one after the other sequentially), they will not be detected and the duplicate observation(s) will not be removed.

The following PROC SORT uses the NODUPREC option with the intent that it will remove any duplicate observations.

title1 '4.1.1a NODUPLICATES in PROC SORT';
proc sort data=advrpt.lab_chemistry
          out=lab_chem
          noduprec;
   by subject;
   run;

In a listing of the resulting data set (lab_chem) we can see that observations 1 and 3 are duplicates that have not been detected by PROC SORT.

4.1.1a NODUPLICATES in PROC SORT
Obs    SUBJECT    VISIT         LABDT    SODIUM    POTASSIUM    CHLORIDE
  1      200        1      07/06/2006     140         3.6         103✓
  2      200        2      07/13/2006     144        4.81         106
  3      200        1      07/06/2006     140         3.6         103✓
  4      200        4      07/13/2006     140        4.02         103
  5      200        4      07/13/2006     140           4         103
  6      200        5      07/21/2006     142        4.57         104
                     . . . portions of the listing are not shown . . . . 

When key fields in the BY statement are sufficient to form a primary key, the observations will necessarily be reordered sufficiently to cause the NODUPREC option to work as we would hope that it would. In the previous example if the BY statement had included VISIT and LABDT as well as SUBJECT, the duplicate record would have been removed.


				
NOTE: There were 169 observations read from the data set ADVRPT.LAB_CHEMISTRY.
NOTE: 3 duplicate observations were deleted.
NOTE: The data set WORK.LAB_CHEM has 166 observations and 6 variables

In fact the LOG shows that although three observations were indeed removed in this example (4.1.1a), the two rows that are checked (obs 1 and 3) are also duplicates and neither was removed.

by _all_;

When you do not have a primary key, or if you do not know which variables will form a primary key, the only way to guarantee that duplicate observations are removed is to list all the data set’s variables in the BY statement. The list of all variables could be abbreviated by using the _ALL_ list abbreviation.

In practice this tends to be a very inefficient solution to the problem of duplicate observations. Although inefficient, if your data set size is such that the extra sorting resources do not impact you to a very great degree, then using _ALL_ at least works and does what we need it to do when using NODUPREC. Generally, although not foolproof, the inclusion of a derived variable (such as a variance or standard deviation) along with the probable primary key variables is often sufficient to successfully reorder the observations so that duplicates will be in adjacent rows.

The NODUPKEY option does not have this problem as only the key fields are evaluated during the check for duplicate values.

MORE INFORMATION

The Hash object is used to eliminate duplicate observations in Section 2.9.5. The NODUPREC option is also discussed in the context of the NOEQUALS option in Section 4.1.6.

4.1.2 The DUPOUT= Option

When the NODUPREC or the NODUPKEY options are used, the LOG will note when observations are removed; however, which observations were removed is not apparent. If you want to be able to see these observations, the DUPOUT= option can be used to save the eliminated observations into a separate data table.

In the example that follows, the BY statement now includes a sufficient key to reorder the problem observations noted in the previous section. The removed observations are written to a separate data set (WORK.REMOVEDOBS).

title1 '4.1.2 NODUPLICATES and DUPOUT= in PROC SORT';
proc sort data=advrpt.lab_chemistry
           out=lab_chem
           dupout=RemovedObs
           noduprec;
   by subject visit labdt;
   run;
proc print data=removedobs(obs=10);
   run;

				

Because we are using a sufficient key to reorder the problem observations noted in the previous section, that duplicate is now also being deleted. Callout 1

4.1.2 NODUPLICATES and DUPOUT= in PROC SORT
Obs    SUBJECT    VISIT         LABDT    SODIUM    POTASSIUM    CHLORIDE
 1       200        1      07/06/2006     140         3.6         103 Callout 1
 2       200        9      09/13/2006     139        4.06         103
 3       201        2      07/14/2006     140        4.15         101
 4       202        6      07/29/2006     139        5.68          96

Although we have detected these duplicate observations, without using what we know to be a sufficient key, we cannot guarantee that these are all of the duplicate observations.

Removal of duplicate observations can also be accomplished using Hash objects (see Sections 2.9.5 and 3.3); however, costs can be similar to those experienced when sorting with _ALL_.

4.1.3 The TAGSORT Option

As the data set to be sorted increases in size (number of rows, number of key variables, or number of variables in the table), more and more resources are required to complete the sorting process. The process itself can result in a number of temporary copies of all or part of the data set that is being sorted, and if the data set is large enough, these temporary tables can exceed the amount of available storage in the WORK directory.

When successful sorting is hampered by a lack of intermediate storage, the TAGSORT option can be used. This option causes PROC SORT to separate the key fields from the rest of the data. The key fields are then sorted, and after the sorting is complete, the data set is reconstructed using the new sort order.

While the TAGSORT option decreases storage requirements during the sort process, the overall time to complete the sort will usually increase.

4.1.4 Using the SORTSEQ Option

PROC SORT uses what is known as the ‘collating sequence’ to determine the sorted order of values. Traditionally there have been two collating sequences, EBCDIC (for mainframe computers) and ASCII (for most other machines running Operating Systems like Windows and UNIX). You have long been able to select one or the other of these two different collating sequences by specifying the EBCDIC or ASCII options on the PROC SORT statement.

With the introduction of National Language Support, NLS, additional collating sequences have been made available to support languages other than English. Like ASCII and EBCDIC these other collating sequences are also selected through options, which include: DANISH, POLISH, SWEDISH, and NATIONAL. The NATIONAL sequence is selected when your site has specified a customized sequence.

The SORTSEQ option allows you to further refine the way the selected collating sequence is used. This includes subsets or locals within a national collating sequence. Even without changing the base collating sequence the SORTSEQ option can be beneficial.

Reordering Numeric Strings

proc sort data=advrpt.clinicnames 
                     (keep=region)
           out= regions1 nodupkey;
   by region;
   run;
Obs    region
  1      1
  2      10
  3      2
  4      3
  5      4
  6      5
  7      6
  8      7
  9      8
 10      9

When character strings that contain numbers are sorted the values are sorted alphabetically. This can be visually unappealing in that the values seem out of order, e.g., ‘10’ < ‘2’. This is shown when we sort on REGION, a $2 character string containing numeric values that range from ‘1’ to ‘10’. The result has region ‘10’ sorted between regions ‘1’ and ‘2’.

The SORTSEQ option can be used to change this default behavior. When the SORTSEQ option is assigned the keyword LINGUISTIC Callout 1, a number of additional keyword qualifiers can also be specified. Turning on the NUMERIC_COLLATION Callout 2 causes the regions to be ordered as if they were numeric. Region ‘10’ will now be sorted last.


				
proc sort data=advrpt.clinicnames(keep=region)
           out= regions2 
           sortseq=linguistic Callout 1 (numeric_collation=on)Callout 2 
           nodupkey;
   by region;
   run;

Case-Sensitive Reordering

Depending on the collating sequence the lowercase letters will all sort either before or after the uppercase letters. The examples in this section have taken the first five names from the ADVRPT.DEMOG data set and copied them into all uppercase and all lowercase, as well as the original mixed case.

proc sort data=anames
          out=anamesE
          sortseq=ebcdic;
   by lname;
   run;
title3 'EBCDIC Sequence';
proc print data=anamesE;
   run;

The side-by-side comparison shows the difference between these two primary collating sequences (the code for generating the ASCII sequence is in the sample programs).

EBCDIC Sequence
Obs    lname
  1    adams
  2    adamson
  3    alexander
  4    antler
  5    atwood
  6    Adams
  7    Adamson
  8    Alexander
  9    Antler
 10    Atwood
 11    ADAMS
 12    ADAMSON
 13    ALEXANDER
 14    ANTLER
 15    ATWOOD
 
ASCII Sequence
Obs    lname
  1    ADAMS
  2    ADAMSON
  3    ALEXANDER
  4    ANTLER
  5    ATWOOD
  6    Adams
  7    Adamson
  8    Alexander
  9    Antler
 10    Atwood
 11    adams
 12    adamson
 13    alexander
 14    antler
 15    atwood
proc sort data=anames
          out=anamesc
          sortseq=linguistic (case_first=upper); Ž
   by lname;
   run;

We can further refine the sequencing by using keyword qualifiers.

Case_First=Upper
Obs    lname
  1    ADAMS
  2    Adams
  3    adams
  4    ADAMSON
  5    Adamson
  6    adamson
  7    ALEXANDER
  8    Alexander
  9    alexander
 10    ANTLER
 11    Antler
 12    antler
 13    ATWOOD
 14    Atwood
 15    atwood

The CASE_FIRST=UPPER Callout 3 qualifier causes uppercase to take priority over lowercase. Notice, however, that the order is still not the same as ASCII which also gives priority to uppercase letters. Here the sensitivity is within the word not across the list of words. The CASE_FIRST keyword qualifier can also take on the value of LOWER.

4.1.5 The FORCE Option

When the OUT= option is not used on the PROC SORT statement, the incoming data set is replaced with its sorted analogue. When the data set is indexed or if the metadata sort indicators (such as is created by the SORTEDBY= data set option – see Section 4.4) indicate that the data set is already sorted, the sorting does not take place. For indexed data sets this protects the index, and for data sets that are already sorted this conserves resources. When this default behavior is not what you want, the FORCE option can be used.

4.1.6 The EQUALS or NOEQUALS Options

Typically when PROC SORT reorders observations based on the levels of the BY variables, the block of observations within a given level or BY group do not change their order. Generally we do not care about the order of the rows within a BY group; if we did we would add another variable to the BY list. While this default behavior can be controlled at the operating system option level using the SORTEQUALS or the NOSORTEQUALS option, it can also be controlled at the PROC SORT step level using the EQUALS or NOEQUALS options.

Under earlier versions of SAS this order preservation made sense from an operational point of view. Fewer resources were expended by handling the rows as a block. Under the current versions of SAS this default behavior may no longer be our best choice. With multi-threading available to the SORT procedure, portions of these blocks may be divided up across processors. When the rows are returned from the different threads, additional resources may actually be expended just to preserve the order within a block, an order that we probably do not care about.

The NOEQUALS option can be used on the PROC SORT statement to allow SAS to not worry about maintaining the original order within groups. Allowing the within block order to change (by not forcing it to be preserved) through the use of the NOEQUALS option can save resources; however, it can also have other impacts. The order returned, especially when multiple processors are involved, can change from one sort to the next. Since the NODUPREC option (see Section 4.1.1) relies on observation order, its results may also vary from one run to the next when the NOEQUALS option is used.

MORE INFORMATION

Caveats associated with the use of the NODUPREC option are discussed in Section 4.1.1.

4.2 Using Data Set Options with PROC SORT

One of the primary efficiency techniques used to speed up our programs is to eliminate variables and/or observations as soon as possible in the data handling process. Just as we will seldom carry a parka in our luggage when visiting Miami, we should not carry the extra baggage of variables or observations that are not needed. Trimming up the data can have a major impact in the time needed to complete a PROC SORT.

Fortunately for us the process of culling unneeded variables and observations can be handled within the PROC SORT step itself through the use of data set options.

If you associate the KEEP= or WHERE= data set options with the data set that is being generated (on the OUT= data set), variables and observations are removed after the sort has been completed. Although this will help with efficiency in subsequent steps, it will do little to help with the current PROC SORT.

proc sort data=realbig 
          out=onoutgoing(keep=sodium2);
   by sodium2;
   run;
proc sort data=realbig(keep=sodium2) 
          out=onincoming;
   by sodium2;
   run;

When data set options are associated with the incoming data set, they are applied before the PROC SORT is executed. This can substantially reduce the processing requirements of the PROC SORT.

Additional efficiency gains can be achieved by eliminating observations. Like in the previous example, which eliminated columns, eliminating observations before they are read (on the DATA= data set) as opposed to as they are being written to the final data set (on the OUT= data set), can make a substantial difference.

SEE ALSO

The use of the WHERE= data set option on the incoming data set is discussed by Benjamin (2007).

4.3 Taking Advantage of Known or Knowable Sort Order

While there can be a negative impact associated with the use of PROC SORT, we obviously still need to be able to use it to reorder the data. Or do we? Often thinking about your program, its flow, and how it is organized, can help you make sure that you only use PROC SORT when it is actually needed. Some strategies to help minimize the number of SORT steps could include the following:

Plan Your Sorts

Since sorts can be costly, plan your program and data flow around your sorts rather than programming sorts as they are needed in your program. If several different steps use a specific sort order, sort the data once for all the steps rather than placing the steps so that the data must be sorted, resorted, and then sorted back to the first order a second time.

Use CLASS Statements

Unless you are going to explicitly use a BY statement, most procedures do not require the data in a specific order. Obviously there are exceptions; however, the point is that you often do not necessarily need to sort your data. This is especially true of procedures that use implicit or explicit classification variables. CLASS statements do NOT require sorted or even ordered data.

When using the MEANS or SUMMARY procedures for instance, the procedure will probably execute faster when a BY statement is used instead of a CLASS statement (of course the results may not contain exactly the same information). However, the BY statement requires sorted data and the sorting itself may increase the overall processing time such that using the CLASS statement would have ultimately been more efficient. The CLASS statement will avoid sorting, but will generally require more memory.

MORE INFORMATION

The use of threads to improve efficiency is discussed in Section 4.5. Differences between the BY and CLASS statements for the MEANS or SUMMARY procedures are discussed in Section 7.12.

Anticipate Procedure Output Order

For procedures that create output data sets, the order of the data is generally known or at least knowable, and knowing the order of the generated data, or planning the procedure so that it generates data in the desired order can eliminate the necessity of a subsequent PROC SORT. To control the possible orderings of the output data set, be sure to take advantage of the ORDER= option. Generally speaking, the order of the classification variables on the incoming data does not affect the output order unless the ORDER= option is set to DATA (see Section 2.6.2).

Even procedures that do not support the CLASS statement may have implied classification variables (e.g., PROC FREQ), and the values of these variables, along with the ORDER= option, help to determine the order of any generated data sets.

The following table is a listing of a data set that was created by a PROC SUMMARY step. By inspection you can see the sort order and you could even infer the CLASS statement. You can also infer the ORDER= option associated with each variable on the CLASS statement.

4.3 Predicting Sort Order
                                        mean
Obs  race  edu  symp  _TYPE_  _FREQ_     HT
  1          .           0       8     66.25
  2          .   01      1       2     64.00
  3          .   02      1       4     66.50
  4          .   03      1       2     68.00
  5         12           2       4     67.50
  6         14           2       2     64.00
  7         15           2       2     66.00
  8         12   02      3       2     67.00
  9         12   03      3       2     68.00
 10         14   01      3       2     64.00
 11         15   02      3       2     66.00
 12   1      .           4       6     67.00
 13   4      .           4       2     64.00
 14   1      .   02      5       4     66.50
 15   1      .   03      5       2     68.00
 16   4      .   01      5       2     64.00
 17   1     12           6       4     67.50
 18   1     15           6       2     66.00
 19   4     14           6       2     64.00
 20   1     12   02      7       2     67.00
 21   1     12   03      7       2     68.00
 22   1     15   02      7       2     66.00
 23   4     14   01      7       2     64.00

Assuming that the classification variables are not formatted, inspection of this table suggests the following CLASS statement (for this procedure INTERNAL is the default value for the ORDER= option).

class race edu symp / 
        order=internal;

If this data set was to be used in a subsequent step, each of these BY statements could be used without first using a PROC SORT.

by _type_ race edu symp;
by race edu symp;
by _type_;

Avoid Sorting by Using Indexes

Indexes provide you with a way to establish one or more virtual sort orders against a data set. While an index must be created and maintained, for stable data sets this cost may be minimal relative to the cost of sorting and then re-sorting the data.

When an index is created it is stored in a separate file from the data set itself. Whenever the data set is modified, even if the modification does not alter the order of the rows, the index must be recreated. The index file itself will take storage space. The amount of required space will depend on several factors, including the number of rows in the table, the number of indexes that have been established, and the number of variables that make up each index. Additional benefits of indexes include optimized searches with WHERE clauses and the ability to perform double SET statement merges and table look-ups without sorting (see Sections 6.4 and 6.6).

MORE INFORMATION

Indexes are discussed in more detail in Section 5.3.

Using PROC SQL to Avoid Sorts

When PROC SQL operates on a data table, the entire table is loaded into memory. While this means that PROC SQL can be limited in what it can do with larger tables (limited by available memory), it also means that the sort order of incoming data is rarely an issue within the SQL step.

A match merge in the DATA step requires a BY statement; however, this is not the case with the equivalent JOIN in an SQL step. Also, the GROUP clause, which is analogous to the CLASS statement, is also available in the SQL step. Before sorting the data, consider whether or not the use of an SQL step might yield the same result, while avoiding a SORT, as well as, an additional pass of the data.

If a data set is too large to sort, especially if you have tried the TAGSORT option, then it is likely to be too large to be effectively handled by an SQL step.

4.4 Metadata Sort Information

When data are sorted or indexed, information about the sort is stored as a part of the table’s metadata. In this example a simple PROC SORT is executed, and the data set’s metadata is then displayed using PROC CONTENTS.

title1 '4.4a Showing SORT Meta-data';
proc sort data=advrpt.lab_chemistry
           out=lab_chem
           noduplicates; 
   by subject visit labdt; 
   run;
proc contents data=lab_chem;
   run;

Among other things, information about how the data is sorted is contained in two different sections of the PROC CONTENTS output.

The upper-most section shows the internal sorted flag Callout 1, which takes on the values of either YES or NO.


				
4.4a Showing SORT Metadata
The CONTENTS Procedure
Data Set Name        WORK.LAB_CHEM                    Observations          165                       
Member Type          DATA                             Variables             6
Engine               V9                               Indexes               0
Created              Thu, Nov 05, 2009 02:26:48 PM    Observation Length    56
Last Modified        Thu, Nov 05, 2009 02:26:48 PM    Deleted Observations  0
Protection                                            Compressed            NO
Data Set Type                                         Sorted                YES Callout 1                     

When the SORTED flag Callout 1 contains YES, indicating that the data are sorted, an additional section is added to the PROC CONTENTS output. This section lets us know more about the conditions of the sort, and these include the BY variables Callout 2 and sort options Callout 4.

         Sort Information
Sortedby       SUBJECT VISIT LABDT Callout 2
Validated      YES Callout 3
Character Set  ANSI
Sort Option    NODUPREC Callout 4

When SAS does the sorting the VALIDATED flag is set to YES.

The SORTEDBY Data Set Option

When the data are already sorted, but not by SAS, the SORTED metadata flag will not be changed to YES, and we can miss out on performance enhancements that take advantage of known sort order. We can let SAS know that the data are actually sorted by setting the SORTED flag set to YES. You can set the SORTED metadata indicator flag directly by using the SORTEDBY data set option.

title1 '4.4b Using the SORTEDBY Option';
data lab2(sortedby=subject visit);
   set lab_chem;
   run;
proc contents data=lab2;
   run;
      Sort Information
Sortedby       SUBJECT VISIT
Validated      NO Callout 5
Character Set  ANSI

Not only is the SORTED flag set to YES, but the sort information section is also completed. Notice, however, that the VALIDATED indicator is still set to NO Callout 5. We are trusted, but only trusted so far.

4.5 Using Threads

Some operations within a computer are computationally intensive. This is especially true for sorting operations and also for the calculation of large numbers of summary statistics. When multiple CPUs are available, some procedures will follow the principle of ‘divide and conquer,’ and they can split up computationally challenging tasks by spreading the work out among the available CPUs. This distributed work load can offer substantial improvements in the elapsed time to complete tasks.

The system option THREADS is used to allow SAS to take advantage of multiple CPUs. Although some of the documentation refers to the option in the singular, it needs to be THREADS. When multi-threads are used, a note is added to the LOG.

A number of Base SAS procedures, as well as several in SAS/STAT®, support multi-threaded operations. Additionally, support can also be found in some SAS Enterprise Miner® procedures and in some SAS/ACCESS® engines. Base procedures which have multi-threaded capabilities include: SORT, MEANS, SUMMARY, SQL, TABULATE, and REPORT.

Although the use of multiple threads will generally improve processing thru-put, this does not necessarily have to be the case. Since resources are used not only to create and maintain the threads, but also to coordinate the information flow between threads, it is possible that the use of these resources can outweigh the advantage of the multiple threads. It is important for you to test your environment with and without threads to determine which has the better performance. You should test with both THREADS and NOTHREADS and, depending on your data and code, determine which is more efficient.

When more than two CPUs are present on the system, you may not want all of them to be available for use by SAS. The system option, CPUCOUNT, is used to control how many of the available CPUs can be used by SAS.

SEE ALSO

Additional information on multi-threading can be found on the SAS R&D site
http://support.sas.com/rnd/scalability/procs/index.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset