Image483_fmt.png

Chapter 6 Table Lookup Techniques

6.1 A Series of IF Statements—The Logical Lookup

6.2 IF -THEN/ELSE Lookup Statements

6.3 DATA Step Merges and SQL Joins

6.4 Merge Using Double SET Statements

6.5 Using Formats

6.6 Using Indexes

6.6.1 Using the BY Statement

6.6.2 Using the KEY= Option

6.7 Key Indexing (Direct Addressing)—Using Arrays to Form a Simple Hash

6.7.1 Building a List of Unique Values

6.7.2 Performing a Key Index Lookup

6.7.3 Using a Non-Numeric Index

6.8 Using the HASH Object

A table lookup is performed when you use the value of a variable (e.g., a clinic number) to determine the value of another variable (e.g., clinic name). Often this second piece of information must be ‘looked up’ in some other secondary table or location. The process of finding the appropriate piece of information is generally fast; however, as the number of items and/or observations increases, the efficiency of the process becomes increasingly important. Fortunately there are a number of techniques for performing these table lookups.

These techniques can be radically different both in terms of programming complexity and in performance. As the programmer, you will be faced with not only a variety of techniques, but also performance issues. Which technique will perform best, or even adequately, will vary from situation to situation. This means that there is no easy answer to the question, ‘Which technique should I use?’ It is possible, however, to give you some general guidelines.

In this chapter data set size is often used as a decision point for determining which of these techniques to use. But what exactly is a large data set? This answer too will be situational. It will depend a lot on your OS and your hardware (including available memory and processing capabilities). You will know a data set is large when you have to be careful with the techniques used to process it. For table lookups this chapter will give you alternatives. Each is a compromise between processing efficiency and coding complexity. Each section should give you sufficient information to help you make an informed decision. In all cases you will probably want to test your candidate techniques using your data and your hardware.

6.1 and 6.2 IF – THEN/ELSE

This is the slowest and least sophisticated of the techniques discussed in this chapter; however, these techniques may be adequate for small data sets and simple lookups.

6.3 Merges and Joins

These techniques tend to be slower than the use of formats (6.5); however, they do allow the retrieval of multiple items.

6.4 Double SET Statement DATA Steps

Replacing the MERGE statement with a double SET statement generally provides a performance boost, but the technique is more programming intensive.

6.5 Format-Driven Lookup

These techniques tend to be substantially faster than the use of merges and joins. These techniques are probably the best first choice for most situations with small to somewhat large data sets.

6.6 Using Indexes

Indexes can improve the performance of merge and join techniques, especially when the indexes have already been created. Depending on the situation these techniques might outperform the use of formats.

6.7 Key Indexing (Array Processing)

The use of DATA step arrays and memory eliminates the need for sorting. Although there are potential memory constraints, key indexing typically out performs merges and joins, and for larger data sets out performs the use of formats. When it is possible to use them, array (direct addressing) techniques, such as these, will also generally outperform hash table techniques.

6.8 Hash Tables

Hash tables (hashing) is more flexible than array processing techniques, and except for array processing techniques, these techniques are typically the fastest of the lookup techniques. Coding requires an understanding of the DATA step component hash objects and methods. For very large data sets, the use of hash tables may be the only viable choice.

In each of the examples in this chapter we will be using lookup techniques to determine the clinic name based on the clinic number.

SEE ALSO

An overview of lookup techniques, which includes several nice examples with explanations, can be found in Liu (2008). Aker (2002) and Carpenter (2001b) each discuss differences and programming techniques for lookups including those for match merging, format lookups, and the use of arrays. An overview of lookup techniques with emphasis on hash tables and hash objects can be found in Dorfman and Vyverman (2004b). Comparison papers on the efficiency of table lookup techniques include Stroupe and Jolley (2008) as well as Jolley and Stroupe (2007).

6.1 A Series of IF Statements—The Logical Lookup

The simplest form of a table lookup makes use of the IF-THEN statement. Although easy to code, this is one of the slowest table lookup methods. Essentially this technique creates the new variable with its associated value through the use of IF-THEN processing. Effectively we are ‘hard coding’ the clinic name within the program. This technique is demonstrated in the following DATA step.

data demognames;
   set advrpt.demog(keep=clinnum lname fname);
   length clinname $35;
   if clinnum='011234' then clinname = 'Boston National Medical';
   if clinnum='014321' then clinname = 'Vermont Treatment Center';
   if clinnum='107211' then clinname = 'Portland General';
   if clinnum='108531' then clinname = 'Seattle Medical Complex';
   if clinnum='023910' then clinname = 'New York Metro Medical Ctr';
   if clinnum='024477' then clinname = 'New York General Hospital';
   run;

								

The problem with this approach is that it is not practical if there are more than a very few codes to lookup; besides this is VERY inefficient. SAS must execute each IF statement even if an earlier IF statement was found to be true. To make matters worse, IF statements require a fair bit of processing time.

This is an example of a 100% lookup. It is a sequential search and one where every code is checked regardless of whether or not the answer has already been found.

6.2 IF -THEN/ELSE Lookup Statements

A substantially faster method than the IF-THEN, is to use the IF-THEN / ELSE statement combination. The following DATA step executes more quickly than the previous one because as soon as one IF statement is found to be true, its associated ELSE is not executed. Consequently as soon as an expression is found to be true, none of the remaining IF-THEN / ELSE statements are executed. This technique can be made somewhat faster if the more likely outcomes are placed earlier in the list.

data demognames;
   set advrpt.demog(keep=clinnum lname fname);
   length clinname $35;
   if clinnum='011234' then clinname = 'Boston National Medical';
   else if clinnum='014321' then clinname = 'Vermont Treatment Center';
   else if clinnum='107211' then clinname = 'Portland General';
   else if clinnum='108531' then clinname = 'Seattle Medical Complex';
   else if clinnum='023910' then clinname = 'New York Metro Medical Ctr';
   else if clinnum='024477' then clinname = 'New York General Hospital';
   run;

In terms of performance efficiency this technique is similar to that of the DATA step’s SELECT statement and the CASE statement in SQL.

The SELECT statement is on par with the IF-THEN / ELSE combination when performing table lookups. It can even be a bit faster (Virgle, 1998). Again processing time is minimized when the most likely match is located early in the list.

data demognames;
   set advrpt.demog(keep=clinnum lname fname);
   select(clinnum);
    when( '011234') clinname='Boston National Medical';
    when( '014321') clinname='Vermont Treatment Center';
    when( '107211') clinname='Portland General';
    when( '108531') clinname='Seattle Medical Complex';
    when( '023910') clinname='New York Metro Medical Ctr';
    when( '024477') clinname='New York General Hospital';
    otherwise;
   end;
   run;

Interestingly Virgle (1998) found that the efficiency of the SELECT statement can sometimes be enhanced by placing the entire expression on the WHEN statement.

data demognames;
   set advrpt.demog(keep=clinnum lname fname);
   select;
    when(clinnum='011234') clinname='Boston National Medical';
    when(clinnum='014321') clinname='Vermont Treatment Center';
    when(clinnum='107211') clinname='Portland General';
    when(clinnum='108531') clinname='Seattle Medical Complex';
    when(clinnum='023910') clinname='New York Metro Medical Ctr';
    when(clinnum='024477') clinname='New York General Hospital';
    otherwise;
   end;
   run;

There are two overriding issues with these techniques. The primary problem is that the search is sequential. When the list is long the average number of comparisons goes up quickly, even when you carefully order the list. The second, but no less important, issue is that these techniques hard code the values in the program. This is just not smart programming.

Fortunately, the other lookup techniques in this chapter not only avoid hard coding altogether, but also minimize the need for sequential searches.

6.3 DATA Step Merges and SQL Joins

The use of joins in an SQL step and the MERGE in the DATA step is another very common way to perform table lookups by matching values between two data sets. The MERGE statement (when used with the BY statement as it usually is) requires sorted or indexed data sets, while the SQL step does not. There are advantages and disadvantages to both processes.

MERGE Statement

The MERGE statement is used to combine two or more data sets. For the purpose of this discussion, one of these data sets will contain the information that is to be looked up. The BY statement is used to make sure that the observations are correctly aligned. The BY statement

should include sufficient variables to form a unique key in all but at most one of the data sets. For the example below ADVRPT.CLINICNAMES has exactly one observation for each value of CLINNUM.

Because the BY statement is used, the data must be sorted. Sorting can be time consuming, or even on occasion impossible for very large data sets or for data sets on tape. In the following steps PROC SORT is used to reorder the data into temporary (WORK) data sets. These are then merged using the MERGE statement.

proc sort data=advrpt.demog
          out=demog;
   by clinnum;
   run;
proc sort data=advrpt.clinicnames
          out=clinicnames;
   by clinnum;
   run;
data demognames(keep=clinnum clinname 
                     lname fname);
   merge demog(in=indemog)
         clinicnames(in=innames);
   by clinnum;
   if indemog; Callout 1
   run;

For a successful lookup using the MERGE statement, both of the incoming data sets must be indexed or in sorted order and as was discussed in Chapter 4 sorting can be operationally expensive.

The following PROC PRINT listing of the first 10 observations of the merged data shows that the CLINICNAME has been successfully acquired.

As anticipated the observations are sorted by CLINNUM. Notice also that the variable CLINNAME, which came from the right-most data set in the MERGE statement, is last on the PDV.

6.3 Lookup By Joining or Merging Two Tables                                               
10 Observations of the merged data                                               
Obs    clinnum     lname     fname              clinname                                               
  1    011234     Nabers     David     Boston National Medical                                               
  2    011234     Taber      Lee       Boston National Medical                                               
  3    014321     Lawless    Henry     Vermont Treatment Center                                               
  4    014321     Mercy      Ronald    Vermont Treatment Center                                               
  5    023910     Atwood     Teddy     New York Metro Medical Ctr                                               
  6    023910     Harbor     Samuel    New York Metro Medical Ctr                                               
  7    023910     Leader     Zac       New York Metro Medical Ctr                                               
  8    024477     Haddock    Linda     New York General Hospital                                               
  9    024477     Little     Sandra    New York General Hospital                                               
 10    024477     Maxwell    Linda     New York General Hospital                                               
. . . . portions of the table are not shown . . . .

The IF statement Callout 1 has been used to eliminate any clinic numbers in CLINICNAMES that do not appear in DEMOG. This logic will not eliminate cases where there is a clinic number in DEMOG that does not appear in CLINICNAMES (the clinic name will be missing). If we want to restrict the lookup to only those clinic numbers with matches in CLINICNAMES, the IF statement Callout 1 could be replaced with if indemog and innames;. This result is achieved directly in the SQL join discussed next Callout 2.

SQL Join

When using SQL, the merging process is called a join. The SQL join operations do not require sorting and can be more efficient than the DATA step MERGE, unless the tables are so large that they do not fit well into memory.

Just because an SQL join does not require the incoming data to be sorted, does not mean that no resources are going to be expended in preparation for the join (lookup). Hidden from view, but within the processing of the SQL step, lookup techniques are being applied. These behind the scenes operations are very similar to the hash techniques that we can explicitly apply and control in the DATA step (see Sections 6.7 and 6.8).

There are a number of different types of joins within SQL, and one that closely matches the previous step is shown below.

proc sql noprint;
   create table demognames2 as
    select a.clinnum, b.clinname, lname, fname
     from advrpt.demog a, advrpt.clinicnames b
      where a.clinnum=b.clinnum; Callout 2
   quit;

In this example we have added the requirement (through the use of a WHERE clause Callout 2) that the clinic number be in both data tables. The WHERE clause is used to select rows that have matching values of CLINNUM on both incoming tables. This is a more stringent criteria than was used in the DATA step shown above Callout 1.

SQL does not require either of the two data sets to be sorted prior to the join, and unless we specifically request that the resulting data table be sorted (ORDER BY clause) it will reflect the order of the incoming data. This can be seen in the order of the clinic number in the PROC PRINT results shown below.

6.3 Lookup By Joining or Merging Two Tables
10 Observations of the Joined data
Obs    clinnum    clinname                      lname        fname
  1    049060     Atlanta General Hospital      Adams        Mary
  2    082287     Denver Security Hospital      Adamson      Joan
  3    066789     Austin Medical Hospital       Alexander    Mark
  4    063742     Houston General               Antler       Peter
  5    023910     New York Metro Medical Ctr    Atwood       Teddy
  6    066789     Austin Medical Hospital       Banner       John
  7    046789     Tampa Treatment Complex       Baron        Roger
  8    049060     Atlanta General Hospital      Batell       Mary
  9    095277     San Francisco Bay General     Block        Will
 10    031234     Bethesda Pioneer Hospital     Candle       Sid
. . . . portions of the table are not shown . . . .

6.4 Merge Using Double SET Statements

There are a number of schemes that have been published that utilize two SET statements in the same DATA step. These SET statements replace the single MERGE statement and the programmer takes charge of the joining process (keeping the two data sets in sync). These techniques can be faster but more complicated than a MERGE. However they do still require that both incoming data sets be sorted.


				
data withnames(keep=subject clinnum clinname);
  set demog(rename=(clinnum=code)); Callout 1
  * The following expression is true only when
  * the current CODE is a duplicate.;
  if code=clinnum then output; Callout 2
  do while(code>clinnum); Callout 3
    * lookup the clinic name using the code (clinnum)
    * from the primary data set;
    set clinicnames(keep=clinnum clinname); Callout 4
    if code=clinnum then output; Callout 5
  end;
  run;

In this example, the two incoming data sets have already been sorted (by CLINNUM). The primary data set contains the observations for which we need the clinic name. The secondary data set (CLINICNAMES) contains just the names that are to be retrieved. It is both possible and even likely that the lookup data set will contain values that have no match in the first or primary data set. This is fine as long as we plan for the possibility Callout 3.

Callout 1 An observation is read from the primary data set. Because DEMOG and CLINICNAMES both use the variable CLINNUM to hold the clinic number, when it is read from the DEMOG data set it is renamed CODE. This allows us to access and compare the clinic numbers from both data sets at the same time ➋➌➎.

Callout 2 The value of CODE (clinic number from DEMOG) is compared to the value of CLINNUM, which comes from the second data set (CLINICNAMES). On the very first pass, no observation will have been read from CLINICNAMES, and CLINNUM will be missing. Codes that do not have matching names will not be written out Callout 5.

Callout 3 The DO WHILE is used to read successive rows from the second data set.

Callout 4 It is possible that there are codes and names in the CLINICNAMES data set that are not in the primary data set (DEMOG). These observations will necessarily have a CLINNUM that is less than CODE. This loop Callout 3 cycles through any extra names until the second data set ‘catches up’ to the first (CODE=CLINNUM).

Callout 5 This code matches the current value of CLINNUM and the observation is written out.

As in the MERGE example shown earlier, the data still have to be sorted before the above DATA step can be used. Although the sorting restrictions are the same as when you use the MERGE statement, the advantage of the double SET can be a substantial reduction in processing time.

MORE INFORMATION

The use of two SET statements in one DATA step is introduced in Section 3.8.4 and used in examples in Sections 3.1.5 and 3.1.6.

6.5 Using Formats

The use of FORMATS allows us to step away from the logical processing of assignment statements, and to take advantage of the search techniques that are an inherent part of the use of FORMATS. When a value is retrieved from a format, a binary search is used and this means that we can search 2N items in N or fewer tries. With 10 guesses we can search over 1000 items. Since binary searches operate by iteratively splitting a list in half until the target is found, these searches tend to be faster than sequential searches—especially as the number of items increases.

For many users, especially those with smaller data sets and lookup tables, the efficiency gains realized here may be sufficient for most if not all tasks. Formats with several thousands of items have been used successfully as lookup tables.

Formats can be built and added to a library (permanent or temporary) through the use of PROC FORMAT (see Chapter 12). The process of creating a format is both fast and straightforward. The following format ($CNAME.) contains an association between the clinic code and its name.

proc format;
   value $cname
   '011234'='Boston National Medical'
   '014321'='Vermont Treatment Center'
   '107211'='Portland General'
   '108531'='Seattle Medical Complex'
   '023910'='New York Metro Medical Ctr'
   . . . . some code not shown . . . . 
   '024477'='New York General Hospital';
   run;

Of course typing in a few values is not a ‘big deal’; however, as the number of entries increases the process tends to become tedious and error prone. Fortunately it is possible to build a format directly from a SAS data set. The CNTLIN= option identifies a data set that contains specific variables. These variables store the information needed to build the format, and as a minimum must include the name of the format (FMTNAME), the incoming value (START), and the value that the incoming value will be translated to (LABEL). The following DATA step builds the data set CONTROL, which is used by PROC FORMAT. Notice the use of the RENAME= option and the RETAIN statement. One advantage of this technique is that the control data set does not need to be sorted.

Since we already have a data set with the matched value pairs (ADVRPT.CLINICNAMES), it is a perfect candidate for building a format automatically.

data control; Callout 1
   set advrpt.clinicnames(keep=clinname clinnum 
                          rename=(clinnum=start Callout 2
                                  clinname=label)); Callout 3
   retain fmtname '$cname'; Callout 4
   run;
proc format cntlin=control; Callout 5
   run;

Callout 1 The control data set containing the variables (FMTNAME, START, and LABEL) is created based on the data set ADVRPT.CLINICNAMES.

Callout 2 The START variable (left side of the = sign in the value statement) is created by renaming CLINNUM.

Callout 3 The LABEL variable (right side of the = sign in the value statement) is created by renaming CLINNAME.

Callout 4 The format name is a constant and is created using the RETAIN statement.

Callout 5 The format is created by PROC FORMAT through the use of the CNTLIN= option which points to the control data set.

Once the format has been defined, the PUT function Callout 6 can be used to assign a value to the variable CLINICNAME by using the $CNAME. format.

data fmtnames(keep=subject clinnum clinname dob);
   set demog(keep = subject dob clinnum); 
   clinname = left(put(clinnum,$cname.)); Callout 6
   run;

Remember the PUT function always returns a character string; when a numeric value is required, the INPUT function can be used. The length of the new variable is determined by the format used in the PUT function. If no length is specified, as in this example, the variable’s length will be based on the longest value in the format ($27. in this example). A shorter format width, say $CNAME20., would cause the variable to have the shorter length (truncation would be a possibility). Values longer than the longest formatted value will not increase the variable’s length past the longest formatted value.

The previous DATA step will be substantially faster than the IF-THEN/ELSE or SELECT processing steps shown above. The difference becomes even more dramatic as the number of items in the lookup list increases. The lookup itself will use the format $CNAME., and hence will employ a binary search. As a rule of thumb, format searches should be very efficient up until the number of items to look up exceeds 20,000 or so items.

MORE INFORMATION

An assignment statement technique that outperforms a PUT function lookup when creating a numeric result is discussed in Section 2.2.3. Data are used to create user-defined formats in Section 12.7.

6.6 Using Indexes

Indexes are a way to logically sort your data without physically sorting it. If you find that you are sorting and then re-sorting data to accomplish your various merges, you may find that indexes will be helpful.

Indexes must be created, stored, and maintained. They are usually created through either PROC DATASETS (shown below) or through PROC SQL; however, they can also be created in a DATA step. The index stores the order of the data as if it had been physically sorted. Once an index exists, SAS will be able to access it, and you will be able to use the data set with the appropriate BY statement even though the data has never been physically sorted.

Resources are required to create an index, and these resources should be taken into consideration. Indexes are stored in a separate file, and the size of this file can be substantial especially as the number of indexes, observations, and variables used to form the indexes increases.

Indexes can substantially speed up processes. They can also SLOW things down (Virgle, 1998). Not all data sets are good candidates to be indexed and not all variables will form good indexes. Be sure to read about indexes (see Section 5.3 for more on indexes), and then experiment carefully before investing a lot in the use of indexes.

The following example shows the creation of indexes for the two data sets of interest. The library containing the data sets, ADVRPT, is identified Callout 1. The NOLIST option prevents PROC DATASETS from writing a list of all the objects in this library to the LOG.

proc datasets library=advrpt nolist;Callout 1
   modify clinicnames; Callout 2
     index create clinnum Callout 3/ unique; Callout 4
   modify demog; Callout 2
     index create clinnum; Callout 3
   quit;

Callout 2 The MODIFY statement is then used to name the data sets that are to receive the indexes. And the INDEX statement Callout 3 defines the index for each data set.

Callout 4 The unique option forces unique values for CLINNUM.

MORE INFORMATION

The building, maintenance, and use of indexes are discussed further in Section 5.3.

6.6.1 Using the BY Statement

Making use of an index can be as simple as using a BY statement. When the BY variable is an index, the index is automatically used, and the data does not need to be sorted. However relying on an index to perform a merge is not necessarily as fast as a merge on sorted data. The advantage is that we do not have to sort the data prior to the merge, and the time required to perform the sort should be taken into consideration. Assuming the indexes have already been created, one of the following techniques should generally give you better performance over an indexed merge.

data mrgnames;
   merge demog(keep=subject clinnum edu)
         clinicnames(keep=clinnum clinname);
   by clinnum;
   run;

6.6.2 Using the KEY= Option

You can also look up a value when an index exists on only the data set that contains the values to be looked up. The KEY= option on the SET statement identifies an index that is to be used.

data keynames;
   set advrpt.demog
          (keep=subject clinnum lname fname); Callout 5
   set advrpt.clinicnames key=clinnum/unique; Callout 6
   if _iorc_ne 0 then clinname=' '; Callout 7
run;

Callout 5 An observation is read from the primary, potentially unsorted data set. This loads a value for the index variable (CLINNUM) into the PDV.

Callout 6 An observation is read from the lookup data set. Because the KEY= option has been specified, the observation corresponding to the current value of CLINNUM is returned. Since this is an indexed read, the observations read from ADVRPT.DEMOG can be in any order, and values of CLINNUM can be repeated.

Callout 7 The temporary variable _IORC_ will be 0 when an indexed value is successfully read. If the value of CLINNUM is not found, _IORC_ will not be equal to 0 and we will need to supply a missing value for the clinic name. Otherwise the value of CLINNAME will have been retained from the previous observation.


				
data rckeylookup;
   set advrpt.demog(keep=subject clinnum lname fname);
   set advrpt.clinicnames key=clinnum/unique;
   select (_iorc_); Callout 8
      when (%sysrc(_sok)) do; Callout 9
         * lookup was successful;
         output;
      end;
      when (%sysrc(_dsenom)) do; Callout 10
         * No matching clinic number found;
         clinname='Unknown';
         output;
      end;
      otherwise do;
         put  'Problem with lookup ' clinnum=;
         stop;
      end;
   end;
run;

The values returned to _IORC_ may change in future releases of SAS. Rather than depend directly on the value of _IORC_, the SAS supplied autocall library macro %SYSRC can be used to ‘decode’ the values contained in _IORC_. The following example is the same as the previous one, but it takes advantage of two of the over two dozen values accepted by %SYSRC. Callout 8 The SELECT statement is used to compare the returned _IORC_ value with codes of interest. _IORC_ will be an integer: 0 for success and >0 for various types of failure.

Callout 9 For a given error mnemonic, the %SYSRC macro returns the number associated with the associated error. _SOK is the mnemonic for success and %SYSRC returns a 0, which matches the value of _IORC_.

Callout 10 When a specific value of CLINNUM is not on the index, the error mnemonic is _DSENOM.

SEE ALSO

Additional examples and discussion on the use of the KEY= option can be found in Aker (2000).

6.7 Key Indexing (Direct Addressing)—Using Arrays to Form a Simple Hash

Sometimes when sorting is not an option or when you just want to speed up a search, the use of arrays can be just what you need. Also known as direct addressing, variations on this form of lookup tend to be the fastest of the lookup techniques discussed in this chapter; however, there are some restrictions on their use that can limit their flexibility. These techniques require you to create arrays, sometimes very large arrays. Fortunately under current versions of SAS you can build arrays that can contain millions of values (Dorfman, 2000a, 2000b).

6.7.1 Building a List of Unique Values

To introduce this topic consider the problem of creating a list of unique values from a data set. In terms of the data sets being used in this set of examples, we would like to establish a list of unique clinic numbers within the data set ADVRPT.DEMOGS. One of several ways that you could use to solve this problem is shown to the left. Here PROC SORT uses a NODUPKEY option to build a data set with unique values of CLINNUM.

proc sort data=advrpt.demog
          out=uniquenums 
          nodupkey;
   by clinnum;
   run;

This works, assuming that the data set can be sorted and that the cost of the resources expended in the sorting process is reasonable. An alternate method appropriate for data sets of all sizes makes use of DATA step arrays.

To avoid sorting, we somehow have to “remember” which clinic codes we have already seen. The way to do this is to use the ARRAY statement. The beauty of this technique is that the search is very quick because it has to check only one item. We accomplish this by using the clinic code itself as the index to the array.

data uniquekey;
   array check {999999} _temporary_; Callout 1
   set advrpt.demog; Callout 2
   if check{input(clinnum,6.)}=. then do; Callout 3
      output; Callout 4
      check{input(clinnum,6.)}=1; Callout 5
   end;
   run;

Callout 1 Establish an array with sufficient dimension to handle all the clinic numbers.

Callout 2 Read a clinic number.

Callout 3 When the array element is missing, this is the first occurrence of this clinic number. Write it out and then mark it Callout 5 so it will not be written again.

Callout 4 Write out this clinic number.

Callout 5 Mark this number as having been seen.

As an observation is read from the incoming data set, the character clinic code is converted to a number using the INPUT function and then is used as the index for the ARRAY CHECK. If the array value is missing, this is the first (unique) occurrence of this clinic number. It is then marked as found (the value is set to 1). Notice that this step will allow a range of clinic codes from 1 to 999,999. Larger ranges, into the 10s of millions, are easily accommodated.

The array used in the previous example is numeric; however, we could have stored a single byte character flag and reduced the memory requirements by a factor of 8.

data uniquekey;
   array check {999999} $1 _temporary_; Callout 6
   set advrpt.demog;
   if check{input(clinnum,6.)}=' ' Callout 7 then do;
      output;
      check{input(clinnum,6.)}='x';Callout 8
   end;
   run;

Callout 6 The array is declared to be a character array of $1 elements.

Callout 7 The check is made for a missing (blank) array value.

Callout 8 A non-blank character is stored to indicate that this clinic number has been found.

6.7.2 Performing a Key Index Lookup

In the previous example an array was used to look up whether or not an item had been found before. This process of looking up a value is exactly what we do when we merge two data sets. In the following DATA step the list of codes are read sequentially, once, into an array that stores the clinic name (instead of just the number 1) again using the clinic code as the array subscript. The second DO UNTIL then reads the data set of interest. In this loop the clinic name is recovered from the array and assigned to the variable CLINNAME.

In addition to its speed of execution, a major advantage of this technique is that neither of the incoming data sets needs to be sorted.

This technique is known as key indexing because the index of the array is the value of the variable that we want to use as the lookup value.

The array itself may be numeric or character depending on whether a numeric or character value is to be retrieved. The index, however, must be numeric (or convertible to numeric as in this example). Large arrays are common. In this example there are almost a million array elements, when the example needs a mere 27. Memory is fast, cheap, and generally available on most modern machines, thus making this overkill a small price.

data clinnames(keep=subject lname fname clinnum clinname);
  array chkname {999999} $35 _temporary_; Callout 1
  do until(allnames); Callout 2
     set advrpt.clinicnames end=allnames;
     chkname{input(clinnum,6.)}=clinname; Callout 3
  end;
  do until(alldemog);
     set advrpt.demog(keep=subject lname fname clinnum) Callout 4
                end=alldemog;
     clinname = chkname{input(clinnum,6.)}; Callout 5
     output clinnames;
  end;
  stop; Callout 6
  run;

Callout 1 A character array of temporary values is established. This array will hold the values to be retrieved (clinic names), and will be indexed by the clinic number (CLINNUM). The length of the array elements must be sufficient to hold each value being inserted into the array.

Callout 2 A loop is used to read all of the observations from the data set that contains the values to be looked up. Because the temporary variable ALLNAMES is defined using the END= option on the SET statement, it will be 0 for all observations except the last one, and then it will be assigned the value of 1 (true).

Callout 3 The value of the clinic name, CLINNAME, is stored in the array element identified by the clinic number. The INPUT function is used to convert the character variable CLINNUM into an integer.

Callout 4 An observation is read from the primary data set. This loads the value of the clinic number into the Program Data Vector, PDV, where it can be used to retrieve the clinic name from the CHKNAME array.

Callout 5 The clinic name is retrieved from the CHKNAME array using the value of the clinic number just retrieved from the primary data set.

Callout 6 Because we have placed the SET statements inside of DO loops, it is necessary to terminate the DATA step directly.


				
data crnames(keep=subject lname fname clinnum clinname region);
  array chkname {999999} $35 _temporary_; 
  array chkregn {999999} $2 _temporary_; 
  do until(allnames); 
     set advrpt.clinicnames end=allnames;
     chkname{input(clinnum,6.)}=clinname;
     chkregn{input(clinnum,6.)}=region; Callout 7
  end;
  do until(alldemog);
     set advrpt.demog(keep=subject lname fname clinnum) 
                end=alldemog;
     clinname = chkname{input(clinnum,6.)};
     region   = chkregn{input(clinnum,6.)}; Callout 7
     output crnames;
  end;
  stop; 
  run;

Because we are working with arrays this technique is not limited to the retrieval of a single value. In this example we want to retrieve both the clinic name and the region associated with the clinic number. Callout 7 The only real difference is the addition of another array.

As was noted above memory is fast and usually readily available. We should still be at least conscious of our memory usage. In this example the maximum length of a clinic name is 27 characters. Since the ARRAY definition will not affect the length of the new variable (array elements were defined as $35 and 35>27), this array statement has wasted almost 8 megabytes (999,999 * 8 bytes) of memory. Although not a lot of memory for a small array such as this one, you should at least be aware of the overall cost of your array. This technique will not work in all situations. As the number of array elements increases, the amount of memory used also increases. Paul Dorfman (2000a) discusses memory limitations. Certainly most modern computers should accommodate arrays with the number of elements in the millions.

For situations where this technique requires unreasonable amounts of memory, other techniques such as bitmapping and hashing are available. Again Paul Dorfman is the acknowledged expert in this area and his cited papers should be consulted for more details.

MORE INFORMATION

In the sample programs associated with this section there is a key indexing example that stores and retrieves multiple values of multiple variables.

6.7.3 Using a Non-Numeric Index

One of the limitations of the key indexing techniques is that the index to the array must be numeric. This limitation is overcome completely by the use of hash objects (see Section 6.8). In the examples in Section 6.7.2 the index is a character string that contains a number; therefore, the INPUT function can be used to create the numeric index value. What if the character string does not readily convert to a number?

When the number of items to be looked up is fairly small, for example fewer than 20 or 30 thousand, a format can be used to convert the character key to a number. In this example let’s assume that CLINNUM could not be converted directly to a number. Instead we create a numeric informat to create a unique artificial index number.


				
data control(keep=fmtname start label type); Callout 1
   set advrpt.clinicnames(keep=clinnum
                          rename=(clinnum=start))
       end=eof;
   retain fmtname 'nam2num' type 'I'; Callout 2
   label=_n_; Callout 3
   output control;
   if eof then call symputx('levels',_n_); Callout 4
   run;
proc format cntlin=control; Callout 5
   run;
data clinnames(keep=subject lname fname 
                    clinnum clinname);
  array chkname {&levelsCallout 4} $35 _temporary_; 
  do until(allnames); 
     set advrpt.clinicnames end=allnames;
     chkname{input(clinnum,nam2num.Callout 6)}=clinname; 
  end;
  do until(alldemog);
     set advrpt.demog(keep=subject lname fname 
                           clinnum) 
                end=alldemog;
     clinname = chkname{input(clinnum,nam2num.Callout 6)};
     output clinnames;
  end;
  stop; 
  run;

Callout 1 A control file is created that will be used to build the conversion format NAM2NUM. (see Section 12.7).

Callout 2 The TYPE variable declares this to be a numeric informat.

Callout 3 The label is the numeric counter.

Callout 4 Since we are reading the whole data set (one observation per unique clinic number), we can save the number of possible values. This value can be used to provide a dimension to the array.

Callout 5 The CONTROL data set is used by PROC FORMAT to create the NAM2NUM. format.

Callout 6 The format is used to convert the character value into a usable array index.

6.8 Using the HASH Object

Users with very large data sets are often limited by constraints that are put on them by memory or processor speed. Often, for instance, it is not practical or perhaps even possible to sort a very large data set. Unsorted data sets cannot be merged using a BY statement unless the data set is indexed, and this type of merge is generally not feasible (see Section 6.6.1). Joins in SQL may be possible by using the BUFFERSIZE option, but this still may not be a useful solution. Fortunately there are a number of techniques for handling these situations as well.

In Section 6.7.2 an array was used to hold and retrieve values. This is a form of a simple hash table. In SAS®9 the DATA step has a HASH object that will hold and access the array portion of this hash array. This hash object is a DATA step component object (DATA step component objects are introduced and discussed in Section 3.3).

While key indexing is fast and works well, it does have limitations that the hash object can overcome.

  • Key indexing requires a numeric value as the array index. While techniques have been developed to work around this limitation, hash objects are designed to work with character or numeric keys.
  • Unless using a multi-dimensional array, key indexing can use only a single key while hash objects can use composite keys.
  • Multiple fields can be returned with a hash object.

Essentially the hash object defines an array in memory, initializes its values with data from a table, and sets up an indexing variable or variables that can be either numeric or character.

data hashnames(keep=subject clinnum clinname lname fname);
   * Define the attributes for variables on lookup table;
   if 0 then set advrpt.clinicnames; Callout 1
   * Create and load the hash object;
   declare hash lookup(dataset: 'advrpt.clinicnames', Callout 2
                       hashexp: 8); Callout 3
   lookup.defineKey('clinnum'), Callout 4
   lookup.defineData('clinname'), Callout 5
   lookup.defineDone(); 
   * Read the primary data;
   do until(done); Callout 6
      set advrpt.demog(keep=subject clinnum lname fname) Callout 7
           end=done; Callout 8
      if lookup.find() = 0 then output hashnames; Callout 9
   end;
   stop; Callout 10
   run;

Callout 1 The attributes for the variables that are to be retrieved from the hash object need to be established on the PDV. This SET statement is used only during DATA step compilation to determine variable attributes.

Callout 2 The HASH object itself is defined, named, and loaded using the DECLARE statement. The attributes of the object are then defined using the DEFINEKEY, DEFINEDATA, and DEFINEDONE methods. This hash object has been named LOOKUP, and has been loaded with the data that contains the values (CLINNAME) that we want to be able to look up.

Callout 3 The number of bins (28=256) used by the hash table is specified.

Callout 4 The DEFINEKEY method is used to list one or more key variables whose values are used to index the LOOKUP hash table.

Callout 5 The DEFINEDATA method lists those variables to be added to the LOOKUP hash table. The values of these variables can be retrieved using the FIND method. Although not needed here, you may want to include the key variables here as well if they are also to be retrieved.

Callout 6 A DO UNTIL loop is used to cycle through the observations in the primary data set.

Callout 7 An observation is read and its value for CLINNUM is loaded into the PDV. Since CLINNUM is a key variable for the hash object Callout 4, its value will automatically be used by the FIND
method Callout 9 when retrieving the value of the clinic name.

Callout 8 The temporary variable DONE will be set to 1 when the last observation is read from ADVRPT.DEMOG.

Callout 9 The clinic name, which is being held in the hash table, is retrieved through the use of the FIND method. This method returns its success (0) or failure. When the retrieval is successful, we write out the resulting observation (with the clinic name defined).

Callout 10 Since the SET statement is inside a loop, we should always stop the implied DATA step loop manually.

data hashnames(keep=subject clinnum clinname lname fname);
   if _n_= 1 then do;
      * Define the attributes for variables on lookup table;
      if 0 then set advrpt.clinicnames; Callout 1
      * Create and load the hash object;
      declare hash lookup(dataset: 'advrpt.clinicnames', Callout 2
                          hashexp: 8); Callout 3
      lookup.defineKey('clinnum'), Callout 4
      lookup.defineData('clinname'), Callout 5
      lookup.defineDone(); 
   end;
   * Read the primary data;
   set advrpt.demog(keep=subject clinnum lname fname); Callout 7
   if lookup.find() = 0 then output hashnames; Callout 9
run;

The DATA step shown above is commonly coded using something similar to the following simpler step that is shown here. The definition and loading of the hash object is done inside a DO block that is executed only once, and the SET statement that reads the ADVRPT.DEMOG data set Callout 7 is not within a DO UNTIL loop (thus eliminating the need to include a STOP statement Callout 10). From a performance perspective, it is valuable to understand the difference between these two DATA steps. While the code used in the former step is more complex, it will probably process faster than the code shown here. This performance advantage will be more apparent as the size of the data set ADVRPT.DEMOG increases.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset