Chapter 7: An Example

Problem

The Hash Package

Four Steps to Creation

Lookup and Retrieval

Problem Revisited

Threaded Read

Parameterized Threaded Read

Problem

Suppose that the company is going to institute a new sales commission plan for its sales reps. Commission will be based on product category and month-to-date sales by the rep. You have to provide the costs of the new plan based on one month’s transactions. You might write out the problem as follows:

For each transaction, you need the following information:

•   Get the current month-to-date sales for the rep.

•   Determine the appropriate commission factor.

•   Calculate the commission for this transaction.

•   Update the rep’s month-to-date sales.

You write a skeleton DS2 program to implement:

proc ds2;

data commissions;

method run();

read_transaction_record();

getMonth_To_Date_Sales();

getCommission_Factor();

Calculate_Commission();

updateMonth_To_Date_Sales();

end;

enddata;

run;

The method run clearly states the main program logic. However, you are far from finished. Here are some relevant details about sales activity:

•   There are about 100 million transactions per month.

•   There are about 10,000 sales reps.

•   The proposed commission structure is as follows:

   1% up to $25,000 of sales for category 1 products, 1.5% of any additional sales.

   2% up to $10,000 of sales for category 2 products, 3% of any additional sales.

•   The category is based on the status of the sales rep (inside or outside sales).

•   Transaction data are stored in an SQL database. For testing, you have a subset with one month of transactions.

Because you have to keep a running total of sales amounts for each of the 10,000 reps, you use an in-memory data structure—a hash table. A hash table provides efficient processing and a simple one-pass-over-the-data approach.

Here is the whole program1:

proc ds2;

 data _null_;                                  

 declare package hash h_mtdSales(8,'','A','','','');                 

 declare decimal(10,2) mtdSales;

 declare decimal(10,2) mtdCommission;

 declare float commissionFactor;

 declare float commissionAmt;

 declare integer rc;

method init();                                                       

  h_mtdSales.keys([salesID saleCat]);

  h_mtdSales.data([salesID saleCat mtdSales mtdCommission]);

  h_mtdSales.defineDone();

 end;

 method getmtdSales();                                               

  rc = h_mtdSales.find();

  if rc <> 0 then

   do;

    mtdSales      = 0;

    commissionAmt = 0;

    mtdCommission = 0;

    h_mtdSales.add();

   end;

 end;

method getCommission_Factor();                                       

  if saleCat = 1 then

  do;

    if mtdSales < 700000

       then commissionFactor = 0.02;

       else commissionFactor = 0.03;

end;

  else if saleCat = 2 then

   do;

    if mtdSales < 890000

       then commissionFactor = 0.01;

       else commissionFactor = 0.015;

   end;

 end;

 method Calculate_Commission();                                      

  commissionAmt = saleAmt * commissionFactor;

 end;

 method updatemtdSales();                                            

  mtdSales = mtdSales + saleAmt;

  mtdCommission = mtdCommission + commissionAmt;

  h_mtdSales.replace();

 end;

 method run();                                                       

  set SalesDEC;

  getmtdSales();

  getCommission_Factor();

  Calculate_Commission();

  updatemtdSales();

 end;

 method term();                                                      

   h_mtdSales.output('commissions');

 end;

enddata;

run;

quit;

   This is a data _null_ program, even though you want a table of commissions. The hash table manages the in-memory data, and then it writes the contents to disk when it is done.

   Global variables are declared. You declare and instantiate a hash object, but you are not associating it with a data set. The hash table is updated in code and it is sorted internally using the A parameter. Variables for the commission are declared and are in multiple methods.

   The hash key and hash data elements are defined. Because a hash object can have its key and data defined only one time, this is put into an init method because method init() is called only one time. In a DATA step program, these statements are enclosed in an IF _N_ = 1 block. For users who are familiar with the DATA step hash object, you see that variables are used as arguments rather than strings with variable names. In DS2, both ways (variable as argument, string with variable name) are possible. This hash object is keyed on the sales rep (salesID) and the sale category (saleCat). Data elements are salesID, saleCat, mtdSales, and mtdCommission.

   The month-to-date sales are retrieved. The hash object uses the current value of salesID and saleCat to locate the record in the hash table h_mtdSales.find(). If the entry is found, the values for mtdSales and mtdCommission are moved from the hash table to the PDV, overwriting the values in the PDV. If the salesID and saleCat are not found (if rc <> 0 then), the commission variables are all set to zero and a hash table is updated. The next time you search for this salesID and saleCat, an entry will be found and the zero values will be returned. If you were sequentially processing a sorted table, FIRST. processing would be used. With the hash object, there is no need to sort the data and you can still correctly initialize the first record for salesID and saleCat.

   The commission factor is determined. A simple if then else is used, based on saleCat and mtdSales. Remember, mtdSales was just retrieved from the hash table.

   The commission on the current sale is calculated. The commission factor is updated. You apply the factor to the current sale.

   The running totals of sales and commission are incremented. After you increment the two running totals, you update the hash table h_mtdSales.replace(). The next time that salesID and saleCat are accessed in the hash table, the new values will be available.

   Each method is invoked. In your pseudo code, you implied a method. Because the run method iterates over all of the rows in the input data set, you need to have an explicit set statement. Each method is invoked for each row in the input data set.

   The term method is automatically called when there are no more records in the input data set. Our hash table has kept running totals for each salesID and saleCat. All rows have been processed and the required commission table is in the hash object. You use the output method to write a data set. When you declared the hash object, you specified that you wanted it sorted by the hash key, so the output table is sorted. In addition, you included salesID and saleCat in the data elements, so they are written to the output data set.

The modular design of DS2 facilitates the transition from the logical flow of pseudo code to a working program.

The Hash Package

Hashing is an effective and efficient way to help you solve problems. The original SAS HASH object, first available in SAS®9, introduced hashing and OOP dot notation to DATA step programmers. Since then, many SAS Global Forum and regional user group conference papers have been written about the usage of the HASH object.

Four Steps to Creation

To create a SAS HASH object, there are four basic steps:

1.   Declare the object.

2.   Define the object keys.

3.   Define the object data.

4.   Load data into the object.

In this book, there is a brief description of how to do each of these steps in DS2. Please see the SAS® 9.4 DS2 Language Reference, Fifth Edition for all of the details. Although the HASH object works fundamentally the same way in both the DATA step and DS2, there is far more flexibility in DS2.

Declare a Hash Package

There are four ways to declare a package. In all four ways, you need to specify the keyword package.

Method Call

The hash package is instantiated with no arguments. Method calls are used to define keys and load data.

DECLARE PACKAGE hash h_mtdSales();

Partial Key and Data

The keys and data variables are defined by method calls. Optional parameters that provide initialization data can be specified either in the DECLARE PACKAGE statement above, in the _NEW_ operator, by method calls, or by a combination of any of these. If a parameter value is not going to be passed in, a placeholder is necessary. A single defineDone method call completes the definition. The parameters within the parentheses are positional. Here is the order in which they are specified:

DECLARE PACKAGE hash h_mtdSales (8, '', 'A','','','');

Here is a brief explanation of each parameter within the parentheses:

•   hashexp is the number of buckets for the hash table.

•   datasource is a table name or FedSQL query.

•   ordered is the internal order of the keys (values are A for ascending or D for descending).

•   duplicate specifies how to handle duplicate keys.

•   suminc is a variable to hold a count of hash keys.

•   multidata specifies whether duplicate keys are allowed.

Complete Key and Data

The keys and data variables are defined in the DECLARE PACKAGE statement, which indicates that the package instance should be created as completely defined. No additional initialization data can be specified by subsequent method calls. The parameters within the parentheses are positional.

DECLARE PACKAGE hash h_mtdSales ([salesID saleCat], [salesID saleCat mtdSales mtdCommission], 8, 'SalesDEC’, ‘A’, ‘’, ‘’, ‘’);

•   (hash keys, hash data, hashexp, datasource, ordered, duplicate, suminc, multidata)

Complete or Partial Key

The keys and data variables are defined in the DECLARE PACKAGE statement, which indicates that the package instance should be created as completely defined. No additional initialization data can be specified by subsequent method calls.

DECLARE PACKAGE hash h_mtdSales ([salesID saleCat], 8, 'SalesDEC’, ‘A’, ‘’, ‘’, ‘’);

•   (hash keys, hashexp, datasource, ordered, duplicate, suminc, multidata)

This example uses a partial declaration:

DECLARE PACKAGE hash h_mtdSales (8, '', 'A','','','');

With positional parameters, if a placeholder is passed in (e.g., ''), then the default value for the parameter is used. If trailing parameters are not to be explicitly set, then they can be left off. The partial declaration could also be written as follows:

DECLARE PACKAGE hash h_mtdSales (8, '', 'A');

Define the Keys

If you use either of the complete approaches, keys are defined as part of the package declaration. In a partial declaration, you can use method calls to define the keys. There are two methods to define keys. definekey() takes a single character parameter with the name of the key. keys() takes a list of variables enclosed in square brackets. All of the following are equal:

h_mtdSales.keys([salesID saleCat]);

h_mtdSales.definekey(‘salesID’);

h_mtdSales.definekey(‘saleCat’);

However, you do need two calls for definekey().

Define the Data

If you use either of the complete approaches, data are defined as part of the package declaration. In a partial declaration, you can use method calls to define the data. As with defining keys, there are two methods to define data—defineData() and data(). All of the following are equal:

h_mtdSales.data([salesID saleCat mtdSales mtdCommission]);

h_mtdSales.defineData(‘salesID’);

h_mtdSales.defineData(‘saleCat’);

h_mtdSales.defineData(‘mtdSales’);

h_mtdSales.defineData(‘mtdCommission’);

Complete the Declaration

Although the following step does not add any more information to the hash package, it is a required step for a partial declaration:

h_mtdSales.defineDone();

Add Data to the Hash Package

You can add data using the ADD() method to add the initial record to the hash table. You can use the REPLACE() method to update the record. In many applications, the data are loaded into the hash table as part of the hash declaration:

h_mtdSales.ADD();

h_mtdSales.REPLACE();

With the ADD() and REPLACE() methods, the hash package takes the values of the keys and data variables from the PDV and copies them into the hash table.

Lookup and Retrieval

Retrieving an entry from the hash table is accomplished using the find() method:

rc = h_mtdSales.find();

The find() method uses the current values of the key fields in the PDV. It searches for them in the hash table. If a key is found, the values of the data variables are copied from the hash table to the variables in the PDV, overwriting the existing values. The return code from the method call specifies whether the key was found. A return code of zero specifies it was found. A nonzero return code specifies it was not. There are two overloaded versions of the find() method:

find([key list]);

find([key list], [data list]);

In the first find() method, a set of variables (or hardcoded values) are passed in and searched for. If they are found, the data variables in the PDV used to define the hash table are updated. This version can be used in cases where lookup keys are coming from multiple data sources and the column names do not match the column names that were used to define the key. In the second find() method, a set of key columns are passed in and searched for. In addition, a set of data columns are passed in and searched for—the data values in the hash table update these data columns rather than the columns in the PDV used to define the data. In this version, the columns that were used to define the hash table are not overwritten.

If you need only to determine whether the key value exists in the hash table, the check() method can be used. This method returns a zero if the key is found and a nonzero if it is not. The data values in the PDV are not updated.

Problem Revisited

Using the HASH object gives you an efficient and simple way to process the input data. However, the biggest bottleneck is accessing the data. Even in a big data world, 100 million rows is not trivial. What can you do to improve data access?

Threaded Read

The first change is to move from reading the SAS table in the method run() to reading from a thread instead. Let’s look at the thread code:

proc ds2;

thread getSalesRow  /overwrite=yes;

   method run();

      set SalesDec;

   end;

endthread;

run;

The set statement is moved into a thread. The data program needs two changes. You need one new line to declare the thread and the set statement in method run() to read from the thread:

declare thread getSalesRow getSalesRow;

set from getSalesRow threads=4;

The set statement now tells DS2 to use four threads to read the data. With threaded reads, some experimentation is required to determine how many (if any) threads provide the best performance, given your hardware.

Barring a disastrous degradation of performance2, moving to threaded access has some important value—you can experiment with different data access methods. For example, you can pass FedSQL statements to the back-end database without changing the program because data access is hidden from the program. You can develop the program with test data while another team in parallel tunes the query!

At its simplest, you can change the thread as follows:

proc ds2;

 thread getSalesRow  /overwrite=yes;

   method run();

      set {select * from SalesDec};

   end;

endthread;

run;

quit;

Parameterized Threaded Read

Threads can be parameterized to make them more general-purpose. You can combine the thread parameters with the sqlstmt package to create a parameterized query:

proc ds2;

thread getSalesRow (timestamp inFromDate, timestamp inToDate)  / overwrite=yes;

    declare timestamp FromDate ToDate;                               

    declare char(6) salesID;

    declare char(1) saleCat;

    declare decimal(7,2) saleAmt;

   dcl package sqlstmt stmt1;                                        

   method init();

      FromDate = inFromDate;

      ToDate   = inToDate;

      stmt1 = _new_ sqlstmt ('select salesID, saleCat, saleAmt from salesDEC

 where saleDate between ? and ? ', [FromDate ToDate]);                 ❹

      stmt1.execute();                                               

      stmt1.bindresults([salesID saleCat saleAmt]);                  

   end;

   method run();

      do while (stmt1.fetch() = 0);                                  

        output;

      end;

   end;

endthread;

run;

quit;

   Two parameters are passed to the thread: inFromDate and inToDate. You can run the query against a date range. Because the back-end database is SQL Server, the timestamp data type is passed.

   Global variables are passed back to the calling method. The parameters in the sqlstmt instantiation must be global.

   A sqlstmt package is declared but not instantiated.

   Two date parameters are instantiated.

   The sqlstmt package is sent to the back-end database.

   You need to match the columns returned from the query with the DS2 thread global variables. Ensure that the number and types of variables in the bindResults() method match the columns returned.

   The fetch() method retrieves one row at a time. A return code of zero indicates that a row was retrieved.

With this new thread, you need to make a few minor changes to the program. Here, only the changes with brief explanations are shown:

drop FromDate ToDate;                                                

 method init();

  h_mtdSales.keys([salesID saleCat]);

  h_mtdSales.data([salesID saleCat mtdSales mtdCommission]);

  h_mtdSales.defineDone();

  set sqldb.dateParms;                                               

  getSalesRow.setParms(FromDate, ToDate);                     

 end;

method run();

  set from getSalesRow;                                             

  getmtdSales();

  getCommission_Factor();

  Calculate_Commission();

  updatemtdSales();

 end;

   The parameterized thread required global variables for the sqlstmt parameters. Because they are not needed in any result sets, the drop command is used.

   Because you want to avoid changing the program when there are changes to data access, the date parameters are put into a table and read in using method init().

   The date parameters are passed from the table to the thread.

   The thread is read. There is no THREADS= option. In this SQL Server implementation, you cannot have a multi-threaded parameterized sqlstmt query.

You have moved from basic pseudo code to a DS2 implementation relatively easily. With the ability to create parameterized FedSQL queries, you can take advantage of SQL back-end processing to join multiple tables, apply WHERE clauses, and bring the results back to your DS2 program. By putting the FedSQL queries into a thread, and then compiling and saving the thread, you hide all of the details of data access from the program.

Here is the complete program:

proc sql;

drop table commissions;

quit;

proc ds2;

thread getSalesRow(timestamp inFromDate, timestamp inToDate)  /overwrite=yes;

    declare timestamp FromDate ToDate;

    declare char(6) salesID;

    declare char(1) saleCat;

    declare decimal(7,2) saleAmt;

   dcl package sqlstmt stmt1;

   method init();

      FromDate = inFromDate;

      ToDate   = inToDate;

      stmt1 = _new_ sqlstmt ('select salesID, saleCat, saleAmt from sqldb.salesDECdata where saleDate between ? and ? ', [FromDate ToDate]);

      stmt1.execute();

      stmt1.bindresults([salesID saleCat saleAmt]);

   end;

   method run();

      do while (stmt1.fetch() = 0);

        output;

      end;

   end;

endthread;

run;

quit;

proc ds2;

data _null_;

 declare thread getSalesRow getSalesRow;

 declare package hash h_mtdSales(8 '', 'A','','','');

 declare decimal mtdSales        having format comma18.2;

 declare decimal mtdCommission   having format comma18.2;

 declare float commissionFactor;

 declare float commissionAmt;

 declare integer rc;

 drop FromDate ToDate;

 method init();

  h_mtdSales.keys([salesID saleCat]);

  h_mtdSales.data([salesID saleCat mtdSales mtdCommission]);

  h_mtdSales.defineDone();

  set sqldb.dateParms;

  getSalesRow.setParms(FromDate, ToDate);

 end;

 method getmtdSales();

  rc = h_mtdSales.find();

  if rc <> 0 then

   do;

    mtdSales      = 0;

    commissionAmt = 0;

    mtdCommission = 0;

    h_mtdSales.add();

   end;

 end;

 method getCommission_Factor();

  if saleCat = 1 then

   do;

    if mtdSales < 700000

       then commissionFactor = 0.02;

       else commissionFactor = 0.03;

   end;

  else if saleCat = 2 then

   do;

    if mtdSales < 890000

       then commissionFactor = 0.01;

       else commissionFactor = 0.015;

   end;

 end;

 method Calculate_Commission();

  commissionAmt = saleAmt * commissionFactor;

 end;

 method updatemtdSales();

  mtdSales = mtdSales + saleAmt;

  mtdCommission = mtdCommission + commissionAmt;

  h_mtdSales.replace();

 end;

 method run();

  set from getSalesRow;

  getmtdSales();

  getCommission_Factor();

  Calculate_Commission();

  updatemtdSales();

 end;

 method term();

   h_mtdSales.output('commissions');

 end;

enddata;

run;

quit;

1 The program in this chapter is an adaption of a great example (Kaufman 2014).

2 Testing for this book showed no real difference in performance between a threaded read and non-threaded read of a SAS table with 109,120,000 rows.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset