Chapter 8: Data Sources

Both SET and MERGE operate on tables or views. The tables and views can come from a variety of sources, but they are always tables or views that have already been defined. With data set options, you can control which columns should be read in. With a WHERE clause, you can control which rows should be read in.

DS2 can do the same things¹. Furthermore, DS2 has additional data sources:

1. FedSQL queries

2. threads

On the other hand, DS2 does not have an INPUT statement, so raw data cannot be directly read. But, if you want to split hairs, an ODBC text driver DSN can be used to process the data using the SET statement.

Sample Tables

Here are some sample tables adapted from the SAS 9.4 DS2 Language Reference Manual:

proc ds2;

data animals (overwrite=yes);

dcl varchar(10) common animal;

method init();

common='a'; animal='Ant1'; output;

common='a'; animal='Ant2'; output;

common='a'; animal='Ant3'; output;

common='b'; animal='Bird1'; output;

common='b'; animal='Bird2'; output;

common='c'; animal='Cat'; output;

common='d'; animal='Dog'; output;

common='e'; animal='Eagle'; output;

common='f'; animal='Frog'; output;

end;

enddata;

run;

data plants (overwrite=yes);

dcl varchar(10) common plant;

method init();

common='a'; plant='Apple1'; output;

common='a'; plant='Apple2'; output;

common='a'; plant='Apple3'; output;

common='b'; plant='Banana1'; output;

common='b'; plant='Banana2'; output;

common='b'; plant='Banana3'; output;

common='c'; plant='Coconut1'; output;

common='c'; plant='Coconut2'; output;

common='c'; plant='Coconut3'; output;

common='d'; plant='Dewberry'; output;

common='e'; plant='Eggplant'; output;

common='f'; plant='Fig'; output;

common='g'; plant='Avocado1'; output;

common='g'; plant='Avocado2'; output;

common='g'; plant='Avocado3'; output;

end;

enddata;

run;

In both data sets, there is a different number of rows for certain variables. For example, in the animals data set, there are two B rows. In the plants data set, there are three B rows. In animals, there is one C row, in plants there are 3. There are three G rows in plants, but no G rows in animal s.

SET

SET Data Set Options

DS2 provides a subset of data set options available in the DATA step. Here are some of the important options:

• KEEP=

• DROP=

• RENAME=

• IN=

The WHERE= option (as in the WHERE clause) is not in this list. In DS2, you can replace the table name in a SET statement with a FedSQL query in which you can embed a WHERE clause.

SET with No BY Statement

When you process a single table with a SET statement without a BY statement, DS2 works the same as the DATA step—it simply reads the data row by row. The results are the same in both.

SET with BY Statement

When you add a BY statement to the program, you can get different results. Looking at the plants data set, you see that it is sorted first by common, then by plant. Here is a simple DATA step and the results:

data bothSetByDS;

set plants (in=inP);

by common;

run;

Figure 8.1: Work. bothSetByDS Result Set

The same program in DS2 gets different results:

proc ds2;

data bothSetBy (overwrite=yes) ;

method run();

set plants (in=inP) ;

by common;

end;

enddata;

run;

quit;

Figure 8.2: Work. bothSetBy Result Set

The order of the rows has changed! Although the data were sorted, in DS2, the BY statement explicitly sorts the table, even though the table was already sorted. Moreover, an SQL sort was performed. Remember, with an SQL sort, the order of the result set is not guaranteed. In this case, the rows are in order by common, but within common, the order is not guaranteed.

Let’s look further into this. The plants table was appended to itself until there were 153,600,000 rows. This table was then sorted by common and plant:

Figure 8.3: Work.Bigplants Properties

The DATA step and the DS2 program are run against the data.

Here is the DATA step log:

NOTE: There were 153600000 observations read from the data set WORK.BIGPLANTS.

NOTE: The data set WORK.BOTHSETBYBIGDS has 153600000 observations and 2 variables.

NOTE: DATA statement used (Total process time):

real time 29.56 seconds

user cpu time 10.35 seconds

system cpu time 1.56 seconds

memory 940.90k

OS Memory 37624.00k

Timestamp 11/30/2015 04:30:02 PM

Step Count 20737 Switch Count 0

Here is the DS2 log:

NOTE: Execution succeeded. 153600000 rows affected

NOTE: PROCEDURE DS2 used (Total process time):

real time 1:11.83

user cpu time 1:31.45

system cpu time 10.00 seconds

memory 1686806.51k

OS Memory 1720160.00k

Timestamp 11/30/2015 04:29:32 PM

Step Count 20736 Switch Count 2

DS2 has a longer elapsed time, uses more CPU, and uses considerably more memory. Because you get results back in a different order than the input table (which was already sorted), and the fact that DS2 uses far more resources, you know that DS2 issued an SQL SELECT statement with an ORDER BY clause to return the data.

What does this mean? First, when a BY GROUP is used for processing, you cannot assume the rows will be returned in the BY column order. In this example, the table was sorted by common and plant, but the BY statement specified only common. You cannot assume that the rows will come back to you in plant order as well. Second, if you are using BY GROUP processing against large tables in DS2, you are probably going to take a performance hit, especially if your tables were already sorted. The corollary is that you can now do BY GROUP processing without sorting the input tables. This can lead to an overall performance improvement if the data need to be sorted before the BY GROUP processing and again, after. Moreover, by removing the need to sort tables, you simplify the overall flow. Each step that is removed in the overall flow is one possible source of an error removed.

SET with FedSQL as a Data Source

Instead of a table name, the SET statement can reference a FedSQL query. The following two statements produce the same results:

SET sqldb.salesDecDATA;

SET {select * from sqldb.salesDecDATA};

In this case, however, there is no need to make this trivial change. But, if any valid FedSQL query can be embedded in a SET statement, you could write something like the following code:

SET {select t.salesID, t.saleDate, c.saleCat, t.saleAmt

from allSalesTrans as t left join workLocation as c

on t.salesID = c.employeeID

and t.saleDate between t.startTime and t.endTime

where t.saleDate between '2014-12-01' and '2014-12-31'

order by t.salesID, t.saleDate

};

Note the use of braces surrounding the SELECT statement—these are required. Rather than running one SQL step to generate a table that will then process, you can generate and process the table in one pass, eliminating one step. Remember, each step removed in the overall flow is one possible source of an error removed.

If you use a FedSQL query instead of a table name, you no longer have access to the data set options. This makes sense because you are no longer dealing with a table. This can be a problem if you do BY GROUP processing and use the IN= option because you no longer have this option.

Merge

As with the SET statement and BY processing, with the MERGE statement and BY processing, you are not guaranteed the order of the rows returned. Using the previous data, here is an example DATA step and DS2 merge program:

data mrgDS; ➊

merge plants (in=inP)

animals (in=inA);

by common; ➋

run;

proc ds2;

data mrgDS2 (overwrite=yes);

method run(); ➌

merge plants (in=inP)

animals (in=inA); ➍

by common;

end;

enddata;

run;

quit;

Figure 8.4: Work.mrgDS Result Set

Figure 8.5: Work.mrgDS2 Result Set

➊ Rows are in the order that you expect.

➋ When there is an unequal number of BY variables, the values from the table with fewer values are not retained. Here, note that in the animals table, there are two entries for B (Bird1 and Bird2). In the plants table, there are three entries (Banana1,Banana2, and Banana3). The second row from the animals table (Bird2) is retained to match the third row from the plants table (Banana3).

➌ Neither table is retrieved in its sorted order.

➍ When there is an unequal number of BY variables, the values from the table with fewer values are not retained. Here, note that in the animals table, there are two entries for B (Bird1 and Bird2). In the plants table, there are three entries (Banana1,Banana2, and Banana3). The third row from the plants table (Banana3) has a missing entry for plant.

The implication of the fourth point is that you cannot do a one-to-many merge without some extra work. If the animals table was changed to have only one row for each value of common, the DS2 program would produce the following results:

Figure 8.6: Work.mrgDS2 Result Set Using Revised Animals Table

There are subtle but important differences in how the DATA step and the DS2 program processes the MERGE statement that you will need to deal with. With some added logic, either approach can make its output match that of the other.

Merge with FedSQL as a Data Source

The SET statement can have a FedSQL query instead of a table name. This is not possible with a MERGE statement (contrary to what the documentation states). The MERGE statement must include a table name. If you want to use a FedSQL query instead of a table name to perform a merge, you can do the merge (table join) in the FedSQL query as part of the SET statement, which means a MERGE statement is not needed and avoided.

Remember, if a FedSQL query is used as the data source, the data set option IN= is not available².

Threads

A DS2 program can run in two different ways—as a program or as a thread. With threaded processing, sections of the code can run concurrently (that is, in its own thread). When running in a thread, input data can come only from tables or FedSQL queries, not from other threads. Output data are returned to the DS2 program that started the thread. In essence, the thread reads the data, does any necessary subsetting and evaluations, and then returns the results to the calling program.

A thread is defined by the THREAD statement and ended by the ENDTHREAD statement. To use a thread, an instance of a thread is created using the DECLARE THREAD statement. To execute the thread, you use the SET FROM statement. In the following example, a data set of 60 million rows is created, and a simple thread named t is created. Then, in a DS2 program, the thread is declared and used.

/* create a data set */

%let lv = 25000;

%let uv = 90000;

proc ds2;

data incomes (overwrite=yes);

declare char(5) name;

declare char(1) cityCode;

declare double income;

declare double min max;

drop min max;

method getRange(integer min, integer max, double u) returns integer;

return(min+floor((1+Max-Min)*u)); /* uniform integer in Min..Max */

end;

method init();

streaminit(98765);

min = &lv;

max = &uv;

end;

method run();

declare integer j rows;

declare double u;

rows = 1E7;

do j = 1 to rows;

u=rand('UNIFORM'); income=getRange(min, max, u); name='John'; citycode='1'; output;

u=rand('UNIFORM'); income=getRange(min, max, u); name='Jane'; citycode='1'; output;

u=rand('UNIFORM'); income=getRange(min, max, u); name='Joe'; citycode='2'; output;

u=rand('UNIFORM'); income=getRange(min, max, u); name='Jan'; citycode='2'; output;

u=rand('UNIFORM'); income=getRange(min, max, u); name='Josh'; citycode='3'; output;

u=rand('UNIFORM'); income=getRange(min, max, u); name='Jill'; citycode='3'; output;

/* The three people to find during mining */

if j=5E5 then do; ❶

name='James'; income=103243; citycode='1'; output;

end;

if j=7E5 then do;

name='Joan'; income=233923; citycode='2'; output;

end;

if j=8E5 then do;

name='Joyce'; income=132443; citycode='3'; output;

end;

enddata;

run;

quit;

/* create the thread */

proc ds2;

thread score /overwrite=yes; ❷

method run();

DCL int accept i;

set incomes;

accept=if citycode ='1' and income > 100000 then 1 ❸

else if citycode='2' and income > 200000 then 1

else if citycode ='3'and income > 120000 then 1

else 0;

if accept then output; ❹

end;

endthread; ❷

run;

quit;

/* use the thread */

proc ds2;

data results /overwrite=yes;

declare thread score score; ❺

method run();

set from score threads=4; ❻

end;

enddata;

run;

quit;

❶ Three rows of the 60 million rows are used to create a filter.

❷ A thread is defined in a thread and endthread block. In this example, the thread is compiled and written to the Work library. In the real world, the thread is written to a permanent library where it can be accessed by many programs.

❸ The filter is applied to find the records that you want to locate.

❹ The records are returned to the calling program.

❺ The thread is declared in the program.

❻ The thread read is initiated with four threads.

Any performance improvement from threaded processing depends on your architecture³.

Thread Methods

Threads have three system-defined methods—init(), run(), and term(). In addition, you can define your own methods within a thread. By moving data access to a thread, you can hide the complexities of data from the main program. In addition, if the underlying data architecture changes, only the thread needs to change, not all the data programs that use it.

The method run() in a thread loops over all rows in the table. Each thread gets a unique set of rows. If there is not an explicit OUTPUT statement anywhere in the thread, each iteration of method run() sends back a row. If there is an OUTPUT statement, rows are returned only when the OUTPUT statement is executed.

Thread Parameters

In the previous example, rows were returned only when income was greater than some hardcoded threshold for each city code. What if you wanted these thresholds to be determined at run time. One option would be to set macro variables that get resolved, but this requires the source for the thread to be available for each run. A better option is to add parameters to the thread, and then use the setParms method to pass the parameters to the thread.

proc ds2;

thread score (double aIncome, double bIncome, double cIncome) / overwrite=yes; ❶

method run();

DCL int accept i;

set incomes;

accept=if citycode='1' and Income > aIncome then 1 ❷

else if citycode='2' and Income > bIncome then 1

else if citycode='3' and Income > cIncome then 1

else 0;

if accept then output;

end;

endthread;

run;

quit;

proc ds2;

data results /overwrite=yes;

declare thread score score; ❹

method init();

score.setParms(100000, 200000, 120000); ❸

end;

method run();

set from score threads=4;

end;

enddata;

run;

quit;

❶ A thread that takes three parameters is defined. Unlike methods, threads cannot be overloaded. So, if you want a different signature for each thread, each must have a unique name.

❷ The parameters are used in the filter.

❸ The parameter values are initialized. Here, there are hardcoded values. However, you could read values from a parameter table or use macro variables.

❹ Instead of calling setParms, you can initialize the parameters in a DECLARE statement. You could use macro variables to initialize. By using setParms, you can also read the parameter values from a table.

SQLSTMT Package

Another way to access data is through the SQLSTMT package. This package facilitates the generation of parameterized FedSQL queries. In addition to generating result sets, the SQLSTMT package can be used for other SQL actions (for example, you can create, modify, or drop tables).

To use a package, you have to declare it. You can instantiate the package when you declare it or with the _NEW_ operator.

dcl package sqlstmt stmt1;

This declares the package but does not instantiate it.

dcl package sqlstmt stmt1(select * from incomes);

This declares the package and instantiates it with a select clause.

dcl package sqlstmt stmt1('select * from incomes where
  ((citycode = ''1'' and Income > ?) or
  (citycode = ''2'' and Income > ?) or
  (citycode = ''3'' and Income > ?))', [aIncome bIncome cIncome]);

This declares the package and instantiates it with a select clause that has parameters. The parameter placeholders in the code are represented by ?. A FedSQL string lists the global variables that will be substituted for the placeholders.

stmt1 = sqlstmt ('select * from incomes where
  ((citycode = ''1'' and Income > ?) or
  (citycode = ''2'' and Income > ?) or
  (citycode = ''3'' and Income > ?))', [aIncome bIncome cIncome]);

This instantiates a package that was previously declared with a select clause that has parameters. The parameter placeholders in the code are represented by ?. A FedSQL string lists the global variables that will be substituted for the placeholders.

SQLSTMT Binding

The process of replacing the placeholders in the previous example is called binding. The parameters are positional. So, the first parameter replaces the first ?, and so on. The program variables need to be bound to the columns returned in the result set.

stmt1.bindresults([name cityCode income]);

The first column in the result set is bound to the global variable name. The second column is bound to cityCode, and the third to income. The number of global variables in the list must match the number of columns returned (which can make the select * dangerous). If the data type in the result set is not the same as the data type of the global variable, the result set value is coerced to be the global variable’s data type.

SQLSTMT – Returning the Rows

To get the result set back to the DS2 program, you first execute the statement, and then invoke a series of fetches to retrieve the rows.

method run();

stmt1.bindresults([name cityCode income]); ❶

stmt1.execute(); ❷

do while (stmt1.fetch() = 0); ❸

put _all_;

end;

enddata;

run;

quit;

❶ The columns in the result set are bound to the global variables.

❷ The statement is passed to the data provider to execute.

❸ The fetch method returns a value of zero if there is another row to return from the data source. This loops until you have no more rows in the result set.

¹ The MERGE statement is provided in the third maintenance release for SAS 9.4.

² Will this be a problem for you? Almost every time I merge tables in a DATA set, I use the IN= option for quality control. I like to create output tables that let me know which BY GROUP values were in all tables and in which table they were in. When key values are missing in a table, it sometimes indicates a data quality problem.

³ On my laptop (4 cores/8 logical processors, 24GB memory) using a SAS table, performance ranged from about 4.6/5.1 (elapsed time/CPU time) seconds for one thread to 1.5/10.4 seconds for eight threads.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 8: Data Sources

Create new playlist

Sign In

Sign Up