Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter `5` Working with Data Sets

`5.1 Automating the COMPARE Process`

`5.2 Reordering Variables on the PDV`

`5.3 Building and Maintaining Indexes`

`5.3.1 Introduction to Indexing`

`5.3.2 Creating Simple Indexes`

`5.3.3 Creating Composite Indexes`

`5.3.4 Using the IDXWHERE and IDXNAME Options`

`5.3.5 Index Caveats and Considerations`

`5.4 Protecting Passwords`

`5.4.2 Protecting Database Passwords`

`5.5 Deleting Data Sets`

`5.6 Renaming Data Sets`

`5.6.1 Using the RENAME Function`

While we are usually most interested in the data values and the analysis results that they generate, there are a number of operations that take place at the data table level. Sometimes we need to be able to work with the entire table and not just at the observation level. Fortunately SAS provides us with a number of tools for this type of processing.

5.1 Automating the COMPARE Process

While PROC COMPARE does a good job of comparing data sets, visual inspection of the output is tedious. Fortunately the COMPARE procedure has sufficient options so that its output can be utilized in an automated process.

The primary PROC COMPARE options that you will need to use include:

DATA	the base data set
COMPARE	the comparison data set
OUT	the new data set with the differences
OUTBASE	include the base data set observation
OUTCOMP	include the compare data set observation
OUTNOEQUAL	only write differences

It is of primary importance that the combination of the BY and ID statements include sufficient variables to form a unique key. If a unique key is not formed the COMPARE procedure will be unable to keep the two data sets synchronized. This implies that you may need to do a pre-analysis to check for duplicates in the key variables.

The first step in the automated comparison process is to create a data set containing only those rows that have differences. These rows are written to a data set using the OUTNOEQUAL option .

proc compare

      data=lab_chem

      compare=lab_chem2

      out=cmpr

      outbase outcomp

      noprint

      outnoequal ;

   id subject visit labdt;

   run;

The DATA= and COMPARE= options determine which two data sets are to be compared.

A data set is created using the OUT= option, which will contain the observations with the differences.

The data set of the differences should contain the original observations from both of the data sets that are being compared.

Since the process is being automated, there is no need to create any printed output.

Only those observations that contain differences need to be written to the data set.

For this example changes have been artificially inserted in the values of SODIUM for two observations in the data set LAB_CHEM2 (see the sample code for the full program).

Because the OUTNOEQUAL option has been specified, when a difference is detected, the OUTBASE OUTCOMP options cause the entire observation from each of the two incoming data sets to be written to the data set named in the OUT= option. This data set has the additional variables _TYPE_ and _OBS_ to help identify the original observations.

5.1 Automated Comparisons

Obs with differences

Obs  _TYPE_  _OBS_  SUBJECT  VISIT      LABDT  SODIUM  POTASSIUM  CHLORIDE

 1   BASE       2     200      2   07/13/2006   144       4.9       106

 2   COMPARE    2     200      2   07/13/2006   1A4       4.9       106

 3   BASE      20     202      1   07/07/2006   139       4.8       96

 4   COMPARE   20     202      1   07/07/2006   1A9       4.8       96

_TYPE_ refers to the two data sets being compared:

BASE the data set identified with the DATA= option
COMPARE the data set identified in the COMPARE= option

Each pair of rows in the data set WORK.CMPR has at least one difference; however, searching these pairs of rows for the differences is only incrementally easier than searching the original output. We need to further isolate the individual values that are different. Since we know that there is at least one difference within each BASE/COMPARE pair of observations, we need to examine each pair of values for each variable individually within this observation pair. This can be more easily accomplished if we first transpose the data so that each value pair can be isolated.

PROC TRANSPOSE is used to create a data set with one observation per pair of variables within each of the two observations. Again the BY and ID statements are used to isolate the key variables. The variable _OBS_, which reflects the original observation number, is added to the BY statement to help make the identification process easier.

proc sort data=cmpr;

   by subject visit labdt _obs_;

   run;

proc transpose data=cmpr

                 out=tdiff(drop=_label_

                           rename=(_name_=variable));

   by subject visit labdt _obs_;

   var _numeric_ _character_;

   id _type_;

   run;

The BY statement is used to identify down to the two rows with differences. The ID statement further identifies down to the specific row.

In the VAR statement we need to list all the variables that need to be compared. Since we do not necessarily know the names of the variables we can use variable list abbreviations (see Section 2.6.1). _ALL_ could also have been used here. When the variables to be compared are known (as is technically the case in this example) they can be named explicitly.

The ID variable _TYPE_, which contains ‘BASE’ and ‘COMPARE’, is used to name the two new columns formed in the transformation process. These will be character variables when the variables in the VAR statement are either all character or a mixture of numeric and character. When there is a mixture of variable types, as is the case in this example, a conversion note is written to the LOG.

The data set TDIFF will now have one row for each original variable, and the values from the two original data sets are stored in the variables BASE and COMPARE. By selecting for unequal BASE and COMPARE values, we can determine the differences that have been detected in the comparison process.

title3 'After Transpose';

proc print data=tdiff(where=(variable ne '_TYPE_'

                             & base ne compare));

   run;

The variables BASE and COMPARE now contain the original values of the variable named VARIABLE, which was renamed from _NAME_.

5.1 Automated Comparisons

Obs with differences

After Transpose

Obs  SUBJECT  VISIT       LABDT  _OBS_  variable  BASE  COMPARE

  6    200      2    07/13/2006     2    SODIUM   144     1A4

 14    202      1    07/07/2006    20    SODIUM   139     1A9

In this example the differences are only printed; however, they could have been easily stored in a data set for further processing.

MORE INFORMATION

A macro that further generalizes the comparison process can be found in program E5_1b.sas, which is located in the sample programs accompanying this book.

5.2 Reordering Variables on the PDV

The order of the variables on the Program Data Vector, PDV, is generally of no concern to the SAS programmer. Indeed there are no tools in the language that are specifically designed to help us to change the order once it has been established. Although they come up fairly infrequently, there are legitimate occasions that force us to either know or change the order of the variables on the PDV.

Sometimes when we create an EXCEL spreadsheet, the resulting columns need to be in a specific order. PROC EXPORT (see Section 1.2) will use the PDV order of the variables to determine the order of the EXCEL columns. To change this order we need to change the order of the variables going into PROC EXPORT.

The current order of the variables can be seen in a number of ways. Some of the most straightforward of these include:

When PROC PRINT is used without an ID, BY, or VAR statement, the variables are printed in the order of their position.
VIEWTABLE in the Display Manager displays the variables in position order.
PROC CONTENTS displays the position of each variable, and with the VARNUM option (formerly the POSITION option) it will also list the variables in position order.
The COLUMNS window in the Display Manager displays the columns in position order by default.

Within a DATA step the order of the variables on the PDV is determined as the step is compiled. If the step reads a data set, as with a SET statement, the incoming data set is examined and its variables are added to the PDV using the same order. Once the order is determined on the PDV it is fixed and cannot be altered. If we want to control the order we must do so before it is fixed.

Using the VARNUM option in a PROC CONTENTS step reveals, among other attributes, the order of the variables in ADVRPT.DEMOG. The left-most column is the variable or position number.

5.2 Reordering Variables on the PDV

The CONTENTS Procedure

                   Variables in Creation Order

#    Variable    Type    Len    Format    Label

 1    subject     Num       8

 2    clinnum     Char      6              clinic number

 3    lname       Char     10              last name

 4    fname       Char      6              first name

 5    ssn         Char      9              social security number

 6    sex         Char      1              patient sex

 7    dob         Num       8    DATE7.    date of birth

 8    death       Num       8    DATE7.    date of death

 9    race        Char      1              race

10    edu         Num       8              years of education

11    wt          Num       8              weight in pounds

12    ht          Num       8              height in inches

13    symp        Char      2              symptom code

14    death2      Num       8    DATE9.

In the example that follows, only a few of the variables from ADVRPT.DEMOG flow through to the new data set, and the order of these variables is changed by the use of the LENGTH statement.

data demog2(keep=subject lname fname sex ht wt dob symp);

   length lname $10 fname $6 sex $1 symp $2;

   set advrpt.demog(keep=subject lname fname sex edu

                         death ht wt dob symp);

   where death and edu>15;

   run;

The KEEP and DROP statements/options do not change or affect the order of variables on the PDV.

The LENGTH statement is used not only to assign the length attribute to these four variables, but it also adds them to the PDV. Since the LENGTH statement appears before the SET statement these variables and their associated attributes are added to the PDV before any variables or attributes are contributed by the SET statement. Attributes for these variables that have not been specified, for example formats or labels, will be picked up from the metadata of the incoming data set named on the SET statement.

Although the variables DEATH and EDU are not to be included on the new data set, they must be available on the PDV for use by the WHERE statement.

5.2 Reordering Variables on the PDV

Obs   lname    fname   sex   symp   subject       dob    wt   ht

 1    James    Debra    F     05      232     19JUN42   163   63

 2    Manley   Debra    F     05      241     19JAN42   163   63

Notice that while the order has changed for the variables in the LENGTH statement, the order for the remaining variables is the same as it was on the incoming data.

The ARRAY, FORMAT, INFORMAT, RETAIN, and ATTRIB statements can also be used in a similar manner to reorder variables on the PDV. The KEEP, DROP, and RENAME statements cannot be used to change variable order. Generally the recommended choice of statements for reordering the variables is the RETAIN statement as it does not require any other attributes and does not otherwise change behavior of variables that are being brought into the DATA step via the SET or MERGE statements.

retain lname fname sex symp;

In the previous example the LENGTH statement could have been replaced with a RETAIN statement, which does not require additional knowledge of the attributes of the variables that are to be reordered.

proc sql ;

create table demog4 as

  select lname, fname, sex, symp, subject, dob, wt, ht

      from advrpt.demog(keep=subject lname fname sex

                             edu death ht wt dob symp)

        where death and edu>15;

   select *

      from demog4;

quit;

It is also possible to reorder variables using an SQL step. Here the SELECT statement is used to specify the new variable order. Notice that unlike the DATA step a full list of variables must be specified unless you use the asterisk (*) to specify all variables, in which case you will see them in PDV order.

When you read more than one data set in a step, the order of variables is determined to some extent by the order that the data sets are read in. Variables are added to the PDV in the order that they are encountered by the compiler. The order of the variables taken from the first data set seen by the compiler will be written to the PDV first. Variables not already on the PDV will be added to the PDV in the order that they are encountered on subsequent data sets. This means that you may have some control of variable order by controlling the order in which your incoming data sets are first seen.

MORE INFORMATION

Although I do not necessarily recommend their use, some of the variable list abbreviations require you to have specific knowledge of the order of the variables (see Section 2.6.1).

5.3 Building and Maintaining Indexes

It is a bit surprising how few programmers actually take advantage of indexing. True the topic can be a bit complex; after all, a book has been written on the subject (Raithel, 2006). Complex or not you do not need a lot of knowledge to take advantage of them.

Indexes give us the ability to virtually sort a data table without physically sorting it. More than one index can be created for any give data table and with multiple indexes it is possible to effectively sort the data set multiple ways at the same time. Once created the index allows the use of the BY statement and other processing techniques as if the data had been sorted using PROC SORT. The index can also be very helpful in subsetting and merging situations.

MORE INFORMATION

Indexes are used with the KEY= option in a table lookup example in Section 6.6, and discussed relative to the use of the WHERE statement in Section 2.7.

5.3.1 Introduction to Indexing

Indexes provide a search tool that allows the detection and extraction of a data subset. Well-defined indexes can be especially useful in increasing the efficiency of the subsetting process. The highest efficiency gains can be had as the data subset becomes smaller relative to the size of the overall data set. An index that segments the data into subsets that are no more than 10 or 15% of the total data set will tend to provide the most efficiency benefits.

Data set variables are used to define indexes. The selected variables should be chosen so that they maximize the ability to discriminate or break up the data into smaller subsets. Variables that take on only a few levels, such as GENDER or RACE, would probably make poor candidates, while variables such as SUBJECTID or NAME, which take on many more levels relative to the size of the overall data set, would tend to make better index variables. You can also use two or more variables in combination in order to increase the ability of the index to discriminate among data subsets. When the data are also sorted, the variables used to sort the data are also good index variable candidates.

Indexes can be simple (a single variable) or composite (two or more variables), and they can be created in a DATA step, an SQL step, or through PROC DATASETS. Once created, the user can take advantage of indexes in several different ways. Sometimes SAS will even take advantage of available indexes without the user’s knowledge. The system option MSGLEVEL=I will cause index usage notes to be written to the LOG.

There are options available in some statements that will specifically invoke indexes (see Section 6.6). While the user can use indexes simply by including the appropriate BY statement, this is not necessarily the best use of indexes. BY-group processing with the index can be inefficient, especially when the full data set is being processed.

Indexes are named and for a simple index, which consists of a single variable, the name is the same as that variable. For composite indexes, which use two or more variables, a name must be provided (the name must be different from any variables on the data set). The index name is used to identify the index file, but is not used by the user to retrieve the index. Indexes are used by specifying the names of the variables that make up the index (simple or composite).

proc contents data=advrpt.demog;

   run;

The metadata shown by the CONTENTS procedure shows that the INDEXES flag is now set and shows the number of indexes associated with this table.

5.3.1 Metadata for an Indexed Table

The CONTENTS Procedure

Data Set Name   ADVRPT.DEMOG                    Observations        77

Member Type     DATA                            Variables           14

Engine          V9                              Indexes              3

Created         Tue, Sep 22, 2009 10:57:37 AM   Observation Length  96

Alphabetic List of Indexes and Attributes

                  # of

                Unique

#    Index      Values    Variables

1    group          23    sex race edu

2    ssn            76

3    subject        77

The index definitions are also included in the data set’s metadata as can be seen by looking at the listing generated by PROC CONTENTS. The name of each index is shown under the column labeled Index . For composite indexes the variable list is also shown under the column labeled Variables .

When an index has been selected, its use will be mentioned in the LOG.

66   proc print data=advrpt.demog;

67      by sex race edu;

NOTE: An index was selected to execute the BY statement.

      The observations will be returned in index order rather than

      in physical order.  The selected index is for the variable(s):

sex

 race

edu

The indexes are not actually stored in the data set itself. Instead they reside in a separate file. Under Windows the index file has the same name as the data set with which it is associated; however, the extension is different.

5.3.2 Creating Simple Indexes

Indexes can be created using a DATASETS step, an SQL step, or a DATA step. Each of the next three steps creates a simple index. Later in this section these indexes are used in several PROC PRINT steps.

proc datasets lib=advrpt;

   modify demog;

   index create clinnum;

   quit;

The data set receiving the index is named. The data set may be either permanent or temporary.

The index to be created is named. For simple indexes the name is the same as the single variable used to form the index.

proc sql noprint;

   create index clinnum

      on advrpt.demog  (clinnum);

   quit;

The variable used to define the index is named (separately in the SQL step).

data demog2(index=(clinnum ));

   set advrpt.demog;

   run;

The three methods for creating the index will tend to have different efficiencies based on your particular data. Each method of creating the index will read and handle the data differently.

Of these three methods of creating indexes, the DATA step is the only one that reads the entire data set. PROC SQL and PROC DATASETS don’t read the data in the same way; primarily they just add the index file. This means that the generation of the index can be more costly when done with a DATA step, and its cost will grow as the volume of data grows.

* Create index on ssn;

proc sql noprint;

   create index ssn

      on advrpt.demog (ssn);

   quit;

options msglevel=i;

* Use the ssn index;

proc print data=advrpt.demog;

   var lname fname;

   where ssn < '3';

   id ssn;

   run;

When the MSGLEVEL system option is set to I , a note will be written to the LOG when an index is utilized. In this PROC PRINT step the data are subsetted using the SSN variable. The LOG shows that the index for SSN was selected for use with the WHERE clause.

139  proc print data=advrpt.demog;

140     var lname fname;

141     where ssn < '3';

INFO: Index ssn selected for WHERE

clause optimization.

proc print data=advrpt.demog;

   by clinnum;

   id clinnum;

   run;

proc print data=advrpt.demog;

   by ssn;

   id ssn;

   run;

Once the indexes have been created, a BY statement using the indexed variable(s) will cause the index to be used. In the two PROC PRINT steps, two different BY statements are used with the same incoming data set. Since both are indexes, that data may not be sorted by either of the two BY variables; however, both steps will execute successfully.

proc datasets lib=advrpt;

   modify demog;

   index delete clinnum;

   quit;

An index can be removed from a data set through the use of PROC DATASETS. The index to be removed is identified by its name (simple or composite) on the INDEX statement along with the DELETE option. In this step the simple index CLINNUM is being removed from the ADVRPT.DEMOG data set.

5.3.3 Creating Composite Indexes

Since composite indexes are made up of two or more variables, the index name must necessarily be different from the individual variables and different from any other variable on the data set. In the first three steps below, composite indexes are created using three different methods. The PROC PRINT steps that follow then make use of two of these indexes.

proc datasets lib=advrpt;

   modify conmed;

   index create drgstart =(drug medstdt);

   quit;

The data set receiving the index is named. A given data set may contain multiple indexes at any given time.

proc sql noprint;

   create index drgstart

      on advrpt.conmed  (drug medstdt);

   quit;

The index to be created is named and the name must be different from other variables in the data set.

data cmed2(index=(drgstart=(drug medstdt)));

   set advrpt.conmed;

   run;

The list of variables making up the index is written inside the parentheses.

proc means data=advrpt.conmed noprint;

   by drug;

   var mednumber;

   output out=sumry max= n=/autoname;

   run;

The name of a composite index is not used in the BY statement; however, the variables used to define the index can be used in the BY statement. When using a composite index you can specify any inclusive subset of the variables starting from the left. Consequently just as a data set that is sorted by DRUG MEDSTDT, must necessarily also be sorted DRUG, this sorted hierarchy will be true for indexes as well.

5.3.4 Using the IDXWHERE and IDXNAME Options

In the absence of an index SAS will satisfy the conditions of a WHERE expression by reading the data sequentially. When an index is present, SAS determines whether or not the utilization of the index will be optimal. Without the index the data is read sequentially; however, the use of the index can produce a non-sequential processing of the data. You can force the use of the index through the use of the IDXWHERE data set option. The IDXNAME data set option can also be used to specify a specific index when more than one exists.

Both of the examples below print portions of the same data set. The composite index DRGSTART was established for this data set in Section 5.3.3. Without the IDXWHERE option, the DRGSTART index has not been applied. And, as is shown by the consecutive numbers in the OBS column, a sequential read of the data took place. In the second PROC PRINT (to the right) the use of the index is requested and a non-sequential read takes place.

title2 'With IDXWHERE';

proc print

     data=advrpt.conmed

            (idxwhere=yes

             idxname=drgstart);

   where drug < 'C';

   var drug medspdt;

   run;

title2 'Without IDXWHERE';

proc print data=advrpt.conmed;

   where drug < 'C';

   var drug medspdt;

   run;

5.3.4 Using IDXWHERE

Without IDXWHERE

Obs    drug                medspdt

 21    B1-VIT           01/01/2010

 22    ACCUPRIL/HCT2    01/01/2010

 44    B1-VIT           01/01/2010

 45    ACCUPRIL/HCT2    09/09/2006

 67    B1-VIT           10/24/2006

 68    ACCUPRIL/HCT2    12/10/2006

 90    B1-VIT           01/01/2010

 91    ACCUPRIL/HCT2    01/01/2010

113    B1-VIT           01/01/2010

114    ACCUPRIL/HCT2    01/01/2010

136    B1-VIT           05/13/2007

137    ACCUPRIL/HCT2    03/05/2007

159    B1-VIT                    .

160    ACCUPRIL/HCT2    04/30/2007

182    B1-VIT           08/25/2007

183    ACCUPRIL/HCT2    09/24/2007

5.3.4 Using IDXWHERE

With IDXWHERE

Obs    drug                medspdt

114    ACCUPRIL/HCT2    01/01/2010

 22    ACCUPRIL/HCT2    01/01/2010

 45    ACCUPRIL/HCT2    09/09/2006

 68    ACCUPRIL/HCT2    12/10/2006

 91    ACCUPRIL/HCT2    01/01/2010

137    ACCUPRIL/HCT2    03/05/2007

160    ACCUPRIL/HCT2    04/30/2007

183    ACCUPRIL/HCT2    09/24/2007

113    B1-VIT           01/01/2010

136    B1-VIT           05/13/2007

 21    B1-VIT           01/01/2010

 44    B1-VIT           01/01/2010

 67    B1-VIT           10/24/2006

 90    B1-VIT           01/01/2010

159    B1-VIT                    .

182    B1-VIT           08/25/2007

5.3.5 Index Caveats and Considerations

While the use of indexes can provide a number of efficiency gains, their use is not without a price. The user should have sufficient information to make an informed decision as to when to build and use indexes.

Remember that the indexes are stored in a separate file from the data set. The indexes must therefore be deliberately maintained. If you update a data set without updating its indexes, the indexing will be lost. If you copy a data set, the index file must also be copied. PROC DATASETS and PROC COPY know to look for and copy index files, but data set copies made at the OS level require that the index file be explicitly copied.

The index file will take up space. How much space depends on a number of factors, but the volume can be non-trivial. Indexes also take time to build and are therefore most appropriate for fairly stable data sets. Here stability is measured against the cost and effort of building and maintaining the index.

5.4 Protecting Passwords

When using SQL pass-through statements to access remote databases we often have to pass user identification and password information to the remote database. If we use those passwords in our program, our user ID and password will be surfaced for all to see. Very often we need to protect our passwords and to a lesser extent our user ID. The level of protection will vary from industry to industry, but the basics are the same.

Fortunately there are a number of ways to protect our passwords.

5.4.1 Using PROC PWENCODE

The PWENCODE procedure can be used to encode or encrypt passwords. Encoding is a text substitution technique that disguises your password through a series of text substitutions. Encryption, which is a more secure method than encoding (available starting in SAS 9.2), uses mathematical operations in the transformation of the text.

I have had very limited success at using this procedure to protect passwords. First, encoding is not a very strong protection, but more importantly the encoded or encrypted text can often be used instead of the password. It does not particularly matter if the bad guys cannot ‘see’ the actual password if the encoded text string, which they can see, will work in its place.

Encoding or encrypting your password through PROC PWENCODE creates and then allows you to use a text string instead of the actual password. It protects your password from being viewed directly, since only the encoded/encrypted string is visible. However, remember that the visible string, while not the password, can still be used as the password. This means that we will also need to protect this encoded/encrypted string.

filename pwfile "&path
esultspwfile.txt";

proc pwencode in='pharmer' out=pwfile;

  run;

In this PWENCODE step our password ‘pharmer’ is to be encrypted (under SAS 9.1 the default was encoding). As the procedure executes, the encrypted value of ‘pharmer’ is written to the text file (PWFILE.TXT) as: {sas002}81F6943F251507393B969C0753B2D73B and is not otherwise surfaced for viewing.

Once the value has been stored in the text file, it can be recovered and used at some point in the future. The SAS documentation for PROC PWENCODE shows how this value can be stored in a macro variable or written to the LOG. Both approaches are not practical, because if the user (or someone else) can see the encoded/encrypted value, they can then use it instead of the real password. They may not know your real password, but that does not matter; they still have access to your data.

We need to be able to use the password without ever surfacing its value, either in a macro variable or to the LOG. In the following DATA step the LIBNAME function is used to create a libref which establishes an ODBC connection to an SQL server. The password is recovered from the text file and the value is inserted into the PASSWORD= option.

data _null_;

length tmp $1024 opt $1200;

infile pwfile truncover;

input tmp;

opt='dsn=SQLServer user=myid password="'||left(trim(tmp))||'"';

rc=libname('sqlsrv',,'odbc', opt);

txt=sysmsg();

put rc= txt=;

run;

The encoded/encrypted password cannot be utilized in all coding situations that require the use of a password. For example, the RENAME function (see Section 5.6.1) allows the use of passwords, but restricts them to 8 characters (this limitation should be fixed in a future release of SAS).

MORE INFORMATION

The discussion in Section 5.4.2 takes an alternate approach to the protection of passwords.

5.4.2 Protecting Database Passwords

The following method places the sensitive information, such as passwords and user identification codes, in a protected data set. And then we write our program (the macro %SECRETSQL) to extract the password without surfacing it.

The data set ADVRPT.PASSTAB includes the user ID and password for several different databases. The data set itself is encrypted and password protected using data set options.

data advrpt.passtab(encrypt=yes  pwreq=yes

                    read=readpwd write=writepwd

                    alter=chngpwd );

  format dsn uid pwd $8.;

  dsn='dbprod'; uid='mary'; pwd='wish2pharm'; output;

  dsn='dbprod'; uid='john'; pwd='data4you';   output;

  dsn='dbdev';  uid='mary'; pwd='hope2pharm'; output;

  run;

%let syslast=;

Notice that the macro variable &SYSLAST has been cleared to remove the name of this data set from the global symbol table.

The macro %SECRETSQL retrieves the passwords from the password data set and then uses the information in the SQL pass-through in such a way that the password is never surfaced to the LOG.

%macro secretsql(dbname, username);

  %local dd uu pp;

  Proc SQL noprint nofeedback;

      SELECT dsn, uid, pwd into :dd, :uu, :pp

      FROM advrpt.passtab(read=readpwd)

        WHERE dsn=trim(symget('dbname'))

            AND uid=trim(symget('username'))

);

    connect to odbc(dsn=%superq(dd) uid=%superq(uu) pwd=%superq(pp));

    create table mylib.mytable as select * from connection to odbc(

      %passthru  /* contains your pass-thru SQL statement(s) */

);

    disconnect from odbc;

    quit;

%mend secretsql;

%secretsql(dbname=dbprod, username=John)

The macro variables that will contain the sensitive information are placed in the temporary local symbol table.

We will read the password from the secret password data set. Notice that the user of this macro does not even need to know of the existence of the password data set.

Macro variables are retrieved by using the SYMGET function rather than the more common macro variable reference with an ampersand (&UU or &DBNAME). This is done because the SYMGET function does not write macro variable values in the LOG, even when macro debugging options, such as SYMBOLGEN, are turned on.

The SYMGET function does not always execute in the CONNECT statement. The macro quoting function %SUPERQ will also resolve the macro variable without surfacing its value to the LOG.

The user creates a macro called %PASSTHRU that contains only those SQL statements that are to be processed by the remote database. The user does not need to see anything inside the %SECRETSQL macro.

The macro call only contains information that is not sensitive.

When the %SECRETSQL macro is kept in a stored compiled macro library, the source statements will not be available to the person using it (see Section 13.9 and Sun and Carpenter (2011) for information on protecting the macro code itself).

MORE INFORMATION

Data set options that provide data set protections are described in more detail in Section 2.1.2. SQL pass-through is introduced in Section 1.5. The PWENCODE procedure can potentially also be used to provide password security (see Section 5.4.1). Issues dealing with macro source code security are also discussed in Section 13.9.

5.5 Deleting Data Sets

There are a number of ways of deleting data sets, both from within SAS and from the operating system. Although generally we do not need to delete our data sets during the execution of a program, sometimes when processing especially large data sets, it can be necessary to clear data sets from the WORK library in order to free up disk space.

PROC DATASETS is the tool most often used from within a SAS program. When using PROC DATASETS, there are two basic ways of carrying out the deletions.

proc datasets library=work

              memtype=data

              nolist ;

   delete male female;

   quit;

proc datasets library=work

              memtype=data

              nolist

              kill ;

   quit;

The library from which the items are to be deleted is specified.

Select the type of item to be deleted. Data sets have the MEMTYPE=DATA. When deleting catalogs the CATALOG procedure can also be used.

The NOLIST option suppresses the list of members prior to the deletion from being written to the LOG.

The KILL option deletes all items of the specified type to be deleted.

The DELETE statement lists one or more items to delete from the specified library .

proc delete data=male allgender ;

   run;

Although no longer documented, the DELETE procedure is still available. This procedure is one of the very few, if not only, procedures to allow more than one data set name to be associated with the DATA= option.

proc sql;

   drop table allgender, male;

   quit;

Data sets can also be deleted from within an SQL step by using the DROP TABLE statement . Notice that more than one table can be listed on the DROP TABLE statement by separating the names with a comma.

5.6 Renaming Data Sets

Data sets can be renamed using a variety of methods including a number of ways through the use of the OS tools. Renaming data sets from within a program is also possible, and for some situations even preferable as the process can be automated using the macro language.

5.6.1 Using the RENAME Function

The DATA step function RENAME can be used to rename data sets, catalogs, and even directories. Like most DATA step functions it can also be utilized by the macro language. Here the data set WORK.MALE is being renamed to WORK.MALES.

data  male

      female;

   set sashelp.class;

   if sex='M' then output male;

   else output female;

   run;

%let rc=%sysfunc(rename(work.male,Males,data));

%put &RC;

Notice that the libref for the new name (second argument) is implied and is not explicitly included. The function returns a 0 for a successful rename operation.

5.6.2 Using PROC DATASETS

Within the DATASETS procedure there are two primary methods for renaming data sets and catalogs:

CHANGE	changes or renames a data set
AGE	renames a group of data sets to form a series of previous versions

data current;

   created = datetime();

   format created datetime18.;

   run;

proc datasets library=work nolist;

   change current=now;

   quit;

The CHANGE statement is designed to rename one or more data sets. The data sets are listed on the CHANGE statement in OLDNAME=NEWNAME pairs. In this example the data set WORK.CURRENT is renamed to WORK.NOW. Because the data sets are being renamed, not copied (which uses the COPY statement), the library for the old and new name will always be the same. If a data set with the NEW name already exists the rename will not take place.

proc datasets library=mydata nolist;

   age current currentV1 - currentV7;

   quit;

When you need to retain one or more snapshots (backup) copies of a data set, the AGE statement can be used to perform the operation. The oldest data set is deleted and then in order of age the data sets are renamed one at a time. In this example CURRENTV7 would be deleted, CURRENTV6 is renamed to CURRENTV7, and so on until the most recent version of CURRENT is renamed to CURRENTV1. If this AGE statement was executed every morning, there would be a backup or ‘aged’ copy of CURRENT for each day of the week.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 5 Working with Data Sets

Create new playlist

Sign In

Sign Up

Chapter 5 Working with Data Sets

5.1 Automating the COMPARE Process

MORE INFORMATION

5.2 Reordering Variables on the PDV

MORE INFORMATION

SEE ALSO

5.3 Building and Maintaining Indexes

MORE INFORMATION

SEE ALSO

5.3.1 Introduction to Indexing

5.3.2 Creating Simple Indexes

5.3.3 Creating Composite Indexes

5.3.4 Using the IDXWHERE and IDXNAME Options

SEE ALSO

5.3.5 Index Caveats and Considerations

5.4 Protecting Passwords

5.4.1 Using PROC PWENCODE

MORE INFORMATION

SEE ALSO

5.4.2 Protecting Database Passwords

MORE INFORMATION

SEE ALSO

5.5 Deleting Data Sets

SEE ALSO

5.6 Renaming Data Sets

5.6.1 Using the RENAME Function

5.6.2 Using PROC DATASETS

Table of Contents for
Chapter 5 Working with Data Sets

Chapter `5` Working with Data Sets