Image483_fmt.png

Chapter 5 Working with Data Sets

5.1 Automating the COMPARE Process

5.2 Reordering Variables on the PDV

5.3 Building and Maintaining Indexes

5.3.1 Introduction to Indexing

5.3.2 Creating Simple Indexes

5.3.3 Creating Composite Indexes

5.3.4 Using the IDXWHERE and IDXNAME Options

5.3.5 Index Caveats and Considerations

5.4 Protecting Passwords

5.4.1 Using PROC PWENCODE

5.4.2 Protecting Database Passwords

5.5 Deleting Data Sets

5.6 Renaming Data Sets

5.6.1 Using the RENAME Function

5.6.2 Using PROC DATASETS

While we are usually most interested in the data values and the analysis results that they generate, there are a number of operations that take place at the data table level. Sometimes we need to be able to work with the entire table and not just at the observation level. Fortunately SAS provides us with a number of tools for this type of processing.

5.1 Automating the COMPARE Process

While PROC COMPARE does a good job of comparing data sets, visual inspection of the output is tedious. Fortunately the COMPARE procedure has sufficient options so that its output can be utilized in an automated process.

The primary PROC COMPARE options that you will need to use include:

  • DATA

the base data set

  • COMPARE

the comparison data set

  • OUT

the new data set with the differences

  • OUTBASE

include the base data set observation

  • OUTCOMP

include the compare data set observation

  • OUTNOEQUAL

only write differences

It is of primary importance that the combination of the BY and ID statements include sufficient variables to form a unique key. If a unique key is not formed the COMPARE procedure will be unable to keep the two data sets synchronized. This implies that you may need to do a pre-analysis to check for duplicates in the key variables.

The first step in the automated comparison process is to create a data set containing only those rows that have differences. These rows are written to a data set using the OUTNOEQUAL option Callout 5.

proc compare
      data=lab_chem Callout 1
      compare=lab_chem2 Callout 1
      out=cmpr Callout 2
      outbase outcomp Callout 3
      noprint Callout 4 
      outnoequal Callout 5;
   id subject visit labdt;
   run;

Callout 1 The DATA= and COMPARE= options determine which two data sets are to be compared.

Callout 2 A data set is created using the OUT= option, which will contain the observations with the differences.

Callout 3 The data set of the differences should contain the original observations from both of the data sets that are being compared.

Callout 4 Since the process is being automated, there is no need to create any printed output.

Callout 5 Only those observations that contain differences need to be written to the data set.

For this example changes have been artificially inserted in the values of SODIUM for two observations in the data set LAB_CHEM2 (see the sample code for the full program).

Because the OUTNOEQUAL Callout 5 option has been specified, when a difference is detected, the OUTBASE OUTCOMP Callout 3 options cause the entire observation from each of the two incoming data sets to be written to the data set named in the OUT= option. This data set has the additional variables _TYPE_ and _OBS_ to help identify the original observations.

5.1 Automated Comparisons
Obs with differences
Obs  _TYPE_  _OBS_  SUBJECT  VISIT      LABDT  SODIUM  POTASSIUM  CHLORIDE                                           
 1   BASE       2     200      2   07/13/2006   144       4.9       106                                           
 2   COMPARE    2     200      2   07/13/2006   1A4       4.9       106                                           
 3   BASE      20     202      1   07/07/2006   139       4.8       96                                           
 4   COMPARE   20     202      1   07/07/2006   1A9       4.8       96                                           

_TYPE_ refers to the two data sets being compared: Callout 1

  • BASE the data set identified with the DATA= option
  • COMPARE the data set identified in the COMPARE= option

Each pair of rows in the data set WORK.CMPR Callout 2 has at least one difference; however, searching these pairs of rows for the differences is only incrementally easier than searching the original output. We need to further isolate the individual values that are different. Since we know that there is at least one difference within each BASE/COMPARE pair of observations, we need to examine each pair of values for each variable individually within this observation pair. This can be more easily accomplished if we first transpose the data so that each value pair can be isolated.

PROC TRANSPOSE is used to create a data set with one observation per pair of variables within each of the two observations. Again the BY and ID statements are used to isolate the key variables. The variable _OBS_, which reflects the original observation number, is added to the BY statement to help make the identification process easier.

proc sort data=cmpr; Callout 2
   by subject visit labdt _obs_;
   run;
proc transpose data=cmpr
                 out=tdiff(drop=_label_
                           rename=(_name_=variable));
   by subject visit labdt _obs_; Callout 6
   var _numeric_ _character_; Callout 7
   id _type_; Callout 8
   run;

Callout 6 The BY statement is used to identify down to the two rows with differences. The ID statement Callout 8 further identifies down to the specific row.

Callout 7 In the VAR statement we need to list all the variables that need to be compared. Since we do not necessarily know the names of the variables we can use variable list abbreviations (see Section 2.6.1). _ALL_ could also have been used here. When the variables to be compared are known (as is technically the case in this example) they can be named explicitly.

Callout 8 The ID variable _TYPE_, which contains ‘BASE’ and ‘COMPARE’, is used to name the two new columns formed in the transformation process. These will be character variables when the variables in the VAR statement are either all character or a mixture of numeric and character. When there is a mixture of variable types, as is the case in this example, a conversion note is written to the LOG.

The data set TDIFF will now have one row for each original variable, and the values from the two original data sets are stored in the variables BASE and COMPARE. By selecting for unequal BASE and COMPARE values, we can determine the differences that have been detected in the comparison process.

title3 'After Transpose';
proc print data=tdiff(where=(variable ne '_TYPE_' 
                             & base ne compareCallout 9));
   run;

Callout 9The variables BASE and COMPARE now contain the original values of the variable named VARIABLE, which was renamed from _NAME_.


				
5.1 Automated Comparisons
Obs with differences
After Transpose
Obs  SUBJECT  VISIT       LABDT  _OBS_  variable  BASE  COMPARE
  6    200      2    07/13/2006     2    SODIUM   144     1A4
 14    202      1    07/07/2006    20    SODIUM   139     1A9

In this example the differences are only printed; however, they could have been easily stored in a data set for further processing.

MORE INFORMATION

A macro that further generalizes the comparison process can be found in program E5_1b.sas, which is located in the sample programs accompanying this book.

5.2 Reordering Variables on the PDV

The order of the variables on the Program Data Vector, PDV, is generally of no concern to the SAS programmer. Indeed there are no tools in the language that are specifically designed to help us to change the order once it has been established. Although they come up fairly infrequently, there are legitimate occasions that force us to either know or change the order of the variables on the PDV.

Sometimes when we create an EXCEL spreadsheet, the resulting columns need to be in a specific order. PROC EXPORT (see Section 1.2) will use the PDV order of the variables to determine the order of the EXCEL columns. To change this order we need to change the order of the variables going into PROC EXPORT.

The current order of the variables can be seen in a number of ways. Some of the most straightforward of these include:

  • When PROC PRINT is used without an ID, BY, or VAR statement, the variables are printed in the order of their position.
  • VIEWTABLE in the Display Manager displays the variables in position order.
  • PROC CONTENTS displays the position of each variable, and with the VARNUM option (formerly the POSITION option) it will also list the variables in position order.
  • The COLUMNS window in the Display Manager displays the columns in position order by default.

Within a DATA step the order of the variables on the PDV is determined as the step is compiled. If the step reads a data set, as with a SET statement, the incoming data set is examined and its variables are added to the PDV using the same order. Once the order is determined on the PDV it is fixed and cannot be altered. If we want to control the order we must do so before it is fixed.

Using the VARNUM option in a PROC CONTENTS step reveals, among other attributes, the order of the variables in ADVRPT.DEMOG. The left-most column is the variable or position number.


				
5.2 Reordering Variables on the PDV
The CONTENTS Procedure
                   Variables in Creation Order
#    Variable    Type    Len    Format    Label
 1    subject     Num       8
 2    clinnum     Char      6              clinic number
 3    lname       Char     10              last name
 4    fname       Char      6              first name
 5    ssn         Char      9              social security number
 6    sex         Char      1              patient sex
 7    dob         Num       8    DATE7.    date of birth
 8    death       Num       8    DATE7.    date of death
 9    race        Char      1              race
10    edu         Num       8              years of education
11    wt          Num       8              weight in pounds
12    ht          Num       8              height in inches
13    symp        Char      2              symptom code
14    death2      Num       8    DATE9.

In the example that follows, only a few of the variables from ADVRPT.DEMOG flow through to the new data set, and the order of these variables is changed by the use of the LENGTH statement.

data demog2(keep=subject lname fname sex ht wt dob symp); Callout 1
   length lname $10 fname $6 sex $1 symp $2; Callout 2
   set advrpt.demog(keep=subject lname fname sex edu Callout 1
                         death ht wt dob symp);
   where death and edu>15; Callout 3
   run;

Callout 1 The KEEP and DROP statements/options do not change or affect the order of variables on the PDV.

Callout 2 The LENGTH statement is used not only to assign the length attribute to these four variables, but it also adds them to the PDV. Since the LENGTH statement appears before the SET statement these variables and their associated attributes are added to the PDV before any variables or attributes are contributed by the SET statement. Attributes for these variables that have not been specified, for example formats or labels, will be picked up from the metadata of the incoming data set named on the SET statement.

Callout 3 Although the variables DEATH and EDU are not to be included on the new data set, they must be available on the PDV for use by the WHERE statement.

5.2 Reordering Variables on the PDV
Obs   lname    fname   sex   symp   subject       dob    wt   ht
 1    James    Debra    F     05      232     19JUN42   163   63
 2    Manley   Debra    F     05      241     19JAN42   163   63

Notice that while the order has changed for the variables in the LENGTH statement, the order for the remaining variables is the same as it was on the incoming data.

The ARRAY, FORMAT, INFORMAT, RETAIN, and ATTRIB statements can also be used in a similar manner to reorder variables on the PDV. The KEEP, DROP, and RENAME statements cannot be used to change variable order. Generally the recommended choice of statements for reordering the variables is the RETAIN statement as it does not require any other attributes and does not otherwise change behavior of variables that are being brought into the DATA step via the SET or MERGE statements.

retain lname fname sex symp;

In the previous example the LENGTH statement Callout 2 could have been replaced with a RETAIN statement, which does not require additional knowledge of the attributes of the variables that are to be reordered.

proc sql ;
create table demog4 as 
 Callout 4 select lname, fname, sex, symp, subject, dob, wt, ht 
      from advrpt.demog(keep=subject lname fname sex 
                             edu death ht wt dob symp)
        where death and edu>15;
   select * Callout 5
      from demog4;
quit;

It is also possible to reorder variables using an SQL step. Here the SELECT statement Callout 4 is used to specify the new variable order. Notice that unlike the DATA step a full list of variables must be specified unless you use the asterisk (*) Callout 5 to specify all variables, in which case you will see them in PDV order.

When you read more than one data set in a step, the order of variables is determined to some extent by the order that the data sets are read in. Variables are added to the PDV in the order that they are encountered by the compiler. The order of the variables taken from the first data set seen by the compiler will be written to the PDV first. Variables not already on the PDV will be added to the PDV in the order that they are encountered on subsequent data sets. This means that you may have some control of variable order by controlling the order in which your incoming data sets are first seen.

MORE INFORMATION

Although I do not necessarily recommend their use, some of the variable list abbreviations require you to have specific knowledge of the order of the variables (see Section 2.6.1).

SEE ALSO

SAS Usage Note 8395 discusses the reordering of variables at http://support.sas.com/kb/8/395.html.

5.3 Building and Maintaining Indexes

It is a bit surprising how few programmers actually take advantage of indexing. True the topic can be a bit complex; after all, a book has been written on the subject (Raithel, 2006). Complex or not you do not need a lot of knowledge to take advantage of them.

Indexes give us the ability to virtually sort a data table without physically sorting it. More than one index can be created for any give data table and with multiple indexes it is possible to effectively sort the data set multiple ways at the same time. Once created the index allows the use of the BY statement and other processing techniques as if the data had been sorted using PROC SORT. The index can also be very helpful in subsetting and merging situations.

MORE INFORMATION

Indexes are used with the KEY= option in a table lookup example in Section 6.6, and discussed relative to the use of the WHERE statement in Section 2.7.

SEE ALSO

The SAS Press book written by Michael Raithel (2006) is the definitive work on SAS indexes. A shorter introduction to the practical aspects of indexes can be found in Raithel (2004). Clifford (2005) addresses a number of frequently asked questions about indexes. Andrews and Kress (year unknown) compare the DATASETS and SQL procedures for the building of indexes.

5.3.1 Introduction to Indexing

Indexes provide a search tool that allows the detection and extraction of a data subset. Well-defined indexes can be especially useful in increasing the efficiency of the subsetting process. The highest efficiency gains can be had as the data subset becomes smaller relative to the size of the overall data set. An index that segments the data into subsets that are no more than 10 or 15% of the total data set will tend to provide the most efficiency benefits.

Data set variables are used to define indexes. The selected variables should be chosen so that they maximize the ability to discriminate or break up the data into smaller subsets. Variables that take on only a few levels, such as GENDER or RACE, would probably make poor candidates, while variables such as SUBJECTID or NAME, which take on many more levels relative to the size of the overall data set, would tend to make better index variables. You can also use two or more variables in combination in order to increase the ability of the index to discriminate among data subsets. When the data are also sorted, the variables used to sort the data are also good index variable candidates.

Indexes can be simple (a single variable) or composite (two or more variables), and they can be created in a DATA step, an SQL step, or through PROC DATASETS. Once created, the user can take advantage of indexes in several different ways. Sometimes SAS will even take advantage of available indexes without the user’s knowledge. The system option MSGLEVEL=I will cause index usage notes to be written to the LOG.

There are options available in some statements that will specifically invoke indexes (see Section 6.6). While the user can use indexes simply by including the appropriate BY statement, this is not necessarily the best use of indexes. BY-group processing with the index can be inefficient, especially when the full data set is being processed.

Indexes are named and for a simple index, which consists of a single variable, the name is the same as that variable. For composite indexes, which use two or more variables, a name must be provided (the name must be different from any variables on the data set). The index name is used to identify the index file, but is not used by the user to retrieve the index. Indexes are used by specifying the names of the variables that make up the index (simple or composite).

proc contents data=advrpt.demog;
   run;

The metadata shown by the CONTENTS procedure shows that the INDEXES flag Callout 1 is now set and shows the number of indexes associated with this table.


				
5.3.1 Metadata for an Indexed Table
The CONTENTS Procedure
Data Set Name   ADVRPT.DEMOG                    Observations        77                                      
Member Type     DATA                            Variables           14                                      
Engine          V9                              Indexes              3  Callout 1                                      
Created         Tue, Sep 22, 2009 10:57:37 AM   Observation Length  96                                      
Alphabetic List of Indexes and Attributes
                  # of
                Unique
#    Index      Values    Variables
       Callout 2                    Callout 3
1    group          23    sex race edu
2    ssn            76
3    subject        77

The index definitions are also included in the data set’s metadata as can be seen by looking at the listing generated by PROC CONTENTS. The name of each index is shown under the column labeled Index Callout 2. For composite indexes the variable list is also shown under the column labeled Variables Callout 3.

When an index has been selected, its use will be mentioned in the LOG.

66   proc print data=advrpt.demog;
67      by sex race edu;
NOTE: An index was selected to execute the BY statement.
      The observations will be returned in index order rather than 
      in physical order.  The selected index is for the variable(s):
 sex
 race
 edu

The indexes are not actually stored in the data set itself. Instead they reside in a separate file. Under Windows the index file has the same name as the data set with which it is associated; however, the extension is different.

5.3.2 Creating Simple Indexes

Indexes can be created using a DATASETS step, an SQL step, or a DATA step. Each of the next three steps creates a simple index. Later in this section these indexes are used in several PROC PRINT steps.

proc datasets lib=advrpt;
   modify demog; Callout 1
   index create clinnum; Callout 2 Callout 3
   quit;

Callout 1 The data set receiving the index is named. The data set may be either permanent or temporary.

Callout 2 The index to be created is named. For simple indexes the name is the same as the single variable used to form the index.

proc sql noprint;
   create index clinnum Callout 2
      on advrpt.demog Callout 1 (clinnum)Callout 3;
   quit;

Callout 3 The variable used to define the index is named (separately in the SQL step).

data demog2Callout 1(index=(clinnum Callout 2Callout 3));
   set advrpt.demog;
   run;

The three methods for creating the index will tend to have different efficiencies based on your particular data. Each method of creating the index will read and handle the data differently.

Of these three methods of creating indexes, the DATA step is the only one that reads the entire data set. PROC SQL and PROC DATASETS don’t read the data in the same way; primarily they just add the index file. This means that the generation of the index can be more costly when done with a DATA step, and its cost will grow as the volume of data grows.

* Create index on ssn;
proc sql noprint;
   create index ssn
      on advrpt.demog (ssn);
   quit;
options msglevel=i; Callout 4
* Use the ssn index; 
proc print data=advrpt.demog;
   var lname fname;
   where ssn < '3';
   id ssn;
   run;

When the MSGLEVEL system option is set to I Callout 4, a note will be written to the LOG when an index is utilized. In this PROC PRINT step the data are subsetted using the SSN variable. The LOG shows that the index for SSN was selected for use with the WHERE clause.

139  proc print data=advrpt.demog;
140     var lname fname;
141     where ssn < '3';
INFO: Index ssn selected for WHERE
clause optimization. Callout 4 
proc print data=advrpt.demog;
   by clinnum; Callout 5
   id clinnum;
   run;
proc print data=advrpt.demog;
   by ssn; Callout 5
   id ssn;
   run;

Once the indexes have been created, a BY statement using the indexed variable(s) will cause the index to be used. In the two PROC PRINT steps, two different BY Callout 5 statements are used with the same incoming data set. Since both are indexes, that data may not be sorted by either of the two BY variables; however, both steps will execute successfully.

proc datasets lib=advrpt;
   modify demog;
   index delete clinnum;
   quit;

An index can be removed from a data set through the use of PROC DATASETS. The index to be removed is identified by its name (simple or composite) on the INDEX statement along with the DELETE option. In this step the simple index CLINNUM is being removed from the ADVRPT.DEMOG data set.

5.3.3 Creating Composite Indexes

Since composite indexes are made up of two or more variables, the index name must necessarily be different from the individual variables and different from any other variable on the data set. In the first three steps below, composite indexes are created using three different methods. The PROC PRINT steps that follow then make use of two of these indexes.

proc datasets lib=advrpt;
   modify conmed; Callout 1
   index create drgstart Callout 2=(drug medstdt); Callout 3
   quit;

Callout 1 The data set receiving the index is named. A given data set may contain multiple indexes at any given time.

proc sql noprint;
   create index drgstart Callout 2
      on advrpt.conmed Callout 1 (drug medstdt)Callout 3;
   quit;

Callout 2 The index to be created is named and the name must be different from other variables in the data set.

data cmed2Callout 1(index=(drgstartCallout 2=(drug medstdt)Callout 3));
   set advrpt.conmed;
   run;

Callout 3 The list of variables making up the index is written inside the parentheses.

proc means data=advrpt.conmed noprint;
   by drug; Callout 4
   var mednumber;
   output out=sumry max= n=/autoname;
   run;

The name of a composite index is not used in the BY statement; however, the variables used to define the index can be used in the BY statement. When using a composite index you can specify any inclusive subset of the variables starting from the left.Callout 4 Consequently just as a data set that is sorted by DRUG MEDSTDT, must necessarily also be sorted DRUG, this sorted hierarchy will be true for indexes as well.

5.3.4 Using the IDXWHERE and IDXNAME Options

In the absence of an index SAS will satisfy the conditions of a WHERE expression by reading the data sequentially. When an index is present, SAS determines whether or not the utilization of the index will be optimal. Without the index the data is read sequentially; however, the use of the index can produce a non-sequential processing of the data. You can force the use of the index through the use of the IDXWHERE data set option. The IDXNAME data set option can also be used to specify a specific index when more than one exists.

Both of the examples below print portions of the same data set. The composite index DRGSTART was established for this data set in Section 5.3.3. Without the IDXWHERE option, the DRGSTART index has not been applied. And, as is shown by the consecutive numbers in the OBS column, a sequential read of the data took place. In the second PROC PRINT (to the right) the use of the index is requested and a non-sequential read takes place.


				
title2 'With IDXWHERE';
proc print 
     data=advrpt.conmed
            (idxwhere=yes
             idxname=drgstart);
   where drug < 'C';
   var drug medspdt;
   run;
 
title2 'Without IDXWHERE'; 
proc print data=advrpt.conmed;
   where drug < 'C';
   var drug medspdt;
   run;
5.3.4 Using IDXWHERE
Without IDXWHERE
Obs    drug                medspdt
 21    B1-VIT           01/01/2010
 22    ACCUPRIL/HCT2    01/01/2010
 44    B1-VIT           01/01/2010
 45    ACCUPRIL/HCT2    09/09/2006
 67    B1-VIT           10/24/2006
 68    ACCUPRIL/HCT2    12/10/2006
 90    B1-VIT           01/01/2010
 91    ACCUPRIL/HCT2    01/01/2010
113    B1-VIT           01/01/2010
114    ACCUPRIL/HCT2    01/01/2010
136    B1-VIT           05/13/2007
137    ACCUPRIL/HCT2    03/05/2007
159    B1-VIT                    .
160    ACCUPRIL/HCT2    04/30/2007
182    B1-VIT           08/25/2007
183    ACCUPRIL/HCT2    09/24/2007
 
5.3.4 Using IDXWHERE
With IDXWHERE
Obs    drug                medspdt
114    ACCUPRIL/HCT2    01/01/2010
 22    ACCUPRIL/HCT2    01/01/2010
 45    ACCUPRIL/HCT2    09/09/2006
 68    ACCUPRIL/HCT2    12/10/2006
 91    ACCUPRIL/HCT2    01/01/2010
137    ACCUPRIL/HCT2    03/05/2007
160    ACCUPRIL/HCT2    04/30/2007
183    ACCUPRIL/HCT2    09/24/2007
113    B1-VIT           01/01/2010
136    B1-VIT           05/13/2007
 21    B1-VIT           01/01/2010
 44    B1-VIT           01/01/2010
 67    B1-VIT           10/24/2006
 90    B1-VIT           01/01/2010
159    B1-VIT                    .
182    B1-VIT           08/25/2007
 

SEE ALSO

During SQL joins SAS will determine whether or not it is optimal to utilize an index. The _METHOD option can be used to determine when an index has been utilized (Lavery, 2005).

5.3.5 Index Caveats and Considerations

While the use of indexes can provide a number of efficiency gains, their use is not without a price. The user should have sufficient information to make an informed decision as to when to build and use indexes.

Remember that the indexes are stored in a separate file from the data set. The indexes must therefore be deliberately maintained. If you update a data set without updating its indexes, the indexing will be lost. If you copy a data set, the index file must also be copied. PROC DATASETS and PROC COPY know to look for and copy index files, but data set copies made at the OS level require that the index file be explicitly copied.

The index file will take up space. How much space depends on a number of factors, but the volume can be non-trivial. Indexes also take time to build and are therefore most appropriate for fairly stable data sets. Here stability is measured against the cost and effort of building and maintaining the index.

5.4 Protecting Passwords

When using SQL pass-through statements to access remote databases we often have to pass user identification and password information to the remote database. If we use those passwords in our program, our user ID and password will be surfaced for all to see. Very often we need to protect our passwords and to a lesser extent our user ID. The level of protection will vary from industry to industry, but the basics are the same.

Fortunately there are a number of ways to protect our passwords.

5.4.1 Using PROC PWENCODE

The PWENCODE procedure can be used to encode or encrypt passwords. Encoding is a text substitution technique that disguises your password through a series of text substitutions. Encryption, which is a more secure method than encoding (available starting in SAS 9.2), uses mathematical operations in the transformation of the text.

I have had very limited success at using this procedure to protect passwords. First, encoding is not a very strong protection, but more importantly the encoded or encrypted text can often be used instead of the password. It does not particularly matter if the bad guys cannot ‘see’ the actual password if the encoded text string, which they can see, will work in its place.

Encoding or encrypting your password through PROC PWENCODE creates and then allows you to use a text string instead of the actual password. It protects your password from being viewed directly, since only the encoded/encrypted string is visible. However, remember that the visible string, while not the password, can still be used as the password. This means that we will also need to protect this encoded/encrypted string.

filename pwfile "&path
esultspwfile.txt";
proc pwencode in='pharmer' out=pwfile;
  run;

In this PWENCODE step our password ‘pharmer’ is to be encrypted (under SAS 9.1 the default was encoding). As the procedure executes, the encrypted value of ‘pharmer’ is written to the text file (PWFILE.TXT) as: {sas002}81F6943F251507393B969C0753B2D73B and is not otherwise surfaced for viewing.

Once the value has been stored in the text file, it can be recovered and used at some point in the future. The SAS documentation for PROC PWENCODE shows how this value can be stored in a macro variable or written to the LOG. Both approaches are not practical, because if the user (or someone else) can see the encoded/encrypted value, they can then use it instead of the real password. They may not know your real password, but that does not matter; they still have access to your data.

We need to be able to use the password without ever surfacing its value, either in a macro variable or to the LOG. In the following DATA step the LIBNAME function is used to create a libref which establishes an ODBC connection to an SQL server. The password is recovered from the text file and the value is inserted into the PASSWORD= option.

data _null_;
length tmp $1024 opt $1200;
infile pwfile truncover;
input tmp;
opt='dsn=SQLServer user=myid password="'||left(trim(tmp))||'"';
rc=libname('sqlsrv',,'odbc', opt);
txt=sysmsg();
put rc= txt=;
run;

The encoded/encrypted password cannot be utilized in all coding situations that require the use of a password. For example, the RENAME function (see Section 5.6.1) allows the use of passwords, but restricts them to 8 characters (this limitation should be fixed in a future release of SAS).

MORE INFORMATION

The discussion in Section 5.4.2 takes an alternate approach to the protection of passwords.

SEE ALSO

Although slightly dated by subsequent releases of SAS, Steven (2007) describes the use of the PWENCODE procedure.

5.4.2 Protecting Database Passwords

The following method places the sensitive information, such as passwords and user identification codes, in a protected data set. And then we write our program (the macro %SECRETSQL) to extract the password without surfacing it.

The data set ADVRPT.PASSTAB includes the user ID and password for several different databases. The data set itself is encrypted and password protected using data set options.

data advrpt.passtab(encrypt=yes  pwreq=yes 
                    read=readpwd write=writepwd
                    alter=chngpwd );
  format dsn uid pwd $8.;
  dsn='dbprod'; uid='mary'; pwd='wish2pharm'; output;
  dsn='dbprod'; uid='john'; pwd='data4you';   output;
  dsn='dbdev';  uid='mary'; pwd='hope2pharm'; output;
  run;
%let syslast=;

Notice that the macro variable &SYSLAST has been cleared to remove the name of this data set from the global symbol table.

The macro %SECRETSQL retrieves the passwords from the password data set and then uses the information in the SQL pass-through in such a way that the password is never surfaced to the LOG.


				
%macro secretsql(dbname, username);
  %local dd uu pp; Callout 1
  Proc SQL noprint nofeedback;
    (
      SELECT dsn, uid, pwd into :dd, :uu, :pp  
      FROM advrpt.passtab(read=readpwd) Callout 2  
        WHERE dsn=trim(symget('dbname')) Callout 3
            AND uid=trim(symget('username'))
    );
    connect to odbc(dsn=%superq(dd) uid=%superq(uu) pwd=%superq(pp)); Callout 4 
    create table mylib.mytable as select * from connection to odbc(
      %passthru Callout 5 /* contains your pass-thru SQL statement(s) */
    );
    disconnect from odbc;
    quit;
%mend secretsql; 
%secretsql(dbname=dbprod, username=John)Callout 6

Callout 1 The macro variables that will contain the sensitive information are placed in the temporary local symbol table.

Callout 2 We will read the password from the secret password data set. Notice that the user of this macro does not even need to know of the existence of the password data set.

Callout 3 Macro variables are retrieved by using the SYMGET function rather than the more common macro variable reference with an ampersand (&UU or &DBNAME). This is done because the SYMGET function does not write macro variable values in the LOG, even when macro debugging options, such as SYMBOLGEN, are turned on.

Callout 4 The SYMGET function does not always execute in the CONNECT statement. The macro quoting function %SUPERQ will also resolve the macro variable without surfacing its value to the LOG.

Callout 5 The user creates a macro called %PASSTHRU that contains only those SQL statements that are to be processed by the remote database. The user does not need to see anything inside the %SECRETSQL macro.

Callout 6 The macro call only contains information that is not sensitive.

When the %SECRETSQL macro is kept in a stored compiled macro library, the source statements will not be available to the person using it (see Section 13.9 and Sun and Carpenter (2011) for information on protecting the macro code itself).

MORE INFORMATION

Data set options that provide data set protections are described in more detail in Section 2.1.2. SQL pass-through is introduced in Section 1.5. The PWENCODE procedure can potentially also be used to provide password security (see Section 5.4.1). Issues dealing with macro source code security are also discussed in Section 13.9.

SEE ALSO

The %SECRETSQL macro and a number of related techniques are described in more detail by Sherman and Carpenter (2007).

5.5 Deleting Data Sets

There are a number of ways of deleting data sets, both from within SAS and from the operating system. Although generally we do not need to delete our data sets during the execution of a program, sometimes when processing especially large data sets, it can be necessary to clear data sets from the WORK library in order to free up disk space.

PROC DATASETS is the tool most often used from within a SAS program. When using PROC DATASETS, there are two basic ways of carrying out the deletions.

proc datasets library=work Callout 1
              memtype=data Callout 2
              nolist Callout 3;
   delete male female; Callout 5
   quit;
 
proc datasets library=work Callout 1
              memtype=data Callout 2
              nolist Callout 3
              kill Callout 4;
   quit;

Callout 1 The library from which the items are to be deleted is specified.

Callout 2 Select the type of item to be deleted. Data sets have the MEMTYPE=DATA. When deleting catalogs the CATALOG procedure can also be used.

Callout 3 The NOLIST option suppresses the list of members prior to the deletion from being written to the LOG.

Callout 4 The KILL option deletes all items of the specified type Callout 2 to be deleted.

Callout 5 The DELETE statement lists one or more items to delete from the specified library Callout 1.

proc delete data=male allgender Callout 6;
   run;

Although no longer documented, the DELETE procedure is still available. This procedure Callout 6 is one of the very few, if not only, procedures to allow more than one data set name to be associated with the DATA= option.

proc sql;
   drop table allgender, male; Callout 7
   quit;

Data sets can also be deleted from within an SQL step by using the DROP TABLE statement Callout 7. Notice that more than one table can be listed on the DROP TABLE statement by separating the names with a comma.

SEE ALSO

Rosenbloom and Lafler (2011d) discuss the use of PROC DATASETS to delete data sets.

5.6 Renaming Data Sets

Data sets can be renamed using a variety of methods including a number of ways through the use of the OS tools. Renaming data sets from within a program is also possible, and for some situations even preferable as the process can be automated using the macro language.

5.6.1 Using the RENAME Function

The DATA step function RENAME can be used to rename data sets, catalogs, and even directories. Like most DATA step functions it can also be utilized by the macro language. Here the data set WORK.MALE is being renamed to WORK.MALES.

data  male 
      female;
   set sashelp.class;
   if sex='M' then output male;
   else output female;
   run;
%let rc=%sysfunc(rename(work.male,Males,data));
%put &RC;

Notice that the libref for the new name (second argument) is implied and is not explicitly included. The function returns a 0 for a successful rename operation.

5.6.2 Using PROC DATASETS

Within the DATASETS procedure there are two primary methods for renaming data sets and catalogs:

  • CHANGE

changes or renames a data set

  • AGE

renames a group of data sets to form a series of previous versions

data current;
   created = datetime();
   format created datetime18.;
   run;
proc datasets library=work nolist;
   change current=now;
   quit;

The CHANGE statement is designed to rename one or more data sets. The data sets are listed on the CHANGE statement in OLDNAME=NEWNAME pairs. In this example the data set WORK.CURRENT is renamed to WORK.NOW. Because the data sets are being renamed, not copied (which uses the COPY statement), the library for the old and new name will always be the same. If a data set with the NEW name already exists the rename will not take place.

proc datasets library=mydata nolist;
   age current currentV1 - currentV7;
   quit;

When you need to retain one or more snapshots (backup) copies of a data set, the AGE statement can be used to perform the operation. The oldest data set is deleted and then in order of age the data sets are renamed one at a time. In this example CURRENTV7 would be deleted, CURRENTV6 is renamed to CURRENTV7, and so on until the most recent version of CURRENT is renamed to CURRENTV1. If this AGE statement was executed every morning, there would be a backup or ‘aged’ copy of CURRENT for each day of the week.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset