Chapter 12. Reading SAS Data Sets

Overview

Introduction

You've learned about creating a SAS data set from raw data. However, you may often want to create a new data set from an existing SAS data set. To create the new data set, you can read a data set using the DATA step. As you read the data set, you can use all the programming features of the DATA step to manipulate your data.

Introduction

This chapter shows you how to use the DATA step to read an existing SAS data set. When you create your new data set, you can choose variables, select observations based on one or more conditions, and assign values conditionally. You can also assign variable attributes such as formats and labels.

Introduction

Objectives

In this chapter, you learn to

  • create a new data set from an existing data set

  • use BY groups to process observations

  • read observations by observation number

  • stop processing when necessary

  • explicitly write observations to output

  • detect the last observation in a data set

  • identify differences in DATA step processing for raw data and DATA step processing for existing data sets.

Reading a Single Data Set

Suppose you want to create a small data set, Lab23.Drug1H, from the Research.CLTrials data set, which contains information on treadmill test time and relative tolerance levels.

To create the data set, you must first reference the library in which CLTrials is stored and then the library in which you want to store Drug1H. Then you write a DATA step to read your data and create a new data set.

After you write a DATA step to name the SAS data set to be created, you specify the data set that will be read in the SET statement. The DATA statement below creates the permanent SAS data set Drug1H, which will be stored in a SAS data library to which the libref Lab23 has been assigned. The SET statement below reads the permanent SAS data set Research.CLTrials.

libname lab23 'c:drugallergylabtests';
libname research 'c:drugallergy';
data lab23.drug1h;
   set  research.cltrials;
run;

The DATA step above reads all observations and variables from the existing data set into the new data set. When you submit this DATA step, the following messages appear in the log, confirming that the new data set was created:

Reading a Single Data Set

Manipulating Data

In the previous practice, you created a data set that was identical to Sasuser.Stress2. But you usually don't want an exact duplicate of the existing data set. When you read a data set, you can use any of the programming features of the DATA step to manipulate your data.

For example, you can use any of the statements and data set options that you learned in previous chapters.

To do this…

Use this type of statement...

Subset data

if resthr<70 then delete; if tolerance='D';

Drop unwanted variables

drop timemin timesec;

Create or modify a variable

TotalTime=(timemin*60)+timesec;

Initialize a sum variable

Sum accumulated values

retain SumSec 5400; sumsec+totaltime;

Specify a variable's length

length TestLength $ 6;

Execute statements conditionally

if totaltime>800 then TestLength='Long'; else if 750<=totaltime<=800 then TestLength='Normal'; else if totaltime<750 then TestLength='Short';

Label a variable

Format a variable

label sumsec='Cumulative Total Seconds'; format sumsec comma6.;

Example

The following DATA step reads the data set Research.CLTrials, selects observations and variables, and creates new variables.

data lab23.drug1h(drop=placebo uric);
   set research.cltrials(drop=triglyc);
   if sex='M' then delete;
   if placebo='YES';
   TestDate='22MAY1999'd;
   retain Days 30;
   days+1;
   length Retest $ 5;
   if cholesterol>190 then retest='YES';
   else if 150<=cholesterol<=190 then retest='CHECK';
   else if cholesterol<150 then retest='NO';
   label retest='Perform Cholesterol Test 2?';
   format enddate mmddyy10.;
run;

Where to Specify the DROP= and KEEP= Data Set Options

You've learned that you can specify the DROP= and KEEP= data set options anywhere you name a SAS data set. However, using DROP= and KEEP= when reading an existing data set requires that you decide where to specify these options. You can specify DROP= and KEEP= in either the DATA statement or the SET statement, depending on whether or not you want to process values of the variables in that DATA step:

  • If you don't process certain variables and you don't want them to appear in the new data set, specify them in the DROP= option in the SET statement.

    In the DATA step shown below, the DROP= option in the SET statement prevents the variables Triglycerides and UricAcid from being read. These variables won't appear in the Lab23.Drug1H data set.

    data lab23.drug1h(drop=placebo);
       set research.cltrials(drop=triglycerides uricacid);
       if placebo='YES';
    run;
  • If you do need to process a variable in the original data set (in a subsetting IF statement, for example), you must specify the variable in the DROP= option in the DATA statement. Otherwise, the statement that is using the variable for processing causes an error.

    This DATA step uses the variable Placebo to select observations. To drop Placebo from the new data set, the DROP= option must appear in the DATA statement.

    data lab23.drug1h(drop=placebo);
       set research.cltrials(drop=triglycerides uricacid);
       if placebo='YES';
    run;

When used in the DATA statement, the DROP= option simply drops the variables from the new data set. However, they are still read from the original data set and are available within the DATA step.

Using BY-Group Processing

Finding the First and Last Observations in a Group

If you completed Chapter 4, "Creating List Reports," on page 99, you learned to use a BY statement in PROC SORT to sort observations and in PROC PRINT to group observations for subtotals. You can also use the BY statement in the DATA step to group observations for processing.

data temp;
   set salary;
   by dept;
run;

When you use the BY statement with the SET statement,

  • the data sets that are listed in the SET statement must be sorted by the values of the BY variable(s), or they must have an appropriate index.

  • the DATA step creates two temporary variables for each BY variable. One is named FIRST.variable, where variable is the name of the BY variable, and the other is named LAST.variable. Their values are either 1 or 0. FIRST.variable and LASTvariable. identify the first and last observation in each BY group.

This variable …

Equals …

FIRST.variable

1 for the first observation in a BY group

 

0 for any other observation in a BY group

LAST.variable

1 for the last observation in a BY group

 

0 for any other observation in a BY group

Example

To work with FIRST.variable and LAST.variable, let's look at a different set of data. The Company.USA data set contains payroll information for individual employees. Suppose you want to compute the annual payroll by department. Assume 2,000 work hours per year for hourly employees.

Before computing the annual payroll, you need to group observations by values of the variable Dept.

Example

The following program computes the annual payroll by department. Notice that the variable name Dept has been appended to FIRST. and LAST.

proc sort data=company.usa out=work.temp;
   by dept;
run;
data company.budget(keep=dept payroll);
   set work.temp;
   by dept;
   if wagecat='S' then Yearly=wagerate*12;
   else if wagecat='H' then Yearly=wagerate*2000;
   if  first.dept then Payroll=0;
   payroll+yearly;
   if  last.dept;
run;

If you could look behind the scenes at the program data vector (PDV) as the Company.Budget data set is being created, you would see the following. Notice the values for FIRST.Dept and LAST.Dept.

Example

When you print the new data set, you can now list and sum the annual payroll by department.

proc print data=company.budget noobs;
   sum payroll;
   format payroll dollar12.2;
run;
Example

Finding the First and Last Observations in Subgroups

When you specify multiple BY variables,

  • FIRST.variable for each variable is set to 1 at the first occurrence of a new value for the variable

  • a change in the value of a primary BY variable forces LAST.variable to equal 1 for the secondary BY variables.

Example

Suppose you now want to compute the annual payroll by job type for each manager. In your program, you specify two BY variables, Manager and JobType.

proc sort data=company.usa out=work.temp2;
   by manager jobtype;
data company.budget2(keep=manager jobtype payroll);
   set work.temp2;
   by manager jobtype;
   if wagecat='S' then Yearly=wagerate*12;
   else if wagecat='H' then Yearly=wagerate*2000;
   if first.jobtype then Payroll=0;
   payroll+yearly;
   if last.jobtype;
run;

If you could look at the PDV now, you would see the following. Notice that the values for FIRST.JobType and LAST.JobType change according to values of FIRST.Manager and LAST.Manager.

Example

Now you can sum the annual payroll by job type for each manager. Here, the payroll for only two managers (Coxe and Delgado) is listed.

proc print data=company.budget2 noobs;;
   by manager;
   var jobtype;
   sum payroll;
   where manager in ('Coxe','Delgado'),
   format payroll dollar12.2;
run;
Example

Reading Observations Using Direct Access

So far in this chapter, you've read the observations in an input data set sequentially. That is, you have accessed observations in the order in which they appear in the physical file. However, you can also access observations directly, by going straight to an observation in a SAS data set without having to process each observation that precedes it.

To access observations directly by their observation number, you use the POINT= option in the SET statement.

Example

Let's suppose you want to read only the fifth observation from a data set. In the following DATA step, the value 5 is assigned to the variable ObsNum. The POINT= option reads the value of ObsNum to determine which observation to read from the data set Company.USA.

data work.getobs5;
   obsnum=5;
   set company.usa(keep=manager payroll)  point=obsnum;
run;

But let's see what would happen if you submitted this program.

As you learned in a previous chapter, the DATA step continues to read observations until it reaches the end-of-file marker in the input data. However, because the POINT= option reads only specified observations, SAS cannot read an end-of-file indicator as it would if the file were being read sequentially. So submitting the following program would cause continuous looping.

data work.getobs5(drop=obsnum);
   obsnum=5;
   set company.usa(keep=manager payroll)  point=obsnum;
run;

Preventing Continuous Looping with POINT=

Because there is no end-of-file condition when you use direct access to read data, you must take either or both of the following precautions:

  • Use a STOP statement to prevent continuous looping. The STOP statement causes SAS to stop processing the current DATA step immediately and to resume processing statements after the end of the current DATA step.

  • Use programming logic that checks for an invalid value of the POINT= variable. If SAS reads an invalid value for the POINT= variable, it sets the automatic variable _ERROR_ to1. You can use this information to check for conditions that cause continuous processing.

So, if you add a STOP statement, your program no longer loops continuously.

data work.getobs5(drop=obsnum);
   obsnum=5;
   set company.usa(keep=manager payroll)  point=obsnum;
   stop;
run;

But it doesn't write any observations to output, either! Remember from Chapter 6, "Understanding DATA Step Processing," on page 183 that the DATA step writes observations to output at the end of the DATA step. However, in this program, the STOP statement immediately stops processing before the end of the DATA step.

Let's see how you can write the observation to output before processing stops.

Writing Observations Explicitly

To override the default way in which the DATA step writes observations to output, you can use an OUTPUT statement in the DATA step. Placing an explicit OUTPUT statement in a DATA step overrides the automatic output, so that observations are added to a data set only when the explicit OUTPUT statement is executed.

With an OUTPUT statement, your program now writes a single observation to output—observation 5.

data work.getobs5(drop=obsnum);
   obsnum=5;
   set company.usa(keep=manager payroll) point=obsnum;
   output;
   stop;
run;
proc print data=work.getobs5 noobs;
run;
Writing Observations Explicitly

Suppose your DATA statement contains two data set names, and you include an OUTPUT statement that references only one of the data sets. The DATA step will create both data sets, but only the data set that is named in the OUTPUT statement will contain output. For example, the program below creates two temporary data sets, Empty and Full. The result of this DATA step is that the data set Empty is created but contains no observations, and the data set Full contains all of the observations from Company.Usa.

data empty full;
   set company.usa;
   output full;
run;

More Complex Ways of Using Direct Access

To convey concepts clearly, the examples in this section have been as simple as possible. However, most uses of the POINT= option are more complex. For example, POINT= is commonly used in combining data sets, not simply in reading a single data set.

You can see more complex examples of using POINT= in Chapter 15, "Generating Data with DO Loops," on page 473.

Detecting the End of a Data Set

Instead of reading specific observations, you may want to determine when the last observation in an input data set has been read, so that you can perform specific processing. For example, you might want to write to output only an observation that contains totals for variables in all observations in the data set.

To create a temporary numeric variable whose value is used to detect the last observation, you can use the END= option in the SET statement.

Example

Suppose you want to sum the number of seconds for treadmill stress tests. If you submit the following program, you produce a new data set that contains cumulative totals for each of the values of TotalTime.

data work.addtoend(drop=timemin timesec);
   set clinic.stress2(keep=timemin timesec);
   TotalMin+timemin;
   TotalSec+timesec;
   TotalTime=totalmin*60+totalsec;
run;
proc print data=work.addtoend noobs;
run;
Example

But what if you only want the final total (the last observation) in the new data set? The following program uses the END= variable last to select only the last observation of the data set. You specify END= in the SET statement and last wherever you need it in processing (here, in the subsetting IF statement).

data work.addtoend(drop=timemin timesec);
   set clinic.stress2(keep=timemin timesec)  end=last;
   TotalMin+timemin;
   TotalSec+timesec;
   TotalTime=totalmin*60+totalsec;
   if last;
run;
proc print data=work.addtoend noobs;
run;

Now the new data set has one observation:

Example

Understanding How Data Sets Are Read

In a previous chapter, you learned about the compilation and execution phases of the DATA step as they pertain to reading raw data. DATA step processing for reading existing SAS data sets is very similar. The main difference is that while reading an existing data set with the SET statement, SAS retains the values of the variables from one observation to the next.

Let's briefly look at the compilation and execution phases of DATA steps that use a SET statement. In this example, the DATA step reads the data set Finance.Loans, creates the variable Interest, and creates the new data set Finance.DueJan.

data finance.duejan;
   set finance.loans;
   Interest=amount*(rate/12);
run;
Understanding How Data Sets Are Read

Compilation Phase

  1. The program data vector is created and contains the automatic variables _N_ and _ERROR_.

    Compilation Phase
  2. SAS also scans each statement in the DATA step, looking for syntax errors.

  3. When the SET statement is compiled, a slot is added to the program data vector for each variable in the input data set. The input data set supplies the variable names, as well as attributes such as type and length.

    Compilation Phase
  4. Any variables that are created in the DATA step are also added to the program data vector. The attributes of each of these variables are determined by the expression in the statement.

    Compilation Phase
  5. At the bottom of the DATA step, the compilation phase is complete, and the descriptor portion of the new SAS data set is created. There are no observations because the DATA step has not yet executed.

When the compilation phase is complete, the execution phase begins.

Execution Phase

  1. The DATA step executes once for each observation in the input data set. For example, this DATA step will execute four times because there are four observations in the input data set Finance.Loans.

  2. At the beginning of the execution phase, the value of _N_ is1. Because there are no data errors, the value of _ERROR_ is0. The remaining variables are initialized to missing. Missing numeric values are represented by a period, and missing character values are represented by a blank.

    Execution Phase
  3. The SET statement reads the first observation from the input data set and writes the values to the program data vector.

    Execution Phase
  4. Then, the assignment statement executes to compute the value for Interest.

    Execution Phase
  5. At the end of the first iteration of the DATA step, the values in the program data vector are written to the new data set as the first observation.

    Execution Phase
  6. The value of _N_ is set to2, and control returns to the top of the DATA step. Remember, the automatic variable _N_ keeps track of how many times the DATA step has begun to execute.

    Execution Phase
  7. SAS retains the values of variables that were read from a SAS data set with the SET statement, or that were created by a sum statement. All other variable values, such as the values of the variable Interest, are set to missing.

    Execution Phase
    Execution Phase
    • variables named in a RETAIN statement

    • variables created in a sum statement

    • data elements in a _TEMPORARY_ array

    • any variables created by using options in the FILE or INFILE statements

    • automatic variables.

  8. At the beginning of the second iteration, the value of _ERROR_ is reset to 0.

    Execution Phase
  9. As the SET statement executes, the values from the second observation are written to the program data vector.

    Execution Phase
  10. The assignment statement executes again to compute the value for Interest for the second observation.

    Execution Phase
  11. At the bottom of the DATA step, the values in the program data vector are written to the data set as the second observation.

    Execution Phase
  12. The value of _N_ is set to 3, and control returns to the top of the DATA step. SAS retains the values of variables that were read from a SAS data set with the SET statement, or that were created by a sum statement. All other variable values, such as the values of the variable Interest, are set to missing.

    Execution Phase

This process continues until all of the observations are read.

Additional Features

The DATA step provides many other programming features for manipulating data sets. For example, you can

  • use IF-THEN/ELSE logic with DO groups and DO loops to control processing based on one or more conditions

  • specify additional data set options

  • process variables in arrays

  • use SAS functions.

You can also combine SAS data sets in several ways, including match merging, interleaving, one-to-one merging, and updating.

Chapter Summary

Text Summary

Setting Up

Before you can create a new data set, you must assign a libref to the SAS data library that will store the data set.

Reading a Single Data Set

After you have referenced the library in which your data set is stored, you can write a DATA step to name the SAS data set to be created. You then specify the data set to be read in the SET statement.

Selecting Variables

You can select the variables you want to drop or keep by using the DROP= and KEEP= data set options in parentheses after a SAS data set name. For convenience, use DROP= if more variables are kept than dropped.

BY-Group Processing

Use the BY statement in the DATA step to group observations for processing. When you use the BY statement with the SET statement, the DATA step automatically creates two temporary variables, FIRST. and LAST. When you specify multiple BY variables, a change in the value of a primary BY variable forces LAST.variable to equal 1 for the secondary BY variables.

Reading Observations Using Direct Access

In addition to reading input data sequentially, you can access observations directly by using the POINT= option to go directly to a data set observation. There is no end-of-file condition when you use direct access, so include an explicit OUTPUT statement and then the STOP statement to prevent continuous looping.

Detecting the End of a Data Set

To determine when the last observation in an input data set has been read, use the END= option in the SET statement. The specified variable is initialized to 0, then set to 1 when the SET statement reads the last observation of the data set.

Syntax

LIBNAME libref 'SAS-data-library';

DATA <SAS-data-set> (KEEP= variable-1 <…variable-n>);

      SET SAS-data-set (DROP= variable-1 <…variable-n>)

      POINT=variable | END=variable;

      OUTPUT <SAS-data-set>;

      STOP;

RUN;

Sample Program

proc sort data=company.usa out=work.temp2;
   by manager jobtype;
data company.budget2(keep=manager jobtype payroll);
   set work.temp2;
   by manager jobtype;
   if wagecat='S' then Yearly=wagerate*12;
   else if wagecat='H' then Yearly=wagerate*2000;
   if first.jobtype then Payroll=0;
   payroll+yearly;
   if last.jobtype;
run;


data work.getobs5(drop=obsnum);
   obsnum=5;
   set company.usa(keep=manager payroll) point=obsnum;
   output;
   stop;
run;


data work.addtoend(drop=timemin timesec);
   set clinic.stress2(keep=timemin timesec) end=last;
   TotalMin+timemin;
   TotalSec+timesec;
   TotalTime=totalmin*60+totalsec;
   if last;
run;

Points to Remember

  • When you perform BY-group processing, the data sets listed in the SET statement must either be sorted by the values of the BY variable(s), or they must have an appropriate index.

  • When using direct access to read data, you must prevent continuous looping. Add a STOP statement to the DATA step, or use programming logic that checks for an invalid value of the POINT= variable.

  • Do not specify the END= option with the POINT= option in a SET statement.

Chapter Quiz

Select the best answer for each question. After completing the quiz, you can check your answers using the answer key in the appendix.

  1. If you submit the following program, which variables appear in the new data set?

    data work.cardiac(drop=age group);
       set clinic.fitness(keep=age weight group);
       if group=2 and age>40;
    run;
    1. none

    2. Weight

    3. Age, Group

    4. Age, Weight, Group

  2. Which of the following programs correctly reads the data set Orders and creates the data set FastOrdr?

    1. data catalog.fastordr(drop=ordrtime);
         set july.orders(keep=product units price);
         if ordrtime<4;
         Total=units*price;
      run;
    2. data catalog.orders(drop=ordrtime);
         set july.fastordr(keep=product units price);
         if ordrtime<4;
         Total=units*price;
      run;
    3. data catalog.fastordr(drop=ordrtime);
         set july.orders(keep=product units price
                         ordrtime);
         if ordrtime<4;
         Total=units*price;
      run;
    4. none of the above

  3. Which of the following statements is false about BY-group processing?

    When you use the BY statement with the SET statement:

    1. The data sets listed in the SET statement must be indexed or sorted by the values of the BY variable(s).

    2. The DATA step automatically creates two variables, FIRST. and LAST., for each variable in the BY statement.

    3. FIRST. and LAST. identify the first and last observation in each BY group, respectively.

    4. FIRST. and LAST. are stored in the data set.

  4. There are 500 observations in the data set Usa. What is the result of submitting the following program?

    data work.getobs5(drop=obsnum);
       obsnum=5;
       set company.usa(keep=manager payroll) point=obsnum;
       stop;
    run;
    1. an error

    2. an empty data set

    3. continuous loop

    4. a data set that contains one observation

  5. There is no end-of-file condition when you use direct access to read data, so how can your program prevent a continuous loop?

    1. Do not use a POINT= variable.

    2. Check for an invalid value of the POINT= variable.

    3. Do not use an END= variable.

    4. Include an OUTPUT statement.

  6. Assuming that the data set Company.USA has five or more observations, what is the result of submitting the following program?

    data work.getobs5(drop=obsnum);
       obsnum=5;
       set company.usa(keep=manager payroll) point=obsnum;
       output;
       stop;
    run;
    1. an error

    2. an empty data set

    3. a continuous loop

    4. a data set that contains one observation

  7. Which of the following statements is true regarding direct access of data sets?

    1. You cannot specify END= with POINT=.

    2. You cannot specify OUTPUT with POINT=.

    3. You cannot specify STOP with END=.

    4. You cannot specify FIRST. with LAST.

  8. What is the result of submitting the following program?

    data work.addtoend;
       set clinic.stress2 end=last;
       if last;
    run;
    1. an error

    2. an empty data set

    3. a continuous loop

    4. a data set that contains one observation

  9. At the start of DATA step processing, during the compilation phase, variables are created in the program data vector (PDV), and observations are set to:

    1. blank

    2. missing

    3. 0

    4. there are no observations.

  10. The DATA step executes:

    1. continuously if you use the POINT= option and the STOP statement.

    2. once for each variable in the output data set.

    3. once for each observation in the input data set.

    4. until it encounters an OUTPUT statement.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset