Overview 346
Introduction 346
Objectives 346
Chapter Summary 365
Chapter Quiz 367
You've learned about creating a SAS data set from raw data. However, you may often want to create a new data set from an existing SAS data set. To create the new data set, you can read a data set using the DATA step. As you read the data set, you can use all the programming features of the DATA step to manipulate your data.
This chapter shows you how to use the DATA step to read an existing SAS data set. When you create your new data set, you can choose variables, select observations based on one or more conditions, and assign values conditionally. You can also assign variable attributes such as formats and labels.
In this chapter, you learn to
create a new data set from an existing data set
use BY groups to process observations
read observations by observation number
stop processing when necessary
explicitly write observations to output
detect the last observation in a data set
identify differences in DATA step processing for raw data and DATA step processing for existing data sets.
Suppose you want to create a small data set, Lab23.Drug1H, from the Research.CLTrials data set, which contains information on treadmill test time and relative tolerance levels.
To create the data set, you must first reference the library in which CLTrials is stored and then the library in which you want to store Drug1H. Then you write a DATA step to read your data and create a new data set.
After you write a DATA step to name the SAS data set to be created, you specify the data set that will be read in the SET statement. The DATA statement below creates the permanent SAS data set Drug1H, which will be stored in a SAS data library to which the libref Lab23 has been assigned. The SET statement below reads the permanent SAS data set Research.CLTrials.
libname lab23 'c:drugallergylabtests'; libname research 'c:drugallergy'; data lab23.drug1h; set research.cltrials; run;
The DATA step above reads all observations and variables from the existing data set into the new data set. When you submit this DATA step, the following messages appear in the log, confirming that the new data set was created:
In the previous practice, you created a data set that was identical to Sasuser.Stress2. But you usually don't want an exact duplicate of the existing data set. When you read a data set, you can use any of the programming features of the DATA step to manipulate your data.
For example, you can use any of the statements and data set options that you learned in previous chapters.
To do this… | Use this type of statement... |
---|---|
Subset data |
|
Drop unwanted variables |
|
Create or modify a variable |
|
Initialize a sum variable Sum accumulated values |
|
Specify a variable's length |
|
Execute statements conditionally |
|
Label a variable Format a variable |
|
The following DATA step reads the data set Research.CLTrials, selects observations and variables, and creates new variables.
data lab23.drug1h(drop=placebo uric); set research.cltrials(drop=triglyc); if sex='M' then delete; if placebo='YES'; TestDate='22MAY1999'd; retain Days 30; days+1; length Retest $ 5; if cholesterol>190 then retest='YES'; else if 150<=cholesterol<=190 then retest='CHECK'; else if cholesterol<150 then retest='NO'; label retest='Perform Cholesterol Test 2?'; format enddate mmddyy10.; run;
You've learned that you can specify the DROP= and KEEP= data set options anywhere you name a SAS data set. However, using DROP= and KEEP= when reading an existing data set requires that you decide where to specify these options. You can specify DROP= and KEEP= in either the DATA statement or the SET statement, depending on whether or not you want to process values of the variables in that DATA step:
If you don't process certain variables and you don't want them to appear in the new data set, specify them in the DROP= option in the SET statement.
In the DATA step shown below, the DROP= option in the SET statement prevents the variables Triglycerides and UricAcid from being read. These variables won't appear in the Lab23.Drug1H data set.
data lab23.drug1h(drop=placebo);
set research.cltrials(drop=triglycerides uricacid);
if placebo='YES';
run;
If you do need to process a variable in the original data set (in a subsetting IF statement, for example), you must specify the variable in the DROP= option in the DATA statement. Otherwise, the statement that is using the variable for processing causes an error.
This DATA step uses the variable Placebo to select observations. To drop Placebo from the new data set, the DROP= option must appear in the DATA statement.
data lab23.drug1h(drop=placebo);
set research.cltrials(drop=triglycerides uricacid);
if placebo='YES';
run;
When used in the DATA statement, the DROP= option simply drops the variables from the new data set. However, they are still read from the original data set and are available within the DATA step.
If you completed Chapter 4, "Creating List Reports," on page 99, you learned to use a BY statement in PROC SORT to sort observations and in PROC PRINT to group observations for subtotals. You can also use the BY statement in the DATA step to group observations for processing.
data temp;
set salary;
by dept;
run;
When you use the BY statement with the SET statement,
the data sets that are listed in the SET statement must be sorted by the values of the BY variable(s), or they must have an appropriate index.
the DATA step creates two temporary variables for each BY variable. One is named FIRST.variable, where variable is the name of the BY variable, and the other is named LAST.variable. Their values are either 1
or 0
. FIRST.variable and LASTvariable. identify the first and last observation in each BY group.
This variable … | Equals … |
---|---|
FIRST.variable | 1 for the first observation in a BY group |
| 0 for any other observation in a BY group |
LAST.variable | 1 for the last observation in a BY group |
| 0 for any other observation in a BY group |
To work with FIRST.variable and LAST.variable, let's look at a different set of data. The Company.USA data set contains payroll information for individual employees. Suppose you want to compute the annual payroll by department. Assume 2,000 work hours per year for hourly employees.
Before computing the annual payroll, you need to group observations by values of the variable Dept.
The following program computes the annual payroll by department. Notice that the variable name Dept has been appended to FIRST. and LAST.
proc sort data=company.usa out=work.temp; by dept; run; data company.budget(keep=dept payroll); set work.temp; by dept; if wagecat='S' then Yearly=wagerate*12; else if wagecat='H' then Yearly=wagerate*2000; if first.dept then Payroll=0; payroll+yearly; if last.dept; run;
If you could look behind the scenes at the program data vector (PDV) as the Company.Budget data set is being created, you would see the following. Notice the values for FIRST.Dept and LAST.Dept.
When you print the new data set, you can now list and sum the annual payroll by department.
proc print data=company.budget noobs; sum payroll; format payroll dollar12.2; run;
When you specify multiple BY variables,
FIRST.variable for each variable is set to 1 at the first occurrence of a new value for the variable
a change in the value of a primary BY variable forces LAST.variable to equal 1 for the secondary BY variables.
Suppose you now want to compute the annual payroll by job type for each manager. In your program, you specify two BY variables, Manager and JobType.
proc sort data=company.usa out=work.temp2; by manager jobtype;
data company.budget2(keep=manager jobtype payroll);
set work.temp2;
by manager jobtype;
if wagecat='S' then Yearly=wagerate*12;
else if wagecat='H' then Yearly=wagerate*2000;
if first.jobtype then Payroll=0;
payroll+yearly;
if last.jobtype;
run;
If you could look at the PDV now, you would see the following. Notice that the values for FIRST.JobType and LAST.JobType change according to values of FIRST.Manager and LAST.Manager.
Now you can sum the annual payroll by job type for each manager. Here, the payroll for only two managers (Coxe and Delgado) is listed.
proc print data=company.budget2 noobs;; by manager; var jobtype; sum payroll; where manager in ('Coxe','Delgado'), format payroll dollar12.2; run;
So far in this chapter, you've read the observations in an input data set sequentially. That is, you have accessed observations in the order in which they appear in the physical file. However, you can also access observations directly, by going straight to an observation in a SAS data set without having to process each observation that precedes it.
To access observations directly by their observation number, you use the POINT= option in the SET statement.
Let's suppose you want to read only the fifth observation from a data set. In the following DATA step, the value 5 is assigned to the variable ObsNum. The POINT= option reads the value of ObsNum to determine which observation to read from the data set Company.USA.
data work.getobs5; obsnum=5; set company.usa(keep=manager payroll) point=obsnum; run;
But let's see what would happen if you submitted this program.
As you learned in a previous chapter, the DATA step continues to read observations until it reaches the end-of-file marker in the input data. However, because the POINT= option reads only specified observations, SAS cannot read an end-of-file indicator as it would if the file were being read sequentially. So submitting the following program would cause continuous looping.
data work.getobs5(drop=obsnum); obsnum=5; set company.usa(keep=manager payroll) point=obsnum; run;
Because there is no end-of-file condition when you use direct access to read data, you must take either or both of the following precautions:
Use a STOP statement to prevent continuous looping. The STOP statement causes SAS to stop processing the current DATA step immediately and to resume processing statements after the end of the current DATA step.
Use programming logic that checks for an invalid value of the POINT= variable. If SAS reads an invalid value for the POINT= variable, it sets the automatic variable _ERROR_ to1
. You can use this information to check for conditions that cause continuous processing.
So, if you add a STOP statement, your program no longer loops continuously.
data work.getobs5(drop=obsnum); obsnum=5; set company.usa(keep=manager payroll) point=obsnum; stop; run;
But it doesn't write any observations to output, either! Remember from Chapter 6, "Understanding DATA Step Processing," on page 183 that the DATA step writes observations to output at the end of the DATA step. However, in this program, the STOP statement immediately stops processing before the end of the DATA step.
Let's see how you can write the observation to output before processing stops.
To override the default way in which the DATA step writes observations to output, you can use an OUTPUT statement in the DATA step. Placing an explicit OUTPUT statement in a DATA step overrides the automatic output, so that observations are added to a data set only when the explicit OUTPUT statement is executed.
With an OUTPUT statement, your program now writes a single observation to output—observation 5.
data work.getobs5(drop=obsnum);
obsnum=5;
set company.usa(keep=manager payroll) point=obsnum;
output;
stop;
run;
proc print data=work.getobs5 noobs;
run;
Suppose your DATA statement contains two data set names, and you include an OUTPUT statement that references only one of the data sets. The DATA step will create both data sets, but only the data set that is named in the OUTPUT statement will contain output. For example, the program below creates two temporary data sets, Empty and Full. The result of this DATA step is that the data set Empty is created but contains no observations, and the data set Full contains all of the observations from Company.Usa.
data empty full; set company.usa; output full; run;
To convey concepts clearly, the examples in this section have been as simple as possible. However, most uses of the POINT= option are more complex. For example, POINT= is commonly used in combining data sets, not simply in reading a single data set.
You can see more complex examples of using POINT= in Chapter 15, "Generating Data with DO Loops," on page 473.
Instead of reading specific observations, you may want to determine when the last observation in an input data set has been read, so that you can perform specific processing. For example, you might want to write to output only an observation that contains totals for variables in all observations in the data set.
To create a temporary numeric variable whose value is used to detect the last observation, you can use the END= option in the SET statement.
Suppose you want to sum the number of seconds for treadmill stress tests. If you submit the following program, you produce a new data set that contains cumulative totals for each of the values of TotalTime.
data work.addtoend(drop=timemin timesec); set clinic.stress2(keep=timemin timesec); TotalMin+timemin; TotalSec+timesec; TotalTime=totalmin*60+totalsec; run; proc print data=work.addtoend noobs; run;
But what if you only want the final total (the last observation) in the new data set? The following program uses the END= variable last to select only the last observation of the data set. You specify END= in the SET statement and last wherever you need it in processing (here, in the subsetting IF statement).
data work.addtoend(drop=timemin timesec); set clinic.stress2(keep=timemin timesec) end=last; TotalMin+timemin; TotalSec+timesec; TotalTime=totalmin*60+totalsec; if last; run; proc print data=work.addtoend noobs; run;
Now the new data set has one observation:
In a previous chapter, you learned about the compilation and execution phases of the DATA step as they pertain to reading raw data. DATA step processing for reading existing SAS data sets is very similar. The main difference is that while reading an existing data set with the SET statement, SAS retains the values of the variables from one observation to the next.
Let's briefly look at the compilation and execution phases of DATA steps that use a SET statement. In this example, the DATA step reads the data set Finance.Loans, creates the variable Interest, and creates the new data set Finance.DueJan.
data finance.duejan;
set finance.loans;
Interest=amount*(rate/12);
run;
The program data vector is created and contains the automatic variables _N_ and _ERROR_.
SAS also scans each statement in the DATA step, looking for syntax errors.
When the SET statement is compiled, a slot is added to the program data vector for each variable in the input data set. The input data set supplies the variable names, as well as attributes such as type and length.
Any variables that are created in the DATA step are also added to the program data vector. The attributes of each of these variables are determined by the expression in the statement.
At the bottom of the DATA step, the compilation phase is complete, and the descriptor portion of the new SAS data set is created. There are no observations because the DATA step has not yet executed.
When the compilation phase is complete, the execution phase begins.
The DATA step executes once for each observation in the input data set. For example, this DATA step will execute four times because there are four observations in the input data set Finance.Loans.
At the beginning of the execution phase, the value of _N_ is1
. Because there are no data errors, the value of _ERROR_ is0
. The remaining variables are initialized to missing. Missing numeric values are represented by a period, and missing character values are represented by a blank.
The SET statement reads the first observation from the input data set and writes the values to the program data vector.
Then, the assignment statement executes to compute the value for Interest.
At the end of the first iteration of the DATA step, the values in the program data vector are written to the new data set as the first observation.
The value of _N_ is set to2
, and control returns to the top of the DATA step. Remember, the automatic variable _N_ keeps track of how many times the DATA step has begun to execute.
SAS retains the values of variables that were read from a SAS data set with the SET statement, or that were created by a sum statement. All other variable values, such as the values of the variable Interest, are set to missing.
variables named in a RETAIN statement
variables created in a sum statement
data elements in a _TEMPORARY_ array
any variables created by using options in the FILE or INFILE statements
automatic variables.
At the beginning of the second iteration, the value of _ERROR_ is reset to 0
.
As the SET statement executes, the values from the second observation are written to the program data vector.
The assignment statement executes again to compute the value for Interest for the second observation.
At the bottom of the DATA step, the values in the program data vector are written to the data set as the second observation.
The value of _N_ is set to 3
, and control returns to the top of the DATA step. SAS retains the values of variables that were read from a SAS data set with the SET statement, or that were created by a sum statement. All other variable values, such as the values of the variable Interest, are set to missing.
This process continues until all of the observations are read.
The DATA step provides many other programming features for manipulating data sets. For example, you can
use IF-THEN/ELSE logic with DO groups and DO loops to control processing based on one or more conditions
specify additional data set options
process variables in arrays
use SAS functions.
You can also combine SAS data sets in several ways, including match merging, interleaving, one-to-one merging, and updating.
Before you can create a new data set, you must assign a libref to the SAS data library that will store the data set.
After you have referenced the library in which your data set is stored, you can write a DATA step to name the SAS data set to be created. You then specify the data set to be read in the SET statement.
You can select the variables you want to drop or keep by using the DROP= and KEEP= data set options in parentheses after a SAS data set name. For convenience, use DROP= if more variables are kept than dropped.
Use the BY statement in the DATA step to group observations for processing. When you use the BY statement with the SET statement, the DATA step automatically creates two temporary variables, FIRST. and LAST. When you specify multiple BY variables, a change in the value of a primary BY variable forces LAST.variable to equal 1
for the secondary BY variables.
In addition to reading input data sequentially, you can access observations directly by using the POINT= option to go directly to a data set observation. There is no end-of-file condition when you use direct access, so include an explicit OUTPUT statement and then the STOP statement to prevent continuous looping.
LIBNAME libref 'SAS-data-library';
DATA <SAS-data-set> (KEEP= variable-1 <…variable-n>);
SET SAS-data-set (DROP= variable-1 <…variable-n>)
POINT=variable | END=variable;
OUTPUT <SAS-data-set>;
STOP;
RUN;
proc sort data=company.usa out=work.temp2; by manager jobtype; data company.budget2(keep=manager jobtype payroll); set work.temp2; by manager jobtype; if wagecat='S' then Yearly=wagerate*12; else if wagecat='H' then Yearly=wagerate*2000;
if first.jobtype then Payroll=0; payroll+yearly; if last.jobtype; run; data work.getobs5(drop=obsnum); obsnum=5; set company.usa(keep=manager payroll) point=obsnum; output; stop; run; data work.addtoend(drop=timemin timesec); set clinic.stress2(keep=timemin timesec) end=last; TotalMin+timemin; TotalSec+timesec; TotalTime=totalmin*60+totalsec; if last; run;
When you perform BY-group processing, the data sets listed in the SET statement must either be sorted by the values of the BY variable(s), or they must have an appropriate index.
When using direct access to read data, you must prevent continuous looping. Add a STOP statement to the DATA step, or use programming logic that checks for an invalid value of the POINT= variable.
Do not specify the END= option with the POINT= option in a SET statement.
Select the best answer for each question. After completing the quiz, you can check your answers using the answer key in the appendix.
If you submit the following program, which variables appear in the new data set?
data work.cardiac(drop=age group); set clinic.fitness(keep=age weight group); if group=2 and age>40; run;
none
Weight
Age, Group
Age, Weight, Group
Which of the following programs correctly reads the data set Orders and creates the data set FastOrdr?
data catalog.fastordr(drop=ordrtime); set july.orders(keep=product units price); if ordrtime<4; Total=units*price; run;
data catalog.orders(drop=ordrtime); set july.fastordr(keep=product units price); if ordrtime<4; Total=units*price; run;
data catalog.fastordr(drop=ordrtime); set july.orders(keep=product units price ordrtime); if ordrtime<4; Total=units*price; run;
none of the above
Which of the following statements is false about BY-group processing?
When you use the BY statement with the SET statement:
The data sets listed in the SET statement must be indexed or sorted by the values of the BY variable(s).
The DATA step automatically creates two variables, FIRST. and LAST., for each variable in the BY statement.
FIRST. and LAST. identify the first and last observation in each BY group, respectively.
FIRST. and LAST. are stored in the data set.
There are 500 observations in the data set Usa. What is the result of submitting the following program?
data work.getobs5(drop=obsnum); obsnum=5; set company.usa(keep=manager payroll) point=obsnum; stop; run;
an error
an empty data set
continuous loop
a data set that contains one observation
There is no end-of-file condition when you use direct access to read data, so how can your program prevent a continuous loop?
Do not use a POINT= variable.
Check for an invalid value of the POINT= variable.
Do not use an END= variable.
Include an OUTPUT statement.
Assuming that the data set Company.USA has five or more observations, what is the result of submitting the following program?
data work.getobs5(drop=obsnum); obsnum=5; set company.usa(keep=manager payroll) point=obsnum; output; stop; run;
an error
an empty data set
a continuous loop
a data set that contains one observation
Which of the following statements is true regarding direct access of data sets?
You cannot specify END= with POINT=.
You cannot specify OUTPUT with POINT=.
You cannot specify STOP with END=.
You cannot specify FIRST. with LAST.
What is the result of submitting the following program?
data work.addtoend; set clinic.stress2 end=last; if last; run;
an error
an empty data set
a continuous loop
a data set that contains one observation
At the start of DATA step processing, during the compilation phase, variables are created in the program data vector (PDV), and observations are set to:
blank
missing
0
there are no observations.
The DATA step executes:
continuously if you use the POINT= option and the STOP statement.
once for each variable in the output data set.
once for each observation in the input data set.
until it encounters an OUTPUT statement.