Chapter 6. Understanding DATA Step Processing

Overview

Introduction

If you completed Chapter 5, "Creating SAS Data Sets from Raw Files and Excel Work-sheets," on page 135, you learned how to read data, perform basic modifications, and create a new SAS data set.

This chapter teaches you what happens behind the scenes when the DATA step reads raw data. You'll examine the program data vector, which is a logical framework that SAS uses when creating SAS data sets.

Introduction

Understanding how the program operates can help you to anticipate how variables will be created and processed, to plan your modifications, and to interpret and debug program errors. It also gives you useful strategies for preventing and correcting common DATA step errors.

Objectives

In this chapter, you learn to

  • identify the two phases that occur when a DATA step is processed

  • interpret automatic variables

  • identify the processing phase in which an error occurs

  • debug SAS DATA steps

  • validate and clean invalid data

  • test programs by limiting the number of observations that are created

  • flag errors in the SAS log.

Writing Basic DATA Steps

If you completed Chapter 5, "Creating SAS Data Sets from Raw Files and Excel Work-sheets," on page 135, you learned how to write a DATA step to create a permanent SAS data set from raw data that is stored in an external file.

Writing Basic DATA Steps

You learned how to submit the DATA step and how to check the log to see whether the step ran successfully.

Writing Basic DATA Steps

You also learned how to display the contents of the data set with the PRINT procedure.

proc print data=clinic.stress;
run;
Writing Basic DATA Steps

How SAS Processes Programs

When you submit a DATA step, SAS processes the DATA step and then creates a new SAS data set. Let's see exactly how that happens.

A SAS DATA step is processed in two phases:

How SAS Processes Programs
  • During the compilation phase, each statement is scanned for syntax errors. Most syntax errors prevent further processing of the DATA step. When the compilation phase is complete, the descriptor portion of the new data set is created.

  • If the DATA step compiles successfully, then the execution phase begins. During the execution phase, the DATA step reads and processes the input data. The DATA step executes once for each record in the input file, unless otherwise directed.

The diagram below shows the general flow of DATA step processing for reading raw data. We'll examine both the compilation phase and the execution phase in this chapter.

How SAS Processes Programs

Let's start with the compilation phase.

Compilation Phase

Input Buffer

At the beginning of the compilation phase, the input buffer (an area of memory) is created to hold a record from the external file. The input buffer is created only when raw data is read, not when a SAS data set is read. The term input buffer refers to a logical concept; it is not a physical storage area.

Input Buffer

Program Data Vector

After the input buffer is created, the program data vector is created. The program data vector is the area of memory where SAS builds a data set, one observation at a time. Like the term input buffer, the term program data vector refers to a logical concept.

The program data vector contains two automatic variables that can be used for processing but which are not written to the data set as part of an observation.

  • _N_ counts the number of times that the DATA step begins to execute.

  • _ERROR_ signals the occurrence of an error that is caused by the data during execution. The default value is 0, which means there is no error. When one or more errors occur, the value is set to 1.

Program Data Vector

Syntax Checking

During the compilation phase, SAS also scans each statement in the DATA step, looking for syntax errors. Syntax errors include

  • missing or misspelled keywords

  • invalid variable names

  • missing or invalid punctuation

  • invalid options.

Data Set Variables

As the INPUT statement is compiled, a slot is added to the program data vector for each variable in the new data set. Generally, variable attributes such as length and type are determined the first time a variable is encountered.

Data Set Variables

Any variables that are created with an assignment statement in the DATA step are also added to the program data vector. For example, the assignment statement below creates the variable Total. As the statement is compiled, the variable is added to the program data vector. The attributes of the variable are determined by the expression in the statement. Because the expression contains an arithmetic operator and produces a numeric value, Total is defined as a numeric variable and is assigned the default length of 8.

Data Set Variables

Descriptor Portion of the SAS Data Set

At the bottom of the DATA step (in this example, when the RUN statement is encountered), the compilation phase is complete, and the descriptor portion of the new SAS data set is created. The descriptor portion of the data set includes

  • the name of the data set

  • the number of observations and variables

  • the names and attributes of the variables.

Descriptor Portion of the SAS Data Set

At this point, the data set contains the five variables that are defined in the input data set and in the assignment statement. Remember, _N_ and _ERROR_ are not written to the data set. There are no observations because the DATA step has not yet executed. During execution, each raw data record is processed and is then written to the data set as an observation.

Descriptor Portion of the SAS Data Set

Summary of the Compilation Phase

Let's review the compilation phase.

data perm.update;
   infile invent;
   input Item $ 1-13 IDnum $ 15-19
         InStock 21-22 BackOrd 24-25;
   Total=instock+backord;
run;

During the compilation phase, the input buffer is created to hold a record from the external file.

Summary of the Compilation Phase

The program data vector is created to hold the current observation.

Summary of the Compilation Phase

The descriptor portion of the SAS data set is created.

Summary of the Compilation Phase

Execution Phase

After the DATA step is compiled, it is ready for execution. During the execution phase, the data portion of the data set is created. The data portion contains the data values.

Execution Phase

During execution, each record in the input raw data file is read, stored in the program data vector, and then written to the new data set as an observation. The DATA step executes once for each record in the input file, unless otherwise directed by additional statements.

Execution Phase

Example

The following DATA step reads values from the file Invent and executes nine times because there are nine records in the file.

Example

Initializing Variables

At the beginning of the execution phase, the value of _N_ is 1. Because there are no data errors, the value of _ERROR_ is 0.

Initializing Variables

The remaining variables are initialized to missing. Missing numeric values are represented by periods, and missing character values are represented by blanks.

Input Data

Next, the INFILE statement identifies the location of the raw data.

Input Data

Input Pointer

When an INPUT statement begins to read data values from a record that is held in the input buffer, it uses an input pointer to keep track of its position. The input pointer starts at column 1 of the first record, unless otherwise directed. As the INPUT statement executes, the raw data in columns 1-13 is read and is assigned to Item in the program data vector.

Input Pointer

Notice that the input pointer now rests on column 14. With column input, the pointer moves as far as the INPUT statement instructs it, and it stops in the column immediately following the last one read.

Input Pointer

Next, the data in columns 15-19 is read and is assigned to IDnum in the program data vector.

Input Pointer

Likewise, the INPUT statement reads the values for InStock from columns 21-22, and it reads the values for BackOrd from columns 24-25.

Input Pointer

Next, the assignment statement executes. The values for InStock and BackOrd are added to produce the values for Total.

Input Pointer

End of the DATA Step

At the end of the DATA step, several actions occur. First, the values in the program data vector are written to the output data set as the first observation.

End of the DATA Step

Next, the value of _N_ is set to 2 and control returns to the top of the DATA step. Finally, the variable values in the program data vector are re-set to missing. Notice that the automatic variable _ERROR_ retains its value.

End of the DATA Step
End of the DATA Step
  • variables that are named in a RETAIN statement

  • variables that are created in a sum statement

  • data elements in a _TEMPORARY_ array

  • any variables that are created with options in the FILE or INFILE statements

  • automatic variables.

In contrast, when reading variables from a SAS data set, SAS sets the values to missing only before the first cycle of execution of the DATA step. Thereafter, the variables retain their values until new values become available — for example, through an assignment statement or through the next execution of a SET or MERGE statement. Variables that are created with options in the SET or MERGE statements also retain their values from one cycle of execution to the next. (You can learn about reading SAS data sets and about arrays, the SET statement, and the MERGE statement in later chapters.)

Iterations of the DATA Step

You can see that the DATA step works like a loop, repetitively executing statements to read data values and create observations one by one. Each loop (or cycle of execution) is called an iteration. At the beginning of the second iteration, the value of _N_ is 2, and _ERROR_ is still 0. Notice that the input pointer rests in column 1 of the second record.

Iterations of the DATA Step

As the INPUT statement executes for the second time, the values from the second record are held in the input buffer and then read into the program data vector.

Iterations of the DATA Step

Next, the value for Total is calculated based on the current values for InStock and BackOrd. The RUN statement indicates the end of the DATA step.

Iterations of the DATA Step

At the bottom of the DATA step, the values in the program data vector are written to the data set as the second observation.

Iterations of the DATA Step

Next, the value of _N_ is set to 3, control returns to the top of the DATA step, and the values for Item, IDnum, InStock, BackOrd, and Total are re-set to missing.

Iterations of the DATA Step

End-of-File Marker

The execution phase continues in this manner until the end-of-file marker is reached in the raw data file. When there are no more records in the raw data file to be read, the data portion of the new data set is complete.

End-of-File Marker

Remember, the order in which variables are defined in the DATA step determines the order in which the variables are stored in the data set. The DATA step below, which reverses the order of Item and IDnum, produces a different data set from the same raw data.

End-of-File Marker

Summary of the Execution Phase

You've seen how the DATA step iteratively reads records in the raw data file. Now take a minute to review execution-phase processing.

During the execution phase

  • variables in the program data vector are initialized to missing before each execution of the DATA step

  • each statement is executed sequentially

  • the INPUT statement reads the next record from the external file identified by the INFILE statement, and it writes the values into the program data vector

  • other statements can then further modify the current observation

  • the values in the program data vector are written to the SAS data set at the end of the DATA step

  • program flow is returned to the top of the DATA step

  • the DATA step is executed until the end-of-file marker is reached in the external file.

Summary of the Execution Phase

End of the Execution Phase

At the end of the execution phase, the SAS log confirms that the raw data file was read, and it displays the number of observations and variables in the data set.

End of the Execution Phase

You already know how to display the data set with the PRINT procedure.

proc print data=perm.update;
run;
End of the Execution Phase

Debugging a DATA Step

Diagnosing Errors in the Compilation Phase

Now that you know how a DATA step is processed, you can use that knowledge to correct errors. If you completed Chapter 3, "Editing and Debugging SAS Programs," on page 67, you saw examples of errors that are detected during the compilation phase, including

  • misspelled keywords and data set names

  • missing semicolons

  • unbalanced quotation marks

  • invalid options.

During the compilation phase, SAS can interpret some syntax errors (such as the keyword DATA misspelled as DAAT). If it cannot interpret the error, SAS

  • prints the word ERROR followed by an error message in the log

  • compiles but does not execute the step where the error occurred, and prints the following message to warn you:

NOTE: The SAS System stopped processing this step because of errors.

Some errors are explained fully by the message that SAS prints; other error messages are not as easy to interpret. For example, because SAS statements are free-format, when you fail to end a SAS statement with a semicolon, SAS does not always detect the error at the point where it occurs.

Diagnosing Errors in the Execution Phase

As you have seen, errors can occur in the compilation phase, resulting in a DATA step that is compiled but not executed. Errors can also occur during the execution phase. When SAS detects an error in the execution phase, the following can occur, depending on the type of error:

  • A note, warning, or error message is displayed in the log.

  • The values that are stored in the program data vector are displayed in the log.

  • The processing of the step either continues or stops.

Example

Suppose you misspelled the fileref in the INFILE statement below. This is not a syntax error, because SAS does not validate the file that you reference until the execution phase. During the compilation phase, the fileref Invnt is assumed to reference some external raw data file.

data perm.update;
   infile  invnt;
   input Item $ 1-13 IDnum $ 15-19
         InStock 21-22 BackOrd 24-25;
   Total=instock+backord;
run;

This error is not detected until the execution phase begins. Because there is no external file that is referenced by the fileref Invnt, the DATA step stops processing.

Example

Because Invent is misspelled, the statement in the DATA step that identifies the raw data is incorrect. Note, however, that the correct number of variables was defined in the descriptor portion of the data set.

Incorrectly identifying a variable's type is another common execution-time error. As you know, the values for IDnum are character values. Suppose you forget to place the dollar sign ($) after the variable's name in your INPUT statement. This is not a compile-time error, because SAS cannot verify IDnum's type until the data values for IDnum are read.

Example

In this case, the DATA step completes the execution phase, and the observations are written to the data set. However, several notes appear in the log.

Example

Each note identifies the location of the invalid data for each observation. In this example, the invalid data is located in columns 15-19 for all observations.

The second line in each note (excluding the RULE line) displays the raw data record. Notice that the second field displays the values for IDnum, which are obviously character values.

Example

The third and fourth lines display the values that are stored in the program data vector. Here, the values for IDnum are missing, although the other values have been correctly assigned to their respective variables. Notice that _ERROR_ has a value of 1, indicating that an error has occurred.

Example

The PRINT procedure displays the data set, showing that the values for IDnum are missing. In this example, the periods indicate that IDnum is a numeric variable, although it should be defined as a character variable.

proc print data=perm.update;
run;
Example

When you read raw data with the DATA step, it's important to check the SAS log to verify that your data was read correctly. Here is a typical message:

Example

When no observations are written to the data set, you should check to see whether your DATA step was completely executed. Most likely, a syntax error or another error is being detected at the beginning of the execution phase.

Example

An invalid data message indicates that the program executed, but the data is not acceptable. Typically, the message indicates that a variable's type has been incorrectly identified in the INPUT statement, or that the raw data file contains some invalid data value(s).

Validating and Cleaning Data

Data errors occur when data values are not appropriate for the SAS statements that are specified in a program. SAS detects data errors during program execution. When a data error is detected, SAS continues to execute the program.

In general, SAS procedures analyze data, produce output, or manage SAS files. In addition, SAS procedures can be used to detect invalid data. The following procedures can be used to detect invalid data:

PROC PRINT

PROC FREQ

PROC MEANS

The PRINT procedure displays the data set, showing that the values for IDnum are missing. In this example, the periods indicate that IDnum is a numeric variable, although it should be defined as a character variable.

proc print data=perm.update;
run;
Validating and Cleaning Data

The FREQ procedure detects invalid character and numeric values by looking at distinct values. You can use PROC FREQ to identify any variables that not given an expected value.

In the following example, the data set contains an invalid character for the variable Gender. PROC FREQ displays the number of distinct values for each variable in the data set. In the output you can see both the valid (M and F) and invalid (G) values for Gender.

proc freq data=work.Check_Gender;
run;
Validating and Cleaning Data

You can also use PROC FREQ with the TABLES statement to produce a frequency table for specific variables.

proc freq data=work.Check_Gender;
      tables Gender Age;
run;

In both the gender and age FREQ tables, one observation needs data cleaned as shown below:

Validating and Cleaning Data

The MEANS procedure can also be used to validate data because it produces summary reports displaying descriptive statistics. For example, PROC MEANS can show whether the values for a particular variable are within their expected range.

Using the same data set as in the previous example, we can submit PROC MEANS to determine if the Age of all test subjects is within a reasonable range. Notice the VAR statementis specified with that particular variable to get the statistical information, or range, of the data values.

proc means data=work.Check_Range;
      var Age;
run;

The output for the MEANS procedure displays a range of 26 to 242, which clearly indicates that there is invalid data somewhere in the Age column.

Validating and Cleaning Data

Cleaning the Data

You can use an assignment statement or a conditional clause to programmatically clean invalid data once it is identified. This example shows how to use the IF-THEN statement to isolate and change the invalid data that was found for the Age variable in the previous example. To avoid overwriting your original data set, it may be a good idea to use the DATA statement to create a new data set. The new data set will contain all the data from your original table except for any invalid data that meets the conditions of the IF-THEN statement:

data work.clean_data;
   set work.patients;
   if Age>110 then delete;
proc print data=clean_data;
run;

Similarly, to fix the invalid data for the Gender variable you can use an assignment statement along with an IF-THEN statement to identify and remove the invalid data:

data work.clean_data;
   set work.patients;
   gender=upcase(Gender);
   if Gender not in ('M', 'F') then delete;
proc print data=clean_data;
run;

Testing Your Programs

Writing a NULL Data Set

After you write or edit a DATA step, you can compile and execute your program without creating observations. This enables you to detect the most common errors and saves you development time. A simple way to test a DATA step is to specify the keyword _NULL_ as the data set name in the DATA statement.

data  _null_;
   infile invent;
   input Item $ 1-13 IDnum $ 15-19
         InStock 21-22 BackOrd 24-25;
   Total=instock+backord;
run;

When you submit the DATA step, no data set is created, but any compilation or execution errors are written to the log. After correcting any errors, you can replace _NULL_ with the name of the data set you want to create.

Limiting Observations

Remember that you can use the OBS= option in the INFILE statement to limit the number of observations that are read or created during the execution of the DATA step.

data perm.update;
   infile invent  obs=10;
   input Item $ 1-13 IDnum $ 15-19
         InStock 21-22 BackOrd 24-25;
   Total=instock+backord;
run;

When processed, this DATA step creates the perm.update data set with variables but with only ten observations.

PUT Statement

When the source of program errors is not apparent, you can use the PUT statement to examine variable values and to print your own message in the log. For diagnostic purposes, you can use IF-THEN/ELSE statements to conditionally check for values. You can learn about IF-THEN/ELSE statements in detail in Chapter 11, "Creating and Managing Variables," on page 317.

data work.test;
   infile loan;
   input Code $ 1 Amount 3-10 Rate 12-16
         Account $ 18-25 Months 27-28;
   if code='1' then type='variable';
   else if code='2' then type='fixed';
   else  put 'MY NOTE: invalid value: ' 
   code=;
run;

In this example, if CODE does not have the expected values of 1 or 2, the PUT statement writes a message to the log:

PUT Statement

Character Strings

You can use a PUT statement to specify a character string to identify your message in the log. The character string must be enclosed in quotation marks.

put 'MY NOTE: The condition was met.';
Character Strings

Data Set Variables

You can use a PUT statement to specify one or more data set variables to be examined for that iteration of the DATA step:

put 'MY NOTE: invalid value: '
     code type;
Data Set Variables

Note that when you specify a variable in the PUT statement, only its value is written to the log. To write both the variable name and its value in the log, add an equal sign (=) to the variable name.

put 'MY NOTE: invalid value: '
   code= type=;
Data Set Variables

Automatic Variables

You can use a PUT statement to display the values of the automatic variables _N_ and _ERROR_. In some cases, knowing the value of _N_ can help you locate an observation in the data set:

put 'MY NOTE: invalid value: '
     code= _n_= _error_=;
Automatic Variables

You can also use a PUT statement to write all variable names and variable values, including automatic variables, to the log. Use the _ALL_ specification:

put 'MY NOTE: invalid value: ' _all_ ;
Automatic Variables

Conditional Processing

You can use a PUT statement with conditional processing (that is, with IF-THEN/ELSE statements) to flag program errors or data that is out of range. In the example below, the PUT statement is used to flag any missing or zero values for the variable Rate.

data finance.newcalc;
   infile newloans;
   input LoanID $ 1-4 Rate 5-8 Amount 9-19;
   if rate>0 then
      Interest=amount*(rate/12);
   else put 'DATA ERROR ' rate= _n_=;
run;
Conditional Processing

Chapter Summary

Text Summary

How SAS Processes Programs

A SAS DATA step is processed in two distinct phases. During the compilation phase, each statement is scanned for syntax errors. During the execution phase, the DATA step writes observations to the new data set.

Compilation Phase

At the beginning of the compilation phase, the input buffer and the program data vector are created. The program data vector is the area of memory where data sets are built, one observation at a time. Two automatic variables are also created: _N_ counts the number of DATA step executions, and _ERROR_ signals the occurrence of an error. DATA step statements are checked for syntax errors, such as invalid options or misspellings.

Execution Phase

During the execution phase, each record in the input file is processed, stored in the program data vector, and then written to the new data set as an observation. The DATA step executes once for each record in the input file, unless otherwise directed.

Diagnosing Errors in the Compilation Phase

Missing semicolons, misspelled keywords, and invalid options will cause syntax errors in the compilation phase. Detected errors are underlined and are identified with a number and message in the log. If SAS can interpret a syntax error, the DATA step compiles and executes; if SAS cannot interpret the error, the DATA step compiles but doesn't execute.

Diagnosing Errors in the Execution Phase

Illegal mathematical operations or processing a character variable as numeric will cause errors in the execution phase. Depending on the type of error, the log might show a warning and might include invalid data from the program data vector, and the DATA step either stops or continues.

Testing Your Programs

To detect common errors and save development time, compile and execute your program without creating observations. Specify the keyword _NULL_ as the data set name to view compilation or execution errors without creating a data set. Or use the OBS= option in the INFILE statement to limit the number of observations that are read or created during the DATA step. You can also use the PUT statement to examine variable values and to generate your own message in the log.

Syntax

    DATA <_NULL_|SAS-data-set>;

                INFILE file-specification OBS=n;

                INPUT variable-1 informat-1 input-pointer-control-1 < . . .variable-n informat-n

                input-pointer-control-n;>< >

                PUT specification(s);

    RUN;

Sample Programs

  data perm.update;
     infile invent;
     input Item $ 1-13 IDnum $ 15-19
          InStock 21-22 BackOrd 24-25;
     Total=instock+backord;
run;
data work.test;
   infile loan;
   input Code $ 1 Amount 3-10;
   if code='1' then type='variable';
   else if code='2' then type='fixed';
   else put 'MY NOTE: invalid value: '
        code=;
run;

Points to Remember

  • Making, diagnosing, and resolving errors is part of the process of writing programs. However, checking for common errors will save you time and trouble. Make sure that

    • each SAS statement ends with a semicolon

    • filenames are spelled correctly

    • keywords are spelled correctly.

  • In SAS output, missing numeric values are represented by periods, and missing character values are left blank.

  • The order in which variables are defined in the DATA step determines the order in which the variables are stored in the data set.

  • Standard character values can include numbers, but numeric values cannot include characters.

Chapter Quiz

Select the best answer for each question. After completing the quiz, you can check your answers using the answer key in the appendix.

  1. Which of the following is not created during the compilation phase?

    1. the data set descriptor

    2. the first observation

    3. the program data vector

    4. the _N_ and _ERROR_ automatic variables

  2. During the compilation phase, SAS scans each statement in the DATA step, looking for syntax errors. Which of the following is not considered a syntax error?

    1. incorrect values and formats

    2. invalid options or variable names

    3. missing or invalid punctuation

    4. missing or misspelled keywords

  3. Unless otherwise directed, the DATA step executes...

    1. once for each compilation phase.

    2. once for each DATA step statement.

    3. once for each record in the input file.

    4. once for each variable in the input file.

  4. At the beginning of the execution phase, the value of _N_ is 1, the value of _ERROR_ is 0, and the values of the remaining variables are set to:

    1. 0

    2. 1

    3. undefined

    4. missing

  5. Suppose you run a program that causes three DATA step errors. What is the value of the automatic variable _ERROR_ when the observation that contains the third error is processed?

    1. 0

    2. 1

    3. 2

    4. 3

  6. Which of the following actions occurs at the end of an iteration of the DATA step?

    1. The automatic variables _N_ and _ERROR_ are incremented by one.

    2. The DATA step stops execution.

    3. The descriptor portion of the data set is written.

    4. The values of variables created in programming statements are re-set to missing in the program data vector.

  7. Look carefully at the DATA step shown below. Based on the INPUT statement, in what order will the variables be stored in the new data set?

    data perm.update;
       infile invent;
       input IDnum $ Item $ 1-13 Instock 21-22
             BackOrd 24-25;
       Total=instock+backord;
    run;
    1. IDnum Item InStock BackOrd Total
    2. Item IDnum InStock BackOrd Total
    3. Total IDnum Item InStock BackOrd
    4. Total Item IDnum InStock BackOrd
  8. If SAS cannot interpret syntax errors, then...

    1. data set variables will contain missing values.

    2. the DATA step does not compile.

    3. the DATA step still compiles, but it does not execute.

    4. the DATA step still compiles and executes.

  9. What is wrong with this program?

    data perm.update;
       infile invent
       input Item $ 1-13 IDnum $ 15-19 Instock 21-22
             BackOrd 24-25;
       total=instock+backord;
    run;
    1. missing semicolon on second line

    2. missing semicolon on third line

    3. incorrect order of variables

    4. incorrect variable type

  10. Look carefully at this section of a SAS session log. Based on the note, what was the most likely problem with the DATA step?

    Chapter Quiz
    1. A keyword was misspelled in the DATA step.

    2. A semicolon was missing from the INFILE statement.

    3. A variable was misspelled in the INPUT statement.

    4. A dollar sign was missing in the INPUT statement.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset