If you completed Chapter 5, "Creating SAS Data Sets from Raw Files and Excel Work-sheets," on page 135, you learned how to read data, perform basic modifications, and create a new SAS data set.
This chapter teaches you what happens behind the scenes when the DATA step reads raw data. You'll examine the program data vector, which is a logical framework that SAS uses when creating SAS data sets.
Understanding how the program operates can help you to anticipate how variables will be created and processed, to plan your modifications, and to interpret and debug program errors. It also gives you useful strategies for preventing and correcting common DATA step errors.
In this chapter, you learn to
identify the two phases that occur when a DATA step is processed
interpret automatic variables
identify the processing phase in which an error occurs
debug SAS DATA steps
validate and clean invalid data
test programs by limiting the number of observations that are created
flag errors in the SAS log.
If you completed Chapter 5, "Creating SAS Data Sets from Raw Files and Excel Work-sheets," on page 135, you learned how to write a DATA step to create a permanent SAS data set from raw data that is stored in an external file.
You learned how to submit the DATA step and how to check the log to see whether the step ran successfully.
You also learned how to display the contents of the data set with the PRINT procedure.
proc print data=clinic.stress; run;
When you submit a DATA step, SAS processes the DATA step and then creates a new SAS data set. Let's see exactly how that happens.
A SAS DATA step is processed in two phases:
During the compilation phase, each statement is scanned for syntax errors. Most syntax errors prevent further processing of the DATA step. When the compilation phase is complete, the descriptor portion of the new data set is created.
If the DATA step compiles successfully, then the execution phase begins. During the execution phase, the DATA step reads and processes the input data. The DATA step executes once for each record in the input file, unless otherwise directed.
The diagram below shows the general flow of DATA step processing for reading raw data. We'll examine both the compilation phase and the execution phase in this chapter.
Let's start with the compilation phase.
At the beginning of the compilation phase, the input buffer (an area of memory) is created to hold a record from the external file. The input buffer is created only when raw data is read, not when a SAS data set is read. The term input buffer refers to a logical concept; it is not a physical storage area.
After the input buffer is created, the program data vector is created. The program data vector is the area of memory where SAS builds a data set, one observation at a time. Like the term input buffer, the term program data vector refers to a logical concept.
The program data vector contains two automatic variables that can be used for processing but which are not written to the data set as part of an observation.
_N_ counts the number of times that the DATA step begins to execute.
_ERROR_ signals the occurrence of an error that is caused by the data during execution. The default value is 0
, which means there is no error. When one or more errors occur, the value is set to 1
.
During the compilation phase, SAS also scans each statement in the DATA step, looking for syntax errors. Syntax errors include
missing or misspelled keywords
invalid variable names
missing or invalid punctuation
invalid options.
As the INPUT statement is compiled, a slot is added to the program data vector for each variable in the new data set. Generally, variable attributes such as length and type are determined the first time a variable is encountered.
Any variables that are created with an assignment statement in the DATA step are also added to the program data vector. For example, the assignment statement below creates the variable Total. As the statement is compiled, the variable is added to the program data vector. The attributes of the variable are determined by the expression in the statement. Because the expression contains an arithmetic operator and produces a numeric value, Total is defined as a numeric variable and is assigned the default length of 8.
At the bottom of the DATA step (in this example, when the RUN statement is encountered), the compilation phase is complete, and the descriptor portion of the new SAS data set is created. The descriptor portion of the data set includes
the name of the data set
the number of observations and variables
the names and attributes of the variables.
At this point, the data set contains the five variables that are defined in the input data set and in the assignment statement. Remember, _N_ and _ERROR_ are not written to the data set. There are no observations because the DATA step has not yet executed. During execution, each raw data record is processed and is then written to the data set as an observation.
Let's review the compilation phase.
data perm.update; infile invent; input Item $ 1-13 IDnum $ 15-19 InStock 21-22 BackOrd 24-25; Total=instock+backord; run;
During the compilation phase, the input buffer is created to hold a record from the external file.
The program data vector is created to hold the current observation.
The descriptor portion of the SAS data set is created.
After the DATA step is compiled, it is ready for execution. During the execution phase, the data portion of the data set is created. The data portion contains the data values.
During execution, each record in the input raw data file is read, stored in the program data vector, and then written to the new data set as an observation. The DATA step executes once for each record in the input file, unless otherwise directed by additional statements.
The following DATA step reads values from the file Invent and executes nine times because there are nine records in the file.
At the beginning of the execution phase, the value of _N_ is 1
. Because there are no data errors, the value of _ERROR_ is 0
.
The remaining variables are initialized to missing. Missing numeric values are represented by periods, and missing character values are represented by blanks.
When an INPUT statement begins to read data values from a record that is held in the input buffer, it uses an input pointer to keep track of its position. The input pointer starts at column 1 of the first record, unless otherwise directed. As the INPUT statement executes, the raw data in columns 1-13 is read and is assigned to Item in the program data vector.
Notice that the input pointer now rests on column 14. With column input, the pointer moves as far as the INPUT statement instructs it, and it stops in the column immediately following the last one read.
Next, the data in columns 15-19 is read and is assigned to IDnum in the program data vector.
Likewise, the INPUT statement reads the values for InStock from columns 21-22, and it reads the values for BackOrd from columns 24-25.
Next, the assignment statement executes. The values for InStock and BackOrd are added to produce the values for Total.
At the end of the DATA step, several actions occur. First, the values in the program data vector are written to the output data set as the first observation.
Next, the value of _N_ is set to 2
and control returns to the top of the DATA step. Finally, the variable values in the program data vector are re-set to missing. Notice that the automatic variable _ERROR_ retains its value.
variables that are named in a RETAIN statement
variables that are created in a sum statement
data elements in a _TEMPORARY_ array
any variables that are created with options in the FILE or INFILE statements
automatic variables.
In contrast, when reading variables from a SAS data set, SAS sets the values to missing only before the first cycle of execution of the DATA step. Thereafter, the variables retain their values until new values become available — for example, through an assignment statement or through the next execution of a SET or MERGE statement. Variables that are created with options in the SET or MERGE statements also retain their values from one cycle of execution to the next. (You can learn about reading SAS data sets and about arrays, the SET statement, and the MERGE statement in later chapters.)
You can see that the DATA step works like a loop, repetitively executing statements to read data values and create observations one by one. Each loop (or cycle of execution) is called an iteration. At the beginning of the second iteration, the value of _N_ is 2
, and _ERROR_ is still 0
. Notice that the input pointer rests in column 1 of the second record.
As the INPUT statement executes for the second time, the values from the second record are held in the input buffer and then read into the program data vector.
Next, the value for Total is calculated based on the current values for InStock and BackOrd. The RUN statement indicates the end of the DATA step.
At the bottom of the DATA step, the values in the program data vector are written to the data set as the second observation.
Next, the value of _N_ is set to 3
, control returns to the top of the DATA step, and the values for Item, IDnum, InStock, BackOrd, and Total are re-set to missing.
The execution phase continues in this manner until the end-of-file marker is reached in the raw data file. When there are no more records in the raw data file to be read, the data portion of the new data set is complete.
Remember, the order in which variables are defined in the DATA step determines the order in which the variables are stored in the data set. The DATA step below, which reverses the order of Item and IDnum, produces a different data set from the same raw data.
You've seen how the DATA step iteratively reads records in the raw data file. Now take a minute to review execution-phase processing.
During the execution phase
variables in the program data vector are initialized to missing before each execution of the DATA step
each statement is executed sequentially
the INPUT statement reads the next record from the external file identified by the INFILE statement, and it writes the values into the program data vector
other statements can then further modify the current observation
the values in the program data vector are written to the SAS data set at the end of the DATA step
program flow is returned to the top of the DATA step
the DATA step is executed until the end-of-file marker is reached in the external file.
Now that you know how a DATA step is processed, you can use that knowledge to correct errors. If you completed Chapter 3, "Editing and Debugging SAS Programs," on page 67, you saw examples of errors that are detected during the compilation phase, including
misspelled keywords and data set names
missing semicolons
unbalanced quotation marks
invalid options.
During the compilation phase, SAS can interpret some syntax errors (such as the keyword DATA misspelled as DAAT). If it cannot interpret the error, SAS
prints the word ERROR followed by an error message in the log
compiles but does not execute the step where the error occurred, and prints the following message to warn you:
NOTE: The SAS System stopped processing this step because of errors.
Some errors are explained fully by the message that SAS prints; other error messages are not as easy to interpret. For example, because SAS statements are free-format, when you fail to end a SAS statement with a semicolon, SAS does not always detect the error at the point where it occurs.
As you have seen, errors can occur in the compilation phase, resulting in a DATA step that is compiled but not executed. Errors can also occur during the execution phase. When SAS detects an error in the execution phase, the following can occur, depending on the type of error:
A note, warning, or error message is displayed in the log.
The values that are stored in the program data vector are displayed in the log.
The processing of the step either continues or stops.
Suppose you misspelled the fileref in the INFILE statement below. This is not a syntax error, because SAS does not validate the file that you reference until the execution phase. During the compilation phase, the fileref Invnt is assumed to reference some external raw data file.
data perm.update;
infile invnt;
input Item $ 1-13 IDnum $ 15-19
InStock 21-22 BackOrd 24-25;
Total=instock+backord;
run;
This error is not detected until the execution phase begins. Because there is no external file that is referenced by the fileref Invnt, the DATA step stops processing.
Because Invent is misspelled, the statement in the DATA step that identifies the raw data is incorrect. Note, however, that the correct number of variables was defined in the descriptor portion of the data set.
Incorrectly identifying a variable's type is another common execution-time error. As you know, the values for IDnum are character values. Suppose you forget to place the dollar sign ($) after the variable's name in your INPUT statement. This is not a compile-time error, because SAS cannot verify IDnum's type until the data values for IDnum are read.
In this case, the DATA step completes the execution phase, and the observations are written to the data set. However, several notes appear in the log.
Each note identifies the location of the invalid data for each observation. In this example, the invalid data is located in columns 15-19 for all observations.
The second line in each note (excluding the RULE line) displays the raw data record. Notice that the second field displays the values for IDnum, which are obviously character values.
The third and fourth lines display the values that are stored in the program data vector. Here, the values for IDnum are missing, although the other values have been correctly assigned to their respective variables. Notice that _ERROR_ has a value of 1
, indicating that an error has occurred.
The PRINT procedure displays the data set, showing that the values for IDnum are missing. In this example, the periods indicate that IDnum is a numeric variable, although it should be defined as a character variable.
proc print data=perm.update; run;
When you read raw data with the DATA step, it's important to check the SAS log to verify that your data was read correctly. Here is a typical message:
When no observations are written to the data set, you should check to see whether your DATA step was completely executed. Most likely, a syntax error or another error is being detected at the beginning of the execution phase.
An invalid data message indicates that the program executed, but the data is not acceptable. Typically, the message indicates that a variable's type has been incorrectly identified in the INPUT statement, or that the raw data file contains some invalid data value(s).
Data errors occur when data values are not appropriate for the SAS statements that are specified in a program. SAS detects data errors during program execution. When a data error is detected, SAS continues to execute the program.
In general, SAS procedures analyze data, produce output, or manage SAS files. In addition, SAS procedures can be used to detect invalid data. The following procedures can be used to detect invalid data:
PROC PRINT
PROC FREQ
PROC MEANS
The PRINT procedure displays the data set, showing that the values for IDnum are missing. In this example, the periods indicate that IDnum is a numeric variable, although it should be defined as a character variable.
proc print data=perm.update; run;
The FREQ procedure detects invalid character and numeric values by looking at distinct values. You can use PROC FREQ to identify any variables that not given an expected value.
In the following example, the data set contains an invalid character for the variable Gender. PROC FREQ displays the number of distinct values for each variable in the data set. In the output you can see both the valid (M
and F
) and invalid (G
) values for Gender.
proc freq data=work.Check_Gender; run;
You can also use PROC FREQ with the TABLES statement to produce a frequency table for specific variables.
proc freq data=work.Check_Gender; tables Gender Age;
run;
In both the gender and age FREQ tables, one observation needs data cleaned as shown below:
The MEANS procedure can also be used to validate data because it produces summary reports displaying descriptive statistics. For example, PROC MEANS can show whether the values for a particular variable are within their expected range.
Using the same data set as in the previous example, we can submit PROC MEANS to determine if the Age of all test subjects is within a reasonable range. Notice the VAR statementis specified with that particular variable to get the statistical information, or range, of the data values.
proc means data=work.Check_Range; var Age; run;
The output for the MEANS procedure displays a range of 26 to 242, which clearly indicates that there is invalid data somewhere in the Age column.
You can use an assignment statement or a conditional clause to programmatically clean invalid data once it is identified. This example shows how to use the IF-THEN statement to isolate and change the invalid data that was found for the Age variable in the previous example. To avoid overwriting your original data set, it may be a good idea to use the DATA statement to create a new data set. The new data set will contain all the data from your original table except for any invalid data that meets the conditions of the IF-THEN statement:
data work.clean_data; set work.patients; if Age>110 then delete; proc print data=clean_data; run;
Similarly, to fix the invalid data for the Gender variable you can use an assignment statement along with an IF-THEN statement to identify and remove the invalid data:
data work.clean_data; set work.patients; gender=upcase(Gender); if Gender not in ('M', 'F') then delete; proc print data=clean_data; run;
After you write or edit a DATA step, you can compile and execute your program without creating observations. This enables you to detect the most common errors and saves you development time. A simple way to test a DATA step is to specify the keyword _NULL_ as the data set name in the DATA statement.
data _null_;
infile invent;
input Item $ 1-13 IDnum $ 15-19
InStock 21-22 BackOrd 24-25;
Total=instock+backord;
run;
When you submit the DATA step, no data set is created, but any compilation or execution errors are written to the log. After correcting any errors, you can replace _NULL_ with the name of the data set you want to create.
Remember that you can use the OBS= option in the INFILE statement to limit the number of observations that are read or created during the execution of the DATA step.
data perm.update;
infile invent obs=10;
input Item $ 1-13 IDnum $ 15-19
InStock 21-22 BackOrd 24-25;
Total=instock+backord;
run;
When processed, this DATA step creates the perm.update data set with variables but with only ten observations.
When the source of program errors is not apparent, you can use the PUT statement to examine variable values and to print your own message in the log. For diagnostic purposes, you can use IF-THEN/ELSE statements to conditionally check for values. You can learn about IF-THEN/ELSE statements in detail in Chapter 11, "Creating and Managing Variables," on page 317.
data work.test; infile loan; input Code $ 1 Amount 3-10 Rate 12-16 Account $ 18-25 Months 27-28; if code='1' then type='variable'; else if code='2' then type='fixed'; else put 'MY NOTE: invalid value: ' code=; run;
In this example, if CODE does not have the expected values of 1
or 2
, the PUT statement writes a message to the log:
You can use a PUT statement to specify a character string to identify your message in the log. The character string must be enclosed in quotation marks.
put 'MY NOTE: The condition was met.';
You can use a PUT statement to specify one or more data set variables to be examined for that iteration of the DATA step:
put 'MY NOTE: invalid value: '
code type;
Note that when you specify a variable in the PUT statement, only its value is written to the log. To write both the variable name and its value in the log, add an equal sign (=) to the variable name.
put 'MY NOTE: invalid value: '
code= type=;
You can use a PUT statement to display the values of the automatic variables _N_ and _ERROR_. In some cases, knowing the value of _N_ can help you locate an observation in the data set:
put 'MY NOTE: invalid value: '
code= _n_= _error_=;
You can also use a PUT statement to write all variable names and variable values, including automatic variables, to the log. Use the _ALL_ specification:
put 'MY NOTE: invalid value: ' _all_ ;
You can use a PUT statement with conditional processing (that is, with IF-THEN/ELSE statements) to flag program errors or data that is out of range. In the example below, the PUT statement is used to flag any missing or zero values for the variable Rate.
data finance.newcalc;
infile newloans;
input LoanID $ 1-4 Rate 5-8 Amount 9-19;
if rate>0 then
Interest=amount*(rate/12);
else put 'DATA ERROR ' rate= _n_=;
run;
A SAS DATA step is processed in two distinct phases. During the compilation phase, each statement is scanned for syntax errors. During the execution phase, the DATA step writes observations to the new data set.
At the beginning of the compilation phase, the input buffer and the program data vector are created. The program data vector is the area of memory where data sets are built, one observation at a time. Two automatic variables are also created: _N_ counts the number of DATA step executions, and _ERROR_ signals the occurrence of an error. DATA step statements are checked for syntax errors, such as invalid options or misspellings.
During the execution phase, each record in the input file is processed, stored in the program data vector, and then written to the new data set as an observation. The DATA step executes once for each record in the input file, unless otherwise directed.
Missing semicolons, misspelled keywords, and invalid options will cause syntax errors in the compilation phase. Detected errors are underlined and are identified with a number and message in the log. If SAS can interpret a syntax error, the DATA step compiles and executes; if SAS cannot interpret the error, the DATA step compiles but doesn't execute.
Illegal mathematical operations or processing a character variable as numeric will cause errors in the execution phase. Depending on the type of error, the log might show a warning and might include invalid data from the program data vector, and the DATA step either stops or continues.
To detect common errors and save development time, compile and execute your program without creating observations. Specify the keyword _NULL_ as the data set name to view compilation or execution errors without creating a data set. Or use the OBS= option in the INFILE statement to limit the number of observations that are read or created during the DATA step. You can also use the PUT statement to examine variable values and to generate your own message in the log.
DATA <_NULL_|SAS-data-set>;
INFILE file-specification OBS=n;
INPUT variable-1 informat-1 input-pointer-control-1 < . . .variable-n informat-n
input-pointer-control-n;>< >
PUT specification(s);
RUN;
data perm.update; infile invent; input Item $ 1-13 IDnum $ 15-19 InStock 21-22 BackOrd 24-25; Total=instock+backord; run;
data work.test; infile loan; input Code $ 1 Amount 3-10; if code='1' then type='variable'; else if code='2' then type='fixed'; else put 'MY NOTE: invalid value: ' code=; run;
Making, diagnosing, and resolving errors is part of the process of writing programs. However, checking for common errors will save you time and trouble. Make sure that
each SAS statement ends with a semicolon
filenames are spelled correctly
keywords are spelled correctly.
In SAS output, missing numeric values are represented by periods, and missing character values are left blank.
The order in which variables are defined in the DATA step determines the order in which the variables are stored in the data set.
Standard character values can include numbers, but numeric values cannot include characters.
Select the best answer for each question. After completing the quiz, you can check your answers using the answer key in the appendix.
Which of the following is not created during the compilation phase?
the data set descriptor
the first observation
the program data vector
the _N_ and _ERROR_ automatic variables
During the compilation phase, SAS scans each statement in the DATA step, looking for syntax errors. Which of the following is not considered a syntax error?
incorrect values and formats
invalid options or variable names
missing or invalid punctuation
missing or misspelled keywords
Unless otherwise directed, the DATA step executes...
once for each compilation phase.
once for each DATA step statement.
once for each record in the input file.
once for each variable in the input file.
At the beginning of the execution phase, the value of _N_ is 1
, the value of _ERROR_ is 0
, and the values of the remaining variables are set to:
0
1
undefined
missing
Suppose you run a program that causes three DATA step errors. What is the value of the automatic variable _ERROR_ when the observation that contains the third error is processed?
0
1
2
3
Which of the following actions occurs at the end of an iteration of the DATA step?
The automatic variables _N_ and _ERROR_ are incremented by one.
The DATA step stops execution.
The descriptor portion of the data set is written.
The values of variables created in programming statements are re-set to missing in the program data vector.
Look carefully at the DATA step shown below. Based on the INPUT statement, in what order will the variables be stored in the new data set?
data perm.update; infile invent; input IDnum $ Item $ 1-13 Instock 21-22 BackOrd 24-25; Total=instock+backord; run;
IDnum Item InStock BackOrd Total
Item IDnum InStock BackOrd Total
Total IDnum Item InStock BackOrd
Total Item IDnum InStock BackOrd
If SAS cannot interpret syntax errors, then...
data set variables will contain missing values.
the DATA step does not compile.
the DATA step still compiles, but it does not execute.
the DATA step still compiles and executes.
What is wrong with this program?
data perm.update; infile invent input Item $ 1-13 IDnum $ 15-19 Instock 21-22 BackOrd 24-25; total=instock+backord; run;
missing semicolon on second line
missing semicolon on third line
incorrect order of variables
incorrect variable type
Look carefully at this section of a SAS session log. Based on the note, what was the most likely problem with the DATA step?
A keyword was misspelled in the DATA step.
A semicolon was missing from the INFILE statement.
A variable was misspelled in the INPUT statement.
A dollar sign was missing in the INPUT statement.