Raw data can be organized in several ways.
This external file contains data that is arranged in columns or fixed fields. You can specify a beginning and ending column for each field. However, this file contains nonstandard data, because one of the variable's values includes a special character, the dollar sign ($). |
|
This external file contains no special characters, but its data is free-format, meaning that it is not arranged in columns. Notice that the values for a particular field do not begin and end in the same columns. |
|
How your data is organized and what type of data you have determine which input style you should use to read the data. SAS provides three primary input styles: column input, formatted input, and list input. This chapter teaches you how to use column input and formatted input to read standard and nonstandard data that is arranged in fixed fields.
If you completed Chapter 5, "Creating SAS Data Sets from Raw Files and Excel Work-sheets," on page 135, you learned how to use column input to read raw data that is stored in an external file.
You can use column input to read the values for Item, IDnum, InStock, and BackOrd from the raw data file that is referenced by the fileref Invent.
Notice that the INPUT statement lists the variables with their corresponding column locations in order from left to right. However, one of the features of column input is the ability to read fields in any order.
For example, you could have read the values for InStock and BackOrd before the values for Item and IDnum.
input InStock 21-22 BackOrd 24-25 Item $ 1-13
IDnum $ 15-19;
When you print a report that is based on this data set, the variables will be listed in the order in which they were created by default.
Column input has several features that make it useful for reading raw data.
It can be used to read character variable values that contain embedded blanks.
No placeholder is required for missing data. A blank field is read as missing and does not cause other fields to be read incorrectly.
Fields or parts of fields can be re-read.
Fields do not have to be separated by blanks or other delimiters.
Standard numeric data values can contain only
numbers
decimal points
numbers in scientific, or E, notation (23E4)
minus signs and plus signs.
Some examples of standard numeric data are 15, -15, 15.4, +.05, 1.54E3, and -1.54E-3.
Nonstandard numeric data includes
values that contain special characters, such as percent signs (%), dollar signs ($), and commas (,)
date and time values
data in fraction, integer binary, real binary, and hexadecimal forms.
The external file referenced by the fileref Empdata contains the personnel information for the technical writing department of a small computer manufacturer. The fields contain values for each employee's last name, first name, job title, and annual salary.
Notice that the values for Salary contain commas. So, the values for Salary are considered to be nonstandard numeric values.
Nonstandard data values require an input style that has more flexibility than column input.
You can use formatted input, which combines the features of column input with the ability to read both standard and nonstandard data.
Whenever you encounter raw data that is organized into fixed fields, you can use
column input to read standard data only
formatted input to read both standard and nonstandard data.
Formatted input is a very powerful method for reading both standard and nonstandard data in fixed fields.
In this chapter, you'll be working with two column pointer controls.
The @n moves the input pointer to a specific column number.
The +n moves the input pointer forward to a column number that is relative to the current position.
Let's first take a look at the @n column pointer control.
The @n is an absolute pointer control that moves the input pointer to a specific column number. The @ moves the pointer to column n, which is the first column of the field that is being read.
Let's use the @n pointer control to locate variable values in the external file Empdata. As you can see, the values for LastName begin in column 1. We could start with the @1 pointer control.
input @1 LastName $7.
However, the default column pointer location is column 1, so you do not need to use a column pointer control to read the first field.
Next, the values for FirstName begin in column 9. To point to column 9, use an @sign and the column number in the INPUT statement:
Column pointer controls are very useful. For example, you can use the @n to move a pointer forward or backward when reading a record.
In this INPUT statement, the value for FirstName is read first, starting in column 9.
Now let's read the values for LastName, which begin in the first column. Here you must use the @n pointer control to move the pointer back to column 1.
The rest of the INPUT statement specifies the column locations of the raw data value for JobTitle and Salary.
The +n pointer control moves the input pointer forward to a column number that is relative to the current position. The + moves the pointer forward n columns.
In order to count correctly, it is important to understand where the column pointer control is located after each data value is read. Let's look at an example.
Suppose you want to read the data from Empdata in the following order: LastName, FirstName, Salary, JobTitle. Like the @n pointer control, the default column location for the +n pointer control is column 1. Because the values for LastName begin in column 1, a column pointer control is not needed.
With formatted input, the column pointer control moves to the first column following the field that was just read. In this example, after LastName is read, the pointer moves to column 8.
To start reading FirstName, which begins in column 9, you move the column pointer control ahead 1 column with +1.
After reading FirstName, the column pointer moves to column 14. Now you want to skip over the values for JobTitle and read the values for Salary, which begin in column 19. So you move the column pointer ahead 5 columns from column 14.
The last field to be read contains the values for JobTitle. You can use the @n column pointer control to return to column 15.
Remember that the general form of the INPUT statement for formatted input is:
INPUT <pointer-control> variable informat.;
An informat is an instruction that tells SAS how to read raw data. SAS provides many informats for reading standard and nonstandard data values. Here is a small sample.
PERCENTw.d | DATEw. | NENGOw. |
$BINARYw. | DATETIMEw | PDw.d |
$VARYINGw. | HEXw. | PERCENTw. |
$w. | JULIANw. | TIMEw. |
COMMAw.d | MMDDYYw. | w.d |
Note that
each informat contains a w value to indicate the width of the raw data field
each informat also contains a period, which is a required delimiter
for some informats, the optional d value specifies the number of implied decimal places
informats for reading character data always begin with a dollar sign ($).
The $w. informat enables you to read character data. The w represents the field width of the data value (the total number of columns that contain the raw data field).
In the example below, the $ indicates that FirstName is a character variable, the 5 indicates a field width of five columns, and a period ends the informat.
The informat for reading standard numeric data is the w.d informat.
The w specifies the field width of the raw data value, the period serves as a delimiter, and the d optionally specifies the number of implied decimal places for the value. The w.d informat ignores any specified d value if the data already contains a decimal point.
For example, the raw data value shown below contains 6 digits (4 are decimals) and 1 decimal point. Therefore, the w. informat requires a field width of only 7 to correctly read the raw data value.
In the example shown below, the values for JobTitle in columns 15-17 contain only numbers. Remember that standard numeric data values can contain only numbers, decimal points, scientific notation, and plus and minus signs.
A d value is not necessary to read the values for JobTitle. Simply move the column pointer control forward 7 spaces to column 15, name the variable, and specify a field width of 3.
The COMMAw.d informat is used to read numeric values and to remove embedded
blanks
commas
dashes
dollar signs
percent signs
right parentheses
left parentheses, which are converted to minus signs.
The COMMAw.d informat has three parts:
1. | the informat name | COMMA |
2. | a value that specifies the width of the field to be read (including dollar signs, decimal places, or other special characters), followed by a period | w. |
3. | an optional value that specifies the number of implied decimal places for a value (not necessary if the value already contains decimal places). | d |
In the example below, the values for Salary contain commas, which means that they are nonstandard numeric values.
The values for Salary begin in column 19, so use the @n or +n pointer control to point to column 19, and then name the variable.
Now add the COMMAw.d informat and specify the field width. The values end in column 27, so the field width is 9 columns. Add a RUN statement to complete the DATA step.
If you use PROC PRINT to display the data set, the commas are removed from the values for Salary in the resulting output.
Thus, the COMMAw.d informat does more than simply read the raw data values. It removes special characters such as commas from numeric data and stores only numeric values in a SAS data set.
Let's place our INPUT statement in a DATA step and submit it for processing. Remember that after the DATA step is submitted, it is compiled and then executed.
data perm.empinfo; infile empdata; input @9 FirstName $5.@1 LastName $7. +7 JobTitle 3. @19 Salary comma9.; run;
During the compile phase, the character variables in the program data vector are defined with the exact length specified by the informat. But notice that the lengths that are defined for JobTitle and Salary in the program data vector are different from the lengths that are specified by their informats.
Remember, by default, SAS stores numeric values (no matter how many digits the value contains) as floating-point numbers in 8 bytes of storage. The length of a stored numeric variable is not affected by an informat's width nor by other column specifications in an INPUT statement.
However, it is still necessary to specify the actual width of a raw data field in an INPUT statement. Otherwise, if you specify a default field width of 8 for all numeric values, you'll get inappropriate variable values when the program executes.
In the following example, the values for JobTitle would contain embedded blanks, thus creating invalid numeric values.
The record format of an external file might affect how data is read with column input and formatted input. A record format specifies how records are organized in a file.
On some operating systems, external files can have different types of record formats. Two common record formats are fixed-length records and variable-length records.
External files that have a fixed-length record format have an end-of-record marker after a predetermined number of columns. A typical record length is 80 columns.
Files that have a variable-length record format have an end-of-record marker after the last field in each record.
As you can see, the length of each record varies.
When you are working with variable-length records that contain fixed-field data, you might have values that are shorter than others or that are missing. This can cause problems when you try to read the raw data into your SAS data set.
For example, notice that the following INPUT statement specifies a field width of 8 columns for Receipts. In the third record, the input pointer encounters an end-of-record marker before the 8th column.
input Dept $ 1-11 @13 Receipts comma8.;
The input pointer moves down to the next record in an attempt to find a value for Receipts. However, GRILL
is a character value, and Receipts is a numeric variable. Thus, an invalid data error occurs, and Receipts is set to missing.
When reading variable-length records that contain fixed-field data, you can avoid problems by using the PAD option in the INFILE statement. The PAD option pads each record with blanks so that all data lines have the same length.
You do not need to use the PAD option for the exercises in this chapter. However, when you use column input or formatted input to read fixed-field data in variable-length records, remember to determine whether or not you need to use the PAD option. For more information about the PAD option, see the SAS documentation.
The default value of the maximum record length is determined by your operating system. If you get unexpected results when reading many variables, you might need to change the maximum record length by specifying the LRECL=option in the INFILE statement. For more information about the LRECL= option, see the SAS documentation for your operating environment.
When data is arranged in columns or fixed fields, you can use column input to read them. With column input, the beginning and ending column are specified for each field. Character variables are identified by a dollar ($) sign.
Column input has several features.
Fields can be read in any order.
It can be used to read character variables that contain embedded blanks.
No placeholder is required for missing data. A blank field is read as missing and does not cause other fields to be read incorrectly.
Fields or parts of fields can be re-read.
Fields do not have to be separated by blanks or other delimiters
It can be used to read standard character and numeric data.
Standard numeric data values are values that contain only numbers, scientific notation, decimal points, and plus and minus signs. When numeric data contains characters such as commas or dollar signs, the data is considered to be nonstandard.
Nonstandard numeric data includes
values that contain special characters, such as percent signs, dollar signs, and commas
date and time values
data in fraction, integer binary, real binary, and hexadecimal forms.
SAS provides two input styles for reading data in fixed fields, column input and formatted input. You can use
column input to read standard data only
formatted input to read both standard and nonstandard data.
Formatted input uses column pointer controls to position the input pointer on a specified column. A column pointer control is optional when the first variable is in the first column.
The @n is an absolute pointer control that moves the input pointer to a specific column number. You can read columns in any order with the @n column pointer control.
The +n is a relative pointer control that moves the input pointer forward to a column number that is relative to the current position. The +n pointer control cannot move backwards. However, you can use the notation +(n) to move the pointer control backwards.
An informat tells SAS how to read raw data. There are informats for reading standard and nonstandard character values and for reading standard and nonstandard numeric data values.
Informats always contain a w value to indicate the width of the raw data field. A period (.) ends the informat or separates the w value from the optional d value, which specifies the number of implied decimal places.
A record format specifies how records are organized in a file. Some operating systems have different types of record formats; the two most common are fixed-length records and variable-length records.
When you read variable-length records that contain fixed-field data into a SAS data set, there might be values that are shorter than others or that are missing. The PAD option pads each record with blanks so that all data lines have the same length.
LIBNAME libref 'SAS-data-library';
FILENAME fileref 'filename';
DATA SAS-data-set;
INFILE file-specification;
INPUT <pointer-control> variable informat.;
RUN;
PROC PRINT DATA=SAS-data-set;
RUN;
libname perm 'c:datasales'; filename vandata 'c: ecordsvans.dat'; data perm.vansales; infile vandata; input +12 Quarter 1. @1 Region $9. +6 TotalSales comma11.; run; proc print data=perm.vansales; run;
When you use column input or formatted input, the input pointer stops on the column following the last column that was read.
When you use informats, you do not need to specify a d value if the data values already contain decimal places.
Column input can be used to read standard character or standard numeric data only.
Formatted input can be used to read both standard and nonstandard data.
When reading variable-length records that contain fixed-field data, you can avoid problems by using the PAD option in the INFILE statement.
Select the best answer for each question. After completing the quiz, check your answers using the answer key in the appendix.
Which SAS statement correctly uses column input to read the values in the raw data file below in this order: Address (4th field), SquareFeet (second field), Style (first field), Bedrooms (third field)?
input Address 15-29 SquareFeet 8-11 Style 1-6 Bedrooms 13;
input $ 15-29 Address 8-11 SquareFeet $ 1-6 Style 13 Bedrooms;
input Address $ 15-29 SquareFeet 8-11 Style $ 1-6 Bedrooms 13;
input Address 15-29 $ SquareFeet 8-11 Style 1-6 $ Bedrooms 13;
Which is not an advantage of column input?
It can be used to read character variables that contain embedded blanks.
No placeholder is required for missing data.
Standard as well as nonstandard data values can be read.
Fields do not have to be separated by blanks or other delimiters.
Which is an example of standard numeric data?
-34.245
$24,234.25
1/2
50%
Formatted input can be used to read
standard free-format data
standard data in fixed fields
nonstandard data in fixed fields
both standard and nonstandard data in fixed fields
Which informat should you use to read the values in column 1-5?
w.
$w.
w.d
COMMAw.d
The COMMAw.d informat can be used to read which of the following values?
12,805
$177.95
18%
all of the above
Which INPUT statement correctly reads the values for ModelNumber (first field) after the values for Item (second field)? Both Item and ModelNumber are character variables.
input +7 Item $9. @1 ModelNumber $5.;
input +6 Item $9. @1 ModelNumber $5.;
input @7 Item $9. +1 ModelNumber $5.;
input @7 Item $9 @1 ModelNumber 5.;
Which INPUT statement correctly reads the numeric values for Cost (third field)?
input @17 Cost 7.2;
input @17 Cost 9.2.;
input @17 Cost comma7.;
input @17 Cost comma9.;
Which SAS statement correctly uses formatted input to read the values in this order: Item (first field), UnitCost (second field), Quantity (third field)?
input @1 Item $9. +1 UnitCost comma6. @18 Quantity 3.;
input Item $9. @11 UnitCost comma6. @18 Quantity 3.;
input Item $9. +1 UnitCost comma6. @18 Quantity 3.;
all of the above
Which raw data file requires the PAD option in the INFILE statement in order to correctly read the data using either column input or formatted input?