Chapter 17. Reading Raw Data in Fixed Fields

Overview

Introduction

Raw data can be organized in several ways.

This external file contains data that is arranged in columns or fixed fields. You can specify a beginning and ending column for each field. However, this file contains nonstandard data, because one of the variable's values includes a special character, the dollar sign ($).

Introduction

This external file contains no special characters, but its data is free-format, meaning that it is not arranged in columns. Notice that the values for a particular field do not begin and end in the same columns.

Introduction

How your data is organized and what type of data you have determine which input style you should use to read the data. SAS provides three primary input styles: column input, formatted input, and list input. This chapter teaches you how to use column input and formatted input to read standard and nonstandard data that is arranged in fixed fields.

Objectives

In this chapter, you learn to

  • distinguish between standard and nonstandard numeric data

  • read standard fixed-field data

  • read nonstandard fixed-field data.

Review of Column Input

If you completed Chapter 5, "Creating SAS Data Sets from Raw Files and Excel Work-sheets," on page 135, you learned how to use column input to read raw data that is stored in an external file.

You can use column input to read the values for Item, IDnum, InStock, and BackOrd from the raw data file that is referenced by the fileref Invent.

Review of Column Input

Notice that the INPUT statement lists the variables with their corresponding column locations in order from left to right. However, one of the features of column input is the ability to read fields in any order.

For example, you could have read the values for InStock and BackOrd before the values for Item and IDnum.

input InStock 21-22 BackOrd 24-25 Item $ 1-13
      IDnum $ 15-19;

When you print a report that is based on this data set, the variables will be listed in the order in which they were created by default.

Review of Column Input

Column Input Features

Column input has several features that make it useful for reading raw data.

  • It can be used to read character variable values that contain embedded blanks.

    Column Input Features
  • No placeholder is required for missing data. A blank field is read as missing and does not cause other fields to be read incorrectly.

    Column Input Features
  • Fields or parts of fields can be re-read.

    Column Input Features
  • Fields do not have to be separated by blanks or other delimiters.

    Column Input Features

Identifying Nonstandard Numeric Data

Standard Numeric Data

Standard numeric data values can contain only

  • numbers

  • decimal points

  • numbers in scientific, or E, notation (23E4)

  • minus signs and plus signs.

Some examples of standard numeric data are 15, -15, 15.4, +.05, 1.54E3, and -1.54E-3.

Nonstandard Numeric Data

Nonstandard numeric data includes

  • values that contain special characters, such as percent signs (%), dollar signs ($), and commas (,)

  • date and time values

  • data in fraction, integer binary, real binary, and hexadecimal forms.

The external file referenced by the fileref Empdata contains the personnel information for the technical writing department of a small computer manufacturer. The fields contain values for each employee's last name, first name, job title, and annual salary.

Nonstandard Numeric Data

Notice that the values for Salary contain commas. So, the values for Salary are considered to be nonstandard numeric values.

Choosing an Input Style

Nonstandard data values require an input style that has more flexibility than column input.

You can use formatted input, which combines the features of column input with the ability to read both standard and nonstandard data.

Choosing an Input Style

Whenever you encounter raw data that is organized into fixed fields, you can use

  • column input to read standard data only

  • formatted input to read both standard and nonstandard data.

Using Formatted Input

The General Form of Formatted Input

Formatted input is a very powerful method for reading both standard and nonstandard data in fixed fields.

In this chapter, you'll be working with two column pointer controls.

  • The @n moves the input pointer to a specific column number.

  • The +n moves the input pointer forward to a column number that is relative to the current position.

Let's first take a look at the @n column pointer control.

Using the @n Column Pointer Control

The @n is an absolute pointer control that moves the input pointer to a specific column number. The @ moves the pointer to column n, which is the first column of the field that is being read.

Let's use the @n pointer control to locate variable values in the external file Empdata. As you can see, the values for LastName begin in column 1. We could start with the @1 pointer control.

input @1 LastName $7.

However, the default column pointer location is column 1, so you do not need to use a column pointer control to read the first field.

Using the @n Column Pointer Control

Next, the values for FirstName begin in column 9. To point to column 9, use an @sign and the column number in the INPUT statement:

Using the @n Column Pointer Control
Using the @n Column Pointer Control

Reading Columns in Any Order

Column pointer controls are very useful. For example, you can use the @n to move a pointer forward or backward when reading a record.

In this INPUT statement, the value for FirstName is read first, starting in column 9.

Reading Columns in Any Order

Now let's read the values for LastName, which begin in the first column. Here you must use the @n pointer control to move the pointer back to column 1.

Reading Columns in Any Order

The rest of the INPUT statement specifies the column locations of the raw data value for JobTitle and Salary.

Reading Columns in Any Order
Reading Columns in Any Order

The +n Pointer Control

The +n pointer control moves the input pointer forward to a column number that is relative to the current position. The + moves the pointer forward n columns.

In order to count correctly, it is important to understand where the column pointer control is located after each data value is read. Let's look at an example.

Suppose you want to read the data from Empdata in the following order: LastName, FirstName, Salary, JobTitle. Like the @n pointer control, the default column location for the +n pointer control is column 1. Because the values for LastName begin in column 1, a column pointer control is not needed.

The +n Pointer Control

With formatted input, the column pointer control moves to the first column following the field that was just read. In this example, after LastName is read, the pointer moves to column 8.

The +n Pointer Control

To start reading FirstName, which begins in column 9, you move the column pointer control ahead 1 column with +1.

The +n Pointer Control

After reading FirstName, the column pointer moves to column 14. Now you want to skip over the values for JobTitle and read the values for Salary, which begin in column 19. So you move the column pointer ahead 5 columns from column 14.

The +n Pointer Control

The last field to be read contains the values for JobTitle. You can use the @n column pointer control to return to column 15.

The +n Pointer Control
The +n Pointer Control
The +n Pointer Control

Using Informats

Remember that the general form of the INPUT statement for formatted input is:

INPUT <pointer-control> variable informat.;

An informat is an instruction that tells SAS how to read raw data. SAS provides many informats for reading standard and nonstandard data values. Here is a small sample.

PERCENTw.d

DATEw.

NENGOw.

$BINARYw.

DATETIMEw

PDw.d

$VARYINGw.

HEXw.

PERCENTw.

$w.

JULIANw.

TIMEw.

COMMAw.d

MMDDYYw.

w.d

Note that

  • each informat contains a w value to indicate the width of the raw data field

  • each informat also contains a period, which is a required delimiter

  • for some informats, the optional d value specifies the number of implied decimal places

  • informats for reading character data always begin with a dollar sign ($).

Using Informats

Reading Character Values

The $w. informat enables you to read character data. The w represents the field width of the data value (the total number of columns that contain the raw data field).

In the example below, the $ indicates that FirstName is a character variable, the 5 indicates a field width of five columns, and a period ends the informat.

Reading Character Values

Reading Standard Numeric Data

The informat for reading standard numeric data is the w.d informat.

The w specifies the field width of the raw data value, the period serves as a delimiter, and the d optionally specifies the number of implied decimal places for the value. The w.d informat ignores any specified d value if the data already contains a decimal point.

For example, the raw data value shown below contains 6 digits (4 are decimals) and 1 decimal point. Therefore, the w. informat requires a field width of only 7 to correctly read the raw data value.

Reading Standard Numeric Data

In the example shown below, the values for JobTitle in columns 15-17 contain only numbers. Remember that standard numeric data values can contain only numbers, decimal points, scientific notation, and plus and minus signs.

A d value is not necessary to read the values for JobTitle. Simply move the column pointer control forward 7 spaces to column 15, name the variable, and specify a field width of 3.

Reading Standard Numeric Data
Reading Standard Numeric Data

Reading Nonstandard Numeric Data

The COMMAw.d informat is used to read numeric values and to remove embedded

  • blanks

  • commas

  • dashes

  • dollar signs

  • percent signs

  • right parentheses

  • left parentheses, which are converted to minus signs.

The COMMAw.d informat has three parts:

1.

the informat name

COMMA

2.

a value that specifies the width of the field to be read (including dollar signs, decimal places, or other special characters), followed by a period

w.

3.

an optional value that specifies the number of implied decimal places for a value (not necessary if the value already contains decimal places).

d

In the example below, the values for Salary contain commas, which means that they are nonstandard numeric values.

The values for Salary begin in column 19, so use the @n or +n pointer control to point to column 19, and then name the variable.

Reading Nonstandard Numeric Data

Now add the COMMAw.d informat and specify the field width. The values end in column 27, so the field width is 9 columns. Add a RUN statement to complete the DATA step.

Reading Nonstandard Numeric Data

If you use PROC PRINT to display the data set, the commas are removed from the values for Salary in the resulting output.

Reading Nonstandard Numeric Data

Thus, the COMMAw.d informat does more than simply read the raw data values. It removes special characters such as commas from numeric data and stores only numeric values in a SAS data set.

DATA Step Processing of Informats

Let's place our INPUT statement in a DATA step and submit it for processing. Remember that after the DATA step is submitted, it is compiled and then executed.

data perm.empinfo;
   infile empdata;
   input @9 FirstName $5.@1 LastName $7. +7 JobTitle 3. 
         @19 Salary comma9.;
run;

During the compile phase, the character variables in the program data vector are defined with the exact length specified by the informat. But notice that the lengths that are defined for JobTitle and Salary in the program data vector are different from the lengths that are specified by their informats.

DATA Step Processing of Informats

Remember, by default, SAS stores numeric values (no matter how many digits the value contains) as floating-point numbers in 8 bytes of storage. The length of a stored numeric variable is not affected by an informat's width nor by other column specifications in an INPUT statement.

However, it is still necessary to specify the actual width of a raw data field in an INPUT statement. Otherwise, if you specify a default field width of 8 for all numeric values, you'll get inappropriate variable values when the program executes.

In the following example, the values for JobTitle would contain embedded blanks, thus creating invalid numeric values.

DATA Step Processing of Informats
DATA Step Processing of Informats

Record Formats

The record format of an external file might affect how data is read with column input and formatted input. A record format specifies how records are organized in a file.

On some operating systems, external files can have different types of record formats. Two common record formats are fixed-length records and variable-length records.

Fixed-Length Records

External files that have a fixed-length record format have an end-of-record marker after a predetermined number of columns. A typical record length is 80 columns.

Fixed-Length Records

Variable-Length Records

Files that have a variable-length record format have an end-of-record marker after the last field in each record.

As you can see, the length of each record varies.

Variable-Length Records

Reading Variable-Length Records

When you are working with variable-length records that contain fixed-field data, you might have values that are shorter than others or that are missing. This can cause problems when you try to read the raw data into your SAS data set.

For example, notice that the following INPUT statement specifies a field width of 8 columns for Receipts. In the third record, the input pointer encounters an end-of-record marker before the 8th column.

input Dept $ 1-11 @13 Receipts comma8.;
Reading Variable-Length Records
Reading Variable-Length Records

The input pointer moves down to the next record in an attempt to find a value for Receipts. However, GRILL is a character value, and Receipts is a numeric variable. Thus, an invalid data error occurs, and Receipts is set to missing.

Reading Variable-Length Records

The PAD Option

When reading variable-length records that contain fixed-field data, you can avoid problems by using the PAD option in the INFILE statement. The PAD option pads each record with blanks so that all data lines have the same length.

The PAD Option

You do not need to use the PAD option for the exercises in this chapter. However, when you use column input or formatted input to read fixed-field data in variable-length records, remember to determine whether or not you need to use the PAD option. For more information about the PAD option, see the SAS documentation.

The PAD Option

The default value of the maximum record length is determined by your operating system. If you get unexpected results when reading many variables, you might need to change the maximum record length by specifying the LRECL=option in the INFILE statement. For more information about the LRECL= option, see the SAS documentation for your operating environment.

Chapter Summary

Text Summary

Review of Column Input

When data is arranged in columns or fixed fields, you can use column input to read them. With column input, the beginning and ending column are specified for each field. Character variables are identified by a dollar ($) sign.

Column input has several features.

  • Fields can be read in any order.

  • It can be used to read character variables that contain embedded blanks.

  • No placeholder is required for missing data. A blank field is read as missing and does not cause other fields to be read incorrectly.

  • Fields or parts of fields can be re-read.

  • Fields do not have to be separated by blanks or other delimiters

  • It can be used to read standard character and numeric data.

Identifying Nonstandard Numeric Data

Standard numeric data values are values that contain only numbers, scientific notation, decimal points, and plus and minus signs. When numeric data contains characters such as commas or dollar signs, the data is considered to be nonstandard.

Nonstandard numeric data includes

  • values that contain special characters, such as percent signs, dollar signs, and commas

  • date and time values

  • data in fraction, integer binary, real binary, and hexadecimal forms.

Choosing an Input Style

SAS provides two input styles for reading data in fixed fields, column input and formatted input. You can use

  • column input to read standard data only

  • formatted input to read both standard and nonstandard data.

Using Formatted Input

Formatted input uses column pointer controls to position the input pointer on a specified column. A column pointer control is optional when the first variable is in the first column.

The @n is an absolute pointer control that moves the input pointer to a specific column number. You can read columns in any order with the @n column pointer control.

The +n is a relative pointer control that moves the input pointer forward to a column number that is relative to the current position. The +n pointer control cannot move backwards. However, you can use the notation +(n) to move the pointer control backwards.

Using Informats

An informat tells SAS how to read raw data. There are informats for reading standard and nonstandard character values and for reading standard and nonstandard numeric data values.

Informats always contain a w value to indicate the width of the raw data field. A period (.) ends the informat or separates the w value from the optional d value, which specifies the number of implied decimal places.

Record Formats

A record format specifies how records are organized in a file. Some operating systems have different types of record formats; the two most common are fixed-length records and variable-length records.

When you read variable-length records that contain fixed-field data into a SAS data set, there might be values that are shorter than others or that are missing. The PAD option pads each record with blanks so that all data lines have the same length.

Syntax

    LIBNAME libref 'SAS-data-library';

    FILENAME fileref 'filename';

    DATA SAS-data-set;

            INFILE file-specification;

            INPUT <pointer-control> variable informat.;

    RUN;

    PROC PRINT DATA=SAS-data-set;

    RUN;

Sample Program

libname perm 'c:datasales';
filename vandata 'c:
ecordsvans.dat';
data perm.vansales;
   infile vandata;
   input +12 Quarter 1. @1 Region $9.
         +6 TotalSales comma11.;
run;
proc print data=perm.vansales;
run;

Points to Remember

  • When you use column input or formatted input, the input pointer stops on the column following the last column that was read.

  • When you use informats, you do not need to specify a d value if the data values already contain decimal places.

  • Column input can be used to read standard character or standard numeric data only.

  • Formatted input can be used to read both standard and nonstandard data.

  • When reading variable-length records that contain fixed-field data, you can avoid problems by using the PAD option in the INFILE statement.

Chapter Quiz

Select the best answer for each question. After completing the quiz, check your answers using the answer key in the appendix.

  1. Which SAS statement correctly uses column input to read the values in the raw data file below in this order: Address (4th field), SquareFeet (second field), Style (first field), Bedrooms (third field)?

    Chapter Quiz
    1. input Address 15-29 SquareFeet 8-11 Style 1-6
            Bedrooms 13;
    2. input $ 15-29 Address 8-11 SquareFeet $ 1-6 Style
            13 Bedrooms;
    3. input Address $ 15-29 SquareFeet 8-11 Style $ 1-6
            Bedrooms 13;
    4. input Address 15-29 $ SquareFeet 8-11 Style 1-6
            $ Bedrooms 13;
  2. Which is not an advantage of column input?

    1. It can be used to read character variables that contain embedded blanks.

    2. No placeholder is required for missing data.

    3. Standard as well as nonstandard data values can be read.

    4. Fields do not have to be separated by blanks or other delimiters.

  3. Which is an example of standard numeric data?

    1. -34.245

    2. $24,234.25

    3. 1/2

    4. 50%

  4. Formatted input can be used to read

    1. standard free-format data

    2. standard data in fixed fields

    3. nonstandard data in fixed fields

    4. both standard and nonstandard data in fixed fields

  5. Which informat should you use to read the values in column 1-5?

    Chapter Quiz
    1. w.

    2. $w.

    3. w.d

    4. COMMAw.d

  6. The COMMAw.d informat can be used to read which of the following values?

    1. 12,805
    2. $177.95
    3. 18%
    4. all of the above

  7. Which INPUT statement correctly reads the values for ModelNumber (first field) after the values for Item (second field)? Both Item and ModelNumber are character variables.

    Chapter Quiz
    1. input +7 Item $9. @1 ModelNumber $5.;
    2. input +6 Item $9. @1 ModelNumber $5.;
    3. input @7 Item $9. +1 ModelNumber $5.;
    4. input @7 Item $9 @1 ModelNumber 5.;
  8. Which INPUT statement correctly reads the numeric values for Cost (third field)?

    Chapter Quiz
    1. input @17 Cost 7.2;
    2. input @17 Cost 9.2.;
    3. input @17 Cost comma7.;
    4. input @17 Cost comma9.;
  9. Which SAS statement correctly uses formatted input to read the values in this order: Item (first field), UnitCost (second field), Quantity (third field)?

    Chapter Quiz
    1. input @1 Item $9. +1 UnitCost comma6.
            @18 Quantity 3.;
    2. input Item $9. @11 UnitCost comma6.
            @18 Quantity 3.;
    3. input Item $9. +1 UnitCost comma6.
            @18 Quantity 3.;
    4. all of the above

  10. Which raw data file requires the PAD option in the INFILE statement in order to correctly read the data using either column input or formatted input?

    1. Chapter Quiz
    2. Chapter Quiz
    3. Chapter Quiz
    4. Chapter Quiz
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset