Chapter 18. Reading Free-Format Data

Overview

Introduction

Raw data can be organized in several ways.

This external file contains data that is arranged in columns, or fixed fields. You can specify a beginning and ending column for each field.

Introduction

By contrast, the following external file contains data that is free-format, meaning data that is not arranged in columns. Notice that the values for a particular field do not begin and end in the same columns.

Introduction

How your data is organized determines which input style you should use to read the data. SAS provides three primary input styles: column, formatted, and list input. This chapter teaches you how to use list input to read free-format data that is not arranged in fixed fields.

Objectives

In this chapter, you learn to use the INPUT statement with list input to read

  • free-format data (data that is not organized in fixed fields)

  • free-format data that is separated by nonblank delimiters, such as commas

  • free-format data that contains missing values

  • character values that exceed eight characters

  • nonstandard free-format data

  • character values that contain embedded blanks.

    In addition, you learn how to mix column, formatted, and list input styles in a single INPUT statement.

Free-Format Data

You have already worked with raw data that is in fixed fields. In doing so, you used column input to read standard data values in fixed fields. You have also used formatted input to read both standard and nonstandard data in fixed fields.

Suppose you have raw data that is free-format; that is, it is not arranged in fixed fields. The fields are often separated by blanks or by some other delimiter, as shown below. In this case, column input and formatted input that you may have used before to read standard and nonstandard data in fixed fields will not enable you to read all of the values in the raw data file.

Free-Format Data

Using List Input

List input is a powerful tool for reading both standard and nonstandard free-format data.

Suppose you have an external data file like the one shown below. The file, which is referenced by the fileref Credit, contains the results of a survey on the use of credit cards by males and females in the 18-39 age range.

Using List Input

You need to read the data values for

  • gender

  • age

  • number of bank credit cards

  • bank card use per month

  • number of department store credit cards

  • department store card use per month.

List input may be the easiest input style to use because, as shown in the INPUT statement below, you simply list the variable names in the same order as the corresponding raw data fields. Remember to distinguish character variables from numeric variables.

input Gender $ Age Bankcard FreqBank Deptcard
      FreqDept;

Because list input, by default, does not specify column locations,

  • all fields must be separated by at least one blank or other delimiter

  • fields must be read in order from left to right

  • you cannot skip or re-read fields.

Processing List Input

It's important to remember that list input causes SAS to scan the input lines for values rather than reading from specific columns. When the INPUT statement is submitted for processing, the input pointer is positioned at column 1 of the raw data file, as shown below.

Processing List Input

SAS reads the first field until it encounters a blank space. The blank space indicates the end of the field, and the data value is assigned to the program data vector for the first variable in the INPUT statement.

Processing List Input

Next, SAS scans the record until the next nonblank space is found, and the second value is read until another blank is encountered. Then the value is assigned to its corresponding variable in the program data vector.

Processing List Input

This process of scanning ahead to the next nonblank column, reading the data value until a blank is encountered, and assigning the value to a variable in the program data vector continues until all the fields have been read and values have been assigned to variables in the program data vector.

Processing List Input

When the DATA step has finished executing, you can display the data set with the PRINT procedure. The following code produces the output below.

Processing List Input

Working with Delimiters

Most free-format data fields are clearly separated by blanks and are easy to imagine as variables and observations. But fields can also be separated by other delimiters, such as commas, as shown below.

Working with Delimiters

When characters other than blanks are used to separate the data values, you can tell SAS which field delimiter to use. Use the DLM= option in the INFILE statement to specify a delimiter other than a blank (the default).

Example

The following program creates the output shown below.

Example
Example
Example

If the field delimiter is also a comma, the fields are identified incorrectly, as shown below.

Example
Example

Reading a Range of Variables

When the variable values in the raw data file are sequential and are separated by a blank (or by another delimiter), you can specify a range of variables in the INPUT statement. This is especially useful if your data contains similar variables, such as the answers to a questionnaire.

For example the following INPUT statement creates five new numeric variables and assigns them the names Ques1, Ques2, Ques3, and so on. You can also specify a range in the VAR statement in the PROC PRINT step to list a range of specific variables.

data survey.phone;
   infile survey;
   input IDnum $ Ques1-Ques5;
run;
proc print data=survey.phone;
   var ques1-ques3;
run;
Reading a Range of Variables

If you are specifying a range of character variables, both the variable list and the $ sign must be enclosed in parentheses.

data survey.stores;
   infile stordata;
   input Age  (Store1-Store3) ($);
run;
proc print data=survey.stores;
run;
Reading a Range of Variables
data test.scores;
   infile group3;
   input Age  (Score1-Score4) (6.);
run;

Limitations of List Input

In its default form, list input places several restrictions on the types of data that can be read:

  • Although the width of a field can be greater than eight columns, both character and numeric variables have a default length of 8. Character values that are longer than eight characters will be truncated.

  • Data must be in standard numeric or character format.

  • Character values cannot contain embedded delimiters.

  • Missing numeric and character values must be represented by a period or some other character

Limitations of List Input

Reading Missing Values

Reading Missing Values at the End of a Record

Suppose that the third person represented in the raw data file below did not answer the questions about how many department store credit cards she has and about how often she uses them.

Reading Missing Values at the End of a Record

Because the missing values occur at the end of the record, you can use the MISSOVER option in the INFILE statement to read the missing values at the end of the record. The MISSOVER option prevents SAS from going to another record if, when using list input, it does not find values in the current line for all the INPUT statement variables. At the end of the current record, values that are expected but not found are set to missing.

For the raw data file shown above, the MISSOVER option prevents the fields in the fourth record from being read as values for Deptcard and FreqDept in the third observation. Note that Deptcard and FreqDept are set to missing.

Reading Missing Values at the End of a Record
Reading Missing Values at the End of a Record

Reading Missing Values at the Beginning or Middle of a Record

Remember that the MISSOVER option works only for missing values that occur at the end of the record. A different method is required when you are using list input to read raw data that contains missing values at the beginning or middle of a record. Let's look at what happens when a missing value occurs at the beginning or middle of a record.

Suppose the value for Age is missing in the first record.

Reading Missing Values at the Beginning or Middle of a Record

When the program below executes, each field in the raw data file is read one by one. The INPUT statement tells SAS to read six data values from each record. However, the first record contains only five values.

data perm.survey;
   infile credit  dlm=',';
   input Gender $ Age Bankcard FreqBank
         Deptcard FreqDept;
run;
proc print data=perm.survey;
run;

The two commas in the first record are interpreted as one delimiter. The incorrect value (1) is read for Age. The program continues to read subsequent incorrect values for Bankcard (8), FreqBank (0), and Deptcard (0). The program then attempts to read the character field FEMALE, at the beginning of the second record, as the value for the numeric variable FreqDept. This causes the value of FreqDept in the first observation to be interpreted as missing. The input pointer then moves down to the third record to begin reading values for the second observation. Therefore, the first observation in the data set contains incorrect values and values from the second record in the raw data file are not included.

Reading Missing Values at the Beginning or Middle of a Record

You can use the DSD option in the INFILE statement to correctly read the raw data. The DSD option changes how SAS treats delimiters when list input is used. Specifically, the DSD option

  • sets the default delimiter to a comma

  • treats two consecutive delimiters as a missing value

  • removes quotation marks from values.

When the following program reads the raw data file, the DSD option sets the default delimiter to a comma and treats the two consecutive delimiters as a missing value. Therefore, the data is read correctly.

Reading Missing Values at the Beginning or Middle of a Record

If the data uses multiple delimiters or a single delimiter other than a comma, simply specify the delimiter value(s) with the DLM= option. In the following example, an asterisk (*) is used as a delimiter. However, the data is still read correctly because of the DSD option.

Reading Missing Values at the Beginning or Middle of a Record

The DSD option can also be used to read raw data when there is a missing value at the beginning of a record, as long as a delimiter precedes the first value in the record.

Reading Missing Values at the Beginning or Middle of a Record

You can also use the DSD and DLM= options to read fields that are delimited by blanks.

data perm.survey;
   infile credit  dsd dlm=' ';
   input Gender $ Age Bankcard FreqBank
         Deptcard FreqDept;
run;
Reading Missing Values at the Beginning or Middle of a Record

Specifying the Length of Character Values

Remember that when you use list input to read raw data, character values are assigned a default length of 8. Let's look at what happens when list input is used to read character variables whose values are longer than 8.

The raw data file referenced by the fileref Citydata contains 1970 and 1980 population figures for several large U.S. cities. Notice that some city names are rather long.

Specifying the Length of Character Values

The longer character values are truncated when they are written to the program data vector.

Specifying the Length of Character Values

PROC PRINT output shows the truncated values for City.

Specifying the Length of Character Values

The LENGTH Statement

Remember, variable attributes are defined when the variable is first encountered in the DATA step. In the program below, the LENGTH statement precedes the INPUT statement and defines both the length and type of the variable City. A length of 12 has been assigned to accommodate PHILADELPHIA, which is the longest value for City.

The LENGTH Statement

Using this method, you do not need to specify City's type in the INPUT statement. However, leaving the $ in the INPUT statement will not produce an error. Your output should now display the complete values for City.

The LENGTH Statement
The LENGTH Statement

Modifying List Input

You can make list input more versatile by using modified list input. There are two modifiers that can be used with list input.

  • The ampersand (&) modifier is used to read character values that contain embedded blanks.

  • The colon (:) modifier is used to read nonstandard data values and character values that are longer than eight characters, but which contain no embedded blanks.

You can use modified list input to read the file shown below. This file contains the names of the ten largest U.S. cities ranked in order based on their 1986 estimated population figures.

Notice that some of the values for city names contain embedded blanks. Also, note that the values representing the population of each city are nonstandard numeric values (they contain commas).

Modifying List Input

In the following sections you will learn how to use the ampersand (&) modifier to read the values for city (City). Then you will learn how the colon (:) modifier can be used to read the nonstandard numeric values that represent population (Pop86).

Reading Values That Contain Embedded Blanks

The ampersand (&) modifier enables you to read character values that contain single embedded blanks. The & indicates that a character value that is being read with list input might contain one or more single embedded blanks. The value is read until two or more consecutive blanks are encountered. The & modifier precedes a specified informat if one is used.

input Rank City &;

In the raw data file shown below, each value of City is followed by two consecutive blanks. There are two ways that you can use list input to read the values of City.

Using the & Modifier with a LENGTH Statement

As shown below, you can use a LENGTH statement to define the length of City, and then add an & modifier to the INPUT statement to indicate that the values contain embedded blanks.

Using the & Modifier with a LENGTH Statement

Using the & Modifier with an Informat

You can also read the values for City with the & modifier followed by the $w. informat, which reads standard character values, as shown below. When you do this, the w value in the informat determines the variable's length and should be large enough to accommodate the longest value.

Using the & Modifier with an Informat
Using the & Modifier with an Informat

Reading Nonstandard Values

The colon (:) modifier enables you to read nonstandard data values and character values that are longer than eight characters, but which contain no embedded blanks. The : indicates that values are read until a blank (or other delimiter) is encountered, and then an informat is applied. If an informat for reading character values is specified, the w value specifies the variable's length, overriding the default length.

Notice the values representing the 1986 population of each city in the raw data file below. Because they contain commas, these values are nonstandard numeric values.

Reading Nonstandard Values

In order to read these values, you can modify list input with the colon (:) modifier, followed by the COMMAw.d informat, as shown in the program below. Notice that the COMMAw.d informat does not specify a w value.

data perm.cityrank;
   infile topten;
   input Rank City & $12.
         Pop86 : comma.;

Remember that list input reads each value until the next blank is detected. The default length of numeric variables is 8, so you don't need to specify a w value to indicate the length of a numeric variable.

This is different from using a numeric informat with formatted input. In that case, you must specify a w value in order to indicate the number of columns to be read.

Processing the DATA Step

At compile time, the informat $12. in the example below sets the length of City to 12 and stores this information in the descriptor portion of the data set. During the execution phase, however, the w value of 12 does not determine the number of columns that are read. This is different from the function of informats in the formatted input style.

Processing the DATA Step

The & modifier indicates that the values for City should be read until two consecutive blanks are encountered. Therefore, the value NEW YORK is read from column 4 to column 11, a total of only 8 columns. When blanks are encountered in both columns 12 and 13, the value NEW YORK is written to the program data vector.

Processing the DATA Step

The input pointer moves forward to the next nonblank column, which is column 14 in the first record. Now the values for Pop86 are read from column 14 until the next blank is encountered. The COMMAw.d informat removes the commas, and the value is written to the program data vector.

Processing the DATA Step

Notice that the character values for City and the nonstandard values for Pop86 are stored correctly in the data set.

Processing the DATA Step

Comparing Formatted Input and Modified List Input

As you have seen, informats work differently in modified list input than they do in formatted input. With formatted input, the informat determines both the length of character variables and the number of columns that are read. The same number of columns are read from each record.

Comparing Formatted Input and Modified List Input

The informat in modified list input determines only the length of the variable, not the number of columns that are read. Here, the raw data values are read until two consecutive blanks are encountered.

Comparing Formatted Input and Modified List Input

Creating Free-Format Data

If you completed Chapter 5, "Creating SAS Data Sets from Raw Files and Excel Work-sheets," on page 135, you learned how the PUT statement can be used with column output to write observations from a SAS data set to a raw data file. The PUT statement can also be used with list output to create free-format raw data files.

List output is similar to list input. With list output, you simply list the names of the variables whose values you want to write. The PUT statement writes a variable, leaves a blank, then writes the next value.

The following program creates the raw data file Findat, using the SAS data set Perm.Finance. The DATEw. format is used to write the value of Date in the form DDMMYYYY.

Creating Free-Format Data

Specifying a Delimiter

You can use the DLM= option with a FILE statement to create a character-delimited raw data file.

Specifying a Delimiter
Specifying a Delimiter
Specifying a Delimiter

Using the DSD Option

What happens if you need to create a comma-delimited file that requires the use of a format that writes out values using commas?

If you used the following program, the resulting raw data file would contain five fields rather than four.

Using the DSD Option

You can use the DSD option in the FILE statement to specify that data values containing commas should be enclosed in quotation marks. Remember that the DSD option uses a comma as a delimiter, so a DLM= option isn't necessary here.

Using the DSD Option

Reading Values That Contain Delimiters Within a Quoted String

You can also use the DSD option in an INFILE statement to read values that contain delimiters within a quoted string. As shown in the following PROC PRINT output, the INPUT statement treats the commas within the values for Salary as valid characters and removes the quotation marks from the character strings before the value is stored.

Reading Values That Contain Delimiters Within a Quoted String

Mixing Input Styles

Evaluating your raw data and choosing the most appropriate input style is a very important task. You have already worked with three input styles for reading raw data.

Input Style

Reads

Column

standard data values in fixed fields

Formatted

nonstandard data values in fixed fields

List

data values that are not arranged in fixed fields, but are separated by blanks or other delimiters

With some file layouts, you might need to mix input styles in the same INPUT statement in order to read the data correctly.

Look at the raw data file below and think about how to combine input styles to read these values.

Mixing Input Styles
  • Column input is an appropriate choice for the first field because the values can be read as standard character values and are located in fixed columns.

  • The next two fields are also located in fixed columns, but the values require an informat. So, formatted input is a good choice here.

  • Values in the fourth field begin in column 28 but do not end in the same column. List input is appropriate here, but notice that some values are longer than eight characters. You need to use the : format modifier with an informat to read these values.

  • The last field does not always begin or end in the same column, so list input is the best input style for those values.

Field Description

Starting Column

Field Width

Data Type

Input Style

Social Security #

1

11

character

column

Date of Hire

13

7

date

formatted

Annual Salary

21

6

numeric

formatted

Department

28

5 to 9

character

list

Phone Extension

??

4

numeric

list

The INPUT statement to read the data should look like this:

data perm.mixed;
   infile rawdata;
   input SSN $ 1-11 @13 HireDate date7. 
         @21 Salary comma6. Department : $9. Phone;
run;
proc print data=perm.mixed;
run;

When you submit the PRINT procedure, the output displays values for each variable.

Mixing Input Styles

Additional Features

Writing Character Strings and Variable Values

You can use a PUT statement to write both character strings and variable values to a raw data file. To write out a character string, simply add a character string, enclosed in quotation marks, to the PUT statement. It's a good idea to include a blank space as the last character in the string to avoid spacing problems.

Writing Character Strings and Variable Values
Writing Character Strings and Variable Values

Chapter Summary

Text Summary

Free-Format Data

External files can contain raw data that is free-format; that is, the data is not arranged in fixed fields. The fields can be separated by blanks, or by some other delimiter, such as commas.

Using List Input

Free-format data can easily be read with list input because you do not need to specify column locations of the data. You simply list the variable names in the same order as the corresponding raw data fields. You must distinguish character variables from numeric variables by using the dollar ($) sign.

When characters other than blanks are used to separate the data values, you can specify the field delimiter by using the DLM= option in the INFILE statement.

You can also specify a range of variables in the INPUT statement when the variable values in the raw data file are sequential and are separated by blanks (or by some other delimiter). This is especially useful if your data contains similar variables, such as the answers to a questionnaire.

In its simplest form, list input places several limitations on the types of data that can be read.

Reading Missing Values

If your data contains missing values at the end of a record, you can use the INFILE statement with the MISSOVER option to prevent SAS from going to the next record to find the missing values.

If your data contains missing values at the beginning or in the middle of a record, you might be able to use the DSD option in the INFILE statement to correctly read the raw data. The DSD option sets the default delimiter to a comma and treats two consecutive delimiters as a missing value.

If the data uses multiple delimiters or a single delimiter other than a comma, you can use both the DSD option and the DLM= option in the INFILE statement.

The DSD option can also be used to read raw data when there is a missing value at the beginning of a record, as long as a delimiter precedes the first value in the record.

Specifying the Length of Character Values

You can specify the length of character variables by using the LENGTH statement. The LENGTH statement enables you to use list input to read names that are longer than eight characters without truncating them.

Because variable attributes are defined when the variable is first encountered in the DATA step, the LENGTH statement precedes the INPUT statement and defines both the length and the type of the variable.

When you use the LENGTH statement, you do not need to specify the variable type again in the INPUT statement.

Modifying List Input

Modified list input can be used to read values that contain embedded blanks and nonstandard values. Modified list input uses two format modifiers:

  • the ampersand (&) modifier enables you to read character values that contain single embedded blanks

  • the colon (:) modifier enables you to read nonstandard data values and character values that are longer than eight characters, but which contain no embedded blanks.

Remember that informats work differently in modified list input than they do in formatted input.

Creating Free-Format Data

You can create a raw data file using list output. With list output, you simply list the names of the variables whose values you want to write. The PUT statement writes a variable, leaves a blank, then writes the next value.

You can use the DLM= option with a FILE statement to create a delimited raw data file. You can use the DSD option in a FILE statement to specify that data values containing commas should be enclosed in quotation marks. You can also use the DSD option to read values that contain delimiters within a quoted string.

Mixing Input Styles

With some file layouts, you might need to mix input styles in the same INPUT statement in order to read the data correctly.

Syntax

Reading Free-Format Data

    LIBNAME libref 'SAS-data-library';

    FILENAME fileref 'filename';

    DATA SAS-data-set;

             INFILE file-specification <DLM 'delimiter'><MISSOVER><DSD>;

             LENGTH variable $ length;

             INPUT variable<$ ><&|:><informat>;

    RUN;

    PROC PRINT DATA=SAS-data-set;

    RUN;

Creating Free-Format Data

    LIBNAME libref 'SAS-data-library';

    DATA _NULL_;

             SET SAS-data-set;

             FILE file-specification <DLM 'delimiter'><DSD>;

             PUT variable<: format>;

    RUN;

Sample Programs

Reading Free-Format Data

libname perm 'c:
ecordsdata';
filename credit 'c:
ecordscredit.dat';
data perm.carduse;
   infile credit dlm='#' missover;
   length LastName $ 14;
   input lastname $ Gender $ Age CardType $
         Total : comma.;
run;
proc print data=perm.carduse;
run;

Creating Raw Data Using List Output

libname perm 'c:
ecordsdata';
data _null_;
   set perm.finance;
   file 'c:accounts
ewdata' dsd;
   put ssn name salary : comma. date : date9.;
run;

Points to Remember

  • When you use list input,

    • fields must be separated by at least one blank or other delimiter.

    • fields must be read in order, left to right. You cannot skip or re-read fields.

    • use a LENGTH statement to avoid truncating character values that are longer than eight characters.

  • In formatted input, the informat determines both the length of character variables and the number of columns that are read. The same number of columns are read from each record.

  • The informat in modified list input determines only the length of the variable value, not the number of columns that are read.

Chapter Quiz

Select the best answer for each question. After completing the quiz, check your answers using the answer key in the appendix.

  1. The raw data file referenced by the fileref Students contains data that is

    Chapter Quiz
    1. arranged in fixed fields

    2. free-format

    3. mixed-format

    4. arranged in columns

  2. Which input style should be used to read the values in the raw data file that is referenced by the fileref Students?

    Chapter Quiz
    1. column

    2. formatted

    3. list

    4. mixed

  3. Which SAS program was used to create the raw data file Teamdat from the SAS data set Work.Scores?

    Chapter Quiz
    1. data _null_;
         set work.scores;
         file 'c:data	eamdat' dlm=',';
         put name highscore team;
      run;
    2. data _null_;
         set work.scores;
         file 'c:data	eamdat' dlm=' ';
         put name highscore team;
      run;
    3. data _null_;
         set work.scores;
         file 'c:data	eamdat' dsd;
         put name highscore team;
      run;
    4. data _null_;
         set work.scores;
         file 'c:data	eamdat';
         put name highscore team;
      run;
  4. Which SAS statement reads the raw data values in order and assigns them to the variables shown below?

    Variables: FirstName (character), LastName (character), Age (numeric), School (character), Class (numeric)

    Chapter Quiz
    1. input FirstName $ LastName $ Age School $ Class;

    2. input FirstName LastName Age School Class;

    3. input FirstName $ 1-4 LastName $ 6-12 Age 14-15
            School $ 17-19 Class 21;
    4. input FirstName 1-4 LastName 6-12 Age 14-15
            School 17-19 Class 21;
  5. Which SAS statement should be used to read the raw data file that is referenced by the fileref Salesrep?

    Chapter Quiz
    1. infile salesrep;

    2. infile salesrep ':';

    3. infile salesrep dlm;

    4. infile salesrep dlm=':';

  6. Which of the following raw data files can be read by using the MISSOVER option in the INFILE statement? Spaces for missing values are highlighted with colored blocks.

    1. Chapter Quiz
    2. Chapter Quiz
    3. Chapter Quiz
    4. Chapter Quiz
  7. Which SAS program correctly reads the data in the raw data file that is referenced by the fileref Volunteer?

    Chapter Quiz
    1. data perm.contest;
         infile volunteer;
         input FirstName $ LastName $ Age
               School $ Class;
      run;
    2. data perm.contest;
         infile volunteer;
         length LastName $ 11;
         input FirstName $ lastname $ Age
               School $ Class;
      run;
    3. data perm.contest;
         infile volunteer;
         input FirstName $ lastname $ Age
               School $ Class;   length LastName $ 11;
      run;
    4. data perm.contest;
         infile volunteer;
         input FirstName $ LastName $ 11. Age
               School $ Class;
      run;
  8. Which type of input should be used to read the values in the raw data file that is referenced by the fileref University?

    Chapter Quiz
    1. column

    2. formatted

    3. list

    4. modified list

  9. Which SAS statement correctly reads the values for Flavor and Quantity? Make sure the length of each variable can accommodate the values shown.

    Chapter Quiz
    1. input Flavor & $9. Quantity : comma.;
    2. input Flavor & $14. Quantity : comma.;
    3. input Flavor : $14. Quantity & comma.;
    4. input Flavor $14. Quantity : comma.;
  10. Which SAS statement correctly reads the raw data values in order and assigns them to these corresponding variables: Year (numeric), School (character), Enrolled (numeric)?

    Chapter Quiz
    1. input Year School & $27.
            Enrolled : comma.;
    2. input Year 1-4 School & $27.
            Enrolled : comma.;
    3. input @1 Year 4. +1 School & $27.
            Enrolled : comma.;
    4. all of the above

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset