CHAPTER  2

Accessing Your Data

2.1    Methods for Getting Your Data into SAS

2.2    SAS Data Libraries and Data Sets

2.3    Listing the Contents of a SAS Data Set

2.4    Reading Excel Files with the IMPORT Procedure

2.5    Accessing Excel Files Using the XLSX LIBNAME Engine

2.6    Reading Delimited Files with the IMPORT Procedure

2.7    Telling SAS Where to Find Your Raw Data

2.8    Reading Raw Data Separated by Spaces

2.9    Reading Raw Data Arranged in Columns

2.10  Reading Raw Data Not in Standard Format

2.11  Selected Informats

2.12  Mixing Input Styles

2.13  Reading Messy Raw Data

2.14  Reading Multiple Lines of Raw Data per Observation

2.15  Reading Multiple Observations per Line of Raw Data

2.16  Reading Part of a Raw Data File

2.17  Controlling Input with Options in the INFILE Statement

2.18  Reading Delimited Files with the DATA Step

 

2.1    Methods for Getting Your Data into SAS

image

Data come in many different forms. Your data may be handwritten on a piece of paper, or typed into a raw data file on your computer. Perhaps your data are in a database file on your personal computer, or in a data warehouse on the server at your office. Wherever your data reside, there is a way for SAS to use them. You may need to convert your data from one form to another, or SAS may be able to use your data in their current form. This section outlines several methods for getting your data into SAS. Most of these methods are covered in this book, but a few of the more advanced methods are merely mentioned so that you know they exist. We do not attempt to cover all methods available for getting your data into SAS, as new methods are continually being developed, and creative SAS users can always come up with clever methods that work for their own situations. However your data are stored, there should be at least one method in this book that will work for you.

Methods for accessing your data can be put into five general categories:

      using SAS data sets

      entering data directly into SAS data sets

      creating SAS data sets from raw data files

      converting other software’s data files into SAS data sets

      reading other software’s data files directly

Using SAS data sets  If your data are already stored as SAS data sets, all you have to do is tell SAS which data set to use. SAS data sets are stored and accessed via SAS data libraries. A SAS data library is a collection of one or more SAS data sets that are stored in the same location. Some SAS data libraries are defined by default, but you can also define your own. Individual data sets within a SAS data library are called members. See the next section for more information about SAS data libraries and data set members.

Entering data directly into SAS data sets  Sometimes the best method for getting your data into SAS is to enter the data directly into SAS data sets through your keyboard.

      The Viewtable window is part of the SAS windowing environment and is included with Base SAS software. Viewtable allows you to enter your data in a tabular format. You can define variables (also called columns) and give them attributes such as name, length, and type (character or numeric).

      The Data Grid is part of SAS Enterprise Guide, which is included with Base SAS, but only for Windows. As with Viewtable, you can define variables, give them attributes, and then enter your data.

Creating SAS data sets from raw data files  Much of this chapter is devoted to reading raw data files (also referred to as text, ASCII, sequential, or flat files). You can always read a raw data file using Base SAS software. If your data are not already in a raw data file, chances are you can convert your data into a raw data file. There are two general methods for reading raw data files:

      The DATA step is so versatile that it can read almost any type of raw data file. This method is covered in this chapter starting with Section 2.7.

      The IMPORT procedure, covered in Section 2.6, is available with Base SAS running on the UNIX and Windows operating environments. This is an easy way to read particular types of raw data files including tab-delimited and comma-separated values (CSV) files.

Converting other software’s data files into SAS data sets  Each software application has its own form for data files. While this is useful for software developers, it is troublesome for software usersespecially when your data are in one application, but you need to analyze them with another. There are several options for converting data:

      The IMPORT procedure, available for UNIX and Windows operating environments, can be used to convert Microsoft Excel (covered in Section 2.4), Lotus, dBase, Stata, SPSS, JMP, Paradox, and Microsoft Access files into SAS data sets. All of these except JMP require that you have SAS/ACCESS Interface to PC Files installed on your SAS server.

      If you don’t have SAS/ACCESS software, then you can always create a raw data file from your application and read the raw data file with either the DATA step or the IMPORT procedure. Many applications can create CSV files, which are easily read using the IMPORT procedure (covered in Section 2.6) or the DATA step (covered in Section 2.18).

Reading other software’s data files directly  Under certain circumstances, you may be able to read data without converting to a SAS data set. This method is particularly useful when you have many people updating data files, and you want to make sure that you are using the most current data.

      The SAS/ACCESS products allow you to read data without converting your data into SAS data sets. There are SAS/ACCESS products for most of the popular database management systems including Oracle, DB2, MySQL, and Teradata. This method of data access is not covered in this book.

      SAS/ACCESS Interface to PC Files also allows you to read some PC file types directly without converting them to SAS data sets including Excel (covered in Section 2.5), Access, and JMP. One method that SAS uses to access PC files in their native format is called a LIBNAME engine. To tell SAS the type of file you want to read, you specify a LIBNAME statement that includes the option for that particular engine. See the SAS Documentation for more information on LIBNAME engines.

      There are also LIBNAME engines that allow you to read data directly and are part of Base SAS software (so you don’t need SAS/ACCESS Interface to PC Files). There are engines for SPSS, OSIRIS, old versions of SAS data sets, and SAS data sets in transport format. Check the SAS Documentation for your operating environment for a complete list of available engines.

Given all these methods for getting your data into SAS, you are sure to find at least one method that will work for you—probably more.

 

2.2     SAS Data Libraries and Data Sets

It’s not always obvious, but all SAS data sets have a two-level name such as WORK.BANANA. The first level is called its libref (short for SAS data library reference), and the second level is the member name that uniquely identifies the data set within the library. A SAS data library is a collection of SAS data sets residing in the same location. So a libref is like a nickname for that location. Sometimes a libref refers to a physical location, such as a hard drive or flash drive. Other times it refers to a logical location such as a directory or folder.

Both the libref and member name follow the standard rules for SAS names. They must start with a letter or underscore and contain only letters, numerals, or underscores. However, librefs cannot be longer than 8 characters while member names can be up to 32 characters. (There are system options that allow you to bend these rules, and the defaults may change, so check the SAS Documentation if you have a question.)

Built-in SAS data libraries  SAS comes with some built-in libraries. The exact libraries available to you will depend on the SAS products you have installed and whether you have any locally defined libraries, but will always include these:

WORK

is a special library for temporary SAS data sets. The WORK library is erased when you exit SAS. WORK is the default library if you don’t specify a libref in the data set name.

SASHELP

contains sample data sets that you can use, and is usually read-only.

SASUSER

is a library provided for you to save your data sets. This library is permanent on most systems. However, in z/OS and some UNIX systems, it may be temporary or read-only. If you have more than one SAS server, then you may have a SASUSER library on each server.

Temporary SAS data sets  If you create a data set that you won’t need later, it’s a good idea to make it temporary so your disks don’t get cluttered. Here is the DATA step shown in Chapter 1:

DATA distance;

   Miles = 26.22;

   Kilometers = 1.61 * Miles;

RUN;

This DATA step creates a temporary data set. Notice that the data set name does not include the libref WORK. Because the DATA statement uses a one-level name, SAS assigns the default library, WORK, and uses DISTANCE as the member name within that library. The log contains this note showing the complete, two-level name:

NOTE: The data set WORK.DISTANCE has 1 observations and 2 variables.

Permanent SAS data sets using direct referencing  In general, if you use a data set more than once, it is more efficient to save it as a permanent SAS data set than to create a new temporary SAS data set every time you want to use the data. To use direct referencing, just take your operating environment’s file path and name, enclose it in quotation marks, and put it in your program. The quotation marks tell SAS that this is a permanent SAS data set. Here is the general form of a DATA statement for direct referencing in different operating environments. (Keep in mind that SAS University Edition and SAS OnDemand for Academics use Linux.)

Windows:

DATA 'drive:directoryfilename';

UNIX or Linux:

DATA '/home/path/filename';

z/OS:

DATA 'data-set-name';

This program is the same as the preceding one except that it includes the file path and name enclosed in quotes (using the syntax for the Windows operating environment):

DATA 'c:MySASLibdistance';

   Miles = 26.22;

   Kilometers = 1.61 * Miles;

RUN;

This time the log contains this note showing the file path and name:

NOTE: The data set c:MySASLibdistance has 1 observations and 2 variables.

The SAS log does not show the libref, but if you check your SAS libraries you will see that SAS has created a library with a name like WC000001 and it contains a data set named DISTANCE. Librefs are temporary, but this data set is permanent because it is not in the WORK library.

Permanent SAS data sets with a LIBNAME statement  If you read or write many data sets in the same SAS data library, it is easier (and less error prone) to use a LIBNAME statement than to keep typing your file path and name over and over. Here is the general form of LIBNAME statements for different operating environments:

Windows:

LIBNAME libref 'drive:directory';

UNIX or Linux:

LIBNAME libref '/home/path';

z/OS:

LIBNAME libref 'data-set-name';

This program is the same as the preceding two except that it uses a LIBNAME statement to define a libref. Notice that the DATA statement uses the two-level name, MARATHON.DISTANCE.

LIBNAME marathon 'c:MySASLib';

DATA marathon.distance;

   Miles = 26.22;

   Kilometers = 1.61 * Miles;

RUN;

This time the log contains this note with the two-level data set name:

NOTE: The data set MARATHON.DISTANCE has 1 observations and 2 variables.

This is a permanent SAS data set because the libref is not WORK. If you view the directory of files on your computer, you will not see a file named MARATHON.DISTANCE. That is because operating environments have their own systems for naming files. When run under Windows or UNIX, this data set will be called distance.sas7bdat. Under z/OS, the filename would be the data-set-name specified in the LIBNAME statement. The examples in this section all create SAS data sets, the next section shows an example, which accesses an existing permanent SAS data set.

2.3     Listing the Contents of a SAS Data Set

To use a SAS data set, all you need to do is tell SAS the name and location of the data set you want, and SAS will figure out what is in it. SAS can do this because SAS data sets are self-documenting, which is another way of saying that SAS automatically stores information about the data set (called the descriptor portion) along with the data. There is an easy way to see the information about your data set, use PROC CONTENTS.

PROC CONTENTS is a simple procedure. You just type the keywords PROC CONTENTS and specify the data set you want with the DATA= option:

PROC CONTENTS DATA = data-set;

Example  The SASHELP data library contains sample data sets and is provided with Base SAS. The following PROC CONTENTS produces a report about the CARS data set in the SASHELP data library. Because the SASHELP library is defined by default, no LIBNAME statement is needed.

* Use PROC CONTENTS to describe the CARS data set in the SASHELP library;

PROC CONTENTS DATA = SASHELP.CARS;

RUN;

The output from PROC CONTENTS is like a table of contents for your data set:

The CONTENTS Procedure

 

Data Set Name

SASHELP.CARS

Observations

428

Member Type

DATA

Variables

15

Engine

V9

Indexes

0

Created

06/24/2015 22:15:37

Observation Length

152

Last Modified

06/24/2015 22:15:37

Deleted Observations

0

Protection

 

Compressed

NO

Data Set Type

 

Sorted

YES

Label

2004 Car Data

 

 

Data Representation

WINDOWS_64

 

 

Encoding

us-ascii  ASCII (ANSI)

 

 

 

Engine/Host Dependent Information

Data Set Page Size

65536

Number of Data Set Pages

2

First Data Page

1

Max Obs per Page

430

Obs in First Data Page

413

Number of Data Set Repairs

0

ExtendObsCounter

YES

Filename

C:Program FilesSASHomeSASFoundation9.4coresashelpcars.sas7bdat

Release Created

9.0401M3

Host Created

X64_SRV12

 

Alphabetic List of Variables and Attributes

#

Variable

Type

Len

Format

Label

9

Cylinders

Num

8

 

 

5

DriveTrain

Char

5

 

 

8

EngineSize

Num

8

 

Engine Size (L)

10

Horsepower

Num

8

 

 

7

Invoice

Num

8

DOLLAR8.

 

15

Length

Num

8

 

Length (IN)

11

MPG_City

Num

8

 

MPG (City)

12

MPG_Highway

Num

8

 

MPG (Highway)

6

MSRP

Num

8

DOLLAR8.

 

1

Make

Char

13

 

 

2

Model

Char

40

 

 

4

Origin

Char

6

 

 

3

Type

Char

8

 

 

13

Weight

Num

8

 

Weight (LBS)

14

Wheelbase

Num

8

 

Wheelbase (IN)

 

Sort Information

Sortedby

Make Type

Validated

YES

Character Set

ANSI

 

The output starts with information about your data set and then describes each variable.

For the data set

For each variable

  Data set name

  Variable name

  Number of observations

  Type (numeric or character)

  Number of variables

  Length (storage size in bytes)

  Date created

  Format for printing (if any)

  Data set label (if any)

  Variable label (if any)

 

2.4      Reading Excel Files with the IMPORT Procedure

SAS offers many ways to access data in Microsoft Excel files. In fact, entire books have been written about exchanging data between SAS and Excel. SAS Studio, SAS Enterprise Guide and the SAS windowing environment each have point-and-click windows that will import Excel files for you, but writing a program gives you more control and is easier to repeat. If you have SAS/ACCESS Interface to PC Files and are running in the Windows or UNIX operating environment, then you can use the IMPORT procedure to read Excel files. PROC IMPORT will scan your Excel worksheet and automatically determine the variable types (character or numeric), will assign lengths to the character variables, and can recognize most date formats. Also, if you want, you can use the first row in your worksheet for the variable names.

Here is the general form of the IMPORT procedure for reading Excel files:

PROC IMPORT DATAFILE = 'filename' OUT = data-set

     DBMS = identifier  REPLACE;

where filename is the file you want to read and data-set is the name of the SAS data set you want to create. The REPLACE option tells SAS to replace the SAS data set named in the OUT= option if it already exists. The DBMS= option tells SAS the type of Excel file to read and may not be necessary.

DBMS identifiers  There are several DBMS identifiers you can use to read Excel files. Three commonly used identifiers are: EXCEL, XLS, and XLSX. In the UNIX operating environment, use the XLS identifier for older style files (.xls extension), and the XLSX identifier for newer style files (.xlsx extension). In the Windows operating environment, in addition to the XLS and XLSX identifiers, you can use the EXCEL identifier to read all types of Excel files. The EXCEL identifier uses different technology to read files than do the XLS and XLSX identifiers, so the results may be different. By default, the XLS and XLSX identifiers look at more data rows to determine the column type than does the EXCEL identifier. Not all of these identifiers may work for you if your Windows computer has a mixture of 64-bit and 32-bit applications. In addition, some computer configurations may require that a PC Files Server be installed. The PC Files Server uses the EXCELCS identifier. See the SAS Documentation for more information.

Optional statements  If you have more than one sheet in your file, then you can specify which sheet to read using the following statement:

SHEET = 'sheet-name';

If you want to read only specific cells in the sheet, you can specify a range. The range can be a named range (if defined), or you can specify the upper-left (UL) and lower-right (LR) cells for the range as follows:

RANGE = 'sheet-name$UL:LR';

By default, the IMPORT procedure will take the variable names from the first row of the spreadsheet. If you do not want this, then you can add the following statement to the procedure and SAS will give your variables names like A, B, C, and so on.

GETNAMES = NO;

When using the EXCEL identifier, if you have a column that contains both numeric and character values, then by default, the numbers will be converted to missing values. To read the numbers as character values instead of converting them to missing values, use the following statement:

MIXED = YES;

Example  Suppose you have the following Microsoft Excel spreadsheet which contains data about magnolia trees. For each type of tree the file includes the scientific and common names, maximum height, age at first blooming when planted from seed, whether evergreen or deciduous, and color of flowers.

image

The following program reads the Microsoft Excel file using the IMPORT procedure with the XLSX DBMS identifier and the REPLACE option, and creates a permanent data set named MAGNOLIA.

LIBNAME sasfiles 'c:MySASLib';

* Read an Excel spreadsheet using PROC IMPORT;

PROC IMPORT DATAFILE = 'c:MyExcelTrees.xlsx' OUT = sasfiles.magnolia

   DBMS = XLSX REPLACE;

RUN;

Here is the MAGNOLIA data set. You can view the data in the Output Data tab in SAS Enterprise Guide or SAS Studio, or in the Viewtable window in the SAS windowing environment.

 

ScientificName

CommonName

MaxHeight

AgeBloom

Type

Color

1

M. grandiflora

Southern Magnolia

80

15

E

white

2

M. campbellii

 

80

20

D

rose

3

M. liliiflora

Lily Magnolia

12

4

D

purple

4

M. soulangiana

Saucer Magnolia

25

3

D

pink

5

M. stellata

Star Magnolia

10

3

D

white

Notice that the variable names were taken from the first row in the spreadsheet. If you have spaces or special characters in the first row, see Section 3.18.

You should always check the SAS log when you create data sets to make sure nothing went wrong. These notes tell you that the MAGNOLIA data set has five observations and six variables.

NOTE: The import data set has 5 observations and 6 variables.

NOTE: SASFILES.MAGNOLIA was successfully created.

2.5     Accessing Excel Files Using the XLSX LIBNAME Engine

The preceding section covered one way to convert an Excel file into a SAS data set, the IMPORT procedure. The XLSX LIBNAME engine goes a step further. With the XLSX LIBNAME engine, you can read from and write to an Excel file without converting it to a SAS data set. This engine works for files created by any version of Microsoft Excel 2007 or later on the Windows or UNIX operating environments. You must have SAS 9.4M2 or higher and SAS/ACCESS Interface to PC Files software. With this engine, you can read an existing worksheet, replace a worksheet, or add a new worksheet, but you cannot update individual values inside a worksheet.

The XLSX LIBNAME engine uses the first line in your data file for the variable names, scans each full column to determine the variable type (character or numeric), assigns lengths to character variables, and recognizes dates, and numeric values containing commas or dollar signs. While the XLSX LIBNAME engine does not offer many options, because you are using an Excel worksheet like a SAS data set, you can use many standard data set options. For example, you can use the RENAME= data set option to change the names of variables. (See Section 6.10.)

Here is the general form of a LIBNAME statement with the XLSX engine option:

LIBNAME libref XLSX 'filename';

Notice that the only difference between this LIBNAME statement and one used to access a SAS data library (Section 2.2) is that the XLSX option has been added. The libref is a name you make up and must follow the rules for names of SAS data libraries (eight characters or fewer, start with a letter or underscore; and contain only letters, numerals or underscores). The filename is the name used by your operating environment.

Keep in mind that, unlike some other LIBNAME engines, the XLSX LIBNAME engine only locks your Excel file while it actively accesses it. The file quickly becomes available to other people and processes that can read and edit it. You do not need to release the file, but if you know you are done with it, it is a good practice to clear the libref with this statement:

LIBNAME libref CLEAR;

Example  This example uses the same Excel spreadsheet as the previous section. The spreadsheet contains data about magnolia trees: the scientific and common names, maximum height, age at first blooming when planted from seed, whether evergreen or deciduous, and color of flowers.

image

 

With the XLSX LIBNAME engine, SAS can read the spreadsheet directly, without first converting it to a SAS data set. Here is a PROC PRINT that prints the data from the Excel spreadsheet. A simple PROC PRINT like this one prints the values for all variables and all observations.

* Read an Excel spreadsheet using the XLSX LIBNAME engine;

LIBNAME exfiles XLSX 'c:MyExcelTrees.xlsx';

PROC PRINT DATA = exfiles.sheet1;

   TITLE 'PROC PRINT of Excel File';

RUN;

Here are the results of the PROC PRINT. Notice that the variable names were taken from the first row in the spreadsheet. If you have spaces or special characters in the first row, see Section 3.18.

PROC PRINT of Excel File

Obs

ScientificName

CommonName

MaxHeight

AgeBloom

Type

Color

1

M. grandiflora

Southern Magnolia

80

15

E

white

2

M. campbellii

 

80

20

D

rose

3

M. liliiflora

Lily Magnolia

12

4

D

purple

4

M. soulangiana

Saucer Magnolia

25

3

D

pink

5

M. stellata

Star Magnolia

10

3

D

white

If you want to convert an Excel spreadsheet to a SAS data set, you can do that too. Here is a DATA step that reads the Excel spreadsheet (using a SET statement in a DATA step, which is covered in more detail in Section 3.1) and creates a permanent SAS data set named MAGNOLIA.

* Read Excel into a permanent SAS data set;

LIBNAME exfiles XLSX 'c:MyExcelTrees.xlsx';

LIBNAME sasfiles 'c:MySASLib';

DATA sasfiles.magnolia;

   SET exfiles.sheet1;

RUN;

Here is the MAGNOLIA data set:

 

ScientificName

CommonName

MaxHeight

AgeBloom

Type

Color

1

M. grandiflora

Southern Magnolia

80

15

E

white

2

M. campbellii

 

80

20

D

rose

3

M. liliiflora

Lily Magnolia

12

4

D

purple

4

M. soulangiana

Saucer Magnolia

25

3

D

pink

5

M. stellata

Star Magnolia

10

3

D

white

The SAS log shows that no observations or variables were lost:

NOTE: The import data set has 5 observations and 6 variables.

NOTE: There were 5 observations read from the data set EXFILES.sheet1.

NOTE: The data set SASFILES.MAGNOLIA has 5 observations and 6 variables.

2.6     Reading Delimited Files with the IMPORT Procedure

image

The IMPORT procedure reads Excel files (Section 2.4), but also reads many other types of files. Delimited files are raw data files that have a special character separating data values. Many types of software can save data as delimited files, often with commas or tab characters for delimiters. The IMPORT procedure for reading delimited files is part of Base SAS in     the UNIX and Windows operating environments. You can also read delimited files with a DATA step (Section 2.18).

PROC IMPORT will scan your data file (the first 20 rows by default) and automatically determine the variable types (character or numeric), will assign lengths to the character variables, and can recognize some date formats. PROC IMPORT will treat two consecutive delimiters in your data file as a missing value, will read values enclosed in quotation marks, and assign missing values to variables when it runs out of data on a line. Also, if you want, you can use the first line in your data file for the variable names.

Here is the general form of the IMPORT procedure for delimited files:

PROC IMPORT DATAFILE = 'filename' OUT = data-set;

where filename is the file you want to read and data-set is the name of the SAS data set you want to create. SAS will determine the file type based on the extension of the file.

Type of File

Extension

DBMS Identifier

Comma-delimited

.csv  

CSV

Tab-delimited

.txt

TAB

Delimiters other than commas or tabs

 

DLM

If your file does not have the proper extension, or your file is of type DLM, then you must use the DBMS= option in the PROC IMPORT statement. Use the REPLACE option if you already have a SAS data set with the name you specified in the OUT= option, and you want to overwrite it. Here is the general form of PROC IMPORT with both the REPLACE and the DBMS options:

PROC IMPORT DATAFILE = 'filename' OUT = data-set

     DBMS = identifier REPLACE;

Optional statements  Some types of files need a few more instructions to be read correctly. If the data do not start in the first line of the file, use the DATAROWS statement. If the delimiter is not a comma, tab, or space, use the DELIMITER statement. If your file contains only data and no headings, use the GETNAMES=NO statement to assign default variable names. Lastly, if your data file has all missing values or non-representative data in the first 20 data rows, you may need the GUESSINGROWS statement to make sure variables are assigned the correct data type and length.

 

DATAROWS = n;

starts reading data in row n. Default is 1.

DELIMITER = 'character';

specifies delimiter for DLM files. Default is space.

GETNAMES = NO;

tells SAS whether to use the first line of the input file for variable names. Default is YES. If NO, then variables are named VAR1, VAR2, VAR3, and so on.

GUESSINGROWS = n;

uses n rows to determine variable types. Default is 20.

  Here is the general form of PROC IMPORT with the GETNAMES=NO statement:

PROC IMPORT DATAFILE = 'filename' OUT = data-set REPLACE;

   GETNAMES = NO;

Example  The following example uses data about Jerry’s Coffee Shop where Jerry employs local bands to attract customers. Jerry keeps records of the number of customers present for each band. The data are the band name, and the number of customers at 8 p.m., 9 p.m., 10 p.m., and 11 p.m. Notice that one of the bands, “Stop, Drop, and Rock-N-Roll,” has commas in the name of the band. When a data value contains the delimiter, then the value must be enclosed in quotation marks.

BandName,EightPM,NinePM,TenPM,ElevenPM

Lupine Lights,45,63,70,

Awesome Octaves,17,28,44,12

"Stop, Drop, and Rock-N-Roll",34,62,77,91

The Silveyville Jazz Quartet,38,30,42,43

Catalina Converts,56,,65,34

Here is the program that will read this data file:

PROC IMPORT DATAFILE ='c:MyRawDataBands2.csv' OUT = music REPLACE;

RUN;

Here is the MUSIC data set created by the program. You can view the data in the Output Data tab in SAS Enterprise Guide or SAS Studio, or in the Viewtable window in the SAS windowing environment. Notice that PROC IMPORT used the first row of data for variable names. If you have spaces or special characters in the first row, see Section 3.18.

 

BandName

EightPM

NinePM

TenPM

ElevenPM

1

Lupine Lights

45

63

70

.

2

Awesome Octaves

17

28

44

12

3

Stop, Drop, and Rock-N-Roll

34

62

77

91

4

The Silveyville Jazz Quartet

38

30

42

43

5

Catalina Converts

56

.

65

34

The SAS log contains these notes describing the new data set:

NOTE: 5 records were read from the infile 'c:MyRawDataBands2.csv'.

      The minimum record length was 23.

      The maximum record length was 41.

NOTE: The data set WORK.MUSIC has 5 observations and 5 variables.

 

2.7     Telling SAS Where to Find Your Raw Data

The rest of this chapter covers reading raw data using the DATA step. If your data are already stored as SAS data sets, as Excel spreadsheets, as CSV or tab-delimited files, or in a database such as Oracle, then you can probably skip this topic. However, if your data are in raw data files (also referred to as text, ASCII, sequential, or flat files), then you can use the power and flexibility of the DATA step to read your data.

A raw data file can be viewed using simple text editors or system commands. On personal computers, raw data files will either have no program associated with them, or they will be associated with simple editors like Microsoft Notepad. In some operating environments, you can use commands to list the file, such as the cat or more commands in UNIX. Spreadsheet files are examples of data files that are not raw data. If you try using a text editor to look at a spreadsheet file, you will probably see lots of funny special characters you can’t find on your keyboard. It may cause your computer to beep and chirp, making you wish you had that private office down the hall. It looks nothing like the nice neat rows and columns you see when you use your spreadsheet software to view the same file.

The first step toward reading raw data files is telling SAS where to find the raw data. Your raw data may be either internal to your SAS program (also called instream), or in a separate file. Either way, you must tell SAS where to find your data.

Internal raw data  If you type raw data directly into your SAS program, then the data are internal to your program. You may want to do this when you have small amounts of data, or when you are testing a program with a small test data set. Use the DATALINES statement to indicate internal data. The DATALINES statement must be the last statement in the DATA step. All lines in the SAS program following the DATALINES statement are considered data until SAS encounters a semicolon. The semicolon can be on a line by itself or at the end of a SAS statement that follows the data lines. Any statements following the data are part of a new step. The CARDS statement is synonymous with the DATALINES statement.

The following SAS program illustrates the use of the DATALINES statement. (The DATA statement simply tells SAS to create a SAS data set named USPRESIDENTS, and the INPUT statement tells SAS how to read the data. The INPUT statement is discussed in Sections 2.8 through 2.18.)

* Read internal data into SAS data set uspresidents;

DATA uspresidents;

   INPUT President $ Party $ Number;

   DATALINES;

Adams F 2

Lincoln R 16

Grant R 18

Kennedy D 35

   ;

RUN;

If you run this program, the following note appears in the SAS log telling you that the data set USPRESIDENTS in the WORK library contains four observations and three variables.

NOTE: The data set WORK.USPRESIDENTS has 4 observations and 3 variables.

External raw data files  Usually you will want to keep data in external files, separating the data from the program. This eliminates the chance that you might accidentally alter the data when you are editing your SAS program. Use the INFILE statement to tell SAS the filename (and path, if appropriate) of the file containing the data. The INFILE statement follows the DATA statement and must precede the INPUT statement. After the INFILE keyword, enclose the file path and name in quotation marks. Here is the general form for different operating environments. (If you are using SAS University Edition or SAS OnDemand for Academics, then use the Linux style of statement.)

Windows:

INFILE 'c:MyDirPresident.dat';

UNIX or Linux:

INFILE '/home/mydir/president.dat';

z/OS:

INFILE 'MYID.PRESIDEN.DAT';

Suppose the following data are in a file called President.dat in the directory MyRawData on the C drive (Windows):

Adams F 2

Lincoln R 16

Grant R 18

Kennedy D 35

You could use this program with an INFILE statement to read the external data file:

* Read data from external file into SAS data set;

DATA uspresidents;

   INFILE 'c:MyRawDataPresident.dat';

   INPUT President $ Party $ Number;

RUN;

Whenever you read data from an external file, SAS lists some valuable information about the file in the SAS log. Always check this information as it could indicate problems. The following is an excerpt from the SAS log after reading the external file. A simple comparison of the number of records read by the INFILE statement (four in this case) with the number of observations (also four) in the SAS data set can tell you a lot about whether SAS read your data correctly.

NOTE: The infile 'c:MyRawDataPresident.dat' is:

      Filename=c:MyRawDataPresident.dat,

      RECFM=V,LRECL=32767, File Size (bytes)=49

NOTE: 4 records were read from the infile 'c:MyRawDataPresident.dat'.

      The minimum record length was 9.

      The maximum record length was 12.

NOTE: The data set WORK.USPRESIDENTS has 4 observations and 3 variables.

2.8    Reading Raw Data Separated by Spaces

image

If the values in your raw data file are all separated by at least one space, then using list input (also called free formatted or space-delimited input) to read the data may be appropriate. List input is an easy way to read raw data into SAS, but with ease come a few limitations. By default, you must read all the data in a recordno skipping over unwanted values. Any missing data must be indicated with a period. Character data, if present, must be simpleno embedded spaces, and no values greater than 8 characters in length. If the data file contains dates or other values that need special treatment, then list input would not be appropriate. This may sound like a lot of restrictions, but a surprising number of data files can be read using list input.

The INPUT statement, which is part of the DATA step, tells SAS how to read your raw data. To write an INPUT statement using list input, simply list the variable names after the keyword INPUT in the order they appear in the data file. Generally, variable names must be 32 characters or fewer, start with a letter or an underscore, and contain only letters, underscores, or numerals. If the values are character (not numeric), then place a dollar sign ($) after the variable name. Leave at least one space between names, and remember to place a semicolon at the end of the statement. The following is an example of a simple list-style INPUT statement:

INPUT Name $ Age Height;

This statement tells SAS to read three data values. The $ after Name indicates that it is a character variable, whereas the variables Age and Height are both numeric.

If your data file does not quite satisfy the rules for list input, you will be glad to know that there are various ways to work around the limitations. Many of those are discussed later in this chapter. One of the most common problems is character variables with lengths longer than 8. You can tell SAS to use a longer length with a LENGTH statement. This statement, for example, tells SAS to create a variable called Name that is character and has a length of 12:

LENGTH Name $ 12;

In SAS programs, the attributes of a variable are set when SAS first encounters that variable. So if a LENGTH statement comes before an INPUT statement, then SAS will use that length.

Example  Your hometown has been overrun with toads this year. A local resident, having heard of frog jumping in California, had the idea of organizing a toad jump to cap off the annual town fair. For each contestant you have the toad’s name, weight, and the jump distance from three separate attempts. If the toad is disqualified for any jump, then a period is used to indicate missing data. Here is what the data file ToadJump.dat looks like:

Lucky 2.3 1.9 . 3.0

Spot 4.6 2.5 3.1 .5

Toadzilla 7.1 . . 3.8

Hop 4.5 3.2 1.9 2.6

Noisy 3.8 1.3 1.8 1.5

Winner 5.7 . . .

This data file does not look very neat, but it does meet most of the requirements for list input: no embedded spaces, all values are separated by at least one space, and missing data are indicated by a period. However, there is one problem. The name of one toad, Toadzilla, is longer than 8 characters. Adding a LENGTH statement to the program will allow SAS to read the toads’ names with list input.

Here is the SAS program that will read the data:

* Create a SAS data set named toads;

* Read the data file ToadJump.dat using list input;

DATA toads;

   LENGTH ToadName $ 9;

   INFILE 'c:MyRawDataToadJump.dat';

   INPUT ToadName Weight Jump1 Jump2 Jump3;

RUN;

The LENGTH statement creates a variable named ToadName that is character (indicated by the dollar sign) and has a length of 9. Then the INPUT statement reads the variables ToadName, Weight, Jump1, Jump2, and Jump3. Because SAS already knows that ToadName is character, you don’t need to include a dollar sign in the INPUT statement. All the other variables are numeric.

It is always important to check data sets you create to make sure they are correct. Here is the TOADS data set. You can view the data in the Output Data tab in SAS Enterprise Guide or SAS Studio, or in the Viewtable window in the SAS windowing environment.

 

ToadName

Weight

Jump1

Jump2

Jump3

1

Lucky

2.3

1.9

.

3.0

2

Spot

4.6

2.5

3.1

0.5

3

Toadzilla

7.1

.

.

3.8

4

Hop

4.5

3.2

1.9

2.6

5

Noisy

3.8

1.3

1.8

1.5

6

Winner

5.7

.

.

.

These notes appear in the log:

NOTE: 6 records were read from the infile 'c:MyRawDataToadJump.dat'.

      The minimum record length was 16.

      The maximum record length was 21.

NOTE: The data set WORK.TOADS has 6 observations and 5 variables.

You can use list input to read files with other delimiters too, such as commas or tabs. See Section 2.18.

2.9     Reading Raw Data Arranged in Columns

image

Some raw data files do not have spaces (or other delimiters) between all the values or periods for missing dataso the files can’t be read using list input. But if each of the variable’s values is always found in the same place in the line of data, then you can use column input as long as all the values are character or standard numeric. Standard numeric data contain only numerals, decimal points, plus and minus signs, and E for scientific notation. Numbers with embedded commas or dates, for example, are not standard.

Column input has the following advantages over list input:

      spaces are not required between values

      missing values can be left blank

      character data can have embedded spaces

      you can skip unwanted variables

Data files with street addresses, which often have embedded blanks, are good candidates for column input. The street Martin Luther King Jr. Boulevard should be read as one variable not five, as it would be with list input. Data which can be read with column input can often also be read with formatted input (discussed in Sections 2.10 and 2.11) or a combination of input styles (Section 2.12).

To write an INPUT statement using column input, after the keyword INPUT, list the first variable’s name. If the variable is character, leave a space; then place a $. After the $, or variable name if it is numeric, leave a space; then list the column or range of columns for that variable. The columns are positions of the characters or numbers in the data line and are not to be confused with columns like those you see in a spreadsheet. Repeat this for all the variables you want to read. Here is a simple column-style INPUT statement:

INPUT Name $ 1-10 Age 11-13 Height 14-18;

The first variable, Name, is character, and the data values are in columns 1 through 10. The second variable, Age, is numeric since it is not followed by a $ and is in columns 11 through 13. The third variable, Height, is also numeric and is in columns 14 through 18.

Example  The local minor league baseball team, the Walla Walla Sweets, is keeping records about concession sales. A ballpark favorite are the sweet onion rings which are sold at the concession stands and also by vendors in the bleachers. The ballpark owners have a feeling that in games with lots of hits and runs more onion rings are sold in the bleachers than at the concession stands. They think they should send more vendors out into the bleachers when the game heats up, but need more evidence to back up their feelings.

For each home game they have the following information: name of opposing team, number of onion ring sales at the concession stands and in the bleachers, the number of hits for each team, and the final score for each team. The following is a sample of the data file named OnionRing.dat. For reference, a ruler showing the column numbers has been placed above the data:

----+----1----+----2----+----3----+----4

Columbia Peaches      35  67  1 10  2  1

Plains Peanuts       210      2  5  0  2

Gilroy Garlics        151035 12 11  7  6

Sacramento Tomatoes  124  85 15  4  9  1

Notice that the data file has the following characteristics, all making it a prime candidate for column input. All the values line up in columns, the team names have embedded blanks, missing values are blank, and in one case there is no space between data values. (Those Gilroy Garlics fans must really love onion rings.)

The following program shows how to read these data using column input:

* Create a SAS data set named sales;

* Read the data file OnionRing.dat using column input;

DATA sales;

   INFILE 'c:MyRawDataOnionRing.dat';

   INPUT VisitingTeam $ 1-20 CSales 21-24 BSales 25-28

         OurHits 29-31 TheirHits 32-34 OurRuns 35-37 TheirRuns 38-40;

RUN;

The variable VisitingTeam is character (indicated by a $) and reads the visiting team’s name in columns 1 through 20. The variables CSales and BSales read the concession and bleacher sales in columns 21 through 24 and 25 through 28, respectively. The number of hits for the home team, OurHits, and the visiting team, TheirHits, are in columns 29 through 31 and 32 through 34, respectively. The number of runs for the home team, OurRuns, is in columns 35 through 37, while the number of runs for the visiting team, TheirRuns, is in columns 38 through 40.

Here is the SALES data set. You can view the data in the Output Data tab in SAS Enterprise Guide or SAS Studio, or in the Viewtable window in the SAS windowing environment.

 

VisitingTeam

CSales

BSales

OurHits

TheirHits

OurRuns

TheirRuns

1

Columbia Peaches

35

67

1

10

2

1

2

Plains Peanuts

210

.

2

5

0

2

3

Gilroy Garlics

15

1035

12

11

7

6

4

Sacramento Tomatoes

124

85

15

4

9

1

These notes appear in the log:

NOTE: 4 records were read from the infile 'c:MyRawDataOnionRings.dat'.

      The minimum record length was 40.

      The maximum record length was 40.

NOTE: The data set WORK.SALES has 4 observations and 7 variables.

2.10     Reading Raw Data Not in Standard Format

image

Sometimes raw data are not straightforward numeric or character. For example, we humans easily read the number 1,000,001 as one million and one, but your trusty computer sees it as a character string. While the embedded commas make the number easier for us to interpret, they make the number impossible for the computer to recognize without some instructions. In SAS, informats are used to tell the computer how to interpret these types of data.

Informats are useful anytime you have nonstandard data. (Standard numeric data contain only numerals, decimal points, plus and minus signs, and E for scientific notation.) Numbers with embedded commas or dollar signs are nonstandard data. Other examples include data in hexadecimal or packed decimal formats. SAS has informats for reading these types of data as well.

Dates are perhaps the most common nonstandard data. Using date informats, SAS will convert conventional forms of dates like 10-31-2023 or 31OCT23 into a number, the number of days since January 1, 1960. This number is referred to as a SAS date value. (Why January 1, 1960? Who knows? Maybe 1960 was a good year for the SAS founders.) This turns out to be extremely useful when you want to do calculations with dates. For example, you can easily find the number of days between two dates by subtracting one from the other. See Section 3.13 for more about dates.

There are three general types of informats: character, numeric, and date. A table of selected SAS informats appears in the next section. The three types of informats have these general forms:

Character

Numeric

Date

$informatw.

informatw.d

informatw.

The $ indicates character informats, informat is the name of the informat, w is the total width, and d is the number of decimal places (numeric informats only). The period is an important part of the informat name. Without a period, SAS may try to interpret the informat as a variable name, which by default, cannot contain any special characters except the underscore. Two informats do not have names: $w., which reads standard character data, and w.d, which reads standard numeric data.

Use informats by placing the informat after the variable name in the INPUT statement; this is called formatted input. Here is an INPUT statement using formatted input:

INPUT Name $10. Age 3. Height 5.1 BirthDate MMDDYY10.;

The columns read for each variable are determined by the starting point and the width of the informat. SAS always starts with the first column; so the data values for the first variable, Name, which has an informat of $10., are in columns 1 through 10. Now the starting point for the second variable is column 11, and SAS reads values for Age in columns 11 through 13. The values for the third variable, Height, are in columns 14 through 18. Those five columns include the decimal place and the decimal point itself (150.3 for example). The values for the last variable, BirthDate, are in columns 19 through 28.

Example  This example illustrates the use of informats for reading data. The data contain the results from a pumpkin-carving contest. Each line includes the contestant’s name, age, type (carved or decorated), the date the pumpkin was entered, and the scores from each of three judges.

Alicia Grossman  13 c 10-28-2020 7.8 6.5 7.2

Matthew Lee       9 D 10-30-2020 6.5 5.9 6.8

Elizabeth Garcia 10 C 10-29-2020 8.9 7.9 8.5

Lori Newcombe     6 D 10-30-2020 6.7     4.9

Jose Martinez     7 d 10-31-2020 8.9 9.510.0

Brian Williams   11 C 10-29-2020 7.8 8.4 8.5

The following program reads the data from a file named Pumpkin.dat, and uses a LIBNAME statement to create a permanent SAS data set named CONTEST. Please note there are many ways to input these data, so if you imagined something else, that’s OK.

LIBNAME pump 'c:MySASLib';

* Create a permanent SAS data set named contest;

* Read the file Pumpkin.dat using formatted input;

DATA pump.contest;

   INFILE 'c:MyRawDataPumpkin.dat';

   INPUT Name $16. Age 3. +1 Type $1. +1 Date MMDDYY10.

         (Score1 Score2 Score3) (4.1);

RUN;

The variable Name has an informat of $16., meaning that it is a character variable 16 columns wide. Variable Age has an informat of 3., is numeric, three columns wide, and has no decimal places. The +1 skips over one column. Variable Type is character, and it is one column wide. Variable Date has an informat MMDDYY10. and reads dates in the form 10-31-2020 or 10/31/2020, each 10 columns wide. The remaining variables, Score1 through Score3, all require the same informat, 4.1. Putting the variables and the informat in separate sets of parentheses saves typing.

Here is the CONTEST data set. You can view the data in the Output Data tab in SAS Enterprise Guide or SAS Studio, or in the Viewtable window in the SAS windowing environment.

 

Name

Age

Type

Date

Score1

Score2

Score3

1

Alicia Grossman

13

c

22216

7.8

6.5

7.2

2

Matthew Lee

9

D

22218

6.5

5.9

6.8

3

Elizabeth Garcia

10

C

22217

8.9

7.9

8.5

4

Lori Newcombe

6

D

22218

6.7

.

4.9

5

Jose Martinez

7

d

22219

8.9

9.5

10.0

6

Brian Williams

11

C

22217

7.8

8.4

8.5

Notice that these dates are printed as the number of days since January 1, 1960. Section 3.13 discusses how to format these into readable dates.

These notes appear in the log:

NOTE: 6 records were read from the infile 'c:MyRawDataPumpkin.dat'.

      The minimum record length was 44.

      The maximum record length was 44.

NOTE: The data set PUMP.CONTEST has 6 observations and 7 variables.

 

2.11     Selected Informats

Informat

Definition

Width range

Default width

Character

 

 

 

$CHARw.

Reads character data—does not trim leading or trailing blanks

1–32,767

8 or length of variable

$UPCASEw.

Converts character data to uppercase

1–32,767

8

$w.

Reads character data—trims leading blanks

1–32,767

none

Date, Time, and Datetime1

ANYDTDTEw.

Reads dates in various date forms

5–32

9

DATEw.

Reads dates in form: ddmmmyy or ddmmmyyyy

7–32

7

DATETIMEw.

Reads datetime values in the form: ddmmmyy hh:mm:ss.ss

13–40

18

DDMMYYw.

Reads dates in form: ddmmyy or ddmmyyyy

6–32

6

JULIANw.

Reads Julian dates in form: yyddd or yyyyddd

5–32

5

MMDDYYw.

Reads dates in form: mmddyy or mmddyyyy

6–32

6

STIMERw.

Reads time in form: hh:mm:ss.ss (or mm:ss.ss, or ss.ss)

1–32

10

TIMEw. d

Reads time in form: hh:mm:ss.ss (or hh:mm) using a 24-hour clock

5–32

8

Numeric

COMMAw.d

Removes embedded commas and $, converts left parentheses to minus sign

1–32

1

COMMAXw.d

Like COMMAw.d but reverses role of comma and period

1–32

1

PERCENTw.

Converts percentages to proportions

1–32

6

w.d

Reads standard numeric data

1–32

none

 

1  SAS date values are the number of days since January 1, 1960. Time values are the number of seconds past midnight, and datetime values are the number of seconds past midnight January 1, 1960.

 

 

 

Informat

Input data

INPUT statement

Results

Character

$CHARw.

my cat    my cat

INPUT Animal $CHAR10.;  

my cat    my cat

$UPCASEw.

my cat

INPUT Name $UPCASE10.;

MY CAT

$w.

my cat    my cat

INPUT Animal $10.;

my cat my cat

Date, Time, and Datetime

ANYDTDTEw.

1jan1961 01/01/61

INPUT Day ANYDTDTE10.;

366 366

DATEw.

1jan1961 1 jan 61

INPUT Day DATE10.;

366 366

DATETIMEw.

1jan1960 10:30:15 1jan1961,10:30:15

INPUT Dt DATETIME18.;

37815 31660215

DDMMYYw.

01.01.61 02/01/61

INPUT Day DDMMYY8.;

366 367

JULIANw.

61001 1961001

INPUT Day JULIAN7.;

366 366

MMDDYYw.

01-01-61 01/01/61

INPUT Day MMDDYY8.;

366 366

STIMERw.

10:30 10:30:15

INPUT Time STIMER8.;

630 37815

TIMEw. d

10:30 10:30:15

INPUT Time TIME8.;

37800 37815

Numeric

COMMAw.d

$1,000,001 (1,234)

INPUT Income COMMA10.;

1000001 -1234

COMMAXw.d

$1.000.001 (1.234,25)

INPUT Value COMMAX10.;

1000001 -1234.25

PERCENTw.

5% (20%)

INPUT Value PERCENT5.;

0.05 -0.2

w.d

1234 -12.3

INPUT Value 5.1;

123.4 -12.3

 

 

2.12   Mixing Input Styles

Each of the three major input styles has its own advantages. List style is the easiest; column style is a bit more work; and formatted style is the hardest of the three. However, column and formatted styles do not require spaces (or other delimiters) between variables and can read embedded blanks. Formatted style can read special data such as dates. Sometimes you use one style, sometimes another, and sometimes the easiest way is to use a combination of styles. SAS is so flexible that you can mix and match any of the input styles for your own convenience.

Example  The following raw data contain information about U.S. national parks: name, state (or states as the case may be), year established, and size in acres:

Yellowstone           ID/MT/WY 1872    4,065,493

Everglades            FL 1934          1,398,800

Yosemite              CA 1864            760,917

Great Smoky Mountains NC/TN 1926         520,269

Wolf Trap Farm        VA 1966                130

You could write the INPUT statement for these data in many ways, which is the point of this section. This program shows one way to do it:

* Create a SAS data set named nationalparks;

* Read a data file NatPark.dat mixing input styles;

DATA nationalparks;

   INFILE 'c:MyRawDataNatPark.dat';

   INPUT ParkName $ 1-22 State $ Year @40 Acreage COMMA9.;

RUN;

Notice that the variable ParkName is read with column style input, State and Year are read with list input, and Acreage is read with formatted input.

Here is the NATIONALPARKS data set:

 

ParkName

State

Year

Acreage

1

Yellowstone

ID/MT/WY

1872

4065493

2

Everglades

FL

1934

1398800

3

Yosemite

CA

1864

760917

4

Great Smoky Mountains

NC/TN

1926

520269

5

Wolf Trap Farm

VA

1966

130

Sometimes programmers run into problems when they mix input styles. When SAS reads a line of raw data, it uses a pointer to mark its place, but each style of input uses the pointer a little differently. With list input, SAS automatically scans to the next non-blank field and starts reading. With column style input, SAS starts reading in the exact column you specify. But with formatted input, SAS just starts reading—wherever the pointer is, that is where SAS reads.

 

Sometimes you need to move the pointer explicitly, and you can do that by using the column pointer, @n, where n is the number of the column SAS should move to.

In the preceding program, the column pointer @40 tells SAS to move to column 40 before reading the value for Acreage. If you removed the column pointer from the INPUT statement, as shown in the following statement, then SAS would start reading Acreage right after Year:

INPUT ParkName $ 1-22 State $ Year Acreage COMMA9.;

Here is the data set with truncated values of the variable Acreage:

 

ParkName

State

Year

Acreage

1

Yellowstone

ID/MT/WY

1872

4065

2

Everglades

FL

1934

.

3

Yosemite

CA

1864

.

4

Great Smoky Mountains

NC/TN

1926

5

5

Wolf Trap Farm

VA

1966

.

Because Acreage was read with formatted input, SAS started reading right where the pointer was. Here is the data file with a column ruler for counting columns at the top and asterisks marking the place where SAS started reading the values of Acreage:

----+----1----+----2----+----3----+----4----+----5

Yellowstone           ID/MT/WY 1872 *  4,065,493

Everglades            FL 1934 *        1,398,800

Yosemite              CA 1864 *          760,917

Great Smoky Mountains NC/TN 1926 *       520,269

Wolf Trap Farm        VA 1966 *              130

The COMMA9. informat told SAS to read nine columns, and SAS did that even when those columns were completely blank.

The column pointer, @n, has other uses, too, and can be used anytime you want SAS to skip backward or forward within a line of data. You could use it, for example, to skip over unneeded data, or to read a variable twice using different informats.

2.13   Reading Messy Raw Data

image

Sometimes you need to read data that just don’t line up in nice columns or have predictable lengths. When you have these types of messy files, ordinary list, column, or formatted input simply aren’t enough. You need more tools in your bag: tools like the @'character' column pointer, and the colon and ampersand modifiers.

The @'character' column pointer  In the preceding section we showed how you can use the @ column pointer to move to a particular column before reading data. However, sometimes you don’t know the starting column of the data, but you do know that it always comes after a particular character or word. For these types of situations, you can use the @'character' column pointer. For example, suppose you have a data file that has information about dog ownership. Nothing in the file lines up, but you know that the breed of the dog always follows the word Breed. You could read the dog’s breed using the following INPUT statement:

INPUT @'Breed' DogBreed $;

The colon modifier  The above INPUT statement will work just fine as long as the dog’s breed is 8 characters or less (the default length for character variables) and contains no spaces. So if the dog is a Poodle you’re fine, but if the dog is a Rottweiler or Shih Tzu, it won’t work. If you assign the variable an informat such as $20. in the INPUT statement, then SAS will read for 20 columns whether or not there is a space in those columns. So the DogBreed variable may include unwanted characters, which appear after the dog’s breed on the data line. If you want SAS to read only until it encounters a space or the end of the data line, you can use a colon modifier on the informat. (This works with other delimiters besides spaces. See Section 2.18.) To use a colon modifier, simply put a colon (:) before the informat (such as :$20. instead of $20.). For example, given this line of raw data:

My dog Sam  Breed Rottweiler  Vet Bills $478

the following table shows the results you would get using different INPUT statements:

Statement

Value of variable DogBreed

 

INPUT @'Breed ' DogBreed $;

Rottweil

INPUT @'Breed ' DogBreed $20.;

Rottweiler Vet Bill

INPUT @'Breed ' DogBreed :$20.;

Rottweiler

The ampersand modifier  The colon modifier tells SAS to read a data value until it reaches a space. If you have embedded spaces in your data values, you may still be able to read your data. The ampersand modifier (&) tells SAS to read a data value until it reaches two or more spaces in a row. Insert the ampersand after the name of the variable with the embedded spaces.

INPUT @'Breed' DogBreed & $;

Example  Each year engineering students from around the USA and Canada build concrete canoes and hold regional and national competitions. Part of the competition involves racing the canoes. The following data contain the final results of a men’s sprint competition. The data lines start with the name of the canoe, followed by the school, and the time.

 

Bellatorum  School CSULA Time 1:40.5

The Kraken  School ASU Time 1:45.35

Black Widow  School UoA Time 1:33.7

Koicrete  School CSUF Time 1:40.25

Khaos  School UNLV  Time 2:03.45

Max  School UCSD Time 1:26.47

Hakuna Matata  School UCLA Time 1:20.64

Prospector  School CPSLO Time 1:12.08

Andromeda  School CPP  Time 1:25.1

Kekoapohaku  School UHM Time 1:24.49

You can see that some canoe names contain embedded spaces, and there are two spaces after each canoe name. Because the canoe names have different lengths, the school names do not line up in the same column. Also, the time values are sometimes six characters and sometimes seven. This SAS program reads the canoe name, school, and time:

DATA canoeresults;

  INFILE 'c:MyRawDataCanoes.dat';

  INPUT CanoeName & $13. @'School' School $ @'Time' RaceTime :STIMER8.;

RUN;

This INPUT statement uses both an ampersand and an informat ($13.) to read the values of CanoeName. The result is that SAS will read CanoeName until it encounters two or more spaces and then stop even if the length of that CanoeName is less than the specified informat. Then the INPUT statement uses @'School' and @'Time' to position the column pointer to read the school name and time. The school names are all less than the default length of 8 characters so no informat is needed to read them, but times are non-standard data and require an informat. Because the time is not always the same number of characters, a colon modifier is added to the informat (:STIMER8.). Without the colon modifier, SAS would go to a new data line to try to read the time values when it ran out of characters on a line of data.

Here is the CANOERESULTS data set:

 

CanoeName

School

RaceTime

1

Bellatorum

CSULA

100.50

2

The Kraken

ASU

105.35

3

Black Widow

UoA

93.70

4

Koicrete

CSUF

100.25

5

Khaos

UNLV

123.45

6

Max

UCSD

86.47

7

Hakuna Matata

UCLA

80.64

8

Prospector

CPSLO

72.08

9

Andromeda

CPP

85.10

10

Kekoapohaku

UHM

84.49

Note that the times in this output are printed in seconds. See Section 3.14 for how to format these values into minutes and seconds.

 

2.14   Reading Multiple Lines of Raw Data per Observation

image

In a typical raw data file each line of data represents one observation, but sometimes the data for each observation are spread over more than one line. Since SAS will automatically go to the next line if it runs out of data before it has read all the variables in an INPUT statement, you could just let SAS take care of figuring out when to go to a new line. But if you know that your data file has multiple lines of raw data per observation, it is better for you to explicitly tell SAS when to go to the next line than to make SAS figure it out. That way you won’t get a suspicious SAS-went-to-a-new-line note in your log (Section 11.4). To tell SAS when to skip to a new line, you simply add line pointers to your INPUT statement.

The line pointers, slash (/) and pound-n (#n), are like road signs telling SAS, “Go this way.” To read more than one line of raw data for a single observation, you simply insert a slash into your INPUT statement when you want to skip to the next line of raw data. The #n line pointer performs the same action except that you specify the line number. The n in #n stands for the number of the line of raw data for that observation; so #2 means to go to the second line for that observation, and #4 means go to the fourth line. You can even go backward using the #n line pointer, reading from line 4 and then from line 3, for example. The slash is simpler, but #n is more flexible.

image

Example  A colleague is planning his next summer vacation, and he wants to go someplace where the weather is just right. He obtains data from a meteorological database. Unfortunately, he has not quite figured out how to export from this database and makes a rather odd file.

The file contains information about temperatures for the month of July for Alaska, Florida, and North Carolina. (If your colleague chooses the last state, maybe he can visit SAS headquarters.) The first line contains the city and state, the second line lists the normal high temperature and normal low (in degrees Fahrenheit), and the third line contains the record high and low:

 

Nome AK

55 44

88 29

Miami FL

90 75

97 65

Raleigh NC

88 68

105 50

The following program reads the weather data from a file named Temperature.dat:

* Create a SAS data set named highlow;

* Read the data file using line pointers;

DATA highlow;

   INFILE 'c:MyRawDataTemperature.dat';

   INPUT City $ State $

         / NormalHigh NormalLow

         #3 RecordHigh RecordLow;

RUN;

The INPUT statement reads the values for City and State from the first line of data. Then the slash tells SAS to move to column 1 of the next line of data before reading NormalHigh and NormalLow. Likewise, the #3 tells SAS to move to column 1 of the third line of data for that observation before reading RecordHigh and RecordLow. As usual, there is more than one way to write this INPUT statement. You could replace the slash with #2 or replace #3 with a slash.

Here is the HIGHLOW data set:

 

City

State

NormalHigh

NormalLow

RecordHigh

RecordLow

1

Nome

AK

55

44

88

29

2

Miami

FL

90

75

97

65

3

Raleigh

NC

88

68

105

50

These notes appear in the log:

NOTE: 9 records were read from the infile 'c:MyRawDataTemperature.dat'.

      The minimum record length was 5.

      The maximum record length was 10.

NOTE: The data set WORK.HIGHLOW has 3 observations and 6 variables.

Notice that while nine records were read from the INFILE statement, the SAS data set contains just three observations. Usually, this would set off alarms in your mind, but here it confirms that indeed three data lines were read for every observation just as planned. You should always check your log, particularly when using line pointers.

2.15   Reading Multiple Observations per Line of Raw Data

image

There ought to be a Murphy’s law of data: whatever form data can take, it will. Normally SAS assumes that each line of raw data represents no more than one observation. When you have multiple observations per line of raw data, you can use double trailing at signs (@@) at the end of your INPUT statement. This line-hold specifier is like a stop sign telling SAS, “Stop, hold that line of raw data.” SAS will hold that line of data, continuing to read observations until it either runs out of data or reaches an INPUT statement that does not end with a double trailing @.

image

Example  Suppose you have a colleague who is planning a vacation and has obtained a file containing data about rainfall for the three cities he is considering. The file contains the name of each city, the state, average rainfall (in inches) for the month of July, and average number of days with measurable precipitation in July. The raw data look like this:

Nome AK 2.5 15 Miami FL 6.75

18 Raleigh NC . 12

Notice that in this data file the first line stops in the middle of the second observation. The following program reads these data from a file named Precipitation.dat, and uses an @@ so SAS does not automatically go to a new line of raw data for each observation:

* Input more than one observation from each record;

DATA rainfall;

   INFILE 'c:MyRawDataPrecipitation.dat';

   INPUT City $ State $ NormalRain MeanDaysRain @@;

RUN;

 

 

Here is the RAINFALL data set:

 

City

State

NormalRain

MeanDaysRain

1

Nome

AK

2.50

15

2

Miami

FL

6.75

18

3

Raleigh

NC

.

12

These notes will appear in the log:

NOTE: 2 records were read from the infile 'c:MyRawDataPrecipitation.dat'

      The minimum record length was 18.

      The maximum record length was 28.

NOTE: SAS went to a new line when INPUT statement reached past the end of a line.

NOTE: The data set WORK.RAINFALL has 3 observations and 4 variables.

While only two records were read from the raw data file, the RAINFALL data set contains three observations. The log also includes a note saying SAS went to a new line when the INPUT statement reached past the end of a line. This means that SAS came to the end of a line in the middle of an observation and continued reading with the next line of raw data. Normally these messages would indicate a problem, but in this case they are exactly what you want.

Section 3.11 shows a different way to produce multiple observations from a single line of raw data, a technique that works when some data values must be repeated.

2.16   Reading Part of a Raw Data File

image

At some time you may find that you need to read a small fraction of the records in a large data file. For example, you might be reading U.S. census data and want only female heads-of-household who have incomes above $225,000 and live in Walla Walla, Washington. You could read all the records in the data file and then throw out the unneeded ones, but that would waste time.

Luckily, you don’t have to read all the data before you tell SAS whether to keep an observation. Instead, you can read just enough variables to decide whether to keep the current observation, and then end the INPUT statement with an at sign (@), called a trailing at. This tells SAS to hold that line of raw data. While the trailing @ holds that line, you can test the observation with an IF statement to see if it’s one you want to keep. If it is, then you can read data for the remaining variables with a second INPUT statement. Without the trailing @, SAS would automatically start reading the next line of raw data with each INPUT statement.

The trailing @ is similar to the column pointer, @n, introduced in Section 2.12. By specifying a number after the @ sign, you tell SAS to move to a particular column. By using an @ without specifying a column, it is as if you are telling SAS, “Stay tuned for more information. Don’t touch that dial!” SAS will hold that line of data until it reaches either the end of the DATA step, or an INPUT statement that does not end with a trailing @.

Example  You want to read part of a raw data file containing local traffic data for freeways and surface streets. The data include information about the type of street, name of street, the average number of vehicles per hour traveling that street during the morning, and the average number of vehicles per hour for the evening. Here are the raw data:

freeway 408                           3684 3459

surface Martin Luther King Jr. Blvd.  1590 1234

surface Broadway                      1259 1290

surface Rodeo Dr.                     1890 2067

freeway 608                           4583 3860

freeway 808                           2386 2518

surface Lake Shore Dr.                1590 1234

surface Pennsylvania Ave.             1259 1290

Suppose you want to see only the freeway data at this point so you read the raw data file, Traffic.dat, with this program:

* Use a trailing @, then delete surface streets;

DATA freeways;

   INFILE 'c:MyRawDataTraffic.dat';

   INPUT Type $ @;

   IF Type = 'surface' THEN DELETE;

   INPUT Name $ 9-38 AMTraffic PMTraffic;

RUN;

 

Notice that there are two INPUT statements. The first reads the character variable Type and then ends with an @. The trailing @ holds each line of data while the IF statement tests it. The second INPUT statement reads Name (in columns 9 through 38), AMTraffic, and PMTraffic. If an observation has a value of surface for the variable Type, then the second INPUT statement never executes. Instead, SAS returns to the beginning of the DATA step to process the next observation and does not add the unwanted observation to the FREEWAYS data set. (Do not pass go, do not collect $200.)

Here is the FREEWAYS data set:

 

Type

Name

AMTraffic

PMTraffic

1

freeway

408

3684

3459

2

freeway

608

4583

3860

3

freeway

808

2386

2518

When you run this program, the log will contain the following two notes, one saying that eight records were read from the input file and another saying that the new data set contains only three observations:

NOTE: 8 records were read from the infile 'c:MyRawDataTraffic.dat'.

      The minimum record length was 47.

      The maximum record length was 47.

NOTE: The data set WORK.FREEWAYS has 3 observations and 4 variables.

The other five observations had a value of surface for the variable Type and were deleted by the IF statement.

This example used a DELETE statement to delete observations for surface streets. Instead of specifying which observations to delete, you could specify which observations to keep. To keep only observations for freeways, you would use a subsetting IF statement. See Section 3.8 for more about subsetting IF statements.

Trailing @ versus double trailing @  The double trailing @, discussed in the previous section, is similar to the trailing @. Both are line-hold specifiers; the difference is how long they hold a line of data for input. The trailing @ holds a line of data for subsequent INPUT statements, but releases that line of data when SAS returns to the top of the DATA step to begin building the next observation. The double trailing @ holds a line of data for subsequent INPUT statements even when SAS starts building a new observation. In both cases, the line of data is released if SAS reaches a subsequent INPUT statement that does not contain a line-hold specifier.

2.17   Controlling Input with Options in the INFILE Statement

So far in this chapter, we have seen ways to use the INPUT statement to read many different types of raw data. When reading raw data files, SAS makes certain assumptions. For example, SAS starts reading with the first data line and, if SAS runs out of data on a line, it automatically goes to the next line to read values for the rest of the variables. Most of the time this is OK, but some data files can’t be read using the default assumptions. The options in the INFILE statement change the way SAS reads raw data files. The following options are useful for reading particular types of data files. Place these options after the filename in the INFILE statement.

FIRSTOBS=  The FIRSTOBS= option tells SAS at what line to begin reading data. This is useful if you have a data file that contains descriptive text or header information at the beginning, and you want to skip over these lines before reading the data. The following data file, for example, has a description of the data in the first two lines:

Ice-cream sales data for the summer

Flavor     Location   Boxes sold

Chocolate  213        123

Vanilla    213        512

Chocolate  415        242

The following program uses the FIRSTOBS= option to tell SAS to start reading data on the third line of the file:

DATA icecream;

   INFILE 'c:MyRawDataIceCreamSales.dat' FIRSTOBS = 3;

   INPUT Flavor $ 1-9 Location BoxesSold;

RUN;

OBS=  The OBS= option can be used anytime you want to read only a part of your data file. It tells SAS to stop reading when it gets to that line in the raw data file. Note that it does not necessarily correspond to the number of observations. If, for example, you are reading two raw data lines for each observation, then an OBS=100 would read 100 data lines, and the resulting SAS data set would have 50 observations. The OBS= option can be used with the FIRSTOBS= option to read lines from the middle of the file. For example, suppose the ice-cream sales data had a remark at the end of the file that was not part of the data.

Ice-cream sales data for the summer

Flavor     Location   Boxes sold

Chocolate  213        123

Vanilla    213        512

Chocolate  415        242

Data verified by Blake White

With FIRSTOBS=3 and OBS=5, SAS will start reading this file on the third data line and stop reading after the fifth data line.

DATA icecream;

   INFILE 'c:MyRawDataIceCreamSales2.dat' FIRSTOBS = 3 OBS=5;

   INPUT Flavor $ 1-9 Location BoxesSold;

RUN;

 

MISSOVER  By default, SAS will go to the next data line to read more data if SAS has reached the end of the data line and there are still more variables in the INPUT statement that have not been assigned values. The MISSOVER option tells SAS not to go to the next line of data when it runs out of data. Instead, assign missing values to any remaining variables. The following data file illustrates where this option may be useful. This file contains test scores for a self-paced course. Since not all students complete all the tests, some have more scores than others.

Nguyen   89 76 91 82

Ramos    67 72 80 76 86

Robbins  76 65 79

The following program reads the data for the five test scores, assigning missing values to tests not completed:

DATA class102;

   INFILE 'c:MyRawDataAllScores.dat' MISSOVER;

   INPUT Name $ Test1 Test2 Test3 Test4 Test5;

RUN;

TRUNCOVER  You need the TRUNCOVER option when you are reading data using column or formatted input and some data lines are shorter than others. If a variable’s field extends past the end of the data line, then, by default, SAS will go to the next line to start reading the variable’s value. This option tells SAS to read data for the variable until it reaches the end of the data line, or the last column specified in the format or column range, whichever comes first. The next file contains addresses and must be read using column or formatted input because the street names have embedded blanks. Note that the data lines have different lengths:

John Garcia     114  Maple Ave.

Sylvia Chung   1302  Washington Drive

Martha Newton    45  S.E. 14th St.

This program uses column input to read the address file. Because some of the addresses stop before the end of the variable Street’s field (columns 22 through 37), you need the TRUNCOVER option. Without the TRUNCOVER option, SAS would try to go to the next line to read the data for Street on the first and third records.

DATA homeaddress;

   INFILE 'c:MyRawDataAddress.dat' TRUNCOVER;

   INPUT Name $ 1-15 Number 16-19 Street $ 22-37;

RUN;

TRUNCOVER is similar to MISSOVER. Both will assign missing values to variables if the data line ends before the variable’s field starts. But when the data line ends in the middle of a variable field, TRUNCOVER will take as much as is there, whereas MISSOVER will assign the variable a missing value.

2.18   Reading Delimited Files with the DATA Step

image

We suspect that by now you have realized that with SAS there is usually more than one way to accomplish the same result. Section 2.6 showed how to read delimited data files using the IMPORT procedure; now we are going to show how to read delimited files using the DATA step. The INFILE statement has two options that make it easy to read delimited data files: the DLM= option and the DSD option.

The DLM= option  If you read your data using list input, SAS expects your file to have spaces between data values. The DELIMITER=, or DLM=, option in the INFILE statement allows you to use other delimiters. To read data with commas as delimiters, add the DLM= option to your INFILE statement like this:

INFILE 'filename' DLM = ',';

You can use the DLM= option to specify tabs as delimiters too. In ASCII, 09 is the hexadecimal equivalent of a tab character, and the notation ‘09’X means a hexadecimal 09. If your computer uses EBCDIC (IBM mainframes) instead of ASCII, then use DLM=’05’X. Here is a DLM= option with the ASCII tab specified as the delimiter.

INFILE 'filename' DLM = '09'X;

The comma and tab characters are common delimiters for data files, but you can read data files with any delimiter character by simply enclosing the delimiter character in quotation marks after the DLM= option (for example, DLM=’&’). If your delimiter is a string of characters, then use the DLMSTR= option.

By default, SAS interprets two or more delimiters in a row as a single delimiter. If your file has missing values, and two delimiters in a row indicate a missing value, then you will also need the DSD option in the INFILE statement.

The DSD option  The DSD (Delimiter-Sensitive Data) option for the INFILE statement does three things for you. First, it ignores delimiters in data values enclosed in quotation marks. Second, it does not read quotation marks as part of the data value. Third, it treats two delimiters in a row as a missing value. The DSD option assumes that the delimiter is a comma. If your delimiter is not a comma, then you can use the DLM= option with the DSD option to specify the delimiter. For example, to read a tab-delimited ASCII file with missing values indicated by two consecutive tab characters use this:

INFILE 'filename' DLM = '09'X DSD;

Comma-separated values files, or CSV files, are a common type of file that can be read with the DSD option. Many programs, such as Microsoft Excel, can save data in CSV format. These files have commas for delimiters and consecutive commas for missing values; if there are commas in any of the data values, then those values are enclosed in quotation marks.

 

Example  The following example illustrates how to read a CSV file using the DSD option. Jerry’s Coffee Shop employs local bands to attract customers. Jerry keeps records of the number of customers for each band, for each night they play in his shop. The band’s name is followed by the number of customers present at 8 p.m., 9 p.m., 10 p.m., and 11 p.m.

Lupine Lights,45,63,70,

Awesome Octaves,17,28,44,12

"Stop, Drop, and Rock-N-Roll",34,62,77,91

The Silveyville Jazz Quartet,38,30,42,43

Catalina Converts,56,,65,34

Notice that one group’s name has embedded commas, and is enclosed in quotation marks. Also, the last group has a missing data point for the 9 p.m. hour as indicated by two consecutive commas. Here is the program that will read this data file:

DATA music;

   INFILE 'c:MyRawDataBands.csv' DLM = ',' DSD MISSOVER;

   INPUT BandName :$30. EightPM NinePM TenPM ElevenPM;

RUN;

Notice that for BandName the INPUT statement uses an informat with a colon modifier. The colon modifier tells SAS to read for the length of the informat (30 for BandName), or until it encounters a delimiter, whichever comes first.

This data file is missing one data value at the end of the first line. By default, SAS would go to the next line of data to find the last data value. The MISSOVER option on the INFILE statement tells SAS not to go to the next line of data when it runs out of values. It is prudent when using the DSD option to add the MISSOVER option if there is any chance that you have missing data values at the end of your data lines.

Here is the MUSIC data set:

 

BandName

EightPM

NinePM

TenPM

ElevenPM

1

Lupine Lights

45

63

70

.

2

Awesome Octaves

17

28

44

12

3

Stop, Drop, and Rock-N-Roll

34

62

77

91

4

The Silveyville Jazz Quartet

38

30

42

43

5

Catalina Converts

56

.

65

34

The SAS log contains these notes describing the new data set:

NOTE: 5 records were read from the infile 'c:MyRawDataBands.csv'.

      The minimum record length was 23.

      The maximum record length was 41.

NOTE: The data set WORK.MUSIC has 5 observations and 5 variables.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset