CHAPTER 2
Accessing Your Data
2.1 Methods for Getting Your Data into SAS
2.2 SAS Data Libraries and Data Sets
2.3 Listing the Contents of a SAS Data Set
2.4 Reading Excel Files with the IMPORT Procedure
2.5 Accessing Excel Files Using the XLSX LIBNAME Engine
2.6 Reading Delimited Files with the IMPORT Procedure
2.7 Telling SAS Where to Find Your Raw Data
2.8 Reading Raw Data Separated by Spaces
2.9 Reading Raw Data Arranged in Columns
2.10 Reading Raw Data Not in Standard Format
2.14 Reading Multiple Lines of Raw Data per Observation
2.15 Reading Multiple Observations per Line of Raw Data
2.16 Reading Part of a Raw Data File
2.17 Controlling Input with Options in the INFILE Statement
2.18 Reading Delimited Files with the DATA Step
2.1 Methods for Getting Your Data into SAS
Data come in many different forms. Your data may be handwritten on a piece of paper, or typed into a raw data file on your computer. Perhaps your data are in a database file on your personal computer, or in a data warehouse on the server at your office. Wherever your data reside, there is a way for SAS to use them. You may need to convert your data from one form to another, or SAS may be able to use your data in their current form. This section outlines several methods for getting your data into SAS. Most of these methods are covered in this book, but a few of the more advanced methods are merely mentioned so that you know they exist. We do not attempt to cover all methods available for getting your data into SAS, as new methods are continually being developed, and creative SAS users can always come up with clever methods that work for their own situations. However your data are stored, there should be at least one method in this book that will work for you.
Methods for accessing your data can be put into five general categories:
♦ using SAS data sets
♦ entering data directly into SAS data sets
♦ creating SAS data sets from raw data files
♦ converting other software’s data files into SAS data sets
♦ reading other software’s data files directly
Using SAS data sets If your data are already stored as SAS data sets, all you have to do is tell SAS which data set to use. SAS data sets are stored and accessed via SAS data libraries. A SAS data library is a collection of one or more SAS data sets that are stored in the same location. Some SAS data libraries are defined by default, but you can also define your own. Individual data sets within a SAS data library are called members. See the next section for more information about SAS data libraries and data set members.
Entering data directly into SAS data sets Sometimes the best method for getting your data into SAS is to enter the data directly into SAS data sets through your keyboard.
♦ The Viewtable window is part of the SAS windowing environment and is included with Base SAS software. Viewtable allows you to enter your data in a tabular format. You can define variables (also called columns) and give them attributes such as name, length, and type (character or numeric).
♦ The Data Grid is part of SAS Enterprise Guide, which is included with Base SAS, but only for Windows. As with Viewtable, you can define variables, give them attributes, and then enter your data.
Creating SAS data sets from raw data files Much of this chapter is devoted to reading raw data files (also referred to as text, ASCII, sequential, or flat files). You can always read a raw data file using Base SAS software. If your data are not already in a raw data file, chances are you can convert your data into a raw data file. There are two general methods for reading raw data files:
♦ The DATA step is so versatile that it can read almost any type of raw data file. This method is covered in this chapter starting with Section 2.7.
♦ The IMPORT procedure, covered in Section 2.6, is available with Base SAS running on the UNIX and Windows operating environments. This is an easy way to read particular types of raw data files including tab-delimited and comma-separated values (CSV) files.
Converting other software’s data files into SAS data sets Each software application has its own form for data files. While this is useful for software developers, it is troublesome for software users—especially when your data are in one application, but you need to analyze them with another. There are several options for converting data:
♦ The IMPORT procedure, available for UNIX and Windows operating environments, can be used to convert Microsoft Excel (covered in Section 2.4), Lotus, dBase, Stata, SPSS, JMP, Paradox, and Microsoft Access files into SAS data sets. All of these except JMP require that you have SAS/ACCESS Interface to PC Files installed on your SAS server.
♦ If you don’t have SAS/ACCESS software, then you can always create a raw data file from your application and read the raw data file with either the DATA step or the IMPORT procedure. Many applications can create CSV files, which are easily read using the IMPORT procedure (covered in Section 2.6) or the DATA step (covered in Section 2.18).
Reading other software’s data files directly Under certain circumstances, you may be able to read data without converting to a SAS data set. This method is particularly useful when you have many people updating data files, and you want to make sure that you are using the most current data.
♦ The SAS/ACCESS products allow you to read data without converting your data into SAS data sets. There are SAS/ACCESS products for most of the popular database management systems including Oracle, DB2, MySQL, and Teradata. This method of data access is not covered in this book.
♦ SAS/ACCESS Interface to PC Files also allows you to read some PC file types directly without converting them to SAS data sets including Excel (covered in Section 2.5), Access, and JMP. One method that SAS uses to access PC files in their native format is called a LIBNAME engine. To tell SAS the type of file you want to read, you specify a LIBNAME statement that includes the option for that particular engine. See the SAS Documentation for more information on LIBNAME engines.
♦ There are also LIBNAME engines that allow you to read data directly and are part of Base SAS software (so you don’t need SAS/ACCESS Interface to PC Files). There are engines for SPSS, OSIRIS, old versions of SAS data sets, and SAS data sets in transport format. Check the SAS Documentation for your operating environment for a complete list of available engines.
Given all these methods for getting your data into SAS, you are sure to find at least one method that will work for you—probably more.
2.2 SAS Data Libraries and Data Sets
It’s not always obvious, but all SAS data sets have a two-level name such as WORK.BANANA. The first level is called its libref (short for SAS data library reference), and the second level is the member name that uniquely identifies the data set within the library. A SAS data library is a collection of SAS data sets residing in the same location. So a libref is like a nickname for that location. Sometimes a libref refers to a physical location, such as a hard drive or flash drive. Other times it refers to a logical location such as a directory or folder.
Both the libref and member name follow the standard rules for SAS names. They must start with a letter or underscore and contain only letters, numerals, or underscores. However, librefs cannot be longer than 8 characters while member names can be up to 32 characters. (There are system options that allow you to bend these rules, and the defaults may change, so check the SAS Documentation if you have a question.)
Built-in SAS data libraries SAS comes with some built-in libraries. The exact libraries available to you will depend on the SAS products you have installed and whether you have any locally defined libraries, but will always include these:
WORK |
is a special library for temporary SAS data sets. The WORK library is erased when you exit SAS. WORK is the default library if you don’t specify a libref in the data set name. |
SASHELP |
contains sample data sets that you can use, and is usually read-only. |
SASUSER |
is a library provided for you to save your data sets. This library is permanent on most systems. However, in z/OS and some UNIX systems, it may be temporary or read-only. If you have more than one SAS server, then you may have a SASUSER library on each server. |
Temporary SAS data sets If you create a data set that you won’t need later, it’s a good idea to make it temporary so your disks don’t get cluttered. Here is the DATA step shown in Chapter 1:
DATA distance;
Miles = 26.22;
Kilometers = 1.61 * Miles;
RUN;
This DATA step creates a temporary data set. Notice that the data set name does not include the libref WORK. Because the DATA statement uses a one-level name, SAS assigns the default library, WORK, and uses DISTANCE as the member name within that library. The log contains this note showing the complete, two-level name:
NOTE: The data set WORK.DISTANCE has 1 observations and 2 variables.
Permanent SAS data sets using direct referencing In general, if you use a data set more than once, it is more efficient to save it as a permanent SAS data set than to create a new temporary SAS data set every time you want to use the data. To use direct referencing, just take your operating environment’s file path and name, enclose it in quotation marks, and put it in your program. The quotation marks tell SAS that this is a permanent SAS data set. Here is the general form of a DATA statement for direct referencing in different operating environments. (Keep in mind that SAS University Edition and SAS OnDemand for Academics use Linux.)
Windows: |
DATA 'drive:directoryfilename'; |
UNIX or Linux: |
DATA '/home/path/filename'; |
z/OS: |
DATA 'data-set-name'; |
This program is the same as the preceding one except that it includes the file path and name enclosed in quotes (using the syntax for the Windows operating environment):
DATA 'c:MySASLibdistance';
Miles = 26.22;
Kilometers = 1.61 * Miles;
RUN;
This time the log contains this note showing the file path and name:
NOTE: The data set c:MySASLibdistance has 1 observations and 2 variables.
The SAS log does not show the libref, but if you check your SAS libraries you will see that SAS has created a library with a name like WC000001 and it contains a data set named DISTANCE. Librefs are temporary, but this data set is permanent because it is not in the WORK library.
Permanent SAS data sets with a LIBNAME statement If you read or write many data sets in the same SAS data library, it is easier (and less error prone) to use a LIBNAME statement than to keep typing your file path and name over and over. Here is the general form of LIBNAME statements for different operating environments:
Windows: |
LIBNAME libref 'drive:directory'; |
UNIX or Linux: |
LIBNAME libref '/home/path'; |
z/OS: |
LIBNAME libref 'data-set-name'; |
This program is the same as the preceding two except that it uses a LIBNAME statement to define a libref. Notice that the DATA statement uses the two-level name, MARATHON.DISTANCE.
LIBNAME marathon 'c:MySASLib';
DATA marathon.distance;
Miles = 26.22;
Kilometers = 1.61 * Miles;
RUN;
This time the log contains this note with the two-level data set name:
NOTE: The data set MARATHON.DISTANCE has 1 observations and 2 variables.
This is a permanent SAS data set because the libref is not WORK. If you view the directory of files on your computer, you will not see a file named MARATHON.DISTANCE. That is because operating environments have their own systems for naming files. When run under Windows or UNIX, this data set will be called distance.sas7bdat. Under z/OS, the filename would be the data-set-name specified in the LIBNAME statement. The examples in this section all create SAS data sets, the next section shows an example, which accesses an existing permanent SAS data set.
2.3 Listing the Contents of a SAS Data Set
To use a SAS data set, all you need to do is tell SAS the name and location of the data set you want, and SAS will figure out what is in it. SAS can do this because SAS data sets are self-documenting, which is another way of saying that SAS automatically stores information about the data set (called the descriptor portion) along with the data. There is an easy way to see the information about your data set, use PROC CONTENTS.
PROC CONTENTS is a simple procedure. You just type the keywords PROC CONTENTS and specify the data set you want with the DATA= option:
PROC CONTENTS DATA = data-set;
Example The SASHELP data library contains sample data sets and is provided with Base SAS. The following PROC CONTENTS produces a report about the CARS data set in the SASHELP data library. Because the SASHELP library is defined by default, no LIBNAME statement is needed.
* Use PROC CONTENTS to describe the CARS data set in the SASHELP library;
PROC CONTENTS DATA = SASHELP.CARS;
RUN;
The output from PROC CONTENTS is like a table of contents for your data set:
The CONTENTS Procedure |
❶ Data Set Name |
SASHELP.CARS |
❷ Observations |
428 |
Member Type |
DATA |
❸ Variables |
15 |
Engine |
V9 |
Indexes |
0 |
❹ Created |
06/24/2015 22:15:37 |
Observation Length |
152 |
Last Modified |
06/24/2015 22:15:37 |
Deleted Observations |
0 |
Protection |
|
Compressed |
NO |
Data Set Type |
|
Sorted |
YES |
❺ Label |
2004 Car Data |
|
|
Data Representation |
WINDOWS_64 |
|
|
Encoding |
us-ascii ASCII (ANSI) |
|
|
Engine/Host Dependent Information |
|
Data Set Page Size |
65536 |
Number of Data Set Pages |
2 |
First Data Page |
1 |
Max Obs per Page |
430 |
Obs in First Data Page |
413 |
Number of Data Set Repairs |
0 |
ExtendObsCounter |
YES |
Filename |
C:Program FilesSASHomeSASFoundation9.4coresashelpcars.sas7bdat |
Release Created |
9.0401M3 |
Host Created |
X64_SRV12 |
Alphabetic List of Variables and Attributes |
|||||
# |
❻ Variable |
❼ Type |
❽ Len |
❾ Format |
❿ Label |
9 |
Cylinders |
Num |
8 |
|
|
5 |
DriveTrain |
Char |
5 |
|
|
8 |
EngineSize |
Num |
8 |
|
Engine Size (L) |
10 |
Horsepower |
Num |
8 |
|
|
7 |
Invoice |
Num |
8 |
DOLLAR8. |
|
15 |
Length |
Num |
8 |
|
Length (IN) |
11 |
MPG_City |
Num |
8 |
|
MPG (City) |
12 |
MPG_Highway |
Num |
8 |
|
MPG (Highway) |
6 |
MSRP |
Num |
8 |
DOLLAR8. |
|
1 |
Make |
Char |
13 |
|
|
2 |
Model |
Char |
40 |
|
|
4 |
Origin |
Char |
6 |
|
|
3 |
Type |
Char |
8 |
|
|
13 |
Weight |
Num |
8 |
|
Weight (LBS) |
14 |
Wheelbase |
Num |
8 |
|
Wheelbase (IN) |
Sort Information |
|
Sortedby |
Make Type |
Validated |
YES |
Character Set |
ANSI |
The output starts with information about your data set and then describes each variable.
For the data set |
For each variable |
❶ Data set name |
❻ Variable name |
❷ Number of observations |
❼ Type (numeric or character) |
❸ Number of variables |
❽ Length (storage size in bytes) |
❹ Date created |
❾ Format for printing (if any) |
❺ Data set label (if any) |
❿ Variable label (if any) |
2.4 Reading Excel Files with the IMPORT Procedure
SAS offers many ways to access data in Microsoft Excel files. In fact, entire books have been written about exchanging data between SAS and Excel. SAS Studio, SAS Enterprise Guide and the SAS windowing environment each have point-and-click windows that will import Excel files for you, but writing a program gives you more control and is easier to repeat. If you have SAS/ACCESS Interface to PC Files and are running in the Windows or UNIX operating environment, then you can use the IMPORT procedure to read Excel files. PROC IMPORT will scan your Excel worksheet and automatically determine the variable types (character or numeric), will assign lengths to the character variables, and can recognize most date formats. Also, if you want, you can use the first row in your worksheet for the variable names.
Here is the general form of the IMPORT procedure for reading Excel files:
PROC IMPORT DATAFILE = 'filename' OUT = data-set
DBMS = identifier REPLACE;
where filename is the file you want to read and data-set is the name of the SAS data set you want to create. The REPLACE option tells SAS to replace the SAS data set named in the OUT= option if it already exists. The DBMS= option tells SAS the type of Excel file to read and may not be necessary.
DBMS identifiers There are several DBMS identifiers you can use to read Excel files. Three commonly used identifiers are: EXCEL, XLS, and XLSX. In the UNIX operating environment, use the XLS identifier for older style files (.xls extension), and the XLSX identifier for newer style files (.xlsx extension). In the Windows operating environment, in addition to the XLS and XLSX identifiers, you can use the EXCEL identifier to read all types of Excel files. The EXCEL identifier uses different technology to read files than do the XLS and XLSX identifiers, so the results may be different. By default, the XLS and XLSX identifiers look at more data rows to determine the column type than does the EXCEL identifier. Not all of these identifiers may work for you if your Windows computer has a mixture of 64-bit and 32-bit applications. In addition, some computer configurations may require that a PC Files Server be installed. The PC Files Server uses the EXCELCS identifier. See the SAS Documentation for more information.
Optional statements If you have more than one sheet in your file, then you can specify which sheet to read using the following statement:
SHEET = 'sheet-name';
If you want to read only specific cells in the sheet, you can specify a range. The range can be a named range (if defined), or you can specify the upper-left (UL) and lower-right (LR) cells for the range as follows:
RANGE = 'sheet-name$UL:LR';
By default, the IMPORT procedure will take the variable names from the first row of the spreadsheet. If you do not want this, then you can add the following statement to the procedure and SAS will give your variables names like A, B, C, and so on.
GETNAMES = NO;
When using the EXCEL identifier, if you have a column that contains both numeric and character values, then by default, the numbers will be converted to missing values. To read the numbers as character values instead of converting them to missing values, use the following statement:
MIXED = YES;
Example Suppose you have the following Microsoft Excel spreadsheet which contains data about magnolia trees. For each type of tree the file includes the scientific and common names, maximum height, age at first blooming when planted from seed, whether evergreen or deciduous, and color of flowers.
The following program reads the Microsoft Excel file using the IMPORT procedure with the XLSX DBMS identifier and the REPLACE option, and creates a permanent data set named MAGNOLIA.
LIBNAME sasfiles 'c:MySASLib';
* Read an Excel spreadsheet using PROC IMPORT;
PROC IMPORT DATAFILE = 'c:MyExcelTrees.xlsx' OUT = sasfiles.magnolia
DBMS = XLSX REPLACE;
RUN;
Here is the MAGNOLIA data set. You can view the data in the Output Data tab in SAS Enterprise Guide or SAS Studio, or in the Viewtable window in the SAS windowing environment.
|
ScientificName |
CommonName |
MaxHeight |
AgeBloom |
Type |
Color |
1 |
M. grandiflora |
Southern Magnolia |
80 |
15 |
E |
white |
2 |
M. campbellii |
|
80 |
20 |
D |
rose |
3 |
M. liliiflora |
Lily Magnolia |
12 |
4 |
D |
purple |
4 |
M. soulangiana |
Saucer Magnolia |
25 |
3 |
D |
pink |
5 |
M. stellata |
Star Magnolia |
10 |
3 |
D |
white |
Notice that the variable names were taken from the first row in the spreadsheet. If you have spaces or special characters in the first row, see Section 3.18.
You should always check the SAS log when you create data sets to make sure nothing went wrong. These notes tell you that the MAGNOLIA data set has five observations and six variables.
NOTE: The import data set has 5 observations and 6 variables.
NOTE: SASFILES.MAGNOLIA was successfully created.
2.5 Accessing Excel Files Using the XLSX LIBNAME Engine
The preceding section covered one way to convert an Excel file into a SAS data set, the IMPORT procedure. The XLSX LIBNAME engine goes a step further. With the XLSX LIBNAME engine, you can read from and write to an Excel file without converting it to a SAS data set. This engine works for files created by any version of Microsoft Excel 2007 or later on the Windows or UNIX operating environments. You must have SAS 9.4M2 or higher and SAS/ACCESS Interface to PC Files software. With this engine, you can read an existing worksheet, replace a worksheet, or add a new worksheet, but you cannot update individual values inside a worksheet.
The XLSX LIBNAME engine uses the first line in your data file for the variable names, scans each full column to determine the variable type (character or numeric), assigns lengths to character variables, and recognizes dates, and numeric values containing commas or dollar signs. While the XLSX LIBNAME engine does not offer many options, because you are using an Excel worksheet like a SAS data set, you can use many standard data set options. For example, you can use the RENAME= data set option to change the names of variables. (See Section 6.10.)
Here is the general form of a LIBNAME statement with the XLSX engine option:
LIBNAME libref XLSX 'filename';
Notice that the only difference between this LIBNAME statement and one used to access a SAS data library (Section 2.2) is that the XLSX option has been added. The libref is a name you make up and must follow the rules for names of SAS data libraries (eight characters or fewer, start with a letter or underscore; and contain only letters, numerals or underscores). The filename is the name used by your operating environment.
Keep in mind that, unlike some other LIBNAME engines, the XLSX LIBNAME engine only locks your Excel file while it actively accesses it. The file quickly becomes available to other people and processes that can read and edit it. You do not need to release the file, but if you know you are done with it, it is a good practice to clear the libref with this statement:
LIBNAME libref CLEAR;
Example This example uses the same Excel spreadsheet as the previous section. The spreadsheet contains data about magnolia trees: the scientific and common names, maximum height, age at first blooming when planted from seed, whether evergreen or deciduous, and color of flowers.
With the XLSX LIBNAME engine, SAS can read the spreadsheet directly, without first converting it to a SAS data set. Here is a PROC PRINT that prints the data from the Excel spreadsheet. A simple PROC PRINT like this one prints the values for all variables and all observations.
* Read an Excel spreadsheet using the XLSX LIBNAME engine;
LIBNAME exfiles XLSX 'c:MyExcelTrees.xlsx';
PROC PRINT DATA = exfiles.sheet1;
TITLE 'PROC PRINT of Excel File';
RUN;
Here are the results of the PROC PRINT. Notice that the variable names were taken from the first row in the spreadsheet. If you have spaces or special characters in the first row, see Section 3.18.
PROC PRINT of Excel File
Obs |
ScientificName |
CommonName |
MaxHeight |
AgeBloom |
Type |
Color |
1 |
M. grandiflora |
Southern Magnolia |
80 |
15 |
E |
white |
2 |
M. campbellii |
|
80 |
20 |
D |
rose |
3 |
M. liliiflora |
Lily Magnolia |
12 |
4 |
D |
purple |
4 |
M. soulangiana |
Saucer Magnolia |
25 |
3 |
D |
pink |
5 |
M. stellata |
Star Magnolia |
10 |
3 |
D |
white |
If you want to convert an Excel spreadsheet to a SAS data set, you can do that too. Here is a DATA step that reads the Excel spreadsheet (using a SET statement in a DATA step, which is covered in more detail in Section 3.1) and creates a permanent SAS data set named MAGNOLIA.
* Read Excel into a permanent SAS data set;
LIBNAME exfiles XLSX 'c:MyExcelTrees.xlsx';
LIBNAME sasfiles 'c:MySASLib';
DATA sasfiles.magnolia;
SET exfiles.sheet1;
RUN;
Here is the MAGNOLIA data set:
|
ScientificName |
CommonName |
MaxHeight |
AgeBloom |
Type |
Color |
1 |
M. grandiflora |
Southern Magnolia |
80 |
15 |
E |
white |
2 |
M. campbellii |
|
80 |
20 |
D |
rose |
3 |
M. liliiflora |
Lily Magnolia |
12 |
4 |
D |
purple |
4 |
M. soulangiana |
Saucer Magnolia |
25 |
3 |
D |
pink |
5 |
M. stellata |
Star Magnolia |
10 |
3 |
D |
white |
The SAS log shows that no observations or variables were lost:
NOTE: The import data set has 5 observations and 6 variables.
NOTE: There were 5 observations read from the data set EXFILES.sheet1.
NOTE: The data set SASFILES.MAGNOLIA has 5 observations and 6 variables.
2.6 Reading Delimited Files with the IMPORT Procedure
The IMPORT procedure reads Excel files (Section 2.4), but also reads many other types of files. Delimited files are raw data files that have a special character separating data values. Many types of software can save data as delimited files, often with commas or tab characters for delimiters. The IMPORT procedure for reading delimited files is part of Base SAS in the UNIX and Windows operating environments. You can also read delimited files with a DATA step (Section 2.18).
PROC IMPORT will scan your data file (the first 20 rows by default) and automatically determine the variable types (character or numeric), will assign lengths to the character variables, and can recognize some date formats. PROC IMPORT will treat two consecutive delimiters in your data file as a missing value, will read values enclosed in quotation marks, and assign missing values to variables when it runs out of data on a line. Also, if you want, you can use the first line in your data file for the variable names.
Here is the general form of the IMPORT procedure for delimited files:
PROC IMPORT DATAFILE = 'filename' OUT = data-set;
where filename is the file you want to read and data-set is the name of the SAS data set you want to create. SAS will determine the file type based on the extension of the file.
Type of File |
Extension |
DBMS Identifier |
Comma-delimited |
.csv |
CSV |
Tab-delimited |
.txt |
TAB |
Delimiters other than commas or tabs |
|
DLM |
If your file does not have the proper extension, or your file is of type DLM, then you must use the DBMS= option in the PROC IMPORT statement. Use the REPLACE option if you already have a SAS data set with the name you specified in the OUT= option, and you want to overwrite it. Here is the general form of PROC IMPORT with both the REPLACE and the DBMS options:
PROC IMPORT DATAFILE = 'filename' OUT = data-set
DBMS = identifier REPLACE;
Optional statements Some types of files need a few more instructions to be read correctly. If the data do not start in the first line of the file, use the DATAROWS statement. If the delimiter is not a comma, tab, or space, use the DELIMITER statement. If your file contains only data and no headings, use the GETNAMES=NO statement to assign default variable names. Lastly, if your data file has all missing values or non-representative data in the first 20 data rows, you may need the GUESSINGROWS statement to make sure variables are assigned the correct data type and length.
DATAROWS = n; |
starts reading data in row n. Default is 1. |
DELIMITER = 'character'; |
specifies delimiter for DLM files. Default is space. |
GETNAMES = NO; |
tells SAS whether to use the first line of the input file for variable names. Default is YES. If NO, then variables are named VAR1, VAR2, VAR3, and so on. |
GUESSINGROWS = n; |
uses n rows to determine variable types. Default is 20. |
Here is the general form of PROC IMPORT with the GETNAMES=NO statement:
PROC IMPORT DATAFILE = 'filename' OUT = data-set REPLACE;
GETNAMES = NO;
Example The following example uses data about Jerry’s Coffee Shop where Jerry employs local bands to attract customers. Jerry keeps records of the number of customers present for each band. The data are the band name, and the number of customers at 8 p.m., 9 p.m., 10 p.m., and 11 p.m. Notice that one of the bands, “Stop, Drop, and Rock-N-Roll,” has commas in the name of the band. When a data value contains the delimiter, then the value must be enclosed in quotation marks.
BandName,EightPM,NinePM,TenPM,ElevenPM
Lupine Lights,45,63,70,
Awesome Octaves,17,28,44,12
"Stop, Drop, and Rock-N-Roll",34,62,77,91
The Silveyville Jazz Quartet,38,30,42,43
Catalina Converts,56,,65,34
Here is the program that will read this data file:
PROC IMPORT DATAFILE ='c:MyRawDataBands2.csv' OUT = music REPLACE;
RUN;
Here is the MUSIC data set created by the program. You can view the data in the Output Data tab in SAS Enterprise Guide or SAS Studio, or in the Viewtable window in the SAS windowing environment. Notice that PROC IMPORT used the first row of data for variable names. If you have spaces or special characters in the first row, see Section 3.18.
|
BandName |
EightPM |
NinePM |
TenPM |
ElevenPM |
1 |
Lupine Lights |
45 |
63 |
70 |
. |
2 |
Awesome Octaves |
17 |
28 |
44 |
12 |
3 |
Stop, Drop, and Rock-N-Roll |
34 |
62 |
77 |
91 |
4 |
The Silveyville Jazz Quartet |
38 |
30 |
42 |
43 |
5 |
Catalina Converts |
56 |
. |
65 |
34 |
The SAS log contains these notes describing the new data set:
NOTE: 5 records were read from the infile 'c:MyRawDataBands2.csv'.
The minimum record length was 23.
The maximum record length was 41.
NOTE: The data set WORK.MUSIC has 5 observations and 5 variables.
2.7 Telling SAS Where to Find Your Raw Data
The rest of this chapter covers reading raw data using the DATA step. If your data are already stored as SAS data sets, as Excel spreadsheets, as CSV or tab-delimited files, or in a database such as Oracle, then you can probably skip this topic. However, if your data are in raw data files (also referred to as text, ASCII, sequential, or flat files), then you can use the power and flexibility of the DATA step to read your data.
A raw data file can be viewed using simple text editors or system commands. On personal computers, raw data files will either have no program associated with them, or they will be associated with simple editors like Microsoft Notepad. In some operating environments, you can use commands to list the file, such as the cat or more commands in UNIX. Spreadsheet files are examples of data files that are not raw data. If you try using a text editor to look at a spreadsheet file, you will probably see lots of funny special characters you can’t find on your keyboard. It may cause your computer to beep and chirp, making you wish you had that private office down the hall. It looks nothing like the nice neat rows and columns you see when you use your spreadsheet software to view the same file.
The first step toward reading raw data files is telling SAS where to find the raw data. Your raw data may be either internal to your SAS program (also called instream), or in a separate file. Either way, you must tell SAS where to find your data.
Internal raw data If you type raw data directly into your SAS program, then the data are internal to your program. You may want to do this when you have small amounts of data, or when you are testing a program with a small test data set. Use the DATALINES statement to indicate internal data. The DATALINES statement must be the last statement in the DATA step. All lines in the SAS program following the DATALINES statement are considered data until SAS encounters a semicolon. The semicolon can be on a line by itself or at the end of a SAS statement that follows the data lines. Any statements following the data are part of a new step. The CARDS statement is synonymous with the DATALINES statement.
The following SAS program illustrates the use of the DATALINES statement. (The DATA statement simply tells SAS to create a SAS data set named USPRESIDENTS, and the INPUT statement tells SAS how to read the data. The INPUT statement is discussed in Sections 2.8 through 2.18.)
* Read internal data into SAS data set uspresidents;
DATA uspresidents;
INPUT President $ Party $ Number;
DATALINES;
Adams F 2
Lincoln R 16
Grant R 18
Kennedy D 35
;
RUN;
If you run this program, the following note appears in the SAS log telling you that the data set USPRESIDENTS in the WORK library contains four observations and three variables.
NOTE: The data set WORK.USPRESIDENTS has 4 observations and 3 variables.
External raw data files Usually you will want to keep data in external files, separating the data from the program. This eliminates the chance that you might accidentally alter the data when you are editing your SAS program. Use the INFILE statement to tell SAS the filename (and path, if appropriate) of the file containing the data. The INFILE statement follows the DATA statement and must precede the INPUT statement. After the INFILE keyword, enclose the file path and name in quotation marks. Here is the general form for different operating environments. (If you are using SAS University Edition or SAS OnDemand for Academics, then use the Linux style of statement.)
Windows: |
INFILE 'c:MyDirPresident.dat'; |
UNIX or Linux: |
INFILE '/home/mydir/president.dat'; |
z/OS: |
INFILE 'MYID.PRESIDEN.DAT'; |
Suppose the following data are in a file called President.dat in the directory MyRawData on the C drive (Windows):
Adams F 2
Lincoln R 16
Grant R 18
Kennedy D 35
You could use this program with an INFILE statement to read the external data file:
* Read data from external file into SAS data set;
DATA uspresidents;
INFILE 'c:MyRawDataPresident.dat';
INPUT President $ Party $ Number;
RUN;
Whenever you read data from an external file, SAS lists some valuable information about the file in the SAS log. Always check this information as it could indicate problems. The following is an excerpt from the SAS log after reading the external file. A simple comparison of the number of records read by the INFILE statement (four in this case) with the number of observations (also four) in the SAS data set can tell you a lot about whether SAS read your data correctly.
NOTE: The infile 'c:MyRawDataPresident.dat' is:
Filename=c:MyRawDataPresident.dat,
RECFM=V,LRECL=32767, File Size (bytes)=49
NOTE: 4 records were read from the infile 'c:MyRawDataPresident.dat'.
The minimum record length was 9.
The maximum record length was 12.
NOTE: The data set WORK.USPRESIDENTS has 4 observations and 3 variables.
2.8 Reading Raw Data Separated by Spaces
If the values in your raw data file are all separated by at least one space, then using list input (also called free formatted or space-delimited input) to read the data may be appropriate. List input is an easy way to read raw data into SAS, but with ease come a few limitations. By default, you must read all the data in a record—no skipping over unwanted values. Any missing data must be indicated with a period. Character data, if present, must be simple—no embedded spaces, and no values greater than 8 characters in length. If the data file contains dates or other values that need special treatment, then list input would not be appropriate. This may sound like a lot of restrictions, but a surprising number of data files can be read using list input.
The INPUT statement, which is part of the DATA step, tells SAS how to read your raw data. To write an INPUT statement using list input, simply list the variable names after the keyword INPUT in the order they appear in the data file. Generally, variable names must be 32 characters or fewer, start with a letter or an underscore, and contain only letters, underscores, or numerals. If the values are character (not numeric), then place a dollar sign ($) after the variable name. Leave at least one space between names, and remember to place a semicolon at the end of the statement. The following is an example of a simple list-style INPUT statement:
INPUT Name $ Age Height;
This statement tells SAS to read three data values. The $ after Name indicates that it is a character variable, whereas the variables Age and Height are both numeric.
If your data file does not quite satisfy the rules for list input, you will be glad to know that there are various ways to work around the limitations. Many of those are discussed later in this chapter. One of the most common problems is character variables with lengths longer than 8. You can tell SAS to use a longer length with a LENGTH statement. This statement, for example, tells SAS to create a variable called Name that is character and has a length of 12:
LENGTH Name $ 12;
In SAS programs, the attributes of a variable are set when SAS first encounters that variable. So if a LENGTH statement comes before an INPUT statement, then SAS will use that length.
Example Your hometown has been overrun with toads this year. A local resident, having heard of frog jumping in California, had the idea of organizing a toad jump to cap off the annual town fair. For each contestant you have the toad’s name, weight, and the jump distance from three separate attempts. If the toad is disqualified for any jump, then a period is used to indicate missing data. Here is what the data file ToadJump.dat looks like:
Lucky 2.3 1.9 . 3.0
Spot 4.6 2.5 3.1 .5
Toadzilla 7.1 . . 3.8
Hop 4.5 3.2 1.9 2.6
Noisy 3.8 1.3 1.8 1.5
Winner 5.7 . . .
This data file does not look very neat, but it does meet most of the requirements for list input: no embedded spaces, all values are separated by at least one space, and missing data are indicated by a period. However, there is one problem. The name of one toad, Toadzilla, is longer than 8 characters. Adding a LENGTH statement to the program will allow SAS to read the toads’ names with list input.
Here is the SAS program that will read the data:
* Create a SAS data set named toads;
* Read the data file ToadJump.dat using list input;
DATA toads;
LENGTH ToadName $ 9;
INFILE 'c:MyRawDataToadJump.dat';
INPUT ToadName Weight Jump1 Jump2 Jump3;
RUN;
The LENGTH statement creates a variable named ToadName that is character (indicated by the dollar sign) and has a length of 9. Then the INPUT statement reads the variables ToadName, Weight, Jump1, Jump2, and Jump3. Because SAS already knows that ToadName is character, you don’t need to include a dollar sign in the INPUT statement. All the other variables are numeric.
It is always important to check data sets you create to make sure they are correct. Here is the TOADS data set. You can view the data in the Output Data tab in SAS Enterprise Guide or SAS Studio, or in the Viewtable window in the SAS windowing environment.
|
ToadName |
Weight |
Jump1 |
Jump2 |
Jump3 |
1 |
Lucky |
2.3 |
1.9 |
. |
3.0 |
2 |
Spot |
4.6 |
2.5 |
3.1 |
0.5 |
3 |
Toadzilla |
7.1 |
. |
. |
3.8 |
4 |
Hop |
4.5 |
3.2 |
1.9 |
2.6 |
5 |
Noisy |
3.8 |
1.3 |
1.8 |
1.5 |
6 |
Winner |
5.7 |
. |
. |
. |
These notes appear in the log:
NOTE: 6 records were read from the infile 'c:MyRawDataToadJump.dat'.
The minimum record length was 16.
The maximum record length was 21.
NOTE: The data set WORK.TOADS has 6 observations and 5 variables.
You can use list input to read files with other delimiters too, such as commas or tabs. See Section 2.18.
2.9 Reading Raw Data Arranged in Columns
Some raw data files do not have spaces (or other delimiters) between all the values or periods for missing data—so the files can’t be read using list input. But if each of the variable’s values is always found in the same place in the line of data, then you can use column input as long as all the values are character or standard numeric. Standard numeric data contain only numerals, decimal points, plus and minus signs, and E for scientific notation. Numbers with embedded commas or dates, for example, are not standard.
Column input has the following advantages over list input:
♦ spaces are not required between values
♦ missing values can be left blank
♦ character data can have embedded spaces
♦ you can skip unwanted variables
Data files with street addresses, which often have embedded blanks, are good candidates for column input. The street Martin Luther King Jr. Boulevard should be read as one variable not five, as it would be with list input. Data which can be read with column input can often also be read with formatted input (discussed in Sections 2.10 and 2.11) or a combination of input styles (Section 2.12).
To write an INPUT statement using column input, after the keyword INPUT, list the first variable’s name. If the variable is character, leave a space; then place a $. After the $, or variable name if it is numeric, leave a space; then list the column or range of columns for that variable. The columns are positions of the characters or numbers in the data line and are not to be confused with columns like those you see in a spreadsheet. Repeat this for all the variables you want to read. Here is a simple column-style INPUT statement:
INPUT Name $ 1-10 Age 11-13 Height 14-18;
The first variable, Name, is character, and the data values are in columns 1 through 10. The second variable, Age, is numeric since it is not followed by a $ and is in columns 11 through 13. The third variable, Height, is also numeric and is in columns 14 through 18.
Example The local minor league baseball team, the Walla Walla Sweets, is keeping records about concession sales. A ballpark favorite are the sweet onion rings which are sold at the concession stands and also by vendors in the bleachers. The ballpark owners have a feeling that in games with lots of hits and runs more onion rings are sold in the bleachers than at the concession stands. They think they should send more vendors out into the bleachers when the game heats up, but need more evidence to back up their feelings.
For each home game they have the following information: name of opposing team, number of onion ring sales at the concession stands and in the bleachers, the number of hits for each team, and the final score for each team. The following is a sample of the data file named OnionRing.dat. For reference, a ruler showing the column numbers has been placed above the data:
----+----1----+----2----+----3----+----4
Columbia Peaches 35 67 1 10 2 1
Plains Peanuts 210 2 5 0 2
Gilroy Garlics 151035 12 11 7 6
Sacramento Tomatoes 124 85 15 4 9 1
Notice that the data file has the following characteristics, all making it a prime candidate for column input. All the values line up in columns, the team names have embedded blanks, missing values are blank, and in one case there is no space between data values. (Those Gilroy Garlics fans must really love onion rings.)
The following program shows how to read these data using column input:
* Create a SAS data set named sales;
* Read the data file OnionRing.dat using column input;
DATA sales;
INFILE 'c:MyRawDataOnionRing.dat';
INPUT VisitingTeam $ 1-20 CSales 21-24 BSales 25-28
OurHits 29-31 TheirHits 32-34 OurRuns 35-37 TheirRuns 38-40;
RUN;
The variable VisitingTeam is character (indicated by a $) and reads the visiting team’s name in columns 1 through 20. The variables CSales and BSales read the concession and bleacher sales in columns 21 through 24 and 25 through 28, respectively. The number of hits for the home team, OurHits, and the visiting team, TheirHits, are in columns 29 through 31 and 32 through 34, respectively. The number of runs for the home team, OurRuns, is in columns 35 through 37, while the number of runs for the visiting team, TheirRuns, is in columns 38 through 40.
Here is the SALES data set. You can view the data in the Output Data tab in SAS Enterprise Guide or SAS Studio, or in the Viewtable window in the SAS windowing environment.
|
VisitingTeam |
CSales |
BSales |
OurHits |
TheirHits |
OurRuns |
TheirRuns |
1 |
Columbia Peaches |
35 |
67 |
1 |
10 |
2 |
1 |
2 |
Plains Peanuts |
210 |
. |
2 |
5 |
0 |
2 |
3 |
Gilroy Garlics |
15 |
1035 |
12 |
11 |
7 |
6 |
4 |
Sacramento Tomatoes |
124 |
85 |
15 |
4 |
9 |
1 |
These notes appear in the log:
NOTE: 4 records were read from the infile 'c:MyRawDataOnionRings.dat'.
The minimum record length was 40.
The maximum record length was 40.
NOTE: The data set WORK.SALES has 4 observations and 7 variables.
2.10 Reading Raw Data Not in Standard Format
Sometimes raw data are not straightforward numeric or character. For example, we humans easily read the number 1,000,001 as one million and one, but your trusty computer sees it as a character string. While the embedded commas make the number easier for us to interpret, they make the number impossible for the computer to recognize without some instructions. In SAS, informats are used to tell the computer how to interpret these types of data.
Informats are useful anytime you have nonstandard data. (Standard numeric data contain only numerals, decimal points, plus and minus signs, and E for scientific notation.) Numbers with embedded commas or dollar signs are nonstandard data. Other examples include data in hexadecimal or packed decimal formats. SAS has informats for reading these types of data as well.
Dates are perhaps the most common nonstandard data. Using date informats, SAS will convert conventional forms of dates like 10-31-2023 or 31OCT23 into a number, the number of days since January 1, 1960. This number is referred to as a SAS date value. (Why January 1, 1960? Who knows? Maybe 1960 was a good year for the SAS founders.) This turns out to be extremely useful when you want to do calculations with dates. For example, you can easily find the number of days between two dates by subtracting one from the other. See Section 3.13 for more about dates.
There are three general types of informats: character, numeric, and date. A table of selected SAS informats appears in the next section. The three types of informats have these general forms:
Character |
Numeric |
Date |
$informatw. |
informatw.d |
informatw. |
The $ indicates character informats, informat is the name of the informat, w is the total width, and d is the number of decimal places (numeric informats only). The period is an important part of the informat name. Without a period, SAS may try to interpret the informat as a variable name, which by default, cannot contain any special characters except the underscore. Two informats do not have names: $w., which reads standard character data, and w.d, which reads standard numeric data.
Use informats by placing the informat after the variable name in the INPUT statement; this is called formatted input. Here is an INPUT statement using formatted input:
INPUT Name $10. Age 3. Height 5.1 BirthDate MMDDYY10.;
The columns read for each variable are determined by the starting point and the width of the informat. SAS always starts with the first column; so the data values for the first variable, Name, which has an informat of $10., are in columns 1 through 10. Now the starting point for the second variable is column 11, and SAS reads values for Age in columns 11 through 13. The values for the third variable, Height, are in columns 14 through 18. Those five columns include the decimal place and the decimal point itself (150.3 for example). The values for the last variable, BirthDate, are in columns 19 through 28.
Example This example illustrates the use of informats for reading data. The data contain the results from a pumpkin-carving contest. Each line includes the contestant’s name, age, type (carved or decorated), the date the pumpkin was entered, and the scores from each of three judges.
Alicia Grossman 13 c 10-28-2020 7.8 6.5 7.2
Matthew Lee 9 D 10-30-2020 6.5 5.9 6.8
Elizabeth Garcia 10 C 10-29-2020 8.9 7.9 8.5
Lori Newcombe 6 D 10-30-2020 6.7 4.9
Jose Martinez 7 d 10-31-2020 8.9 9.510.0
Brian Williams 11 C 10-29-2020 7.8 8.4 8.5
The following program reads the data from a file named Pumpkin.dat, and uses a LIBNAME statement to create a permanent SAS data set named CONTEST. Please note there are many ways to input these data, so if you imagined something else, that’s OK.
LIBNAME pump 'c:MySASLib';
* Create a permanent SAS data set named contest;
* Read the file Pumpkin.dat using formatted input;
DATA pump.contest;
INFILE 'c:MyRawDataPumpkin.dat';
INPUT Name $16. Age 3. +1 Type $1. +1 Date MMDDYY10.
(Score1 Score2 Score3) (4.1);
RUN;
The variable Name has an informat of $16., meaning that it is a character variable 16 columns wide. Variable Age has an informat of 3., is numeric, three columns wide, and has no decimal places. The +1 skips over one column. Variable Type is character, and it is one column wide. Variable Date has an informat MMDDYY10. and reads dates in the form 10-31-2020 or 10/31/2020, each 10 columns wide. The remaining variables, Score1 through Score3, all require the same informat, 4.1. Putting the variables and the informat in separate sets of parentheses saves typing.
Here is the CONTEST data set. You can view the data in the Output Data tab in SAS Enterprise Guide or SAS Studio, or in the Viewtable window in the SAS windowing environment.
|
Name |
Age |
Type |
Date |
Score1 |
Score2 |
Score3 |
1 |
Alicia Grossman |
13 |
c |
22216 |
7.8 |
6.5 |
7.2 |
2 |
Matthew Lee |
9 |
D |
22218 |
6.5 |
5.9 |
6.8 |
3 |
Elizabeth Garcia |
10 |
C |
22217 |
8.9 |
7.9 |
8.5 |
4 |
Lori Newcombe |
6 |
D |
22218 |
6.7 |
. |
4.9 |
5 |
Jose Martinez |
7 |
d |
22219 |
8.9 |
9.5 |
10.0 |
6 |
Brian Williams |
11 |
C |
22217 |
7.8 |
8.4 |
8.5 |
Notice that these dates are printed as the number of days since January 1, 1960. Section 3.13 discusses how to format these into readable dates.
These notes appear in the log:
NOTE: 6 records were read from the infile 'c:MyRawDataPumpkin.dat'.
The minimum record length was 44.
The maximum record length was 44.
NOTE: The data set PUMP.CONTEST has 6 observations and 7 variables.
2.11 Selected Informats
Informat |
Definition |
Width range |
Default width |
Character |
|
|
|
$CHARw. |
Reads character data—does not trim leading or trailing blanks |
1–32,767 |
8 or length of variable |
$UPCASEw. |
Converts character data to uppercase |
1–32,767 |
8 |
$w. |
Reads character data—trims leading blanks |
1–32,767 |
none |
Date, Time, and Datetime1 |
|||
ANYDTDTEw. |
Reads dates in various date forms |
5–32 |
9 |
DATEw. |
Reads dates in form: ddmmmyy or ddmmmyyyy |
7–32 |
7 |
DATETIMEw. |
Reads datetime values in the form: ddmmmyy hh:mm:ss.ss |
13–40 |
18 |
DDMMYYw. |
Reads dates in form: ddmmyy or ddmmyyyy |
6–32 |
6 |
JULIANw. |
Reads Julian dates in form: yyddd or yyyyddd |
5–32 |
5 |
MMDDYYw. |
Reads dates in form: mmddyy or mmddyyyy |
6–32 |
6 |
STIMERw. |
Reads time in form: hh:mm:ss.ss (or mm:ss.ss, or ss.ss) |
1–32 |
10 |
TIMEw. d |
Reads time in form: hh:mm:ss.ss (or hh:mm) using a 24-hour clock |
5–32 |
8 |
Numeric |
|||
COMMAw.d |
Removes embedded commas and $, converts left parentheses to minus sign |
1–32 |
1 |
COMMAXw.d |
Like COMMAw.d but reverses role of comma and period |
1–32 |
1 |
PERCENTw. |
Converts percentages to proportions |
1–32 |
6 |
w.d |
Reads standard numeric data |
1–32 |
none |
1 SAS date values are the number of days since January 1, 1960. Time values are the number of seconds past midnight, and datetime values are the number of seconds past midnight January 1, 1960.
Informat |
Input data |
INPUT statement |
Results |
Character |
|||
$CHARw. |
my cat my cat |
INPUT Animal $CHAR10.; |
my cat my cat |
$UPCASEw. |
my cat |
INPUT Name $UPCASE10.; |
MY CAT |
$w. |
my cat my cat |
INPUT Animal $10.; |
my cat my cat |
Date, Time, and Datetime |
|||
ANYDTDTEw. |
1jan1961 01/01/61 |
INPUT Day ANYDTDTE10.; |
366 366 |
DATEw. |
1jan1961 1 jan 61 |
INPUT Day DATE10.; |
366 366 |
DATETIMEw. |
1jan1960 10:30:15 1jan1961,10:30:15 |
INPUT Dt DATETIME18.; |
37815 31660215 |
DDMMYYw. |
01.01.61 02/01/61 |
INPUT Day DDMMYY8.; |
366 367 |
JULIANw. |
61001 1961001 |
INPUT Day JULIAN7.; |
366 366 |
MMDDYYw. |
01-01-61 01/01/61 |
INPUT Day MMDDYY8.; |
366 366 |
STIMERw. |
10:30 10:30:15 |
INPUT Time STIMER8.; |
630 37815 |
TIMEw. d |
10:30 10:30:15 |
INPUT Time TIME8.; |
37800 37815 |
Numeric |
|||
COMMAw.d |
$1,000,001 (1,234) |
INPUT Income COMMA10.; |
1000001 -1234 |
COMMAXw.d |
$1.000.001 (1.234,25) |
INPUT Value COMMAX10.; |
1000001 -1234.25 |
PERCENTw. |
5% (20%) |
INPUT Value PERCENT5.; |
0.05 -0.2 |
w.d |
1234 -12.3 |
INPUT Value 5.1; |
123.4 -12.3 |
2.12 Mixing Input Styles
Each of the three major input styles has its own advantages. List style is the easiest; column style is a bit more work; and formatted style is the hardest of the three. However, column and formatted styles do not require spaces (or other delimiters) between variables and can read embedded blanks. Formatted style can read special data such as dates. Sometimes you use one style, sometimes another, and sometimes the easiest way is to use a combination of styles. SAS is so flexible that you can mix and match any of the input styles for your own convenience.
Example The following raw data contain information about U.S. national parks: name, state (or states as the case may be), year established, and size in acres:
Yellowstone ID/MT/WY 1872 4,065,493
Everglades FL 1934 1,398,800
Yosemite CA 1864 760,917
Great Smoky Mountains NC/TN 1926 520,269
Wolf Trap Farm VA 1966 130
You could write the INPUT statement for these data in many ways, which is the point of this section. This program shows one way to do it:
* Create a SAS data set named nationalparks;
* Read a data file NatPark.dat mixing input styles;
DATA nationalparks;
INFILE 'c:MyRawDataNatPark.dat';
INPUT ParkName $ 1-22 State $ Year @40 Acreage COMMA9.;
RUN;
Notice that the variable ParkName is read with column style input, State and Year are read with list input, and Acreage is read with formatted input.
Here is the NATIONALPARKS data set:
|
ParkName |
State |
Year |
Acreage |
1 |
Yellowstone |
ID/MT/WY |
1872 |
4065493 |
2 |
Everglades |
FL |
1934 |
1398800 |
3 |
Yosemite |
CA |
1864 |
760917 |
4 |
Great Smoky Mountains |
NC/TN |
1926 |
520269 |
5 |
Wolf Trap Farm |
VA |
1966 |
130 |
Sometimes programmers run into problems when they mix input styles. When SAS reads a line of raw data, it uses a pointer to mark its place, but each style of input uses the pointer a little differently. With list input, SAS automatically scans to the next non-blank field and starts reading. With column style input, SAS starts reading in the exact column you specify. But with formatted input, SAS just starts reading—wherever the pointer is, that is where SAS reads.
Sometimes you need to move the pointer explicitly, and you can do that by using the column pointer, @n, where n is the number of the column SAS should move to.
In the preceding program, the column pointer @40 tells SAS to move to column 40 before reading the value for Acreage. If you removed the column pointer from the INPUT statement, as shown in the following statement, then SAS would start reading Acreage right after Year:
INPUT ParkName $ 1-22 State $ Year Acreage COMMA9.;
Here is the data set with truncated values of the variable Acreage:
|
ParkName |
State |
Year |
Acreage |
1 |
Yellowstone |
ID/MT/WY |
1872 |
4065 |
2 |
Everglades |
FL |
1934 |
. |
3 |
Yosemite |
CA |
1864 |
. |
4 |
Great Smoky Mountains |
NC/TN |
1926 |
5 |
5 |
Wolf Trap Farm |
VA |
1966 |
. |
Because Acreage was read with formatted input, SAS started reading right where the pointer was. Here is the data file with a column ruler for counting columns at the top and asterisks marking the place where SAS started reading the values of Acreage:
----+----1----+----2----+----3----+----4----+----5
Yellowstone ID/MT/WY 1872 * 4,065,493
Everglades FL 1934 * 1,398,800
Yosemite CA 1864 * 760,917
Great Smoky Mountains NC/TN 1926 * 520,269
Wolf Trap Farm VA 1966 * 130
The COMMA9. informat told SAS to read nine columns, and SAS did that even when those columns were completely blank.
The column pointer, @n, has other uses, too, and can be used anytime you want SAS to skip backward or forward within a line of data. You could use it, for example, to skip over unneeded data, or to read a variable twice using different informats.
2.13 Reading Messy Raw Data
Sometimes you need to read data that just don’t line up in nice columns or have predictable lengths. When you have these types of messy files, ordinary list, column, or formatted input simply aren’t enough. You need more tools in your bag: tools like the @'character' column pointer, and the colon and ampersand modifiers.
The @'character' column pointer In the preceding section we showed how you can use the @ column pointer to move to a particular column before reading data. However, sometimes you don’t know the starting column of the data, but you do know that it always comes after a particular character or word. For these types of situations, you can use the @'character' column pointer. For example, suppose you have a data file that has information about dog ownership. Nothing in the file lines up, but you know that the breed of the dog always follows the word Breed. You could read the dog’s breed using the following INPUT statement:
INPUT @'Breed' DogBreed $;
The colon modifier The above INPUT statement will work just fine as long as the dog’s breed is 8 characters or less (the default length for character variables) and contains no spaces. So if the dog is a Poodle you’re fine, but if the dog is a Rottweiler or Shih Tzu, it won’t work. If you assign the variable an informat such as $20. in the INPUT statement, then SAS will read for 20 columns whether or not there is a space in those columns. So the DogBreed variable may include unwanted characters, which appear after the dog’s breed on the data line. If you want SAS to read only until it encounters a space or the end of the data line, you can use a colon modifier on the informat. (This works with other delimiters besides spaces. See Section 2.18.) To use a colon modifier, simply put a colon (:) before the informat (such as :$20. instead of $20.). For example, given this line of raw data:
My dog Sam Breed Rottweiler Vet Bills $478
the following table shows the results you would get using different INPUT statements:
Statement |
Value of variable DogBreed |
|
|
INPUT @'Breed ' DogBreed $; |
Rottweil |
||
INPUT @'Breed ' DogBreed $20.; |
Rottweiler Vet Bill |
||
INPUT @'Breed ' DogBreed :$20.; |
Rottweiler |
||
The ampersand modifier The colon modifier tells SAS to read a data value until it reaches a space. If you have embedded spaces in your data values, you may still be able to read your data. The ampersand modifier (&) tells SAS to read a data value until it reaches two or more spaces in a row. Insert the ampersand after the name of the variable with the embedded spaces.
INPUT @'Breed' DogBreed & $;
Example Each year engineering students from around the USA and Canada build concrete canoes and hold regional and national competitions. Part of the competition involves racing the canoes. The following data contain the final results of a men’s sprint competition. The data lines start with the name of the canoe, followed by the school, and the time.
Bellatorum School CSULA Time 1:40.5
The Kraken School ASU Time 1:45.35
Black Widow School UoA Time 1:33.7
Koicrete School CSUF Time 1:40.25
Khaos School UNLV Time 2:03.45
Max School UCSD Time 1:26.47
Hakuna Matata School UCLA Time 1:20.64
Prospector School CPSLO Time 1:12.08
Andromeda School CPP Time 1:25.1
Kekoapohaku School UHM Time 1:24.49
You can see that some canoe names contain embedded spaces, and there are two spaces after each canoe name. Because the canoe names have different lengths, the school names do not line up in the same column. Also, the time values are sometimes six characters and sometimes seven. This SAS program reads the canoe name, school, and time:
DATA canoeresults;
INFILE 'c:MyRawDataCanoes.dat';
INPUT CanoeName & $13. @'School' School $ @'Time' RaceTime :STIMER8.;
RUN;
This INPUT statement uses both an ampersand and an informat ($13.) to read the values of CanoeName. The result is that SAS will read CanoeName until it encounters two or more spaces and then stop even if the length of that CanoeName is less than the specified informat. Then the INPUT statement uses @'School' and @'Time' to position the column pointer to read the school name and time. The school names are all less than the default length of 8 characters so no informat is needed to read them, but times are non-standard data and require an informat. Because the time is not always the same number of characters, a colon modifier is added to the informat (:STIMER8.). Without the colon modifier, SAS would go to a new data line to try to read the time values when it ran out of characters on a line of data.
Here is the CANOERESULTS data set:
|
CanoeName |
School |
RaceTime |
1 |
Bellatorum |
CSULA |
100.50 |
2 |
The Kraken |
ASU |
105.35 |
3 |
Black Widow |
UoA |
93.70 |
4 |
Koicrete |
CSUF |
100.25 |
5 |
Khaos |
UNLV |
123.45 |
6 |
Max |
UCSD |
86.47 |
7 |
Hakuna Matata |
UCLA |
80.64 |
8 |
Prospector |
CPSLO |
72.08 |
9 |
Andromeda |
CPP |
85.10 |
10 |
Kekoapohaku |
UHM |
84.49 |
Note that the times in this output are printed in seconds. See Section 3.14 for how to format these values into minutes and seconds.
2.14 Reading Multiple Lines of Raw Data per Observation
In a typical raw data file each line of data represents one observation, but sometimes the data for each observation are spread over more than one line. Since SAS will automatically go to the next line if it runs out of data before it has read all the variables in an INPUT statement, you could just let SAS take care of figuring out when to go to a new line. But if you know that your data file has multiple lines of raw data per observation, it is better for you to explicitly tell SAS when to go to the next line than to make SAS figure it out. That way you won’t get a suspicious SAS-went-to-a-new-line note in your log (Section 11.4). To tell SAS when to skip to a new line, you simply add line pointers to your INPUT statement.
The line pointers, slash (/) and pound-n (#n), are like road signs telling SAS, “Go this way.” To read more than one line of raw data for a single observation, you simply insert a slash into your INPUT statement when you want to skip to the next line of raw data. The #n line pointer performs the same action except that you specify the line number. The n in #n stands for the number of the line of raw data for that observation; so #2 means to go to the second line for that observation, and #4 means go to the fourth line. You can even go backward using the #n line pointer, reading from line 4 and then from line 3, for example. The slash is simpler, but #n is more flexible.
Example A colleague is planning his next summer vacation, and he wants to go someplace where the weather is just right. He obtains data from a meteorological database. Unfortunately, he has not quite figured out how to export from this database and makes a rather odd file.
The file contains information about temperatures for the month of July for Alaska, Florida, and North Carolina. (If your colleague chooses the last state, maybe he can visit SAS headquarters.) The first line contains the city and state, the second line lists the normal high temperature and normal low (in degrees Fahrenheit), and the third line contains the record high and low:
Nome AK
55 44
88 29
Miami FL
90 75
97 65
Raleigh NC
88 68
105 50
The following program reads the weather data from a file named Temperature.dat:
* Create a SAS data set named highlow;
* Read the data file using line pointers;
DATA highlow;
INFILE 'c:MyRawDataTemperature.dat';
INPUT City $ State $
/ NormalHigh NormalLow
#3 RecordHigh RecordLow;
RUN;
The INPUT statement reads the values for City and State from the first line of data. Then the slash tells SAS to move to column 1 of the next line of data before reading NormalHigh and NormalLow. Likewise, the #3 tells SAS to move to column 1 of the third line of data for that observation before reading RecordHigh and RecordLow. As usual, there is more than one way to write this INPUT statement. You could replace the slash with #2 or replace #3 with a slash.
Here is the HIGHLOW data set:
|
City |
State |
NormalHigh |
NormalLow |
RecordHigh |
RecordLow |
1 |
Nome |
AK |
55 |
44 |
88 |
29 |
2 |
Miami |
FL |
90 |
75 |
97 |
65 |
3 |
Raleigh |
NC |
88 |
68 |
105 |
50 |
These notes appear in the log:
NOTE: 9 records were read from the infile 'c:MyRawDataTemperature.dat'.
The minimum record length was 5.
The maximum record length was 10.
NOTE: The data set WORK.HIGHLOW has 3 observations and 6 variables.
Notice that while nine records were read from the INFILE statement, the SAS data set contains just three observations. Usually, this would set off alarms in your mind, but here it confirms that indeed three data lines were read for every observation just as planned. You should always check your log, particularly when using line pointers.
2.15 Reading Multiple Observations per Line of Raw Data
There ought to be a Murphy’s law of data: whatever form data can take, it will. Normally SAS assumes that each line of raw data represents no more than one observation. When you have multiple observations per line of raw data, you can use double trailing at signs (@@) at the end of your INPUT statement. This line-hold specifier is like a stop sign telling SAS, “Stop, hold that line of raw data.” SAS will hold that line of data, continuing to read observations until it either runs out of data or reaches an INPUT statement that does not end with a double trailing @.
Example Suppose you have a colleague who is planning a vacation and has obtained a file containing data about rainfall for the three cities he is considering. The file contains the name of each city, the state, average rainfall (in inches) for the month of July, and average number of days with measurable precipitation in July. The raw data look like this:
Nome AK 2.5 15 Miami FL 6.75
18 Raleigh NC . 12
Notice that in this data file the first line stops in the middle of the second observation. The following program reads these data from a file named Precipitation.dat, and uses an @@ so SAS does not automatically go to a new line of raw data for each observation:
* Input more than one observation from each record;
DATA rainfall;
INFILE 'c:MyRawDataPrecipitation.dat';
INPUT City $ State $ NormalRain MeanDaysRain @@;
RUN;
Here is the RAINFALL data set:
|
City |
State |
NormalRain |
MeanDaysRain |
1 |
Nome |
AK |
2.50 |
15 |
2 |
Miami |
FL |
6.75 |
18 |
3 |
Raleigh |
NC |
. |
12 |
These notes will appear in the log:
NOTE: 2 records were read from the infile 'c:MyRawDataPrecipitation.dat'
The minimum record length was 18.
The maximum record length was 28.
NOTE: SAS went to a new line when INPUT statement reached past the end of a line.
NOTE: The data set WORK.RAINFALL has 3 observations and 4 variables.
While only two records were read from the raw data file, the RAINFALL data set contains three observations. The log also includes a note saying SAS went to a new line when the INPUT statement reached past the end of a line. This means that SAS came to the end of a line in the middle of an observation and continued reading with the next line of raw data. Normally these messages would indicate a problem, but in this case they are exactly what you want.
Section 3.11 shows a different way to produce multiple observations from a single line of raw data, a technique that works when some data values must be repeated.
2.16 Reading Part of a Raw Data File
At some time you may find that you need to read a small fraction of the records in a large data file. For example, you might be reading U.S. census data and want only female heads-of-household who have incomes above $225,000 and live in Walla Walla, Washington. You could read all the records in the data file and then throw out the unneeded ones, but that would waste time.
Luckily, you don’t have to read all the data before you tell SAS whether to keep an observation. Instead, you can read just enough variables to decide whether to keep the current observation, and then end the INPUT statement with an at sign (@), called a trailing at. This tells SAS to hold that line of raw data. While the trailing @ holds that line, you can test the observation with an IF statement to see if it’s one you want to keep. If it is, then you can read data for the remaining variables with a second INPUT statement. Without the trailing @, SAS would automatically start reading the next line of raw data with each INPUT statement.
The trailing @ is similar to the column pointer, @n, introduced in Section 2.12. By specifying a number after the @ sign, you tell SAS to move to a particular column. By using an @ without specifying a column, it is as if you are telling SAS, “Stay tuned for more information. Don’t touch that dial!” SAS will hold that line of data until it reaches either the end of the DATA step, or an INPUT statement that does not end with a trailing @.
Example You want to read part of a raw data file containing local traffic data for freeways and surface streets. The data include information about the type of street, name of street, the average number of vehicles per hour traveling that street during the morning, and the average number of vehicles per hour for the evening. Here are the raw data:
freeway 408 3684 3459
surface Martin Luther King Jr. Blvd. 1590 1234
surface Broadway 1259 1290
surface Rodeo Dr. 1890 2067
freeway 608 4583 3860
freeway 808 2386 2518
surface Lake Shore Dr. 1590 1234
surface Pennsylvania Ave. 1259 1290
Suppose you want to see only the freeway data at this point so you read the raw data file, Traffic.dat, with this program:
* Use a trailing @, then delete surface streets;
DATA freeways;
INFILE 'c:MyRawDataTraffic.dat';
INPUT Type $ @;
IF Type = 'surface' THEN DELETE;
INPUT Name $ 9-38 AMTraffic PMTraffic;
RUN;
Notice that there are two INPUT statements. The first reads the character variable Type and then ends with an @. The trailing @ holds each line of data while the IF statement tests it. The second INPUT statement reads Name (in columns 9 through 38), AMTraffic, and PMTraffic. If an observation has a value of surface for the variable Type, then the second INPUT statement never executes. Instead, SAS returns to the beginning of the DATA step to process the next observation and does not add the unwanted observation to the FREEWAYS data set. (Do not pass go, do not collect $200.)
Here is the FREEWAYS data set:
|
Type |
Name |
AMTraffic |
PMTraffic |
1 |
freeway |
408 |
3684 |
3459 |
2 |
freeway |
608 |
4583 |
3860 |
3 |
freeway |
808 |
2386 |
2518 |
When you run this program, the log will contain the following two notes, one saying that eight records were read from the input file and another saying that the new data set contains only three observations:
NOTE: 8 records were read from the infile 'c:MyRawDataTraffic.dat'.
The minimum record length was 47.
The maximum record length was 47.
NOTE: The data set WORK.FREEWAYS has 3 observations and 4 variables.
The other five observations had a value of surface for the variable Type and were deleted by the IF statement.
This example used a DELETE statement to delete observations for surface streets. Instead of specifying which observations to delete, you could specify which observations to keep. To keep only observations for freeways, you would use a subsetting IF statement. See Section 3.8 for more about subsetting IF statements.
Trailing @ versus double trailing @ The double trailing @, discussed in the previous section, is similar to the trailing @. Both are line-hold specifiers; the difference is how long they hold a line of data for input. The trailing @ holds a line of data for subsequent INPUT statements, but releases that line of data when SAS returns to the top of the DATA step to begin building the next observation. The double trailing @ holds a line of data for subsequent INPUT statements even when SAS starts building a new observation. In both cases, the line of data is released if SAS reaches a subsequent INPUT statement that does not contain a line-hold specifier.
2.17 Controlling Input with Options in the INFILE Statement
So far in this chapter, we have seen ways to use the INPUT statement to read many different types of raw data. When reading raw data files, SAS makes certain assumptions. For example, SAS starts reading with the first data line and, if SAS runs out of data on a line, it automatically goes to the next line to read values for the rest of the variables. Most of the time this is OK, but some data files can’t be read using the default assumptions. The options in the INFILE statement change the way SAS reads raw data files. The following options are useful for reading particular types of data files. Place these options after the filename in the INFILE statement.
FIRSTOBS= The FIRSTOBS= option tells SAS at what line to begin reading data. This is useful if you have a data file that contains descriptive text or header information at the beginning, and you want to skip over these lines before reading the data. The following data file, for example, has a description of the data in the first two lines:
Ice-cream sales data for the summer
Flavor Location Boxes sold
Chocolate 213 123
Vanilla 213 512
Chocolate 415 242
The following program uses the FIRSTOBS= option to tell SAS to start reading data on the third line of the file:
DATA icecream;
INFILE 'c:MyRawDataIceCreamSales.dat' FIRSTOBS = 3;
INPUT Flavor $ 1-9 Location BoxesSold;
RUN;
OBS= The OBS= option can be used anytime you want to read only a part of your data file. It tells SAS to stop reading when it gets to that line in the raw data file. Note that it does not necessarily correspond to the number of observations. If, for example, you are reading two raw data lines for each observation, then an OBS=100 would read 100 data lines, and the resulting SAS data set would have 50 observations. The OBS= option can be used with the FIRSTOBS= option to read lines from the middle of the file. For example, suppose the ice-cream sales data had a remark at the end of the file that was not part of the data.
Ice-cream sales data for the summer
Flavor Location Boxes sold
Chocolate 213 123
Vanilla 213 512
Chocolate 415 242
Data verified by Blake White
With FIRSTOBS=3 and OBS=5, SAS will start reading this file on the third data line and stop reading after the fifth data line.
DATA icecream;
INFILE 'c:MyRawDataIceCreamSales2.dat' FIRSTOBS = 3 OBS=5;
INPUT Flavor $ 1-9 Location BoxesSold;
RUN;
MISSOVER By default, SAS will go to the next data line to read more data if SAS has reached the end of the data line and there are still more variables in the INPUT statement that have not been assigned values. The MISSOVER option tells SAS not to go to the next line of data when it runs out of data. Instead, assign missing values to any remaining variables. The following data file illustrates where this option may be useful. This file contains test scores for a self-paced course. Since not all students complete all the tests, some have more scores than others.
Nguyen 89 76 91 82
Ramos 67 72 80 76 86
Robbins 76 65 79
The following program reads the data for the five test scores, assigning missing values to tests not completed:
DATA class102;
INFILE 'c:MyRawDataAllScores.dat' MISSOVER;
INPUT Name $ Test1 Test2 Test3 Test4 Test5;
RUN;
TRUNCOVER You need the TRUNCOVER option when you are reading data using column or formatted input and some data lines are shorter than others. If a variable’s field extends past the end of the data line, then, by default, SAS will go to the next line to start reading the variable’s value. This option tells SAS to read data for the variable until it reaches the end of the data line, or the last column specified in the format or column range, whichever comes first. The next file contains addresses and must be read using column or formatted input because the street names have embedded blanks. Note that the data lines have different lengths:
John Garcia 114 Maple Ave.
Sylvia Chung 1302 Washington Drive
Martha Newton 45 S.E. 14th St.
This program uses column input to read the address file. Because some of the addresses stop before the end of the variable Street’s field (columns 22 through 37), you need the TRUNCOVER option. Without the TRUNCOVER option, SAS would try to go to the next line to read the data for Street on the first and third records.
DATA homeaddress;
INFILE 'c:MyRawDataAddress.dat' TRUNCOVER;
INPUT Name $ 1-15 Number 16-19 Street $ 22-37;
RUN;
TRUNCOVER is similar to MISSOVER. Both will assign missing values to variables if the data line ends before the variable’s field starts. But when the data line ends in the middle of a variable field, TRUNCOVER will take as much as is there, whereas MISSOVER will assign the variable a missing value.
2.18 Reading Delimited Files with the DATA Step
We suspect that by now you have realized that with SAS there is usually more than one way to accomplish the same result. Section 2.6 showed how to read delimited data files using the IMPORT procedure; now we are going to show how to read delimited files using the DATA step. The INFILE statement has two options that make it easy to read delimited data files: the DLM= option and the DSD option.
The DLM= option If you read your data using list input, SAS expects your file to have spaces between data values. The DELIMITER=, or DLM=, option in the INFILE statement allows you to use other delimiters. To read data with commas as delimiters, add the DLM= option to your INFILE statement like this:
INFILE 'filename' DLM = ',';
You can use the DLM= option to specify tabs as delimiters too. In ASCII, 09 is the hexadecimal equivalent of a tab character, and the notation ‘09’X means a hexadecimal 09. If your computer uses EBCDIC (IBM mainframes) instead of ASCII, then use DLM=’05’X. Here is a DLM= option with the ASCII tab specified as the delimiter.
INFILE 'filename' DLM = '09'X;
The comma and tab characters are common delimiters for data files, but you can read data files with any delimiter character by simply enclosing the delimiter character in quotation marks after the DLM= option (for example, DLM=’&’). If your delimiter is a string of characters, then use the DLMSTR= option.
By default, SAS interprets two or more delimiters in a row as a single delimiter. If your file has missing values, and two delimiters in a row indicate a missing value, then you will also need the DSD option in the INFILE statement.
The DSD option The DSD (Delimiter-Sensitive Data) option for the INFILE statement does three things for you. First, it ignores delimiters in data values enclosed in quotation marks. Second, it does not read quotation marks as part of the data value. Third, it treats two delimiters in a row as a missing value. The DSD option assumes that the delimiter is a comma. If your delimiter is not a comma, then you can use the DLM= option with the DSD option to specify the delimiter. For example, to read a tab-delimited ASCII file with missing values indicated by two consecutive tab characters use this:
INFILE 'filename' DLM = '09'X DSD;
Comma-separated values files, or CSV files, are a common type of file that can be read with the DSD option. Many programs, such as Microsoft Excel, can save data in CSV format. These files have commas for delimiters and consecutive commas for missing values; if there are commas in any of the data values, then those values are enclosed in quotation marks.
Example The following example illustrates how to read a CSV file using the DSD option. Jerry’s Coffee Shop employs local bands to attract customers. Jerry keeps records of the number of customers for each band, for each night they play in his shop. The band’s name is followed by the number of customers present at 8 p.m., 9 p.m., 10 p.m., and 11 p.m.
Lupine Lights,45,63,70,
Awesome Octaves,17,28,44,12
"Stop, Drop, and Rock-N-Roll",34,62,77,91
The Silveyville Jazz Quartet,38,30,42,43
Catalina Converts,56,,65,34
Notice that one group’s name has embedded commas, and is enclosed in quotation marks. Also, the last group has a missing data point for the 9 p.m. hour as indicated by two consecutive commas. Here is the program that will read this data file:
DATA music;
INFILE 'c:MyRawDataBands.csv' DLM = ',' DSD MISSOVER;
INPUT BandName :$30. EightPM NinePM TenPM ElevenPM;
RUN;
Notice that for BandName the INPUT statement uses an informat with a colon modifier. The colon modifier tells SAS to read for the length of the informat (30 for BandName), or until it encounters a delimiter, whichever comes first.
This data file is missing one data value at the end of the first line. By default, SAS would go to the next line of data to find the last data value. The MISSOVER option on the INFILE statement tells SAS not to go to the next line of data when it runs out of values. It is prudent when using the DSD option to add the MISSOVER option if there is any chance that you have missing data values at the end of your data lines.
Here is the MUSIC data set:
|
BandName |
EightPM |
NinePM |
TenPM |
ElevenPM |
1 |
Lupine Lights |
45 |
63 |
70 |
. |
2 |
Awesome Octaves |
17 |
28 |
44 |
12 |
3 |
Stop, Drop, and Rock-N-Roll |
34 |
62 |
77 |
91 |
4 |
The Silveyville Jazz Quartet |
38 |
30 |
42 |
43 |
5 |
Catalina Converts |
56 |
. |
65 |
34 |
The SAS log contains these notes describing the new data set:
NOTE: 5 records were read from the infile 'c:MyRawDataBands.csv'.
The minimum record length was 23.
The maximum record length was 41.
NOTE: The data set WORK.MUSIC has 5 observations and 5 variables.