Overview 558
Introduction 558
Objectives 558
Summary 590
Syntax 591
Sample Programs 593
Quiz 595
Sometimes, you need to combine data from two or more sets into a single observation in a new data set according to the values of a common variable. When the data sources are two or more data sets that have a common structure, you can use a match-merge to combine the data sets. However, in some cases the data sources do not share a common structure. When data sources do not have a common structure, you can use a lookup table to match them. A lookup table is a table that contains key values.
The technique that you use to perform a table lookup is dependent on your data. This chapter focuses on using multidimensional arrays to perform table lookups and transposing SAS data sets in preparation for a match-merge.
When a lookup operation depends on more than one numerical factor, you can use a multidimensional array. You use an ARRAY statement to create an array. The ARRAY statement defines a set of elements that you plan to process as a group.
The keyword _TEMPORARY_ might be used instead of array-elements to avoid creating new data set variables. Only temporary array elements are produced as a result of using _TEMPORARY_. Δ
When you are working with arrays, remember that
the name of the array must be a SAS name that is not the name of a SAS variable in the same DATA step
the variables listed as array elements must all be the same type (either all numeric or all character)
the initial values specified can be numbers or character strings. You must enclose all character strings in quotation marks.
If you use the _TEMPORARY_ keyword in an array statement, remember that temporary data elements behave like DATA step variables with the following exceptions:
They do not have names. Refer to temporary data elements by the array name and dimension.
They do not appear in the output data set.
You cannot use the special subscript astersisk (*) to refer to all the elements.
Temporary data element values are always automatically retained, rather than being reset to missing at the beginning of the next iteration of the DATA step.
Δ
Suppose you need to determine the wind chill values for the flights represented in the SAS data set Sasuser.Flights. The data set contains three variables: Flight
(the flight number), Temp
(the average outdoor temperature during the flight), and Wspeed
(the average wind speed during the flight).
Wind chill values are derived from the air temperature and wind speed as shown in the following wind chill lookup table. To determine the wind chill for each flight, you can create a multidimensional array that stores the wind chill values shown in the table. You can then match the values of Temp
and Wspeed
with the wind chill values stored in the array.
In the following program, the ARRAY statement creates the two-dimensional array WC
and specifies the dimensions of the array: four rows and two columns. No variables are created from the array because the keyword _TEMPORARY_ is used. The initial values specified correspond to the values in the wind chill lookup table. For this example, only the values in the first two columns and four rows in the wind chill lookup table are included in the array.
data work.wndchill (drop = column row);
array WC{4,2} _temporary_
(-22,-16,-28,-22,-32,-26,-35,-29);
set sasuser.flights;
row = round(wspeed,5)/5;
column = (round(temp,5)/5)+3;
WindChill= wc{row,column};
run;
The value of WindChill
for each flight is determined by referencing the array based on the values of Wspeed
and Temp
in the Sasuser.Flights data set. The row number for the array reference is determined by the value of Wspeed
. The column number for the array reference is determined by the value of Temp
.
The rounding unit for the value of Wspeed
is 5 because the values for wind speed in the wind chill table are rounded to every 5 miles-per-hour. Wspeed
is then divided by 5 to derive the row number for the array reference.
Like the value for Wspeed
, the value of Temp
is rounded to the nearest 5, then divided by 5. The offset of 3 is added to the value because the third column in the wind chill lookup table represents 0 degrees.
data work.wndchill (drop = column row); array WC{4,2} _temporary_ (-22,-16,-28,-22,-32,-26,-35,-29); set sasuser.flights; row = round(wspeed,5)/5; column = (round(temp,5)/5)+3;
WindChill= wc{row,column}; run;
PROC PRINT output shows the completed data set.
In the previous section, the wind chill values were loaded into the WC array when the array was created. In some cases, you might need to store array values in a SAS data set rather than loading them in an ARRAY statement. Array values should be stored in a SAS data set when
there are too many values to initialize easily in the array
the values change frequently
the same values are used in many programs.
Suppose you want to compare the actual cargo revenue values in the SAS data set Sasuser.Monthsum to the target cargo revenue values in the SAS data set Sasuser.Ctargets.
Sasuser.Monthsum contains the actual cargo and passenger revenue figures for each month from 1997 through 1999.
Table 16.1. SAS Data Set Sasuser.Monthsum (first four observations of selected variables)
Obs | SaleMon | RevCargo | MonthNo |
---|---|---|---|
1 | JAN1997 | $171,520,869.10 | 1 |
2 | JAN1998 | $238,786,807.60 | 1 |
3 | JAN1999 | $280,350,393.00 | 1 |
4 | FEB1997 | $177,671,530.40 | 2 |
The SAS data set Sasuser.Ctargets contains the target cargo revenue figures for each month from 1997 through 1999.
Table 16.2. SAS Data Set Sasuser.Ctargets
Obs | Year | Jan | Feb | Mar | Apr | May | Jun |
---|---|---|---|---|---|---|---|
1 | 1997 | 192284420 | 86376721 | 28526103 | 260386468 | 109975326 | 102833104 |
2 | 1998 | 108645734 | 147656369 | 202158055 | 41160707 | 264294440 | 267135485 |
3 | 1999 | 85730444 | 74168740 | 39955768 | 312654811 | 318149340 | 187270927 |
Obs | Jul | Aug | Sep | Oct | Nov | Dec | |
---|---|---|---|---|---|---|---|
1 | 196728648 | 236996122 | 112413744 | 125401565 | 72551855 | 136042505 | |
2 | 208694865 | 83456868 | 286846554 | 275721406 | 230488351 | 24901752 | |
3 | 123394421 | 34273985 | 151565752 | 141528519 | 178043261 | 181668256 |
You want to create a new SAS data set, Work.Lookup1, that lists the actual and target values for each month. Work.Lookup1 should have the same structure as Sasuser.Monthsum: an observation for each month and year, as well as a new variable, Ctarget
(target cargo revenues). The value of Ctarget
is derived from the target values in Sasuser.Ctargets.
Table 16.3. SAS Data Set Work.Lookup1 (first five observations of selected variables)
Obs | SaleMon | RevCargo | Ctarget |
---|---|---|---|
1 | JAN1997 | $171,520,869.10 | $192,284,420.00 |
2 | JAN1998 | $238,786,807.60 | $108,645,734.00 |
3 | JAN1999 | $280,350,393.00 | $85,730,444.00 |
4 | FEB1997 | $177,671,530.40 | $86,376,721.00 |
5 | FEB1998 | $215,959,695.50 | $147,656,369.00 |
Sasuser.Monthsum and Sasuser.Ctargets cannot be merged because they have different structures:
Sasuser.Monthsum has an observation for each month and year.
Sasuser.Ctargets has one column for each month and one observation for each year.
However, the data sets have two common factors: month and year. You can use a multidimensional array to match the actual values for each month and year in Sasuser.Monthsum with the target values for each month and year in Sasuser.Ctargets.
The first step is to create an array to hold the values in the target data set, Sasuser.Ctargets. The array needs two dimensions: one for the year values and one for the month values. In the following program, the first ARRAY statement creates the two-dimensional array, Targets
.
Remember that the index of an array does not have to range from one to the number of elements. You can specify a range for the values for the index when you define the array. In this case, the dimensions of the array are specified as three rows (one for each year: 1997, 1998, and 1999) and 12 columns (one for each month).
data work.lookup1;
array Targets{1997:1999,12} _temporary_;
if _n_=1 then do i= 1 to 3;
set sasuser.ctargets;
array mnth{*} Jan--Dec;
do j=1 to dim(mnth);
targets{year,j}=mnth{j};
end;
end;
set sasuser.monthsum(keep=salemon revcargo monthno);
year=input(substr(salemon,4),4.);
Ctarget=targets{year,monthno};
format ctarget dollar15.2;
run;
The following table represents the Targets
array. Notice that the array is not populated. The next step is to load the array elements from Sasuser.Ctargets.
The row dimension for the Targets
array could have been specified using the value 3. For example,
array Targets{3,12} _temporary_;
Δ
However, using the notation 1997:1999
simplifies the program by eliminating the need to map numeric values to the year values.
The Targets
array needs to be loaded with the values in Sasuser.Ctargets. One method for accomplishing this task is to load the array within a DO loop.
Table 16.5. SAS Data Set Sasuser.Ctargets
Year | Jan | Feb | Mar | Apr | May | Jun |
---|---|---|---|---|---|---|
1997 | 192284420 | 86376721 | 28526103 | 260386468 | 109975326 | 102833104 |
1998 | 108645734 | 147656369 | 202158055 | 41160707 | 264294440 | 267135485 |
1999 | 85730444 | 74168740 | 39955768 | 312654811 | 318149340 | 187270927 |
Jul | Aug | Sep | Oct | Nov | Dec | |
---|---|---|---|---|---|---|
196728648 | 236996122 | 112413744 | 125401565 | 72551855 | 136042505 | |
208694865 | 83456868 | 286846554 | 275721406 | 230488351 | 24901752 | |
123394421 | 34273985 | 151565752 | 141528519 | 178043261 | 181668256 |
The IF-THEN statement specifies that the Targets
array is loaded only once, during the first iteration of the DATA step. The DO loop executes three times, once for each observation in Sasuser.Ctargets.
The ARRAY statement within the DO loop creates the Mnth
array, which will be used to store the elements from Sasuser.Ctargets. The dimensions of the Mnth
array are specified using an asterisk, which enables SAS to automatically count the array elements.
If you use an asterisk to specify the dimensions of an array, you must list the array elements. You cannot use an asterisk to specify an array's dimensions if the elements of the array are specified with the _TEMPORARY_ keyword. Δ
The array elements Jan
through Dec
are listed using a double hyphen (- -). The double hyphen (- -) is used to read the specified values based on their positions in the PDV instead of alphabetically.
data work.lookup1; array Targets{1997:1999,12} _temporary_; if _n_=1 then do i= 1 to 3; set sasuser.ctargets; array Mnth{*} Jan--Dec; do j=1 to dim(mnth); targets{year,j}=mnth{j}; end; end; set sasuser.monthsum(keep=salemon revcargo monthno); year=input(substr(salemon,4),4.); Ctarget=targets{year,monthno}; format ctarget dollar15.2; run;
The following table shows the values in the Mnth
array after the first iteration of the DO loop.
Table 16.6. Table Representation of Mnth Array (after the first iteration of the DO loop)
Jan | Feb | Mar... | ...Oct | Nov | Dec |
---|---|---|---|---|---|
192284420 | 86376721 | 260386468 | 125401565 | 72551855 | 136042505 |
Within the nested DO loop, the Targets
array reference is matched to the Mnth
array reference in order to populate the Targets
array. The DIM function returns the number of elements in the Mnth
array (in this case 12) and provides an ending point for the nested DO loop.
data work.lookup1; array Targets{1997:1999,12} _temporary_; if _n_=1 then do i= 1 to 3; set sasuser.ctargets; array Mnth{*} Jan--Dec; do j=1 to dim(mnth); targets{year,j}=mnth{j}; end; end; set sasuser.monthsum(keep=salemon revcargo monthno); year=input(substr(salemon,4),4.); Ctarget=targets{year,monthno}; format ctarget dollar15.2; run;
Table 16.7. Table Representation of Mnth Array (after the second iteration of the DO loop)
Jan | Feb | Mar... | ...Oct | Nov | Dec |
---|---|---|---|---|---|
108645734 | 147656369 | 202158055 | 275721406 | 230488351 | 24901752 |
Table 16.8. Table Representation of Mnth Array (after the third iteration of the DO loop)
Jan | Feb | Mar... | ...Oct | Nov | Dec |
---|---|---|---|---|---|
85730444 | 74168740 | 39955768 | 141528519 | 178043261 | 181668256 |
Table 16.9. Table Representation of Populated Targets Array
1 | 2 | 3... | ...10 | 11 | 12 | |
---|---|---|---|---|---|---|
1997 | 192284420 | 86376721 | 260386468 | 125401565 | 72551855 | 136042505 |
1998 | 108645734 | 147656369 | 202158055 | 275721406 | 230488351 | 24901752 |
1999 | 85730444 | 74168740 | 39955768 | 141528519 | 178043261 | 181668256 |
The dimension of the Mnth
array could also be specified using the numeric value 12. However, the asterisk notation enables the program to be more flexible. For example, using the asterisk, the program would not need to be edited if the target data set contained data for only eleven months. Remember that if you use an asterisk to count the array elements, you must list the array elements. Δ
The last step is to read the actual values stored in Sasuser.Monthsum. Remember that you need to know the month and year values for each observation in order to locate the correct target revenue values.
Table 16.10. SAS Data Set Sasuser.Monthsum (first five observations of selected variables)
SaleMon | RevCargo | MonthNo |
---|---|---|
JAN1997 | $171,520,869.10 | 1 |
JAN1998 | $238,786,807.60 | 1 |
JAN1999 | $280,350,393.00 | 1 |
FEB1997 | $177,671,530.40 | 2 |
FEB1998 | $215,959,695.50 | 2 |
The values for month are read in from MonthNo
. The year values are contained within the values of SaleMon
and can be extracted using the SUBSTR function. In this example, the SUBSTR function brings in four characters from SaleMon
, starting at the fourth character. Note that the INPUT function is used to convert the value that is extracted from SaleMon
from character to numeric in the assignment statement for Year
. A numeric format must be used because the value of Year
will be used as an array reference.
The values of Ctarget
are then read in from the Targets
array based on the value of Year
and MonthNo
.
data work.lookup1; array Targets{1997:1999,12} _temporary_; if _n_=1 then do i= 1 to 3; set sasuser.ctargets; array Mnth{*} Jan--Dec; do j=1 to dim(mnth); targets{year,j}=mnth{j}; end; end; set sasuser.monthsum(keep=salemon revcargo monthno); year=input(substr(salemon,4),4.); Ctarget=targets{year,monthno}; format ctarget dollar15.2; run;
Table 16.11. Table Representation of Targets Array
1 | 2 | 3... | ...10 | 11 | 12 | |
---|---|---|---|---|---|---|
1997 | 192284420 | 86376721 | 260386468 | 125401565 | 72551855 | 136042505 |
1998 | 108645734 | 147656369 | 202158055 | 275721406 | 230488351 | 24901752 |
1999 | 85730444 | 74168740 | 39955768 | 141528519 | 178043261 | 181668256 |
PROC PRINT output shows the new data set Work.Lookup1, which contains the actual cargo values (RevCargo
) and the target cargo values (Ctarget
).
In the previous section, we compared actual revenue values to target revenue values using an array as a lookup table. Remember that
Sasuser.Monthsum has an observation for each month and year.
Sasuser.Ctargets has one variable for each month and one observation for each year.
Using arrays was a good solution because the orientation of the data sets differed. An alternate solution is to transpose Sasuser.Ctargets using PROC TRANSPOSE, and then merge the transposed data set with Sasuser.Monthsum by the values of Year
and Month
.
If output-data-set does not exist, PROC TRANSPOSE creates it by using the DATA n naming convention. Δ
If you omit the VAR statement, the TRANSPOSE procedure transposes all of the numeric variables in the input data set that are not listed in another statement. Δ
You must list character variables in a VAR statement if you want to transpose them. Δ
The TRANSPOSE procedure creates an output data set by restructuring the values in a SAS data set. When the data set is restructured, selected variables are transposed into observations. The TRANSPOSE procedure can often eliminate the need to write a lengthy DATA step to achieve the same result. The output data set can be used in subsequent DATA or PROC steps for analysis, reporting, or further data manipulation.
PROC TRANSPOSE does not print the output data set. Use PROC PRINT, PROC REPORT, or some other SAS reporting tool to print the output data set.
The following program transposes the SAS data set Sasuser.Ctargets. The OUT=option specifies the name of the output data set, Work.Ctarget2. All of the variables in Sasuser.Ctargets are transposed because all of the variables are numeric and a VAR statement is not used in the program.
proc transpose data=sasuser.ctargets
out=work.ctarget2;
run;
Table 16.14. Input Data Set Sasuser.Ctargets (selected variables)
Year | Jan | Feb | Mar | Apr | May | Jun |
---|---|---|---|---|---|---|
1997 | 192284420 | 86376721 | 28526103 | 260386468 | 109975326 | 102833104 |
1998 | 108645734 | 147656369 | 202158055 | 41160707 | 264294440 | 267135485 |
1999 | 85730444 | 74168740 | 39955768 | 312654811 | 318149340 | 187270927 |
Table 16.15. Output Data Set: Work.Ctarget2
Obs | _NAME_ | COL1 | COL2 | COL3 |
---|---|---|---|---|
1 | Year | 1997 | 1998 | 1999 |
2 | Jan | 192284420 | 108645734 | 85730444 |
3 | Feb | 86376721 | 147656369 | 74168740 |
4 | Mar | 28526103 | 202158055 | 39955768 |
5 | Apr | 260386468 | 41160707 | 31265481 |
6 | May | 109975326 | 264294440 | 318149340 |
7 | Jun | 102833104 | 267135485 | 187270927 |
8 | Jul | 196728648 | 208694865 | 123394421 |
9 | Aug | 236996122 | 83456868 | 34273985 |
10 | Sep | 112413744 | 286846554 | 151565752 |
11 | Oct | 125401565 | 275721406 | 141528519 |
12 | Nov | 72551855 | 230488351 | 178043261 |
13 | Dec | 136042505 | 24901752 | 181668256 |
Notice that in the output data set, the variables are named _NAME_
, COL1
, COL2
, and COL3.
_NAME_
is the default name of the variable that PROC TRANSPOSE creates to identify the source of the values in each observation in the output data set. This variable is a character variable whose values are the names of the variables that are transposed from the input data set. For example, in Work.Ctarget2 the values in the first observation in the output data set come from the values of the variable Year
in the input data set.
The remaining transposed variables are named COL1
... COL
n
by default. In Work.Ctarget2, the values of the variables COL1
, COL2
, and COL3
represent the target cargo revenue for each month in the years 1997, 1998, and 1999.
You can use PROC TRANSPOSE options to give the variables in the output data set more descriptive names. The NAME= option specifies a name for the _NAME_
variable.
The PREFIX= option specifies a prefix to use in constructing names for transposed variables in the output data set. For example, if PREFIX=Ctarget, the names of the variables are Ctarget1
, Ctarget2
, and Ctarget3
.
proc transpose data=sasuser.ctargets out=work.ctarget2 name=Month prefix=Ctarget; run;
Table 16.16. Output Data Set: Work.Ctarget2
Obs | Month | Ctarget1 | Ctarget2 | Ctarget3 |
---|---|---|---|---|
1 | Year | 1997 | 1998 | 1999 |
2 | Jan | 192284420 | 108645734 | 85730444 |
3 | Feb | 86376721 | 147656369 | 74168740 |
4 | Mar | 28526103 | 202158055 | 39955768 |
5 | Apr | 260386468 | 41160707 | 31265481 |
6 | May | 109975326 | 264294440 | 318149340 |
7 | Jun | 102833104 | 267135485 | 187270927 |
8 | Jul | 196728648 | 208694865 | 123394421 |
9 | Aug | 236996122 | 83456868 | 34273985 |
10 | Sep | 112413744 | 286846554 | 151565752 |
11 | Oct | 125401565 | 275721406 | 141528519 |
12 | Nov | 72551855 | 230488351 | 178043261 |
13 | Dec | 136042505 | 24901752 | 181668256 |
The RENAME=data set option can also be used with PROC TRANSPOSE to change variable names.
proc transpose data=sasuser.ctargets out=work.ctarget2 (rename=(col1=Ctarget1 col2=Ctarget2 col3=Ctarget3)) name=Month; run;
Δ
The default label for the _NAME_
variable is NAME OF FORMER VARIABLE.
To see this, print the transposed data set using PROC PRINT with the LABEL option. You can use a LABEL statement to override the default label.
proc transpose data=sasuser.ctargets out=work.ctarget2 name=Month prefix=Ctarget; run; proc print data=work.ctarget2 label; label Month=MONTH; run;
Remember that the transposed data set, Work.Ctarget2, needs to be merged with Sasuser.Monthsum by the values of Year
and Month
. Neither data set is currently structured correctly for the merge.
In order to correctly structure Work.Ctarget2 for the merge, a BY statement needs to be used with PROC TRANSPOSE. For each BY group, PROC TRANSPOSE creates one observation for each variable that it transposes. The BY variable itself is not transposed.
The following program transposes Sasuser.Ctargets by the value of Year
. The resulting output data set, Work.Ctarget2, now contains 12 observations for each each year (1997, 1998, and 1999).
proc transpose data=sasuser.ctargets
out=work.ctarget2
name=Month
prefix=Ctarget;
by year;
run;
Table 16.19. Input Data Set Sasuser.Ctargets (selected variables)
Obs | Year | Jan | Feb | Mar | Apr | May | Jun |
---|---|---|---|---|---|---|---|
1 | 1997 | 192284420 | 86376721 | 28526103 | 260386468 | 109975326 | 102833104 |
2 | 1998 | 108645734 | 147656369 | 202158055 | 41160707 | 264294440 | 267135485 |
3 | 1999 | 85730444 | 74168740 | 39955768 | 312654811 | 318149340 | 187270927 |
Table 16.20. Output Data Set Work.Ctarget2 (first 12 observations)
Obs | Year | Month | Ctarget1 |
---|---|---|---|
1 | 1997 | Jan | 192284420 |
2 | 1997 | Feb | 86376721 |
3 | 1997 | Mar | 28526103 |
4 | 1997 | Apr | 260386468 |
5 | 1997 | May | 109975326 |
6 | 1997 | Jun | 102833104 |
7 | 1997 | Jul | 196728648 |
8 | 1997 | Aug | 236996122 |
9 | 1997 | Sep | 112413744 |
10 | 1997 | Oct | 125401565 |
11 | 1997 | Nov | 72551855 |
12 | 1997 | Dec | 136042505 |
The original SAS data set must be sorted or indexed before using a BY statement with PROC TRANSPOSE unless you use the NOTSORTED option. Δ
The last step in preparing Work.Ctarget2 for the merge is to use the SORT procedure to sort the data set by Year
and Month
as shown in the following program:
proc sort data=work.ctarget2;
by year month;
run;
Notice that in the sorted version of Work.Ctarget2, the values of month are sorted alphabetically by year.
The data in Sasuser.Monthsum must also be reorganized for the merge because the month and year values in that data set are combined in the variable SaleMon
.
Table 16.22. SAS Data Set Sasuser.Monthsum (first five observations of selected variables)
Obs | SaleMon | RevCargo | MonthNo |
---|---|---|---|
1 | JAN1997 | $171,520,869.10 | 1 |
2 | JAN1998 | $238,786,807.60 | 1 |
3 | JAN1999 | $280,350,393.00 | 1 |
4 | FEB1997 | $177,671,530.40 | 2 |
5 | FEB1998 | $215,959,695.50 | 2 |
The following program creates two new variables, Year
and Month
, to hold the year and month values. The values for Year
are created from SaleMon
using the INPUT and SUBSTR functions. The values for Month
are extracted from SaleMon
using the LOWCASE and SUBSTR functions.
data work.mnthsum2; set sasuser.monthsum(keep=SaleMon RevCargo); length Month $ 8; Year=input(substr(SaleMon,4),4.); Month=substr(SaleMon,1,1) ||lowcase(substr(SaleMon,2,2)); run;
As with Work.Ctarget2, the last step in preparing for the merge is to sort the data set by the values of Year
and Month
as shown in the following program:
proc sort data=work.mnthsum2;
by year month;
run;
Notice that in the sorted version of Work.Mnthsum2, the values of month are sorted alphabetically by year.
Table 16.24. SAS Data Set Work.Mnthsum3 (sorted, first twelve observations)
Obs | SaleMon | RevCargo | Month | Year |
---|---|---|---|---|
1 | APR1997 | $380,804,120.20 | Apr | 1997 |
2 | AUG1997 | $196,639,501.10 | Aug | 1997 |
3 | DEC1997 | $196,504,413.00 | Dec | 1997 |
4 | FEB1997 | $177,671,530.40 | Feb | 1997 |
5 | JAN1997 | $171,520,869.10 | Jan | 1997 |
6 | JUL1997 | $197,163,278.20 | Jul | 1997 |
7 | JUN1997 | $190,560,828.50 | Jun | 1997 |
8 | MAR1997 | $196,591,378.20 | Mar | 1997 |
9 | MAY1997 | $196,261,573.20 | May | 1997 |
10 | NOV1997 | $190,228,066.70 | Nov | 1997 |
11 | OCT1997 | $196,957,153.40 | Oct | 1997 |
12 | SEP1997 | $190,535,012.50 | Sep | 1997 |
When the data is structured correctly, Work.Mnthsum2 and Work.Ctarget2 can be merged by the values of Year
and Month
as shown in the following program:
data work.merged; merge work.mnthsum2 work.ctarget2; by year month; run;
Table 16.25. SAS Data Set Work.Mnthsum2 (first five observations)
Obs | SaleMon | RevCargo | Month | Year |
---|---|---|---|---|
1 | APR1997 | $380,804,120.20 | Apr | 1997 |
2 | AUG1997 | $196,639,501.10 | Aug | 1997 |
3 | DEC1997 | $196,504,413.00 | Dec | 1997 |
4 | FEB1997 | $177,671,530.40 | Feb | 1997 |
5 | JAN1997 | $171,520,869.10 | Jan | 1997 |
Table 16.26. SAS Data Set Work.Ctarget2 (first five observations)
Obs | Year | Month | Ctarget1 |
---|---|---|---|
1 | 1997 | Apr | 260386468 |
2 | 1997 | Aug | 236996122 |
3 | 1997 | Dec | 136042505 |
4 | 1997 | Feb | 86376721 |
5 | 1997 | Jan | 192284420 |
PROC PRINT output shows the resulting data set Work.Merged. The values of RevCargo
represent the actual cargo revenue for each month. The values of Ctarget1
represent the target cargo values for each month.
Beginning with SAS 9, the hash object is available for use in a DATA step. The hash object provides an efficient, convenient mechanism for quick data storage and retrieval.
Unlike an array, which uses a series of consecutive integers to address array elements, a hash object can use any combination of numeric and character values as addresses. A hash object can be loaded from hard-coded values or a SAS data set, is sized dynamically, and exists for the duration of the DATA step.
The hash object is a good choice for lookups using unordered data that can fit into memory because it provides in-memory storage and retrieval and does not require the data to be sorted or indexed.
When a lookup operation depends on one or more key values, you can use the hash object. A hash object resembles a table with rows and columns and contains a key component and a data component.
The key component
might consist of numeric and character values
maps key values to data rows
must be unique
can be a composite.
The data component
can contain multiple data values per key value
can consist of numeric and character values.
Suppose you have a data set, named Sasuser.Contrib, that lists the quarterly contributions to a retirement fund. You can use the hash object to calculate the difference between the actual contribution and the goal amount.
The following program creates a hash object that stores the quarterly goal for employee contributions to the retirement fund. To calculate the difference between actual contribution and the goal amount, the program retrieves the goal amount from the hash object based on the key values.
data work.difference (drop= goalamount); length goalamount 8; if _N_ = 1 then do; declare hash goal( ); goal.definekey("QtrNum"); goal.definedata("GoalAmount"); goal.definedone( ); call missing(qtrnum, goalamount); goal.add(key:'qtr1', data:10 ); goal.add(key:'qtr2', data:15 ); goal.add(key:'qtr3', data: 5 ); goal.add(key:'qtr4', data:15 ); end; set sasuser.contrib; goal.find(); Diff = amount - goalamount; run;
We will see how the hash object is set up.
The hash object is a DATA step component object. Component objects are data elements that consist of attributes and methods. Attributes are the properties that specify the information that is associated with an object. An example is size. Methods define the operations that an object can perform.
To use a DATA step component object in your SAS program, you must first declare and create (instantiate) the object.
You declare a hash object using the DECLARE statement.
The hash iterator object retrieves data from the hash object in ascending or descending key order. Δ
The following DECLARE statement creates a hash object named Goal.
data work.difference (drop= goalamount);
length goalamount 8;
if _N_ = 1 then do;
declare hash goal;
At this point, you have declared the variable Goal. It has the potential to hold a component object of the type hash.
The DECLARE statement is an executable statement. Δ
You use the _NEW_ statement to instantiate the hash object.
The following _NEW_ statement creates an instance of the hash object and assigns it to the variable Goal.
data work.difference (drop= goalamount);
length goalamount 8;
if _N_ = 1 then do;
declare hash goal;
goal= _new_ hash();
As an alternative to the two-step process of using the DECLARE and _NEW_ statements to declare and instantiate a component object, you can use the DECLARE statement to declare and instantiate the component object in one step.
data Work.Difference (drop= goalamount);
length goalamount 8;
if _N_ = 1 then do;
declare hash Goal();
Remember that the hash object uses lookup keys to store and retrieve data. The keys and the data are DATA step variables that you use to initialize the hash object by using dot notation method calls.
A key is defined by passing the key variable name to the DEFINEKEY method. Data is defined by passing the data variable name to the DEFINEDATA method. When all key and data variables are defined, the DEFINEDONE method is called. Keys and data can consist of any number of character or numeric DATA step variables.
The following code initializes the key variable QtrNum
and the data variable GoalAmount
.
data work.difference (drop= goalamount); length goalamount 8; if _N_ = 1 then do; declare hash goal(); goal.definekey ("QtrNum"); goal.definedata ("GoalAmount"); goal.definedone();
The hash object does not assign values to key variables, and the SAS compiler cannot detect the implicit key and data variable assignments done by the hash object. Therefore, if no explicit assignment to a key or data variable appears in the program, SAS issues a note stating that the variables are uninitialized.
To avoid receiving these notes, use the CALL MISSING routine with the key and data variables as parameters. The CALL MISSING routine assigns a missing value to the specified character or numeric variables.
data Work.Difference (drop= goalamount);
length GoalAmount 8;
if _N_ = 1 then do; declare hash goal();
goal.definekey("QtrNum");
goal.definedata("GoalAmount");
goal.definedone();
call missing(qtrnum, goalamount);
Another way to avoid receiving notes stating that the variables are uninitialized is to provide an initial assignment statement that assigns a missing value to each key and data variable. Δ
So far, you have declared and instantiated the hash object, and initialized the hash object's key and data variables. You are now ready to store data in the hash object using the ADD method. The following code uses the ADD method to load the key values qtr1, qtr2, qtr3, and qtr4 and the corresponding data values 10, 15, 5, and 15 into the hash object.
data work.difference (drop= goalamount); length goalamount 8; if _N_ = 1 then do; declare hash goal(); declare hash goal( ); goal.definekey("QtrNum"); goal.definedata("GoalAmount"); goal.definedone( ); call missing(qtrnum, goalamount); goal.add(key:'qtr1', data:10 ); goal.add(key:'qtr2', data:15 ); goal.add(key:'qtr3', data: 5 ); goal.add(key:'qtr4', data:15 ); end;
You use the FIND method to retrieve matching data from the hash object. The FIND method returns a value that indicates whether the key is in the hash object. If the key is in the hash object, then the FIND method also sets the data variable to the value of the data item so that it is available for use after the method call.
data work.difference (drop= goalamount);
length goalamount 8;
if _N_ = 1 then do;
declare hash goal( );
goal.definekey("QtrNum");
goal.definedata("GoalAmount");
goal.definedone( );
call missing(qtrnum, goalamount);
goal.add(key:'qtr1', data:10 );
goal.add(key:'qtr2', data:15 );
goal.add(key:'qtr3', data: 5 );
goal.add(key:'qtr4', data:15 );
end;
set sasuser.contrib;
goal.find();
Diff = amount - goalamount;
run;
We will consider what happens when the program is submitted for execution.
data Work.Difference (drop= goalamount); length goalamount 8; if _N_ = 1 then do;
declare hash goal( );
goal.definekey("QtrNum");
goal.definedata("GoalAmount");
goal.definedone( );
call missing(qtrnum, goalamount);
goal.add(key:'qtr1', data:10 );
goal.add(key:'qtr2', data:15 );
goal.add(key:'qtr3', data: 5 );
goal.add(key:'qtr4', data:15 );
end;
set sasuser.contrib;
goal.find();
Diff = amount - goalamount;
run;
The program executes until the DATA step encounters the end of the line. PROC PRINT output shows the completed data set.
Suppose you need to create a report that shows revenue, expenses, profits, and airport information. You have two data sets that contain portions of the required data. The SAS data set Sasuser.Revenue contains flight revenue information. The SAS data set Sasuser.Acities contains airport data including the airport code, location, and name.
Table 16.27. SAS Data Set Sasuser.Revenue (first five observations)
Origin | Dest | FlightID | Date | Rev1st | RevBusiness | RevEcon |
---|---|---|---|---|---|---|
ANC | RDU | IA03400 | 02DEC1999 | 15829 | 28420 | 68688 |
ANC | RDU | IA03400 | 14DEC1999 | 20146 | 26460 | 72981 |
ANC | RDU | IA03400 | 26DEC1999 | 20146 | 23520 | 59625 |
ANC | RDU | IA03401 | 09DEC1999 | 15829 | 22540 | 58671 |
ANC | RDU | IA03401 | 21DEC1999 | 20146 | 22540 | 65826 |
Table 16.28. SAS Data Set Sasuser.Acities (first five observations)
City | Code | Name | Country |
---|---|---|---|
Auckland | AKL | International | New Zealand |
Amsterdam | AMS | Schiphol | Netherlands |
Anchorage, AK | ANC | Anchorage International Airport | USA |
Stockholm | ARN | Arlanda | Sweden |
Athens (Athinai) | ATH | Hellinikon International Airport | Greece |
To create the report, you can use a hash object to retreive matching airport data from Sasuser.Acities.
To initialize the attributes of hash variables that originate from an existing SAS data set, you can use a non-executing SET statement.
Because the IF condition is false during execution, the SET statement is compiled but not executed. The PDV is created with the variable Code
, City
, and Name
from Sasuser.Acities.
data work.report; if _N_=1 then do; if 0 then set sasuser.acities (keep=Code City Name);
When you use this technique, the MISSING routine is not required.
The hash object that you worked with earlier in this chapter contains one key variable and one data variable. In this example, you need to associate more than one data value with each key.
In the following code, the DECLARE statement creates the Airports hash object and loads it from Sasuser.Acities. the DEFINEKEY method call defines the key to be the value of the variable Code
. The DEFINEDATA method call defines the data to be the values of the variables City
and Name
.
data work.report; if 0 then set sasuser.acities (keep=Code City Name); if _N_=1 then do; declare hash airports (dataset: "sasuser.acities") airports.definekey ("Code"); airports.definedata ("City", "Name"); airports.definedone(); end;
Table 16.29. Hash Object Airports
Key: Code | Data: City | Data: Name |
---|---|---|
ANC | Anchorage, AK | Anchorage International Airport |
BNA | Nashville, TN | Nashville International Airport |
CDG | Paris | Charles de Gaulle |
LAX | Los Angeles, CA | Los Angeles International Airport |
RDU | Raleigh-Durham, NC | Raleigh-Durham International Airport |
To define all data set variables as data variables for the hash object, use the ALL: "YES" option. Δ
hashobject.DEFINEDATA (ALL:"YES");
The hash object can store multiple key variables as well as multiple data variables. Δ
You can use multiple FIND method calls to retrieve multiple data values. In the following program, the FIND method calls retrieve the value of City
and Name
from the Airports hash object based on the value of Origin
.
data work.report; if _N_=1 then do; if 0 then set sasuser.acities (keep=Code City Name); declare hash airports (dataset: "sasuser.acities"); airports.definekey ("Code"); airports.definedata ("City", "Name"); airports.definedone(); end; set sasuser.revenue; airports.find(key:origin); OriginCity=city; OriginAirport=name; airports.find(key:dest); DestCity=city; DestAirport=name; run;
PROC PRINT output shows the completed data set.
proc print data=work.report; var origin dest flightid date origincity originairport destcity destairport; run;
Method calls create a return code that is a numeric value. The value specifies whether the method succeeded or failed. A value of 0 indicates that the method succeeded. A non-zero value indicates that the method failed.
If the program does not contain a return code variable for the method call and the method fails, then an appropriate error message is written to the log.
To store the value of the return code in a variable, specify the variable name rc
at the beginning of the method call. For example:
rc=hashobject.find (key:keyvalue);
The return code variable value can be used in conditional logic to ensure that the FIND method found a KEY value in the hash object that matches the KEY value from the PDV.
Error messages are written to the log when the following program is submitted.
data work.report; if _N_=1 then do; if 0 then set sasuser.acities (keep=Code City Name); declare hash airports (dataset: "sasuser.acities"); airports.definekey ("Code"); airports.definedata ("City", "Name"); airports.definedone(); end; set sasuser.revenue; airports.find(key:origin); OriginCity=city; OriginAirport=name; airports.find(key:dest); DestCity=city; DestAirport=name; run;
A closer examination of the output shows that the data set Work.Report contains errors. For example, notice that in observations 6 through 8 the value of both OriginCity
and DestCity
is Canberra, Australian C and the values of OriginAirport
and DestAirport
are missing.
The errors occur because the Airports hash object does not include the key value WLG or a corresponding Name
value for the key value CBR.
Conditional logic can be added to the program to create blank values if the values loaded from the input data set, Sasuser.Revenue, cannot be found in the Airports hash object:
If the return code for the FIND method call has a value of 0, indicating that the method succeeded, the value of City
and Name
are assigned to the appropriate variables (OriginCity
and OriginAirport
or DestCity
and DestAirport
).
If the return code for the FIND method call has a non-zero value, indicating the method failed, the value of City
and Name
are assigned blank values.
data work.report; if _N_=1 then do; if 0 then set sasuser.acities(keep=Code City Name); declare hash airports (dataset: "sasuser.acities"); airports.definekey ("Code"); airports.definedata ("City", "Name"); airports.definedone(); end; set sasuser.revenue; rc=airports.find(key:origin); if rc=0 then do; OriginCity=city; OriginAirport=name; end; else do;
OriginCity=''; OriginAirport=''; end; rc=airports.find(key:dest); if rc=0 then do; DestCity=city; DestAirport=name; end; else do; DestCity=''; DestAirport=''; end; run;
PROC PRINT output shows the completed data set. Notice that in observations 6 through 8, the value of DestCity
is now blank and no error messages appear in the log.
proc print data=work.report; var origin dest flightid date origincity originairport destcity destairport; run;
This section contains the following:
a text summary of the material taught in this chapter
syntax for statements and options
sample programs
points to remember.
Sometimes, you need to combine data from two or more sets into a single observation in a new data set according to the values of a common variable. When data sources do not have a common structure, you can use a lookup table to match them.
When a lookup operation depends on more than one numerical factor, you can use a multidimensional array. You use an ARRAY statement to create an array. The ARRAY statement defines a set of elements that you plan to process as a group.
In some cases, you might need to store array values in a SAS data set rather than loading them in an ARRAY statement. Array values should be stored in a SAS data set when
there are too many values to initialize easily in the array
the values change frequently
the same values are used in many programs.
The first step in using stored array values is to create an array to hold the values in the target data set. The next step is to load the array elements. One method for accomplishing this task is to load the array within a DO loop. The last step is to read the actual values stored in the other source data set.
The TRANSPOSE procedure can also be used to match data when the orientation of the data sets differs. PROC TRANSPOSE creates an output data set by restructuring the values in a SAS data set, thereby transposing selected variables into observations. The transposed (output) data set can then be merged with the other data set in order to match the data.
The output data set contains several default variables.
_NAME_
is the default name of the variable that PROC TRANSPOSE creates to identify the source of the values in each observation in the output data set. This variable is a character variable whose values are the names of the variables that are transposed from the input data set. To override the default name, use the NAME= option.
The remaining transposed variables are named COL1
...COL
n by default. To override the default names, use the PREFIX= option.
You might need to use a BY statement with PROC TRANSPOSE in order to correctly structure the data for a merge. For each BY group, PROC TRANSPOSE creates one observation for each variable that it transposes. The BY variable itself is not transposed. In order to structure the data for a merge, you might also need to sort the output data set. Any other source data sets might need to be reorganized and sorted as well. When the data is structured correctly, the data sets can be merged.
Beginning with SAS 9, the hash object is available for use in a DATA step. The hash object provides an efficient, convenient mechanism for quick data storage and retrieval.
A hash object resembles a table with rows and columns; it contains a key component and a data component. Unlike an array, which uses a series of consecutive integers to address array elements, a hash object can use any combination of numeric and character values as addresses.
The hash object is a DATA step component option. Component objects are data elements that consist of attributes and methods. To use a DATA step component object in your SAS program, you must first declare and create (instantiate) the object. After you define the hash object's key and data variables, you can load the variables from hard-coded values or a SAS data set.
You use the FIND method call return code that is a numeric value. The value specifies whether the method succeeded or failed. A value of 0 indicates that the method succeeded. A nonzero value indicates that the method failed. The return code variable value can be used in conditional logic to ensure that the FIND method found KEY value in the hash object that matches the KEY value from the PDV.
LIBNAME libref 'SAS-data-library';
DATA libref.sas-data-set(DROP=variable(s));
ARRAY array-name {rows,cols,...}< $><length>
< array-elements><(initial values)>;
SET<SAS-data-set(s) <(data-set-options(s))>> <options>;
variable=expression;
variable=array-name {rows,cols,...};
RUN;
LIBNAME libref 'SAS-data-library';
DATA libref.sas-data-set;
ARRAY array-name1 {rows,cols,...}<$>< length>
< array-elements><(initial values)>;
IF expression THEN DO index-variable=specification;
SET < SAS-data-set(s) <(data-set-options(s) )>> <options>;
ARRAY array-name2 {rows,cols,...}<$><length>
<array-elements><(initial values)>;
DO index-variable=specification;
array-name1 {rows,cols,...}=array-name2 {rows,cols,...};
END;
END;
RUN;
LIBNAME libref 'SAS-data-library';
PROC TRANSPOSE <DATA=input-data-set>
<OUT=output-data-set>
<NAME=name>
<PREFIX=prefix>;
BY <DESCENDING> variable-1
< ...<DESCENDING> variable-n><NOTSORTED>;
RUN;
PROC SORT;
BY <DESCENDING> variable-1
<...<DESCENDING> variable-n>;
RUN;
DATA libref.sas-data-set;
SET<SAS-data-set(s) <(data-set-option(s))>> <options>;
LENGTH variable(s) < $> length;
variable=expression;
variable=expression;
RUN;
PROC SORT;
BY <DESCENDING> variable-1 <...<DESCENDING> variable-n>;
RUN;
DATA libref.sas-data-set;
MERGE SAS-data-set-1 <(data-set-options)>
SAS-data-set-2 <(data-set-options)>;
BY <DESCENDING> variable-1 <...<DESCENDING> variable-n>;
RUN;
DATA libref.sas-data-set;
IF expression THEN statement;
DECLARE object variable <(<argument_tag-1: value-1 <, ...argument_tag-n:
value-n>>)>;
variable = _NEW_ object(<argument_tag-1: value-1<, ...argument_tag-n:
value-n>>);
object.DEFINEKEY('keyvarname-1'<..., 'keyvarname-n'>);
object.DEFINEDATA('datavarname-1'<,...'datavarname-n'>);
object.DEFINEDONE( );
CALL MISSING(varname1<, varname2, ...>);
object.ADD(<KEY: keyvalue-1,..., KEY: keyvalue-n, DATA: datavalue-1, ...,
DATA: datavalue-n>);
END;
SET <SAS-data-set(s) <(data-set-options(s) )>> < options>;
object.FIND(<KEY: keyvalue-1,..., KEY: keyvalue-n>);
RUN;
data work.wndchill (drop = column row); array WC{4,2} _temporary_ (-22,-16,-28,-22,-32,-26,-35,-29); set sasuser.flights; row = round(wspeed,5)/5; column = (round(temp,5)/5)+3; WindChill= wc{row,column}; run;
data work.lookup1; array Targets{1997:1999,12} _temporary_; if _n_=1 then do i= 1 to 3; set sasuser.ctargets; array Mnth{*} Jan--Dec; do j=1 to dim(mnth); targets{year,j}=mnth{j}; end; end; set sasuser.monthsum(keep=salemon revcargo monthno); year=input(substr(salemon,4),4.); Ctarget=targets{year,monthno}; format ctarget dollar15.2; run;
proc transpose data=sasuser.ctargets out=work.ctarget2 name=Month prefix=Ctarget; by year; run; proc sort data=work.ctarget2;
by year month; run; data work.mnthsum2; set sasuser.monthsum(keep=SaleMon RevCargo); length Month $ 8; Year=input(substr(SaleMon,4),4.); Month=substr(SaleMon,1,1) ||lowcase(substr(SaleMon,2,2)); run; proc sort data=work.mnthsum2; by year month; run; data work.merged; merge work.mnthsum2 work.ctarget2; by year month; run;
data work.difference (drop= goalamount); length goalamount 8; if _N_ = 1 then do; declare hash goal( ); goal.definekey("QtrNum"); goal.definedata("GoalAmount"); goal.definedone( ); call missing(qtrnum, goalamount); goal.add(key:'qtr1', data:10 ); goal.add(key:'qtr2', data:15 ); goal.add(key:'qtr3', data: 5 ); goal.add(key:'qtr4', data:15 ); end; set sasuser.contrib; goal.find(); Diff = amount - goalamount; run;
data work.report; if _N_=1 then do; if 0 then set sasuser.acities(keep=Code City Name); declare hash airports (dataset: "sasuser.acities"); airports.definekey ("Code"); airports.definedata ("City", "Name"); airports.definedone(); end; set sasuser.revenue; rc=airports.find(key:origin); if rc=0 then do; OriginCity=city; OriginAirport=name;
end; else do; OriginCity=''; OriginAirport=''; end; rc=airports.find(key:dest); if rc=0 then do; DestCity=city; DestAirport=name; end; else do; DestCity=''; DestAirport=''; end; run;
The name of an array must be a SAS name that is not the name of a SAS variable in the same DATA step.
Array elements must be either all numeric or all character.
The initial values specified for an array can be numbers or character strings. You must enclose all character strings in quotation marks.
The original SAS data set must be sorted or indexed prior to using a BY statement with PROC TRANSPOSE unless you use the NOTSORTED option.
The hash object is a good choice for lookups using unordered data that can fit into memory because it provides in-memory storage and retrieval and does not require the data to be sorted.
The hash object is sized dynamically, and exists for the duration of the DATA step.
Select the best answer for each question. After completing the quiz, check your answers using the answer key in the appendix.
Which SAS statement correctly specifies the array Sales
as illustrated in the following table?
array Sales{3,4} m1-m12;
array Sales{4,3} m1-m12;
array {3,4} Sales m1-m12;
array {4,12} Sales m1-m12;
Which of the following statements creates temporary array elements?
array new {*} _temporary_;
array new {6} _temporary_;
array new {*} _temporary_ Jan Feb Mar Apr May Jun;
array _temporary_ new {6} Jan Feb Mar Apr May Jun;
Which DO statement processes all of the elements in the Yearx
array?
array Yearx{12} Jan--Dec;
do i=1 to dim(yearx);
do i=1 to 12;
do i=Jan to Dec;
a and b
Given the following program, what is the value of Points
in the fifth observation in the data set Work.Results?
40
50
60
70
Array values should be stored in a SAS data set when
there are too many values to initialize easily in an array.
the values change frequently.
the same values are used in many programs.
all of the above
Given the following program, which statement is not true?
data work.lookup1; array Targets{1997:1999,12} _temporary_; if _n_=1 then do i= 1 to 3; set sasuser.ctargets; array Mnth{*} Jan--Dec;
do j=1 to dim(mnth); targets{year,j}=mnth{j}; end; end; set sasuser.monthsum(keep=salemon revcargo monthno); year=input(substr(salemon,4),4.); Ctarget=targets{year,monthno}; run;
The IF-THEN statement specifies that the Targets
array is loaded once.
During the first iteration of the DATA step, the outer DO loop executes three times.
After the first iteration of the DO loop, the pointer drops down to the second SET statement.
During the second iteration of the DATA step, the condition _N_=1 is false
. So, the DO loop does not execute.
Given the following program, which variable names will appear in the data set Work.New?
Year, Jan, Feb, Mar, Apr
Year, 2000, 2001, 2002
_NAME_, Col1, Col2, Col3
_NAME_, Jan, Feb, Mar, Apr
Which program creates the output data set Work.Temp2?
proc transpose data=work.temp out=work.temp2 prefix=Quarter; run;
proc transpose data=work.temp out=work.temp2 name=Month prefix=Quarter; run;
proc transpose data=work.temp out=work.temp2 prefix=Month name=Quarter; run
proc transpose data=work.temp out=work.temp2 prefix=Month index=Quarter; run;
Which version of the data set Work.Sales2 is created by the following program?
Which program creates the data set Work.Fishsize?
proc transpose data=work.fishdata out=work.fishsize prefix=Measurement; run;
proc transpose data=work.fishdata out=work.fishsize prefix=Measurement; by location; run;
proc transpose data=work.fishdata out=work.fishsize prefix=Measurement; by date; run;
proc transpose data=work.fishdata out=work.fishsize prefix=Measurement; by location date; run;