In this chapter we discuss common data management tasks beginning with combining DataFrames. Other tasks discussed include sorting, finding duplicate values, drawing samples, and transposing. Every analysis task requires data to be organized into a specific form before the data is used to render meaningful results. Data often comes from multiple sources with a range of different formats. This requires you to logically relate and organize data to fit the analysis task. In fact, most of the effort for any data analysis is “wrangling” data to shape it appropriately.
pandas have two main facilities for combining DataFrames with various types of set logic and relational/algebraic capabilities for join/merge operations. The concat() method performs row-wise or column-wise concatenation operations and performs union and intersection set logic on DataFrames. The examples explored in this chapter are analogous to the SAS Data Step SET statement and PROC APPEND.
The merge() method offers an SQL-like interface for performing DataFrame join/merge operations. The SAS MERGE statement and PROC SQL are the analogs used to introduce the merge() method.
The merge() method is like the SAS match-merge operation. And since data is rarely tidy, we also explore cases where key columns are inferred as well as handling key columns having different names followed by merges with missing key values.
Start by constructing the left and right DataFrames illustrated in Listing 5-1. In this example, the DataFrames use the ID column as a common key.
Notice the left DataFrame has ‘321’ as a duplicate value in the ID column, making a many-to-one relationship between the DataFrames. Also notice how the right DataFrame has ‘871’ as an ID value not found in the left DataFrame. These are the types of issues that may cause unexpected results when performing merge/join operations. These two DataFrames are used in several examples throughout this chapter.
Listing 5-2 builds the same input data into the left and right datasets in SAS. With the SAS examples, we explore the use of both Data Step and PROC SQL logic as analogs to the merge() and concat() methods for DataFrames.
4 data left;
5 infile datalines dlm=',';
6 length name $ 12 dept $ 5;
7 input name $
8 id
9 gender $
10 dept;
11 list;
12 datalines;
RULE: --+--1--+--2--+--3--+--4--+--5
13 Gunter, 929, M, Mfg
14 Harbinger, 446, M, Mfg
15 Benito, 228, F, Mfg
16 Rudelich, 299, M, Sales
17 Sirignano, 442, F, Admin
18 Morrison, 321, M, Sales
19 Morrison, 321, M, Sales
20 Oniedae, 882, F, Admin
NOTE: The dataset WORK.LEFT has 8 observations and 4 variables.
21 ;;;;
22
23 data right;
24 input id
25 salary;
26 list;
27 datalines;
RULE: --+--1--+--2--+--3--+--4--+--5
28 929 45650
29 446 51290
30 228 62000
31 299 39800
32 442 44345
33 871 70000
NOTE: The dataset WORK.RIGHT has 6 observations and 2 variables.
34 ;;;;
35
36 proc print data=left;
NOTE: There were 8 observations read from the dataset WORK.LEFT.
37 proc print data=right;
38 run;
Listing 5-2
Create Left and Right Datasets
Figure 5-1 uses PROC PRINT to display the left dataset.
Figure 5-2 uses PROC PRINT to display the right dataset.
SAS Sort/Merge
To begin, consider Listing 5-3, also referred to as match-merge. This program is a common pattern for match-merging two SAS datasets containing a common key. In this example, both the left and right SAS datasets are sorted by the id variable enabling SAS By Group processing. After sorting, the match-merge joins the datasets by the id variable.
Experienced SQL users know the results from this match-merge are the same as a FULL OUTER join in PROC SQL in cases where the table relationships are one-to-one or many-to-one. In cases where the table relationships are many-to-many, the results from the Data Step and PROC SQL differ. The many-to-many use cases for SAS and pandas are detailed in Appendix B at the end of the book. For the remainder of this chapter, our examples deal with one-to-one or one-to-many join relationships represented by the left and right DataFrames and datasets created in Listing 5-1 and Listing 5-2.
Listing 5-3 combines observations from the left and right datasets into a single observation in the new merge_lr dataset according to the values for the id variable found in both datasets.
4 proc sort data=left;
5 by id;
6 run;
NOTE: There were 8 observations read from the dataset WORK.LEFT.
NOTE: The dataset WORK.LEFT has 8 observations and 4 variables.
7 proc sort data=right;
8 by id;
9 run;
NOTE: There were 6 observations read from the dataset WORK.RIGHT.
NOTE: The dataset WORK.RIGHT has 6 observations and 2 variables.
10 data merge_lr;
11 merge left
12 right;
13 by id;
14 run;
NOTE: There were 8 observations read from the dataset WORK.LEFT.
NOTE: There were 6 observations read from the dataset WORK.RIGHT.
NOTE: The dataset WORK.MERGE_LR has 9 observations and 5 variables.
15 proc print data=merge_lr;
16 id id;
17 run;
NOTE: There were 9 observations read from the dataset WORK.MERGE_LR.
Listing 5-3
SAS Sort/Merge
Figure 5-3 uses PROC PRINT to display the resulting merge_lr dataset.
To introduce the pandas merge() method, consider Listing 5-4. Using a single Data Step, the program creates seven output datasets. Each of the joins illustrates the following operations:
Inner join
Right join
Left join
Outer join
Left join with no matched keys
Right join with no matched keys
Outer join with no matched keys
4 data inner
5 right
6 left
7 outer
8 nomatch_l
9 nomatch_r
10 nomatch;
11
12 merge left(in=l)
13 right(in=r);
14 by id;
15
16 if (l=l and r=1) then output inner; *Inner Join;
17
18 if r = 1 then output right; * Right Join;
19
20 if l = 1 then output left; * Left Join;
21
21 if (l=1 or r=1) then output outer; *Full Outer Join;
23
24 if (l=1 and r=0) then output nomatch_l;
25
26 if (l=0 and r=1) then output nomatch_r;
27
28 if (l=0 or r=0) then output nomatch;
29
30 run;
NOTE: There were 8 observations read from the dataset WORK.LEFT.
NOTE: There were 6 observations read from the dataset WORK.RIGHT.
NOTE: The dataset WORK.INNER has 6 observations and 5 variables.
NOTE: The dataset WORK.RIGHT has 6 observations and 5 variables.
NOTE: The dataset WORK.LEFT has 8 observations and 5 variables.
NOTE: The dataset WORK.OUTER has 9 observations and 5 variables.
NOTE: The dataset WORK.NOMATCH_L has 3 observations and 5 variables.
NOTE: The dataset WORK.NOMATCH_R has 1 observations and 5 variables.
NOTE: The dataset WORK.NOMATCH has 4 observations and 5 variables.
Listing 5-4
Create Seven Output Datasets
Each of these seven join/merge operations is explored in detail along with their pandas counterpart operations.
Inner Join
An INNER JOIN selects only those rows whose key values are found in both tables. Another way to say this is the intersection of matched key values. The SAS Data Step for an INNER JOIN is shown in Listing 5-5. The id column in both datasets must have matching values to be included in the result set.
4 data inner;
5 merge left(in=l)
6 right(in=r);
7 by id;
8
9 if (l=1 and r=1) then output;
10 run;
NOTE: There were 8 observations read from the dataset WORK.LEFT.
NOTE: There were 6 observations read from the dataset WORK.RIGHT.
NOTE: The dataset WORK.INNER has 5 observations and 5 variables.
Listing 5-5
Data Step Inner Join
The dataset IN= option creates Boolean variables to indicate which dataset contributes values to the current observation being read. The IF statement applies a truth test selecting only those observations where the id variable has matched values in both the left and right datasets.
With the SAS example as a template, we illustrate the pandas merge() method. The merge() method signature is
Listing 5-6 uses the on=‘ID’ argument to indicate the ID column is a key column found in both DataFrames. It turns out the on=‘ID’ argument is not needed in this example, since the merge() method detects the presence of a column labeled ID in both DataFrames and automatically asserts them as the key column. The how=‘inner’ argument performs an inner join, which is the default. The sort= argument is set to False. Not surprisingly, merge() operations on large DataFrames gain substantial performance improvements by not having to return rows in sorted order.
A RIGHT JOIN returns all observations from the right dataset along with any observations from the left dataset where the id variable has a match with the id variable in the right dataset. In this case, all observations from the right dataset are returned along with any observations in the left dataset with id values matching in the right dataset.
The SAS Data Step equivalent for a right join is illustrated in Listing 5-8.
4 data r_join;
5 merge left(in=l)
6 right(in=r);
7 by id;
8
9 if r=1 then output;
10 run;
NOTE: There were 8 observations read from the dataset WORK.LEFT.
NOTE: There were 6 observations read from the dataset WORK.RIGHT.
NOTE: The dataset WORK.R_JOIN has 6 observations and 5 variables.
Listing 5-8
Data Step Right Join
The dataset IN= option creates Boolean variables to indicate whether the datasets contributed values to the current observation. The IF statement applies a truth test selecting all observations in the right table along with those observations in the left dataset where the id variable has matched values in the right dataset. In cases where there are observations in the right dataset with no matching id in the left dataset, these values are set to missing.
Listing 5-9 illustrates the compactness of the merge() method.
The merge() method automatically coalesces the ID column values from both DataFrames into a single column in the returned r_join DataFrame. In cases where there are rows in the right DataFrame with no matching id in the left DataFrame, these values are set to NaN’s.
Listing 5-10 illustrates a right join with PROC SQL.
4 proc sql;
5 select coalesce(left.id, right.id) as id
6 ,name
7 ,dept
8 ,gender
9 ,salary
10 from left
11 right join
12 right
13 on left.id = right.id;
14 quit;
Listing 5-10
PROC SQL Right Join
The COALESCE function coerces the id columns from both tables to return a single column. Without the COALESCE function, the result set returns columns labeled ID from both the right and left tables with the ID column from the left table containing nulls for those rows with no matches found for the ID column from the right table.
Figure 5-5 displays the result set created from the right join.
As the output shows, all rows from the right table are returned and PROC SQL assigns missing values for rows in the right table having unmatched values for the id column in the left table.
Left Join
A left join returns all rows from the left dataset along with any rows from the right table where the join predicate is true. In this case, all rows from the left are returned along with any rows in the right with id values matching in the left. Listing 5-11 illustrates a left join.
4 data l_join;
5 merge left(in=l)
6 right(in=r);
7 by id;
8
9 if l=1 then output;
10 run;
NOTE: There were 8 observations read from the dataset WORK.LEFT.
NOTE: There were 6 observations read from the dataset WORK.RIGHT.
NOTE: The dataset WORK.L_JOIN has 8 observations and 5 variables.
Listing 5-11
Data Step Left Join
Listing 5-12 illustrates a left join on the left and right tables.
Like the pandas right join example in Listing 5-9, the merge() method automatically coalesces the ID column values from both DataFrames into a single column in the returned l_join DataFrame. The output shows all rows from the left DataFrame are returned with the merge() method assigning NaN’s for columns in the right DataFrame having unmatched values for the id column in the left DataFrame.
Listing 5-13 illustrates a left join with PROC SQL.
4 proc sql;
5 select coalesce(left.id, right.id) as id
6 ,name
7 ,dept
8 ,gender
9 ,salary
10 from left
11 left join
12 right
13 on left.id = right.id;
14 quit;
Listing 5-13
PROC SQL Left Join
Similar to Listing 5-10, this example uses the COALESCE function to coerce the id columns from both tables to return a single column. Figure 5-6 displays the PROC SQL output.
Outer Join
Earlier we stated the SAS Sort/Merge returns the same result set as those from PROC SQL OUTER JOINs in cases where the table relationships are either one-to-one or one-to-many. The results from this example are the same as the SAS Sort/Merge results. See Listing 5-3 for the Data Step source program.
Listing 5-14 illustrates uses of the how='outer' argument to select all rows from the left and right tables to perform an outer join.
All rows from the left and right tables are returned. In cases where there are no matched values for the id variable, the values are set to missing.
Figure 5-7 displays the result set from an outer join on the left and right tables with PROC SQL.
Right Join Unmatched Keys
Up to this point, the examples are based on finding matching key values in the data to be joined. The next three examples illustrate joining data where keys are not matched.
Every SQL join is either a Cartesian product join or a subset of a Cartesian product join. In cases involving unmatched key values, a form of WHERE processing is required. Together, the SAS Data Step with its IN= and associated IF processing logic is a common pattern for this type of filtering. PROC SQL with a WHERE clause is used as well.
The next three examples illustrate the indicator= argument for the pandas merge() method as an analog to the SAS IN= dataset option. For pandas, the filtering process utilizes a Boolean comparison based on the indicator= value.
NOTE: There were 8 observations read from the dataset WORK.LEFT.
NOTE: There were 6 observations read from the dataset WORK.RIGHT.
NOTE: The dataset WORK.R_JOIN_NMK has 1 observations and 5 variables.
Listing 5-16
Data Step Unmatched Keys in Right
The statement
if (l=0 and r=1) then output;
behaves as a WHERE filter matching on the id column in both datasets resulting in rows from the right dataset with no matched values in the left dataset.
To perform a right join on unmatched keys on DataFrames, use the indicator= argument to the merge() method. This argument adds a column to the output DataFrame with the default name _merge as an indicator for the source of each row. Their returned values are
left_only
right_only
both
By applying a Boolean filter to the indicator= values, we replicate the behaviors for the SAS IN= dataset option for merge operations. Consider Listing 5-17.
The indicator= argument adds the in_col column to the nomatch_r DataFrame in this example. We print the nomatch_r DataFrame as an intermediate step to display values for the in_col column added by the indicator= argument.
The Boolean expression
nomatch_r[(nomatch_r['in_col'] == 'right_only')]
is a subsetting operation to create the nomatch_r DataFrame with the Boolean test
(nomatch_r['in_col'] == 'right_only')
selecting rows where the in_col column value is 'right_only'.
The SAS analog to Listing 5-17 example is generated by simply adding a WHERE filter to the example from Listing 5-10. This is illustrated in Listing 5-18.
4 proc sql;
5 select coalesce(left.id, right.id) as id
6 ,name
7 ,dept
8 ,gender
9 ,salary
10
11 from left
12 right join right on left.ID = right.ID
13 where left.ID is NULL;
14 quit;
Listing 5-18
PROC SQL RIGHT JOIN Unmatched Keys
The WHERE clause returns those rows from the right table having no matching id values in the left table. Columns returned from the left table are set to missing.
To find the unmatched key values in the left table, consider Listing 5-19.
4 data l_join_nmk;
5 merge left(in=l)
6 right(in=r);
7 by id;
8
9 if (l=1 and r=0) then output;
10 run;
NOTE: There were 8 observations read from the dataset WORK.LEFT.
NOTE: There were 6 observations read from the dataset WORK.RIGHT.
NOTE: The dataset WORK.L_JOIN_NMK has 3 observations and 5 variables.
Listing 5-19
Data Step Unmatched Keys in Left
The statement
if (l=1 and r=0) then output;
is the reverse from Listing 5-16. The IF statement behaves like a WHERE filter matching on the id column in both datasets resulting in just the observations from the left dataset having no matched id values in the right dataset.
Finding the unmatched key values in the left DataFrame is illustrated in Listing 5-20. Notice the call to the merge() method is the same as illustrated in Listing 5-17; however, in this case, we create the nomatch_l DataFrame and the subsequent filtering logic is different.
is the subsetting logic to create the nomatch_l DataFrame with the Boolean test
(nomatch_l['in_col'] == 'left_only')
selecting rows where the in_col column value is 'left_only'.
The same result set is generated by simply adding a WHERE filter to the example from Listing 5-13. This is illustrated in Listing 5-21.
4 proc sql;
5 select coalesce(left.id, right.id) as id
6 ,name
7 ,dept
8 ,gender
9 ,salary
10 from left
11 left join
12 right
13 on left.id = right.id
14 where right.id is null;
15 quit;
Listing 5-21
PROC SQL Left Join on Unmatched Keys
The WHERE clause returns those rows from the left table having no matching id values in the right table. Columns returned from the right table are set to missing.
An outer join on unmatched keys returns rows from each dataset with unmatched keys in the other dataset. To find the unmatched key values in the leftorright dataset, consider Listing 5-22.
4 data outer_nomatch_both;
5 merge left (in=l)
6 right (in=r);
7 by id;
8
9 if (l=0 or r=0) then output;
10 run;
NOTE: There were 8 observations read from the dataset WORK.LEFT.
NOTE: There were 6 observations read from the dataset WORK.RIGHT.
NOTE: The dataset WORK.OUTER_NOMATCH_BOTH has 4 observations and 5 variables.
Listing 5-22
Data Step Outer Join Unmatched Keys in Both
The statement
if (l=0 or r=0) then output;
behaves like a WHERE filter matching on the id column in both datasets returning rows from the left dataset with no matched values in the right dataset or rows from the right dataset with no matched values in the left dataset.
To find unmatched keys in both DataFrames requires an outer join with a corresponding filter identifying the null or missing values. Consider Listing 5-23.
The nomatch_both DataFrame holds the rows resulting from the outer join. The indicator= argument adds the in_col column to the DataFrame containing values subject to a Boolean test for identifying the null values.
selecting those rows where the in_col column value is not'both'.
Listing 5-24 illustrates finding the unmatched key values in both the right and left tables.
4 proc sql;
5 select coalesce(left.id, right.id)
6 ,name
7 ,dept
8 ,salary
9 from left
10 full join
11 right
12 on left.id=right.id
13 where left.id ne right.id;
14 quit;
Listing 5-24
PROC SQL Outer Join Unmatched Keys in Both
The WHERE clause returns rows from the left table having no matching id values in the right table and rows from the right table having no matching id values in the left table. Figure 5-10 displays the PROC SQL output.
Validate Keys
As a part of any data management tasks, one must understand the relationships among tables being joined or merged. For example, the results one expects from joining may not be correct if tables believed to have unique key values turn out to have non-unique key values instead. Fortunately, both the pandas library and SAS have methods to detect and enforce join key uniqueness if this is a requirement. Consider Listing 5-25.
pandas.errors.MergeError: Merge keys are not unique in left dataset; not a one-to-one merge
Listing 5-25
pandas Validate 1:1 Relationship
In this example, the validate= argument determines if the merge is of a specified type of relationship. The valid values for validate= are
one_to_one or 1:1: Check if keys are unique in both the left and right DataFrames.
one_to_many or 1:m: Check if keys are unique in the left Dataframe.
many_to_one or m:1: Check if keys are unique in the right DataFrame.
many_to_many or m:m: Allowed, but does not result in a check.
The default value is None. In this example, validate=“one_to_one” raises a pandas.errors.MergeError and reports the merge key, ID, in the left Dataframe is not unique.
Joining on an Index
Recall in Chapter 4, “Indexing and GroupBy,” we detailed the use of indexes as row labels for DataFrames. The pandas library implements the join() method to combine columns from two differently indexed DataFrames into a single DataFrame. It turns out the join() method is a convenience for calling the merge() method in those cases where a DataFrame lacks an index (other than the default RangeIndex) or where key columns have different names.
Consider Listing 5-26. In this example, the left and right DataFrames use an index to label the rows.
>>> import pandas as pd
>>> left = pd.DataFrame(
... { 'Style' : ['S1', 'S2', 'S3', 'S4'],
... 'Size' : ['SM', 'MD', 'LG', 'XL']},
... index = ['01', '02', '03', '05'])
>>> right = pd.DataFrame(
... { 'Color' : ['Red', 'Blue', 'Cyan', 'Pink'],
... 'Brand' : ['X', 'Y', 'Z', 'J']},
... index = ['01', '02', '03', '04'])
>>> print(left)
Style Size
01 S1 SM
02 S2 MD
03 S3 LG
05 S4 XL
>>> print(right)
Color Brand
01 Red X
02 Blue Y
03 Cyan Z
04 Pink J
>>>
>>> df1 = left.join(right, how="left")
>>> print(df1)
Style Size Color Brand
01 S1 SM Red X
02 S2 MD Blue Y
03 S3 LG Cyan Z
05 S4 XL NaN NaN
Listing 5-26
Left Join on Indexed DataFrames
This example creates the left DataFrame with a Python list of values assigning the values ‘01’, ‘02’, ’03, and ‘05’ as the row labels. Similarly, the right DataFrame is created with an index for row labels, ‘01’ to ‘04’. The df1 DataFrame is created by calling the join() method which joins either on an index (as in this case) or on a designated key column.
In this example, the call to the join() method performs the default left join with the how=‘left’ argument.
The indexes from both columns are preserved on the new df1 DataFrame as a function of the type of join called. Observe how the print() function displays the index column as row labels in the df1 DataFrame.
We can appreciate how the join() method is a convenience for the merge() method by reducing the amount of typing needed, since the corresponding syntax needed to produce the same results is
An inner join operation on the indexed DataFrames is illustrated in Listing 5-28.
>>> df3 = left.join(right, how="inner")
>>> print(df3)
Style Size Color Brand
01 S1 SM Red X
02 S2 MD Blue Y
03 S3 LG Cyan Z
Listing 5-28
Inner Join on Index DataFrames
In both Listing 5-27 and Listing 5-28, the indexes labeling the rows are the join keys and remain a part of the joined DataFrames.
Join Key Column with an Index
In those cases where a DataFrame is not indexed, use the on= argument to identify the key column used in a join operation. Observe in Listing 5-29 that the left Dataframe does not have an index and the right DataFrame does.
>>> left = pd.DataFrame(
... {'Style' : ['S1', 'S2', 'S3', 'S4'],
... 'Size' : ['SM', 'MD', 'LG', 'XL'],
... 'Key' : ['01', '02', '03', '05']})
>>> right = pd.DataFrame(
... {'Color' : ['Red', 'Blue', 'Cyan', 'Pink'],
... 'Brand' : ['X', 'Y', 'Z', 'J']},
... index = ['01', '02', '03', '04'])
>>> print(left)
Style Size Key
0 S1 SM 01
1 S2 MD 02
2 S3 LG 03
3 S4 XL 05
>>> print(right)
Color Brand
01 Red X
02 Blue Y
03 Cyan Z
04 Pink J
>>>
>>> df4 = left.join(right, on="Key", how="outer")
>>> print(df4)
Style Size Key Color Brand
0 S1 SM 01 Red X
1 S2 MD 02 Blue Y
2 S3 LG 03 Cyan Z
3 S4 XL 05 NaN NaN
3 NaN NaN 04 Pink J
Listing 5-29
Outer Join Key Column with Index
The on= argument identifies the column called Key in the right DataFrame as the join key used with the index on the left DataFrame as the join key.
As pointed out previously, the join() method is a convenience for calling the merge() method to join an indexed DataFrame with a non-indexed DataFrame. The same result set from Listing 5-29 is generated with Listing 5-30.
In this example, the how='outer' argument calls for an outer join using the left_on='Key' argument to designate the join key on the left Dataframe. The right_index=True argument designates the index as the join key on the right DataFrame.
Update
Update operations are used in cases where there is a master table containing original data values. Transaction datasets are typically shaped the same as the master datasets containing new values for updating the master dataset. In the case of SAS, an observation from the transaction dataset that does not correspond to any observations in the master dataset becomes a new observation. Begin by observing the behavior of the SAS UPDATE statement in Listing 5-31.
4 data master;
5 input ID salary;
6 list;
7 datalines;
RULE: --+--1--+--2--+--3--+--4--+--
8 023 45650
9 088 55350
10 099 55100
11 111 61625
NOTE: The dataset WORK.MASTER has 4 observations and 2 variables.
12 ;;;;
13
14 data trans;
15 infile datalines dlm=',';
16 input ID
17 salary
18 bonus;
19 list;
20 datalines;
RULE: --+--1--+--2--+--3--+--4--+--
21 023, 45650, 2000
22 088, 61000,
23 099, 59100,
24 111, 61625, 3000
25 121, 50000,
NOTE: The dataset WORK.TRANS has 5 observations and 3 variables.
26 ;;;;
27 data new_pay;
28 update master(rename=(salary = old_salary))
29 trans (rename=(salary = new_salary));
30 by id;
31 run;
NOTE: There were 4 observations read from the dataset WORK.MASTER.
NOTE: There were 5 observations read from the dataset WORK.TRANS.
NOTE: The dataset WORK.NEW_PAY has 5 observations and 4 variables.
32
33 proc print data=new_pay;
34 id ID;
35 run;
Listing 5-31
SAS Update
The SAS UPDATE statement creates the new_pay dataset by applying transaction values from the transact dataset to the master dataset. A BY variable is required. Because the ID values are read in sorted order on input, a call to PROC SORT is not needed in this example.
All non-missing values for variables in the transact dataset replace the corresponding values in the master dataset. A RENAME statement is used to rename the salary variable in order to display the effects of the UPDATE operation. The resulting new_pay dataset is displayed in Figure 5-11.
In this example, the master dataset does not contain a bonus variable. Since the transact dataset contains a bonus variable, it is applied to the new_pay dataset.
The pandas update() method is used to modify DataFrame values using non-NA values from another DataFrame. In contrast to SAS, the update() method performs an in-place update to the calling DataFrame, in our case the master DataFrame. The SAS UPDATE statement forces the creation of a new dataset and does not modify the master dataset. To understand these difference, consider Listing 5-32.
The master DataFrame calls the update() method using the trans DataFrame to update non-NA values using values in the trans DataFrame. The default join operation for the update() method is a left join which explains why row ID 121 is not in the updated master DataFrame. This row exists only in the trans DataFrame. This is also the explanation for why the Bonus column in the trans DataFrame is not a part of the update to the masterDataFrame.
An alternate approach is attempting a call to the DataFrame update() method with the how= argument set to outer. Unfortunately, this raises the NotImplementedError: Only left join is supported error illustrated in Listing 5-33. (In order to run this example, copy and paste the code for defining the left and right DataFrame from Listing 5-32.) Remember, the update() method performs an in-place update.
>>> master.update(trans, join="outer")
NotImplementedError: Only left join is supported
Listing 5-33
DataFrame update() Outer Join
The correct alternative is to call the merge() method as illustrated in Listing 5-34.
The on='ID2' argument uses the ID column common to the master and trans DataFrame as join keys. The how='outer' argument performs an outer join and suffixes=('_Old','_Updated' ) adds a suffix to like-named columns in both DataFrames to disambiguate the DataFrame column contribution.
Conditional Update
There are cases when updates need to be applied conditionally. SAS users are accustomed to thinking in terms of IF/THEN/ELSE logic for conditional updates. To help understand how this logic works with pandas, we use two examples. The first example defines a Python function to calculate tax rates conditionally on the Salary2_Updated column in the df6 DataFrame. This example will look familiar to SAS users. It uses row iteration with if/else logic to calculate the new Taxes column.
The second approach is a better performing method of using the loc() indexer to apply calculated values to the DataFrame.
Begin with Listing 5-35. Our first step is to copy the df6 DataFrame, created in Listing 5-34, to one called df7.
>>> #copy df6 to df7 to be used in second example
... df7 = df6.copy()
>>> print(df6)
ID2 Salary2_Old Salary2_Updated Bonus2
0 023 45650.0 45650 2000.0
1 088 55350.0 61000 NaN
2 099 55100.0 59100 NaN
3 111 61625.0 61625 3000.0
4 121 NaN 50000 NaN
>>> def calc_taxes(row):
... if row['Salary2_Updated'] <= 50000:
... val = row['Salary2_Updated'] * .125
... else:
... val = row['Salary2_Updated'] * .175
... return val
...
>>> df6['Taxes'] = df6.apply(calc_taxes, axis=1)
>>> print(df6)
ID2 Salary2_Old Salary2_Updated Bonus2 Taxes
0 023 45650.0 45650 2000.0 5706.250
1 088 55350.0 61000 NaN 10675.000
2 099 55100.0 59100 NaN 10342.500
3 111 61625.0 61625 3000.0 10784.375
4 121 NaN 50000 NaN 6250.000
Listing 5-35
Conditional DataFrame Column Update with a Function
This example defines the calc_taxes function to be applied iterating over the DataFrame rows. The row variable is local to the function definition and returns the val object value when called. The function contains two conditions, if and else. If you need to cascade a Series of if statements, then use the elif keyword following the first if statement.
The if and else statements define the Taxes column calculated at a 12.5% rate when the Salary2_Updated value is less than or equal to $50,000. Otherwise the tax rate is 17.5%.
The statement
df6['Taxes'] = df6.apply(calc_taxes, axis=1)
creates the Taxes column in the df6 DataFrame by calling the apply function with the calc_taxes function and the axis=1 argument indicates the function is applied along the column.
The second approach for conditional update uses the .loc indexer illustrated in Listing 5-36.
One way to read this statement is to consider the syntax to the left of the comma (,) similar to a WHERE clause. The df7 DataFrame calls the loc() indexer to find the condition df7['Salary2_Updated'] less than or equal to $50,000. To the right of the comma is the assignment when the condition is True; the value for the Taxes column (which is created on the DataFrame) is calculated at a rate of 12.5% of the value found in the Salary2_Updated column.
The second condition, 17.5% tax rate on the Salary2_Updated column greater than $50,000
Specifically, the loc() indexer creates a Boolean mask which is used to index the DataFrame and return those rows meeting the logical condition. This is illustrated in Listing 5-37. For the logical condition, df7['Salary2_Updated'] <= 50000, only when the Boolean mask returns True is the value multiplied by .125. Likewise for the logical condition, df7['Salary2_Updated'] > 50000, is the value multiplied by 17.5.
>>> nl = '
'
>>>
>>> print(nl,
... "Boolean Mask for 'Salary2_Updated' <= 50000",
... nl,
... df7['Salary2_Updated'] <= 50000,
... nl,
... "Boolean Mask for 'Salary2_Updated' > 50000",
... nl,
... df7['Salary2_Updated'] > 50000)
Boolean Mask for 'Salary2_Updated' <= 50000
0 True
1 False
2 False
3 False
4 True
Name: Salary2_Updated, dtype: bool
Boolean Mask for 'Salary2_Updated' > 50000
0 False
1 True
2 True
3 True
4 False
Name: Salary2_Updated, dtype: bool
Listing 5-37
Boolean Mask for Logical Conditions
A conditional update with a SAS Data Step is illustrated in Listing 5-38.
4 data calc_taxes;
5 set new_pay;
6 if new_salary <= 50000 then
7 taxes = new_salary * .125;
8 else taxes = new_salary * .175;
9 run;
NOTE: There were 5 observations read from the dataset WORK.NEW_PAY.
NOTE: The dataset WORK.CALC_TAXES has 5 observations and 5 variables.
10 proc print data=calc_taxes;
11 id ID;
12 run;
Listing 5-38
SAS Conditional Update
The SAS IF/ELSE statement is similar to the Python function defining the calc_taxes function in Listing 5-35.
Figure 5-12 uses PROC PRINT to display the newly created taxes variable.
Concatenation
The pandas library implements a concat() method similar in behavior to the SAS SET statement. It is used to “glue” DataFrames together both on a row-oriented basis and on a column-oriented basis, like the SAS MERGE statement. Here we examine its behavior as an analog to the SAS SET statement.
This example uses the DataFrame() method to create the three DataFrames, loc1, loc2, and loc3. The objects are placed into a Python list and assigned to the frames object with the syntax
frames = [loc1, loc2, loc3]
The three DataFrames are concatenated by calling the concat() method using the syntax
all = pd.concat(frames)
creating the output all DataFrame.
The analog SAS program is shown in Listing 5-40. It creates the three SAS datasets, loc1, loc2, and loc3, and uses the SET statement to concatenate them together producing the all dataset.
4 data loc1;
5 length id $ 3;
6 input id $
7 onhand
8 price;
9 list;
10 datalines;
RULE: --+--1--+--2--+--3--+--4--+--
12 A0 21 17.19
13 A1 79 9.99
14 A2 33 21
15 A3 81 .99
NOTE: The dataset WORK.LOC1 has 4 observations and 3 variables.
16 ;;;;
17 data loc2;
18 length id $ 3;
19 input id $
20 onhand
21 price;
22 list;
23 datalines;
RULE: --+--1--+--2--+--3--+--4--+--
24 A4 12 21.99
25 A5 33 18
26 A6 233 .19
27 A7 45 23.99
NOTE: The dataset WORK.LOC2 has 4 observations and 3 variables.
28 ;;;;
29 data loc3;
30 length id $ 3;
31 input id $
32 onhand
34 price;
35 list;
36 datalines;
RULE: --+--1--+--2--+--3--+--4--+--
37 A8 37 9.99
38 A9 50 5
39 A10 13 22.19
40 A11 88 3.99
NOTE: The dataset WORK.LOC3 has 4 observations and 3 variables.
41 ;;;;
42 data all;
43
44 set loc1
45 loc2
46 loc3;
47 run;
NOTE: There were 4 observations read from the dataset WORK.LOC1.
NOTE: There were 4 observations read from the dataset WORK.LOC2.
NOTE: There were 4 observations read from the dataset WORK.LOC3.
NOTE: The dataset WORK.ALL has 12 observations and 3 variables.
48 proc print data=all;
49 run;
Listing 5-40
SAS SET Statement
The PROC SQL UNION ALL set operator is an alternative to the SAS SET statement for creating the all table. This example is illustrated in Listing 5-41.
4 proc sql;
5 create table all as
6 select * from loc1
7 union all
8 select * from loc2
9 union all
10 select * from loc3;
NOTE: Table WORK.ALL created, with 12 rows and 3 columns.
The concat() method is able to construct a hierarchical index by providing the keys= argument to form the outermost level. Listing 5-42 illustrates this feature.
>>> all = pd.concat(frames, keys=['Loc1', 'Loc2', 'Loc3'])
>>> print(all)
Onhand Price
Loc1 A0 21 17.99
A1 79 9.99
A2 33 21.00
A3 81 0.99
Loc2 A4 12 21.99
A5 33 18.00
A6 233 0.19
A7 45 23.99
Loc3 A8 37 9.99
A9 50 5.00
A10 13 22.19
A11 88 3.99
>>> all.loc['Loc3']
Onhand Price
A8 37 9.99
A9 50 5.00
A10 13 22.19
A11 88 3.99
Listing 5-42
Hierarchical Index from concat() Method
With the hierarchical index in place, we can easily identify subsets using the loc() indexer discussed in Chapter 4, “Indexing and GroupBy.” In this example, the loc() indexer slices those rows belonging to the original loc3 DataFrame.
The IN= dataset option enables the ability to uniquely identify observations contributed by a specific SAS dataset as illustrated in Listing 5-43.
4 data all;
5 length id $ 3;
6 set loc1 (in=l1)
7 loc2 (in=l2)
8 loc3 (in=l3);
9 if l1 then location = 'Loc1';
10 if l2 then location = 'Loc2';
11 if l3 then location = 'Loc3';
12 run;
NOTE: There were 4 observations read from the dataset WORK.LOC1.
NOTE: There were 4 observations read from the dataset WORK.LOC2.
NOTE: There were 4 observations read from the dataset WORK.LOC3.
NOTE: The dataset WORK.ALL has 12 observations and 4 variables.
13
14 proc print data = all(where=(location='Loc3'));
15 id id;
16 var onhand price;
17 run;
Listing 5-43
SET Statement Using IN=
The IF statements create the location variable by using the IN= dataset option by identifying which dataset contributed observations. To identify those observations contributed from the loc3 dataset, apply the WHERE= filter. In this example, WHERE=(location='Loc3') is applied when calling PROC PRINT. The results are displayed in Figure 5-14.
Similar to SAS PROC APPEND, the pandas library provisions an append() method. Consider Listing 5-44. This method is a convenience for calling the concat() method. The append() method syntax is likely a more natural syntax for SAS users.
>>> all_parts = loc1.append([loc2, loc3])
>>> print(all_parts)
Onhand Price
A0 21 17.99
A1 79 9.99
A2 33 21.00
A3 81 0.99
A4 12 21.99
A5 33 18.00
A6 233 0.19
A7 45 23.99
A8 37 9.99
A9 50 5.00
A10 13 22.19
A11 88 3.99
Listing 5-44
pandas append() Method
DataFrame loc1 calls the append() method to append the loc2 and loc3 DataFrames. For this example, the equivalent syntax using the concat() method is
SAS uses PROC APPEND to append observations from a dataset to a base dataset as illustrated in Listing 5-45.
4 proc append base = loc1
5 data = loc2;
6 run;
NOTE: Appending WORK.LOC2 to WORK.LOC1.
NOTE: There were 4 observations read from the dataset WORK.LOC2.
NOTE: 4 observations added.
NOTE: The dataset WORK.LOC1 has 8 observations and 3 variables.
7 proc append base = loc1
8 data = loc3;
NOTE: Appending WORK.LOC3 to WORK.LOC1.
NOTE: There were 4 observations read from the dataset WORK.LOC3.
NOTE: 4 observations added.
NOTE: The dataset WORK.LOC1 has 12 observations and 3 variables.
9 proc print data=loc1;
10 run;
Listing 5-45
SAS PROC APPEND
In cases where more than one dataset is being appended, multiple calls to PROC APPEND are needed. In some cases, appending is a better performance choice over the SET statement when appending smaller datasets to a larger dataset. PROC APPEND avoids reading observations in the BASE= dataset by positioning the record pointer at the end of the BASE= dataset. Observations from the DATA= dataset are applied to the end of the BASE= dataset. Figure 5-15 displays the loc1 dataset after the append operations.
Finding Column Min and Max Values
A common task is finding the minimum and maximum values in a column. The pandas max() and min() methods return the minimum and maximum column values, respectively. This is illustrated in Listing 5-46.
The first two lines of the script return the maximum and minimum values for the Price column in the all_parts Dataframe. The line in the script returns the row for the maximum value of Price using the [ ] indexer followed by the Boolean comparison.
Listing 5-47 illustrates the same logic with PROC SQL.
Both the pandas library and SAS provide sort methods and options for controlling how values are sorted. Examine Listing 5-48. It creates the df DataFrame and calls the sort_values attribute with a Python list of column names indicating a multi-key sort.
Both the sort_values attribute and PROC SORT use ascending as the default sort order. Listing 5-49 is the analog illustrating a call to PROC SORT using a multi-key sort. Of course, the same dataset can be generated with PROC SQL and an ORDER BY statement.
4 data df;
5 length id $ 2;
6 input id $
7 age
8 rank;
9 list;
10 datalines;
RULE: --+--1--+--2--+--3--+--4--+--
11 A0 21 1
12 A1 79 2
13 A2 33 3
14 A3 81 3
15 A4 . 4
16 5A 33 5
17 5B 33 6
NOTE: The dataset WORK.DF has 7 observations and 3 variables.
18 ;;;;
19
20 proc sort data = df;
21 by age rank;
22 run;
NOTE: There were 7 observations read from the dataset WORK.DF.
23 proc print data = df;
24 id id;
25 run;
Listing 5-49
SAS PROC SORT
Figure 5-17 displays the results of the sort operation.
PROC SORT sorts missing values as the smallest numeric value. The default behavior for pandas sort_values attribute is to sort NaN’s as the largest numeric values. The sort_values attribute has the na_position= argument which uses values of 'first' or 'last' for placing NaN’s at the beginning or the end, respectively, of a DataFrame.
Listing 5-50 illustrates overriding the default behavior of 'last' with 'first' with the sort_values attribute.
The behaviors between sort_values attribute and PROC SORT differ with respect to where the sort occurs. By default, the sort_values attribute returns a new DataFrame with the inplace= argument set to False. By contrast, PROC SORT sorts the dataset in place unless an output dataset is specified with the OUT= option.
For a multi-key sort using the sort_values attribute, the ascending or descending sort order is specified with the ascending= argument using True or False for the same number of values listed with the by= argument. Take, for example, Listing 5-51.
In this example, the Age column is sorted by ascending value and the Rank column is sorted by descending value.
PROC SQL uses the keyword DESCENDING following a column name to indicate a descending sort order used with ORDER BY. Listing 5-52 illustrates the DESCENDING keyword to alter the default sort order.
4 proc sql;
5 select * from df
6 order by age, rank descending;
7 quit;
Listing 5-52
PROC SQL ORDER BY
The results from PROC SQL with a multi-key sort is displayed in Figure 5-18.
Finding Duplicates
In some cases data entry errors lead to duplicate data values or non-unique key values or duplicate rows. In Listing 5-25, we discuss validating the join relationship among tables. The validation argument for the merge() method validates whether a join is of a particular type. This raises the question on how to find and remove duplicate key values.
Consider Listing 5-53. This example illustrates how to find duplicate key values using the duplicated attribute.
>> print(df)
ID Age Rank
0 A0 21.0 1
1 A1 79.0 2
2 A2 33.0 3
3 A3 81.0 3
4 A4 NaN 4
5 5A 33.0 5
6 5B 33.0 6
>>> dup_mask = df.duplicated('Age', keep="first")
>>> df_dups = df.loc[dup_mask]
>>> print(df_dups)
ID Age Rank
5 5A 33.0 5
6 5B 33.0 6
Listing 5-53
Find DataFrame Duplicate Keys
The duplicated attribute returns Booleans indicating duplicate rows and, in this case, limited to duplicate values in the Age column. By default, the duplicated attribute applies to all DataFrame columns. In other words, the default behavior is to identify duplicate rows.
The statement
dup_mask = df.duplicated('Age', keep="first")
defines a Boolean mask using the keep='first' argument for the duplicated attribute. The keep= argument has the following three values:
first: Mark duplicates as True except for the first occurrence.
last: Mark duplicates as False except for the last occurrence.
False: Mark all duplicates as True.
The statement
df_dups = df.loc[dup_mask]
creates the df_dups DataFrame containing the duplicate values for the Age column. This form of conditional slicing is covered in detail in Chapter 4, “Indexing and GroupBy,” in the section “Conditional Slicing.”
Dropping Duplicates
The drop_duplicates attribute returns a de-duplicated DataFrame based on its argument values. This feature is illustrated in Listing 5-54.
The keep= argument has the same values and behavior as the keep= argument for the duplicated attribute.
Similarly, with SAS, the NODUPKEY and NODUPRECS options for PROC SORT are used to detect and remove duplicate values. The NODUPKEY option removes observations with duplicate BY values. This is analogous to the drop_duplicates attribute with a column argument.
The PROC SORT NODUPRECORDS is analogous to the drop_duplicates attribute with the keep= argument set to False. In each case, duplicate observations or rows, if found, are eliminated.
Consider Listing 5-55. This example replicates the logic for Listing 5-53 and Listing 5-54.
4 proc sort data = df nodupkey
5 out = df_deduped
6 dupout = df_dups;
7 by age;
8
9 run;
NOTE: There were 7 observations read from the dataset WORK.DF.
NOTE: 2 observations with duplicate key values were deleted.
NOTE: The dataset WORK.DF_DEDUPED has 5 observations and 3 variables.
NOTE: The dataset WORK.DF_DUPS has 2 observations and 3 variables.
10 proc print data = df;
11 id id;
12 run;
NOTE: There were 7 observations read from the dataset WORK.DF.
13 proc print data = df_dups;
14 id id;
15 run;
NOTE: There were 2 observations read from the dataset WORK.DF_DUPS.
16 proc print data = df_deduped;
17 id id;
18 run;
NOTE: There were 5 observations read from the dataset WORK.DF_DEDUPED.
Listing 5-55
SAS Find and Drop Duplicates
In this example, the OUT= dataset option creates the df_deduped dataset dropping the duplicate values for the age variable found in the df dataset. Without the OUT= option, PROC SORT does an in-place sort overwriting the df dataset. The NODUPKEY option removes observations with duplicate BY values for the age variable. The DUPOUT= option identifies the output dataset to output the duplicate values.
Figure 5-19 uses PROC PRINT to display the original dataset with duplicate age values.
Figure 5-20 uses PROC PRINT to display the duplicate observations found using the NODUPKEY option that are written to the df_dups dataset using the DUPOUT= option.
Figure 5-21 uses PROC PRINT to display the de-duplicated dataset df_deduped.
Sampling
Sampling is often a requirement in data analysis. The idea is to select a sample from a population and, by analyzing the sample, draw inferences about the population. In a simple random sample without replacement process every row has an equal chance of being selected. Once selected, the observation cannot be chosen again. Listing 5-56 illustrates this concept.
The df DataFrame is constructed using the numpy random.randnrandom number generator to generate 360 values into the value column with a standard normal distribution. The samp1 DataFrame is created by drawing a sample by calling the sample attribute along with the n= argument to set the number of rows to be drawn without replacement. The replace= argument is False by default to sample without replacement.
The same logic for a simple random sample without replacement is illustrated in Listing 5-57.
4 data df;
5 do date = '01Jan1970'd to '31Dec2000'd by 31;
6 value = rand('NORMAL');
7 output;
8 end;
9 format date yymmdd10.;
10 run;
NOTE: The dataset WORK.DF has 366 observations and 2 variables.
11 data samp1 (drop = k n);
12 retain k 100 n;
13 if _n_ = 1 then n = total;
14 set df nobs=total;
15
16 if ranuni(654321) <= k/n then do;
17 output;
18 k = k -1;
19 end;
20 n = n -1;
21
22 if k = 0 then stop;
23 run;
NOTE: There were 360 observations read from the dataset WORK.DF.
NOTE: The dataset WORK.SAMP1 has 100 observations and 2 variables.
24 proc print data = samp1 (obs=5);
25 id date;
26 var value;
27 run;
Listing 5-57
Simple Random Sample from a Dataset
The df dataset is generated using a DO/END block. The values assigned to the value variable are randomly generated from the standard normal distribution.
The samp1 dataset is created by reading the df dataset and retaining the k variable to the desired sample size of 100. The total variable holds the value for the number of observations read from the input df dataset using the NOBS= dataset option on the SET statement. When the first observation is read from the df dataset, the n variable is assigned the value from the total variable.
The IF-THEN/DO block is based on the logic that each observation in the input dataset has an equal probability of k/n (100/360) of being selected as the first observation in the sample. The values for the k and n variables are decremented by one to reach the desired sample size. If a random number from the RANUNI function is less than or equal to k/n, then the observation is included in the sample. The iteration stops when k, which began with a retained value of 100, decrements to zero (0).
The pandas sample attribute uses the frac= argument to include a portion, for example, 30% of the rows to be included in the sample. This capability is illustrated in Listing 5-58.
>>> samp2 = df.sample(frac=.3, replace=True)
>>> print(samp2.head(5))
value
1971-05-31 0.639097
1997-10-31 1.779798
1971-07-31 1.578456
1981-12-31 2.114340
1980-11-30 0.293887
>>> print(samp2.shape)
(108, 1)
Listing 5-58
Proportional Sample from a DataFrame with Replacement
The replace=True argument indicates the sample is drawn with replacement.
Convert Types
There are occasions when a value is assigned to a DataFrame column that is not appropriate for a desired operation. For example, date values may have been read as strings or numeric values are read as strings.
pandas provisions the attribute to convert types. This feature is illustrated in Listing 5-59. The DataFrame dtypes attribute returns the column’s type.
In this example, DataFrame df8 contains two columns, String and Ints, with types of object and int64, respectively. After the calls to the astype attribute, the String and Ints columns have data types float64 and object, respectively. The “Column Types” section in Chapter 3, “pandas Library,” provides more details for the common pandas types.
Rename Columns
Listing 5-60 illustrates calling the DataFrame rename attribute to rename column labels.
The DataFrame rename attribute accepts the Dictionary key:value pairs where the key is the existing column name and the value is the new name. The default value for the inplace= argument is False.
Map Column Values
Mapping column values is similar to the function provided by PROC FORMAT. Mapping allows translation of DataFrame column values into associated values as illustrated in Listing 5-61.
The dow DataFrame is constructed with the Day_Num column having values 1 through 7. dow_map defines a Dictionary of key:value pairs where the keys are equivalent to the values on the Day_Num column. The Dictionary values are the values to be mapped to these keys, in this case, names for the week day. The statement
dow["Day_Name"] = dow["Day_Num"].map(dow_map)
creates the Day_Name column in the dow DataFrame by calling the map attribute with the dow_map Dictionary for the Day_Num column.
Transpose
Transposing creates a DataFrame by restructuring the values in a DataFrame, transposing columns into row. By default a DataFrame is transposed in place. In most cases, an output DataFrame is created with an assignment statement. Creating a new DataFrame by transposing an existing DataFrame is shown in Listing 5-62.
The example begins by defining the Dictionary uni. The statement
df_uni = pd.DataFrame(data=uni)
uses the uni Dictionary to create the df_uni DataFrame.
The call to create the transposed DataFrame and create the output t_df_uni DataFrame is the statement
t_df_uni = df_uni.T
In cases where the DataFrame has mixed data types, such as this one, the transposed DataFrame columns are returned as object (string) types. The Students column in the original df_uni DataFrame is int64. The columns 0–2 in the transposed DataFrame are object data types.
The [ ] indexer is used to return column 0 in this example.
Listing 5-63 illustrates transposing a SAS dataset.
4 data df_uni;
5 infile datalines dlm=',';
6 length school $ 4
7 mascot $ 5
8 city $ 11;
9 input school $
10 mascot $
11 students
12 city $;
13 list;
14 datalines;
RULE: --+--1--+--2--+--3--+--4--+--5
2922 NCSU, Wolf, 22751, Raleigh
2923 UNC, Ram, 31981, Chapel Hill
2924 Duke, Devil, 19610, Durham
NOTE: The dataset WORK.DF_UNI has 3 observations and 4 variables.
15 ;;;;
16
17 proc print data = df_uni;
18 id school;
19 run;
NOTE: There were 3 observations read from the dataset WORK.DF_UNI.
20
21 proc transpose data = df_uni
22 out = t_df_uni;
23 var school mascot students city;
24 run;
NOTE: Numeric variables in the input dataset will be converted to character in the output dataset.
NOTE: There were 3 observations read from the dataset WORK.DF_UNI.
NOTE: The dataset WORK.T_DF_UNI has 4 observations and 4 variables.
25
26 proc print data = t_df_uni;
27 run;
NOTE: There were 4 observations read from the dataset WORK.T_DF_UNI.
28
29 proc sql;
30 select name
31 , type
32 from dictionary.columns
33 where libname="WORK" and memname="T_DF_UNI";
34 quit;
Listing 5-63
SAS Transpose
The call to PROC TRANSPOSE transposes the input dataset df_uni and creates the transposed t_df_uni dataset using the OUT= option. Observe the note in the SAS log
Numeric variables in the input dataset will be converted to character in the output dataset.
The note indicates numerics, in this case, the student variable, are converted to characters. The PROC SQL code queries the SAS DICTIONARY table COLUMNS to display the column names and data types for the transposed t_df_uni dataset.
Figure 5-22 uses PROC PRINT to display the input dataset. Figure 5-23 uses PROC PRINT to display the transposed dataset. Figure 5-24 displays the PROC SQL results from the DICTIONARY.COLUMNS table.
Summary
In this chapter, we examined a variety of essential data management tasks needed to reshape and organize data. The primary focus was to enable an understanding of joining/merging and reshaping DataFrames. These principles set the stage for the next chapter’s topic, “pandas Readers and Writers.”