SAS users tend to think of indexing SAS datasets to either improve query performance or as a method to avoid dataset sorting. Another use case for using SAS indexes is to provide direct access to a specific observation in a dataset. For example, in order to retrieve an observation whose value for the variable name is ‘Lassiter’, absent an index, SAS performs a sequential read starting with the first observation in the dataset. SAS normally begins reading at observation 1 reading the dataset in sequence for name = ‘Lassiter’ until all observations are read.
Alternatively, if an index on the variable name is present, SAS may use the index to directly access those observations containing these values without performing a sequential read of the dataset. A SAS index stores values in ascending order for a specific variable or variables and manages information on how to locate a given observation(s) in the dataset.
In contrast, pandas automatically create an index structure at DataFrame creation time for both rows and columns. In Chapter 3, “pandas Library,” we encountered the RangeIndex object used as the default row index. These index objects are responsible for holding the axis labels and other metadata like integer-based location identifiers, axis name, and so on.
One or more columns in DataFrame can be used to define an index. Assigning more than one column as an index creates a MultiIndex object discussed later in this chapter. New users to pandas often get confused about the role of an index, since most of their prior associations consider an index to be an auxiliary structure for columns.
The way we like to think about a pandas index is to consider it as a means for labeling DataFrame rows. Recall from Chapter 3, “pandas Library,” that at DataFrame creation time, the RangeIndex object is created as the default index similar to the automatic _N_ variable SAS establishes at SAS dataset creation time. In a DataFrame, the values from a column or columns may be used as an index to supply values as row labels augmenting the default integer values assigned by the RangeIndex object. Just as you are able to return SAS dataset observations using the automatic variable _N_, a DataFrame’s default RangeIndex is used to return rows using a zero-based offset. By explicitly setting a DataFrame index from column or multiple column values, you can return rows using these column values in addition to returning rows using the RangeIndex object.
Create Index
When a DataFrame is assigned an index, the rows remain accessible by supplying a collection of integer values as well as accessible by the row labels defined by the index. Listing 4-1 illustrates the set_index() method using the ID column. In this example, the values from the ID column supply labels for the DataFrame rows.
The DataFrame df is constructed using the DataFrame() constructor method. The first print() function returns all of the rows labeled with the default RangeIndex object labeling rows starting with the integer 0 to axis (length –1). The syntax
df.set_index('ID', inplace = True)
selects the ID column as the index and performs an inplace operation indicated by the argument: inplace = True updates the index in place rather than creating a new DataFrame. The default value for the set_index() method is inplace = False. In the case where the inplace= argument is False, you must assign the results to a new DataFrame. For example,
df_idx = df.set_index('ID', inplace=False)
creates the df_idx DataFrame with the ID column as its index. The original df DataFrame remains unaltered.
The second print() function illustrates how the DataFrame rows are labeled with values from the ID column. The overhead for creating and dropping indexes is minimal, and it is not unusual to do so repetitively in a single Python program.
Next, we consider subsetting DataFrames. Subsetting data by rows and/or columns is an essential task for any form of data analysis. pandas DataFrames offer three choices for subsetting operations. They are
1.
[ ] operator enables selection by columns or by rows.
2.
loc() indexer uses row and column labels for subsetting. A column label is the column name and row labels are assigned with an index (either with the index= parameter at DataFrame creation time or with the set_index() method. If no index is explicitly assigned, then the integer-based RangeIndex object is the default. If no names are assigned to columns, then the RangeIndex object labels the columns with the first column as 0, the second column as 1, and so on.
3.
iloc()indexer uses integer positions (from 0 to axis (length - 1)) for subsetting rows and columns. This method remains available even if a user-defined index or MultiIndex is defined. MultiIndexes or hierarchical indexes are discussed later in this chapter.
Both the loc() and iloc() indexers accept Boolean logic to perform complex subsetting. The [ ] operator and the iloc() indexers can access rows using the default RangeIndex object, that is, integer values indicating a position along the index. The loc()indexer requires a user-defined index for creating row labels in order to operate.
All three indexers return a DataFrame.
Return Columns by Position
Consider Listing 4-2. This example constructs the i DataFrame using a single print() function to display the DataFrame values, default row index, and the default column index. The '
' syntax inserts a new line to display the desired results.
We begin with subsetting using the [ ] operator. Observe that both columns and rows in DataFrame i use the default RangeIndex as their labels.
The default RangeIndex is used to select rows or columns using integers to locate their positions along the index. Consider Listing 4-3.
>>> i[0]
0 Patton
1 Joyner
2 Williams
3 Jurat
4 Aden
Listing 4-3
Returning Column 0 from the DataFrame
The call to [ ] operator returns the first column (0) from the DataFrame i. In most cases, DataFrame columns will have labels to return the columns of interest.
The [ ] operator also accepts a Python list of columns to return. Recall that a Python list is a mutable data structure for holding a collection of items. List literals are written within square brackets [ ] with commas (,) to indicate multiple items in the list.
Consider Listing 4-4. Each of the values supplied to the DataFrame() constructor method is a Python list and the columns= argument passes a list.
is a column subsetting operation returning the columns Name and After. Notice how the Python list with [' Name', 'After'] inside the DataFrame slice operator results in a pair of square brackets [[ ]]. The outer pair is the syntax for the DataFrame [ ] operator, while the inner pair holds the literal values to form the Python list of column names.
Clearly, using a list of column names rather than a list of column integer index positions is a more convenient method for subsetting.
The equivalent SAS program is displayed in Listing 4-5. It is used in subsequent examples in this chapter.
4 data df;
5 length region $ 6
6 name $ 8
7 district $ 3;
8 infile cards dlm=',';
8 input district $
10 region $
11 name $
12 before
13 after;
14 list;
15 datalines;
RULE: --+--1--+--2--+--3--+--4--+--5
16 I, North, Patton, 17, 27
17 I, South, Joyner, 13, 22
18 I, East, Williams, 111, 121
19 I, West, Jurat, 51, 55
20 II, North, Aden, 71, 70
21 II, South, Tanner, 113, 122
22 II, East, Jenkins, 99, 99
23 II, West, Milner, 15, 65
24 III, North, Chang, 69, 101
25 III, South, Gupta, 11, 22
26 III, East, Haskins, 45, 41
27 III, West, LeMay, 35, 69
28 III, West, LeMay, 35, 69
NOTE: The dataset WORK.DF has 13 observations and 5 variables.
29 ;;;;
30 proc print data = df(obs=4);
31 var name after;
32 run;
NOTE: There were 4 observations read from the dataset WORK.DF
Listing 4-5
Create df SAS Dataset
The output from PROC PRINT with data=df(obs=4) is displayed in Figure 4-1.
Return Rows by Position
The general syntax for DataFrame row slicing (subsetting rows) using the [ ] operator is
df:[start : stop : step]
The start position is included in the output, and the stop position is not included in the output.
This example returns the first three rows from the df DataFrame. A null value for the start position defaults to start position zero (0). The value following the colon (:) indicates the stop position and goes up to but does not include the row in the slicing operation.
Listing 4-7 illustrates returning every other row from the df DataFrame.
>>> df[::2]
District Sector Name Before After
0 I North Patton 17 27
2 I East Williams 111 121
4 II North Aden 71 70
6 II East Jenkins 99 99
8 III North Chang 69 101
10 III East Haskins 45 41
12 III West LeMay 35 69
Listing 4-7
DataFrame Row Slicing, Example 2
The start and stop positions are null causing the slice df[::2] to default to the first and last row, respectively, in the DataFrame. The value of two (2) for the step position returns every other row.
NOTE: There were 13 observations read from the dataset WORK.DF.
NOTE: The dataset WORK.DF1 has 7 observations and 5 variables.
Listing 4-8
SELECT Every Other Row
The example creates the df1 dataset with a subsetting IF statement to perform modulo division by 2 on the automatic SAS _N_ variable assigned to each observation. Modulo division by 2 on even integers returns 0 (zero). By using the IF statement’s inequality evaluation of ^= 0 every odd _N_ value (1, 3, 5, etc.) evaluates true and is written to the output df1 dataset.
Calling PROC PRINT displays the output shown in Figure 4-2.
Return Rows and Columns by Label
The loc() indexer is a method primarily used for returning rows and columns using labels. Allowed inputs to loc() are
A single label such as 12 or ‘Name’. Note that 12 is interpreted as the row label and not as the integer location along the index.
A Python list of labels [‘A’, ‘B’, ‘C’].
A slice object with labels ‘a’ : ‘z’. Both start, in this case, ‘a’, and stop, ‘z’, are included when present in the index.
Conditional evaluations.
Each of these methods is illustrated.
Up to this point, the index for the df DataFrame relies on the default RangeIndex object for returning rows by an integer position along the index. In order to retrieve rows from the df DataFrame by labels, the Name column is set as the DataFrame index. This action assigns the values from the Name column as labels for the DataFrame rows. Said another way, a DataFrame index maps column values onto rows as labels.
The syntax and default values for the set_index() method is
In this example, the first print() function displays all rows from the df DataFrame. The syntax
print(df.index)
returns the default RangeIndex in use for the rows with integer values between 0 and 13. The syntax
df.set_index('Name', inplace=True, drop=True)
selects the values from the Name column label as row labels. The argument inplace=True updates the df DataFrame in place, and the drop=True argument drops the Name column from the DataFrame.
With this defined index, the loc() indexer uses the Name column values to return rows rather than using row position.
Notice how the third print() function displays the values for the Name column as row labels in the leftmost column of the output. The syntax
print(df.index)
returns the index values as a Python list. An index may have non-unique values which we illustrate with this example. Some DataFrame operations require the index keys be in sorted order, while others may require unique values. We cover the details for sorting in Chapter 5, “Data Management.” Chapter 7, “Date and Time,” covers details for unique index values.
With an index in place as row labels, we can slice rows using the loc() indexer. As well, columns can be sliced with the loc() indexer since they have labels (i.e., names).
The syntax for the loc() indexer is
df.loc[row selection, column selection]
The row selection is listed first, separated by a comma (,) followed by the column selection. For both the row and column selections, a colon (:) is used to request a range of items. Consider Listing 4-10.
>>> df.loc['Patton': 'Aden', ]
District Sector Before After
Name
Patton I North 17 27
Joyner I South 13 22
Williams I East 111 121
Jurat I West 51 55
Aden II North 71 70
Listing 4-10
Return Row Slices
This example slices rows beginning with the row labeled 'Patton' and ending with the row labeled 'Aden' inclusive. The empty value for the column selector, following the comma (,), implies all columns. The same DataFrame can be returned by stating the column selector explicitly with the syntax
df.loc['Patton' : 'Aden', 'District' : 'After']
Listing 4-11 illustrates supplying a single label to the row selector followed by a Python list of labels as the column selector.
>>> df.loc['LeMay', ['Before','After']]
Before After
Name
LeMay 35 69
LeMay 35 69
Listing 4-11
Return Row and Column Slices
Notice how the row labels are not unique.
Conditionals
Listing 4-12 illustrates returning rows and columns based on a Boolean comparison.
The Boolean comparisons are enclosed with parentheses ( ) and utilize any of the comparison operators listed in Table 2-2 from Chapter 2, “Python Types and Formatting.” In this example, the Boolean comparisons contain two predicates; the first is (df['Sector'] == 'West') and the second is (df['Before'] > 20). The Boolean operator & (and) joins the predicates, and therefore both must return True in order to meet the row selection criteria.
Note the syntax differences between Listing 4-11 and Listing 4-12. In the former rows are sliced based on labels. The latter uses the df['Sector'] and df['Before'] to designate column names for the conditional expression.
Suppose we wish to subset rows based on the last letter of the value for the Name column ending with the letter ‘r’. Listing 4-13 combines the loc() indexer with the .str.endswith attribute to satisfy the request.
KeyError: 'the label [Name] is not in the [index]'
Listing 4-13
Conditionally Return Rows with String Manipulation
Unfortunately, this example raises a KeyError since the column Name was dropped when the df.index was initially created in Listing 4-9. Note this error message is truncated here. One remedy for the KeyError is to “return” the Name column using the reset_index() function illustrated in Listing 4-14.
calls the reset_index() method to “drop” the index and return the Name column as one of the columns on the df DataFrame. The inplace = True argument performs the operation in place. The second line in the program chains the .str.endswith("r") attribute to the Name column and returns True for any value whose last letter in the sequence is ‘r’.
The purpose of this example is to simply illustrate resetting an index with the reset_index() method. The more Pythonic remedy for the KeyError illustrated in Listing 4-13 is
Another method for conditional testing is to combine the loc() indexer with the isin attribute. The isin attribute returns a Boolean indicating if elements in the DataFrame column are contained in a Python list of values. As an example, consider Listing 4-16.
>>> df.set_index('Name', inplace=True)
>>> df.loc[df['Sector'].isin(['North', 'South'])]
District Sector Before After
Name
Patton I North 17 27
Joyner I South 13 22
Aden II North 71 70
Tanner II South 113 122
Chang III North 69 101
Gupta III South 11 22
Listing 4-16
Select Rows with isin List of Values
Because the index was “dropped” in Listing 4-14, the index is set again, this time with the syntax
df.set_index('Name', inplace=True)
to enable slicing with labels using the loc() indexer.
Listing 4-17 illustrates the same capability using the IN operator. The IN operator performs an implied truth test by including values from a list that match values from the sector variable. The IN operator is also valid with an IF statement in a Data Step. Figure 4-4 displays the subset row.
4 proc sql;
5 select *
6 from df
7 where sector in ('North', 'South');
8 quit;
Listing 4-17
SAS IN Operator
Updating
The loc() indexer can update or set values (the term used with pandas documentation). Consider Listing 4-18.
The first call to the loc() indexer supplies a Python list of Name labels for three individuals along with their corresponding After values and returns a Series. Recall that a Series is analogous to a single DataFrame column. The second call to the loc() indexer sets (updates) the After column for each of the labels in the Python list
8 if name in ('Patton', 'Jurat', 'Gupta') then do;
9 after = 100;
10 put @1 name @10 after;
11 end;
12 run;
Listing 4-19
IN Operator Conditionally Select Rows.
OUTPUT:
Name After
Patton 100
Jurat 100
Gupta 100
NOTE: There were 13 observations read from the dataset WORK.DF.
NOTE: The dataset WORK.DF has 13 observations and 5 variables.
This example uses the IN operator with an IF/THEN DO/END block updating the After variable conditionally.
Setting values for an entire DataFrame column is illustrated in Listing 4-20.
>>> df.loc[: , 'After'] = 100
>>> print(df.head(5))
District Sector Before After
Name
Patton I North 17 100
Joyner I South 13 100
Williams I East 111 100
Jurat I West 51 100
Aden II North 71 100
Listing 4-20
Set Values for a Column
The call to the loc() indexer slices all rows from the df DataFrame since no start and stop values are supplied indicated by a colon (:). The column slice After is set to 100.
Return Rows and Columns by Position
The iloc() indexer uses integer positions (from 0 to axis (length –1)) for slicing rows and columns. Allowed inputs to iloc() are
An integer, for example, 12.
A Python list of integers [4, 2, 22].
A slice object with integers 2 : 22. The start, in this case, 2, is inclusive and the stop position 22 is exclusive.
The stop position for the iloc() indexer is exclusive, meaning not included. This is in contrast to the loc() indexer which is inclusive.
The syntax for the iloc() indexer is
df.iloc[row selection, column selection]
A comma (,) is used to separate the request of row slices from column slices. A colon (:) is used to request a range of items. The absence of either a column or row selector is an implicit request for all columns or rows, respectively.
These features are illustrated in Listing 4-21 introducing the iloc() indexer.
>>> df.iloc[[0, -1]]
District Sector Name Before After
Name
Patton I North Patton 17 100
LeMay III West LeMay 35 69
Listing 4-21
Return df First and Last Row
In this example, a Python list of row values based on their position is passed to the iloc() indexer. Row 0 is the first row and row –1 is the last row in the df DataFrame.
The same logic for SAS is illustrated in Listing 4-22.
4 data df1;
5 set df end = last;
6
7 if name in ('Patton', 'Jurat', 'Gupta') then after = 100;
8 if _n_ = 1 then output;
9 if last = 1 then output;
10 run;
NOTE: There were 12 observations read from the dataset WORK.DF.
NOTE: The dataset WORK.DF1 has 2 observations and 5 variables.
11 proc print data = df1 noobs;
12 run;
Listing 4-22
Return First and Last Observation
The input dataset df uses the END= dataset option to detect the last observation reading the df dataset. The END= dataset option initializes a variable’s value to 0 and is set to 1 when the last observation is read. Subsetting IF statements are used to output the first and last observation to the output dataset df1. The output dataset is displayed in Figure 4-5. The NOOBS option for PROC PRINT suppresses the display of the SAS observation number contained in the automatic SAS variable _N_.
The iloc() indexer accommodates a Python list of integers as well as a slice object to define row and column selections. Listing 4-23 illustrates combining these selectors.
>>> df.reset_index(inplace = True)
>>> df.iloc[[2, 10, 12], :2]
Name District
2 Williams I
10 Haskins III
12 LeMay III
Listing 4-23
iloc() Using List and Slice Object
While it is possible to call the iloc() indexer with an index preset, in order to understand the effect, the Name index is dropped with
df.reset_index(inplace = True)
The row selector uses a Python list of integers selecting rows 2, 10, and 12 followed by a comma (,) to define the column slicer. The column slicer 0:2 selects two columns (column 0 and column 1). Remember, for the iloc() indexer, the stop values used as slicers for row and column values go up to but do not include the stop value.
The syntax for selecting the first three columns and all rows in the DataFrame is
df.iloc[ : , :3]
If the column slicer stop position exceeds the number of columns in the DataFrame, then all columns are returned.
The iloc() indexer accepts the value –1 to indicate the last object in a sequence, –2 as second to last, and so on. Listing 4-24 illustrates this feature.
>>> df.iloc[-1, -1]
100
Listing 4-24
Return Last Row from Last Column
This example returns the last row from the last column in DataFrame df.
MultiIndexing
Thus far, the use of indexes involves a single column labeling DataFrame rows. See Listing 4-9 as an illustration. This section introduces MultiIndexing, also known as hierarchical indexing. Often the data for analysis is captured at the detail level. As part of performing an exploratory analysis, a MultiIndex DataFrame provides a useful multi-dimensional “view” of data.
These actions are accomplished using the pandas’ MultiIndex object. Simply put, a MultiIndex allows multiple index levels within a single index. Higher-dimensional data can be represented by a one-dimensional Series or a two-dimensional DataFrame.
In a DataFrame, rows and columns may have multiple levels of indexes defined with a MultiIndex object.
Later in this chapter, we will see the benefits from MultiIndexing for “pivoting” DataFrames much the same way an Excel spreadsheet can be pivoted. We will also discuss “stacking” data as a means for “flattening” DataFrames and “unstacking” to perform the reverse operation.
To begin, consider Listing 4-25. The example creates a hierarchical index for the columns in the df DataFrame.
To control the output, options.display.float_format= displays floats two places to the right of the decimal. There are several different constructors for defining a MultiIndex. This example uses MultiIndex.from_tuples() to define a hierarchical index for the DataFrame columns.
A Python tuple is a data structure similar to a list used to hold unlike items such as strings, ints, floats, and so on. Unlike a list, tuples are immutable and are defined using a pair of parentheses ( ). In this example, the for loops are shortcuts creating the strings to populate the tuples. Without the for loops, the syntax is
The df DataFrame in this example uses the DataFrame() constructor assigning row labels with index=['Row 1','Row 2'] and columns = col creating the MultiIndexed or hierarchical columns.
With the df DataFrame constructed along with its hierarchical columns and row labels, let’s examine the constituent components closely by considering Listing 4-26.
Col Level 1: Index(['Test1', 'Test2', 'Test3'], dtype="object")
Col Level 2: Index(['Post', 'Pre'], dtype="object")
Listing 4-26
MultiIndex Details, Part 2
Begin with the index. Recall a pandas index is simply a method to assign labels to rows. In this example, df.index returns the row labels, 'Row1' and 'Row2'.
The statement df.columns returns the DataFrame’s column labels. In this case, a Python list of lists which are the unique levels from the MultiIndex assigned as columns. The labels return a Python list of lists referencing these levels on the index.
This df DataFrame MultiIndex has two levels. The statement
df.columns.levels[0]
returns a Python list of column labels used in the outermost level of the hierarchical index. The statement df.columns.levels[1] returns the innermost level of the hierarchical index. Whether assigned to the DataFrame rows or columns, a hierarchical index can have an arbitrary number of levels.
To further understand MultiIndexes, we construct a more elaborate DataFrame. Listing 4-27 illustrates a hierarchical index for both the DataFrame’s rows and columns. The MultiIndex for the columns has a depth of two with values for Area and When as levels. The second hierarchical index on the rows has a depth of two with Year and Month values as levels. The tickets DataFrame holds values for traffic violation tickets.
Since our objective is to understand hierarchical indexes, the explanation for the Python code creating the tickets DataFrame is found in Appendix A at the end of this book. Note that the script in Appendix A must be executed as a part of the examples in this section.
Notice the output from the print function in Listing 4-27. The results display the hierarchical columns Area as the outer level and When as the inner level. Likewise, with the hierarchical row index, where Year is the outer level and Month is the inner level.
The print(tickets.index) statement returns the MultiIndex levels and labels assigned to the rows. To subset DataFrame rows, we refer to
[2015, 2016, 2017, 2018]
as the outer level of the MultiIndex to indicate Year and
[1, 2, 3]
as the inner level of the MultiIndex to indicate Month to compose the row slices.
Similarly, to subset columns, we refer to
['City', 'Rural', 'Suburbs']
as the outer levels of the of the MultiIndex to indicate Area and
['Day', 'Night']
as the inner portion of the MultiIndex to indicate When for the column slices.
Together, row and column slices determine the DataFrame subset.
Listing 4-28 produces the analog tickets SAS dataset using PROC TABULATE to render output shaped like the tickets DataFrame. Since the Python code and SAS code call different random number generators, the values created, while similar, differ between the DataFrame and the SAS dataset.
4 data tickets;
5 length Area $ 7
6 When $ 9;
7 call streaminit(123456);
8 do year = 2015, 2016, 2017, 2018;
9 do month = 1, 2, 3;
10 do area = 'City', 'Rural', 'Suburbs';
11 do when = 'Day', 'Night';
12 tickets = abs(int((rand('Normal')*100)/5));
13 output;
14 end;
15 end;
16 end;
17 end;
NOTE: The dataset WORK.TICKETS has 72 observations and 5 variables.
18 proc tabulate;
19 var tickets;;
20 class area when year month;
21 table year * month ,
22 area=' ' * when=' ' * sum=' ' * tickets=' ';
23 run;
Listing 4-28
Tickets Dataset from PROC TABULATE
The Data Step uses nested DO/END blocks generating the class variables area, when, year, and month. The tickets variable is created with nested functions working from the inside out:
1.
The RAND function draws values from the normal distribution random number generator. These values are then multiplied by 100 and the product is divided by 5.
2.
The INT function returns the integer portion of the value.
The row dimension crosses (*) the month variable with the year variable. The column dimension crosses (*) values for tickets with the area variable which in turn is crossed (*) with the when variable and together they are crossed with the summed value for the tickets variable.
Basic Subsets with MultiIndexes
With the tickets DataFrame created having hierarchical indexes for rows and columns, we can apply a range of methods for subsetting as well as applying condition-based logic as filtering criteria.
An important feature of hierarchical indexing is the ability to select data by a “partial” label identifying a subgroup in the data. Partial selection “drops” levels of the hierarchical index from the results using methods analogous to row and column slicing for regular DataFrames.
In this example, the [ ] operator returns a subset of the tickets DataFrame from the level Rural. In this case, Rural designates one of three values from the outer level of the column hierarchical index. Because there is no explicit row selection, all rows are returned.
Listing 4-30 answers the question: for each month how many tickets were issued in the city at night?
>>> tickets['City', 'Night']
Year Month
2015 1 18.0
2 18.0
3 54.0
2016 1 17.0
2 23.0
3 17.0
2017 1 5.0
2 33.0
3 12.0
2018 1 10.0
2 14.0
3 32.0
Listing 4-30
Identify Subgroups with MultiIndexing, Example 2
This example illustrates selecting with both levels of the column MultiIndex. City is from the outermost level of the hierarchical index and Night is from the innermost level.
Recall that most subsetting and slicing operations return a DataFrame. Listing 4-31 illustrates creating a new DataFrame. In this example, the sum() function is applied to the tickets DataFrame elements returning the sum of all tickets by year. These summed values create the new DataFrame sum_tickets.
>>> sum_tickets = tickets.sum(level = 'Year')
>>> print(sum_tickets)
Area City Rural Suburbs
When Day Night Day Night Day Night
Year
2015 31.0 90.0 19.0 39.0 59.0 36.0
2016 27.0 57.0 35.0 53.0 32.0 45.0
2017 57.0 50.0 60.0 29.0 33.0 14.0
2018 63.0 56.0 50.0 39.0 54.0 22.0
Listing 4-31
Sum Tickets to New DataFrame
Use the axis = 1 argument to apply the sum() function along a column with the syntax
NOTE: The dataset WORK.SUM_TICKETS has 24 observations and 4 variables.
NOTE: There were 72 observations read from the dataset WORK.TICKETS.
14 ods output close;
15 proc print data = sum_tickets;
16 run;
NOTE: There were 24 observations read from the dataset WORK.SUM_TICKETS.
Listing 4-32
PROC TABULATE Report and New Dataset
The default statistic for PROC TABULATE is sum and is applied to the variable tickets using the VAR statement. The TABLE statement arranges the output similar to the output in Listing 4-30. The PROC TABULATE output is presented in Figure 4-7.
In order to create the output dataset sum_tickets, the syntax
ods output
table = sum_tickets (keep = area
when
year
tickets_sum);
opens the ODS destination sum_tickets, as an output SAS dataset with a KEEP list of variables. This method for summarization is an alternative to calling PROC SUMMARY/MEANS or PROC SQL.
Listing 4-33 illustrates the more conventional method for producing the same “rolled up” or summarized dataset.
4 proc summary data = tickets
5 nway
6 noprint;
7 class area
8 when
9 year;
10 output out=sum_tickets(keep=area when year tickets_sum)
11 sum(tickets)=tickets_sum;
NOTE: There were 72 observations read from the dataset WORK.TICKETS.
NOTE: The dataset WORK.SUM_TICKETS has 24 observations and 4 variables.
Listing 4-33
Summarizing tickets Dataset with PROC SUMMARY
The NWAY option requests a combination for all levels of variable values listed on the CLASS statement. The SUM(tickets)=tickets_sum option then sums the number of tickets for each NWAY crossing.
Advanced Indexing with MultiIndexes
Earlier in the chapter, we detailed the loc() indexer for slicing rows and columns with indexed DataFrames. See the section “Return Rows and Columns by Label” in this chapter. Slicing rows and columns with the loc() indexer can be used with a MultiIndexed DataFrame using similar syntax. The loc() indexer supports Boolean logic for filtering criteria.
The loc() indexer enables partial slicing using hierarchically indexed rows and/or columns. Begin by returning the DataFrame along with its index and column information illustrated in Listing 4-34.
The loc() indexer takes as arguments, slicers to determine the DataFrame subset of interest. Consider Listing 4-35 illustrates returning all rows for year 2018.
>>> tickets.loc[2018]
Area City Rural Suburbs
When Day Night Day Night Day Night
Month
1 25.0 10.0 8.0 4.0 20.0 15.0
2 35.0 14.0 9.0 14.0 10.0 1.0
3 3.0 32.0 33.0 21.0 24.0 6.0
Listing 4-35
Year Slice 2018
In this case, the rows are sliced returning those with the MultiIndex level for Year equal to 2018. And because no column slicer is provided, all columns are returned.
We can slice Year level for 2018 and Month level for 3 illustrated in Listing 4-36.
>>> tickets.loc[2018, 3, :]
Area City Rural Suburbs
When Day Night Day Night Day Night
Year Month
2018 3 3.0 32.0 33.0 21.0 24.0 6.0
Listing 4-36
Slice Year 2018 and Month 3
In this example, level 2018 denotes the outer row slice and 3 denotes the inner row slice. This subset sets the DataFrame by returning month 3 for every year. The column slice follows the second comma. Again, with no column slice provided, denoted by the colon (:), all columns are returned.
Slicing Rows and Columns
Consider Listing 4-37. In this example, we wish to return the third month for each year. Based on what we have learned about row and column slicing up to this point, it is reasonable to conclude the statement
tickets.loc[(:,3),:]
is the appropriate syntax. However, this syntax raises an error since it is illegal to use a colon inside a tuple constructor. Recall a tuple is an immutable sequence of items enclosed by parenthesis. As a convenience the Python’s built-in slice(None)function selects all the content for a level. In this case, we want month level 3 for all years.
>>> tickets.loc[(slice(None), 3), :]
Area City Rural Suburbs
When Day Night Day Night Day Night
Year Month
2015 3 5.0 54.0 7.0 6.0 14.0 18.0
2016 3 9.0 17.0 31.0 48.0 2.0 17.0
2017 3 31.0 12.0 19.0 17.0 14.0 2.0
2018 3 3.0 32.0 33.0 21.0 24.0 6.0
Listing 4-37
Slice Month 3 for All Years
The syntax slice(None) is the slicer for the Year column which includes all values for a given level, in this case, 2015 to 2018 followed by 3 to designate the level for month. All columns are returned since no column slicer was given.
Another way to request this same subset is
tickets.loc[(slice(None), slice(3,3)), :]
Consider the request for all years and months 2 and 3 as the row slicer in Listing 4-38.
>>> tickets.loc[(slice(None), slice(2,3)), :]
Area City Rural Suburbs
When Day Night Day Night Day Night
Year Month
2015 2 11.0 18.0 3.0 30.0 42.0 15.0
3 5.0 54.0 7.0 6.0 14.0 18.0
2016 2 7.0 23.0 3.0 5.0 19.0 2.0
3 9.0 17.0 31.0 48.0 2.0 17.0
2017 2 5.0 33.0 19.0 2.0 7.0 10.0
3 31.0 12.0 19.0 17.0 14.0 2.0
2018 2 35.0 14.0 9.0 14.0 10.0 1.0
3 3.0 32.0 33.0 21.0 24.0 6.0
Listing 4-38
Slice Months 2 and 3 for All Years
Alternatively, the same results are accomplished with the syntax
This syntax helps in further understanding exactly how the slicing operation is performed. The first slice(None) requests all of the rows for the outer row label, years 2015 to 2018. slice(2,3) returns months 2 and 3 for inner row label. The last slice(None) requests all columns, that is, both the outer column Area and the inner column When.
Fairly quickly, however, we begin to have difficulty supplying a collection of tuples for the slicers used by the loc() indexer. Fortunately, pandas provides the IndexSlice object to deal with this situation.
Consider Listing 4-39 as an alternative to Listing 4-38.
>>> idx = pd.IndexSlice
>>> tickets.loc[idx[2015:2018, 2:3], :]
>>>
Area City Rural Suburbs
When Day Night Day Night Day Night
Year Month
2015 2 11.0 18.0 3.0 30.0 42.0 15.0
3 5.0 54.0 7.0 6.0 14.0 18.0
2016 2 7.0 23.0 3.0 5.0 19.0 2.0
3 9.0 17.0 31.0 48.0 2.0 17.0
2017 2 5.0 33.0 19.0 2.0 7.0 10.0
3 31.0 12.0 19.0 17.0 14.0 2.0
2018 2 35.0 14.0 9.0 14.0 10.0 1.0
3 3.0 32.0 33.0 21.0 24.0 6.0
Listing 4-39
IndexSlice Object
The IndexSlice object provides a more natural syntax for slicing operations on MultiIndexed rows and columns. In this case, the slice
tickets.loc[idx[2015:2018, 2:3], :]
returns years 2015:2018 inclusive on the outer level of the MultiIndex for the rows and months 2 and 3 inclusive on the inner level. The colon (:) designates the start and stop positions for these row labels. Following the row slicer is a comma (,) to designate the column slicer. With no explicit column slices defined all columns are returned.
The row slicer returns levels 2018 for Year on the outer level of the MultiIndex and 2 and 3 from Month on the inner level. The column slicer returns the levels City and Rural from Area on the outer level of the MultiIndex. In this example, the column slicer did not slice along the inner level of the MultiIndex on When.
Listing 4-41 illustrates details for slicing columns.
The row slicer returns levels 2018 for Year on the outer level of the MultiIndex and 2 and 3 from Month on the inner level. The column slicer returns the level City for Area on the outer level of the MultiIndex and the levels Day and Night on the inner level from When.
Conditional Slicing
Oftentimes we need to subset based on conditional criteria. pandas allows the loc() indexer to permit a Boolean mask for slicing based on criteria applied to values in the DataFrame. We introduced the concept of a Boolean mask in Chapter 3, “pandas Library,” in the section on isnull().
We can identify instances where the number of tickets relates to a given threshold by creating a Boolean mask and applying it to the DataFrame using the loc() indexer. Specifically, we want to know when the number of tickets issued in the city during the day is greater than 25.
In this example, we define the mask object using the column slicing syntax followed by the Boolean operator greater than (>) and 25 as the threshold value. Rows are sliced using the conditional with the mask object. The columns are sliced using the City level from Area and the Day level from When. Area is the outer level of the column MultiIndex and When is the inner level.
Another form of conditional slicing uses the pandas where attribute. The where attribute returns a DataFrame the same size as the original whose corresponding values are returned when the condition is True. When the condition is False, the default behavior is to return NaN’s. This feature is illustrated in Listing 4-43.
>>> missing = "XXX"
>>> tickets.where(tickets> 30, other = missing)
Area City Rural Suburbs
When Day Night Day Night Day Night
Year Month
2015 1 XXX XXX XXX XXX XXX XXX
2 XXX XXX XXX XXX 42 XXX
3 XXX 54 XXX XXX XXX XXX
2016 1 XXX XXX XXX XXX XXX XXX
2 XXX XXX XXX XXX XXX XXX
3 XXX XXX 31 48 XXX XXX
2017 1 XXX XXX XXX XXX XXX XXX
2 XXX 33 XXX XXX XXX XXX
3 31 XXX XXX XXX XXX XXX
2018 1 XXX XXX XXX XXX XXX XXX
2 35 XXX XXX XXX XXX XXX
3 XXX 32 33 XXX XXX XXX
Listing 4-43
DataFrame where Attribute
The other = argument assigns an arbitrary value for the False condition, in this case, ‘missing’. Also notice how the returned DataFrame is the same shape as the original.
Cross Sections
pandas DataFrames provision a cross section method called xs as another means for returning rows and columns from an indexed DataFrame or partial data in the case of a MultiIndexed DataFrame. The compact syntax offered by the xs() method makes it fairly easy to subset MultiIndexed DataFrames. The xs() method is read only.
The xs() cross section method has two arguments. The first argument in this example is level 1 and the second argument level = 'Month' returning the rows for month 1 for all years with all columns. Recall the Month column is a component of the MultiIndex to form the row labels.
The xs() cross section method works along a column axis illustrated in Listing 4-45.
>> tickets.xs(('City'), level="Area", axis = 1)
When Day Night
Year Month
2015 1 15.0 18.0
2 11.0 18.0
3 5.0 54.0
2016 1 11.0 17.0
2 7.0 23.0
3 9.0 17.0
2017 1 21.0 5.0
2 5.0 33.0
3 31.0 12.0
2018 1 25.0 10.0
2 35.0 14.0
3 3.0 32.0
Listing 4-45
xs() Cross Section, Example 2
In this example, we return all rows for the level City. The axis = 1 argument returns just the columns for the level City.
Because the xs() cross section method returns a DataFrame, we can apply mathematical and statistical functions as attributes. Listing 4-46 returns the sum of all tickets issued during daylight hours in each of the three areas.
Listing 4-47 is the analog program for Listing 4-46.
4 proc sql;
5 select unique area
6 , sum(tickets) as Sum_by_Area
7 from tickets
8 where when = 'Day'
9 group by area;
10 quit;
Listing 4-47
Summed Tickets Where Day over Area
The WHERE clause selects those rows for when = 'Day'. The results from the query are displayed in Figure 4-8.
The GROUP BY clause sums the variable ticket into the unique levels for the area variable. As we will see in the next section, grouping operations are essential for data analysis.
GroupBy
A common pattern for data analysis familiar to SAS users is BY-group processing. SAS defines BY-group processing as
a method of processing observations from one or more SAS datasets that are grouped or ordered by values of one or more common variables. The most common use of BY-group processing in the DATA step is to combine two or more SAS datasets by using the BY statement with a SET, MERGE, MODIFY, or UPDATE statement.1
pandas uses the term “Group By” to describe the task in terms of three steps:
Splitting values into groups based on some criteria
Applying a function to each of these groups
Combining the results into a data structure
Within the Apply step, we often wish to do one or more of the following actions:
Aggregate to compute summary statistics, such as
Compute group sums or means.
Compute group counts.
Transform to perform group-specific calculations, such as
Normalize data within the group.
Replace missing values with a value derived from each group.
Filter to discard some groups based on group-wise calculation, such as
Drop data whose values are sparse.
Filter out data based on the group statistic.
To accomplish these types of operation, the pandas library includes a GroupBy object.
When a GroupBy object is created, it contains instructions to map rows and columns to named groups. A crucial benefit for GroupBy is eliminating the need to handle each of the resulting splits, or sub-groups explicitly. Instead GroupBy applies operations to the entire DataFrame often with a single pass of the data. The benefit being the user does not focus on group processing details, but instead benefits from a more-abstracted processing method.
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002592A6A3B38>
Listing 4-48
Create GroupBy gb
The gb object is defined as a GroupBy object using the District column as the by = parameter. In this case, the three unique values from the District column define the groups. Notice a DataFrameGroupBy object is returned rather than a DataFrame. The gb object is analogous to an SQL view containing instructions for executing SQL statement to materialize rows and columns when the view is applied.
The SQL analogy is
CREATE VIEW GB as
SELECT distinct District
, mean(Before)
, sum(After)
FROM DF
GROUP BY District
Only when the GroupBy object is applied are results produced. Consider Listing 4-49.
All numeric columns in the underlying df Dataframe are grouped by the unique levels from the District column and then summed within each group. Of course, the sum() method is just one possibility here. Later in this chapter, we will illustrate examples for selecting individual columns and applying different aggregation methods as well as applying nearly any valid DataFrame operation.
Also observe how output from the District column appears like what one sees with an indexed DataFrame to define row labels.
A GroupBy object returns a DataFrame. Observe what happens when the d_grby_sum DataFrame is created from the GroupBy object in Listing 4-50.
The d_grby_sum DataFrame is indexed with values from the District column. GroupBy objects also have attributes allowing examination of their keys and groups. These object attributes are illustrated in Listing 4-51.
The syntax groups.keys() returns a Python list for the keys’ values. The syntax gb.groups returns a Python Dictionary of key:value pairs for each key, mapped to a group, along with a corresponding list of values indicating which rows compose a given group. In this example, rows 0, 1, 2, and 3 define the groupby level for District = 'I'.
Listing 4-52 illustrates similar logic calling PROC SUMMARY to create a “grouped” dataset summing the numeric variables by district.
4 data df;
5 infile cards dlm = ',';
6 length district $ 3
7 sector $ 5
8 name $ 8;
9 input district $
10 sector $
11 name $
12 before
13 after
14 age;
15 list;
316 datalines;
RULE: --+--1--+--2--+--3--+--4--+--
17 I, North, Patton, 17, 27, 22
18 I, South, Joyner, 13, 22, 19
19 I, East, Williams, 111, 121, 29
20 I, West, Jurat, 51, 55, 22
21 II, North, Aden, 71, 70, 17
22 II, South, Tanner, 113, 122, 32
23 II, East, Jenkins, 99, 99, 24
24 II, West, Milner, 15, 65, 22
25 III, North, Chang, 69, 101, 21
26 III, South, Gupta, 11, 22, 21
27 III, East, Haskins, 45, 41, 19
28 III, West, LeMay, 35, 69, 20
29 III, West, LeMay, 35, 69, 20
NOTE: The dataset WORK.DF has 13 observations and 6 variables.
30 ;;;;
31
32 proc summary data=df nway;
33 class district;
34 var before after age;
35 output out=gb_sum (drop = _TYPE_ _FREQ_)
36 sum=;
37 run;
NOTE: There were 13 observations read from the dataset WORK.DF.
NOTE: The dataset WORK.GB_SUM has 3 observations and 4 variables.
Listing 4-52
Summary by District
Figure 4-9 displays the resulting “groups” created using PROC SUMMARY. The CLASS statement defines the unique levels for the district variable.
Iteration Over Groups
The GroupBy object supports iterating over the defined groups. As an example, consider Listing 4-53.
>>> gb = df.groupby(['District'])
>>> for name, group in gb:
... print('Group Name===> ',name)
... print(group)
... print('='*47)
...
Group Name===> I
District Sector Name Before After Age
0 I North Patton 17 27 22
1 I South Joyner 13 22 19
2 I East Williams 111 121 29
3 I West Jurat 51 55 22
===============================================
Group Name===> II
District Sector Name Before After Age
4 II North Aden 71 70 17
5 II South Tanner 113 122 32
6 II East Jenkins 99 99 24
7 II West Milner 15 65 22
===============================================
Group Name===> III
District Sector Name Before After Age
8 III North Chang 69 101 21
9 III South Gupta 11 22 21
10 III East Haskins 45 41 19
11 III West LeMay 35 69 20
12 III West LeMay 35 69 20
===============================================
Listing 4-53
Iterate Over Groups
In this example, a for loop iterates over the GroupBy object to produce a custom report. As we have seen previously, iterating manually over objects can be useful; however, the apply() method discussed later in this chapter may be a more productive alternative for applying methods and functions to grouped values in a DataFrame.
With SAS, the same report is easily produced using the Data Step by group processing as shown in Listing 4-54. While we could have called PROC PRINT for this example, the goal of the example is to illustrate how FIRST.district and LAST.district behave for By Group processing.
4 proc sort data = df presorted;
5 by district;
6
NOTE: Sort order of input dataset has been verified.
NOTE: There were 13 observations read from the dataset WORK.DF.
7 data _null_;
8 file print;
9 set df;
10 by district;
11
12 if first.district then
13 put 'Group Name====> ' district /
14 'District Sector Name Pre Post Age';
15 put @1 district @10 sector @20 name
16 @29 pre @34 post @40 age;
17
18 if last.district then
19 put '=========================================';
20 run;
NOTE: 22 lines were written to file PRINT.
Listing 4-54
Iterative By Group Processing
In general, SAS By Group processing is established with either PROC SORT or an ORDER BY statement in PROC SQL. For Data Step processing when a BY statement is encountered, SAS creates the automatic variables FIRST.<by_variable> and LAST.<by_variable> to permit truth testing to control logic by identifying observations as first or last in the by group. The statement fragment
if first.district then
is a truth test with an implied Boolean evaluation of 0 for false and 1 for true. In our example, the preceding statement can also be written as
Similarly, pandas provisions the first() and last() methods for the GroupBy object as illustrated in Listing 4-55 returning the first and last row, respectively, for each group.
>>> df.groupby('District').first()
Sector Name Before After Age
District
I North Patton 17 27 22
II North Aden 71 70 17
III North Chang 69 101 21
>>> df.groupby('District').last()
Sector Name Before After Age
District
I West Jurat 51 55 22
II West Milner 15 65 22
III West LeMay 35 69 20
Listing 4-55
Return First and Last Rows from GroupBy
GroupBy Summary Statistics
As mentioned earlier a GroupBy feature is the ability to accept most methods applicable to a DataFrame by applying the methods to individual groups. Consider Listing 4-56.
I 4.00 23.00 4.24 19.00 21.25 22.00 23.75 29.00 ...
II 4.00 23.75 6.24 17.00 20.75 23.00 26.00 32.00 ...
III 5.00 20.20 0.84 19.00 20.00 20.00 21.00 21.00 ...
[3 rows x 24 columns]
Listing 4-56
Summary Statistics by Group
This example illustrates how methods not specifically implemented for the GroupBy object are passed through allowing groups to call the method. Here the DataFrame’s describe() method performs the aggregation describing values for each group. Due to page width limitations, only a portion of the actual output is presented here.
We can apply different aggregation methods to different columns defined by the GroupBy object. In Listing 4-49, the sum() method is applied to all numeric columns. In contrast, Listing 4-57 illustrates different statistics applied to columns.
>>> gb = df.groupby(['District'])
>>> gb.agg({'Age' : 'mean',
... 'Before' : 'median',
... 'After' : ['sum', 'median', 'std']
... })
Age Before After
mean median sum median std
District
I 23.00 34 225 41.00 45.54
II 23.75 85 356 84.50 26.62
III 20.20 35 302 69.00 30.20
Listing 4-57
Different Statistics Over Group Columns
In this example, the agg() function is applied to the gb GroupBy object using a Python Dictionary to identify aggregation methods applied to designated columns. Recall a Python Dictionary is a data structure for holding key:value pairs. To accommodate multiple statistics for a given column, we pass a Python list of methods as the value for the Dictionary. For example, the After column has as its value a Python list of aggregation methods, sum(), median(), and std().
Listing 4-58 illustrates the same approach using SAS.
4 proc summary data=df nway;
5 class district;
6 output out=gb_sum (drop = _TYPE_ _FREQ_)
7 mean(age) = age_mean
8 median(before) = bfr_median
9 sum(after) = aft_sum
10 median(after) = aft_median
11 std(after) = aft_std;
12 run;
NOTE: There were 13 observations read from the dataset WORK.DF.
NOTE: The dataset WORK.GB_SUM has 3 observations and 6 variables.
13 proc print data = gb_sum noobs;
14 run;
Listing 4-58
By Group Statistics over Different Variable
Figure 4-11 displays the output created by PROC SUMMARY.
The OUTPUT OUT= syntax applies summary statistics to the input variables and permits the naming of the resulting output variables.
Filtering by Group
A common coding pattern for data analysis is applying actions to a set of data based on a group’s statistic. As an example, consider Listing 4-59.
>>> print(df)
District Sector Name Before After Age
0 I North Patton 17 27 22
1 I South Joyner 13 22 19
2 I East Williams 111 121 29
3 I West Jurat 51 55 22
4 II North Aden 71 70 17
5 II South Tanner 113 122 32
6 II East Jenkins 99 99 24
7 II West Milner 15 65 22
8 III North Chang 69 101 21
9 III South Gupta 11 22 21
10 III East Haskins 45 41 19
11 III West LeMay 35 69 20
12 III West LeMay 35 69 20
>>> def std_1(x):
... return x['Age'].std() < 5
...
>>> df.groupby(['District']).filter(std_1)
District Sector Name Before After Age
0 I North Patton 17 27 22
1 I South Joyner 13 22 19
2 I East Williams 111 121 29
3 I West Jurat 51 55 22
8 III North Chang 69 101 21
9 III South Gupta 11 22 21
10 III East Haskins 45 41 19
11 III West LeMay 35 69 20
12 III West LeMay 35 69 20
Listing 4-59
Group By Filtering on a Statistic
This example removes groups with a group standard deviation for Age less than five (5). To do this, we define the std_1 function containing the filter criteria as
def std_1(x):
return x['Age'].std() < 5
def is used to define a Python function followed by the function’s name, in this case, std_1(). Inside this function definition, x is a local variable holding the group passed in when called.
A new DataFrame is created by passing the std_1() function to filter method of the GroupBy object.
Notice how no rows are returned from the District column with a value of ‘II’.
Group by Column with Continuous Values
Sometimes the desire is to use columns with continuous values as a GroupBy object. Consider the case of age where these values are continuous. To create a meaningful GroupBy object, the first step is mapping continuous values into “buckets” and applying these binned values to a GroupBy operation. The binned values are mapped using the apply() method to create the GroupBy object. This action allows aggregations to be performed based on group values determined by bin ranges formed with the Age column.
Here we illustrate this pattern using the pandas cut() method for segmenting and sorting data values into bins. Consider Listing 4-60.
>>> def stats(group):
... return {'count' : group.count(),
... 'min' : group.min(),
... 'max' : group.max(),
... 'mean' : group.mean()}
...
>>> bins = [0, 25, 50, 75, 200]
>>> gp_labels = ['0 to 25', '26 to 50', '51 to 75', 'Over 75']
In the example we begin by defining the stats() function using the def statement and naming this function stats. It simply returns a Python Dictionary of aggregation methods as a convenience for passing this Dictionary to the apply() method when creating the GroupBy object.
The syntax
bins = [0, 25, 50, 75, 200]
gp_labels = ['0 to 25', '26 to 50', '51 to 75', 'Over 75']
assigns the “cut-points” to the bins object as a Python list of values representing the upper and lower bounds for the bins created with the cut() method. The gp_labels object is another Python list of values holding the labels assigned to these bins. These objects are passed to the cut() method with the syntax
defining the Age_Fmt column in the df DataFrame. This assignment creates column values by calling the cut() method for the df['Age'] column (with bins and labels defined). Note that pd.cut() uses the syntax pd to refer to the name for the pandas library that is loaded into the namespace with
creates the GroupBy object using unique values from the Age_Fmt column as the group’s levels and is attached to the df['Age'] column. The apply() method calls the defined function stats() applying the statistics column values within each group. The unstack() method reshapes the returned object from a stacked form (in this case, a Series object) to an unstacked form (a “wide” DataFrame).
The same logic in SAS is shown in Listing 4-61. In this example, the aggregation functions for the age variable statistics are produced with PROC SQL.
4 proc format cntlout = groups;
5 value age_fmt
6 0 - 25 = '0-25'
7 26 - 50 = '26-50'
8 51 - 75 = '51-75'
9 76 - high = 'Over 75';
NOTE: Format AGE_FMT has been output.
NOTE: The dataset WORK.GROUPS has 4 observations and 21 variables.
10 proc sql;
11 select fmt.label label = 'Group'
12 , count(dat.age) label = 'Count'
13 , min(dat.age) label = 'Min'
14 , max(dat.age) label = 'Max'
15 , mean(dat.age) label = 'Mean'
16 from
17 groups as fmt
18 left join df as dat
19 on fmt.label = put(dat.age, age_fmt.)
20 group by fmt.label;
21 quit;
Listing 4-61
By Group with Continuous Variable
PROC FORMAT provides similar binning logic as the cut() method in the Python example in Listing 4-59. The CNTLOUT = groups option outputs a dataset containing several variables including the label variable holding the value labels for the user-defined agefmt. format. The aggregation functions are applied to the age variable using PROC SQL. PROC SQL uses a left join to combine rows on the label column from the groups table (created with CNTLOUT =) with rows from the aggregation functions applied to the age column from the df dataset. The output from PROC SQL is displayed in Figure 4-12.
Transform Based on Group Statistic
Up to this point, the GroupBy objects return DataFrames with fewer rows than the original DataFrame. This is to be expected since GroupBy objects are commonly used in aggregation operations. There are cases where you wish to apply a transformation based on group statistics and merge the transformed version with the original DataFrame. Calculating a z-score is an example illustrated in Listing 4-62.
>>> z = df.groupby('District').transform(lambda x: (x - x.mean()) / x.std())
The logic to compute the z-score is accomplished by creating the z DataFrame with a GroupBy object using the syntax
z = df.groupby('District').transform(lambda x: (x - x.mean()) / x.std())
In this example, a lambda expression is used to create an anonymous or in-line function defining the z-score calculation. Like the def expression, this expression creates a function, but does not provide it a name. Hence, it is known as an anonymous function.
The transform() function computes the z-score for rows within each group using the group’s computed mean and standard deviation. The transform() function returns a DataFrame the same shape as the input DataFrame making it useful for combining the two together.
Because pandas allows the same name for multiple columns, the rename attribute is applied to the z DataFrame passing a Python Dictionary of key:value pairs where the key is the old column name and the value is the new column name. The syntax
df1 = pd.concat([df, z], axis = 1)
creates the df1 DataFrame by concatenating the df and z DataFrames along the columns with the axis = 1 argument. We cover the details for pandas concatenation and joins in Chapter 5, “Data Management.”
Listing 4-63 illustrates the same logic in SAS. PROC SUMMARY is called to create the intermediate variables used for calculating the z-scores. PROC SORT is called to sort the df dataset and the z_out dataset produced by PROC SUMMARY using the variable district as the sort key.
4 proc summary nway data = df;
5 class district;
6 var pre post age;
7 output out=z_out (drop = _TYPE_ _FREQ_)
8 mean(age) = age_mean
9 mean(pre) = pre_mean
10 mean(post) = post_mean
11 std(age) = age_std
12 std(pre) = pre_std
13 std(post) = post_std;
NOTE: There were 13 observations read from the dataset WORK.DF.
NOTE: The dataset WORK.Z_OUT has 3 observations and 7 variables.
14 proc sort data = df presorted;
15 by district;
16
NOTE: Sort order of input dataset has been verified.
NOTE: There were 13 observations read from the dataset WORK.DF.
17 proc sort data = z_out presorted;
18 by district;
19
NOTE: Sort order of input dataset has been verified.
NOTE: There were 3 observations read from the dataset WORK.Z_OUT.
20 data z_df (drop = age_mean pre_mean post_mean
21 age_std pre_std post_std);
22 merge df
23 z_out;
24 by district;
25
26 z_pre = (pre - pre_mean) / pre_std;
27 z_post = (post - post_mean) / post_std;
28 z_age = (age - age_mean) / age_std;
29 format z_pre z_post z_age 8.2;
30
NOTE: There were 13 observations read from the dataset WORK.DF.
NOTE: There were 3 observations read from the dataset WORK.Z_OUT.
NOTE: The dataset WORK.Z_DF has 13 observations and 9 variables.
31 proc print data=z_df(obs=6) noobs;
32 var name pre z_pre post z_post age z_age;
33 run;
NOTE: There were 6 observations read from the dataset WORK.Z_DF.
Listing 4-63
Transform Based on By Group Statistic
The final step uses a Data Step to merge the df and z_out datasets on the district sort key and performs the z-score calculations. The intermediate variables from the z_out dataset are dropped with a DROP list. Figure 4-13 displays the output produced by PROC PRINT.
Pivot
pandas provide the pivot_table() function to create spreadsheet-style pivot tables. The pivot_table() function enables aggregation of data values across row and column dimensions. As we will see shortly, this function not only provides a multi-dimensional view of your data, but it turns out to be a convenient method to apply a MultiIndex to a DataFrame.
Begin by using read_csv() method to read detailed sales transaction data collected between 2016 and 2017 in Listing 4-64. This input data is transaction details referred to as stacked, or long format. There is one row per transaction.
Notice the read_csv() method uses the parameter na_filter = False. Without calling this argument, the Territory column does not include rows with the value “NA”. In our case, “NA” denotes the value of North America and not missing values. Later in Chapter 6, “pandas Readers and Writers,” we explore numerous arguments to the read_csv() function in detail.
In order to appreciate the flexibility afforded by the pivot_table() function, the script includes output from info() method indicating the DataFrame has 2823 rows and 10 columns.
In this example, the pivot_table() function uses three arguments:
index: Containing a Python list of columns forming row labels, with Year as the outer level and ProductLine as the inner level.
columns: Containing a Python list of columns acting as keys to GroupBy on the pivot table columns. Unique values from the columns argument make up the columns in the pivot tables. In this example, the Territory column has values “APAC”, “EMEA”, and “NA” (for North America) with each value as the pivot table’s columns.
values: The column or Python list of columns to aggregate, in this example, the Amount column. The default aggregation method is np.mean.
Notice how row labels are formed using the Year column values as the outer level and ProductLine as the inner level. In other words the index argument to pivot_table() function creates either an index if one column is specified or a MultiIndex object if more than one column is specified. The same is true for the columns = argument.
Let’s improve the pivot table created in Listing 4-64. Notice in that report NAN’s have been returned indicating missing values. Further, we want to replace the default np.mean aggregation method for all columns by summing the values from the Quantity column. Finally, we can add row and column totals to get sub-totals and a grand total. These features are illustrated in Listing 4-65.
The argument fill_value = 0 replaces the NaN’s in the original output with zeros. The aggfunc() function passes a Python Dictionary to associate column names (key) with a corresponding aggregation method (value). In this example, the Amount column is aggregated using np.mean and the Quantity column is aggregated using np.sum. Figure 4-14 shows improvements to the pivot table.
The pivot_table() function syntax is easy to understand and provides a straightforward solution for a variety of analysis problems. Consider Listing 4-66.
pd.pivot_table(df2,values = ['Amount'],
columns = ['Territory'],
index = ['Year', 'Status'],
fill_value = 0,
aggfunc = (np.sum))
Listing 4-66
Sales by Year Over Territory
Figure 4-15 shows that the EMEA territory has a usually high amount of cancellations compared to the rest of the organization.
To produce the same report with SAS requires multiple steps after the .csv file is read with PROC IMPORT. The task is to summarize the amount variable and transpose the territory variable’s unique values into variables. The steps are
1.
Sort the sales_detail dataset created with PROC IMPORT by the territory variable.
2.
Summarize the sales_detail dataset by territory for the amount variable with PROC SUMMARY. Output summary as sales_sum dataset.
3.
Sort the sales_sum dataset by the variables year and status.
4.
Transpose the sales_sum dataset on the territory variable (ID) by year and status with PROC TRANSPOSE. Create a transposed dataset called sales_trans.
5.
Print the sales_trans dataset using the SAS-supplied dollar13.2 format.
NOTE: The dataset WORK.SALES_DETAIL has 2823 observations and 10 variables.
14 proc sort data=sales_detail;
15 by territory;
16 run;
NOTE: There were 2823 observations read from the dataset WORK.SALES_DETAIL.
17 proc summary data=sales_detail nway;
18 by territory;
19 class year status;
20 var amount;
21 output out=sales_sum (drop = _TYPE_ _FREQ_)
22 sum(amount) = amount_sum;
23 run;
NOTE: There were 2823 observations read from the dataset WORK.SALES_DETAIL.
NOTE: The dataset WORK.SALES_SUM has 14 observations and 4 variables.
24 proc sort data=sales_sum;
25 by year status;
26 run;
NOTE: There were 14 observations read from the dataset WORK.SALES_SUM.
NOTE: The dataset WORK.SALES_SUM has 14 observations and 4 variables.
27 proc transpose data = sales_sum
28 out = sales_trans(drop=_name_);
29 id territory;
30 by year status;
31 run;
NOTE: There were 14 observations read from the dataset WORK.SALES_SUM.
NOTE: The dataset WORK.SALES_TRANS has 6 observations and 5 variables.
32 proc print data=sales_trans;
33 var apac emea na ;
34 id status year;
35 format apac emea na dollar13.2;
36 run;
Listing 4-67
SAS Year Status Over Territory
The output from PROC PRINT is displayed in Figure 4-16.
The key to creating this report is the call to PROC TRANSPOSE. The territory values in the detail dataset, sales_detail, are row-oriented. The ID statement maps the unique values for the territory variable into variables on the sales_trans output dataset. And because the summarizations are by the variables status and year, the call to PROC TRANSPOSE is also BY status and year.
Summary
In this chapter we discussed the role of indexing and hierarchical indexing as a means for providing labels for DataFrame rows and columns. We introduced the three indexers along with slicers to return subsets from a DataFrame:
1.
[ ] operator
2.
loc() indexer for slicing along rows and columns using labels
3.
iloc() indexer for slicing along rows and columns based on a value position along an index
We examined how to apply a range of methods to the subset DataFrames to perform common data manipulation methods for analysis.
We provided a detailed discussion on the GroupBy object for split-apply-combine operations. We also provided a general introduction to pivot tables. Together these examples lay the foundation for Chapter 5, “Data Management,” where we examine joining DataFrames through concatenation and merging methods.