One of the most fundamental tasks during data analysis involves splitting data into independent groups before performing a calculation on each group. This methodology has been around for quite some time but has more recently been referred to as split-apply-combine. This chapter covers the powerful .groupby
method, which allows you to group your data in any way imaginable and apply any type of function independently to each group before returning a single dataset.
Before we get started with the recipes, we will need to know just a little terminology. All basic groupby operations have grouping columns, and each unique combination of values in these columns represents an independent grouping of the data. The syntax looks as follows:
df.groupby(['list', 'of', 'grouping', 'columns'])
df.groupby('single_column') # when grouping by a single column
The result of calling the .groupby
method is a groupby
object. It is this groupby
object that will be the engine that drives all the calculations for this entire chapter. pandas does very little when creating this groupby
object, merely validating that grouping is possible. You will have to chain methods on this groupby
object to unleash its powers.
The most common use of the .groupby
method is to perform an aggregation. What is an aggregation? An aggregation takes place when a sequence of many inputs get summarized or combined into a single value output. For example, summing up all the values of a column or finding its maximum are aggregations applied to a sequence of data. An aggregation takes a sequence and reduces it to a single value.
In addition to the grouping columns defined during the introduction, most aggregations have two other components, the aggregating columns and aggregating functions. The aggregating columns are the columns whose values will be aggregated. The aggregating functions define what aggregations take place. Aggregation functions include sum
, min
, max
, mean
, count
, variance
, std
, and so on.
In this recipe, we examine the flights dataset and perform the simplest aggregation involving only a single grouping column, a single aggregating column, and a single aggregating function. We will find the average arrival delay for each airline. pandas has different syntaxes to create an aggregation, and this recipe will show them.
>>> import pandas as pd
>>> import numpy as np
>>> flights = pd.read_csv('data/flights.csv')
>>> flights.head()
0 1 1 4 ... 65.0 0 0
1 1 1 4 ... -13.0 0 0
2 1 1 4 ... 35.0 0 0
3 1 1 4 ... -7.0 0 0
4 1 1 4 ... 39.0 0 0
AIRLINE
), aggregating columns (ARR_DELAY
), and aggregating functions (mean
). Place the grouping column in the .groupby
method and then call the .agg
method with a dictionary pairing the aggregating column with its aggregating function. If you pass in a dictionary, it returns back a DataFrame instance:
>>> (flights
... .groupby('AIRLINE')
... .agg({'ARR_DELAY':'mean'})
... )
ARR_DELAY
AIRLINE
AA 5.542661
AS -0.833333
B6 8.692593
DL 0.339691
EV 7.034580
... ...
OO 7.593463
UA 7.765755
US 1.681105
VX 5.348884
WN 6.397353
Alternatively, you may place the aggregating column in the index operator and then pass the aggregating function as a string to .agg
. This will return a Series:
>>> (flights
... .groupby('AIRLINE')
... ['ARR_DELAY']
... .agg('mean')
... )
AIRLINE
AA 5.542661
AS -0.833333
B6 8.692593
DL 0.339691
EV 7.034580
...
OO 7.593463
UA 7.765755
US 1.681105
VX 5.348884
WN 6.397353
Name: ARR_DELAY, Length: 14, dtype: float64
.agg
method, such as the NumPy mean
function. The output is the same as the previous step:
>>> (flights
... .groupby('AIRLINE')
... ['ARR_DELAY']
... .agg(np.mean)
... )
AIRLINE
AA 5.542661
AS -0.833333
B6 8.692593
DL 0.339691
EV 7.034580
...
OO 7.593463
UA 7.765755
US 1.681105
VX 5.348884
WN 6.397353
Name: ARR_DELAY, Length: 14, dtype: float64
agg
method altogether in this case and use the code in text method directly. This output is also the same as step 3:
>>> (flights
... .groupby('AIRLINE')
... ['ARR_DELAY']
... .mean()
... )
AIRLINE
AA 5.542661
AS -0.833333
B6 8.692593
DL 0.339691
EV 7.034580
...
OO 7.593463
UA 7.765755
US 1.681105
VX 5.348884
WN 6.397353
Name: ARR_DELAY, Length: 14, dtype: float64
The syntax for the .groupby
method is not as straightforward as other methods. Let's intercept the chain of methods in step 2 by storing the result of the .groupby
method as its own variable:
>>> grouped = flights.groupby('AIRLINE')
>>> type(grouped)
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
A completely new intermediate object is first produced with its own distinct attributes and methods. No calculations take place at this stage. pandas merely validates the grouping columns. This groupby
object has an .agg
method to perform aggregations. One of the ways to use this method is to pass it a dictionary mapping the aggregating column to the aggregating function, as done in step 2. If you pass in a dictionary, the result will be a DataFrame.
The pandas library often has more than one way to perform the same operation. Step 3 shows another way to perform a groupby. Instead of identifying the aggregating column in the dictionary, place it inside the index operator as if you were selecting it as a column from a DataFrame. The function string name is then passed as a scalar to the .agg
method. The result, in this case, is a Series.
You may pass any aggregating function to the .agg
method. pandas allows you to use the string names for simplicity, but you may also explicitly call an aggregating function as done in step 4. NumPy provides many functions that aggregate values.
Step 5 shows one last syntax flavor. When you are only applying a single aggregating function as in this example, you can often call it directly as a method on the groupby
object itself without .agg
. Not all aggregation functions have a method equivalent, but most do.
If you do not use an aggregating function with .agg
, pandas raises an exception. For instance, let's see what happens when we apply the square root function to each group:
>>> (flights
... .groupby('AIRLINE')
... ['ARR_DELAY']
... .agg(np.sqrt)
... )
Traceback (most recent call last):
...
ValueError: function does not reduce
It is possible to group and aggregate with multiple columns. The syntax is slightly different than it is for grouping and aggregating with a single column. As usual with any kind of grouping operation, it helps to identify the three components: the grouping columns, aggregating columns, and aggregating functions.
In this recipe, we showcase the flexibility of the .groupby
method by answering the following queries:
AIRLINE
, WEEKDAY
), the aggregating column (CANCELLED
), and the aggregating function (sum
):
>>> (flights
... .groupby(['AIRLINE', 'WEEKDAY'])
... ['CANCELLED']
... .agg('sum')
... )
AIRLINE WEEKDAY
AA 1 41
2 9
3 16
4 20
5 18
..
WN 3 18
4 10
5 7
6 10
7 7
Name: CANCELLED, Length: 98, dtype: int64
>>> (flights
... .groupby(['AIRLINE', 'WEEKDAY'])
... [['CANCELLED', 'DIVERTED']]
... .agg(['sum', 'mean'])
... )
CANCELLED DIVERTED
sum mean sum mean
AIRLINE WEEKDAY
AA 1 41 0.032106 6 0.004699
2 9 0.007341 2 0.001631
3 16 0.011949 2 0.001494
4 20 0.015004 5 0.003751
5 18 0.014151 1 0.000786
... ... ... ... ...
WN 3 18 0.014118 2 0.001569
4 10 0.007911 4 0.003165
5 7 0.005828 0 0.000000
6 10 0.010132 3 0.003040
7 7 0.006066 3 0.002600
.agg
method to map specific aggregating columns to specific aggregating functions:
>>> (flights
... .groupby(['ORG_AIR', 'DEST_AIR'])
... .agg({'CANCELLED':['sum', 'mean', 'size'],
... 'AIR_TIME':['mean', 'var']})
... )
CANCELLED ... AIR_TIME
sum mean ... mean var
ORG_AIR DEST_AIR ...
ATL ABE 0 0.000000 ... 96.387097 45.778495
ABQ 0 0.000000 ... 170.500000 87.866667
ABY 0 0.000000 ... 28.578947 6.590643
ACY 0 0.000000 ... 91.333333 11.466667
AEX 0 0.000000 ... 78.725000 47.332692
... ... ... ... ... ...
SFO SNA 4 0.032787 ... 64.059322 11.338331
STL 0 0.000000 ... 198.900000 101.042105
SUN 0 0.000000 ... 78.000000 25.777778
TUS 0 0.000000 ... 100.200000 35.221053
XNA 0 0.000000 ... 173.500000 0.500000
>>> (flights
... .groupby(['ORG_AIR', 'DEST_AIR'])
... .agg(sum_cancelled=pd.NamedAgg(column='CANCELLED', aggfunc='sum'),
... mean_cancelled=pd.NamedAgg(column='CANCELLED', aggfunc='mean'),
... size_cancelled=pd.NamedAgg(column='CANCELLED', aggfunc='size'),
... mean_air_time=pd.NamedAgg(column='AIR_TIME', aggfunc='mean'),
... var_air_time=pd.NamedAgg(column='AIR_TIME', aggfunc='var'))
... )
sum_cancelled mean_cancelled ... mean_air_time
ORG_AIR DEST_AIR ...
ATL ABE 0 0.000000 ... 96.387097
ABQ 0 0.000000 ... 170.500000
ABY 0 0.000000 ... 28.578947
ACY 0 0.000000 ... 91.333333
AEX 0 0.000000 ... 78.725000
... ... ... ... ...
SFO SNA 4 0.032787 ... 64.059322
STL 0 0.000000 ... 198.900000
SUN 0 0.000000 ... 78.000000
TUS 0 0.000000 ... 100.200000
XNA 0 0.000000 ... 173.500000
To group by multiple columns as in step 1, we pass a list of the string names to the .groupby
method. Each unique combination of AIRLINE
and WEEKDAY
forms its own group. Within each of these groups, the sum of the canceled flights is calculated and then returned as a Series.
Step 2 groups by both AIRLINE
and WEEKDAY
, but this time aggregates two columns. It applies each of the two aggregation functions, using the strings sum
and mean
, to each column, resulting in four returned columns per group.
Step 3 goes even further, and uses a dictionary to map specific aggregating columns to different aggregating functions. Notice that the size
aggregating function returns the total number of rows per group. This is different than the count
aggregating function, which returns the number of non-missing values per group.
Step 4 shows the new syntax to create flat columns, named aggregations.
To flatten the columns in step 3, you can use the .to_flat_index
method (available since pandas 0.24):
>>> res = (flights
... .groupby(['ORG_AIR', 'DEST_AIR'])
... .agg({'CANCELLED':['sum', 'mean', 'size']
... 'AIR_TIME':['mean', 'var']})
... )
>>> res.columns = ['_'.join(x) for x in
... res.columns.to_flat_index()]
>>> res
CANCELLED_sum CANCELLED_mean ... AIR_TIME_mean
ORG_AIR DEST_AIR ...
ATL ABE 0 0.000000 ... 96.387097
ABQ 0 0.000000 ... 170.500000
ABY 0 0.000000 ... 28.578947
ACY 0 0.000000 ... 91.333333
AEX 0 0.000000 ... 78.725000
... ... ... ... ...
SFO SNA 4 0.032787 ... 64.059322
STL 0 0.000000 ... 198.900000
SUN 0 0.000000 ... 78.000000
TUS 0 0.000000 ... 100.200000
XNA 0 0.000000 ... 173.500000
That is kind of ugly and I would prefer a chain operation to flatten the columns. Unfortunately, the .reindex
method does not support flattening. Instead, we will have to leverage the .pipe
method:
>>> def flatten_cols(df):
... df.columns = ['_'.join(x) for x in
... df.columns.to_flat_index()]
... return df
>>> res = (flights
... .groupby(['ORG_AIR', 'DEST_AIR'])
... .agg({'CANCELLED':['sum', 'mean', 'size'],
... 'AIR_TIME':['mean', 'var']})
... .pipe(flatten_cols)
... )
>>> res
CANCELLED_sum CANCELLED_mean ... AIR_TIME_mean
ORG_AIR DEST_AIR ...
ATL ABE 0 0.000000 ... 96.387097
ABQ 0 0.000000 ... 170.500000
ABY 0 0.000000 ... 28.578947
ACY 0 0.000000 ... 91.333333
AEX 0 0.000000 ... 78.725000
... ... ... ... ...
SFO SNA 4 0.032787 ... 64.059322
STL 0 0.000000 ... 198.900000
SUN 0 0.000000 ... 78.000000
TUS 0 0.000000 ... 100.200000
XNA 0 0.000000 ... 173.500000
Be aware that when grouping with multiple columns, pandas creates a hierarchical index, or multi-index. In the preceding example, it returned 1,130 rows. However, if one of the columns that we group by is categorical (and has a category
type, not an object
type), then pandas will create a Cartesian product of all combinations for each level. In this case, it returns 2,710 rows. However, if you have categorical columns with higher cardinality, you can get many more values:
>>> res = (flights
... .assign(ORG_AIR=flights.ORG_AIR.astype('category'))
... .groupby(['ORG_AIR', 'DEST_AIR'])
... .agg({'CANCELLED':['sum', 'mean', 'size'],
... 'AIR_TIME':['mean', 'var']})
... )
>>> res
CANCELLED ... AIR_TIME
sum mean ... mean var
ORG_AIR DEST_AIR ...
ATL ABE 0.0 0.0 ... 96.387097 45.778495
ABI NaN NaN ... NaN NaN
ABQ 0.0 0.0 ... 170.500000 87.866667
ABR NaN NaN ... NaN NaN
ABY 0.0 0.0 ... 28.578947 6.590643
... ... ... ... ... ...
SFO TYS NaN NaN ... NaN NaN
VLD NaN NaN ... NaN NaN
VPS NaN NaN ... NaN NaN
XNA 0.0 0.0 ... 173.500000 0.500000
YUM NaN NaN ... NaN NaN
To remedy the combinatoric explosion, use the observed=True
parameter. This makes the categorical group bys
work like grouping with string types, and only shows the observed values and not the Cartesian product:
>>> res = (flights
... .assign(ORG_AIR=flights.ORG_AIR.astype('category'))
... .groupby(['ORG_AIR', 'DEST_AIR'], observed=True)
... .agg({'CANCELLED':['sum', 'mean', 'size'],
... 'AIR_TIME':['mean', 'var']})
... )
>>> res
CANCELLED ... AIR_TIME
sum mean ... mean var
ORG_AIR DEST_AIR ...
LAX ABQ 1 0.018182 ... 89.259259 29.403215
ANC 0 0.000000 ... 307.428571 78.952381
ASE 1 0.038462 ... 102.920000 102.243333
ATL 0 0.000000 ... 224.201149 127.155837
AUS 0 0.000000 ... 150.537500 57.897310
... ... ... ... ... ...
MSP TTN 1 0.125000 ... 124.428571 57.952381
TUL 0 0.000000 ... 91.611111 63.075163
TUS 0 0.000000 ... 176.000000 32.000000
TVC 0 0.000000 ... 56.600000 10.300000
XNA 0 0.000000 ... 90.642857 115.939560
Inevitably, when using groupby, you will create a MultiIndex. MultiIndexes can happen in both the index and the columns. DataFrames with MultiIndexes are more difficult to navigate and occasionally have confusing column names as well.
In this recipe, we perform an aggregation with the .groupby
method to create a DataFrame with a MultiIndex for the rows and columns. Then, we manipulate the index so that it has a single level and the column names are descriptive.
>>> flights = pd.read_csv('data/flights.csv')
>>> airline_info = (flights
... .groupby(['AIRLINE', 'WEEKDAY'])
... .agg({'DIST':['sum', 'mean'],
... 'ARR_DELAY':['min', 'max']})
... .astype(int)
... )
>>> airline_info
DIST ARR_DELAY
sum mean min max
AIRLINE WEEKDAY
AA 1 1455386 1139 -60 551
2 1358256 1107 -52 725
3 1496665 1117 -45 473
4 1452394 1089 -46 349
5 1427749 1122 -41 732
... ... ... ... ...
WN 3 997213 782 -38 262
4 1024854 810 -52 284
5 981036 816 -44 244
6 823946 834 -41 290
7 945679 819 -45 261
.to_flat_index
. Let's display the output of each level and then concatenate both levels before setting it as the new column values:
>>> airline_info.columns.get_level_values(0)
Index(['DIST', 'DIST', 'ARR_DELAY', 'ARR_DELAY'], dtype='object')
>>> airline_info.columns.get_level_values(1)
Index(['sum', 'mean', 'min', 'max'], dtype='object')
>>> airline_info.columns.to_flat_index()
Index([('DIST', 'sum'), ('DIST', 'mean'), ('ARR_DELAY', 'min'),
('ARR_DELAY', 'max')],
dtype='object')
>>> airline_info.columns = ['_'.join(x) for x in
... airline_info.columns.to_flat_index()]
>>> airline_info
DIST_sum DIST_mean ARR_DELAY_min ARR_DELAY_max
AIRLINE WEEKDAY
AA 1 1455386 1139 -60 551
2 1358256 1107 -52 725
3 1496665 1117 -45 473
4 1452394 1089 -46 349
5 1427749 1122 -41 732
... ... ... ... ...
WN 3 997213 782 -38 262
4 1024854 810 -52 284
5 981036 816 -44 244
6 823946 834 -41 290
7 945679 819 -45 261
.reset_index
method:
>>> airline_info.reset_index()
AIRLINE WEEKDAY ... ARR_DELAY_min ARR_DELAY_max
0 AA 1 ... -60 551
1 AA 2 ... -52 725
2 AA 3 ... -45 473
3 AA 4 ... -46 349
4 AA 5 ... -41 732
.. ... ... ... ... ...
93 WN 3 ... -38 262
94 WN 4 ... -52 284
95 WN 5 ... -44 244
96 WN 6 ... -41 290
97 WN 7 ... -45 261
>>> (flights
... .groupby(['AIRLINE', 'WEEKDAY'])
... .agg(dist_sum=pd.NamedAgg(column='DIST', aggfunc='sum'),
... dist_mean=pd.NamedAgg(column='DIST', aggfunc='mean'),
... arr_delay_min=pd.NamedAgg(column='ARR_DELAY', aggfunc='min'),
... arr_delay_max=pd.NamedAgg(column='ARR_DELAY', aggfunc='max'))
... .astype(int)
... .reset_index()
... )
AIRLINE WEEKDAY ... ARR_DELAY_min ARR_DELAY_max
0 AA 1 ... -60 551
1 AA 2 ... -52 725
2 AA 3 ... -45 473
3 AA 4 ... -46 349
4 AA 5 ... -41 732
.. ... ... ... ... ...
93 WN 3 ... -38 262
94 WN 4 ... -52 284
95 WN 5 ... -44 244
96 WN 6 ... -41 290
97 WN 7 ... -45 261
When using the .agg
method to perform an aggregation on multiple columns, pandas creates an index object with two levels. The aggregating columns become the top level, and the aggregating functions become the bottom level. pandas displays MultiIndex levels differently to single-level columns. Except for the innermost levels, repeated index values do not get displayed in Jupyter or a Python shell. You can inspect the DataFrame from step 1 to verify this. For instance, the DIST
column shows up only once, but it refers to both of the first two columns.
Step 2 defines new columns by first retrieving the underlying values of each of the levels with the MultiIndex method, .get_level_values
. This method accepts an integer identifying the index level. They are numbered beginning with zero from the outside (top/left). We use the recently added index method, .to_flat_index,
in combination with a list comprehension to create strings for each column. We assign these new values to the columns attribute.
In step 3, we make use of the .reset_index
method to push both index levels into columns. This is easy, and I wish there was a similar method for column name compaction.
In step 4, we use the NamedAgg
class (new in pandas 0.25) to create flat aggregate columns.
By default, at the end of a groupby
operation, pandas puts all of the grouping columns in the index. The as_index
parameter in the .groupby
method can be set to False
to avoid this behavior. You can chain the .reset_index
method after grouping to get the same effect as seen in step 3. Let's see an example of this by finding the average distance traveled per flight from each airline:
>>> (flights
... .groupby(['AIRLINE'], as_index=False)
... ['DIST']
... .agg('mean')
... .round(0)
... )
AIRLINE DIST
0 AA 1114.0
1 AS 1066.0
2 B6 1772.0
3 DL 866.0
4 EV 460.0
.. ... ...
9 OO 511.0
10 UA 1231.0
11 US 1181.0
12 VX 1240.0
13 WN 810.0
Take a look at the order of the airlines in the previous result. By default, pandas sorts the grouping columns. The sort
parameter exists within the .groupby
method and defaults to True
. You may set it to False
to keep the order of the grouping columns the same as how they are encountered in the dataset. There is a small performance improvement by not sorting your data.
pandas provides a number of aggregation functions to use with the groupby
object. At some point, you may need to write your own custom user-defined function that does not exist in pandas or NumPy.
In this recipe, we use the college dataset to calculate the mean and standard deviation of the undergraduate student population per state. We then use this information to find the maximum number of standard deviations from the mean that any single population value is per state.
>>> college = pd.read_csv('data/college.csv')
>>> (college
... .groupby('STABBR')
... ['UGDS']
... .agg(['mean', 'std'])
... .round(0)
... )
mean std
STABBR
AK 2493.0 4052.0
AL 2790.0 4658.0
AR 1644.0 3143.0
AS 1276.0 NaN
AZ 4130.0 14894.0
... ... ...
VT 1513.0 2194.0
WA 2271.0 4124.0
WI 2655.0 4615.0
WV 1758.0 5957.0
WY 2244.0 2745.0
>>> def max_deviation(s):
... std_score = (s - s.mean()) / s.std()
... return std_score.abs().max()
.agg
method to complete the aggregation:
>>> (college
... .groupby('STABBR')
... ['UGDS']
... .agg(max_deviation)
... .round(1)
... )
STABBR
AK 2.6
AL 5.8
AR 6.3
AS NaN
AZ 9.9
...
VT 3.8
WA 6.6
WI 5.8
WV 7.2
WY 2.8
Name: UGDS, Length: 59, dtype: float64
There is no predefined pandas function to calculate the maximum number of standard deviations away from the mean. We need to write our own function. Notice that this custom function, max_deviation,
accepts a single parameter, s
.
In step 3, you will notice that the function name is placed inside the .agg
method without directly being called. Nowhere is the parameter s
explicitly passed to max_deviation
. Instead, pandas implicitly passes the UGDS
column as a Series to max_deviation
.
The max_deviation
function is called once for each group. As s
is a Series, all normal Series methods are available. It subtracts the mean of that particular grouping from each of the values in the group before dividing by the standard deviation in a process called standardization.
As we are interested in absolute deviation from the mean, we take the absolute value from all the standardized scores and return the maximum. The .agg
method requires that we return a scalar from the function, or else an exception will be raised.
pandas defaults to using the sample standard deviation, which is undefined for any groups with just a single value. For instance, the state abbreviation AS
(American Samoa) has a missing value returned as it has only a single institution in the dataset.
It is possible to apply our custom function to multiple aggregating columns. We simply add more column names to the indexing operator. The max_deviation
function only works with numeric columns:
>>> (college
... .groupby('STABBR')
... [['UGDS', 'SATVRMID', 'SATMTMID']]
... .agg(max_deviation)
... .round(1)
... )
UGDS SATVRMID SATMTMID
STABBR
AK 2.6 NaN NaN
AL 5.8 1.6 1.8
AR 6.3 2.2 2.3
AS NaN NaN NaN
AZ 9.9 1.9 1.4
... ... ... ...
VT 3.8 1.9 1.9
WA 6.6 2.2 2.0
WI 5.8 2.4 2.2
WV 7.2 1.7 2.1
WY 2.8 NaN NaN
You can also use your custom aggregation function along with the prebuilt functions. The following does this and groups by state and religious affiliation:
>>> (college
... .groupby(['STABBR', 'RELAFFIL'])
... [['UGDS', 'SATVRMID', 'SATMTMID']]
... .agg([max_deviation, 'mean', 'std'])
... .round(1)
... )
UGDS ... SATMTMID
max_deviation mean ... mean std
STABBR RELAFFIL ...
AK 0 2.1 3508.9 ... NaN NaN
1 1.1 123.3 ... 503.0 NaN
AL 0 5.2 3248.8 ... 515.8 56.7
1 2.4 979.7 ... 485.6 61.4
AR 0 5.8 1793.7 ... 503.6 39.0
... ... ... ... ... ...
WI 0 5.3 2879.1 ... 591.2 85.7
1 3.4 1716.2 ... 526.6 42.5
WV 0 6.9 1873.9 ... 480.0 27.7
1 1.3 716.4 ... 484.8 17.7
WY 0 2.8 2244.4 ... 540.0 NaN
Notice that pandas uses the name of the function as the name for the returned column. You can change the column name directly with the .rename
method or you can modify the function attribute .__name__
:
>>> max_deviation.__name__
'max_deviation'
>>> max_deviation.__name__ = 'Max Deviation'
>>> (college
... .groupby(['STABBR', 'RELAFFIL'])
... [['UGDS', 'SATVRMID', 'SATMTMID']]
... .agg([max_deviation, 'mean', 'std'])
... .round(1)
... )
UGDS ... SATMTMID
Max Deviation mean ... mean std
STABBR RELAFFIL ...
AK 0 2.1 3508.9 ... NaN NaN
1 1.1 123.3 ... 503.0 NaN
AL 0 5.2 3248.8 ... 515.8 56.7
1 2.4 979.7 ... 485.6 61.4
AR 0 5.8 1793.7 ... 503.6 39.0
... ... ... ... ... ...
WI 0 5.3 2879.1 ... 591.2 85.7
1 3.4 1716.2 ... 526.6 42.5
WV 0 6.9 1873.9 ... 480.0 27.7
1 1.3 716.4 ... 484.8 17.7
WY 0 2.8 2244.4 ... 540.0 NaN
When writing your own user-defined customized aggregation function, pandas implicitly passes it each of the aggregating columns one at a time as a Series. Occasionally, you will need to pass more arguments to your function than just the Series itself. To do so, you need to be aware of Python's ability to pass an arbitrary number of arguments to functions.
The signature to .agg
is agg(func, *args, **kwargs)
. The func
parameter is a reducing function, the string name of a reducing method, a list of reducing functions, or a dictionary mapping columns to functions or a list of functions. Additionally, as we have seen, you can use keyword arguments to create named aggregations.
If you have a reducing function that takes additional arguments that you would like to use, you can leverage the *args
and **kwargs
parameters to pass arguments to the reduction function. You can use *args
to pass an arbitrary number of positional arguments to your customized aggregation function. Similarly, **kwargs
allows you to pass an arbitrary number of keyword arguments.
In this recipe, we will build a customized function for the college dataset that finds the percentage of schools by state and religious affiliation that have an undergraduate population between two values.
>>> def pct_between_1_3k(s):
... return (s
... .between(1_000, 3_000)
... .mean()
... * 100
... )
>>> (college
... .groupby(['STABBR', 'RELAFFIL'])
... ['UGDS']
... .agg(pct_between_1_3k)
... .round(1)
... )
STABBR RELAFFIL
AK 0 14.3
1 0.0
AL 0 23.6
AR 0 27.9
...
WI 0 13.8
1 36.0
WV 0 24.6
1 37.5
WY 0 54.5
Name: UGDS, Length: 112, dtype: float64
>>> def pct_between(s, low, high):
... return s.between(low, high).mean() * 100
.agg
method along with the lower and upper bounds:
>>> (college
... .groupby(['STABBR', 'RELAFFIL'])
... ['UGDS']
... .agg(pct_between, 1_000, 10_000)
... .round(1)
... )
STABBR RELAFFIL
AK 0 42.9
1 0.0
AL 0 45.8
1 37.5
AR 0 39.7
...
WI 0 31.0
1 44.0
WV 0 29.2
1 37.5
WY 0 72.7
Name: UGDS, Length: 112, dtype: float64
Step 1 creates a function that doesn't accept any extra arguments. The upper and lower bounds are hardcoded into the function, which isn't very flexible. Step 2 shows the results of this aggregation.
We create a more flexible function in step 3 where we parameterize both the lower and upper bounds dynamically. Step 4 is where the magic of *args
and **kwargs
comes into play. In this particular example, we pass two non-keyword arguments, 1_000
and 10_000
, to the .agg
method. pandas passes these two arguments respectively to the low
and high
parameters of pct_between
.
There are a few ways we could achieve the same result in step 4. We could have explicitly used keyword parameters to produce the same result:
(college
.groupby(['STABBR', 'RELAFFIL'])
['UGDS']
.agg(pct_between, high=10_000, low=1_000)
.round(1)
)
If we want to call multiple aggregation functions and some of them need parameters, we can utilize Python's closure functionality to create a new function that has the parameters closed over in its calling environment:
>>> def between_n_m(n, m):
... def wrapper(ser):
... return pct_between(ser, n, m)
... wrapper.__name__ = f'between_{n}_{m}'
... return wrapper
>>> (college
... .groupby(['STABBR', 'RELAFFIL'])
... ['UGDS']
... .agg([between_n_m(1_000, 10_000), 'max', 'mean'])
... .round(1)
... )
between_1000_10000 max mean
STABBR RELAFFIL
AK 0 42.9 12865.0 3508.9
1 0.0 275.0 123.3
AL 0 45.8 29851.0 3248.8
1 37.5 3033.0 979.7
AR 0 39.7 21405.0 1793.7
... ... ... ...
WI 0 31.0 29302.0 2879.1
1 44.0 8212.0 1716.2
WV 0 29.2 44924.0 1873.9
1 37.5 1375.0 716.4
WY 0 72.7 9910.0 2244.4
The immediate result from using the .groupby
method on a DataFrame is a groupby
object. Usually, we chain operations on this object to do aggregations or transformations without ever storing the intermediate values in variables.
In this recipe, we examine the groupby
object to examine individual groups.
>>> college = pd.read_csv('data/college.csv')
>>> grouped = college.groupby(['STABBR', 'RELAFFIL'])
>>> type(grouped)
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
dir
function to discover the attributes of a groupby
object:
>>> print([attr for attr in dir(grouped) if not
... attr.startswith('_')])
['CITY', 'CURROPER', 'DISTANCEONLY', 'GRAD_DEBT_MDN_SUPP', 'HBCU', 'INSTNM',
'MD_EARN_ WNE_P10', 'MENONLY', 'PCTFLOAN', 'PCTPELL', 'PPTUG_EF', 'RELAFFIL',
'SATMTMID', 'SATVRMID' , 'STABBR', 'UG25ABV', 'UGDS', 'UGDS_2MOR', 'UGDS_AIAN',
'UGDS_ASIAN', 'UGDS_BLACK', 'UGDS _HISP', 'UGDS_NHPI', 'UGDS_NRA', 'UGDS_UNKN',
'UGDS_WHITE', 'WOMENONLY', 'agg', 'aggregate ', 'all', 'any', 'apply',
'backfill', 'bfill', 'boxplot', 'corr', 'corrwith', 'count', 'co v', 'cumcount',
'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'dtypes', 'ex
panding', 'ffill', 'fillna', 'filter', 'first', 'get_group', 'groups', 'head',
'hist', 'id xmax', 'idxmin', 'indices', 'last', 'mad', 'max', 'mean', 'median',
'min', 'ndim', 'ngroup ', 'ngroups', 'nth', 'nunique', 'ohlc', 'pad',
'pct_change', 'pipe', 'plot', 'prod', 'quan tile', 'rank', 'resample',
'rolling', 'sem', 'shift', 'size', 'skew', 'std', 'sum', 'tail' , 'take',
'transform', 'tshift', 'var']
.ngroups
attribute:
>>> grouped.ngroups
112
.groups
attribute, which contains a dictionary of each unique group mapped to all the corresponding index labels of that group. Because we grouped by two columns, each of the keys has a tuple, one value for the STABBR
column and another for the RELAFFIL
column:
>>> groups = list(grouped.groups)
>>> groups[:6]
[('AK', 0), ('AK', 1), ('AL', 0), ('AL', 1), ('AR', 0), ('AR', 1)]
.get_group
method by passing it a tuple of an exact group label. For example, to get all the religiously affiliated schools in the state of Florida, do the following:
>>> grouped.get_group(('FL', 1))
INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
712 The Bapt... Graceville ... 30800 20052
713 Barry Un... Miami ... 44100 28250
714 Gooding ... Panama City ... NaN PrivacyS...
715 Bethune-... Daytona ... ... 29400 36250
724 Johnson ... Kissimmee ... 26300 20199
... ... ... ... ... ...
7486 Strayer ... Coral Sp... ... 49200 36173.5
7487 Strayer ... Fort Lau... ... 49200 36173.5
7488 Strayer ... Miramar ... 49200 36173.5
7489 Strayer ... Miami ... 49200 36173.5
7490 Strayer ... Miami ... 49200 36173.5
groupby
objects are iterable. If you are in Jupyter, you can leverage the display
function to show each group in a single cell (otherwise, Jupyter will only show the result of the last statement of the cell):
from IPython.display import display
for name, group in grouped:
print(name)
display(group.head(3))
Displaying multiple dataframes
However, I typically want to see some example data from a single group to figure out what function I want to apply to the groups. If I know the names of the values from the columns I grouped by, I can use the previous step. Often, I don't know those names, but I also don't need to see all of the groups. The following is some debugging of the code that is usually sufficient to understand what a group looks like:
>>> for name, group in grouped:
... print(name)
... print(group)
... break
('AK', 0)
INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
60 Universi... Anchorage ... 42500 19449.5
62 Universi... Fairbanks ... 36200 19355
63 Universi... Juneau ... 37400 16875
65 AVTEC-Al... Seward ... 33500 PrivacyS...
66 Charter ... Anchorage ... 39200 13875
67 Alaska C... Anchorage ... 28700 8994
5171 Ilisagvi... Barrow ... 24900 PrivacyS...
.head
method on your groupby
object to get the first rows of each group together in a single DataFrame:
>>> grouped.head(2)
INSTNM CITY ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
0 Alabama ... Normal ... 30300 33888
1 Universi... Birmingham ... 39700 21941.5
2 Amridge ... Montgomery ... 40100 23370
10 Birmingh... Birmingham ... 44200 27000
43 Prince I... Elmhurst ... PrivacyS... 20992
... ... ... ... ... ...
5289 Pacific ... Mangilao ... PrivacyS... PrivacyS...
6439 Touro Un... Henderson ... NaN PrivacyS...
7352 Marinell... Henderson ... 21200 9796.5
7404 Universi... St. Croix ... 31800 15150
7419 Computer... Las Cruces ... 21300 14250
Step 1 creates our groupby
object. We can display all the public attributes and methods to reveal the functionality of an object as was done in step 2. Each group is uniquely identified by a tuple containing a unique combination of the values in the grouping columns. pandas allows you to select a specific group as a DataFrame with the .get_group
method shown in step 5.
It is rare that you will need to iterate through your groups. In fact, you should avoid doing so, as it can be quite slow. Occasionally, however, you will have no other choice. When iterating through a groupby
object, you are given a tuple containing the group name and the DataFrame with the grouping columns moved into the index. This tuple is unpacked into the name
and group
variables in the for
loop in step 6.
One thing you can do while iterating through your groups is to display a few of the rows from each group directly in the notebook. To do this, you can either use the print
function or the display
function from the IPython.display module
if you are using Jupyter.
There are several useful methods that were not explored from the list in step 2. Take, for instance, the .nth
method, which, when provided with a list of integers, selects those specific rows from each group. For example, the following operation selects the first and last rows from each group:
>>> grouped.nth([1, -1])
INSTNM CITY ... MD_EARN_WNE_P10
STABBR RELAFFIL ...
AK 0 Universi... Fairbanks ... 36200
0 Ilisagvi... Barrow ... 24900
1 Alaska P... Anchorage ... 47000
1 Alaska C... Soldotna ... NaN
AL 0 Universi... Birmingham ... 39700
... ... ... ... ...
WV 0 BridgeVa... South C... ... NaN
1 Appalach... Mount Hope ... 28700
1 West Vir... Nutter Fort ... 16700
WY 0 Central ... Riverton ... 25200
0 CollegeA... Cheyenne ... 25600
Previously, we examined using Boolean arrays to filter rows. In a similar fashion, when using the .groupby
method, we can filter out groups. The .filter
method of the groupby
object accepts a function that must return either True
or False
to indicate whether a group is kept.
This .filter
method applied after a call to the .groupby
method is completely different to the DataFrame .filter
method covered in the Selecting columns with methods recipe from Chapter 2, Essential DataFrame Operations.
One thing to be aware of is that when the .filter
method is applied, the result does not use the grouping columns as the index, but keeps the original index! The DataFrame .filter
method filters columns, not values.
In this recipe, we use the college dataset to find all the states that have more non-white undergraduate students than white. This is a dataset from the US, where whites form the majority and therefore, we are looking for states with a minority majority.
.nunique
Series method:
>>> college = pd.read_csv('data/college.csv', index_col='INSTNM')
>>> grouped = college.groupby('STABBR')
>>> grouped.ngroups
59
>>> college['STABBR'].nunique() # verifying the same number
59
.filter
method, which accepts a custom function that determines whether a group is kept. The custom function accepts a DataFrame of the current group and is required to return a Boolean. Let's define a function that calculates the total percentage of minority students and returns True
if this percentage is greater than a user-defined threshold:
>>> def check_minority(df, threshold):
... minority_pct = 1 - df['UGDS_WHITE']
... total_minority = (df['UGDS'] * minority_pct).sum()
... total_ugds = df['UGDS'].sum()
... total_minority_pct = total_minority / total_ugds
... return total_minority_pct > threshold
.filter
method passed with the check_minority
function and a threshold of 50% to find all states that have a minority majority:
>>> college_filtered = grouped.filter(check_minority, threshold=.5)
>>> college_filtered
CITY STABBR ... MD_EARN_WNE_P10 GRAD_DEBT_MDN_SUPP
INSTNM ...
Everest C... Phoenix AZ ... 28600 9500
Collins C... Phoenix AZ ... 25700 47000
Empire Be... Phoenix AZ ... 17800 9588
Empire Be... Tucson AZ ... 18200 9833
Thunderbi... Glendale AZ ... 118900 PrivacyS...
... ... ... ... ... ...
WestMed C... Merced CA ... NaN 15623.5Vantage C... El Paso TX ... NaN 9500
SAE Insti... Emeryville CA ... NaN 9500
Bay Area ... San Jose CA ... NaN PrivacyS...
Excel Lea... San Antonio TX ... NaN 12125
>>> college.shape
(7535, 26)
>>> college_filtered.shape
(3028, 26)
>>> college_filtered['STABBR'].nunique()
20
This recipe takes a look at the total population of all the institutions on a state-by-state basis. The goal is to keep all the rows from the states, as a whole, that have a minority majority. This requires us to group our data by state, which we do in step 1. We find that there are 59 independent groups.
The .filter
groupby method either keeps all the rows in a group or filters them out. It does not change the number of columns. The .filter
groupby method performs this gatekeeping through a user-defined function, check_minority,
in this recipe. This function accepts a DataFrame of each group and needs to return a Boolean.
Inside the check_minority
function, the percentage and the total number of non-white students for each institution are first calculated followed by the total number of all students. Finally, the percentage of non-white students for the entire state is checked against the given threshold, which produces a Boolean.
The final result is a DataFrame with the same columns as the original (and the same index, not the grouped index), but with the rows from the states that don't meet the threshold filtered out. As it is possible that the head of the filtered DataFrame is the same as the original, you need to do some inspection to ensure that the operation completed successfully. We verify this by checking the number of rows and unique states.
Our function, check_minority
, is flexible and accepts a parameter to lower or raise the percentage of minority threshold. Let's check the shape and number of unique states for a couple of other thresholds:
>>> college_filtered_20 = grouped.filter(check_minority, threshold=.2)
>>> college_filtered_20.shape
(7461, 26)
>>> college_filtered_20['STABBR'].nunique()
57
>>> college_filtered_70 = grouped.filter(check_minority, threshold=.7)
>>> college_filtered_70.shape
(957, 26)
>>> college_filtered_70['STABBR'].nunique()
10
One method to increase motivation to lose weight is to make a bet with someone else. The scenario in this recipe will track weight loss from two individuals throughout a four-month period and determine a winner.
In this recipe, we use simulated data from two individuals to track the percentage of weight loss over four months. At the end of each month, a winner will be declared based on the individual who lost the highest percentage of body weight for that month. To track weight loss, we group our data by month and person, and then call the .transform
method to find the percentage weight loss change for each week against the start of the month.
We will use the .transform
method in this recipe. This method returns a new object that preserves the index of the original DataFrame but allows you to do calculations on groups of the data.
weight_loss
dataset, and examine the first month of data from the two people, Amy and Bob. There are a total of four weigh-ins per month:
>>> weight_loss = pd.read_csv('data/weight_loss.csv')
>>> weight_loss.query('Month == "Jan"')
Name Month Week Weight
0 Bob Jan Week 1 291
1 Amy Jan Week 1 197
2 Bob Jan Week 2 288
3 Amy Jan Week 2 189
4 Bob Jan Week 3 283
5 Amy Jan Week 3 189
6 Bob Jan Week 4 283
7 Amy Jan Week 4 190
>>> def percent_loss(s):
... return ((s - s.iloc[0]) / s.iloc[0]) * 100
>>> (weight_loss
... .query('Name=="Bob" and Month=="Jan"')
... ['Weight']
... .pipe(percent_loss)
... )
0 0.000000
2 -1.030928
4 -2.749141
6 -2.749141
Name: Weight, dtype: float64
Name
and Month
, and then use the .transform
method to apply this custom function. The function we pass to .transform
needs to maintain the index of the group that is passed into it, so we can use percent_loss
here:
>>> (weight_loss
... .groupby(['Name', 'Month'])
... ['Weight']
... .transform(percent_loss)
... )
0 0.000000
1 0.000000
2 -1.030928
3 -4.060914
4 -2.749141
...
27 -3.529412
28 -3.065134
29 -3.529412
30 -4.214559
31 -5.294118
Name: Weight, Length: 32, dtype: float64
.transform
method takes a function that returns an object with the same index (and the same number of rows) as was passed into it. Because it has the same index, we can insert it as a column. The .transform
method is useful for summarizing information from the groups and then adding it back to the original DataFrame. We will also filter down to two months of data for Bob:
>>> (weight_loss
... .assign(percent_loss=(weight_loss
... .groupby(['Name', 'Month'])
... ['Weight']
... .transform(percent_loss)
... .round(1)))
... .query('Name=="Bob" and Month in ["Jan", "Feb"]')
... )
Name Month Week Weight percent_loss
0 Bob Jan Week 1 291 0.0
2 Bob Jan Week 2 288 -1.0
4 Bob Jan Week 3 283 -2.7
6 Bob Jan Week 4 283 -2.7
8 Bob Feb Week 1 283 0.0
10 Bob Feb Week 2 275 -2.8
12 Bob Feb Week 3 268 -5.3
14 Bob Feb Week 4 268 -5.3
percent_loss
column, we can manually determine a winner but let's see whether we can find a way to do this automatically. As the only week that matters is the last week, let's select week 4:
>>> (weight_loss
... .assign(percent_loss=(weight_loss
... .groupby(['Name', 'Month'])
... ['Weight']
... .transform(percent_loss)
... .round(1)))
... .query('Week == "Week 4"')
... )
Name Month Week Weight percent_loss
6 Bob Jan Week 4 283 -2.7
7 Amy Jan Week 4 190 -3.6
14 Bob Feb Week 4 268 -5.3
15 Amy Feb Week 4 173 -8.9
22 Bob Mar Week 4 261 -2.6
23 Amy Mar Week 4 170 -1.7
30 Bob Apr Week 4 250 -4.2
31 Amy Apr Week 4 161 -5.3
.pivot
method so that Bob's and Amy's percent weight loss is side by side for each month:
>>> (weight_loss
... .assign(percent_loss=(weight_loss
... .groupby(['Name', 'Month'])
... ['Weight']
... .transform(percent_loss)
... .round(1)))
... .query('Week == "Week 4"')
... .pivot(index='Month', columns='Name',
... values='percent_loss')
... )
Name Amy Bob
Month
Apr -5.3 -4.2
Feb -8.9 -5.3
Jan -3.6 -2.7
Mar -1.7 -2.6
if then else
function called where
, which can map a Series or array of Booleans to other values. Let's create a column, winner
, with the name of the winner:
>>> (weight_loss
... .assign(percent_loss=(weight_loss
... .groupby(['Name', 'Month'])
... ['Weight']
... .transform(percent_loss)
... .round(1)))
... .query('Week == "Week 4"')
... .pivot(index='Month', columns='Name',
... values='percent_loss')
... .assign(winner=lambda df_:
... np.where(df_.Amy < df_.Bob, 'Amy', 'Bob'))
... )
Name Amy Bob winner
Month
Apr -5.3 -4.2 Amy
Feb -8.9 -5.3 Amy
Jan -3.6 -2.7 Amy
Mar -1.7 -2.6 Bob
In Jupyter, you can highlight the winning percentage for each month using the .style
attribute:
(weight_loss
.assign(percent_loss=(weight_loss
.groupby(['Name', 'Month'])
['Weight']
.transform(percent_loss)
.round(1)))
.query('Week == "Week 4"')
.pivot(index='Month', columns='Name',
values='percent_loss')
.assign(winner=lambda df_:
np.where(df_.Amy < df_.Bob, 'Amy', 'Bob'))
.style.highlight_min(axis=1)
)
The highlight minimum
.value_counts
method to return the final score as the number of months won:
>>> (weight_loss
... .assign(percent_loss=(weight_loss
... .groupby(['Name', 'Month'])
... ['Weight']
... .transform(percent_loss)
... .round(1)))
... .query('Week == "Week 4"')
... .pivot(index='Month', columns='Name',
... values='percent_loss')
... .assign(winner=lambda df_:
... np.where(df_.Amy < df_.Bob, 'Amy', 'Bob'))
... .winner
... .value_counts()
... )
Amy 3
Bob 1
Name: winner, dtype: int64
Throughout this recipe, the .query
method is used to filter data instead of using Boolean arrays. Refer to the Improving readability of Boolean indexing with the query method recipe in Chapter 7, Filtering Rows for more information.
Our goal is to find the percentage weight loss for each month for each person. One way to accomplish this task is to calculate each week's weight loss relative to the start of each month. This specific task is perfectly suited to the .transform
groupby method. The .transform
method requires a function as a parameter. This function gets passed each group (which can be a Series or DataFrame). It must return a sequence of values the same length as the group that was passed in or else an exception will be raised. No aggregation or filtering takes place.
Step 2 creates a function that calculates the percent age loss (or gain) relative to the first value. It subtracts the first value of the passed Series from all of its values and then divides this result by the first value. In step 3, we test this function on one person during one month.
In step 4, we use .groupby
with .transform
to run this function over every combination of person and month. We are transforming the Weight
column into the percentage of weight lost in the current week.
The first month of data is outputted for each person in step 6. pandas returns the new data as a Series. This Series isn't all that useful by itself and makes more sense appended to the original DataFrame as a new column. We complete this operation in step 5.
To determine the winner, only week 4 of each month is necessary. We could stop here and manually determine the winner, but pandas supplies us with the functionality to automate this. The .pivot
function in step 7 reshapes our dataset by pivoting the unique values of one column into new column names. The index
parameter is used for the column that you do not want to pivot. The column passed to the values
parameter gets tiled over each unique combination of the columns in the index
and columns
parameters.
The .pivot
method only works if there is just a single occurrence of each unique combination of the columns in the index and columns parameters. If there is more than one unique combination, an exception is raised. You can use the .pivot_table
or .groupby
method in that situation.
Here is an example of using .groupyby
with .unstack
to emulate the pivot
functionality:
>>> (weight_loss
... .assign(percent_loss=(weight_loss
... .groupby(['Name', 'Month'])
... ['Weight']
... .transform(percent_loss)
... .round(1)))
... .query('Week == "Week 4"')
... .groupby(['Month', 'Name'])
... ['percent_loss']
... .first()
... .unstack()
... )
Name Amy Bob
Month
Apr -5.3 -4.2
Feb -8.9 -5.3
Jan -3.6 -2.7
Mar -1.7 -2.6
After pivoting, we utilize the NumPy where
function, whose first argument is a condition that produces a Series of Booleans. True
values get mapped to Amy,
and False
values get mapped to Bob
. We highlight the winner of each month and tally the final score with the .value_counts
method.
Take a look at the DataFrame output from step 7. Did you notice that the months are in alphabetical and not chronological order? pandas unfortunately, in this case at least, orders the months for us alphabetically. We can solve this issue by changing the data type of Month
to a categorical variable. Categorical variables map all the values of each column to an integer. We can choose this mapping to be the normal chronological order for the months. pandas uses this underlying integer mapping during the .pivot
method to order the months chronologically:
>>> (weight_loss
... .assign(percent_loss=(weight_loss
... .groupby(['Name', 'Month'])
... ['Weight']
... .transform(percent_loss)
... .round(1)),
... Month=pd.Categorical(weight_loss.Month,
... categories=['Jan', 'Feb', 'Mar', 'Apr'],
... ordered=True))
... .query('Week == "Week 4"')
... .pivot(index='Month', columns='Name',
... values='percent_loss')
... )
Name Amy Bob
Month
Jan -3.6 -2.7
Feb -8.9 -5.3
Mar -1.7 -2.6
Apr -5.3 -4.2
To convert Month
to an ordered category column, use the Categorical
constructor. Pass it the original column as a Series and a unique sequence of all the categories in the desired order to the categories
parameter. In general, to sort columns of the object
data type by something other than alphabetical, convert them to categorical.
The groupby
object has four methods that accept a function (or functions) to perform a calculation on each group. These four methods are .agg
, .filter
, .transform
, and .apply
. Each of the first three of these methods has a very specific output that the function must return. .agg
must return a scalar value, .filter
must return a Boolean, and .transform
must return a Series or DataFrame with the same length as the passed group. The .apply
method, however, may return a scalar value, a Series, or even a DataFrame of any shape, therefore making it very flexible. It is also called only once per group (on a DataFrame), while the .transform
and .agg
methods get called once for each aggregating column (on a Series). The .apply
method's ability to return a single object when operating on multiple columns at the same time makes the calculation in this recipe possible.
In this recipe, we calculate the weighted average of both the math and verbal SAT scores per state from the college dataset. We weight the scores by the population of undergraduate students per school.
UGDS
, SATMTMID
, or SATVRMID
columns. We do not want any missing values for those columns:
>>> college = pd.read_csv('data/college.csv')
>>> subset = ['UGDS', 'SATMTMID', 'SATVRMID']
>>> college2 = college.dropna(subset=subset)
>>> college.shape
(7535, 27)
>>> college2.shape
(1184, 27)
>>> def weighted_math_average(df):
... weighted_math = df['UGDS'] * df['SATMTMID']
... return int(weighted_math.sum() / df['UGDS'].sum())
.apply
method. Because each group has multiple columns and we want to reduce those to a single value, we need to use .apply
. The weighted_math_average
function will be called once for each group (not on the individual columns in the group):
>>> college2.groupby('STABBR').apply(weighted_math_average)
STABBR
AK 503
AL 536
AR 529
AZ 569
CA 564
...
VT 566
WA 555
WI 593
WV 500
WY 540
Length: 53, dtype: int64
.agg
method (which calls the function for every column):
>>> (college2
... .groupby('STABBR')
... .agg(weighted_math_average)
... )
Traceback (most recent call last):
...
KeyError: 'UGDS'
weighted_math_average
function gets applied to each non-aggregating column in the DataFrame. If you try and limit the columns to just SATMTMID
, you will get an error as you won't have access to UGDS
. So, the best way to complete operations that act on multiple columns is with .apply
:
>>> (college2
... .groupby('STABBR')
... ['SATMTMID']
... .agg(weighted_math_average)
... )
Traceback (most recent call last):
...
KeyError: 'UGDS'
.apply
is that you can create multiple new columns by returning a Series. The index of this returned Series will be the new column names. Let's modify our function to calculate the weighted and arithmetic average for both SAT scores along with the count of the number of institutions from each group. We return these five values in a Series:
>>> def weighted_average(df):
... weight_m = df['UGDS'] * df['SATMTMID']
... weight_v = df['UGDS'] * df['SATVRMID']
... wm_avg = weight_m.sum() / df['UGDS'].sum()
... wv_avg = weight_v.sum() / df['UGDS'].sum()
... data = {'w_math_avg': wm_avg,
... 'w_verbal_avg': wv_avg,
... 'math_avg': df['SATMTMID'].mean(),
... 'verbal_avg': df['SATVRMID'].mean(),
... 'count': len(df)
... }
... return pd.Series(data)
>>> (college2
... .groupby('STABBR')
... .apply(weighted_average)
... .astype(int)
... )
w_math_avg w_verbal_avg math_avg verbal_avg count
STABBR
AK 503 555 503 555 1
AL 536 533 504 508 21
AR 529 504 515 491 16
AZ 569 557 536 538 6
CA 564 539 562 549 72
... ... ... ... ... ...
VT 566 564 526 527 8
WA 555 541 551 548 18
WI 593 556 545 516 14
WV 500 487 481 473 17
WY 540 535 540 535 1
In order for this recipe to complete correctly, we need to filter for institutions that do not have missing values for UGDS
, SATMTMID
, and SATVRMID
. By default, the .dropna
method drops rows that have one or more missing values. We must use the subset
parameter to limit the columns it looks at. It only considers the UGDS
, SATMTMID
, or SATVRMID
columns for missing values.
If we do not remove the missing values, it will throw off the computations for the weighted averages. Next, you can see that the weighted scores for AK
are 5 and 6, which does not make sense:
>>> (college
... .groupby('STABBR')
... .apply(weighted_average)
... )
w_math_avg w_verbal_avg math_avg verbal_avg count
STABBR
AK 5.548091 6.121651 503.000000 555.000000 10.0
AL 261.895658 260.550109 504.285714 508.476190 96.0
AR 301.054792 287.264872 515.937500 491.875000 86.0
AS 0.000000 0.000000 NaN NaN 1.0
AZ 61.815821 60.511712 536.666667 538.333333 133.0
... ... ... ... ... ...
VT 389.967094 388.696848 526.875000 527.500000 27.0
WA 274.885878 267.880280 551.222222 548.333333 123.0
WI 153.803086 144.160115 545.071429 516.857143 112.0
WV 224.697582 218.843452 481.705882 473.411765 73.0
WY 216.761180 214.754132 540.000000 535.000000 11.0
In step 2, we define a function that calculates the weighted average for just the SATMTMID
column. The weighted average differs from the arithmetic mean because each value is multiplied by a weight. This quantity is then summed and divided by the sum of the weights. In this case, our weight is the undergraduate student population.
In step 3, we pass this function to the .apply
method. Our function, weighted_math_average,
gets passed a DataFrame of all the original columns for each group. It returns a single scalar value, the weighted average of SATMTMID
. At this point, you might think that this calculation is possible using the .agg
method. Directly replacing .apply
with .agg
does not work as .agg
returns a value for each of its aggregating columns.
Step 6 shows the versatility of .apply
. We build a new function that calculates the weighted and arithmetic average of both SAT columns as well as the number of rows for each group. To use .apply
to create multiple columns, you must return a Series. The index values are used as column names in the resulting DataFrame. You can return as many values as you want with this method.
Note that because I'm using a Python version greater than 3.5, I can use a normal dictionary in weighted_average
to create a Series. This is because since Python 3.6, the dictionary is sorted by default.
In this recipe, we returned a single row as a Series for each group. It's possible to return any number of rows and columns for each group by returning a DataFrame.
In addition to finding just the arithmetic and weighted means, let's also find the geometric and harmonic means of both SAT columns and return the results as a DataFrame with rows as the name of the type of mean and columns as the SAT type. To ease the burden on us, we use the NumPy function average to compute the weighted average and the SciPy functions gmean
and hmean
for geometric and harmonic means:
>>> from scipy.stats import gmean, hmean
>>> def calculate_means(df):
... df_means = pd.DataFrame(index=['Arithmetic', 'Weighted',
... 'Geometric', 'Harmonic'])
... cols = ['SATMTMID', 'SATVRMID']
... for col in cols:
... arithmetic = df[col].mean()
... weighted = np.average(df[col], weights=df['UGDS'])
... geometric = gmean(df[col])
... harmonic = hmean(df[col])
... df_means[col] = [arithmetic, weighted,
... geometric, harmonic]
... df_means['count'] = len(df)
... return df_means.astype(int)
>>> (college2
... .groupby('STABBR')
... .apply(calculate_means)
... )
SATMTMID SATVRMID count
STABBR
AK Arithmetic 503 555 1
Weighted 503 555 1
Geometric 503 555 1
Harmonic 503 555 1
AL Arithmetic 504 508 21
... ... ... ...
WV Harmonic 480 472 17
WY Arithmetic 540 535 1
Weighted 540 535 1
Geometric 540 534 1
Harmonic 540 535 1
When grouping in pandas, you typically use columns with discrete repeating values. If there are no repeated values, then grouping would be pointless as there would only be one row per group. Continuous numeric columns typically have few repeated values and are generally not used to form groups. However, if we can transform columns with continuous values into a discrete column by placing each value in a bin, rounding them, or using some other mapping, then grouping with them makes sense.
In this recipe, we explore the flights dataset to discover the distribution of airlines for different travel distances. This allows us, for example, to find the airline that makes the most flights between 500 and 1,000 miles. To accomplish this, we use the pandas cut
function to discretize the distance of each flight flown.
>>> flights = pd.read_csv('data/flights.csv')
>>> flights
MONTH DAY WEEKDAY ... ARR_DELAY DIVERTED CANCELLED
0 1 1 4 ... 65.0 0 0
1 1 1 4 ... -13.0 0 0
2 1 1 4 ... 35.0 0 0
3 1 1 4 ... -7.0 0 0
4 1 1 4 ... 39.0 0 0
... ... ... ... ... ... ... ...
58487 12 31 4 ... -19.0 0 0
58488 12 31 4 ... 4.0 0 0
58489 12 31 4 ... -5.0 0 0
58490 12 31 4 ... 34.0 0 0
58491 12 31 4 ... -1.0 0 0
DIST
column into discrete bins. Let's use the pandas cut
function to split the data into five bins:
>>> bins = [-np.inf, 200, 500, 1000, 2000, np.inf]
>>> cuts = pd.cut(flights['DIST'], bins=bins)
>>> cuts
0 (500.0, 1000.0]
1 (1000.0, 2000.0]
2 (500.0, 1000.0]
3 (1000.0, 2000.0]
4 (1000.0, 2000.0]
...
58487 (1000.0, 2000.0]
58488 (200.0, 500.0]
58489 (200.0, 500.0]
58490 (500.0, 1000.0]
58491 (500.0, 1000.0]
Name: DIST, Length: 58492, dtype: category
Categories (5, interval[float64]): [(-inf, 200.0] < (200.0, 500.0] < (500.0, 1000.0] <
(1000.0, 2000.0] < (2000.0, inf]]
>>> cuts.value_counts()
(500.0, 1000.0] 20659
(200.0, 500.0] 15874
(1000.0, 2000.0] 14186
(2000.0, inf] 4054
(-inf, 200.0] 3719
Name: DIST, dtype: int64
cuts
Series can now be used to form groups. pandas allows you to pass many types into the .groupby
method. Pass the cuts
Series to the .groupby
method and then call the .value_counts
method on the AIRLINE
column to find the distribution for each distance group. Notice that SkyWest (OO) makes up 33% of flights of less than 200 miles but only 16% of those between 200 and 500 miles:
>>> (flights
... .groupby(cuts)
... ['AIRLINE']
... .value_counts(normalize=True)
... .round(3)
... )
DIST AIRLINE
(-inf, 200.0] OO 0.326
EV 0.289
MQ 0.211
DL 0.086
AA 0.052
...
(2000.0, inf] WN 0.046
HA 0.028
NK 0.019
AS 0.012
F9 0.004
Name: AIRLINE, Length: 57, dtype: float64
In step 2, the .cut
function places each value of the DIST
column into one of five bins. The bins are created by a sequence of six numbers defining the edges. You always need one more edge than the number of bins. You can pass the bins
parameter an integer, which automatically creates that number of equal-width bins. Negative infinity and positive infinity values are available in NumPy and ensure that all values get placed in a bin. If you have values that are outside the bin edges, they will be made missing and not be placed in a bin.
The cuts
variable is now a Series of five ordered categories. It has all the normal Series methods and, in step 3, the .value_counts
method is used to get a sense of its distribution.
The .groupby
method allows you to pass any object to group on. This means that you are able to form groups from something completely unrelated to the current DataFrame. Here, we group by the values in the cuts
variable. For each grouping, we find the percentage of flights per airline with .value_counts
by setting normalize
to True
.
Some interesting insights can be drawn from this result. Looking at the full result, SkyWest is the leading airline for under 200 miles but has no flights over 2,000 miles. In contrast, American Airlines has the fifth highest total for flights under 200 miles but has by far the most flights between 1,000 and 2,000 miles.
We can find more results when grouping by the cuts
variable. For instance, we can find the 25th, 50th, and 75th percentile airtime for each distance grouping. As airtime is in minutes, we can divide by 60 to get hours. This will return a Series with a MultiIndex:
>>> (flights
... .groupby(cuts)
... ['AIR_TIME']
... .quantile(q=[.25, .5, .75])
... .div(60)
... .round(2)
... )
DIST
(-inf, 200.0] 0.25 0.43
0.50 0.50
0.75 0.57
(200.0, 500.0] 0.25 0.77
0.50 0.92
...
(1000.0, 2000.0] 0.50 2.93
0.75 3.40
(2000.0, inf] 0.25 4.30
0.50 4.70
0.75 5.03
Name: AIR_TIME, Length: 15, dtype: float64
We can use this information to create informative string labels when using the cut
function. These labels replace the interval notation found in the index. We can also chain the .unstack
method, which transposes the inner index level to column names:
>>> labels=['Under an Hour', '1 Hour', '1-2 Hours',
... '2-4 Hours', '4+ Hours']
>>> cuts2 = pd.cut(flights['DIST'], bins=bins, labels=labels)
>>> (flights
... .groupby(cuts2)
... ['AIRLINE']
... .value_counts(normalize=True)
... .round(3)
... .unstack()
... )
AIRLINE AA AS B6 ... US VX WN
DIST ...
Under an Hour 0.052 NaN NaN ... NaN NaN 0.009
1 Hour 0.071 0.001 0.007 ... 0.016 0.028 0.194
1-2 Hours 0.144 0.023 0.003 ... 0.025 0.004 0.138
2-4 Hours 0.264 0.016 0.003 ... 0.040 0.012 0.160
4+ Hours 0.212 0.012 0.080 ... 0.065 0.074 0.046
In the flights dataset, we have data on the origin and destination airport. It is trivial to count the number of flights originating in Houston and landing in Atlanta, for instance. What is more difficult is counting the total number of flights between the two cities.
In this recipe, we count the total number of flights between two cities, regardless of which one is the origin or destination. To accomplish this, we sort the origin and destination airports alphabetically so that each combination of airports always occurs in the same order. We can then use this new column arrangement to form groups and then to count.
>>> flights = pd.read_csv('data/flights.csv')
>>> flights_ct = flights.groupby(['ORG_AIR', 'DEST_AIR']).size()
>>> flights_ct
ORG_AIR DEST_AIR
ATL ABE 31
ABQ 16
ABY 19
ACY 6
AEX 40
...
SFO SNA 122
STL 20
SUN 10
TUS 20
XNA 2
Length: 1130, dtype: int64
>>> flights_ct.loc[[('ATL', 'IAH'), ('IAH', 'ATL')]]
ORG_AIR DEST_AIR
ATL IAH 121
IAH ATL 148
dtype: int64
axis='columns'
to do that:
>>> f_part3 = (flights
... [['ORG_AIR', 'DEST_AIR']]
... .apply(lambda ser:
... ser.sort_values().reset_index(drop=True),
... axis='columns')
... )
>>> f_part3
DEST_AIR ORG_AIR
0 SLC LAX
1 IAD DEN
2 VPS DFW
3 DCA DFW
4 MCI LAX
... ... ...
58487 DFW SFO
58488 SFO LAS
58489 SBA SFO
58490 ATL MSP
58491 BOI SFO
>>> rename_dict = {0:'AIR1', 1:'AIR2'}
>>> (flights
... [['ORG_AIR', 'DEST_AIR']]
... .apply(lambda ser:
... ser.sort_values().reset_index(drop=True),
... axis='columns')
... .rename(columns=rename_dict)
... .groupby(['AIR1', 'AIR2'])
... .size()
... )
AIR1 AIR2
ATL ABE 31
ABQ 16
ABY 19
ACY 6
AEX 40
...
SFO SNA 122
STL 20
SUN 10
TUS 20
XNA 2
Length: 1130, dtype: int64
>>> (flights
... [['ORG_AIR', 'DEST_AIR']]
... .apply(lambda ser:
... ser.sort_values().reset_index(drop=True),
... axis='columns')
... .rename(columns=rename_dict)
... .groupby(['AIR1', 'AIR2'])
... .size()
... .loc[('ATL', 'IAH')]
... )
269
>>> (flights
... [['ORG_AIR', 'DEST_AIR']]
... .apply(lambda ser:
... ser.sort_values().reset_index(drop=True),
... axis='columns')
... .rename(columns=rename_dict)
... .groupby(['AIR1', 'AIR2'])
... .size()
... .loc[('IAH', 'ATL')]
... )
Traceback (most recent call last)
...
KeyError: 'ATL'
In step 1, we form groups by the origin and destination airport columns and then apply the .size
method to the groupby
object, which returns the total number of rows for each group. Notice that we could have passed the string size
to the .agg
method to achieve the same result. In step 2, the total number of flights for each direction between Atlanta and Houston are selected. The result is a Series that has a MultiIndex with two levels. One way to select rows from a MultiIndex is to pass the .loc
index operator a tuple of the exact level values. Here, we select two different rows, ('ATL', 'HOU')
and ('HOU', 'ATL')
. We use a list of tuples to do this correctly.
Step 3 is the most important step in the recipe. We would like to have just one label for all flights between Atlanta and Houston and so far we have two. If we sort each combination of origin and destination airports alphabetically, we would then have a single label for flights between airports. To do this, we use the .apply
method on a DataFrame. This is different from the groupby .apply
method. No groups are formed in step 3.
The DataFrame .apply
method must be passed a function. In this case, it's a lambda function that sorts each row. By default, this function is passed each column. We can change the direction of computation by using axis='columns'
(or axis=1
). The lambda function has each row of data passed to it implicitly as a Series. It returns a Series with sorted airport codes. We have to call .reset_index
so that the columns do not realign after the application of the function.
The .apply
method iterates over all rows using the lambda function. After completion of this operation, the values in the two columns are sorted for each row. The column names are now meaningless. We rename the column names in the next step and then perform the same grouping and aggregation as was done in step 2. This time, all flights between Atlanta and Houston fall under the same label.
Steps 3 through 6 are expensive operations and take several seconds to complete. There are only about 60,000 rows, so this solution would not scale well to larger data. Calling the .apply
method with axis='columns'
(or axis=1
) is one of the least performant operations in all of pandas. Internally, pandas loops over each row and does not provide any speed boosts from NumPy. If possible, avoid using .apply
with axis=1
.
We can get a massive speed increase with the NumPy sort function. Let's go ahead and use this function and analyze its output. By default, it sorts each row:
>>> data_sorted = np.sort(flights[['ORG_AIR', 'DEST_AIR']])
>>> data_sorted[:10]
array([['LAX', 'SLC'],
['DEN', 'IAD'],
['DFW', 'VPS'],
['DCA', 'DFW'],
['LAX', 'MCI'],
['IAH', 'SAN'],
['DFW', 'MSY'],
['PHX', 'SFO'],
['ORD', 'STL'],
['IAH', 'SJC']], dtype=object)
A two-dimensional NumPy array is returned. NumPy does not do grouping operations so let's use the DataFrame constructor to create a new DataFrame and check whether it equals the DataFrame from step 3:
>>> flights_sort2 = pd.DataFrame(data_sorted, columns=['AIR1', 'AIR2'])
>>> flights_sort2.equals(f_part3.rename(columns={'0:'AIR1', 1:'AIR2'}))
True
Because the DataFrames are the same, you can replace step 3 with the previous faster sorting routine. Let's time the difference between each of the different sorting methods:
>>> %%timeit
>>> flights_sort = (flights
... [['ORG_AIR', 'DEST_AIR']]
... .apply(lambda ser:
... ser.sort_values().reset_index(drop=True),
... axis='columns')
... )
1min 5s ± 2.67 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %%timeit
>>> data_sorted = np.sort(flights[['ORG_AIR', 'DEST_AIR']])
>>> flights_sort2 = pd.DataFrame(data_sorted,
... columns=['AIR1', 'AIR2'])
14.6 ms ± 173 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The NumPy solution is 4,452 times faster than using .apply
with pandas in this example.
One of the most important metrics for airlines is their on-time flight performance. The Federal Aviation Administration considers a flight delayed when it arrives at least 15 minutes later than its scheduled arrival time. pandas includes methods to calculate the total and percentage of on-time flights per airline. While these basic summary statistics are an important metric, there are other non-trivial calculations that are interesting, such as finding the length of consecutive on-time flights for each airline at each of its origin airports.
In this recipe, we find the longest consecutive streak of on-time flights for each airline at each origin airport. This requires each value in a column to be aware of the value immediately following it. We make clever use of the .diff
and .cumsum
methods to find streaks before applying this methodology to each of the groups.
The max_streak
function we develop in this section exposes a regression in pandas 1.0 and 1.0.1. This bug (https://github.com/pandas-dev/pandas/issues/31802) should be fixed in pandas 1.0.2.
>>> s = pd.Series([0, 1, 1, 0, 1, 1, 1, 0])
>>> s
0 0
1 1
2 1
3 0
4 1
5 1
6 1
7 0
dtype: int64
.cumsum
method:
>>> s1 = s.cumsum()
>>> s1
0 0
1 1
2 2
3 2
4 3
5 4
6 5
7 5
dtype: int64
>>> s.mul(s1)
0 0
1 1
2 2
3 0
4 3
5 4
6 5
7 0
dtype: int64
.diff
method, which subtracts the previous value from the current:
>>> s.mul(s1).diff()
0 NaN
1 1.0
2 1.0
3 -2.0
4 3.0
5 1.0
6 1.0
7 -5.0
dtype: float64
.where
method:
>>> (s
... .mul(s.cumsum())
... .diff()
... .where(lambda x: x < 0)
... )
0 NaN
1 NaN
2 NaN
3 -2.0
4 NaN
5 NaN
6 NaN
7 -5.0
dtype: float64
.ffill
method:
>>> (s
... .mul(s.cumsum())
... .diff()
... .where(lambda x: x < 0)
... .ffill()
... )
0 NaN
1 NaN
2 NaN
3 -2.0
4 -2.0
5 -2.0
6 -2.0
7 -5.0
dtype: float64
>>> (s
... .mul(s.cumsum())
... .diff()
... .where(lambda x: x < 0)
... .ffill()
... .add(s.cumsum(), fill_value=0)
... )
0 0.0
1 1.0
2 2.0
3 0.0
4 1.0
5 2.0
6 3.0
7 0.0
dtype: float64
>>> flights = pd.read_csv('data/flights.csv')
>>> (flights
... .assign(ON_TIME=flights['ARR_DELAY'].lt(15).astype(int))
... [['AIRLINE', 'ORG_AIR', 'ON_TIME']]
... )
AIRLINE ORG_AIR ON_TIME
0 WN LAX 0
1 UA DEN 1
2 MQ DFW 0
3 AA DFW 1
4 WN LAX 0
... ... ... ...
58487 AA SFO 1
58488 F9 LAS 1
58489 OO SFO 1
58490 WN MSP 0
58491 OO SFO 1
>>> def max_streak(s):
... s1 = s.cumsum()
... return (s
... .mul(s1)
... .diff()
... .where(lambda x: x < 0)
... .ffill()
... .add(s1, fill_value=0)
... .max()
... )
>>> (flights
... .assign(ON_TIME=flights['ARR_DELAY'].lt(15).astype(int))
... .sort_values(['MONTH', 'DAY', 'SCHED_DEP'])
... .groupby(['AIRLINE', 'ORG_AIR'])
... ['ON_TIME']
... .agg(['mean', 'size', max_streak])
... .round(2)
... )
mean size max_streak
AIRLINE ORG_AIR
AA ATL 0.82 233 15
DEN 0.74 219 17
DFW 0.78 4006 64
IAH 0.80 196 24
LAS 0.79 374 29
... ... ... ...
WN LAS 0.77 2031 39
LAX 0.70 1135 23
MSP 0.84 237 32
PHX 0.77 1724 33
SFO 0.76 445 17
Finding streaks in the data is not a straightforward operation in pandas and requires methods that look ahead or behind, such as .diff
or .shift
, or those that remember their current state, such as .cumsum
. The final result from the first seven steps is a Series the same length as the original that keeps track of all consecutive ones. Throughout these steps, we use the .mul
and .add
methods instead of their operator equivalents, (*
) and (+
). In my opinion, this allows for a slightly cleaner progression of calculations from left to right. You, of course, can replace these with the actual operators.
Ideally, we would like to tell pandas to apply the .cumsum
method to the start of each streak and reset itself after the end of each one. It takes many steps to convey this message to pandas. Step 2 accumulates all the ones in the Series as a whole. The rest of the steps slowly remove any excess accumulation. To identify this excess accumulation, we need to find the end of each streak and subtract this value from the beginning of the next streak.
To find the end of each streak, we cleverly make all values not part of the streak zero by multiplying the cumulative sum by the original Series of zeros and ones in step 3. The first zero following a non-zero, marks the end of a streak. That's good, but again, we need to eliminate the excess accumulation. Knowing where the streak ends doesn't exactly get us there.
In step 4, we use the .diff
method to find this excess. The .diff
method takes the difference between the current value and any value located a set number of rows away from it. By default, the difference between the current and the immediately preceding value is returned.
Only negative values are meaningful in step 4. Those are the ones immediately following the end of a streak. These values need to be propagated down until the end of the following streak. To eliminate (make missing) all the values we don't care about, we use the .where
method (this is different from the NumPy where
function), which takes a Boolean array of the same size as the calling Series. By default, all the True
values remain the same, while the False
values become missing. The .where
method allows you to use the calling Series as part of the conditional by taking a function as its first parameter. An anonymous function is used, which gets passed the calling Series implicitly and checks whether each value is less than zero. The result of step 5 is a Series where only the negative values are preserved, with the rest changed to missing.
The .ffill
method in step 6 replaces missing values with the last non-missing value going down a Series. As the first three values don't follow a non-missing value, they remain missing. We finally have our Series that removes the excess accumulation. We add our accumulation Series to the result of step 6 to get the streaks all beginning from zero. The .add
method allows us to replace the missing values with the fill_value
parameter. This completes the process of finding streaks of ones in the dataset. When doing complex logic like this, it is a good idea to use a small dataset where you know what the final output will be. It would be quite a difficult task to start at step 8 and build this streak-finding logic while grouping.
In step 8, we create the ON_TIME
column. One item of note is that the canceled flights have missing values for ARR_DELAY
, which do not pass the Boolean condition and therefore result in a zero for the ON_TIME
column. Canceled flights are treated the same as delayed.
Step 9 turns our logic from the first seven steps into a function and chains the .max
method to return the longest streak. As our function returns a single value, it is formally an aggregating function and can be passed to the .agg
method in step 10. To ensure that we are looking at consecutive flights, we use the .sort_values
method to sort by date and scheduled departure time.
Now that we have found the longest streaks of on-time arrivals, we can easily find the opposite – the longest streak of delayed arrivals. The following function returns two rows for each group passed to it. The first row is the start of the streak, and the last row is the end of the streak. Each row contains the month and day that the streak started and ended, along with the total streak length:
>>> def max_delay_streak(df):
... df = df.reset_index(drop=True)
... late = 1 - df['ON_TIME']
... late_sum = late.cumsum()
... streak = (late
... .mul(late_sum)
... .diff()
... .where(lambda x: x < 0)
... .ffill()
... .add(late_sum, fill_value=0)
... )
... last_idx = streak.idxmax()
... first_idx = last_idx - streak.max() + 1
... res = (df
... .loc[[first_idx, last_idx], ['MONTH', 'DAY']]
... .assign(streak=streak.max())
... )
... res.index = ['first', 'last']
... return res
>>> (flights
... .assign(ON_TIME=flights['ARR_DELAY'].lt(15).astype(int))
... .sort_values(['MONTH', 'DAY', 'SCHED_DEP'])
... .groupby(['AIRLINE', 'ORG_AIR'])
... .apply(max_delay_streak)
... .sort_values('streak', ascending=False)
... )
MONTH DAY streak
AIRLINE ORG_AIR
AA DFW first 2.0 26.0 38.0
last 3.0 1.0 38.0
MQ ORD last 1.0 12.0 28.0
first 1.0 6.0 28.0
DFW last 2.0 26.0 25.0
... ... ... ...
US LAS last 1.0 7.0 1.0
AS ATL first 5.0 4.0 1.0
OO LAS first 2.0 8.0 1.0
EV PHX last 8.0 1.0 0.0
first NaN NaN 0.0
As we are using the .apply
groupby method, a DataFrame of each group is passed to the max_delay_streak
function. Inside this function, the index of the DataFrame is dropped and replaced by a RangeIndex
in order for us to easily find the first and last row of the streak. The ON_TIME
column is inverted and then the same logic is used to find streaks of delayed flights. The index of the first and last rows of the streak are stored as variables. These indexes are then used to select the month and day when the streaks ended. We use a DataFrame to return our results. We label and name the index to make the final result clearer.
Our final results show the longest delayed streaks accompanied by the first and last date. Let's investigate to see whether we can find out why these delays happened. Inclement weather is a common reason for delayed or canceled flights. Looking at the first row, American Airlines (AA
) started a streak of 38 delayed flights in a row from the Dallas Fort-Worth (DFW
) airport beginning February 26 until March 1,2015. Looking at historical weather data from February 27, 2015, two inches of snow fell, which was a record for that day. This was a major weather event for DFW and caused problems for the entire city. Notice that DFW makes another appearance as the third longest streak, but this time a few days earlier and for a different airline.