9

Grouping for Aggregation, Filtration, and Transformation

Introduction

One of the most fundamental tasks during data analysis involves splitting data into independent groups before performing a calculation on each group. This methodology has been around for quite some time but has more recently been referred to as split-apply-combine. This chapter covers the powerful .groupby method, which allows you to group your data in any way imaginable and apply any type of function independently to each group before returning a single dataset.

Before we get started with the recipes, we will need to know just a little terminology. All basic groupby operations have grouping columns, and each unique combination of values in these columns represents an independent grouping of the data. The syntax looks as follows:

df.groupby(['list', 'of', 'grouping', 'columns'])
df.groupby('single_column')  # when grouping by a single column 

The result of calling the .groupby method is a groupby object. It is this groupby object that will be the engine that drives all the calculations for this entire chapter. pandas does very little when creating this groupby object, merely validating that grouping is possible. You will have to chain methods on this groupby object to unleash its powers.

The most common use of the .groupby method is to perform an aggregation. What is an aggregation? An aggregation takes place when a sequence of many inputs get summarized or combined into a single value output. For example, summing up all the values of a column or finding its maximum are aggregations applied to a sequence of data. An aggregation takes a sequence and reduces it to a single value.

In addition to the grouping columns defined during the introduction, most aggregations have two other components, the aggregating columns and aggregating functions. The aggregating columns are the columns whose values will be aggregated. The aggregating functions define what aggregations take place. Aggregation functions include sum, min, max, mean, count, variance, std, and so on.

Defining an aggregation

In this recipe, we examine the flights dataset and perform the simplest aggregation involving only a single grouping column, a single aggregating column, and a single aggregating function. We will find the average arrival delay for each airline. pandas has different syntaxes to create an aggregation, and this recipe will show them.

How to do it…

  1. Read in the flights dataset:
    >>> import pandas as pd
    >>> import numpy as np
    >>> flights = pd.read_csv('data/flights.csv')
    >>> flights.head()
    0      1    1        4  ...      65.0        0         0
    1      1    1        4  ...     -13.0        0         0
    2      1    1        4  ...      35.0        0         0
    3      1    1        4  ...      -7.0        0         0
    4      1    1        4  ...      39.0        0         0
    
  2. Define the grouping columns (AIRLINE), aggregating columns (ARR_DELAY), and aggregating functions (mean). Place the grouping column in the .groupby method and then call the .agg method with a dictionary pairing the aggregating column with its aggregating function. If you pass in a dictionary, it returns back a DataFrame instance:
    >>> (flights
    ...      .groupby('AIRLINE')
    ...      .agg({'ARR_DELAY':'mean'})
    ... )
             ARR_DELAY
    AIRLINE           
    AA        5.542661
    AS       -0.833333
    B6        8.692593
    DL        0.339691
    EV        7.034580
    ...            ...
    OO        7.593463
    UA        7.765755
    US        1.681105
    VX        5.348884
    WN        6.397353
    

    Alternatively, you may place the aggregating column in the index operator and then pass the aggregating function as a string to .agg. This will return a Series:

    >>> (flights
    ...      .groupby('AIRLINE')
    ...      ['ARR_DELAY']
    ...      .agg('mean')
    ... )
    AIRLINE
    AA    5.542661
    AS   -0.833333
    B6    8.692593
    DL    0.339691
    EV    7.034580
            ...   
    OO    7.593463
    UA    7.765755
    US    1.681105
    VX    5.348884
    WN    6.397353
    Name: ARR_DELAY, Length: 14, dtype: float64
    
  3. The string names used in the previous step are a convenience that pandas offers you to refer to a particular aggregation function. You can pass any aggregating function directly to the .agg method, such as the NumPy mean function. The output is the same as the previous step:
    >>> (flights
    ...     .groupby('AIRLINE')
    ...     ['ARR_DELAY']
    ...     .agg(np.mean)
    ... )
    AIRLINE
    AA    5.542661
    AS   -0.833333
    B6    8.692593
    DL    0.339691
    EV    7.034580
            ...   
    OO    7.593463
    UA    7.765755
    US    1.681105
    VX    5.348884
    WN    6.397353
    Name: ARR_DELAY, Length: 14, dtype: float64
    
  4. It's possible to skip the agg method altogether in this case and use the code in text method directly. This output is also the same as step 3:
    >>> (flights
    ...     .groupby('AIRLINE')
    ...     ['ARR_DELAY']
    ...     .mean()
    ... )
    AIRLINE
    AA    5.542661
    AS   -0.833333
    B6    8.692593
    DL    0.339691
    EV    7.034580
            ...   
    OO    7.593463
    UA    7.765755
    US    1.681105
    VX    5.348884
    WN    6.397353
    Name: ARR_DELAY, Length: 14, dtype: float64
    

How it works…

The syntax for the .groupby method is not as straightforward as other methods. Let's intercept the chain of methods in step 2 by storing the result of the .groupby method as its own variable:

>>> grouped = flights.groupby('AIRLINE')
>>> type(grouped)
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>

A completely new intermediate object is first produced with its own distinct attributes and methods. No calculations take place at this stage. pandas merely validates the grouping columns. This groupby object has an .agg method to perform aggregations. One of the ways to use this method is to pass it a dictionary mapping the aggregating column to the aggregating function, as done in step 2. If you pass in a dictionary, the result will be a DataFrame.

The pandas library often has more than one way to perform the same operation. Step 3 shows another way to perform a groupby. Instead of identifying the aggregating column in the dictionary, place it inside the index operator as if you were selecting it as a column from a DataFrame. The function string name is then passed as a scalar to the .agg method. The result, in this case, is a Series.

You may pass any aggregating function to the .agg method. pandas allows you to use the string names for simplicity, but you may also explicitly call an aggregating function as done in step 4. NumPy provides many functions that aggregate values.

Step 5 shows one last syntax flavor. When you are only applying a single aggregating function as in this example, you can often call it directly as a method on the groupby object itself without .agg. Not all aggregation functions have a method equivalent, but most do.

There's more…

If you do not use an aggregating function with .agg, pandas raises an exception. For instance, let's see what happens when we apply the square root function to each group:

>>> (flights
...    .groupby('AIRLINE')
...    ['ARR_DELAY']
...    .agg(np.sqrt)
... )
Traceback (most recent call last):
  ...
ValueError: function does not reduce

Grouping and aggregating with multiple columns and functions

It is possible to group and aggregate with multiple columns. The syntax is slightly different than it is for grouping and aggregating with a single column. As usual with any kind of grouping operation, it helps to identify the three components: the grouping columns, aggregating columns, and aggregating functions.

In this recipe, we showcase the flexibility of the .groupby method by answering the following queries:

  • Finding the number of canceled flights for every airline per weekday
  • Finding the number and percentage of canceled and diverted flights for every airline per weekday
  • For each origin and destination, finding the total number of flights, the number and percentage of canceled flights, and the average and variance of the airtime

How to do it…

  1. Read in the flights dataset, and answer the first query by defining the grouping columns (AIRLINE, WEEKDAY), the aggregating column (CANCELLED), and the aggregating function (sum):
    >>> (flights
    ...     .groupby(['AIRLINE', 'WEEKDAY'])
    ...     ['CANCELLED'] 
    ...     .agg('sum')
    ... )
    AIRLINE  WEEKDAY
    AA       1          41
             2           9
             3          16
             4          20
             5          18
                        ..
    WN       3          18
             4          10
             5           7
             6          10
             7           7
    Name: CANCELLED, Length: 98, dtype: int64
    
  2. Answer the second query by using a list for each pair of grouping and aggregating columns, and use a list for the aggregating functions:
    >>> (flights
    ...     .groupby(['AIRLINE', 'WEEKDAY']) 
    ...     [['CANCELLED', 'DIVERTED']]
    ...     .agg(['sum', 'mean'])
    ... )
                    CANCELLED           DIVERTED          
                          sum      mean      sum      mean
    AIRLINE WEEKDAY                                       
    AA      1              41  0.032106        6  0.004699
            2               9  0.007341        2  0.001631
            3              16  0.011949        2  0.001494
            4              20  0.015004        5  0.003751
            5              18  0.014151        1  0.000786
    ...                   ...       ...      ...       ...
    WN      3              18  0.014118        2  0.001569
            4              10  0.007911        4  0.003165
            5               7  0.005828        0  0.000000
            6              10  0.010132        3  0.003040
            7               7  0.006066        3  0.002600
    
  3. Answer the third query using a dictionary in the .agg method to map specific aggregating columns to specific aggregating functions:
    >>> (flights
    ...     .groupby(['ORG_AIR', 'DEST_AIR'])
    ...     .agg({'CANCELLED':['sum', 'mean', 'size'],
    ...           'AIR_TIME':['mean', 'var']})
    ... )
                     CANCELLED            ...    AIR_TIME            
                           sum      mean  ...        mean         var
    ORG_AIR DEST_AIR                      ...                        
    ATL     ABE              0  0.000000  ...   96.387097   45.778495
            ABQ              0  0.000000  ...  170.500000   87.866667
            ABY              0  0.000000  ...   28.578947    6.590643
            ACY              0  0.000000  ...   91.333333   11.466667
            AEX              0  0.000000  ...   78.725000   47.332692
    ...                    ...       ...  ...         ...         ...
    SFO     SNA              4  0.032787  ...   64.059322   11.338331
            STL              0  0.000000  ...  198.900000  101.042105
            SUN              0  0.000000  ...   78.000000   25.777778
            TUS              0  0.000000  ...  100.200000   35.221053
            XNA              0  0.000000  ...  173.500000    0.500000
    
  4. In pandas 0.25, there is a named aggregation object that can create non-hierarchical columns. We will repeat the above query using them:
    >>> (flights
    ...     .groupby(['ORG_AIR', 'DEST_AIR'])
    ...     .agg(sum_cancelled=pd.NamedAgg(column='CANCELLED', aggfunc='sum'),
    ...          mean_cancelled=pd.NamedAgg(column='CANCELLED', aggfunc='mean'),
    ...          size_cancelled=pd.NamedAgg(column='CANCELLED', aggfunc='size'),
    ...          mean_air_time=pd.NamedAgg(column='AIR_TIME', aggfunc='mean'),
    ...          var_air_time=pd.NamedAgg(column='AIR_TIME', aggfunc='var'))
    ... )
                       sum_cancelled  mean_cancelled  ...  mean_air_time
    ORG_AIR DEST_AIR                                  ...             
    ATL     ABE                 0        0.000000     ...    96.387097
            ABQ                 0        0.000000     ...   170.500000
            ABY                 0        0.000000     ...    28.578947
            ACY                 0        0.000000     ...    91.333333
            AEX                 0        0.000000     ...    78.725000
    ...                       ...             ...     ...          ...
    SFO     SNA                 4        0.032787     ...    64.059322
            STL                 0        0.000000     ...   198.900000
            SUN                 0        0.000000     ...    78.000000
            TUS                 0        0.000000     ...   100.200000
            XNA                 0        0.000000     ...   173.500000
    

How it works…

To group by multiple columns as in step 1, we pass a list of the string names to the .groupby method. Each unique combination of AIRLINE and WEEKDAY forms its own group. Within each of these groups, the sum of the canceled flights is calculated and then returned as a Series.

Step 2 groups by both AIRLINE and WEEKDAY, but this time aggregates two columns. It applies each of the two aggregation functions, using the strings sum and mean, to each column, resulting in four returned columns per group.

Step 3 goes even further, and uses a dictionary to map specific aggregating columns to different aggregating functions. Notice that the size aggregating function returns the total number of rows per group. This is different than the count aggregating function, which returns the number of non-missing values per group.

Step 4 shows the new syntax to create flat columns, named aggregations.

There's more…

To flatten the columns in step 3, you can use the .to_flat_index method (available since pandas 0.24):

>>> res = (flights
...     .groupby(['ORG_AIR', 'DEST_AIR'])
...     .agg({'CANCELLED':['sum', 'mean', 'size']
...           'AIR_TIME':['mean', 'var']})
... )
>>> res.columns = ['_'.join(x) for x in
...     res.columns.to_flat_index()]
>>> res
                  CANCELLED_sum  CANCELLED_mean  ...  AIR_TIME_mean 
ORG_AIR DEST_AIR                                 ...
ATL     ABE                 0       0.000000     ...    96.387097
        ABQ                 0       0.000000     ...   170.500000
        ABY                 0       0.000000     ...    28.578947
        ACY                 0       0.000000     ...    91.333333
        AEX                 0       0.000000     ...    78.725000
...                       ...            ...     ...          ...
SFO     SNA                 4       0.032787     ...    64.059322
        STL                 0       0.000000     ...   198.900000
        SUN                 0       0.000000     ...    78.000000
        TUS                 0       0.000000     ...   100.200000
        XNA                 0       0.000000     ...   173.500000

That is kind of ugly and I would prefer a chain operation to flatten the columns. Unfortunately, the .reindex method does not support flattening. Instead, we will have to leverage the .pipe method:

>>> def flatten_cols(df):
...     df.columns = ['_'.join(x) for x in
...         df.columns.to_flat_index()]
...     return df
>>> res = (flights
...     .groupby(['ORG_AIR', 'DEST_AIR'])
...     .agg({'CANCELLED':['sum', 'mean', 'size'],
...           'AIR_TIME':['mean', 'var']})
...     .pipe(flatten_cols)
... )
>>> res
                  CANCELLED_sum  CANCELLED_mean  ...  AIR_TIME_mean 
ORG_AIR DEST_AIR                                 ...
ATL     ABE                 0       0.000000     ...    96.387097
        ABQ                 0       0.000000     ...   170.500000
        ABY                 0       0.000000     ...    28.578947
        ACY                 0       0.000000     ...    91.333333
        AEX                 0       0.000000     ...    78.725000
...                       ...            ...     ...          ...
SFO     SNA                 4       0.032787     ...    64.059322
        STL                 0       0.000000     ...   198.900000
        SUN                 0       0.000000     ...    78.000000
        TUS                 0       0.000000     ...   100.200000
        XNA                 0       0.000000     ...   173.500000

Be aware that when grouping with multiple columns, pandas creates a hierarchical index, or multi-index. In the preceding example, it returned 1,130 rows. However, if one of the columns that we group by is categorical (and has a category type, not an object type), then pandas will create a Cartesian product of all combinations for each level. In this case, it returns 2,710 rows. However, if you have categorical columns with higher cardinality, you can get many more values:

>>> res = (flights
...     .assign(ORG_AIR=flights.ORG_AIR.astype('category'))
...     .groupby(['ORG_AIR', 'DEST_AIR'])
...     .agg({'CANCELLED':['sum', 'mean', 'size'],
...           'AIR_TIME':['mean', 'var']})
... )
>>> res
                 CANCELLED       ...    AIR_TIME           
                       sum mean  ...        mean        var
ORG_AIR DEST_AIR                 ...                       
ATL     ABE            0.0  0.0  ...   96.387097  45.778495
        ABI            NaN  NaN  ...         NaN        NaN
        ABQ            0.0  0.0  ...  170.500000  87.866667
        ABR            NaN  NaN  ...         NaN        NaN
        ABY            0.0  0.0  ...   28.578947   6.590643
...                    ...  ...  ...         ...        ...
SFO     TYS            NaN  NaN  ...         NaN        NaN
        VLD            NaN  NaN  ...         NaN        NaN
        VPS            NaN  NaN  ...         NaN        NaN
        XNA            0.0  0.0  ...  173.500000   0.500000
        YUM            NaN  NaN  ...         NaN        NaN

To remedy the combinatoric explosion, use the observed=True parameter. This makes the categorical group bys work like grouping with string types, and only shows the observed values and not the Cartesian product:

>>> res = (flights
...     .assign(ORG_AIR=flights.ORG_AIR.astype('category'))
...     .groupby(['ORG_AIR', 'DEST_AIR'], observed=True)
...     .agg({'CANCELLED':['sum', 'mean', 'size'],
...           'AIR_TIME':['mean', 'var']})
... )
>>> res
                 CANCELLED            ...    AIR_TIME            
                       sum      mean  ...        mean         var
ORG_AIR DEST_AIR                      ...                        
LAX     ABQ              1  0.018182  ...   89.259259   29.403215
        ANC              0  0.000000  ...  307.428571   78.952381
        ASE              1  0.038462  ...  102.920000  102.243333
        ATL              0  0.000000  ...  224.201149  127.155837
        AUS              0  0.000000  ...  150.537500   57.897310
...                    ...       ...  ...         ...         ...
MSP     TTN              1  0.125000  ...  124.428571   57.952381
        TUL              0  0.000000  ...   91.611111   63.075163
        TUS              0  0.000000  ...  176.000000   32.000000
        TVC              0  0.000000  ...   56.600000   10.300000
        XNA              0  0.000000  ...   90.642857  115.939560

Removing the MultiIndex after grouping

Inevitably, when using groupby, you will create a MultiIndex. MultiIndexes can happen in both the index and the columns. DataFrames with MultiIndexes are more difficult to navigate and occasionally have confusing column names as well.

In this recipe, we perform an aggregation with the .groupby method to create a DataFrame with a MultiIndex for the rows and columns. Then, we manipulate the index so that it has a single level and the column names are descriptive.

How to do it…

  1. Read in the flights dataset, write a statement to find the total and average miles flown, and the maximum and minimum arrival delay for each airline for each weekday:
    >>> flights = pd.read_csv('data/flights.csv')
    >>> airline_info = (flights
    ...     .groupby(['AIRLINE', 'WEEKDAY'])
    ...     .agg({'DIST':['sum', 'mean'],
    ...           'ARR_DELAY':['min', 'max']}) 
    ...     .astype(int)
    ... )
    >>> airline_info
                        DIST       ARR_DELAY     
                         sum  mean       min  max
    AIRLINE WEEKDAY                              
    AA      1        1455386  1139       -60  551
            2        1358256  1107       -52  725
            3        1496665  1117       -45  473
            4        1452394  1089       -46  349
            5        1427749  1122       -41  732
    ...                  ...   ...       ...  ...
    WN      3         997213   782       -38  262
            4        1024854   810       -52  284
            5         981036   816       -44  244
            6         823946   834       -41  290
            7         945679   819       -45  261
    
  2. Both the rows and columns are labeled by a MultiIndex with two levels. Let's squash both down to just a single level. To address the columns, we use the MultiIndex method, .to_flat_index. Let's display the output of each level and then concatenate both levels before setting it as the new column values:
    >>> airline_info.columns.get_level_values(0)
    Index(['DIST', 'DIST', 'ARR_DELAY', 'ARR_DELAY'], dtype='object')
    >>> airline_info.columns.get_level_values(1)
    Index(['sum', 'mean', 'min', 'max'], dtype='object')
    >>> airline_info.columns.to_flat_index()
    Index([('DIST', 'sum'), ('DIST', 'mean'), ('ARR_DELAY', 'min'),
           ('ARR_DELAY', 'max')],
          dtype='object')
    >>> airline_info.columns = ['_'.join(x) for x in
    ...     airline_info.columns.to_flat_index()]
    >>> airline_info
                     DIST_sum  DIST_mean  ARR_DELAY_min  ARR_DELAY_max
    AIRLINE WEEKDAY                                                 
    AA      1         1455386       1139          -60            551
            2         1358256       1107          -52            725
            3         1496665       1117          -45            473
            4         1452394       1089          -46            349
            5         1427749       1122          -41            732
    ...                   ...        ...          ...            ...
    WN      3          997213        782          -38            262
            4         1024854        810          -52            284
            5          981036        816          -44            244
            6          823946        834          -41            290
            7          945679        819          -45            261
    
  3. A quick way to get rid of the row MultiIndex is to use the .reset_index method:
    >>> airline_info.reset_index()
       AIRLINE  WEEKDAY  ...  ARR_DELAY_min  ARR_DELAY_max
    0       AA        1  ...          -60            551
    1       AA        2  ...          -52            725
    2       AA        3  ...          -45            473
    3       AA        4  ...          -46            349
    4       AA        5  ...          -41            732
    ..     ...      ...  ...          ...            ...
    93      WN        3  ...          -38            262
    94      WN        4  ...          -52            284
    95      WN        5  ...          -44            244
    96      WN        6  ...          -41            290
    97      WN        7  ...          -45            261
    
  4. Refactor the code to make it readable. Use the pandas 0.25 functionality to flatten columns automatically:
    >>> (flights
    ...     .groupby(['AIRLINE', 'WEEKDAY'])
    ...     .agg(dist_sum=pd.NamedAgg(column='DIST', aggfunc='sum'),
    ...          dist_mean=pd.NamedAgg(column='DIST', aggfunc='mean'),
    ...          arr_delay_min=pd.NamedAgg(column='ARR_DELAY', aggfunc='min'),
    ...          arr_delay_max=pd.NamedAgg(column='ARR_DELAY', aggfunc='max'))
    ...     .astype(int)
    ...     .reset_index()
    ... )
       AIRLINE  WEEKDAY  ...  ARR_DELAY_min  ARR_DELAY_max
    0       AA        1  ...            -60            551
    1       AA        2  ...            -52            725
    2       AA        3  ...            -45            473
    3       AA        4  ...            -46            349
    4       AA        5  ...            -41            732
    ..     ...      ...  ...            ...            ...
    93      WN        3  ...            -38            262
    94      WN        4  ...            -52            284
    95      WN        5  ...            -44            244
    96      WN        6  ...            -41            290
    97      WN        7  ...            -45            261
    

How it works…

When using the .agg method to perform an aggregation on multiple columns, pandas creates an index object with two levels. The aggregating columns become the top level, and the aggregating functions become the bottom level. pandas displays MultiIndex levels differently to single-level columns. Except for the innermost levels, repeated index values do not get displayed in Jupyter or a Python shell. You can inspect the DataFrame from step 1 to verify this. For instance, the DIST column shows up only once, but it refers to both of the first two columns.

Step 2 defines new columns by first retrieving the underlying values of each of the levels with the MultiIndex method, .get_level_values. This method accepts an integer identifying the index level. They are numbered beginning with zero from the outside (top/left). We use the recently added index method, .to_flat_index, in combination with a list comprehension to create strings for each column. We assign these new values to the columns attribute.

In step 3, we make use of the .reset_index method to push both index levels into columns. This is easy, and I wish there was a similar method for column name compaction.

In step 4, we use the NamedAgg class (new in pandas 0.25) to create flat aggregate columns.

There's more…

By default, at the end of a groupby operation, pandas puts all of the grouping columns in the index. The as_index parameter in the .groupby method can be set to False to avoid this behavior. You can chain the .reset_index method after grouping to get the same effect as seen in step 3. Let's see an example of this by finding the average distance traveled per flight from each airline:

>>> (flights
...     .groupby(['AIRLINE'], as_index=False)
...     ['DIST']
...     .agg('mean')
...     .round(0)
... )
   AIRLINE    DIST
0       AA  1114.0
1       AS  1066.0
2       B6  1772.0
3       DL   866.0
4       EV   460.0
..     ...     ...
9       OO   511.0
10      UA  1231.0
11      US  1181.0
12      VX  1240.0
13      WN   810.0

Take a look at the order of the airlines in the previous result. By default, pandas sorts the grouping columns. The sort parameter exists within the .groupby method and defaults to True. You may set it to False to keep the order of the grouping columns the same as how they are encountered in the dataset. There is a small performance improvement by not sorting your data.

Grouping with a custom aggregation function

pandas provides a number of aggregation functions to use with the groupby object. At some point, you may need to write your own custom user-defined function that does not exist in pandas or NumPy.

In this recipe, we use the college dataset to calculate the mean and standard deviation of the undergraduate student population per state. We then use this information to find the maximum number of standard deviations from the mean that any single population value is per state.

How to do it…

  1. Read in the college dataset, and find the mean and standard deviation of the undergraduate population by state:
    >>> college = pd.read_csv('data/college.csv')
    >>> (college
    ...     .groupby('STABBR')
    ...     ['UGDS']
    ...     .agg(['mean', 'std'])
    ...     .round(0)
    ... )
              mean      std
    STABBR                 
    AK      2493.0   4052.0
    AL      2790.0   4658.0
    AR      1644.0   3143.0
    AS      1276.0      NaN
    AZ      4130.0  14894.0
    ...        ...      ...
    VT      1513.0   2194.0
    WA      2271.0   4124.0
    WI      2655.0   4615.0
    WV      1758.0   5957.0
    WY      2244.0   2745.0
    
  2. This output isn't quite what we desire. We are not looking for the mean and standard deviations of the entire group but the maximum number of standard deviations away from the mean for any one institution. To calculate this, we need to subtract the mean undergraduate population by state from each institution's undergraduate population and then divide by the standard deviation. This standardizes the undergraduate population for each group. We can then take the maximum of the absolute value of these scores to find the one that is farthest away from the mean. pandas does not provide a function capable of doing this. Instead, we will need to create a custom function:
    >>> def max_deviation(s):
    ...     std_score = (s - s.mean()) / s.std()
    ...     return std_score.abs().max()
    
  3. After defining the function, pass it directly to the .agg method to complete the aggregation:
    >>> (college
    ...     .groupby('STABBR')
    ...     ['UGDS']
    ...     .agg(max_deviation)
    ...     .round(1)
    ... )
    STABBR
    AK    2.6
    AL    5.8
    AR    6.3
    AS    NaN
    AZ    9.9
         ... 
    VT    3.8
    WA    6.6
    WI    5.8
    WV    7.2
    WY    2.8
    Name: UGDS, Length: 59, dtype: float64
    

How it works…

There is no predefined pandas function to calculate the maximum number of standard deviations away from the mean. We need to write our own function. Notice that this custom function, max_deviation, accepts a single parameter, s.

In step 3, you will notice that the function name is placed inside the .agg method without directly being called. Nowhere is the parameter s explicitly passed to max_deviation. Instead, pandas implicitly passes the UGDS column as a Series to max_deviation.

The max_deviation function is called once for each group. As s is a Series, all normal Series methods are available. It subtracts the mean of that particular grouping from each of the values in the group before dividing by the standard deviation in a process called standardization.

As we are interested in absolute deviation from the mean, we take the absolute value from all the standardized scores and return the maximum. The .agg method requires that we return a scalar from the function, or else an exception will be raised.

pandas defaults to using the sample standard deviation, which is undefined for any groups with just a single value. For instance, the state abbreviation AS (American Samoa) has a missing value returned as it has only a single institution in the dataset.

There's more…

It is possible to apply our custom function to multiple aggregating columns. We simply add more column names to the indexing operator. The max_deviation function only works with numeric columns:

>>> (college
...     .groupby('STABBR')
...     [['UGDS', 'SATVRMID', 'SATMTMID']]
...     .agg(max_deviation)
...     .round(1)
... )
        UGDS  SATVRMID  SATMTMID
STABBR                          
AK       2.6       NaN       NaN
AL       5.8       1.6       1.8
AR       6.3       2.2       2.3
AS       NaN       NaN       NaN
AZ       9.9       1.9       1.4
...      ...       ...       ...
VT       3.8       1.9       1.9
WA       6.6       2.2       2.0
WI       5.8       2.4       2.2
WV       7.2       1.7       2.1
WY       2.8       NaN       NaN

You can also use your custom aggregation function along with the prebuilt functions. The following does this and groups by state and religious affiliation:

>>> (college
...     .groupby(['STABBR', 'RELAFFIL']) 
...     [['UGDS', 'SATVRMID', 'SATMTMID']]
...     .agg([max_deviation, 'mean', 'std'])
...     .round(1)
... )
                         UGDS          ... SATMTMID      
                max_deviation    mean  ...     mean   std
STABBR RELAFFIL                        ...               
AK     0                 2.1   3508.9  ...      NaN   NaN
       1                 1.1    123.3  ...    503.0   NaN
AL     0                 5.2   3248.8  ...    515.8  56.7
       1                 2.4    979.7  ...    485.6  61.4
AR     0                 5.8   1793.7  ...    503.6  39.0
...                      ...      ...  ...      ...   ...
WI     0                 5.3   2879.1  ...    591.2  85.7
       1                 3.4   1716.2  ...    526.6  42.5
WV     0                 6.9   1873.9  ...    480.0  27.7
       1                 1.3    716.4  ...    484.8  17.7
WY     0                 2.8   2244.4  ...    540.0   NaN

Notice that pandas uses the name of the function as the name for the returned column. You can change the column name directly with the .rename method or you can modify the function attribute .__name__:

>>> max_deviation.__name__
'max_deviation'
>>> max_deviation.__name__ = 'Max Deviation'
>>> (college
...     .groupby(['STABBR', 'RELAFFIL']) 
...     [['UGDS', 'SATVRMID', 'SATMTMID']]
...     .agg([max_deviation, 'mean', 'std'])
...     .round(1)
... )
                         UGDS          ... SATMTMID      
                Max Deviation    mean  ...     mean   std
STABBR RELAFFIL                        ...               
AK     0                 2.1   3508.9  ...      NaN   NaN
       1                 1.1    123.3  ...    503.0   NaN
AL     0                 5.2   3248.8  ...    515.8  56.7
       1                 2.4    979.7  ...    485.6  61.4
AR     0                 5.8   1793.7  ...    503.6  39.0
...                      ...      ...  ...      ...   ...
WI     0                 5.3   2879.1  ...    591.2  85.7
       1                 3.4   1716.2  ...    526.6  42.5
WV     0                 6.9   1873.9  ...    480.0  27.7
       1                 1.3    716.4  ...    484.8  17.7
WY     0                 2.8   2244.4  ...    540.0   NaN

Customizing aggregating functions with *args and **kwargs

When writing your own user-defined customized aggregation function, pandas implicitly passes it each of the aggregating columns one at a time as a Series. Occasionally, you will need to pass more arguments to your function than just the Series itself. To do so, you need to be aware of Python's ability to pass an arbitrary number of arguments to functions.

The signature to .agg is agg(func, *args, **kwargs). The func parameter is a reducing function, the string name of a reducing method, a list of reducing functions, or a dictionary mapping columns to functions or a list of functions. Additionally, as we have seen, you can use keyword arguments to create named aggregations.

If you have a reducing function that takes additional arguments that you would like to use, you can leverage the *args and **kwargs parameters to pass arguments to the reduction function. You can use *args to pass an arbitrary number of positional arguments to your customized aggregation function. Similarly, **kwargs allows you to pass an arbitrary number of keyword arguments.

In this recipe, we will build a customized function for the college dataset that finds the percentage of schools by state and religious affiliation that have an undergraduate population between two values.

How to do it…

  1. Define a function that returns the percentage of schools with an undergraduate population of between 1,000 and 3,000:
    >>> def pct_between_1_3k(s):
    ...     return (s
    ...         .between(1_000, 3_000)
    ...         .mean()
    ...         * 100
    ...     )
    
  2. Calculate this percentage grouping by state and religious affiliation:
    >>> (college
    ...     .groupby(['STABBR', 'RELAFFIL'])
    ...     ['UGDS'] 
    ...     .agg(pct_between_1_3k)
    ...     .round(1)
    ... )
    STABBR  RELAFFIL        
    AK      0           14.3
            1            0.0
    AL      0           23.6
    AR      0           27.9
                        ... 
    WI      0           13.8
            1           36.0
    WV      0           24.6
            1           37.5
    WY      0           54.5
    Name: UGDS, Length: 112, dtype: float64
    
  3. This function works, but it does not give the user any flexibility to choose the lower and upper bound. Let's create a new function that allows the user to parameterize these bounds:
    >>> def pct_between(s, low, high):
    ...     return s.between(low, high).mean() * 100
    
  4. Pass this new function to the .agg method along with the lower and upper bounds:
    >>> (college
    ...     .groupby(['STABBR', 'RELAFFIL'])
    ...     ['UGDS'] 
    ...     .agg(pct_between, 1_000, 10_000)
    ...     .round(1)
    ... )
    STABBR  RELAFFIL        
    AK      0           42.9
            1            0.0
    AL      0           45.8
            1           37.5
    AR      0           39.7
                        ... 
    WI      0           31.0
            1           44.0
    WV      0           29.2
            1           37.5
    WY      0           72.7
    Name: UGDS, Length: 112, dtype: float64
    

How it works…

Step 1 creates a function that doesn't accept any extra arguments. The upper and lower bounds are hardcoded into the function, which isn't very flexible. Step 2 shows the results of this aggregation.

We create a more flexible function in step 3 where we parameterize both the lower and upper bounds dynamically. Step 4 is where the magic of *args and **kwargs comes into play. In this particular example, we pass two non-keyword arguments, 1_000 and 10_000, to the .agg method. pandas passes these two arguments respectively to the low and high parameters of pct_between.

There are a few ways we could achieve the same result in step 4. We could have explicitly used keyword parameters to produce the same result:

 (college
     .groupby(['STABBR', 'RELAFFIL'])
     ['UGDS']
     .agg(pct_between, high=10_000, low=1_000)
     .round(1)
 )

There's more…

If we want to call multiple aggregation functions and some of them need parameters, we can utilize Python's closure functionality to create a new function that has the parameters closed over in its calling environment:

>>> def between_n_m(n, m):
...     def wrapper(ser):
...         return pct_between(ser, n, m)
...     wrapper.__name__ = f'between_{n}_{m}'
...     return wrapper
>>> (college
...     .groupby(['STABBR', 'RELAFFIL'])
...     ['UGDS'] 
...     .agg([between_n_m(1_000, 10_000), 'max', 'mean'])
...     .round(1)
... )
                 between_1000_10000      max    mean
STABBR RELAFFIL
AK     0                       42.9  12865.0  3508.9
       1                        0.0    275.0   123.3
AL     0                       45.8  29851.0  3248.8
       1                       37.5   3033.0   979.7
AR     0                       39.7  21405.0  1793.7
...                             ...      ...     ...
WI     0                       31.0  29302.0  2879.1
       1                       44.0   8212.0  1716.2
WV     0                       29.2  44924.0  1873.9
       1                       37.5   1375.0   716.4
WY     0                       72.7   9910.0  2244.4

Examining the groupby object

The immediate result from using the .groupby method on a DataFrame is a groupby object. Usually, we chain operations on this object to do aggregations or transformations without ever storing the intermediate values in variables.

In this recipe, we examine the groupby object to examine individual groups.

How to do it…

  1. Let's get started by grouping the state and religious affiliation columns from the college dataset, saving the result to a variable and confirming its type:
    >>> college = pd.read_csv('data/college.csv')
    >>> grouped = college.groupby(['STABBR', 'RELAFFIL'])
    >>> type(grouped)
    <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
    
  2. Use the dir function to discover the attributes of a groupby object:
    >>> print([attr for attr in dir(grouped) if not
    ...     attr.startswith('_')])
    ['CITY', 'CURROPER', 'DISTANCEONLY', 'GRAD_DEBT_MDN_SUPP', 'HBCU', 'INSTNM',
    'MD_EARN_ WNE_P10', 'MENONLY', 'PCTFLOAN', 'PCTPELL', 'PPTUG_EF', 'RELAFFIL',
    'SATMTMID', 'SATVRMID' , 'STABBR', 'UG25ABV', 'UGDS', 'UGDS_2MOR', 'UGDS_AIAN',
    'UGDS_ASIAN', 'UGDS_BLACK', 'UGDS _HISP', 'UGDS_NHPI', 'UGDS_NRA', 'UGDS_UNKN',
    'UGDS_WHITE', 'WOMENONLY', 'agg', 'aggregate ', 'all', 'any', 'apply',
    'backfill', 'bfill', 'boxplot', 'corr', 'corrwith', 'count', 'co v', 'cumcount',
    'cummax', 'cummin', 'cumprod', 'cumsum', 'describe', 'diff', 'dtypes', 'ex
    panding', 'ffill', 'fillna', 'filter', 'first', 'get_group', 'groups', 'head',
    'hist', 'id xmax', 'idxmin', 'indices', 'last', 'mad', 'max', 'mean', 'median',
    'min', 'ndim', 'ngroup ', 'ngroups', 'nth', 'nunique', 'ohlc', 'pad',
    'pct_change', 'pipe', 'plot', 'prod', 'quan tile', 'rank', 'resample',
    'rolling', 'sem', 'shift', 'size', 'skew', 'std', 'sum', 'tail' , 'take',
    'transform', 'tshift', 'var']
    
  3. Find the number of groups with the .ngroups attribute:
    >>> grouped.ngroups
    112
    
  4. To find the uniquely identifying labels for each group, look in the .groups attribute, which contains a dictionary of each unique group mapped to all the corresponding index labels of that group. Because we grouped by two columns, each of the keys has a tuple, one value for the STABBR column and another for the RELAFFIL column:
    >>> groups = list(grouped.groups)
    >>> groups[:6]
    [('AK', 0), ('AK', 1), ('AL', 0), ('AL', 1), ('AR', 0), ('AR', 1)]
    
  5. Retrieve a single group with the .get_group method by passing it a tuple of an exact group label. For example, to get all the religiously affiliated schools in the state of Florida, do the following:
    >>> grouped.get_group(('FL', 1))
               INSTNM         CITY  ... MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
    712   The Bapt...   Graceville  ...        30800           20052
    713   Barry Un...        Miami  ...        44100           28250
    714   Gooding ...  Panama City  ...          NaN     PrivacyS...
    715   Bethune-...  Daytona ...  ...        29400           36250
    724   Johnson ...    Kissimmee  ...        26300           20199
    ...           ...          ...  ...          ...             ...
    7486  Strayer ...  Coral Sp...  ...        49200         36173.5
    7487  Strayer ...  Fort Lau...  ...        49200         36173.5
    7488  Strayer ...      Miramar  ...        49200         36173.5
    7489  Strayer ...        Miami  ...        49200         36173.5
    7490  Strayer ...        Miami  ...        49200         36173.5
    
  6. You may want to take a peek at each individual group. This is possible because groupby objects are iterable. If you are in Jupyter, you can leverage the display function to show each group in a single cell (otherwise, Jupyter will only show the result of the last statement of the cell):
    from IPython.display import display
     for name, group in grouped:
          print(name)
          display(group.head(3))
    
    Display multiple dataframes

    Displaying multiple dataframes

    However, I typically want to see some example data from a single group to figure out what function I want to apply to the groups. If I know the names of the values from the columns I grouped by, I can use the previous step. Often, I don't know those names, but I also don't need to see all of the groups. The following is some debugging of the code that is usually sufficient to understand what a group looks like:

    >>> for name, group in grouped:
    ...     print(name)
    ...     print(group)
    ...     break
    ('AK', 0)
               INSTNM       CITY  ... MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
    60    Universi...  Anchorage  ...        42500         19449.5
    62    Universi...  Fairbanks  ...        36200           19355
    63    Universi...     Juneau  ...        37400           16875
    65    AVTEC-Al...     Seward  ...        33500     PrivacyS...
    66    Charter ...  Anchorage  ...        39200           13875
    67    Alaska C...  Anchorage  ...        28700            8994
    5171  Ilisagvi...     Barrow  ...        24900     PrivacyS...
    
  7. You can also call the .head method on your groupby object to get the first rows of each group together in a single DataFrame:
    >>> grouped.head(2)
               INSTNM        CITY  ... MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
    0     Alabama ...      Normal  ...        30300           33888
    1     Universi...  Birmingham  ...        39700         21941.5
    2     Amridge ...  Montgomery  ...        40100           23370
    10    Birmingh...  Birmingham  ...        44200           27000
    43    Prince I...    Elmhurst  ...  PrivacyS...           20992
    ...           ...         ...  ...          ...             ...
    5289  Pacific ...    Mangilao  ...  PrivacyS...     PrivacyS...
    6439  Touro Un...   Henderson  ...          NaN     PrivacyS...
    7352  Marinell...   Henderson  ...        21200          9796.5
    7404  Universi...   St. Croix  ...        31800           15150
    7419  Computer...  Las Cruces  ...        21300           14250
    

How it works…

Step 1 creates our groupby object. We can display all the public attributes and methods to reveal the functionality of an object as was done in step 2. Each group is uniquely identified by a tuple containing a unique combination of the values in the grouping columns. pandas allows you to select a specific group as a DataFrame with the .get_group method shown in step 5.

It is rare that you will need to iterate through your groups. In fact, you should avoid doing so, as it can be quite slow. Occasionally, however, you will have no other choice. When iterating through a groupby object, you are given a tuple containing the group name and the DataFrame with the grouping columns moved into the index. This tuple is unpacked into the name and group variables in the for loop in step 6.

One thing you can do while iterating through your groups is to display a few of the rows from each group directly in the notebook. To do this, you can either use the print function or the display function from the IPython.display module if you are using Jupyter.

There's more…

There are several useful methods that were not explored from the list in step 2. Take, for instance, the .nth method, which, when provided with a list of integers, selects those specific rows from each group. For example, the following operation selects the first and last rows from each group:

>>> grouped.nth([1, -1])
                      INSTNM         CITY  ...  MD_EARN_WNE_P10 
STABBR RELAFFIL                            ...
AK     0         Universi...    Fairbanks  ...        36200
       0         Ilisagvi...       Barrow  ...        24900
       1         Alaska P...    Anchorage  ...        47000
       1         Alaska C...     Soldotna  ...          NaN
AL     0         Universi...   Birmingham  ...        39700
...                      ...          ...  ...          ...
WV     0         BridgeVa...  South  C...  ...          NaN
       1         Appalach...   Mount Hope  ...        28700
       1         West Vir...  Nutter Fort  ...        16700
WY     0         Central ...     Riverton  ...        25200
       0         CollegeA...     Cheyenne  ...        25600

Filtering for states with a minority majority

Previously, we examined using Boolean arrays to filter rows. In a similar fashion, when using the .groupby method, we can filter out groups. The .filter method of the groupby object accepts a function that must return either True or False to indicate whether a group is kept.

This .filter method applied after a call to the .groupby method is completely different to the DataFrame .filter method covered in the Selecting columns with methods recipe from Chapter 2, Essential DataFrame Operations.

One thing to be aware of is that when the .filter method is applied, the result does not use the grouping columns as the index, but keeps the original index! The DataFrame .filter method filters columns, not values.

In this recipe, we use the college dataset to find all the states that have more non-white undergraduate students than white. This is a dataset from the US, where whites form the majority and therefore, we are looking for states with a minority majority.

How to do it…

  1. Read in the college dataset, group by state, and display the total number of groups. This should equal the number of unique states retrieved from the .nunique Series method:
    >>> college = pd.read_csv('data/college.csv', index_col='INSTNM')
    >>> grouped = college.groupby('STABBR')
    >>> grouped.ngroups
    59
    >>> college['STABBR'].nunique() # verifying the same number
    59
    
  2. The grouped variable has a .filter method, which accepts a custom function that determines whether a group is kept. The custom function accepts a DataFrame of the current group and is required to return a Boolean. Let's define a function that calculates the total percentage of minority students and returns True if this percentage is greater than a user-defined threshold:
    >>> def check_minority(df, threshold):
    ...     minority_pct = 1 - df['UGDS_WHITE']
    ...     total_minority = (df['UGDS'] * minority_pct).sum()
    ...     total_ugds = df['UGDS'].sum()
    ...     total_minority_pct = total_minority / total_ugds
    ...     return total_minority_pct > threshold
    
  3. Use the .filter method passed with the check_minority function and a threshold of 50% to find all states that have a minority majority:
    >>> college_filtered = grouped.filter(check_minority, threshold=.5)
    >>> college_filtered
                         CITY STABBR  ...  MD_EARN_WNE_P10  GRAD_DEBT_MDN_SUPP
    INSTNM                            ...
    Everest C...      Phoenix     AZ  ...        28600             9500
    Collins C...      Phoenix     AZ  ...        25700            47000
    Empire Be...      Phoenix     AZ  ...        17800             9588
    Empire Be...       Tucson     AZ  ...        18200             9833
    Thunderbi...     Glendale     AZ  ...       118900      PrivacyS...
    ...                   ...    ...  ...          ...              ...
    WestMed C...       Merced     CA  ...          NaN          15623.5Vantage C...      El Paso     TX  ...          NaN             9500
    SAE Insti...   Emeryville     CA  ...          NaN             9500
    Bay Area ...     San Jose     CA  ...          NaN      PrivacyS...
    Excel Lea...  San Antonio     TX  ...          NaN            12125
    
  4. Just looking at the output may not be indicative of what happened. The DataFrame starts with the state of Arizona (AZ) and not Alaska (AK), so we can visually confirm that something changed. Let's compare the shape of this filtered DataFrame with the original. Looking at the results, about 60% of the rows have been filtered, and only 20 states remain that have a minority majority:
    >>> college.shape
    (7535, 26)
    >>> college_filtered.shape
    (3028, 26)
    >>> college_filtered['STABBR'].nunique()
    20
    

How it works…

This recipe takes a look at the total population of all the institutions on a state-by-state basis. The goal is to keep all the rows from the states, as a whole, that have a minority majority. This requires us to group our data by state, which we do in step 1. We find that there are 59 independent groups.

The .filter groupby method either keeps all the rows in a group or filters them out. It does not change the number of columns. The .filter groupby method performs this gatekeeping through a user-defined function, check_minority, in this recipe. This function accepts a DataFrame of each group and needs to return a Boolean.

Inside the check_minority function, the percentage and the total number of non-white students for each institution are first calculated followed by the total number of all students. Finally, the percentage of non-white students for the entire state is checked against the given threshold, which produces a Boolean.

The final result is a DataFrame with the same columns as the original (and the same index, not the grouped index), but with the rows from the states that don't meet the threshold filtered out. As it is possible that the head of the filtered DataFrame is the same as the original, you need to do some inspection to ensure that the operation completed successfully. We verify this by checking the number of rows and unique states.

There's more…

Our function, check_minority, is flexible and accepts a parameter to lower or raise the percentage of minority threshold. Let's check the shape and number of unique states for a couple of other thresholds:

>>> college_filtered_20 = grouped.filter(check_minority, threshold=.2)
>>> college_filtered_20.shape
(7461, 26)
>>> college_filtered_20['STABBR'].nunique()
57
>>> college_filtered_70 = grouped.filter(check_minority, threshold=.7)
>>> college_filtered_70.shape
(957, 26)
>>> college_filtered_70['STABBR'].nunique()
10

Transforming through a weight loss bet

One method to increase motivation to lose weight is to make a bet with someone else. The scenario in this recipe will track weight loss from two individuals throughout a four-month period and determine a winner.

In this recipe, we use simulated data from two individuals to track the percentage of weight loss over four months. At the end of each month, a winner will be declared based on the individual who lost the highest percentage of body weight for that month. To track weight loss, we group our data by month and person, and then call the .transform method to find the percentage weight loss change for each week against the start of the month.

We will use the .transform method in this recipe. This method returns a new object that preserves the index of the original DataFrame but allows you to do calculations on groups of the data.

How to do it…

  1. Read in the raw weight_loss dataset, and examine the first month of data from the two people, Amy and Bob. There are a total of four weigh-ins per month:
    >>> weight_loss = pd.read_csv('data/weight_loss.csv')
    >>> weight_loss.query('Month == "Jan"')
      Name Month    Week  Weight
    0  Bob   Jan  Week 1     291
    1  Amy   Jan  Week 1     197
    2  Bob   Jan  Week 2     288
    3  Amy   Jan  Week 2     189
    4  Bob   Jan  Week 3     283
    5  Amy   Jan  Week 3     189
    6  Bob   Jan  Week 4     283
    7  Amy   Jan  Week 4     190
    
  2. To determine the winner for each month, we only need to compare weight loss from the first week to the last week of each month. But, if we wanted to have weekly updates, we can also calculate weight loss from the current week to the first week of each month. Let's create a function that is capable of providing weekly updates. It will take a Series and return a Series of the same size:
    >>> def percent_loss(s):
    ...     return ((s - s.iloc[0]) / s.iloc[0]) * 100
    
  3. Let's test out this function for Bob during the month of January:
    >>> (weight_loss
    ...     .query('Name=="Bob" and Month=="Jan"')
    ...     ['Weight']
    ...     .pipe(percent_loss)
    ... )
    0    0.000000
    2   -1.030928
    4   -2.749141
    6   -2.749141
    Name: Weight, dtype: float64
    
  4. After the first week, Bob lost 1% of his body weight. He continued losing weight during the second week but made no progress during the last week. We can apply this function to every single combination of person and month to get the weight loss per week in relation to the first week of the month. To do this, we need to group our data by Name and Month, and then use the .transform method to apply this custom function. The function we pass to .transform needs to maintain the index of the group that is passed into it, so we can use percent_loss here:
    >>> (weight_loss
    ...     .groupby(['Name', 'Month'])
    ...     ['Weight'] 
    ...     .transform(percent_loss)
    ... )
    0     0.000000
    1     0.000000
    2    -1.030928
    3    -4.060914
    4    -2.749141
            ...   
    27   -3.529412
    28   -3.065134
    29   -3.529412
    30   -4.214559
    31   -5.294118
    Name: Weight, Length: 32, dtype: float64
    
  5. The .transform method takes a function that returns an object with the same index (and the same number of rows) as was passed into it. Because it has the same index, we can insert it as a column. The .transform method is useful for summarizing information from the groups and then adding it back to the original DataFrame. We will also filter down to two months of data for Bob:
    >>> (weight_loss
    ...     .assign(percent_loss=(weight_loss
    ...         .groupby(['Name', 'Month'])
    ...         ['Weight'] 
    ...         .transform(percent_loss)
    ...         .round(1)))
    ...     .query('Name=="Bob" and Month in ["Jan", "Feb"]')
    ... )
       Name Month    Week  Weight  percent_loss
    0   Bob   Jan  Week 1     291           0.0
    2   Bob   Jan  Week 2     288          -1.0
    4   Bob   Jan  Week 3     283          -2.7
    6   Bob   Jan  Week 4     283          -2.7
    8   Bob   Feb  Week 1     283           0.0
    10  Bob   Feb  Week 2     275          -2.8
    12  Bob   Feb  Week 3     268          -5.3
    14  Bob   Feb  Week 4     268          -5.3
    
  6. Notice that the percentage of weight loss resets after the new month. With this new percent_loss column, we can manually determine a winner but let's see whether we can find a way to do this automatically. As the only week that matters is the last week, let's select week 4:
    >>> (weight_loss
    ...     .assign(percent_loss=(weight_loss
    ...         .groupby(['Name', 'Month'])
    ...         ['Weight'] 
    ...         .transform(percent_loss)
    ...         .round(1)))
    ...     .query('Week == "Week 4"')
    ... )
       Name Month    Week  Weight  percent_loss
    6   Bob   Jan  Week 4     283          -2.7
    7   Amy   Jan  Week 4     190          -3.6
    14  Bob   Feb  Week 4     268          -5.3
    15  Amy   Feb  Week 4     173          -8.9
    22  Bob   Mar  Week 4     261          -2.6
    23  Amy   Mar  Week 4     170          -1.7
    30  Bob   Apr  Week 4     250          -4.2
    31  Amy   Apr  Week 4     161          -5.3
    
  7. This narrows down the weeks but still doesn't automatically find out the winner of each month. Let's reshape this data with the .pivot method so that Bob's and Amy's percent weight loss is side by side for each month:
    >>> (weight_loss
    ...     .assign(percent_loss=(weight_loss
    ...         .groupby(['Name', 'Month'])
    ...         ['Weight'] 
    ...         .transform(percent_loss)
    ...         .round(1)))
    ...     .query('Week == "Week 4"')
    ...     .pivot(index='Month', columns='Name',
    ...            values='percent_loss')
    ... )
    Name   Amy  Bob
    Month          
    Apr   -5.3 -4.2
    Feb   -8.9 -5.3
    Jan   -3.6 -2.7
    Mar   -1.7 -2.6
    
  8. This output makes it clearer who has won each month, but we can still go a couple of steps further. NumPy has a vectorized if then else function called where, which can map a Series or array of Booleans to other values. Let's create a column, winner, with the name of the winner:
    >>> (weight_loss
    ...     .assign(percent_loss=(weight_loss
    ...         .groupby(['Name', 'Month'])
    ...         ['Weight'] 
    ...         .transform(percent_loss)
    ...         .round(1)))
    ...     .query('Week == "Week 4"')
    ...     .pivot(index='Month', columns='Name',
    ...            values='percent_loss')
    ...     .assign(winner=lambda df_:
    ...             np.where(df_.Amy < df_.Bob, 'Amy', 'Bob'))
    ... )
    Name   Amy  Bob winner
    Month                 
    Apr   -5.3 -4.2    Amy
    Feb   -8.9 -5.3    Amy
    Jan   -3.6 -2.7    Amy
    Mar   -1.7 -2.6    Bob
    

    In Jupyter, you can highlight the winning percentage for each month using the .style attribute:

    (weight_loss
        .assign(percent_loss=(weight_loss
            .groupby(['Name', 'Month'])
            ['Weight'] 
            .transform(percent_loss)
            .round(1)))
        .query('Week == "Week 4"')
        .pivot(index='Month', columns='Name',
               values='percent_loss')
        .assign(winner=lambda df_:
                np.where(df_.Amy < df_.Bob, 'Amy', 'Bob'))
        .style.highlight_min(axis=1)
    )
    
    Highlight minimum

    The highlight minimum

  9. Use the .value_counts method to return the final score as the number of months won:
    >>> (weight_loss
    ...     .assign(percent_loss=(weight_loss
    ...         .groupby(['Name', 'Month'])
    ...         ['Weight'] 
    ...         .transform(percent_loss)
    ...         .round(1)))
    ...     .query('Week == "Week 4"')
    ...     .pivot(index='Month', columns='Name',
    ...            values='percent_loss')
    ...     .assign(winner=lambda df_:
    ...             np.where(df_.Amy < df_.Bob, 'Amy', 'Bob'))
    ...     .winner
    ...     .value_counts()
    ... )
    Amy    3
    Bob    1
    Name: winner, dtype: int64
    

How it works…

Throughout this recipe, the .query method is used to filter data instead of using Boolean arrays. Refer to the Improving readability of Boolean indexing with the query method recipe in Chapter 7, Filtering Rows for more information.

Our goal is to find the percentage weight loss for each month for each person. One way to accomplish this task is to calculate each week's weight loss relative to the start of each month. This specific task is perfectly suited to the .transform groupby method. The .transform method requires a function as a parameter. This function gets passed each group (which can be a Series or DataFrame). It must return a sequence of values the same length as the group that was passed in or else an exception will be raised. No aggregation or filtering takes place.

Step 2 creates a function that calculates the percent age loss (or gain) relative to the first value. It subtracts the first value of the passed Series from all of its values and then divides this result by the first value. In step 3, we test this function on one person during one month.

In step 4, we use .groupby with .transform to run this function over every combination of person and month. We are transforming the Weight column into the percentage of weight lost in the current week.

The first month of data is outputted for each person in step 6. pandas returns the new data as a Series. This Series isn't all that useful by itself and makes more sense appended to the original DataFrame as a new column. We complete this operation in step 5.

To determine the winner, only week 4 of each month is necessary. We could stop here and manually determine the winner, but pandas supplies us with the functionality to automate this. The .pivot function in step 7 reshapes our dataset by pivoting the unique values of one column into new column names. The index parameter is used for the column that you do not want to pivot. The column passed to the values parameter gets tiled over each unique combination of the columns in the index and columns parameters.

The .pivot method only works if there is just a single occurrence of each unique combination of the columns in the index and columns parameters. If there is more than one unique combination, an exception is raised. You can use the .pivot_table or .groupby method in that situation.

Here is an example of using .groupyby with .unstack to emulate the pivot functionality:

>>> (weight_loss
...     .assign(percent_loss=(weight_loss
...         .groupby(['Name', 'Month'])
...         ['Weight'] 
...         .transform(percent_loss)
...         .round(1)))
...     .query('Week == "Week 4"')
...     .groupby(['Month', 'Name'])
...     ['percent_loss']
...     .first()
...     .unstack()
... )
Name   Amy  Bob
Month          
Apr   -5.3 -4.2
Feb   -8.9 -5.3
Jan   -3.6 -2.7
Mar   -1.7 -2.6

After pivoting, we utilize the NumPy where function, whose first argument is a condition that produces a Series of Booleans. True values get mapped to Amy, and False values get mapped to Bob. We highlight the winner of each month and tally the final score with the .value_counts method.

There's more…

Take a look at the DataFrame output from step 7. Did you notice that the months are in alphabetical and not chronological order? pandas unfortunately, in this case at least, orders the months for us alphabetically. We can solve this issue by changing the data type of Month to a categorical variable. Categorical variables map all the values of each column to an integer. We can choose this mapping to be the normal chronological order for the months. pandas uses this underlying integer mapping during the .pivot method to order the months chronologically:

>>> (weight_loss
...     .assign(percent_loss=(weight_loss
...         .groupby(['Name', 'Month'])
...         ['Weight'] 
...         .transform(percent_loss)
...         .round(1)),
...             Month=pd.Categorical(weight_loss.Month,
...                   categories=['Jan', 'Feb', 'Mar', 'Apr'],
...                   ordered=True))
...     .query('Week == "Week 4"')
...     .pivot(index='Month', columns='Name',
...            values='percent_loss')
... )
Name   Amy  Bob
Month          
Jan   -3.6 -2.7
Feb   -8.9 -5.3
Mar   -1.7 -2.6
Apr   -5.3 -4.2

To convert Month to an ordered category column, use the Categorical constructor. Pass it the original column as a Series and a unique sequence of all the categories in the desired order to the categories parameter. In general, to sort columns of the object data type by something other than alphabetical, convert them to categorical.

Calculating weighted mean SAT scores per state with apply

The groupby object has four methods that accept a function (or functions) to perform a calculation on each group. These four methods are .agg, .filter, .transform, and .apply. Each of the first three of these methods has a very specific output that the function must return. .agg must return a scalar value, .filter must return a Boolean, and .transform must return a Series or DataFrame with the same length as the passed group. The .apply method, however, may return a scalar value, a Series, or even a DataFrame of any shape, therefore making it very flexible. It is also called only once per group (on a DataFrame), while the .transform and .agg methods get called once for each aggregating column (on a Series). The .apply method's ability to return a single object when operating on multiple columns at the same time makes the calculation in this recipe possible.

In this recipe, we calculate the weighted average of both the math and verbal SAT scores per state from the college dataset. We weight the scores by the population of undergraduate students per school.

How to do it…

  1. Read in the college dataset, and drop any rows that have missing values in the UGDS, SATMTMID, or SATVRMID columns. We do not want any missing values for those columns:
    >>> college = pd.read_csv('data/college.csv')
    >>> subset = ['UGDS', 'SATMTMID', 'SATVRMID']
    >>> college2 = college.dropna(subset=subset)
    >>> college.shape
    (7535, 27)
    >>> college2.shape
    (1184, 27)
    
  2. The vast majority of institutions do not have data for our three required columns, but this is still more than enough data to continue. Next, create a user-defined function to calculate the weighted average of the SAT math scores:
    >>> def weighted_math_average(df):
    ...     weighted_math = df['UGDS'] * df['SATMTMID']
    ...     return int(weighted_math.sum() / df['UGDS'].sum())
    
  3. Group by state and pass this function to the .apply method. Because each group has multiple columns and we want to reduce those to a single value, we need to use .apply. The weighted_math_average function will be called once for each group (not on the individual columns in the group):
    >>> college2.groupby('STABBR').apply(weighted_math_average)
    STABBR
    AK    503
    AL    536
    AR    529
    AZ    569
    CA    564
         ...
    VT    566
    WA    555
    WI    593
    WV    500
    WY    540
    Length: 53, dtype: int64
    
  4. We successfully returned a scalar value for each group. Let's take a small detour and see what the outcome would have been by passing the same function to the .agg method (which calls the function for every column):
    >>> (college2
    ...     .groupby('STABBR')
    ...     .agg(weighted_math_average)
    ... )
    Traceback (most recent call last):
      ...
    KeyError: 'UGDS'
    
  5. The weighted_math_average function gets applied to each non-aggregating column in the DataFrame. If you try and limit the columns to just SATMTMID, you will get an error as you won't have access to UGDS. So, the best way to complete operations that act on multiple columns is with .apply:
    >>> (college2
    ...     .groupby('STABBR')
    ...     ['SATMTMID'] 
    ...     .agg(weighted_math_average)
    ... )
    Traceback (most recent call last):
      ...
    KeyError: 'UGDS'
    
  6. A nice feature of .apply is that you can create multiple new columns by returning a Series. The index of this returned Series will be the new column names. Let's modify our function to calculate the weighted and arithmetic average for both SAT scores along with the count of the number of institutions from each group. We return these five values in a Series:
    >>> def weighted_average(df):
    ...    weight_m = df['UGDS'] * df['SATMTMID']
    ...    weight_v = df['UGDS'] * df['SATVRMID']
    ...    wm_avg = weight_m.sum() / df['UGDS'].sum()
    ...    wv_avg = weight_v.sum() / df['UGDS'].sum()
    ...    data = {'w_math_avg': wm_avg,
    ...            'w_verbal_avg': wv_avg,
    ...            'math_avg': df['SATMTMID'].mean(),
    ...            'verbal_avg': df['SATVRMID'].mean(),
    ...            'count': len(df)
    ...    }
    ...    return pd.Series(data)
    >>> (college2
    ...     .groupby('STABBR')
    ...     .apply(weighted_average)
    ...     .astype(int)
    ... )
            w_math_avg  w_verbal_avg  math_avg  verbal_avg  count
    STABBR
    AK             503           555       503         555      1
    AL             536           533       504         508     21
    AR             529           504       515         491     16
    AZ             569           557       536         538      6
    CA             564           539       562         549     72
    ...            ...           ...       ...         ...    ...
    VT             566           564       526         527      8
    WA             555           541       551         548     18
    WI             593           556       545         516     14
    WV             500           487       481         473     17
    WY             540           535       540         535      1
    

How it works…

In order for this recipe to complete correctly, we need to filter for institutions that do not have missing values for UGDS, SATMTMID, and SATVRMID. By default, the .dropna method drops rows that have one or more missing values. We must use the subset parameter to limit the columns it looks at. It only considers the UGDS, SATMTMID, or SATVRMID columns for missing values.

If we do not remove the missing values, it will throw off the computations for the weighted averages. Next, you can see that the weighted scores for AK are 5 and 6, which does not make sense:

>>> (college
...     .groupby('STABBR')
...     .apply(weighted_average)
... )
        w_math_avg  w_verbal_avg    math_avg  verbal_avg  count
STABBR
AK        5.548091      6.121651  503.000000  555.000000   10.0
AL      261.895658    260.550109  504.285714  508.476190   96.0
AR      301.054792    287.264872  515.937500  491.875000   86.0
AS        0.000000      0.000000         NaN         NaN    1.0
AZ       61.815821     60.511712  536.666667  538.333333  133.0
...            ...           ...         ...         ...    ...
VT      389.967094    388.696848  526.875000  527.500000   27.0
WA      274.885878    267.880280  551.222222  548.333333  123.0
WI      153.803086    144.160115  545.071429  516.857143  112.0
WV      224.697582    218.843452  481.705882  473.411765   73.0
WY      216.761180    214.754132  540.000000  535.000000   11.0
    

In step 2, we define a function that calculates the weighted average for just the SATMTMID column. The weighted average differs from the arithmetic mean because each value is multiplied by a weight. This quantity is then summed and divided by the sum of the weights. In this case, our weight is the undergraduate student population.

In step 3, we pass this function to the .apply method. Our function, weighted_math_average, gets passed a DataFrame of all the original columns for each group. It returns a single scalar value, the weighted average of SATMTMID. At this point, you might think that this calculation is possible using the .agg method. Directly replacing .apply with .agg does not work as .agg returns a value for each of its aggregating columns.

Step 6 shows the versatility of .apply. We build a new function that calculates the weighted and arithmetic average of both SAT columns as well as the number of rows for each group. To use .apply to create multiple columns, you must return a Series. The index values are used as column names in the resulting DataFrame. You can return as many values as you want with this method.

Note that because I'm using a Python version greater than 3.5, I can use a normal dictionary in weighted_average to create a Series. This is because since Python 3.6, the dictionary is sorted by default.

There's more…

In this recipe, we returned a single row as a Series for each group. It's possible to return any number of rows and columns for each group by returning a DataFrame.

In addition to finding just the arithmetic and weighted means, let's also find the geometric and harmonic means of both SAT columns and return the results as a DataFrame with rows as the name of the type of mean and columns as the SAT type. To ease the burden on us, we use the NumPy function average to compute the weighted average and the SciPy functions gmean and hmean for geometric and harmonic means:

>>> from scipy.stats import gmean, hmean
>>> def calculate_means(df):
...     df_means = pd.DataFrame(index=['Arithmetic', 'Weighted',
...                                    'Geometric', 'Harmonic'])
...     cols = ['SATMTMID', 'SATVRMID']
...     for col in cols:
...         arithmetic = df[col].mean()
...         weighted = np.average(df[col], weights=df['UGDS'])
...         geometric = gmean(df[col])
...         harmonic = hmean(df[col])
...         df_means[col] = [arithmetic, weighted,
...                          geometric, harmonic]
...     df_means['count'] = len(df)
...     return df_means.astype(int)
>>> (college2
...     .groupby('STABBR')
...     .apply(calculate_means)
... )
                   SATMTMID  SATVRMID  count
STABBR                                      
AK     Arithmetic       503       555      1
       Weighted         503       555      1
       Geometric        503       555      1
       Harmonic         503       555      1
AL     Arithmetic       504       508     21
...                     ...       ...    ...
WV     Harmonic         480       472     17
WY     Arithmetic       540       535      1
       Weighted         540       535      1
       Geometric        540       534      1
       Harmonic         540       535      1

Grouping by continuous variables

When grouping in pandas, you typically use columns with discrete repeating values. If there are no repeated values, then grouping would be pointless as there would only be one row per group. Continuous numeric columns typically have few repeated values and are generally not used to form groups. However, if we can transform columns with continuous values into a discrete column by placing each value in a bin, rounding them, or using some other mapping, then grouping with them makes sense.

In this recipe, we explore the flights dataset to discover the distribution of airlines for different travel distances. This allows us, for example, to find the airline that makes the most flights between 500 and 1,000 miles. To accomplish this, we use the pandas cut function to discretize the distance of each flight flown.

How to do it…

  1. Read in the flights dataset:
    >>> flights = pd.read_csv('data/flights.csv')
    >>> flights
           MONTH  DAY  WEEKDAY  ... ARR_DELAY DIVERTED CANCELLED
    0          1    1        4  ...      65.0        0         0
    1          1    1        4  ...     -13.0        0         0
    2          1    1        4  ...      35.0        0         0
    3          1    1        4  ...      -7.0        0         0
    4          1    1        4  ...      39.0        0         0
    ...      ...  ...      ...  ...       ...      ...       ...
    58487     12   31        4  ...     -19.0        0         0
    58488     12   31        4  ...       4.0        0         0
    58489     12   31        4  ...      -5.0        0         0
    58490     12   31        4  ...      34.0        0         0
    58491     12   31        4  ...      -1.0        0         0
    
  2. If we want to find the distribution of airlines over a range of distances, we need to place the values of the DIST column into discrete bins. Let's use the pandas cut function to split the data into five bins:
    >>> bins = [-np.inf, 200, 500, 1000, 2000, np.inf]
    >>> cuts = pd.cut(flights['DIST'], bins=bins)
    >>> cuts
    0         (500.0, 1000.0]
    1        (1000.0, 2000.0]
    2         (500.0, 1000.0]
    3        (1000.0, 2000.0]
    4        (1000.0, 2000.0]
                   ...
    58487    (1000.0, 2000.0]
    58488      (200.0, 500.0]
    58489      (200.0, 500.0]
    58490     (500.0, 1000.0]
    58491     (500.0, 1000.0]
    Name: DIST, Length: 58492, dtype: category
    Categories (5, interval[float64]): [(-inf, 200.0] < (200.0, 500.0] < (500.0, 1000.0] <
                                     (1000.0, 2000.0] < (2000.0, inf]]
    
  3. An ordered categorical Series is created. To help get an idea of what happened, let's count the values of each category:
    >>> cuts.value_counts()
    (500.0, 1000.0]     20659
    (200.0, 500.0]      15874
    (1000.0, 2000.0]    14186
    (2000.0, inf]        4054
    (-inf, 200.0]        3719
    Name: DIST, dtype: int64
    
  4. The cuts Series can now be used to form groups. pandas allows you to pass many types into the .groupby method. Pass the cuts Series to the .groupby method and then call the .value_counts method on the AIRLINE column to find the distribution for each distance group. Notice that SkyWest (OO) makes up 33% of flights of less than 200 miles but only 16% of those between 200 and 500 miles:
    >>> (flights
    ...     .groupby(cuts)
    ...     ['AIRLINE']
    ...     .value_counts(normalize=True) 
    ...     .round(3)
    ... )
    DIST           AIRLINE         
    (-inf, 200.0]  OO         0.326
                   EV         0.289
                   MQ         0.211
                   DL         0.086
                   AA         0.052
                              ...  
    (2000.0, inf]  WN         0.046
                   HA         0.028
                   NK         0.019
                   AS         0.012
                   F9         0.004
    Name: AIRLINE, Length: 57, dtype: float64
    

How it works…

In step 2, the .cut function places each value of the DIST column into one of five bins. The bins are created by a sequence of six numbers defining the edges. You always need one more edge than the number of bins. You can pass the bins parameter an integer, which automatically creates that number of equal-width bins. Negative infinity and positive infinity values are available in NumPy and ensure that all values get placed in a bin. If you have values that are outside the bin edges, they will be made missing and not be placed in a bin.

The cuts variable is now a Series of five ordered categories. It has all the normal Series methods and, in step 3, the .value_counts method is used to get a sense of its distribution.

The .groupby method allows you to pass any object to group on. This means that you are able to form groups from something completely unrelated to the current DataFrame. Here, we group by the values in the cuts variable. For each grouping, we find the percentage of flights per airline with .value_counts by setting normalize to True.

Some interesting insights can be drawn from this result. Looking at the full result, SkyWest is the leading airline for under 200 miles but has no flights over 2,000 miles. In contrast, American Airlines has the fifth highest total for flights under 200 miles but has by far the most flights between 1,000 and 2,000 miles.

There's more…

We can find more results when grouping by the cuts variable. For instance, we can find the 25th, 50th, and 75th percentile airtime for each distance grouping. As airtime is in minutes, we can divide by 60 to get hours. This will return a Series with a MultiIndex:

>>> (flights
...   .groupby(cuts)
...   ['AIR_TIME']
...   .quantile(q=[.25, .5, .75]) 
...   .div(60)
...   .round(2)
... )
DIST
(-inf, 200.0]     0.25    0.43
                  0.50    0.50
                  0.75    0.57
(200.0, 500.0]    0.25    0.77
                  0.50    0.92
                          ... 
(1000.0, 2000.0]  0.50    2.93
                  0.75    3.40
(2000.0, inf]     0.25    4.30
                  0.50    4.70
                  0.75    5.03
Name: AIR_TIME, Length: 15, dtype: float64     

We can use this information to create informative string labels when using the cut function. These labels replace the interval notation found in the index. We can also chain the .unstack method, which transposes the inner index level to column names:

>>> labels=['Under an Hour', '1 Hour', '1-2 Hours',
...         '2-4 Hours', '4+ Hours']
>>> cuts2 = pd.cut(flights['DIST'], bins=bins, labels=labels)
>>> (flights
...    .groupby(cuts2)
...    ['AIRLINE']
...    .value_counts(normalize=True) 
...    .round(3) 
...    .unstack() 
... )
AIRLINE           AA     AS     B6  ...     US     VX     WN
DIST                                ...
Under an Hour  0.052    NaN    NaN  ...    NaN    NaN  0.009
1 Hour         0.071  0.001  0.007  ...  0.016  0.028  0.194
1-2 Hours      0.144  0.023  0.003  ...  0.025  0.004  0.138
2-4 Hours      0.264  0.016  0.003  ...  0.040  0.012  0.160
4+ Hours       0.212  0.012  0.080  ...  0.065  0.074  0.046

Counting the total number of flights between cities

In the flights dataset, we have data on the origin and destination airport. It is trivial to count the number of flights originating in Houston and landing in Atlanta, for instance. What is more difficult is counting the total number of flights between the two cities.

In this recipe, we count the total number of flights between two cities, regardless of which one is the origin or destination. To accomplish this, we sort the origin and destination airports alphabetically so that each combination of airports always occurs in the same order. We can then use this new column arrangement to form groups and then to count.

How to do it…

  1. Read in the flights dataset, and find the total number of flights between each origin and destination airport:
    >>> flights = pd.read_csv('data/flights.csv')
    >>> flights_ct = flights.groupby(['ORG_AIR', 'DEST_AIR']).size()
    >>> flights_ct
    ORG_AIR  DEST_AIR       
    ATL      ABE          31
             ABQ          16
             ABY          19
             ACY           6
             AEX          40
                        ... 
    SFO      SNA         122
             STL          20
             SUN          10
             TUS          20
             XNA           2
    Length: 1130, dtype: int64
    
  2. Select the total number of flights between Houston (IAH) and Atlanta (ATL) in both directions:
    >>> flights_ct.loc[[('ATL', 'IAH'), ('IAH', 'ATL')]]
    ORG_AIR  DEST_AIR
    ATL      IAH         121
    IAH      ATL         148
    dtype: int64
    
  3. We could simply sum these two numbers together to find the total flights between the cities, but there is a more efficient and automated solution that can work for all flights. Let's sort the origin and destination columns for each row alphabetically. We will use axis='columns' to do that:
    >>> f_part3 = (flights  
    ...   [['ORG_AIR', 'DEST_AIR']] 
    ...   .apply(lambda ser:
    ...          ser.sort_values().reset_index(drop=True),
    ...          axis='columns')
    ... )
    >>> f_part3
          DEST_AIR ORG_AIR
    0          SLC     LAX
    1          IAD     DEN
    2          VPS     DFW
    3          DCA     DFW
    4          MCI     LAX
    ...        ...     ...
    58487      DFW     SFO
    58488      SFO     LAS
    58489      SBA     SFO
    58490      ATL     MSP
    58491      BOI     SFO
    
  4. Now that the origin and destination values in each row are sorted, the column names are not correct. Let's rename them to something more generic and then again find the total number of flights between all cities:
    >>> rename_dict = {0:'AIR1', 1:'AIR2'}  
    >>> (flights     
    ...   [['ORG_AIR', 'DEST_AIR']]
    ...   .apply(lambda ser:
    ...          ser.sort_values().reset_index(drop=True),
    ...          axis='columns')
    ...   .rename(columns=rename_dict)
    ...   .groupby(['AIR1', 'AIR2'])
    ...   .size()
    ... )
    AIR1  AIR2
    ATL   ABE      31
          ABQ      16
          ABY      19
          ACY       6
          AEX      40
                 ... 
    SFO   SNA     122
          STL      20
          SUN      10
          TUS      20
          XNA       2
    Length: 1130, dtype: int64
    
  5. Let's select all the flights between Atlanta and Houston and verify that they match the sum of the values in step 2:
    >>> (flights     
    ...   [['ORG_AIR', 'DEST_AIR']]
    ...   .apply(lambda ser:
    ...          ser.sort_values().reset_index(drop=True),
    ...          axis='columns')
    ...   .rename(columns=rename_dict)
    ...   .groupby(['AIR1', 'AIR2'])
    ...   .size()
    ...   .loc[('ATL', 'IAH')]
    ... )
    269
    
  6. If we try and select flights with Houston followed by Atlanta, we get an error:
    >>> (flights
    ...   [['ORG_AIR', 'DEST_AIR']]
    ...   .apply(lambda ser:
    ...          ser.sort_values().reset_index(drop=True),
    ...          axis='columns')
    ...   .rename(columns=rename_dict)
    ...   .groupby(['AIR1', 'AIR2'])
    ...   .size()
    ...   .loc[('IAH', 'ATL')]
    ... )
    Traceback (most recent call last)
      ...
    KeyError: 'ATL'
    

How it works…

In step 1, we form groups by the origin and destination airport columns and then apply the .size method to the groupby object, which returns the total number of rows for each group. Notice that we could have passed the string size to the .agg method to achieve the same result. In step 2, the total number of flights for each direction between Atlanta and Houston are selected. The result is a Series that has a MultiIndex with two levels. One way to select rows from a MultiIndex is to pass the .loc index operator a tuple of the exact level values. Here, we select two different rows, ('ATL', 'HOU') and ('HOU', 'ATL'). We use a list of tuples to do this correctly.

Step 3 is the most important step in the recipe. We would like to have just one label for all flights between Atlanta and Houston and so far we have two. If we sort each combination of origin and destination airports alphabetically, we would then have a single label for flights between airports. To do this, we use the .apply method on a DataFrame. This is different from the groupby .apply method. No groups are formed in step 3.

The DataFrame .apply method must be passed a function. In this case, it's a lambda function that sorts each row. By default, this function is passed each column. We can change the direction of computation by using axis='columns' (or axis=1). The lambda function has each row of data passed to it implicitly as a Series. It returns a Series with sorted airport codes. We have to call .reset_index so that the columns do not realign after the application of the function.

The .apply method iterates over all rows using the lambda function. After completion of this operation, the values in the two columns are sorted for each row. The column names are now meaningless. We rename the column names in the next step and then perform the same grouping and aggregation as was done in step 2. This time, all flights between Atlanta and Houston fall under the same label.

There's more…

Steps 3 through 6 are expensive operations and take several seconds to complete. There are only about 60,000 rows, so this solution would not scale well to larger data. Calling the .apply method with axis='columns' (or axis=1) is one of the least performant operations in all of pandas. Internally, pandas loops over each row and does not provide any speed boosts from NumPy. If possible, avoid using .apply with axis=1.

We can get a massive speed increase with the NumPy sort function. Let's go ahead and use this function and analyze its output. By default, it sorts each row:

>>> data_sorted = np.sort(flights[['ORG_AIR', 'DEST_AIR']])
>>> data_sorted[:10]
array([['LAX', 'SLC'],
       ['DEN', 'IAD'],
       ['DFW', 'VPS'],
       ['DCA', 'DFW'],
       ['LAX', 'MCI'],
       ['IAH', 'SAN'],
       ['DFW', 'MSY'],
       ['PHX', 'SFO'],
       ['ORD', 'STL'],
       ['IAH', 'SJC']], dtype=object)

A two-dimensional NumPy array is returned. NumPy does not do grouping operations so let's use the DataFrame constructor to create a new DataFrame and check whether it equals the DataFrame from step 3:

>>> flights_sort2 = pd.DataFrame(data_sorted, columns=['AIR1', 'AIR2'])
>>> flights_sort2.equals(f_part3.rename(columns={'0:'AIR1', 1:'AIR2'}))
True

Because the DataFrames are the same, you can replace step 3 with the previous faster sorting routine. Let's time the difference between each of the different sorting methods:

>>> %%timeit
>>> flights_sort = (flights   
...     [['ORG_AIR', 'DEST_AIR']] 
...    .apply(lambda ser:
...          ser.sort_values().reset_index(drop=True),
...          axis='columns')
... )
1min 5s ± 2.67 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %%timeit
>>> data_sorted = np.sort(flights[['ORG_AIR', 'DEST_AIR']])
>>> flights_sort2 = pd.DataFrame(data_sorted,
...     columns=['AIR1', 'AIR2'])
14.6 ms ± 173 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The NumPy solution is 4,452 times faster than using .apply with pandas in this example.

Finding the longest streak of on-time flights

One of the most important metrics for airlines is their on-time flight performance. The Federal Aviation Administration considers a flight delayed when it arrives at least 15 minutes later than its scheduled arrival time. pandas includes methods to calculate the total and percentage of on-time flights per airline. While these basic summary statistics are an important metric, there are other non-trivial calculations that are interesting, such as finding the length of consecutive on-time flights for each airline at each of its origin airports.

In this recipe, we find the longest consecutive streak of on-time flights for each airline at each origin airport. This requires each value in a column to be aware of the value immediately following it. We make clever use of the .diff and .cumsum methods to find streaks before applying this methodology to each of the groups.

The max_streak function we develop in this section exposes a regression in pandas 1.0 and 1.0.1. This bug (https://github.com/pandas-dev/pandas/issues/31802) should be fixed in pandas 1.0.2.

How to do it…

  1. Before we get started with the flights dataset, let's practice counting streaks of ones with a small sample Series:
    >>> s = pd.Series([0, 1, 1, 0, 1, 1, 1, 0])
    >>> s
    0    0
    1    1
    2    1
    3    0
    4    1
    5    1
    6    1
    7    0
    dtype: int64
    
  2. Our final representation of the streaks of ones will be a Series of the same length as the original with an independent count beginning from one for each streak. To get started, let's use the .cumsum method:
    >>> s1 = s.cumsum()
    >>> s1
    0    0
    1    1
    2    2
    3    2
    4    3
    5    4
    6    5
    7    5
    dtype: int64
    
  3. We have now accumulated all the ones going down the Series. Let's multiply this Series by the original:
    >>> s.mul(s1)
    0    0
    1    1
    2    2
    3    0
    4    3
    5    4
    6    5
    7    0
    dtype: int64
    
  4. We have only non-zero values where we originally had ones. This result is fairly close to what we desire. We just need to restart each streak at one instead of where the cumulative sum left off. Let's chain the .diff method, which subtracts the previous value from the current:
    >>> s.mul(s1).diff()
    0    NaN
    1    1.0
    2    1.0
    3   -2.0
    4    3.0
    5    1.0
    6    1.0
    7   -5.0
    dtype: float64
    
  5. A negative value represents the end of a streak. We need to propagate the negative values down the Series and use them to subtract away the excess accumulation from step 2. To do this, we will make all non-negative values missing with the .where method:
    >>> (s
    ...     .mul(s.cumsum())
    ...     .diff()
    ...     .where(lambda x: x < 0)
    ... )
    0    NaN
    1    NaN
    2    NaN
    3   -2.0
    4    NaN
    5    NaN
    6    NaN
    7   -5.0
    dtype: float64
    
  6. We can now propagate these values down with the .ffill method:
    >>> (s
    ...     .mul(s.cumsum())
    ...     .diff()
    ...     .where(lambda x: x < 0)
    ...     .ffill()
    ... )
    0    NaN
    1    NaN
    2    NaN
    3   -2.0
    4   -2.0
    5   -2.0
    6   -2.0
    7   -5.0
    dtype: float64
    
  7. Finally, we can add this Series back to the cumulative sum to clear out the excess accumulation:
    >>> (s
    ...     .mul(s.cumsum())
    ...     .diff()
    ...     .where(lambda x: x < 0)
    ...     .ffill()
    ...     .add(s.cumsum(), fill_value=0)
    ... )
    0    0.0
    1    1.0
    2    2.0
    3    0.0
    4    1.0
    5    2.0
    6    3.0
    7    0.0
    dtype: float64
    
  8. Now that we have a working consecutive streak finder, we can find the longest streak per airline and origin airport. Let's read in the flights dataset and create a column to represent on-time arrival:
    >>> flights = pd.read_csv('data/flights.csv')
    >>> (flights
    ...     .assign(ON_TIME=flights['ARR_DELAY'].lt(15).astype(int))
    ...     [['AIRLINE', 'ORG_AIR', 'ON_TIME']]
    ... )
          AIRLINE ORG_AIR  ON_TIME
    0          WN     LAX        0
    1          UA     DEN        1
    2          MQ     DFW        0
    3          AA     DFW        1
    4          WN     LAX        0
    ...       ...     ...      ...
    58487      AA     SFO        1
    58488      F9     LAS        1
    58489      OO     SFO        1
    58490      WN     MSP        0
    58491      OO     SFO        1
    
  9. Use our logic from the first seven steps to define a function that returns the maximum streak of ones for a given Series:
    >>> def max_streak(s):
    ...     s1 = s.cumsum()
    ...     return (s
    ...        .mul(s1)
    ...        .diff()
    ...        .where(lambda x: x < 0) 
    ...        .ffill()
    ...        .add(s1, fill_value=0)
    ...        .max()
    ...     )
    
  10. Find the maximum streak of on-time arrivals per airline and origin airport along with the total number of flights and the percentage of on-time arrivals. First, sort the day of the year and the scheduled departure time:
    >>> (flights
    ...     .assign(ON_TIME=flights['ARR_DELAY'].lt(15).astype(int))
    ...     .sort_values(['MONTH', 'DAY', 'SCHED_DEP']) 
    ...     .groupby(['AIRLINE', 'ORG_AIR'])
    ...     ['ON_TIME'] 
    ...     .agg(['mean', 'size', max_streak])
    ...     .round(2)
    ... )
                     mean  size  max_streak
    AIRLINE ORG_AIR
    AA      ATL      0.82   233          15
            DEN      0.74   219          17
            DFW      0.78  4006          64
            IAH      0.80   196          24
            LAS      0.79   374          29
    ...               ...   ...         ...
    WN      LAS      0.77  2031          39
            LAX      0.70  1135          23
            MSP      0.84   237          32
            PHX      0.77  1724          33
            SFO      0.76   445          17
    

How it works…

Finding streaks in the data is not a straightforward operation in pandas and requires methods that look ahead or behind, such as .diff or .shift, or those that remember their current state, such as .cumsum. The final result from the first seven steps is a Series the same length as the original that keeps track of all consecutive ones. Throughout these steps, we use the .mul and .add methods instead of their operator equivalents, (*) and (+). In my opinion, this allows for a slightly cleaner progression of calculations from left to right. You, of course, can replace these with the actual operators.

Ideally, we would like to tell pandas to apply the .cumsum method to the start of each streak and reset itself after the end of each one. It takes many steps to convey this message to pandas. Step 2 accumulates all the ones in the Series as a whole. The rest of the steps slowly remove any excess accumulation. To identify this excess accumulation, we need to find the end of each streak and subtract this value from the beginning of the next streak.

To find the end of each streak, we cleverly make all values not part of the streak zero by multiplying the cumulative sum by the original Series of zeros and ones in step 3. The first zero following a non-zero, marks the end of a streak. That's good, but again, we need to eliminate the excess accumulation. Knowing where the streak ends doesn't exactly get us there.

In step 4, we use the .diff method to find this excess. The .diff method takes the difference between the current value and any value located a set number of rows away from it. By default, the difference between the current and the immediately preceding value is returned.

Only negative values are meaningful in step 4. Those are the ones immediately following the end of a streak. These values need to be propagated down until the end of the following streak. To eliminate (make missing) all the values we don't care about, we use the .where method (this is different from the NumPy where function), which takes a Boolean array of the same size as the calling Series. By default, all the True values remain the same, while the False values become missing. The .where method allows you to use the calling Series as part of the conditional by taking a function as its first parameter. An anonymous function is used, which gets passed the calling Series implicitly and checks whether each value is less than zero. The result of step 5 is a Series where only the negative values are preserved, with the rest changed to missing.

The .ffill method in step 6 replaces missing values with the last non-missing value going down a Series. As the first three values don't follow a non-missing value, they remain missing. We finally have our Series that removes the excess accumulation. We add our accumulation Series to the result of step 6 to get the streaks all beginning from zero. The .add method allows us to replace the missing values with the fill_value parameter. This completes the process of finding streaks of ones in the dataset. When doing complex logic like this, it is a good idea to use a small dataset where you know what the final output will be. It would be quite a difficult task to start at step 8 and build this streak-finding logic while grouping.

In step 8, we create the ON_TIME column. One item of note is that the canceled flights have missing values for ARR_DELAY, which do not pass the Boolean condition and therefore result in a zero for the ON_TIME column. Canceled flights are treated the same as delayed.

Step 9 turns our logic from the first seven steps into a function and chains the .max method to return the longest streak. As our function returns a single value, it is formally an aggregating function and can be passed to the .agg method in step 10. To ensure that we are looking at consecutive flights, we use the .sort_values method to sort by date and scheduled departure time.

There's more…

Now that we have found the longest streaks of on-time arrivals, we can easily find the opposite – the longest streak of delayed arrivals. The following function returns two rows for each group passed to it. The first row is the start of the streak, and the last row is the end of the streak. Each row contains the month and day that the streak started and ended, along with the total streak length:

>>> def max_delay_streak(df):
...     df = df.reset_index(drop=True)
...     late = 1 - df['ON_TIME']
...     late_sum = late.cumsum()
...     streak = (late
...         .mul(late_sum)
...         .diff()
...         .where(lambda x: x < 0) 
...         .ffill()
...         .add(late_sum, fill_value=0)
...     )
...     last_idx = streak.idxmax()
...     first_idx = last_idx - streak.max() + 1
...     res = (df
...         .loc[[first_idx, last_idx], ['MONTH', 'DAY']]
...         .assign(streak=streak.max())
...     )
...     res.index = ['first', 'last']
...     return res
>>> (flights
...     .assign(ON_TIME=flights['ARR_DELAY'].lt(15).astype(int))
...     .sort_values(['MONTH', 'DAY', 'SCHED_DEP']) 
...     .groupby(['AIRLINE', 'ORG_AIR']) 
...     .apply(max_delay_streak) 
...     .sort_values('streak', ascending=False)
... )
                       MONTH   DAY  streak
AIRLINE ORG_AIR
AA      DFW     first    2.0  26.0    38.0
                last     3.0   1.0    38.0
MQ      ORD     last     1.0  12.0    28.0
                first    1.0   6.0    28.0
        DFW     last     2.0  26.0    25.0
...                      ...   ...     ...
US      LAS     last     1.0   7.0     1.0
AS      ATL     first    5.0   4.0     1.0
OO      LAS     first    2.0   8.0     1.0
EV      PHX     last     8.0   1.0     0.0
                first    NaN   NaN     0.0

As we are using the .apply groupby method, a DataFrame of each group is passed to the max_delay_streak function. Inside this function, the index of the DataFrame is dropped and replaced by a RangeIndex in order for us to easily find the first and last row of the streak. The ON_TIME column is inverted and then the same logic is used to find streaks of delayed flights. The index of the first and last rows of the streak are stored as variables. These indexes are then used to select the month and day when the streaks ended. We use a DataFrame to return our results. We label and name the index to make the final result clearer.

Our final results show the longest delayed streaks accompanied by the first and last date. Let's investigate to see whether we can find out why these delays happened. Inclement weather is a common reason for delayed or canceled flights. Looking at the first row, American Airlines (AA) started a streak of 38 delayed flights in a row from the Dallas Fort-Worth (DFW) airport beginning February 26 until March 1,2015. Looking at historical weather data from February 27, 2015, two inches of snow fell, which was a record for that day. This was a major weather event for DFW and caused problems for the entire city. Notice that DFW makes another appearance as the third longest streak, but this time a few days earlier and for a different airline.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset