10

Restructuring Data into a Tidy Form

Introduction

All the datasets used in the preceding chapters have not had much or any work done to change their structure. We immediately began processing the datasets in their original shape. Many datasets in the wild will need a significant amount of restructuring before commencing a more detailed analysis. In some cases, an entire project might only concern itself with formatting the data in such a way that it can be easily processed by someone else.

There are many terms that are used to describe the process of data restructuring, with tidy data being the most common to data scientists. Tidy data is a term coined by Hadley Wickham to describe a form of data that makes analysis easy to do. This chapter will cover many ideas formulated by Hadley and how to accomplish them with pandas. To learn a great deal more about tidy data, read Hadley's paper (http://vita.had.co.nz/papers/tidy-data.pdf).

The following is an example of untidy data:

Name Category Value

Jill

Bank

2,300

Jill

Color

Red

John

Bank

1,100

Jill

Age

40

John

Color

Purple

The following is an example of tidy data:

Name Age Bank Color

Jill

40

2,300

Red

John

38

Purple

What is tidy data? Hadley puts forth three guiding principles that determine whether a dataset is tidy:

  • Each variable forms a column
  • Each observation forms a row
  • Each type of observational unit forms a table

Any dataset that does not meet these guidelines is considered messy. This definition will make more sense once we start restructuring our data into tidy form, but for now, we'll need to know what variables, observations, and observational units are.

Using this jargon, a variable is not referring to a Python variable, it is a piece of data. It is good to think about the distinction between a variable name and the variable value. The variable names are labels, such as gender, race, salary, and position. The variable values are those things liable to change for every observation, such as male, female, or other for gender.

A single observation is the collection of all variable values for a single observational unit. To help understand what an observational unit might be, consider a retail store, which has data for each transaction, employee, customer, item, and the store itself. Each of these can be viewed as an observational unit and would require its own table. Combining employee information (like the number of hours worked) with customer information (like the amount spent) in the same table would break this tidy principle.

The first step to resolving messy data is to recognize it when it exists, and there are boundless possibilities. Hadley explicitly mentions five of the most common types of messy data:

  • Column names are values, not variable names
  • Multiple variables are stored in column names
  • Variables are stored in both rows and columns
  • Multiple types of observational units are stored in the same table
  • A single observational unit is stored in multiple tables

It is important to understand that tidying data does not typically involve changing the values of your dataset, filling in missing values, or doing any sort of analysis. Tidying data consists in changing the shape or structure of the data to meet the tidy principles. Tidy data is akin to having all your tools in the toolbox instead of scattered randomly throughout your house. Having the tools properly in the toolbox allows all other tasks to be completed easily. Once the data is in the correct form, it becomes much easier to perform further analysis.

Once you have spotted messy data, you will use the pandas library to restructure the data, so that it is tidy. The main tidy tools that pandas has available for you are the DataFrame methods .stack, .melt, .unstack, and .pivot. More complex tidying involves ripping apart text, which necessitates the .str accessor. Other helper methods, such as .rename, .rename_axis, .reset_index, and .set_index, will help with applying the final touches to tidy data.

Tidying variable values as column names with stack

To help understand the differences between tidy and messy data, let's take a look at a table that may or may not be in tidy form:

>>> import pandas as pd
>>> import numpy as np
>>> state_fruit = pd.read_csv('data/state_fruit.csv', index_col=0)
>>> state_fruit
         Apple  Orange  Banana
Texas       12      10      40
Arizona      9       7      12
Florida      0      14     190

There does not appear to be anything messy about this table, and the information is easily consumable. However, according to the tidy principles, it isn't tidy. Each column name is the value of a variable. In fact, none of the variable names are even present in the DataFrame. One of the first steps to transform a messy dataset into tidy data is to identify all of the variables. In this particular dataset, we have variables for state and fruit. There's also the numeric data that wasn't identified anywhere in the context of the problem. We can label this variable as weight or any other sensible name.

This particular messy dataset contains variable values as column names. We will need to transpose these column names into column values. In this recipe, we use the stack method to restructure our DataFrame into tidy form.

How to do it…

  1. First, take note that the state names are in the index of the DataFrame. These states are correctly placed vertically and do not need to be restructured. It is the column names that are the problem. The .stack method takes all of the column names and pivots them into the index. Typically, when you call the .stack method, the data becomes taller.
  2. Note that in this case, the result collapses from a DataFrame to a Series:
    >>> state_fruit.stack()
    Texas    Apple      12
             Orange     10
             Banana     40
    Arizona  Apple       9
             Orange      7
             Banana     12
    Florida  Apple       0
             Orange     14
             Banana    190
    dtype: int64
    
  3. Notice that we now have a Series with a MultiIndex. There are now two levels in the index. The original index has been pushed to the left to make room for the fruit column names. With this one command, we now essentially have tidy data. Each variable, state, fruit, and weight is vertical. Let's use the .reset_index method to turn the result into a DataFrame:
    >>> (state_fruit
    ...    .stack()
    ...    .reset_index()
    ... )
       level_0 level_1    0
    0    Texas   Apple   12
    1    Texas  Orange   10
    2    Texas  Banana   40
    3  Arizona   Apple    9
    4  Arizona  Orange    7
    5  Arizona  Banana   12
    6  Florida   Apple    0
    7  Florida  Orange   14
    8  Florida  Banana  190
    
  4. Our structure is now correct, but the column names are meaningless. Let's replace them with proper identifiers:
    >>> (state_fruit
    ...    .stack()
    ...    .reset_index()
    ...    .rename(columns={'level_0':'state', 
    ...       'level_1': 'fruit', 0: 'weight'})
    ... )
         state   fruit  weight
    0    Texas   Apple      12
    1    Texas  Orange      10
    2    Texas  Banana      40
    3  Arizona   Apple       9
    4  Arizona  Orange       7
    5  Arizona  Banana      12
    6  Florida   Apple       0
    7  Florida  Orange      14
    8  Florida  Banana     190
    
  5. Instead of using the .rename method, it is possible to use the lesser-known Series method .rename_axis to set the names of the index levels before using .reset_index:
    >>> (state_fruit
    ...     .stack()
    ...     .rename_axis(['state', 'fruit'])
    ... )
    state    fruit
    Texas    Apple      12
             Orange     10
             Banana     40
    Arizona  Apple       9
             Orange      7
             Banana     12
    Florida  Apple       0
             Orange     14
             Banana    190
    dtype: int64
    
  6. From here, we can chain the .reset_index method with the name parameter to reproduce the output from step 3:
    >>> (state_fruit
    ...     .stack()
    ...     .rename_axis(['state', 'fruit'])
    ...     .reset_index(name='weight')
    ... )
         state   fruit  weight
    0    Texas   Apple      12
    1    Texas  Orange      10
    2    Texas  Banana      40
    3  Arizona   Apple       9
    4  Arizona  Orange       7
    5  Arizona  Banana      12
    6  Florida   Apple       0
    7  Florida  Orange      14
    8  Florida  Banana     190
    

How it works…

The .stack method is powerful, and it takes time to understand and appreciate fully. By default, it takes the (innermost level in hierarchical columns of) column names and transposes them, so they become the new innermost index level. Notice how each old column name still labels its original value by being paired with each state. There were nine original values in a 3 x 3 DataFrame, which got transformed into a single Series with the same number of values. The original first row of data became the first three values in the resulting Series.

After resetting the index in step 2, pandas defaults our DataFrame columns to level_0, level_1, and 0 (two strings and one integer). This is because the Series calling this method has two index levels that were formally unnamed. pandas also refers to indexes by integer, beginning from zero from the outside.

Step 3 shows an intuitive way to rename the columns with the .rename method.

Alternatively, it is possible to set the column names by chaining the .rename_axis method that uses a list of values as the index level names. pandas uses these index level names as the new column names when the index is reset. Additionally, the .reset_index method has a name parameter corresponding to the new column name of the Series values.

All Series have a name attribute that can be assigned or changed with the .rename method. It is this attribute that becomes the column name when using .reset_index.

There's more…

One of the keys to using .stack is to place all of the columns that you do not wish to transform in the index. The dataset in this recipe was initially read with the states in the index. Let's take a look at what would have happened if we did not read the states into the index:

>>> state_fruit2 = pd.read_csv('data/state_fruit2.csv')
>>> state_fruit2
     State  Apple  Orange  Banana
0    Texas     12      10      40
1  Arizona      9       7      12
2  Florida      0      14     190

As the state names are not in the index, using .stack on this DataFrame reshapes all values into one long Series of values:

>>> state_fruit2.stack()
0  State       Texas
   Apple          12
   Orange         10
   Banana         40
1  State     Arizona
              ...   
   Banana         12
2  State     Florida
   Apple           0
   Orange         14
   Banana        190
Length: 12, dtype: object

This command reshapes all the columns, this time including the states, and is not at all what we need. To reshape this data correctly, you will need to put all the non-reshaped columns into the index first with the .set_index method, and then use .stack. The following code gives a similar result to step 1:

>>> state_fruit2.set_index('State').stack()
State
Texas    Apple      12
         Orange     10
         Banana     40
Arizona  Apple       9
         Orange      7
         Banana     12
Florida  Apple       0
         Orange     14
         Banana    190
dtype: int64

Tidying variable values as column names with melt

Like most large Python libraries, pandas has many different ways to accomplish the same task, the differences usually being readability and performance. A DataFrame has a method named .melt that is similar to the .stack method described in the previous recipe but gives a bit more flexibility.

In this recipe, we use the .melt method to tidy a DataFrame with variable values as column names.

How to do it…

  1. Read in the state_fruit2.csv dataset:
    >>> state_fruit2 = pd.read_csv('data/state_fruit2.csv')
    >>> state_fruit2
         State  Apple  Orange  Banana
    0    Texas     12      10      40
    1  Arizona      9       7      12
    2  Florida      0      14     190
    
  2. Use the .melt method by passing the appropriate columns to the id_vars and value_vars parameters:
    >>> state_fruit2.melt(id_vars=['State'],
    ...     value_vars=['Apple', 'Orange', 'Banana'])
         State variable  value
    0    Texas    Apple     12
    1  Arizona    Apple      9
    2  Florida    Apple      0
    3    Texas   Orange     10
    4  Arizona   Orange      7
    5  Florida   Orange     14
    6    Texas   Banana     40
    7  Arizona   Banana     12
    8  Florida   Banana    190
    
  3. This one step creates tidy data for us. By default, .melt refers to the transformed column names as variables and the corresponding values as values. Conveniently, .melt has two additional parameters, var_name and value_name, that give you the ability to rename these two columns:
    >>> state_fruit2.melt(id_vars=['State'],
    ...                    value_vars=['Apple', 'Orange', 'Banana'],
    ...                    var_name='Fruit',
    ...                    value_name='Weight')
         State   Fruit  Weight
    0    Texas   Apple      12
    1  Arizona   Apple       9 2  Florida   Apple       0
    3    Texas  Orange      10
    4  Arizona  Orange       7
    5  Florida  Orange      14
    6    Texas  Banana      40
    7  Arizona  Banana      12
    8  Florida  Banana     190
    

How it works…

The .melt method reshapes your DataFrame. It takes up to five parameters, with two of them being crucial to understanding how to reshape your data correctly:

  • id_vars is a list of column names that you want to preserve as columns and not reshape
  • value_vars is a list of column names that you want to reshape into a single column

The id_vars, or the identification variables, remain in the same column but repeat for each of the columns passed to value_vars. One crucial aspect of .melt is that it ignores values in the index, and it silently drops your index and replaces it with a default RangeIndex. This means that if you do have values in your index that you would like to keep, you will need to reset the index first before using melt.

There's more…

All the parameters for the .melt method are optional, and if you desire all your values to be in a single column and their old column labels to be in the other, you may call .melt with the default parameters:

>>> state_fruit2.melt()
   variable    value
0     State    Texas
1     State  Arizona
2     State  Florida
3     Apple       12
4     Apple        9
..      ...      ...
7    Orange        7
8    Orange       14
9    Banana       40
10   Banana       12
11   Banana      190

More realistically, you might have lots of variables that need melting and would like to specify only the identification variables. In that case, calling .melt in the following manner will yield the same result as in step 2. You don't even need a list when melting a single column and can pass its string value:

>>> state_fruit2.melt(id_vars='State')
     State variable  value
0    Texas    Apple     12
1  Arizona    Apple      9
2  Florida    Apple      0
3    Texas   Orange     10
4  Arizona   Orange      7
5  Florida   Orange     14
6    Texas   Banana     40
7  Arizona   Banana     12
8  Florida   Banana    190

Stacking multiple groups of variables simultaneously

Some datasets contain multiple groups of variables as column names that need to be stacked simultaneously into their own columns. An example involving the movie dataset can help clarify this. Let's begin by selecting all columns containing the actor names and their corresponding Facebook likes:

>>> movie = pd.read_csv('data/movie.csv')
>>> actor = movie[['movie_title', 'actor_1_name',
...                'actor_2_name', 'actor_3_name',
...                'actor_1_facebook_likes',
...                'actor_2_facebook_likes',
...                'actor_3_facebook_likes']]
>>> actor.head()
                                  movie_title  ...
0                                      Avatar  ...
1    Pirates of the Caribbean: At World's End  ...
2                                     Spectre  ...
3                       The Dark Knight Rises  ...
4  Star Wars: Episode VII - The Force Awakens  ...

If we define our variables as the title of the movie, the actor name, and the number of Facebook likes, then we will need to stack two sets of columns, which is not possible using a single call to .stack or .melt.

In this recipe, we will tidy our actor DataFrame by simultaneously stacking the actor names and their corresponding Facebook likes with the wide_to_long function.

How to do it…

  1. We will be using the wide_to_long function to reshape our data into tidy form. To use this function, we will need to change the column names that we are stacking, so that they end with a digit. We first create a user-defined function to change the column names:
    >>> def change_col_name(col_name):
    ...     col_name = col_name.replace('_name', '')
    ...     if 'facebook' in col_name:
    ...         fb_idx = col_name.find('facebook')
    ...         col_name = (col_name[:5] + col_name[fb_idx - 1:] 
    ...                + col_name[5:fb_idx-1])
    ...     return col_name
    
  2. Pass this function to the rename method to transform all the column names:
    >>> actor2 = actor.rename(columns=change_col_name)
    >>> actor2
          movie_title      actor_1  ... actor_facebook_likes_2
    0          Avatar  CCH Pounder  ...        936.0
    1     Pirates ...  Johnny Depp  ...       5000.0
    2         Spectre  Christop...  ...        393.0
    3     The Dark...    Tom Hardy  ...      23000.0
    4     Star War...  Doug Walker  ...         12.0
    ...           ...          ...  ...          ...
    4911  Signed S...  Eric Mabius  ...        470.0
    4912  The Foll...  Natalie Zea  ...        593.0
    4913  A Plague...  Eva Boehnke  ...          0.0
    4914  Shanghai...    Alan Ruck  ...        719.0
    4915  My Date ...  John August  ...         23.0
    
  3. Use the wide_to_long function to stack the actor and Facebook sets of columns simultaneously:
    >>> stubs = ['actor', 'actor_facebook_likes']
    >>> actor2_tidy = pd.wide_to_long(actor2,
    ...     stubnames=stubs,
    ...     i=['movie_title'],
    ...     j='actor_num',
    ...     sep='_')
    >>> actor2_tidy.head()
                                  actor  actor_facebook_likes
    movie_title  actor_num                          
    Avatar       1          CCH Pounder       1000.0
    Pirates o... 1          Johnny Depp      40000.0
    Spectre      1          Christop...      11000.0
    The Dark ... 1            Tom Hardy      27000.0
    Star Wars... 1          Doug Walker        131.0
    

How it works…

The wide_to_long function works in a fairly specific manner. Its main parameter is stubnames, which is a list of strings. Each string represents a single column grouping. All columns that begin with this string will be stacked into a single column. In this recipe, there are two groups of columns: actor, and actor_facebook_likes. By default, each of these groups of columns will need to end in a digit. This digit will subsequently be used to label the reshaped data. Each of these column groups has an underscore character separating the stubname from the ending digit. To account for this, you must use the sep parameter.

The original column names do not match the pattern needed for wide_to_long to work. The column names could have been changed manually by specifying their values with a list. This could quickly become a lot of typing so instead, we define a function that automatically converts our columns to a format that works. The change_col_name function removes *_name* from the actor columns and rearranges the Facebook columns so that now they both end in digits.

To accomplish the column renaming, we use the .rename method in step 2. It accepts many different types of arguments, one of which is a function. When passing it to a function, every column name gets implicitly passed to it one at a time.

We have now correctly created two groups of columns, those beginning with actor and actor_facebook_likes that will be stacked. In addition to this, wide_to_long requires a unique column, parameter i, to act as an identification variable that will not be stacked. Also required is the parameter j, which renames the identifying digit stripped from the end of the original column names. By default, the suffix parameter contains the regular expression, r'd+', that searches for one or more digits. The d is a special token that matches the digits 0-9. The plus sign, +, makes the expression match for one or more of these digits.

There's more…

The function wide_to_long works when all groupings of variables have the same numeric ending like they did in this recipe. When your variables do not have the same ending or don't end in a digit, you can still use wide_to_long to do simultaneous column stacking. For instance, let's take a look at the following dataset:

>>> df = pd.read_csv('data/stackme.csv')
>>> df
  State Country    a1   b2   Test  d  e
0    TX      US  0.45  0.3  Test1  2  6
1    MA      US  0.03  1.2  Test2  9  7
2    ON     CAN  0.70  4.2  Test3  4  2

Let's say we wanted columns a1 and b1 stacked together, as well as columns d and e. Additionally, we wanted to use a1 and b1 as labels for the rows. To accomplish this task, we would need to rename the columns so that they ended in the label we desired:

>>> df.rename(columns = {'a1':'group1_a1', 'b2':'group1_b2',
...                      'd':'group2_a1', 'e':'group2_b2'})
  State Country  ...  group2_a1  group2_b2
0    TX      US  ...          2          6
1    MA      US  ...          9          7
2    ON     CAN  ...          4          2

We would then need to modify the suffix parameter, which normally defaults to a regular expression that selects digits. Here, we tell it to find any number of characters:

>>> pd.wide_to_long(
...        df.rename(columns = {'a1':'group1_a1', 
...                  'b2':'group1_b2',
...                  'd':'group2_a1', 'e':'group2_b2'}),
...     stubnames=['group1', 'group2'],
...     i=['State', 'Country', 'Test'],
...     j='Label',
...     suffix='.+',
...     sep='_')
                           group1  group2
State Country Test  Label
TX    US      Test1 a1       0.45       2
                    b2       0.30       6
MA    US      Test2 a1       0.03       9
                    b2       1.20       7
ON    CAN     Test3 a1       0.70       4
                    b2       4.20       2

Inverting stacked data

DataFrames have two similar methods, .stack and .melt, to convert horizontal column names into vertical column values. DataFrames can invert these two operations with the .unstack and .pivot methods, respectively. .stack and .unstack are methods that allow control over only the column and row indexes, while .melt and .pivot give more flexibility to choose which columns are reshaped.

In this recipe, we will call .stack and .melt on a dataset and promptly invert the operation with the .unstack and .pivot methods.

How to do it…

  1. Read in the college dataset with the institution name as the index, and with only the undergraduate race columns:
    >>> def usecol_func(name):
    ...     return 'UGDS_' in name or name == 'INSTNM'
    >>> college = pd.read_csv('data/college.csv',
    ...     index_col='INSTNM',
    ...     usecols=usecol_func)
    >>> college
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...
    Alabama A...      0.0333      0.9353  ...    0.0059     0.0138
    Universit...      0.5922      0.2600  ...    0.0179     0.0100
    Amridge U...      0.2990      0.4192  ...    0.0000     0.2715
    Universit...      0.6988      0.1255  ...    0.0332     0.0350
    Alabama S...      0.0158      0.9208  ...    0.0243     0.0137
    ...                  ...         ...  ...       ...        ...
    SAE Insti...         NaN         NaN  ...       NaN        NaN
    Rasmussen...         NaN         NaN  ...       NaN        NaN
    National ...         NaN         NaN  ...       NaN        NaN
    Bay Area ...         NaN         NaN  ...       NaN        NaN
    Excel Lea...         NaN         NaN  ...       NaN        NaN
    
  2. Use the .stack method to convert each horizontal column name to a vertical index level:
    >>> college_stacked = college.stack()
    >>> college_stacked
    INSTNM
    Alabama A & M University         UGDS_WHITE    0.0333
                                     UGDS_BLACK    0.9353
                                     UGDS_HISP     0.0055
                                     UGDS_ASIAN    0.0019
                                     UGDS_AIAN     0.002
                                                    ...
    Coastal Pines Technical College  UGDS_AIAN     0.0034
                                     UGDS_NHPI     0.0017
                                     UGDS_2MOR     0.0191
                                     UGDS_NRA      0.0028
                                     UGDS_UNKN     0.0056
    Length: 61866, dtype: float64
    
  3. Invert this stacked data back to its original form with the .unstack method:
    >>> college_stacked.unstack()
                  UGDS_WHITE  UGDS_BLACK  ...  UGDS_NRA  UGDS_UNKN
    INSTNM                                ...
    Alabama A...      0.0333      0.9353  ...    0.0059     0.0138
    Universit...      0.5922      0.2600  ...    0.0179     0.0100
    Amridge U...      0.2990      0.4192  ...    0.0000     0.2715
    Universit...      0.6988      0.1255  ...    0.0332     0.0350
    Alabama S...      0.0158      0.9208  ...    0.0243     0.0137
    ...                  ...         ...  ...       ...        ...
    Hollywood...      0.2182      0.4182  ...    0.0182     0.0909
    Hollywood...      0.1200      0.3333  ...    0.0000     0.0667
    Coachella...      0.3284      0.1045  ...    0.0000     0.0000
    Dewey Uni...      0.0000      0.0000  ...    0.0000     0.0000
    Coastal P...      0.6762      0.2508  ...    0.0028     0.0056
    
  4. A similar sequence of operations can be done with .melt followed by .pivot. First, read in the data without putting the institution name in the index:
    >>> college2 = pd.read_csv('data/college.csv',
    ...    usecols=usecol_func)
    >>> college2
               INSTNM  UGDS_WHITE  ...  UGDS_NRA  UGDS_UNKN
    0     Alabama ...      0.0333  ...    0.0059     0.0138
    1     Universi...      0.5922  ...    0.0179     0.0100
    2     Amridge ...      0.2990  ...    0.0000     0.2715
    3     Universi...      0.6988  ...    0.0332     0.0350
    4     Alabama ...      0.0158  ...    0.0243     0.0137
    ...           ...         ...  ...       ...        ...
    7530  SAE Inst...         NaN  ...       NaN        NaN
    7531  Rasmusse...         NaN  ...       NaN        NaN
    7532  National...         NaN  ...       NaN        NaN
    7533  Bay Area...         NaN  ...       NaN        NaN
    7534  Excel Le...         NaN  ...       NaN        NaN
    
  5. Use the .melt method to transpose all the race columns into a single column:
    >>> college_melted = college2.melt(id_vars='INSTNM',
    ...     var_name='Race',
    ...     value_name='Percentage')
    >>> college_melted
                INSTNM        Race  Percentage
    0      Alabama ...  UGDS_WHITE      0.0333
    1      Universi...  UGDS_WHITE      0.5922
    2      Amridge ...  UGDS_WHITE      0.2990
    3      Universi...  UGDS_WHITE      0.6988
    4      Alabama ...  UGDS_WHITE      0.0158
    ...            ...         ...         ...
    67810  SAE Inst...   UGDS_UNKN         NaN
    67811  Rasmusse...   UGDS_UNKN         NaN
    67812  National...   UGDS_UNKN         NaN
    67813  Bay Area...   UGDS_UNKN         NaN
    67814  Excel Le...   UGDS_UNKN         NaN
    
  6. Use the .pivot method to invert this previous result:
    >>> melted_inv = college_melted.pivot(index='INSTNM',
    ...     columns='Race',
    ...     values='Percentage')
    >>> melted_inv
    Race          UGDS_2MOR  UGDS_AIAN  ...  UGDS_UNKN  UGDS_WHITE
    INSTNM                              ...
    A & W Hea...     0.0000     0.0000  ...     0.0000      0.0000
    A T Still...        NaN        NaN  ...        NaN         NaN
    ABC Beaut...     0.0000     0.0000  ...     0.0000      0.0000
    ABC Beaut...     0.0000     0.0000  ...     0.0000      0.2895
    AI Miami ...     0.0018     0.0000  ...     0.4644      0.0324
    ...                 ...        ...  ...        ...         ...
    Yukon Bea...     0.0000     0.1200  ...     0.0000      0.8000
    Z Hair Ac...     0.0211     0.0000  ...     0.0105      0.9368
    Zane Stat...     0.0218     0.0029  ...     0.2399      0.6995
    duCret Sc...     0.0976     0.0000  ...     0.0244      0.4634
    eClips Sc...     0.0000     0.0000  ...     0.0000      0.1446
    
  7. Notice that the institution names are now shuttled over into the index and are not in their original order. The column names are not in their original order. To get an exact replication of our starting DataFrame from step 4, use the .loc index operator to select rows and columns simultaneously and then reset the index:
    >>> college2_replication = (melted_inv
    ...     .loc[college2['INSTNM'], college2.columns[1:]]
    ...     .reset_index()
    ... )
    >>> college2.equals(college2_replication)
    True
    

How it works…

There are multiple ways to accomplish the same thing in step 1. Here, we show the versatility of the read_csv function. The usecols parameter accepts either a list of the columns that we would like to import or a function that dynamically determines them. We use a function that checks whether the column name contains UGDS_ or is equal to INSTNM. The function is passed each column name as a string and must return a Boolean. A considerable amount of memory can be saved in this manner.

The stack method in step 2 puts all column names into the innermost index level and returns a Series. In step 3, the .unstack method inverts this operation by taking all the values in the innermost index level and converting them to column names. Note that the sizes of the results of steps 1 and 3 are different because .stack drops missing values by default. If you pass in the dropna=False parameter, it will round-trip correctly.

Step 4 reads in the same dataset as in step 1 but does not put the institution name in the index because the .melt method isn't able to access it. Step 5 uses the .melt method to transpose all the Race columns. It does this by leaving the value_vars parameter as its default value, None. When not specified, all the columns not present in the id_vars parameter get transposed.

Step 6 inverts the operation from step 5 with the .pivot method, which accepts three parameters. Most parameters take a single column as a string (the values parameter may also accept a list of column names). The column referenced by the index parameter remains vertical and becomes the new index. The values of the column referenced by the columns parameter become the column names. The values referenced by the values parameter become tiled to correspond with the intersection of their former index and columns label.

To make a replication with pivot, we need to sort the rows and columns in the same order as the original. As the institution name is in the index, we use the .loc index operator to sort the DataFrame by its original index.

There's more…

To help further understand .stack and .unstack, let's use them to transpose the college DataFrame. In this context, we are using the precise mathematical definition of the transposing of a matrix, where the new rows are the old columns of the original data matrix.

If you take a look at the output from step 2, you'll notice that there are two index levels. By default, the .unstack method uses the innermost index level as the new column values. Index levels are numbered beginning from zero from the outside. pandas defaults the level parameter of the .unstack method to -1, which refers to the innermost index. We can instead .unstack the outermost column using level=0:

>>> college.stack().unstack(0)
INSTNM      Alaba/rsity  ...  Coast/llege
UGDS_WHITE       0.0333  ...       0.6762
UGDS_BLACK       0.9353  ...       0.2508
UGDS_HISP        0.0055  ...       0.0359
UGDS_ASIAN       0.0019  ...       0.0045
UGDS_AIAN        0.0024  ...       0.0034
UGDS_NHPI        0.0019  ...       0.0017
UGDS_2MOR        0.0000  ...       0.0191
UGDS_NRA         0.0059  ...       0.0028
UGDS_UNKN        0.0138  ...       0.0056

There is a way to transpose a DataFrame that does not require .stack or .unstack. Use the .transpose method or the .T attribute like this:

>>> college.T
>>> college.transpose()
INSTNM      Alaba/rsity  ...  Coast/llege
UGDS_WHITE       0.0333  ...       0.6762
UGDS_BLACK       0.9353  ...       0.2508
UGDS_HISP        0.0055  ...       0.0359
UGDS_ASIAN       0.0019  ...       0.0045
UGDS_AIAN        0.0024  ...       0.0034
UGDS_NHPI        0.0019  ...       0.0017
UGDS_2MOR        0.0000  ...       0.0191
UGDS_NRA         0.0059  ...       0.0028
UGDS_UNKN        0.0138  ...       0.0056

Unstacking after a groupby aggregation

Grouping data by a single column and performing an aggregation on a single column returns a result that is easy to consume. When grouping by more than one column, a resulting aggregation might not be structured in a manner that makes consumption easy. Since .groupby operations, by default, put the unique grouping columns in the index, the .unstack method can be beneficial to rearrange the data so that it is presented in a manner that is more useful for interpretation.

In this recipe, we use the employee dataset to perform an aggregation, grouping by multiple columns. We then use the .unstack method to reshape the result into a format that makes for easier comparisons of different groups.

How to do it…

  1. Read in the employee dataset and find the mean salary by race:
    >>> employee = pd.read_csv('data/employee.csv')
    >>> (employee
    ...     .groupby('RACE')
    ...     ['BASE_SALARY']
    ...     .mean()
    ...     .astype(int)
    ... )
    RACE
    American Indian or Alaskan Native    60272
    Asian/Pacific Islander               61660
    Black or African American            50137
    Hispanic/Latino                      52345
    Others                               51278
    White                                64419
    Name: BASE_SALARY, dtype: int64
    
  2. This is a groupby operation that results in a Series that is easy to read and has no need to reshape. Let's now find the average salary for all races by gender. Note that the result is a Series:
    >>> (employee
    ...     .groupby(['RACE', 'GENDER'])
    ...     ['BASE_SALARY'] 
    ...     .mean()
    ...     .astype(int)
    ... )
    RACE                               GENDER
    American Indian or Alaskan Native  Female    60238
                                       Male      60305
    Asian/Pacific Islander             Female    63226
                                       Male      61033
    Black or African American          Female    48915
                                                 ...
    Hispanic/Latino                    Male      54782
    Others                             Female    63785
                                       Male      38771
    White                              Female    66793
                                       Male      63940
    Name: BASE_SALARY, Length: 12, dtype: int64
    
  3. This aggregation is more complex and can be reshaped to make different comparisons easier. For instance, it would be easier to compare male versus female salaries for each race if they were side by side and not vertical as they are now. Let's call on .unstack on the gender index level:
    >>> (employee
    ...     .groupby(['RACE', 'GENDER'])
    ...     ['BASE_SALARY'] 
    ...     .mean()
    ...     .astype(int)
    ...     .unstack('GENDER')
    ... )
    GENDER                             Female   Male
    RACE
    American Indian or Alaskan Native   60238  60305
    Asian/Pacific Islander              63226  61033
    Black or African American           48915  51082
    Hispanic/Latino                     46503  54782
    Others                              63785  38771
    White                               66793  63940
    
  4. Similarly, we can unstack the race index level:
    >>> (employee
    ...     .groupby(['RACE', 'GENDER'])
    ...     ['BASE_SALARY'] 
    ...     .mean()
    ...     .astype(int)
    ...     .unstack('RACE')
    ... )
    RACE    American Indian or Alaskan Native  ...  White
    GENDER                                     ...
    Female                              60238  ...  66793
    Male                                60305  ...  63940
    

How it works…

Step 1 has the simplest possible aggregation with a single grouping column (RACE), a single aggregating column (BASE_SALARY), and a single aggregating function (.mean). This result is easy to consume and doesn't require any more processing to evaluate. Step 2 groups by both race and gender together. The resulting Series (which has a MultiIndex) contains all the values in a single dimension, which makes comparisons more difficult. To make the information easier to consume, we use the .unstack method to convert the values in one (or more) of the levels to columns.

By default, .unstack uses the innermost index level as the new columns. You can specify the level you would like to unstack with the level parameter, which accepts either the level name as a string or the level integer location. It is preferable to use the level name over the integer location to avoid ambiguity. Steps 3 and 4 unstack each level, which results in a DataFrame with a single-level index. It is now much easier to compare salaries from each race by gender.

There's more…

If there are multiple aggregating functions when performing a groupby with a single column from a DataFrame, then the immediate result will be a DataFrame and not a Series. For instance, let's calculate more aggregations than just the mean, as was done in step 2:

>>> (employee
...     .groupby(['RACE', 'GENDER'])
...     ['BASE_SALARY']
...     .agg(['mean', 'max', 'min'])
...     .astype(int)
... )
                                           mean     max    min
RACE                              GENDER
American Indian or Alaskan Native Female  60238   98536  26125
                                  Male    60305   81239  26125
Asian/Pacific Islander            Female  63226  130416  26125
                                  Male    61033  163228  27914
Black or African American         Female  48915  150416  24960
...                                         ...     ...    ...
Hispanic/Latino                   Male    54782  165216  26104
Others                            Female  63785   63785  63785
                                  Male    38771   38771  38771
White                             Female  66793  178331  27955
                                  Male    63940  210588  26125

Unstacking the Gender column will result in columns with a MultiIndex. From here, you can keep swapping row and column levels with both the .unstack and .stack methods until you achieve the structure of data you desire:

>>> (employee
...     .groupby(['RACE', 'GENDER'])
...     ['BASE_SALARY']
...     .agg(['mean', 'max', 'min'])
...     .astype(int)
...     .unstack('GENDER')
... )
               mean         ...    min
GENDER       Female   Male  ... Female   Male
RACE                        ...
American ...  60238  60305  ...  26125  26125
Asian/Pac...  63226  61033  ...  26125  27914
Black or ...  48915  51082  ...  24960  26125
Hispanic/...  46503  54782  ...  26125  26104
Others        63785  38771  ...  63785  38771
White         66793  63940  ...  27955  26125

Replicating pivot_table with a groupby aggregation

At first glance, it may seem that the .pivot_table method provides a unique way to analyze data. However, after a little massaging, it is possible to replicate its functionality with the .groupby method. Knowing this equivalence can help shrink the universe of pandas functionality.

In this recipe, we use the flights dataset to create a pivot table and then recreate it using the .groupby method.

How to do it…

  1. Read in the flights dataset, and use the .pivot_table method to find the total number of canceled flights per origin airport for each airline:
    >>> flights = pd.read_csv('data/flights.csv')
    >>> fpt = flights.pivot_table(index='AIRLINE',
    ...     columns='ORG_AIR',
    ...     values='CANCELLED',
    ...     aggfunc='sum',
    ...     fill_value=0)
    >>> fpt
    ORG_AIR  ATL  DEN  DFW  IAH  LAS  LAX  MSP  ORD  PHX  SFO
    AIRLINE
    AA         3    4   86    3    3   11    3   35    4    2
    AS         0    0    0    0    0    0    0    0    0    0
    B6         0    0    0    0    0    0    0    0    0    1
    DL        28    1    0    0    1    1    4    0    1    2
    EV        18    6   27   36    0    0    6   53    0    0
    ...      ...  ...  ...  ...  ...  ...  ...  ...  ...  ...
    OO         3   25    2   10    0   15    4   41    9   33
    UA         2    9    1   23    3    6    2   25    3   19
    US         0    0    2    2    1    0    0    6    7    3
    VX         0    0    0    0    0    3    0    0    0    3
    WN         9   13    0    0    7   32    1    0    6   25
    
  2. To replicate this with the .groupby method, we will need to groupby two columns and then unstack them. A groupby aggregation cannot replicate this table. The trick is to group by all the columns in both the index and columns parameters first:
    >>> (flights
    ...     .groupby(['AIRLINE', 'ORG_AIR'])
    ...     ['CANCELLED']
    ...     .sum()
    ... )
    AIRLINE  ORG_AIR
    AA       ATL         3
             DEN         4
             DFW        86
             IAH         3
             LAS         3
                        ..
    WN       LAS         7
             LAX        32
             MSP         1
             PHX         6
             SFO        25
    Name: CANCELLED, Length: 114, dtype: int64
    
  3. Use the .unstack method to pivot the ORG_AIR index level to column names:
    >>> fpg = (flights
    ...     .groupby(['AIRLINE', 'ORG_AIR'])
    ...     ['CANCELLED']
    ...     .sum()
    ...     .unstack('ORG_AIR', fill_value=0)
    ... )
    >>> fpt.equals(fpg)
    True
    

How it works…

The .pivot_table method is very versatile and flexible but performs a rather similar operation to a groupby aggregation with step 1 showing an example. The index parameter takes a column (or list of columns) that will not be pivoted and whose unique values will be placed in the index. The columns parameter takes a column (or list of columns) that will be pivoted and whose unique values will be made into column names. The values parameter takes a column (or list of columns) that will be aggregated.

There also exists an aggfunc parameter that takes an aggregating function (or list of functions) that determines how the columns in the values parameter get aggregated. It defaults to the string mean, and, in this example, we change it to calculate the sum. Additionally, some unique combinations of AIRLINE and ORG_AIR do not exist. These missing combinations will default to missing values in the resulting DataFrame. Here, we use the fill_value parameter to change them to zero.

Step 2 begins the replication process using all the columns in the index and columns parameter as the grouping columns. This is the key to making this recipe work. A pivot table is an intersection of all the unique combinations of the grouping columns. Step 3 finishes the replication by pivoting the innermost index level into column names with the .unstack method. Just like with .pivot_table, not all combinations of AIRLINE and ORG_AIR exist; we again use the fill_value parameter to force these missing intersections to zero.

There's more…

It is possible to replicate much more complex pivot tables with the .groupby method. For instance, take the following result from .pivot_table:

>>> flights.pivot_table(index=['AIRLINE', 'MONTH'],
...     columns=['ORG_AIR', 'CANCELLED'],
...     values=['DEP_DELAY', 'DIST'],
...     aggfunc=['sum', 'mean'],
...     fill_value=0)
                    sum     ...         mean
              DEP_DELAY     ...         DIST
ORG_AIR             ATL     ...          SFO
CANCELLED             0  1  ...            0       1
AIRLINE MONTH               ...
AA      1           -13  0  ...  1860.166667     0.0
        2           -39  0  ...  1337.916667  2586.0
        3            -2  0  ...  1502.758621     0.0
        4             1  0  ...  1646.903226     0.0
        5            52  0  ...  1436.892857     0.0
...                 ... ..  ...          ...     ...
WN      7          2604  0  ...   636.210526     0.0
        8          1718  0  ...   644.857143   392.0
        9          1033  0  ...   731.578947   354.5
        11          700  0  ...   580.875000   392.0
        12         1679  0  ...   782.256410     0.0

To replicate this with the .groupby method, follow the same pattern from the recipe, place all the columns from the index and columns parameters into the .groupby method, and then call .unstack to pull the index levels out to the columns:

>>> (flights
...     .groupby(['AIRLINE', 'MONTH', 'ORG_AIR', 'CANCELLED']) 
...     [['DEP_DELAY', 'DIST']]
...     .agg(['mean', 'sum']) 
...     .unstack(['ORG_AIR', 'CANCELLED'], fill_value=0) 
...     .swaplevel(0, 1, axis='columns')
... )
                    mean      ...      sum
               DEP_DELAY      ...     DIST
ORG_AIR              ATL      ...      SFO
CANCELLED              0   1  ...        0       1
AIRLINE MONTH                 ...
AA      1      -3.250000 NaN  ...  33483.0     NaN
        2      -3.000000 NaN  ...  32110.0  2586.0
        3      -0.166667 NaN  ...  43580.0     NaN
        4       0.071429 NaN  ...  51054.0     NaN
        5       5.777778 NaN  ...  40233.0     NaN
...                  ...  ..  ...      ...     ...
WN      7      21.700000 NaN  ...  24176.0     NaN
        8      16.207547 NaN  ...  18056.0   784.0
        9       8.680672 NaN  ...  27800.0   709.0
        11      5.932203 NaN  ...  23235.0   784.0
        12     15.691589 NaN  ...  30508.0     NaN

The order of the column levels differs, with .pivot_table putting the aggregation functions at a level preceding the columns in the values parameter. You can use the .swaplevel method to remedy this. It will swap the outermost column (level 0) with the level below that (level 1). Also note that the column order is different.

Renaming axis levels for easy reshaping

Reshaping with the .stack and .unstack methods is far easier when each axis (both index and column) level has a name. pandas allows users to reference each axis level by integer location or by name. Since integer location is implicit and not explicit, you should consider using level names whenever possible. This advice follows from The Zen of Python (type import this if you are not familiar with it), a short list of guiding principles for Python, of which the second one is "Explicit is better than implicit."

When grouping or aggregating with multiple columns, the resulting pandas object will have multiple levels in one or both of the axes. In this recipe, we will name each level of each axis and then use the .stack and .unstack methods to reshape the data to the desired form.

How to do it…

  1. Read in the college dataset, and find a few basic summary statistics on the undergraduate population and SAT math scores by institution and religious affiliation:
    >>> college = pd.read_csv('data/college.csv')
    >>> (college
    ...     .groupby(['STABBR', 'RELAFFIL']) 
    ...     [['UGDS', 'SATMTMID']]
    ...     .agg(['size', 'min', 'max'])
    ... )
                    UGDS                 SATMTMID
                    size    min      max     size    min    max
    STABBR RELAFFIL
    AK     0           7  109.0  12865.0        7    NaN    NaN
           1           3   27.0    275.0        3  503.0  503.0
    AL     0          72   12.0  29851.0       72  420.0  590.0
           1          24   13.0   3033.0       24  400.0  560.0
    AR     0          68   18.0  21405.0       68  427.0  565.0
    ...              ...    ...      ...      ...    ...    ...
    WI     0          87   20.0  29302.0       87  480.0  680.0
           1          25    4.0   8212.0       25  452.0  605.0
    WV     0          65   20.0  44924.0       65  430.0  530.0
           1           8   63.0   1375.0        8  455.0  510.0
    WY     0          11   52.0   9910.0       11  540.0  540.0
    
  2. Notice that both index levels have names and are the old column names. The column levels, on the other hand, do not have names. Use the .rename_axis method to give them level names:
    >>> (college
    ...     .groupby(['STABBR', 'RELAFFIL']) 
    ...     [['UGDS', 'SATMTMID']]
    ...     .agg(['size', 'min', 'max'])
    ...     .rename_axis(['AGG_COLS', 'AGG_FUNCS'], axis='columns')
    ... )
    AGG_COLS        UGDS                 SATMTMID
    AGG_FUNCS       size    min      max     size    min    max
    STABBR RELAFFIL
    AK     0           7  109.0  12865.0        7    NaN    NaN
           1           3   27.0    275.0        3  503.0  503.0
    AL     0          72   12.0  29851.0       72  420.0  590.0
           1          24   13.0   3033.0       24  400.0  560.0
    AR     0          68   18.0  21405.0       68  427.0  565.0
    ...              ...    ...      ...      ...    ...    ...
    WI     0          87   20.0  29302.0       87  480.0  680.0
           1          25    4.0   8212.0       25  452.0  605.0
    WV     0          65   20.0  44924.0       65  430.0  530.0
           1           8   63.0   1375.0        8  455.0  510.0
    WY     0          11   52.0   9910.0       11  540.0  540.0
    
  3. Now that each axis level has a name, reshaping is a breeze. Use the .stack method to move the AGG_FUNCS column to an index level:
    >>> (college
    ...     .groupby(['STABBR', 'RELAFFIL']) 
    ...     [['UGDS', 'SATMTMID']]
    ...     .agg(['size', 'min', 'max'])
    ...     .rename_axis(['AGG_COLS', 'AGG_FUNCS'], axis='columns')
    ...     .stack('AGG_FUNCS')
    ... )
    AGG_COLS                      UGDS  SATMTMID
    STABBR RELAFFIL AGG_FUNCS
    AK     0        size           7.0       7.0
                    min          109.0       NaN
                    max        12865.0       NaN
           1        size           3.0       3.0
                    min           27.0     503.0
    ...                            ...       ...
    WV     1        min           63.0     455.0
                    max         1375.0     510.0
    WY     0        size          11.0      11.0
                    min           52.0     540.0
                    max         9910.0     540.0
    
  4. By default, stacking places the new column level in the innermost index position. Use the .swaplevel method to move AGG_FUNCS from the innermost level to the outer level:
    >>> (college
    ...     .groupby(['STABBR', 'RELAFFIL']) 
    ...     [['UGDS', 'SATMTMID']]
    ...     .agg(['size', 'min', 'max'])
    ...     .rename_axis(['AGG_COLS', 'AGG_FUNCS'], axis='columns')
    ...     .stack('AGG_FUNCS')
    ...     .swaplevel('AGG_FUNCS', 'STABBR',
    ...        axis='index')
    ... )
    AGG_COLS                      UGDS  SATMTMID
    AGG_FUNCS RELAFFIL STABBR
    size      0        AK          7.0       7.0
    min       0        AK        109.0       NaN
    max       0        AK      12865.0       NaN
    size      1        AK          3.0       3.0
    min       1        AK         27.0     503.0
    ...                            ...       ...
                       WV         63.0     455.0
    max       1        WV       1375.0     510.0
    size      0        WY         11.0      11.0
    min       0        WY         52.0     540.0
    max       0        WY       9910.0     540.0
    
  5. We can continue to make use of the axis level names by sorting levels with the .sort_index method:
    >>> (college
    ...     .groupby(['STABBR', 'RELAFFIL']) 
    ...     [['UGDS', 'SATMTMID']]
    ...     .agg(['size', 'min', 'max'])
    ...     .rename_axis(['AGG_COLS', 'AGG_FUNCS'], axis='columns')
    ...     .stack('AGG_FUNCS')
    ...     .swaplevel('AGG_FUNCS', 'STABBR', axis='index') 
    ...     .sort_index(level='RELAFFIL', axis='index') 
    ...     .sort_index(level='AGG_COLS', axis='columns')
    ... )
    AGG_COLS                   SATMTMID      UGDS
    AGG_FUNCS RELAFFIL STABBR
    max       0        AK           NaN   12865.0
                       AL         590.0   29851.0
                       AR         565.0   21405.0
                       AS           NaN    1276.0
                       AZ         580.0  151558.0
    ...                             ...       ...
    size      1        VI           1.0       1.0
                       VT           5.0       5.0
                       WA          17.0      17.0
                       WI          25.0      25.0
                       WV           8.0       8.0
    
  6. To completely reshape your data, you might need to stack some columns while unstacking others. Chain the two methods together:
    >>> (college
    ...     .groupby(['STABBR', 'RELAFFIL']) 
    ...     [['UGDS', 'SATMTMID']]
    ...     .agg(['size', 'min', 'max'])
    ...     .rename_axis(['AGG_COLS', 'AGG_FUNCS'], axis='columns')
    ...     .stack('AGG_FUNCS')
    ...     .unstack(['RELAFFIL', 'STABBR'])
    ... )
    AGG_COLS      UGDS         ... SATMTMID
    RELAFFIL         0      1  ...        1      0
    STABBR          AK     AK  ...       WV     WY
    AGG_FUNCS                  ...
    size           7.0    3.0  ...      8.0   11.0
    min          109.0   27.0  ...    455.0  540.0
    max        12865.0  275.0  ...    510.0  540.0
    
  7. Stack all the columns at once to return a Series:
    >>> (college
    ...     .groupby(['STABBR', 'RELAFFIL']) 
    ...     [['UGDS', 'SATMTMID']]
    ...     .agg(['size', 'min', 'max'])
    ...     .rename_axis(['AGG_COLS', 'AGG_FUNCS'], axis='columns')
    ...     .stack(['AGG_FUNCS', 'AGG_COLS'])
    ... )
    STABBR  RELAFFIL  AGG_FUNCS  AGG_COLS
    AK      0         size       UGDS            7.0
                                 SATMTMID        7.0
                      min        UGDS          109.0
                      max        UGDS        12865.0
            1         size       UGDS            3.0
                                              ...
    WY      0         size       SATMTMID       11.0
                      min        UGDS           52.0
                                 SATMTMID      540.0
                      max        UGDS         9910.0
                                 SATMTMID      540.0
    Length: 640, dtype: float64
    
  8. We can also unstack everything in the index. In this case, it collapses to a very wide result, which pandas displays as a Series:
    >>> (college
    ...     .groupby(['STABBR', 'RELAFFIL']) 
    ...     [['UGDS', 'SATMTMID']]
    ...     .agg(['size', 'min', 'max'])
    ...     .rename_axis(['AGG_COLS', 'AGG_FUNCS'], axis='columns')
    ...     .unstack(['STABBR', 'RELAFFIL']) 
    ... )
    AGG_COLS  AGG_FUNCS  STABBR  RELAFFIL
    UGDS      size       AK      0             7.0
                                 1             3.0
                         AL      0            72.0
                                 1            24.0
                         AR      0            68.0
                                             ...
    SATMTMID  max        WI      1           605.0
                         WV      0           530.0
                                 1           510.0
                         WY      0           540.0
                                 1             NaN
    Length: 708, dtype: float64
    

How it works…

It is common for the result of a call to the .groupby method to produce a DataFrame or Series with multiple axis levels. The resulting DataFrame from the groupby operation in step 1 has multiple levels for each axis. The column levels are not named, which would require us to reference them only by their integer location. To ease our ability to reference the column levels, we rename them with the .rename_axis method.

The .rename_axis method is a bit strange in that it can modify both the level names and the level values based on the type of the first argument passed to it. Passing it a list (or a scalar if there is only one level) changes the names of the levels. In step 2, we pass the .rename_axis method a list and are returned a DataFrame with all axis levels named.

Once all the axis levels have names, we can control the structure of data. Step 3 stacks the AGG_FUNCS column into the innermost index level. The .swaplevel method in step 4 accepts the name or position of the levels that you want to swap as the first two arguments. In step 5, the .sort_index method is called twice and sorts the values of each level. Notice that the values of the column level are the column names SATMTMID and UGDS.

We can get vastly different output by both stacking and unstacking, as done in step 6. It is also possible to stack or unstack every single column or index level, and both will collapse into a Series.

There's more…

If you wish to dispose of the level values altogether, you may set them to None. You can do this when you want to reduce visual clutter or when it is obvious what the column levels represent and no further processing will take place:

>>> (college
...     .groupby(['STABBR', 'RELAFFIL']) 
...     [['UGDS', 'SATMTMID']]
...     .agg(['size', 'min', 'max'])
...     .rename_axis([None, None], axis='index') 
...     .rename_axis([None, None], axis='columns')
... )
         UGDS                 SATMTMID
         size    min      max     size    min    max
    AK 0    7  109.0  12865.0        7    NaN    NaN
       1    3   27.0    275.0        3  503.0  503.0
    AL 0   72   12.0  29851.0       72  420.0  590.0
       1   24   13.0   3033.0       24  400.0  560.0
    AR 0   68   18.0  21405.0       68  427.0  565.0
    ...   ...    ...      ...      ...    ...    ...
    WI 0   87   20.0  29302.0       87  480.0  680.0
       1   25    4.0   8212.0       25  452.0  605.0
    WV 0   65   20.0  44924.0       65  430.0  530.0
       1    8   63.0   1375.0        8  455.0  510.0
    WY 0   11   52.0   9910.0       11  540.0  540.0

Tidying when multiple variables are stored as column names

One particular flavor of messy data appears whenever the column names contain multiple different variables themselves. A common example of this scenario occurs when age and sex are concatenated together. To tidy datasets like this, we must manipulate the columns with the pandas .str attribute. This attribute contains additional methods for string processing.

In this recipe, we will first identify all the variables, of which some will be concatenated together as column names. We then reshape the data and parse the text to extract the correct variable values.

How to do it…

  1. Read in the men's weightlifting dataset, and identify the variables:
    >>> weightlifting = pd.read_csv('data/weightlifting_men.csv')
    >>> weightlifting
      Weight Category  M35 35-39  ...  M75 75-79  M80 80+
    0           56           137  ...         62       55
    1           62           152  ...         67       57
    2           69           167  ...         75       60
    3           77           182  ...         82       65
    4           85           192  ...         87       70
    5           94           202  ...         90       75
    6          105           210  ...         95       80
    7         105+           217  ...        100       85
    
  2. The variables are the Weight Category, a combination of sex and age, and the qualifying total. The age and sex variables have been concatenated together into a single cell. Before we can separate them, let's use the .melt method to transpose the age and sex column names into a single vertical column:
    >>> (weightlifting
    ...     .melt(id_vars='Weight Category',
    ...           var_name='sex_age',
    ...           value_name='Qual Total')
    ... )
       Weight Category    sex_age  Qual Total
    0            56     M35 35-39         137
    1            62     M35 35-39         152
    2            69     M35 35-39         167
    3            77     M35 35-39         182
    4            85     M35 35-39         192
    ..          ...           ...         ...
    75           77       M80 80+          65
    76           85       M80 80+          70
    77           94       M80 80+          75
    78          105       M80 80+          80
    79         105+       M80 80+          85
    
  3. Select the sex_age column, and use the .split method available from the .str attribute to split the column into two different columns:
    >>> (weightlifting
    ...     .melt(id_vars='Weight Category',
    ...           var_name='sex_age',
    ...           value_name='Qual Total')
    ...     ['sex_age']
    ...     .str.split(expand=True)
    ... )
          0      1
    0   M35  35-39
    1   M35  35-39
    2   M35  35-39
    3   M35  35-39
    4   M35  35-39
    ..  ...    ...
    75  M80    80+
    76  M80    80+
    77  M80    80+
    78  M80    80+
    79  M80    80+
    
  4. This operation returned a DataFrame with meaningless column names. Let's rename the columns:
    >>> (weightlifting
    ...     .melt(id_vars='Weight Category',
    ...           var_name='sex_age',
    ...           value_name='Qual Total')
    ...     ['sex_age']
    ...     .str.split(expand=True)
    ...     .rename(columns={0:'Sex', 1:'Age Group'})
    ... )
        Sex  Age Group
    0   M35      35-39
    1   M35      35-39
    2   M35      35-39
    3   M35      35-39
    4   M35      35-39
    ..  ...        ...
    75  M80        80+
    76  M80        80+
    77  M80        80+
    78  M80        80+
    79  M80        80+
    
  5. Create a Sex column using an index operation after the .str attribute to select the first character from the renamed Sex column:
    >>> (weightlifting
    ...     .melt(id_vars='Weight Category',
    ...           var_name='sex_age',
    ...           value_name='Qual Total')
    ...     ['sex_age']
    ...     .str.split(expand=True)
    ...     .rename(columns={0:'Sex', 1:'Age Group'})
    ...     .assign(Sex=lambda df_: df_.Sex.str[0])
    ... )
       Sex  Age Group
    0    M      35-39
    1    M      35-39
    2    M      35-39
    3    M      35-39
    4    M      35-39
    ..  ..        ...
    75   M        80+
    76   M        80+
    77   M        80+
    78   M        80+
    79   M        80+
    
  6. Use the pd.concat function to concatenate this DataFrame with the Weight Category and Qual Total columns:
    >>> melted = (weightlifting
    ...     .melt(id_vars='Weight Category',
    ...           var_name='sex_age',
    ...           value_name='Qual Total')
    ... )
    >>> tidy = pd.concat([melted
    ...            ['sex_age']
    ...            .str.split(expand=True)
    ...            .rename(columns={0:'Sex', 1:'Age Group'})
    ...            .assign(Sex=lambda df_: df_.Sex.str[0]),
    ...           melted[['Weight Category', 'Qual Total']]],
    ...           axis='columns'
    ... )
    >>> tidy
       Sex  Age Group Weight Category  Qual Total
    0    M      35-39           56            137
    1    M      35-39           62            152
    2    M      35-39           69            167
    3    M      35-39           77            182
    4    M      35-39           85            192
    ..  ..        ...          ...            ...
    75   M        80+           77             65
    76   M        80+           85             70
    77   M        80+           94             75
    78   M        80+          105             80
    79   M        80+         105+             85
    
  7. This same result could have been created with the following:
    >>> melted = (weightlifting
    ...     .melt(id_vars='Weight Category',
    ...           var_name='sex_age',
    ...           value_name='Qual Total')
    ... )
    >>> (melted
    ...     ['sex_age']
    ...     .str.split(expand=True)
    ...     .rename(columns={0:'Sex', 1:'Age Group'})
    ...     .assign(Sex=lambda df_: df_.Sex.str[0],
    ...             Category=melted['Weight Category'],
    ...             Total=melted['Qual Total'])
    ... )
       Sex  Age Group Category  Total
    0    M      35-39       56    137
    1    M      35-39       62    152
    2    M      35-39       69    167
    3    M      35-39       77    182
    4    M      35-39       85    192
    ..  ..        ...      ...    ...
    75   M        80+       77     65
    76   M        80+       85     70
    77   M        80+       94     75
    78   M        80+      105     80
    79   M        80+     105+     85
    

How it works…

The weightlifting dataset, like many datasets, has easily digestible information in its raw form. Still, technically it is messy, as all but one of the column names contain information for sex and age. Once the variables are identified, we can begin to tidy the dataset. Whenever column names contain variables, you will need to use the .melt (or .stack) method. The Weight Category variable is already in the correct position, so we keep it as an identifying variable by passing it to the id_vars parameter. Note that we don't explicitly need to name all the columns that we are melting with value_vars. By default, all the columns not present in id_vars get melted.

The sex_age column needs to be parsed, and split into two variables. For this, we turn to the extra functionality provided by the .str attribute, only available to Series (a single DataFrame column) or an index (this is not hierarchical). The .split method is one of the more common methods in this situation, as it can separate different parts of the string into their own columns.By default, it splits on an empty space, but you may also specify a string or regular expression with the pat parameter. When the expand parameter is set to True, a new column forms for each independent split character segment. When False, a single column is returned, containing a list of all the segments.

After renaming the columns in step 4, we need to use the .str attribute again. This attribute allows us to index or slice off of it, just like a string. Here, we select the first character, which is the variable for sex. We could go further and split the ages into two separate columns for minimum and maximum age, but it is common to refer to the entire age group in this manner, so we leave it as is.

Step 6 shows one of two different methods to join all the data together. The concat function accepts a collection of DataFrames and either concatenates them vertically (axis='index') or horizontally (axis='columns'). Because the two DataFrames are indexed identically, it is possible to assign the values of one DataFrame to new columns in the other, as done in step 7.

There's more…

Another way to complete this recipe, beginning after step 2, is by assigning new columns from the sex_age column without using the .split method. The .assign method may be used to add these new columns dynamically:

>>> tidy2 = (weightlifting
...     .melt(id_vars='Weight Category',
...           var_name='sex_age',
...           value_name='Qual Total')
...     .assign(Sex=lambda df_:df_.sex_age.str[0],
...             **{'Age Group':(lambda df_: (df_
...                 .sex_age
...                 .str.extract(r'(d{2}[-+](?:d{2})?)',
...                              expand=False)))})
...     .drop(columns='sex_age')
... )
>>> tidy2
   Weight Category  Qual Total Sex Age Group
0            56            137   M     35-39
1            62            152   M     35-39
2            69            167   M     35-39
3            77            182   M     35-39
4            85            192   M     35-39
..          ...            ...  ..       ...
75           77             65   M       80+
76           85             70   M       80+
77           94             75   M       80+
78          105             80   M       80+
79         105+             85   M       80+
>>> tidy.sort_index(axis=1).equals(tidy2.sort_index(axis=1))
True

The Sex column is found in the same manner as done in step 5. Because we are not using .split, the Age Group column must be extracted in a different manner. The .extract method uses a complex regular expression to extract very specific portions of the string. To use .extract correctly, your pattern must contain capture groups. A capture group is formed by enclosing parentheses around a portion of the pattern. In this example, the entire expression is one large capture group. It begins with d{2}, which searches for exactly two digits, followed by either a literal plus or minus, optionally followed by two more digits. Although the last part of the expression, (?:d{2})?, is surrounded by parentheses, the ?: denotes that it is not a capture group. It is technically a non-capturing group used to express two digits together as optional. The sex_age column is no longer needed and is dropped.

Finally, the two tidy DataFrames are compared against one another and are found to be equivalent.

Tidying when multiple variables are stored as a single column

Tidy datasets must have a single column for each variable. Occasionally, multiple variable names are placed in a single column with their corresponding value placed in another.

In this recipe, we identify the column containing the improperly structured variables and pivot it to create tidy data.

How to do it…

  1. Read in the restaurant inspections dataset, and convert the Date column data type to datetime64:
    >>> inspections = pd.read_csv('data/restaurant_inspections.csv',
    ...     parse_dates=['Date'])
    >>> inspections
                                  Name  ...
    0                E & E Grill House  ...
    1                E & E Grill House  ...
    2                E & E Grill House  ...
    3                E & E Grill House  ...
    4                E & E Grill House  ...
    ..                             ...  ...
    495  PIER SIXTY ONE-THE LIGHTHOUSE  ...
    496  PIER SIXTY ONE-THE LIGHTHOUSE  ...
    497  PIER SIXTY ONE-THE LIGHTHOUSE  ...
    498  PIER SIXTY ONE-THE LIGHTHOUSE  ...
    499  PIER SIXTY ONE-THE LIGHTHOUSE  ...
    
  2. This dataset has two columns, Name and Date, that are each correctly contained in a single column. The Info column has five different variables: Borough, Cuisine, Description, Grade, and Score. Let's attempt to use the .pivot method to keep the Name and Date columns vertical, create new columns out of all the values in the Info column, and use the Value column as their intersection:
    >>> inspections.pivot(index=['Name', 'Date'],
    ...     columns='Info', values='Value')
    Traceback (most recent call last):
      ...
    NotImplementedError: > 1 ndim Categorical are not supported at this time
    
  3. Unfortunately, pandas developers have not implemented this functionality for us. Thankfully, for the most part, pandas has multiple ways of accomplishing the same task. Let's put Name, Date, and Info into the index:
    >>> inspections.set_index(['Name','Date', 'Info'])
                                               Value
    Name         Date       Info
    E & E Gri... 2017-08-08 Borough        MANHATTAN
                            Cuisine         American
                            Description  Non-food...
                            Grade                  A
                            Score                9.0
    ...                                          ...
    PIER SIXT... 2017-09-01 Borough        MANHATTAN
                            Cuisine         American
                            Description  Filth fl...
                            Grade                  Z
                            Score               33.0
    
  4. Use the .unstack method to pivot all the values in the Info column:
    >>> (inspections
    ...     .set_index(['Name','Date', 'Info']) 
    ...     .unstack('Info')
    ... )
                                   Value               ...
    Info                         Borough      Cuisine  ... Grade Score
    Name         Date                                  ...
    3 STAR JU... 2017-05-10     BROOKLYN  Juice, S...  ...     A  12.0
    A & L PIZ... 2017-08-22     BROOKLYN        Pizza  ...     A   9.0
    AKSARAY T... 2017-07-25     BROOKLYN      Turkish  ...     A  13.0
    ANTOJITOS... 2017-06-01     BROOKLYN  Latin (C...  ...     A  10.0
    BANGIA       2017-06-16    MANHATTAN       Korean  ...     A   9.0
    ...                              ...          ...  ...   ...   ...
    VALL'S PI... 2017-03-15  STATEN I...  Pizza/It...  ...     A   9.0
    VIP GRILL    2017-06-12     BROOKLYN  Jewish/K...  ...     A  10.0
    WAHIZZA      2017-04-13    MANHATTAN        Pizza  ...     A  10.0
    WANG MAND... 2017-08-29       QUEENS       Korean  ...     A  12.0
    XIAOYAN Y... 2017-08-29       QUEENS       Korean  ...     Z  49.0
    
  5. Make the index levels into columns with the .reset_index method:
    >>> (inspections
    ...     .set_index(['Name','Date', 'Info']) 
    ...     .unstack('Info')
    ...     .reset_index(col_level=-1)
    ... )
    .                             ... Value
    Info         Name       Date  ... Grade Score
    0     3 STAR J... 2017-05-10  ...     A  12.0
    1     A & L PI... 2017-08-22  ...     A   9.0
    2     AKSARAY ... 2017-07-25  ...     A  13.0
    3     ANTOJITO... 2017-06-01  ...     A  10.0
    4          BANGIA 2017-06-16  ...     A   9.0
    ..            ...        ...  ...   ...   ...
    95    VALL'S P... 2017-03-15  ...     A   9.0
    96      VIP GRILL 2017-06-12  ...     A  10.0
    97        WAHIZZA 2017-04-13  ...     A  10.0
    98    WANG MAN... 2017-08-29  ...     A  12.0
    99    XIAOYAN ... 2017-08-29  ...     Z  49.0
    
  6. The dataset is tidy, but there is some annoying leftover pandas debris that needs to be removed. Let's use the .droplevel method to remove the top column level and then rename the index level to None:
    >>> (inspections
    ...     .set_index(['Name','Date', 'Info']) 
    ...     .unstack('Info')
    ...     .reset_index(col_level=-1)
    ...     .droplevel(0, axis=1)
    ...     .rename_axis(None, axis=1)
    ... )
               Name       Date  ... Grade Score
    0   3 STAR J... 2017-05-10  ...     A  12.0
    1   A & L PI... 2017-08-22  ...     A   9.0
    2   AKSARAY ... 2017-07-25  ...     A  13.0
    3   ANTOJITO... 2017-06-01  ...     A  10.0
    4        BANGIA 2017-06-16  ...     A   9.0
    ..          ...        ...  ...   ...   ...
    95  VALL'S P... 2017-03-15  ...     A   9.0
    96    VIP GRILL 2017-06-12  ...     A  10.0
    97      WAHIZZA 2017-04-13  ...     A  10.0
    98  WANG MAN... 2017-08-29  ...     A  12.0
    99  XIAOYAN ... 2017-08-29  ...     Z  49.0
    
  7. The creation of the column MultiIndex in step 4 could have been avoided by converting that one column DataFrame in step 3 into a Series with the .squeeze method. The following code produces the same result as the previous step:
    >>> (inspections
    ...     .set_index(['Name','Date', 'Info']) 
    ...     .squeeze() 
    ...     .unstack('Info') 
    ...     .reset_index() 
    ...     .rename_axis(None, axis='columns')
    ... )
               Name       Date  ... Grade Score
    0   3 STAR J... 2017-05-10  ...     A  12.0
    1   A & L PI... 2017-08-22  ...     A   9.0
    2   AKSARAY ... 2017-07-25  ...     A  13.0
    3   ANTOJITO... 2017-06-01  ...     A  10.0
    4        BANGIA 2017-06-16  ...     A   9.0
    ..          ...        ...  ...   ...   ...
    95  VALL'S P... 2017-03-15  ...     A   9.0
    96    VIP GRILL 2017-06-12  ...     A  10.0
    97      WAHIZZA 2017-04-13  ...     A  10.0
    98  WANG MAN... 2017-08-29  ...     A  12.0
    99  XIAOYAN ... 2017-08-29  ...     Z  49.0
    

How it works…

In step 1, we notice that there are five variables placed vertically in the Info column with their corresponding value in the Value column. Because we need to pivot each of these five variables as horizontal column names, it would seem that the .pivot method would work. Unfortunately, pandas developers have yet to implement this special case when there is more than one non-pivoted column. We are forced to use a different method.

The .unstack method also pivots vertical data, but only for data in the index. Step 3 begins this process by moving both the columns that will and will not be pivoted into the index with the .set_index method. Once these columns are in the index, the .unstack method can be put to work, as done in step 4.

Notice that as we are unstacking a DataFrame, pandas keeps the original column names (here, it is just a single column, Value) and creates a MultiIndex with the old column names as the upper level. The dataset is now essentially tidy, but we go ahead and make our non-pivoted columns normal columns with the .reset_index method. Because we have MultiIndex columns, we can choose which level the new column names will belong to with the col_level parameter. By default, the names are inserted into the uppermost level (level 0). We use -1 to indicate the bottommost level.

After all this, we have some excess DataFrame names and indexes that need to be discarded. We use .droplevel and .rename_axis to remedy that. These columns still have a useless .name attribute, Info, which is renamed None.

Cleaning up the MultiIndex columns could have been avoided by forcing the resulting DataFrame from step 3 to a Series. The .squeeze method works on single-column DataFrames and turns them into Series.

There's more…

It is possible to use the .pivot_table method, which has no restrictions on how many non-pivoted columns are allowed. The .pivot_table method differs from .pivot by performing an aggregation for all the values that correspond to the intersection between the columns in the index and columns parameters.

Because there may be multiple values in this intersection, .pivot_table requires the user to pass it an aggregating function to output a single value. We use the first aggregating function, which takes the first of the values of the group. In this particular example, there is exactly one value for each intersection, so there is nothing to be aggregated. The default aggregation function is the mean, which will produce an error here, since some of the values are strings:

>>> (inspections
...     .pivot_table(index=['Name', 'Date'],
...                  columns='Info',
...                  values='Value',
...                  aggfunc='first') 
...     .reset_index() 
...     .rename_axis(None, axis='columns')
... )
           Name       Date  ... Grade Score
0   3 STAR J... 2017-05-10  ...     A  12.0
1   A & L PI... 2017-08-22  ...     A   9.0
2   AKSARAY ... 2017-07-25  ...     A  13.0
3   ANTOJITO... 2017-06-01  ...     A  10.0
4        BANGIA 2017-06-16  ...     A   9.0
..          ...        ...  ...   ...   ...
95  VALL'S P... 2017-03-15  ...     A   9.0
96    VIP GRILL 2017-06-12  ...     A  10.0
97      WAHIZZA 2017-04-13  ...     A  10.0
98  WANG MAN... 2017-08-29  ...     A  12.0
99  XIAOYAN ... 2017-08-29  ...     Z  49.0

Tidying when two or more values are stored in the same cell

Tabular data, by nature, is two-dimensional, and thus, there is a limited amount of information that can be presented in a single cell. As a workaround, you will occasionally see datasets with more than a single value stored in the same cell. Tidy data allows for just a single value for each cell. To rectify these situations, you will typically need to parse the string data into multiple columns with the methods from the .str attribute.

In this recipe, we examine a dataset that has a column containing multiple different variables in each cell. We use the .str attribute to parse these strings into separate columns to tidy the data.

How to do it...

  1. Read in the Texas cities dataset:
    >>> cities = pd.read_csv('data/texas_cities.csv')
    >>> cities
          City             Geolocation
    0  Houston  29.7604° N, 95.3698° W
    1   Dallas  32.7767° N, 96.7970° W
    2   Austin  30.2672° N, 97.7431° W
    
  2. The City column looks good and contains exactly one value. The Geolocation column, on the other hand, contains four variables: latitude, latitude direction, longitude, and longitude direction. Let's split the Geolocation column into four separate columns. We will use the regular expression that matches any character followed by a space:
    >>> geolocations = cities.Geolocation.str.split(pat='. ',
    ...     expand=True)
    >>> geolocations.columns = ['latitude', 'latitude direction',
    ...     'longitude', 'longitude direction']
    
  3. Because the original data type for the Geolocation was an object, all the new columns are also objects. Let's change latitude and longitude into float types:
    >>> geolocations = geolocations.astype({'latitude':'float',
    ...    'longitude':'float'})
    >>> geolocations.dtypes
    latitude               float64
    latitude direction      object
    longitude              float64
    longitude direction     object
    dtype: object
    
  4. Combine these new columns with the City column from the original:
    >>> (geolocations
    ...     .assign(city=cities['City'])
    ... )
       latitude latitude direction  ...  longitude direction     city
    0   29.7604            N        ...            W          Houston
    1   32.7767            N        ...            W           Dallas
    2   30.2672            N        ...            W           Austin
    

How it works…

After reading the data, we decide how many variables there are in the dataset. Here, we chose to split the Geolocation column into four variables, but we could have just chosen two for latitude and longitude and used a negative sign to differentiate between west and east and south and north.

There are a few ways to parse the Geolocation column with the methods from the .str attribute. The easiest way is to use the .split method. We pass it a regular expression defined by any character (the period) and a space. When a space follows any character, a split is made, and a new column is formed. The first occurrence of this pattern takes place at the end of the latitude. A space follows the degree character, and a split is formed. The splitting characters are discarded and not kept in the resulting columns. The next split matches the comma and space following directly after the latitude direction.

A total of three splits are made, resulting in four columns. The second line in step 2 provides them with meaningful names. Even though the resulting latitude and longitude columns appear to be float types, they are not. They were originally parsed from an object column and therefore remain object data types. Step 3 uses a dictionary to map the column names to their new types.

Instead of using a dictionary, which would require a lot of typing if you had many column names, you can use the function to_numeric to attempt to convert each column to either integer or float. To apply this function iteratively over each column, use the .apply method with the following:

>>> geolocations.apply(pd.to_numeric, errors='ignore')
   latitude latitude direction  longitude longitude direction
0   29.7604                  N    95.3698                   W
1   32.7767                  N    96.7970                   W
2   30.2672                  N    97.7431                   W

Step 4 concatenates the city to the DataFrame to complete the process of making tidy data.

There's more…

The .split method worked well in this example with a regular expression. For other examples, some columns might require you to create splits on several different patterns. To search for multiple regular expressions, use the pipe character (|). For instance, if we wanted to split only the degree symbol and comma, each followed by a space, we would do the following:

>>> cities.Geolocation.str.split(pat=r'° |, ', expand=True)
         0  1        2  3
0  29.7604  N  95.3698  W
1  32.7767  N  96.7970  W
2  30.2672  N  97.7431  W

This returns the same DataFrame from step 2. Any number of additional split patterns may be appended to the preceding string pattern with the pipe character.

The .extract method is another method that allows you to extract specific groups within each cell. These capture groups must be enclosed in parentheses. Anything that matches outside the parentheses is not present in the result. The following line produces the same output as step 2:

''' {.sourceCode .pycon}
>>> cities.Geolocation.str.extract(r'([0-9.]+). (N|S), ([0-9.]+). (E|W)',
...    expand=True)
         0  1        2  3
0  29.7604  N  95.3698  W
1  32.7767  N  96.7970  W
2  30.2672  N  97.7431  W
'''

This regular expression has four capture groups. The first and third groups search for at least one or more consecutive digits with decimals. The second and fourth groups search for a single character (the direction). The first and third capture groups are separated by any character followed by a space. The second capture group is separated by a comma and then a space.

Tidying when variables are stored in column names and values

One particularly difficult form of messy data to diagnose appears whenever variables are stored both horizontally across the column names and vertically down column values. This type of dataset usually is not found in a database, but from a summarized report that someone else has already generated.

How to do it…

In this recipe, data is reshaped into tidy data with the .melt and .pivot_table methods.

  1. Read in the sensors dataset:
    >>> sensors = pd.read_csv('data/sensors.csv')
    >>> sensors
      Group     Property  2012  2013  2014  2015  2016
    0     A     Pressure   928   873   814   973   870
    1     A  Temperature  1026  1038  1009  1036  1042
    2     A         Flow   819   806   861   882   856
    3     B     Pressure   817   877   914   806   942
    4     B  Temperature  1008  1041  1009  1002  1013
    5     B         Flow   887   899   837   824   873
    
  2. The only variable placed correctly in a vertical column is Group. The Property column appears to have three unique variables, Pressure, Temperature, and Flow. The rest of the columns 2012 to 2016 are themselves a single variable, which we can sensibly name Year. It isn't possible to restructure this kind of messy data with a single DataFrame method. Let's begin with the .melt method to pivot the years into their own column:
    >>> sensors.melt(id_vars=['Group', 'Property'], var_name='Year')
       Group     Property  Year  value
    0      A     Pressure  2012    928
    1      A  Temperature  2012   1026
    2      A         Flow  2012    819
    3      B     Pressure  2012    817
    4      B  Temperature  2012   1008
    ..   ...          ...   ...    ...
    25     A  Temperature  2016   1042
    26     A         Flow  2016    856
    27     B     Pressure  2016    942
    28     B  Temperature  2016   1013
    29     B         Flow  2016    873
    
  3. This takes care of one of our issues. Let's use the .pivot_table method to pivot the Property column into new column names:
    >>> (sensors
    ...     .melt(id_vars=['Group', 'Property'], var_name='Year') 
    ...     .pivot_table(index=['Group', 'Year'],
    ...                  columns='Property', values='value') 
    ...     .reset_index() 
    ...     .rename_axis(None, axis='columns')
    ... )
      Group  Year  Flow  Pressure  Temperature
    0     A  2012   819       928         1026
    1     A  2013   806       873         1038
    2     A  2014   861       814         1009
    3     A  2015   882       973         1036
    4     A  2016   856       870         1042
    5     B  2012   887       817         1008
    6     B  2013   899       877         1041
    7     B  2014   837       914         1009
    8     B  2015   824       806         1002
    9     B  2016   873       942         1013
    

How it works…

Once we have identified the variables in step 1, we can begin our restructuring. pandas does not have a method to pivot columns simultaneously, so we must take on this task one step at a time. We correct the years by keeping the Property column vertical by passing it to the id_vars parameter in the .melt method.

The result is now the pattern of messy data found in the recipe before last. As explained in the There's more... section of that recipe, we must use .pivot_table to pivot a DataFrame when using more than one column in the index parameter. After pivoting, the Group and Year variables are stuck in the index. We push them back out as columns with .reset_index. The .pivot_table method preserves the column name used in the columns parameter as the name of the column index. After resetting the index, this name is meaningless, and we remove it with .rename_axis.

There's more…

Whenever a solution involves .melt, .pivot_table, or .pivot, you can be sure that there is an alternative method using .stack and .unstack. The trick is first to move the columns that are not currently being pivoted into the index:

>>> (sensors
...     .set_index(['Group', 'Property']) 
...     .rename_axis('Year', axis='columns') 
...     .stack() 
...     .unstack('Property') 
...     .rename_axis(None, axis='columns') 
...     .reset_index()
... )
  Group  Year  Flow  Pressure  Temperature
0     A  2012   819       928         1026
1     A  2013   806       873         1038
2     A  2014   861       814         1009
3     A  2015   882       973         1036
4     A  2016   856       870         1042
5     B  2012   887       817         1008
6     B  2013   899       877         1041
7     B  2014   837       914         1009
8     B  2015   824       806         1002
9     B  2016   873       942         1013
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset