All the datasets used in the preceding chapters have not had much or any work done to change their structure. We immediately began processing the datasets in their original shape. Many datasets in the wild will need a significant amount of restructuring before commencing a more detailed analysis. In some cases, an entire project might only concern itself with formatting the data in such a way that it can be easily processed by someone else.
There are many terms that are used to describe the process of data restructuring, with tidy data being the most common to data scientists. Tidy data is a term coined by Hadley Wickham to describe a form of data that makes analysis easy to do. This chapter will cover many ideas formulated by Hadley and how to accomplish them with pandas. To learn a great deal more about tidy data, read Hadley's paper (http://vita.had.co.nz/papers/tidy-data.pdf).
The following is an example of untidy data:
Name | Category | Value |
Jill |
Bank |
2,300 |
Jill |
Color |
Red |
John |
Bank |
1,100 |
Jill |
Age |
40 |
John |
Color |
Purple |
The following is an example of tidy data:
Name | Age | Bank | Color |
Jill |
40 |
2,300 |
Red |
John |
38 |
Purple |
What is tidy data? Hadley puts forth three guiding principles that determine whether a dataset is tidy:
Any dataset that does not meet these guidelines is considered messy. This definition will make more sense once we start restructuring our data into tidy form, but for now, we'll need to know what variables, observations, and observational units are.
Using this jargon, a variable is not referring to a Python variable, it is a piece of data. It is good to think about the distinction between a variable name and the variable value. The variable names are labels, such as gender, race, salary, and position. The variable values are those things liable to change for every observation, such as male, female, or other for gender.
A single observation is the collection of all variable values for a single observational unit. To help understand what an observational unit might be, consider a retail store, which has data for each transaction, employee, customer, item, and the store itself. Each of these can be viewed as an observational unit and would require its own table. Combining employee information (like the number of hours worked) with customer information (like the amount spent) in the same table would break this tidy principle.
The first step to resolving messy data is to recognize it when it exists, and there are boundless possibilities. Hadley explicitly mentions five of the most common types of messy data:
It is important to understand that tidying data does not typically involve changing the values of your dataset, filling in missing values, or doing any sort of analysis. Tidying data consists in changing the shape or structure of the data to meet the tidy principles. Tidy data is akin to having all your tools in the toolbox instead of scattered randomly throughout your house. Having the tools properly in the toolbox allows all other tasks to be completed easily. Once the data is in the correct form, it becomes much easier to perform further analysis.
Once you have spotted messy data, you will use the pandas library to restructure the data, so that it is tidy. The main tidy tools that pandas has available for you are the DataFrame methods .stack
, .melt
, .unstack
, and .pivot
. More complex tidying involves ripping apart text, which necessitates the .str
accessor. Other helper methods, such as .rename
, .rename_axis
, .reset_index
, and .set_index,
will help with applying the final touches to tidy data.
To help understand the differences between tidy and messy data, let's take a look at a table that may or may not be in tidy form:
>>> import pandas as pd
>>> import numpy as np
>>> state_fruit = pd.read_csv('data/state_fruit.csv', index_col=0)
>>> state_fruit
Apple Orange Banana
Texas 12 10 40
Arizona 9 7 12
Florida 0 14 190
There does not appear to be anything messy about this table, and the information is easily consumable. However, according to the tidy principles, it isn't tidy. Each column name is the value of a variable. In fact, none of the variable names are even present in the DataFrame. One of the first steps to transform a messy dataset into tidy data is to identify all of the variables. In this particular dataset, we have variables for state and fruit. There's also the numeric data that wasn't identified anywhere in the context of the problem. We can label this variable as weight or any other sensible name.
This particular messy dataset contains variable values as column names. We will need to transpose these column names into column values. In this recipe, we use the stack method to restructure our DataFrame into tidy form.
.stack
method takes all of the column names and pivots them into the index. Typically, when you call the .stack
method, the data becomes taller. >>> state_fruit.stack()
Texas Apple 12
Orange 10
Banana 40
Arizona Apple 9
Orange 7
Banana 12
Florida Apple 0
Orange 14
Banana 190
dtype: int64
.reset_index
method to turn the result into a DataFrame:
>>> (state_fruit
... .stack()
... .reset_index()
... )
level_0 level_1 0
0 Texas Apple 12
1 Texas Orange 10
2 Texas Banana 40
3 Arizona Apple 9
4 Arizona Orange 7
5 Arizona Banana 12
6 Florida Apple 0
7 Florida Orange 14
8 Florida Banana 190
>>> (state_fruit
... .stack()
... .reset_index()
... .rename(columns={'level_0':'state',
... 'level_1': 'fruit', 0: 'weight'})
... )
state fruit weight
0 Texas Apple 12
1 Texas Orange 10
2 Texas Banana 40
3 Arizona Apple 9
4 Arizona Orange 7
5 Arizona Banana 12
6 Florida Apple 0
7 Florida Orange 14
8 Florida Banana 190
.rename
method, it is possible to use the lesser-known Series method .rename_axis
to set the names of the index levels before using .reset_index
:
>>> (state_fruit
... .stack()
... .rename_axis(['state', 'fruit'])
... )
state fruit
Texas Apple 12
Orange 10
Banana 40
Arizona Apple 9
Orange 7
Banana 12
Florida Apple 0
Orange 14
Banana 190
dtype: int64
.reset_index
method with the name parameter to reproduce the output from step 3:
>>> (state_fruit
... .stack()
... .rename_axis(['state', 'fruit'])
... .reset_index(name='weight')
... )
state fruit weight
0 Texas Apple 12
1 Texas Orange 10
2 Texas Banana 40
3 Arizona Apple 9
4 Arizona Orange 7
5 Arizona Banana 12
6 Florida Apple 0
7 Florida Orange 14
8 Florida Banana 190
The .stack
method is powerful, and it takes time to understand and appreciate fully. By default, it takes the (innermost level in hierarchical columns of) column names and transposes them, so they become the new innermost index level. Notice how each old column name still labels its original value by being paired with each state. There were nine original values in a 3 x 3 DataFrame, which got transformed into a single Series with the same number of values. The original first row of data became the first three values in the resulting Series.
After resetting the index in step 2, pandas defaults our DataFrame columns to level_0
, level_1
, and 0
(two strings and one integer). This is because the Series calling this method has two index levels that were formally unnamed. pandas also refers to indexes by integer, beginning from zero from the outside.
Step 3 shows an intuitive way to rename the columns with the .rename
method.
Alternatively, it is possible to set the column names by chaining the .rename_axis
method that uses a list of values as the index level names. pandas uses these index level names as the new column names when the index is reset. Additionally, the .reset_index
method has a name
parameter corresponding to the new column name of the Series values.
All Series have a name attribute that can be assigned or changed with the .rename
method. It is this attribute that becomes the column name when using .reset_index
.
One of the keys to using .stack
is to place all of the columns that you do not wish to transform in the index. The dataset in this recipe was initially read with the states in the index. Let's take a look at what would have happened if we did not read the states into the index:
>>> state_fruit2 = pd.read_csv('data/state_fruit2.csv')
>>> state_fruit2
State Apple Orange Banana
0 Texas 12 10 40
1 Arizona 9 7 12
2 Florida 0 14 190
As the state names are not in the index, using .stack
on this DataFrame reshapes all values into one long Series of values:
>>> state_fruit2.stack()
0 State Texas
Apple 12
Orange 10
Banana 40
1 State Arizona
...
Banana 12
2 State Florida
Apple 0
Orange 14
Banana 190
Length: 12, dtype: object
This command reshapes all the columns, this time including the states, and is not at all what we need. To reshape this data correctly, you will need to put all the non-reshaped columns into the index first with the .set_index
method, and then use .stack
. The following code gives a similar result to step 1:
>>> state_fruit2.set_index('State').stack()
State
Texas Apple 12
Orange 10
Banana 40
Arizona Apple 9
Orange 7
Banana 12
Florida Apple 0
Orange 14
Banana 190
dtype: int64
Like most large Python libraries, pandas has many different ways to accomplish the same task, the differences usually being readability and performance. A DataFrame has a method named .melt
that is similar to the .stack
method described in the previous recipe but gives a bit more flexibility.
In this recipe, we use the .melt
method to tidy a DataFrame with variable values as column names.
state_fruit2.csv
dataset:
>>> state_fruit2 = pd.read_csv('data/state_fruit2.csv')
>>> state_fruit2
State Apple Orange Banana
0 Texas 12 10 40
1 Arizona 9 7 12
2 Florida 0 14 190
.melt
method by passing the appropriate columns to the id_vars
and value_vars
parameters:
>>> state_fruit2.melt(id_vars=['State'],
... value_vars=['Apple', 'Orange', 'Banana'])
State variable value
0 Texas Apple 12
1 Arizona Apple 9
2 Florida Apple 0
3 Texas Orange 10
4 Arizona Orange 7
5 Florida Orange 14
6 Texas Banana 40
7 Arizona Banana 12
8 Florida Banana 190
.melt
refers to the transformed column names as variables and the corresponding values as values. Conveniently, .melt
has two additional parameters, var_name
and value_name
, that give you the ability to rename these two columns:
>>> state_fruit2.melt(id_vars=['State'],
... value_vars=['Apple', 'Orange', 'Banana'],
... var_name='Fruit',
... value_name='Weight')
State Fruit Weight
0 Texas Apple 12
1 Arizona Apple 9 2 Florida Apple 0
3 Texas Orange 10
4 Arizona Orange 7
5 Florida Orange 14
6 Texas Banana 40
7 Arizona Banana 12
8 Florida Banana 190
The .melt
method reshapes your DataFrame. It takes up to five parameters, with two of them being crucial to understanding how to reshape your data correctly:
id_vars
is a list of column names that you want to preserve as columns and not reshapevalue_vars
is a list of column names that you want to reshape into a single columnThe id_vars
, or the identification variables, remain in the same column but repeat for each of the columns passed to value_vars
. One crucial aspect of .melt
is that it ignores values in the index, and it silently drops your index and replaces it with a default RangeIndex
. This means that if you do have values in your index that you would like to keep, you will need to reset the index first before using melt.
All the parameters for the .melt
method are optional, and if you desire all your values to be in a single column and their old column labels to be in the other, you may call .melt
with the default parameters:
>>> state_fruit2.melt()
variable value
0 State Texas
1 State Arizona
2 State Florida
3 Apple 12
4 Apple 9
.. ... ...
7 Orange 7
8 Orange 14
9 Banana 40
10 Banana 12
11 Banana 190
More realistically, you might have lots of variables that need melting and would like to specify only the identification variables. In that case, calling .melt
in the following manner will yield the same result as in step 2. You don't even need a list when melting a single column and can pass its string value:
>>> state_fruit2.melt(id_vars='State')
State variable value
0 Texas Apple 12
1 Arizona Apple 9
2 Florida Apple 0
3 Texas Orange 10
4 Arizona Orange 7
5 Florida Orange 14
6 Texas Banana 40
7 Arizona Banana 12
8 Florida Banana 190
Some datasets contain multiple groups of variables as column names that need to be stacked simultaneously into their own columns. An example involving the movie dataset can help clarify this. Let's begin by selecting all columns containing the actor names and their corresponding Facebook likes:
>>> movie = pd.read_csv('data/movie.csv')
>>> actor = movie[['movie_title', 'actor_1_name',
... 'actor_2_name', 'actor_3_name',
... 'actor_1_facebook_likes',
... 'actor_2_facebook_likes',
... 'actor_3_facebook_likes']]
>>> actor.head()
movie_title ...
0 Avatar ...
1 Pirates of the Caribbean: At World's End ...
2 Spectre ...
3 The Dark Knight Rises ...
4 Star Wars: Episode VII - The Force Awakens ...
If we define our variables as the title of the movie, the actor name, and the number of Facebook likes, then we will need to stack two sets of columns, which is not possible using a single call to .stack
or .melt
.
In this recipe, we will tidy our actor DataFrame by simultaneously stacking the actor names and their corresponding Facebook likes with the wide_to_long
function.
wide_to_long
function to reshape our data into tidy form. To use this function, we will need to change the column names that we are stacking, so that they end with a digit. We first create a user-defined function to change the column names:
>>> def change_col_name(col_name):
... col_name = col_name.replace('_name', '')
... if 'facebook' in col_name:
... fb_idx = col_name.find('facebook')
... col_name = (col_name[:5] + col_name[fb_idx - 1:]
... + col_name[5:fb_idx-1])
... return col_name
rename
method to transform all the column names:
>>> actor2 = actor.rename(columns=change_col_name)
>>> actor2
movie_title actor_1 ... actor_facebook_likes_2
0 Avatar CCH Pounder ... 936.0
1 Pirates ... Johnny Depp ... 5000.0
2 Spectre Christop... ... 393.0
3 The Dark... Tom Hardy ... 23000.0
4 Star War... Doug Walker ... 12.0
... ... ... ... ...
4911 Signed S... Eric Mabius ... 470.0
4912 The Foll... Natalie Zea ... 593.0
4913 A Plague... Eva Boehnke ... 0.0
4914 Shanghai... Alan Ruck ... 719.0
4915 My Date ... John August ... 23.0
wide_to_long
function to stack the actor and Facebook sets of columns simultaneously:
>>> stubs = ['actor', 'actor_facebook_likes']
>>> actor2_tidy = pd.wide_to_long(actor2,
... stubnames=stubs,
... i=['movie_title'],
... j='actor_num',
... sep='_')
>>> actor2_tidy.head()
actor actor_facebook_likes
movie_title actor_num
Avatar 1 CCH Pounder 1000.0
Pirates o... 1 Johnny Depp 40000.0
Spectre 1 Christop... 11000.0
The Dark ... 1 Tom Hardy 27000.0
Star Wars... 1 Doug Walker 131.0
The wide_to_long
function works in a fairly specific manner. Its main parameter is stubnames
, which is a list of strings. Each string represents a single column grouping. All columns that begin with this string will be stacked into a single column. In this recipe, there are two groups of columns: actor
, and actor_facebook_likes
. By default, each of these groups of columns will need to end in a digit. This digit will subsequently be used to label the reshaped data. Each of these column groups has an underscore character separating the stubname
from the ending digit. To account for this, you must use the sep
parameter.
The original column names do not match the pattern needed for wide_to_long
to work. The column names could have been changed manually by specifying their values with a list. This could quickly become a lot of typing so instead, we define a function that automatically converts our columns to a format that works. The change_col_name
function removes *_name*
from the actor columns and rearranges the Facebook columns so that now they both end in digits.
To accomplish the column renaming, we use the .rename
method in step 2. It accepts many different types of arguments, one of which is a function. When passing it to a function, every column name gets implicitly passed to it one at a time.
We have now correctly created two groups of columns, those beginning with actor
and actor_facebook_likes
that will be stacked. In addition to this, wide_to_long
requires a unique column, parameter i
, to act as an identification variable that will not be stacked. Also required is the parameter j
, which renames the identifying digit stripped from the end of the original column names. By default, the suffix
parameter contains the regular expression, r'd+'
, that searches for one or more digits. The d
is a special token that matches the digits 0-9. The plus sign, +
, makes the expression match for one or more of these digits.
The function wide_to_long
works when all groupings of variables have the same numeric ending like they did in this recipe. When your variables do not have the same ending or don't end in a digit, you can still use wide_to_long
to do simultaneous column stacking. For instance, let's take a look at the following dataset:
>>> df = pd.read_csv('data/stackme.csv')
>>> df
State Country a1 b2 Test d e
0 TX US 0.45 0.3 Test1 2 6
1 MA US 0.03 1.2 Test2 9 7
2 ON CAN 0.70 4.2 Test3 4 2
Let's say we wanted columns a1
and b1
stacked together, as well as columns d
and e
. Additionally, we wanted to use a1
and b1
as labels for the rows. To accomplish this task, we would need to rename the columns so that they ended in the label we desired:
>>> df.rename(columns = {'a1':'group1_a1', 'b2':'group1_b2',
... 'd':'group2_a1', 'e':'group2_b2'})
State Country ... group2_a1 group2_b2
0 TX US ... 2 6
1 MA US ... 9 7
2 ON CAN ... 4 2
We would then need to modify the suffix
parameter, which normally defaults to a regular expression that selects digits. Here, we tell it to find any number of characters:
>>> pd.wide_to_long(
... df.rename(columns = {'a1':'group1_a1',
... 'b2':'group1_b2',
... 'd':'group2_a1', 'e':'group2_b2'}),
... stubnames=['group1', 'group2'],
... i=['State', 'Country', 'Test'],
... j='Label',
... suffix='.+',
... sep='_')
group1 group2
State Country Test Label
TX US Test1 a1 0.45 2
b2 0.30 6
MA US Test2 a1 0.03 9
b2 1.20 7
ON CAN Test3 a1 0.70 4
b2 4.20 2
DataFrames have two similar methods, .stack
and .melt
, to convert horizontal column names into vertical column values. DataFrames can invert these two operations with the .unstack
and .pivot
methods, respectively. .stack
and .unstack
are methods that allow control over only the column and row indexes, while .melt
and .pivot
give more flexibility to choose which columns are reshaped.
In this recipe, we will call .stack
and .melt
on a dataset and promptly invert the operation with the .unstack
and .pivot
methods.
>>> def usecol_func(name):
... return 'UGDS_' in name or name == 'INSTNM'
>>> college = pd.read_csv('data/college.csv',
... index_col='INSTNM',
... usecols=usecol_func)
>>> college
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 0.0333 0.9353 ... 0.0059 0.0138
Universit... 0.5922 0.2600 ... 0.0179 0.0100
Amridge U... 0.2990 0.4192 ... 0.0000 0.2715
Universit... 0.6988 0.1255 ... 0.0332 0.0350
Alabama S... 0.0158 0.9208 ... 0.0243 0.0137
... ... ... ... ... ...
SAE Insti... NaN NaN ... NaN NaN
Rasmussen... NaN NaN ... NaN NaN
National ... NaN NaN ... NaN NaN
Bay Area ... NaN NaN ... NaN NaN
Excel Lea... NaN NaN ... NaN NaN
.stack
method to convert each horizontal column name to a vertical index level:
>>> college_stacked = college.stack()
>>> college_stacked
INSTNM
Alabama A & M University UGDS_WHITE 0.0333
UGDS_BLACK 0.9353
UGDS_HISP 0.0055
UGDS_ASIAN 0.0019
UGDS_AIAN 0.002
...
Coastal Pines Technical College UGDS_AIAN 0.0034
UGDS_NHPI 0.0017
UGDS_2MOR 0.0191
UGDS_NRA 0.0028
UGDS_UNKN 0.0056
Length: 61866, dtype: float64
.unstack
method:
>>> college_stacked.unstack()
UGDS_WHITE UGDS_BLACK ... UGDS_NRA UGDS_UNKN
INSTNM ...
Alabama A... 0.0333 0.9353 ... 0.0059 0.0138
Universit... 0.5922 0.2600 ... 0.0179 0.0100
Amridge U... 0.2990 0.4192 ... 0.0000 0.2715
Universit... 0.6988 0.1255 ... 0.0332 0.0350
Alabama S... 0.0158 0.9208 ... 0.0243 0.0137
... ... ... ... ... ...
Hollywood... 0.2182 0.4182 ... 0.0182 0.0909
Hollywood... 0.1200 0.3333 ... 0.0000 0.0667
Coachella... 0.3284 0.1045 ... 0.0000 0.0000
Dewey Uni... 0.0000 0.0000 ... 0.0000 0.0000
Coastal P... 0.6762 0.2508 ... 0.0028 0.0056
.melt
followed by .pivot
. First, read in the data without putting the institution name in the index:
>>> college2 = pd.read_csv('data/college.csv',
... usecols=usecol_func)
>>> college2
INSTNM UGDS_WHITE ... UGDS_NRA UGDS_UNKN
0 Alabama ... 0.0333 ... 0.0059 0.0138
1 Universi... 0.5922 ... 0.0179 0.0100
2 Amridge ... 0.2990 ... 0.0000 0.2715
3 Universi... 0.6988 ... 0.0332 0.0350
4 Alabama ... 0.0158 ... 0.0243 0.0137
... ... ... ... ... ...
7530 SAE Inst... NaN ... NaN NaN
7531 Rasmusse... NaN ... NaN NaN
7532 National... NaN ... NaN NaN
7533 Bay Area... NaN ... NaN NaN
7534 Excel Le... NaN ... NaN NaN
.melt
method to transpose all the race columns into a single column:
>>> college_melted = college2.melt(id_vars='INSTNM',
... var_name='Race',
... value_name='Percentage')
>>> college_melted
INSTNM Race Percentage
0 Alabama ... UGDS_WHITE 0.0333
1 Universi... UGDS_WHITE 0.5922
2 Amridge ... UGDS_WHITE 0.2990
3 Universi... UGDS_WHITE 0.6988
4 Alabama ... UGDS_WHITE 0.0158
... ... ... ...
67810 SAE Inst... UGDS_UNKN NaN
67811 Rasmusse... UGDS_UNKN NaN
67812 National... UGDS_UNKN NaN
67813 Bay Area... UGDS_UNKN NaN
67814 Excel Le... UGDS_UNKN NaN
.pivot
method to invert this previous result:
>>> melted_inv = college_melted.pivot(index='INSTNM',
... columns='Race',
... values='Percentage')
>>> melted_inv
Race UGDS_2MOR UGDS_AIAN ... UGDS_UNKN UGDS_WHITE
INSTNM ...
A & W Hea... 0.0000 0.0000 ... 0.0000 0.0000
A T Still... NaN NaN ... NaN NaN
ABC Beaut... 0.0000 0.0000 ... 0.0000 0.0000
ABC Beaut... 0.0000 0.0000 ... 0.0000 0.2895
AI Miami ... 0.0018 0.0000 ... 0.4644 0.0324
... ... ... ... ... ...
Yukon Bea... 0.0000 0.1200 ... 0.0000 0.8000
Z Hair Ac... 0.0211 0.0000 ... 0.0105 0.9368
Zane Stat... 0.0218 0.0029 ... 0.2399 0.6995
duCret Sc... 0.0976 0.0000 ... 0.0244 0.4634
eClips Sc... 0.0000 0.0000 ... 0.0000 0.1446
.loc
index operator to select rows and columns simultaneously and then reset the index:
>>> college2_replication = (melted_inv
... .loc[college2['INSTNM'], college2.columns[1:]]
... .reset_index()
... )
>>> college2.equals(college2_replication)
True
There are multiple ways to accomplish the same thing in step 1. Here, we show the versatility of the read_csv
function. The usecols
parameter accepts either a list of the columns that we would like to import or a function that dynamically determines them. We use a function that checks whether the column name contains UGDS_
or is equal to INSTNM
. The function is passed each column name as a string and must return a Boolean. A considerable amount of memory can be saved in this manner.
The stack method in step 2 puts all column names into the innermost index level and returns a Series. In step 3, the .unstack
method inverts this operation by taking all the values in the innermost index level and converting them to column names. Note that the sizes of the results of steps 1 and 3 are different because .stack
drops missing values by default. If you pass in the dropna=False
parameter, it will round-trip correctly.
Step 4 reads in the same dataset as in step 1 but does not put the institution name in the index because the .melt
method isn't able to access it. Step 5 uses the .melt
method to transpose all the Race
columns. It does this by leaving the value_vars
parameter as its default value, None
. When not specified, all the columns not present in the id_vars
parameter get transposed.
Step 6 inverts the operation from step 5 with the .pivot
method, which accepts three parameters. Most parameters take a single column as a string (the values
parameter may also accept a list of column names). The column referenced by the index parameter remains vertical and becomes the new index. The values of the column referenced by the columns
parameter become the column names. The values referenced by the values
parameter become tiled to correspond with the intersection of their former index and columns label.
To make a replication with pivot
, we need to sort the rows and columns in the same order as the original. As the institution name is in the index, we use the .loc
index operator to sort the DataFrame by its original index.
To help further understand .stack
and .unstack
, let's use them to transpose the college DataFrame. In this context, we are using the precise mathematical definition of the transposing of a matrix, where the new rows are the old columns of the original data matrix.
If you take a look at the output from step 2, you'll notice that there are two index levels. By default, the .unstack
method uses the innermost index level as the new column values. Index levels are numbered beginning from zero from the outside. pandas defaults the level parameter of the .unstack
method to -1
, which refers to the innermost index. We can instead .unstack
the outermost column using level=0
:
>>> college.stack().unstack(0)
INSTNM Alaba/rsity ... Coast/llege
UGDS_WHITE 0.0333 ... 0.6762
UGDS_BLACK 0.9353 ... 0.2508
UGDS_HISP 0.0055 ... 0.0359
UGDS_ASIAN 0.0019 ... 0.0045
UGDS_AIAN 0.0024 ... 0.0034
UGDS_NHPI 0.0019 ... 0.0017
UGDS_2MOR 0.0000 ... 0.0191
UGDS_NRA 0.0059 ... 0.0028
UGDS_UNKN 0.0138 ... 0.0056
There is a way to transpose a DataFrame that does not require .stack
or .unstack
. Use the .transpose
method or the .T
attribute like this:
>>> college.T
>>> college.transpose()
INSTNM Alaba/rsity ... Coast/llege
UGDS_WHITE 0.0333 ... 0.6762
UGDS_BLACK 0.9353 ... 0.2508
UGDS_HISP 0.0055 ... 0.0359
UGDS_ASIAN 0.0019 ... 0.0045
UGDS_AIAN 0.0024 ... 0.0034
UGDS_NHPI 0.0019 ... 0.0017
UGDS_2MOR 0.0000 ... 0.0191
UGDS_NRA 0.0059 ... 0.0028
UGDS_UNKN 0.0138 ... 0.0056
Grouping data by a single column and performing an aggregation on a single column returns a result that is easy to consume. When grouping by more than one column, a resulting aggregation might not be structured in a manner that makes consumption easy. Since .groupby
operations, by default, put the unique grouping columns in the index, the .unstack
method can be beneficial to rearrange the data so that it is presented in a manner that is more useful for interpretation.
In this recipe, we use the employee dataset to perform an aggregation, grouping by multiple columns. We then use the .unstack
method to reshape the result into a format that makes for easier comparisons of different groups.
>>> employee = pd.read_csv('data/employee.csv')
>>> (employee
... .groupby('RACE')
... ['BASE_SALARY']
... .mean()
... .astype(int)
... )
RACE
American Indian or Alaskan Native 60272
Asian/Pacific Islander 61660
Black or African American 50137
Hispanic/Latino 52345
Others 51278
White 64419
Name: BASE_SALARY, dtype: int64
groupby
operation that results in a Series that is easy to read and has no need to reshape. Let's now find the average salary for all races by gender. Note that the result is a Series:
>>> (employee
... .groupby(['RACE', 'GENDER'])
... ['BASE_SALARY']
... .mean()
... .astype(int)
... )
RACE GENDER
American Indian or Alaskan Native Female 60238
Male 60305
Asian/Pacific Islander Female 63226
Male 61033
Black or African American Female 48915
...
Hispanic/Latino Male 54782
Others Female 63785
Male 38771
White Female 66793
Male 63940
Name: BASE_SALARY, Length: 12, dtype: int64
.unstack
on the gender index level:
>>> (employee
... .groupby(['RACE', 'GENDER'])
... ['BASE_SALARY']
... .mean()
... .astype(int)
... .unstack('GENDER')
... )
GENDER Female Male
RACE
American Indian or Alaskan Native 60238 60305
Asian/Pacific Islander 63226 61033
Black or African American 48915 51082
Hispanic/Latino 46503 54782
Others 63785 38771
White 66793 63940
>>> (employee
... .groupby(['RACE', 'GENDER'])
... ['BASE_SALARY']
... .mean()
... .astype(int)
... .unstack('RACE')
... )
RACE American Indian or Alaskan Native ... White
GENDER ...
Female 60238 ... 66793
Male 60305 ... 63940
Step 1 has the simplest possible aggregation with a single grouping column (RACE
), a single aggregating column (BASE_SALARY
), and a single aggregating function (.mean
). This result is easy to consume and doesn't require any more processing to evaluate. Step 2 groups by both race and gender together. The resulting Series (which has a MultiIndex
) contains all the values in a single dimension, which makes comparisons more difficult. To make the information easier to consume, we use the .unstack
method to convert the values in one (or more) of the levels to columns.
By default, .unstack
uses the innermost index level as the new columns. You can specify the level you would like to unstack with the level
parameter, which accepts either the level name as a string or the level integer location. It is preferable to use the level name over the integer location to avoid ambiguity. Steps 3 and 4 unstack each level, which results in a DataFrame with a single-level index. It is now much easier to compare salaries from each race by gender.
If there are multiple aggregating functions when performing a groupby with a single column from a DataFrame, then the immediate result will be a DataFrame and not a Series. For instance, let's calculate more aggregations than just the mean, as was done in step 2:
>>> (employee
... .groupby(['RACE', 'GENDER'])
... ['BASE_SALARY']
... .agg(['mean', 'max', 'min'])
... .astype(int)
... )
mean max min
RACE GENDER
American Indian or Alaskan Native Female 60238 98536 26125
Male 60305 81239 26125
Asian/Pacific Islander Female 63226 130416 26125
Male 61033 163228 27914
Black or African American Female 48915 150416 24960
... ... ... ...
Hispanic/Latino Male 54782 165216 26104
Others Female 63785 63785 63785
Male 38771 38771 38771
White Female 66793 178331 27955
Male 63940 210588 26125
Unstacking the Gender
column will result in columns with a MultiIndex
. From here, you can keep swapping row and column levels with both the .unstack
and .stack
methods until you achieve the structure of data you desire:
>>> (employee
... .groupby(['RACE', 'GENDER'])
... ['BASE_SALARY']
... .agg(['mean', 'max', 'min'])
... .astype(int)
... .unstack('GENDER')
... )
mean ... min
GENDER Female Male ... Female Male
RACE ...
American ... 60238 60305 ... 26125 26125
Asian/Pac... 63226 61033 ... 26125 27914
Black or ... 48915 51082 ... 24960 26125
Hispanic/... 46503 54782 ... 26125 26104
Others 63785 38771 ... 63785 38771
White 66793 63940 ... 27955 26125
At first glance, it may seem that the .pivot_table
method provides a unique way to analyze data. However, after a little massaging, it is possible to replicate its functionality with the .groupby
method. Knowing this equivalence can help shrink the universe of pandas functionality.
In this recipe, we use the flights dataset to create a pivot table and then recreate it using the .groupby
method.
.pivot_table
method to find the total number of canceled flights per origin airport for each airline:
>>> flights = pd.read_csv('data/flights.csv')
>>> fpt = flights.pivot_table(index='AIRLINE',
... columns='ORG_AIR',
... values='CANCELLED',
... aggfunc='sum',
... fill_value=0)
>>> fpt
ORG_AIR ATL DEN DFW IAH LAS LAX MSP ORD PHX SFO
AIRLINE
AA 3 4 86 3 3 11 3 35 4 2
AS 0 0 0 0 0 0 0 0 0 0
B6 0 0 0 0 0 0 0 0 0 1
DL 28 1 0 0 1 1 4 0 1 2
EV 18 6 27 36 0 0 6 53 0 0
... ... ... ... ... ... ... ... ... ... ...
OO 3 25 2 10 0 15 4 41 9 33
UA 2 9 1 23 3 6 2 25 3 19
US 0 0 2 2 1 0 0 6 7 3
VX 0 0 0 0 0 3 0 0 0 3
WN 9 13 0 0 7 32 1 0 6 25
.groupby
method, we will need to groupby two columns and then unstack them. A groupby aggregation cannot replicate this table. The trick is to group by all the columns in both the index and columns parameters first:
>>> (flights
... .groupby(['AIRLINE', 'ORG_AIR'])
... ['CANCELLED']
... .sum()
... )
AIRLINE ORG_AIR
AA ATL 3
DEN 4
DFW 86
IAH 3
LAS 3
..
WN LAS 7
LAX 32
MSP 1
PHX 6
SFO 25
Name: CANCELLED, Length: 114, dtype: int64
.unstack
method to pivot the ORG_AIR
index level to column names:
>>> fpg = (flights
... .groupby(['AIRLINE', 'ORG_AIR'])
... ['CANCELLED']
... .sum()
... .unstack('ORG_AIR', fill_value=0)
... )
>>> fpt.equals(fpg)
True
The .pivot_table
method is very versatile and flexible but performs a rather similar operation to a groupby aggregation with step 1 showing an example. The index
parameter takes a column (or list of columns) that will not be pivoted and whose unique values will be placed in the index. The columns
parameter takes a column (or list of columns) that will be pivoted and whose unique values will be made into column names. The values
parameter takes a column (or list of columns) that will be aggregated.
There also exists an aggfunc
parameter that takes an aggregating function (or list of functions) that determines how the columns in the values
parameter get aggregated. It defaults to the string mean
, and, in this example, we change it to calculate the sum. Additionally, some unique combinations of AIRLINE
and ORG_AIR
do not exist. These missing combinations will default to missing values in the resulting DataFrame. Here, we use the fill_value
parameter to change them to zero.
Step 2 begins the replication process using all the columns in the index
and columns
parameter as the grouping columns. This is the key to making this recipe work. A pivot table is an intersection of all the unique combinations of the grouping columns. Step 3 finishes the replication by pivoting the innermost index level into column names with the .unstack
method. Just like with .pivot_table
, not all combinations of AIRLINE
and ORG_AIR
exist; we again use the fill_value
parameter to force these missing intersections to zero.
It is possible to replicate much more complex pivot tables with the .groupby
method. For instance, take the following result from .pivot_table
:
>>> flights.pivot_table(index=['AIRLINE', 'MONTH'],
... columns=['ORG_AIR', 'CANCELLED'],
... values=['DEP_DELAY', 'DIST'],
... aggfunc=['sum', 'mean'],
... fill_value=0)
sum ... mean
DEP_DELAY ... DIST
ORG_AIR ATL ... SFO
CANCELLED 0 1 ... 0 1
AIRLINE MONTH ...
AA 1 -13 0 ... 1860.166667 0.0
2 -39 0 ... 1337.916667 2586.0
3 -2 0 ... 1502.758621 0.0
4 1 0 ... 1646.903226 0.0
5 52 0 ... 1436.892857 0.0
... ... .. ... ... ...
WN 7 2604 0 ... 636.210526 0.0
8 1718 0 ... 644.857143 392.0
9 1033 0 ... 731.578947 354.5
11 700 0 ... 580.875000 392.0
12 1679 0 ... 782.256410 0.0
To replicate this with the .groupby
method, follow the same pattern from the recipe, place all the columns from the index
and columns
parameters into the .groupby
method, and then call .unstack
to pull the index levels out to the columns:
>>> (flights
... .groupby(['AIRLINE', 'MONTH', 'ORG_AIR', 'CANCELLED'])
... [['DEP_DELAY', 'DIST']]
... .agg(['mean', 'sum'])
... .unstack(['ORG_AIR', 'CANCELLED'], fill_value=0)
... .swaplevel(0, 1, axis='columns')
... )
mean ... sum
DEP_DELAY ... DIST
ORG_AIR ATL ... SFO
CANCELLED 0 1 ... 0 1
AIRLINE MONTH ...
AA 1 -3.250000 NaN ... 33483.0 NaN
2 -3.000000 NaN ... 32110.0 2586.0
3 -0.166667 NaN ... 43580.0 NaN
4 0.071429 NaN ... 51054.0 NaN
5 5.777778 NaN ... 40233.0 NaN
... ... .. ... ... ...
WN 7 21.700000 NaN ... 24176.0 NaN
8 16.207547 NaN ... 18056.0 784.0
9 8.680672 NaN ... 27800.0 709.0
11 5.932203 NaN ... 23235.0 784.0
12 15.691589 NaN ... 30508.0 NaN
The order of the column levels differs, with .pivot_table
putting the aggregation functions at a level preceding the columns in the values
parameter. You can use the .swaplevel
method to remedy this. It will swap the outermost column (level 0) with the level below that (level 1). Also note that the column order is different.
Reshaping with the .stack
and .unstack
methods is far easier when each axis (both index and column) level has a name. pandas allows users to reference each axis level by integer location or by name. Since integer location is implicit and not explicit, you should consider using level names whenever possible. This advice follows from The Zen of Python (type import this
if you are not familiar with it), a short list of guiding principles for Python, of which the second one is "Explicit is better than implicit."
When grouping or aggregating with multiple columns, the resulting pandas object will have multiple levels in one or both of the axes. In this recipe, we will name each level of each axis and then use the .stack
and .unstack
methods to reshape the data to the desired form.
>>> college = pd.read_csv('data/college.csv')
>>> (college
... .groupby(['STABBR', 'RELAFFIL'])
... [['UGDS', 'SATMTMID']]
... .agg(['size', 'min', 'max'])
... )
UGDS SATMTMID
size min max size min max
STABBR RELAFFIL
AK 0 7 109.0 12865.0 7 NaN NaN
1 3 27.0 275.0 3 503.0 503.0
AL 0 72 12.0 29851.0 72 420.0 590.0
1 24 13.0 3033.0 24 400.0 560.0
AR 0 68 18.0 21405.0 68 427.0 565.0
... ... ... ... ... ... ...
WI 0 87 20.0 29302.0 87 480.0 680.0
1 25 4.0 8212.0 25 452.0 605.0
WV 0 65 20.0 44924.0 65 430.0 530.0
1 8 63.0 1375.0 8 455.0 510.0
WY 0 11 52.0 9910.0 11 540.0 540.0
.rename_axis
method to give them level names:
>>> (college
... .groupby(['STABBR', 'RELAFFIL'])
... [['UGDS', 'SATMTMID']]
... .agg(['size', 'min', 'max'])
... .rename_axis(['AGG_COLS', 'AGG_FUNCS'], axis='columns')
... )
AGG_COLS UGDS SATMTMID
AGG_FUNCS size min max size min max
STABBR RELAFFIL
AK 0 7 109.0 12865.0 7 NaN NaN
1 3 27.0 275.0 3 503.0 503.0
AL 0 72 12.0 29851.0 72 420.0 590.0
1 24 13.0 3033.0 24 400.0 560.0
AR 0 68 18.0 21405.0 68 427.0 565.0
... ... ... ... ... ... ...
WI 0 87 20.0 29302.0 87 480.0 680.0
1 25 4.0 8212.0 25 452.0 605.0
WV 0 65 20.0 44924.0 65 430.0 530.0
1 8 63.0 1375.0 8 455.0 510.0
WY 0 11 52.0 9910.0 11 540.0 540.0
.stack
method to move the AGG_FUNCS
column to an index level:
>>> (college
... .groupby(['STABBR', 'RELAFFIL'])
... [['UGDS', 'SATMTMID']]
... .agg(['size', 'min', 'max'])
... .rename_axis(['AGG_COLS', 'AGG_FUNCS'], axis='columns')
... .stack('AGG_FUNCS')
... )
AGG_COLS UGDS SATMTMID
STABBR RELAFFIL AGG_FUNCS
AK 0 size 7.0 7.0
min 109.0 NaN
max 12865.0 NaN
1 size 3.0 3.0
min 27.0 503.0
... ... ...
WV 1 min 63.0 455.0
max 1375.0 510.0
WY 0 size 11.0 11.0
min 52.0 540.0
max 9910.0 540.0
.swaplevel
method to move AGG_FUNCS
from the innermost level to the outer level:
>>> (college
... .groupby(['STABBR', 'RELAFFIL'])
... [['UGDS', 'SATMTMID']]
... .agg(['size', 'min', 'max'])
... .rename_axis(['AGG_COLS', 'AGG_FUNCS'], axis='columns')
... .stack('AGG_FUNCS')
... .swaplevel('AGG_FUNCS', 'STABBR',
... axis='index')
... )
AGG_COLS UGDS SATMTMID
AGG_FUNCS RELAFFIL STABBR
size 0 AK 7.0 7.0
min 0 AK 109.0 NaN
max 0 AK 12865.0 NaN
size 1 AK 3.0 3.0
min 1 AK 27.0 503.0
... ... ...
WV 63.0 455.0
max 1 WV 1375.0 510.0
size 0 WY 11.0 11.0
min 0 WY 52.0 540.0
max 0 WY 9910.0 540.0
.sort_index
method:
>>> (college
... .groupby(['STABBR', 'RELAFFIL'])
... [['UGDS', 'SATMTMID']]
... .agg(['size', 'min', 'max'])
... .rename_axis(['AGG_COLS', 'AGG_FUNCS'], axis='columns')
... .stack('AGG_FUNCS')
... .swaplevel('AGG_FUNCS', 'STABBR', axis='index')
... .sort_index(level='RELAFFIL', axis='index')
... .sort_index(level='AGG_COLS', axis='columns')
... )
AGG_COLS SATMTMID UGDS
AGG_FUNCS RELAFFIL STABBR
max 0 AK NaN 12865.0
AL 590.0 29851.0
AR 565.0 21405.0
AS NaN 1276.0
AZ 580.0 151558.0
... ... ...
size 1 VI 1.0 1.0
VT 5.0 5.0
WA 17.0 17.0
WI 25.0 25.0
WV 8.0 8.0
>>> (college
... .groupby(['STABBR', 'RELAFFIL'])
... [['UGDS', 'SATMTMID']]
... .agg(['size', 'min', 'max'])
... .rename_axis(['AGG_COLS', 'AGG_FUNCS'], axis='columns')
... .stack('AGG_FUNCS')
... .unstack(['RELAFFIL', 'STABBR'])
... )
AGG_COLS UGDS ... SATMTMID
RELAFFIL 0 1 ... 1 0
STABBR AK AK ... WV WY
AGG_FUNCS ...
size 7.0 3.0 ... 8.0 11.0
min 109.0 27.0 ... 455.0 540.0
max 12865.0 275.0 ... 510.0 540.0
>>> (college
... .groupby(['STABBR', 'RELAFFIL'])
... [['UGDS', 'SATMTMID']]
... .agg(['size', 'min', 'max'])
... .rename_axis(['AGG_COLS', 'AGG_FUNCS'], axis='columns')
... .stack(['AGG_FUNCS', 'AGG_COLS'])
... )
STABBR RELAFFIL AGG_FUNCS AGG_COLS
AK 0 size UGDS 7.0
SATMTMID 7.0
min UGDS 109.0
max UGDS 12865.0
1 size UGDS 3.0
...
WY 0 size SATMTMID 11.0
min UGDS 52.0
SATMTMID 540.0
max UGDS 9910.0
SATMTMID 540.0
Length: 640, dtype: float64
>>> (college
... .groupby(['STABBR', 'RELAFFIL'])
... [['UGDS', 'SATMTMID']]
... .agg(['size', 'min', 'max'])
... .rename_axis(['AGG_COLS', 'AGG_FUNCS'], axis='columns')
... .unstack(['STABBR', 'RELAFFIL'])
... )
AGG_COLS AGG_FUNCS STABBR RELAFFIL
UGDS size AK 0 7.0
1 3.0
AL 0 72.0
1 24.0
AR 0 68.0
...
SATMTMID max WI 1 605.0
WV 0 530.0
1 510.0
WY 0 540.0
1 NaN
Length: 708, dtype: float64
It is common for the result of a call to the .groupby
method to produce a DataFrame or Series with multiple axis levels. The resulting DataFrame from the groupby operation in step 1 has multiple levels for each axis. The column levels are not named, which would require us to reference them only by their integer location. To ease our ability to reference the column levels, we rename them with the .rename_axis
method.
The .rename_axis
method is a bit strange in that it can modify both the level names and the level values based on the type of the first argument passed to it. Passing it a list (or a scalar if there is only one level) changes the names of the levels. In step 2, we pass the .rename_axis
method a list and are returned a DataFrame with all axis levels named.
Once all the axis levels have names, we can control the structure of data. Step 3 stacks the AGG_FUNCS
column into the innermost index level. The .swaplevel
method in step 4 accepts the name or position of the levels that you want to swap as the first two arguments. In step 5, the .sort_index
method is called twice and sorts the values of each level. Notice that the values of the column level are the column names SATMTMID
and UGDS
.
We can get vastly different output by both stacking and unstacking, as done in step 6. It is also possible to stack or unstack every single column or index level, and both will collapse into a Series.
If you wish to dispose of the level values altogether, you may set them to None
. You can do this when you want to reduce visual clutter or when it is obvious what the column levels represent and no further processing will take place:
>>> (college
... .groupby(['STABBR', 'RELAFFIL'])
... [['UGDS', 'SATMTMID']]
... .agg(['size', 'min', 'max'])
... .rename_axis([None, None], axis='index')
... .rename_axis([None, None], axis='columns')
... )
UGDS SATMTMID
size min max size min max
AK 0 7 109.0 12865.0 7 NaN NaN
1 3 27.0 275.0 3 503.0 503.0
AL 0 72 12.0 29851.0 72 420.0 590.0
1 24 13.0 3033.0 24 400.0 560.0
AR 0 68 18.0 21405.0 68 427.0 565.0
... ... ... ... ... ... ...
WI 0 87 20.0 29302.0 87 480.0 680.0
1 25 4.0 8212.0 25 452.0 605.0
WV 0 65 20.0 44924.0 65 430.0 530.0
1 8 63.0 1375.0 8 455.0 510.0
WY 0 11 52.0 9910.0 11 540.0 540.0
One particular flavor of messy data appears whenever the column names contain multiple different variables themselves. A common example of this scenario occurs when age and sex are concatenated together. To tidy datasets like this, we must manipulate the columns with the pandas .str
attribute. This attribute contains additional methods for string processing.
In this recipe, we will first identify all the variables, of which some will be concatenated together as column names. We then reshape the data and parse the text to extract the correct variable values.
>>> weightlifting = pd.read_csv('data/weightlifting_men.csv')
>>> weightlifting
Weight Category M35 35-39 ... M75 75-79 M80 80+
0 56 137 ... 62 55
1 62 152 ... 67 57
2 69 167 ... 75 60
3 77 182 ... 82 65
4 85 192 ... 87 70
5 94 202 ... 90 75
6 105 210 ... 95 80
7 105+ 217 ... 100 85
Weight Category
, a combination of sex and age, and the qualifying total. The age and sex variables have been concatenated together into a single cell. Before we can separate them, let's use the .melt
method to transpose the age and sex column names into a single vertical column:
>>> (weightlifting
... .melt(id_vars='Weight Category',
... var_name='sex_age',
... value_name='Qual Total')
... )
Weight Category sex_age Qual Total
0 56 M35 35-39 137
1 62 M35 35-39 152
2 69 M35 35-39 167
3 77 M35 35-39 182
4 85 M35 35-39 192
.. ... ... ...
75 77 M80 80+ 65
76 85 M80 80+ 70
77 94 M80 80+ 75
78 105 M80 80+ 80
79 105+ M80 80+ 85
sex_age
column, and use the .split
method available from the .str
attribute to split the column into two different columns:
>>> (weightlifting
... .melt(id_vars='Weight Category',
... var_name='sex_age',
... value_name='Qual Total')
... ['sex_age']
... .str.split(expand=True)
... )
0 1
0 M35 35-39
1 M35 35-39
2 M35 35-39
3 M35 35-39
4 M35 35-39
.. ... ...
75 M80 80+
76 M80 80+
77 M80 80+
78 M80 80+
79 M80 80+
>>> (weightlifting
... .melt(id_vars='Weight Category',
... var_name='sex_age',
... value_name='Qual Total')
... ['sex_age']
... .str.split(expand=True)
... .rename(columns={0:'Sex', 1:'Age Group'})
... )
Sex Age Group
0 M35 35-39
1 M35 35-39
2 M35 35-39
3 M35 35-39
4 M35 35-39
.. ... ...
75 M80 80+
76 M80 80+
77 M80 80+
78 M80 80+
79 M80 80+
Sex
column using an index operation after the .str
attribute to select the first character from the renamed Sex
column:
>>> (weightlifting
... .melt(id_vars='Weight Category',
... var_name='sex_age',
... value_name='Qual Total')
... ['sex_age']
... .str.split(expand=True)
... .rename(columns={0:'Sex', 1:'Age Group'})
... .assign(Sex=lambda df_: df_.Sex.str[0])
... )
Sex Age Group
0 M 35-39
1 M 35-39
2 M 35-39
3 M 35-39
4 M 35-39
.. .. ...
75 M 80+
76 M 80+
77 M 80+
78 M 80+
79 M 80+
pd.concat
function to concatenate this DataFrame with the Weight Category
and Qual Total
columns:
>>> melted = (weightlifting
... .melt(id_vars='Weight Category',
... var_name='sex_age',
... value_name='Qual Total')
... )
>>> tidy = pd.concat([melted
... ['sex_age']
... .str.split(expand=True)
... .rename(columns={0:'Sex', 1:'Age Group'})
... .assign(Sex=lambda df_: df_.Sex.str[0]),
... melted[['Weight Category', 'Qual Total']]],
... axis='columns'
... )
>>> tidy
Sex Age Group Weight Category Qual Total
0 M 35-39 56 137
1 M 35-39 62 152
2 M 35-39 69 167
3 M 35-39 77 182
4 M 35-39 85 192
.. .. ... ... ...
75 M 80+ 77 65
76 M 80+ 85 70
77 M 80+ 94 75
78 M 80+ 105 80
79 M 80+ 105+ 85
>>> melted = (weightlifting
... .melt(id_vars='Weight Category',
... var_name='sex_age',
... value_name='Qual Total')
... )
>>> (melted
... ['sex_age']
... .str.split(expand=True)
... .rename(columns={0:'Sex', 1:'Age Group'})
... .assign(Sex=lambda df_: df_.Sex.str[0],
... Category=melted['Weight Category'],
... Total=melted['Qual Total'])
... )
Sex Age Group Category Total
0 M 35-39 56 137
1 M 35-39 62 152
2 M 35-39 69 167
3 M 35-39 77 182
4 M 35-39 85 192
.. .. ... ... ...
75 M 80+ 77 65
76 M 80+ 85 70
77 M 80+ 94 75
78 M 80+ 105 80
79 M 80+ 105+ 85
The weightlifting dataset, like many datasets, has easily digestible information in its raw form. Still, technically it is messy, as all but one of the column names contain information for sex and age. Once the variables are identified, we can begin to tidy the dataset. Whenever column names contain variables, you will need to use the .melt
(or .stack
) method. The Weight Category
variable is already in the correct position, so we keep it as an identifying variable by passing it to the id_vars
parameter. Note that we don't explicitly need to name all the columns that we are melting with value_vars
. By default, all the columns not present in id_vars
get melted.
The sex_age
column needs to be parsed, and split into two variables. For this, we turn to the extra functionality provided by the .str
attribute, only available to Series (a single DataFrame column) or an index (this is not hierarchical). The .split
method is one of the more common methods in this situation, as it can separate different parts of the string into their own columns.By default, it splits on an empty space, but you may also specify a string or regular expression with the pat
parameter. When the expand
parameter is set to True
, a new column forms for each independent split character segment. When False
, a single column is returned, containing a list of all the segments.
After renaming the columns in step 4, we need to use the .str
attribute again. This attribute allows us to index or slice off of it, just like a string. Here, we select the first character, which is the variable for sex. We could go further and split the ages into two separate columns for minimum and maximum age, but it is common to refer to the entire age group in this manner, so we leave it as is.
Step 6 shows one of two different methods to join all the data together. The concat
function accepts a collection of DataFrames and either concatenates them vertically (axis='index'
) or horizontally (axis='columns'
). Because the two DataFrames are indexed identically, it is possible to assign the values of one DataFrame to new columns in the other, as done in step 7.
Another way to complete this recipe, beginning after step 2, is by assigning new columns from the sex_age
column without using the .split
method. The .assign
method may be used to add these new columns dynamically:
>>> tidy2 = (weightlifting
... .melt(id_vars='Weight Category',
... var_name='sex_age',
... value_name='Qual Total')
... .assign(Sex=lambda df_:df_.sex_age.str[0],
... **{'Age Group':(lambda df_: (df_
... .sex_age
... .str.extract(r'(d{2}[-+](?:d{2})?)',
... expand=False)))})
... .drop(columns='sex_age')
... )
>>> tidy2
Weight Category Qual Total Sex Age Group
0 56 137 M 35-39
1 62 152 M 35-39
2 69 167 M 35-39
3 77 182 M 35-39
4 85 192 M 35-39
.. ... ... .. ...
75 77 65 M 80+
76 85 70 M 80+
77 94 75 M 80+
78 105 80 M 80+
79 105+ 85 M 80+
>>> tidy.sort_index(axis=1).equals(tidy2.sort_index(axis=1))
True
The Sex
column is found in the same manner as done in step 5. Because we are not using .split
, the Age Group
column must be extracted in a different manner. The .extract
method uses a complex regular expression to extract very specific portions of the string. To use .extract
correctly, your pattern must contain capture groups. A capture group is formed by enclosing parentheses around a portion of the pattern. In this example, the entire expression is one large capture group. It begins with d{2}
, which searches for exactly two digits, followed by either a literal plus or minus, optionally followed by two more digits. Although the last part of the expression, (?:d{2})?
, is surrounded by parentheses, the ?:
denotes that it is not a capture group. It is technically a non-capturing group used to express two digits together as optional. The sex_age
column is no longer needed and is dropped.
Finally, the two tidy DataFrames are compared against one another and are found to be equivalent.
Tidy datasets must have a single column for each variable. Occasionally, multiple variable names are placed in a single column with their corresponding value placed in another.
In this recipe, we identify the column containing the improperly structured variables and pivot it to create tidy data.
Date
column data type to datetime64
:
>>> inspections = pd.read_csv('data/restaurant_inspections.csv',
... parse_dates=['Date'])
>>> inspections
Name ...
0 E & E Grill House ...
1 E & E Grill House ...
2 E & E Grill House ...
3 E & E Grill House ...
4 E & E Grill House ...
.. ... ...
495 PIER SIXTY ONE-THE LIGHTHOUSE ...
496 PIER SIXTY ONE-THE LIGHTHOUSE ...
497 PIER SIXTY ONE-THE LIGHTHOUSE ...
498 PIER SIXTY ONE-THE LIGHTHOUSE ...
499 PIER SIXTY ONE-THE LIGHTHOUSE ...
Name
and Date
, that are each correctly contained in a single column. The Info
column has five different variables: Borough, Cuisine, Description, Grade, and Score. Let's attempt to use the .pivot
method to keep the Name
and Date
columns vertical, create new columns out of all the values in the Info
column, and use the Value
column as their intersection:
>>> inspections.pivot(index=['Name', 'Date'],
... columns='Info', values='Value')
Traceback (most recent call last):
...
NotImplementedError: > 1 ndim Categorical are not supported at this time
Name
, Date
, and Info
into the index:
>>> inspections.set_index(['Name','Date', 'Info'])
Value
Name Date Info
E & E Gri... 2017-08-08 Borough MANHATTAN
Cuisine American
Description Non-food...
Grade A
Score 9.0
... ...
PIER SIXT... 2017-09-01 Borough MANHATTAN
Cuisine American
Description Filth fl...
Grade Z
Score 33.0
.unstack
method to pivot all the values in the Info
column:
>>> (inspections
... .set_index(['Name','Date', 'Info'])
... .unstack('Info')
... )
Value ...
Info Borough Cuisine ... Grade Score
Name Date ...
3 STAR JU... 2017-05-10 BROOKLYN Juice, S... ... A 12.0
A & L PIZ... 2017-08-22 BROOKLYN Pizza ... A 9.0
AKSARAY T... 2017-07-25 BROOKLYN Turkish ... A 13.0
ANTOJITOS... 2017-06-01 BROOKLYN Latin (C... ... A 10.0
BANGIA 2017-06-16 MANHATTAN Korean ... A 9.0
... ... ... ... ... ...
VALL'S PI... 2017-03-15 STATEN I... Pizza/It... ... A 9.0
VIP GRILL 2017-06-12 BROOKLYN Jewish/K... ... A 10.0
WAHIZZA 2017-04-13 MANHATTAN Pizza ... A 10.0
WANG MAND... 2017-08-29 QUEENS Korean ... A 12.0
XIAOYAN Y... 2017-08-29 QUEENS Korean ... Z 49.0
.reset_index
method:
>>> (inspections
... .set_index(['Name','Date', 'Info'])
... .unstack('Info')
... .reset_index(col_level=-1)
... )
. ... Value
Info Name Date ... Grade Score
0 3 STAR J... 2017-05-10 ... A 12.0
1 A & L PI... 2017-08-22 ... A 9.0
2 AKSARAY ... 2017-07-25 ... A 13.0
3 ANTOJITO... 2017-06-01 ... A 10.0
4 BANGIA 2017-06-16 ... A 9.0
.. ... ... ... ... ...
95 VALL'S P... 2017-03-15 ... A 9.0
96 VIP GRILL 2017-06-12 ... A 10.0
97 WAHIZZA 2017-04-13 ... A 10.0
98 WANG MAN... 2017-08-29 ... A 12.0
99 XIAOYAN ... 2017-08-29 ... Z 49.0
.droplevel
method to remove the top column level and then rename the index level to None
:
>>> (inspections
... .set_index(['Name','Date', 'Info'])
... .unstack('Info')
... .reset_index(col_level=-1)
... .droplevel(0, axis=1)
... .rename_axis(None, axis=1)
... )
Name Date ... Grade Score
0 3 STAR J... 2017-05-10 ... A 12.0
1 A & L PI... 2017-08-22 ... A 9.0
2 AKSARAY ... 2017-07-25 ... A 13.0
3 ANTOJITO... 2017-06-01 ... A 10.0
4 BANGIA 2017-06-16 ... A 9.0
.. ... ... ... ... ...
95 VALL'S P... 2017-03-15 ... A 9.0
96 VIP GRILL 2017-06-12 ... A 10.0
97 WAHIZZA 2017-04-13 ... A 10.0
98 WANG MAN... 2017-08-29 ... A 12.0
99 XIAOYAN ... 2017-08-29 ... Z 49.0
.squeeze
method. The following code produces the same result as the previous step:
>>> (inspections
... .set_index(['Name','Date', 'Info'])
... .squeeze()
... .unstack('Info')
... .reset_index()
... .rename_axis(None, axis='columns')
... )
Name Date ... Grade Score
0 3 STAR J... 2017-05-10 ... A 12.0
1 A & L PI... 2017-08-22 ... A 9.0
2 AKSARAY ... 2017-07-25 ... A 13.0
3 ANTOJITO... 2017-06-01 ... A 10.0
4 BANGIA 2017-06-16 ... A 9.0
.. ... ... ... ... ...
95 VALL'S P... 2017-03-15 ... A 9.0
96 VIP GRILL 2017-06-12 ... A 10.0
97 WAHIZZA 2017-04-13 ... A 10.0
98 WANG MAN... 2017-08-29 ... A 12.0
99 XIAOYAN ... 2017-08-29 ... Z 49.0
In step 1, we notice that there are five variables placed vertically in the Info
column with their corresponding value in the Value
column. Because we need to pivot each of these five variables as horizontal column names, it would seem that the .pivot
method would work. Unfortunately, pandas developers have yet to implement this special case when there is more than one non-pivoted column. We are forced to use a different method.
The .unstack
method also pivots vertical data, but only for data in the index. Step 3 begins this process by moving both the columns that will and will not be pivoted into the index with the .set_index
method. Once these columns are in the index, the .unstack
method can be put to work, as done in step 4.
Notice that as we are unstacking a DataFrame, pandas keeps the original column names (here, it is just a single column, Value
) and creates a MultiIndex
with the old column names as the upper level. The dataset is now essentially tidy, but we go ahead and make our non-pivoted columns normal columns with the .reset_index
method. Because we have MultiIndex
columns, we can choose which level the new column names will belong to with the col_level
parameter. By default, the names are inserted into the uppermost level (level 0). We use -1 to indicate the bottommost level.
After all this, we have some excess DataFrame names and indexes that need to be discarded. We use .droplevel
and .rename_axis
to remedy that. These columns still have a useless .name
attribute, Info
, which is renamed None
.
Cleaning up the MultiIndex
columns could have been avoided by forcing the resulting DataFrame from step 3 to a Series. The .squeeze
method works on single-column DataFrames and turns them into Series.
It is possible to use the .pivot_table
method, which has no restrictions on how many non-pivoted columns are allowed. The .pivot_table
method differs from .pivot
by performing an aggregation for all the values that correspond to the intersection between the columns in the index
and columns
parameters.
Because there may be multiple values in this intersection, .pivot_table
requires the user to pass it an aggregating function to output a single value. We use the first
aggregating function, which takes the first of the values of the group. In this particular example, there is exactly one value for each intersection, so there is nothing to be aggregated. The default aggregation function is the mean
, which will produce an error here, since some of the values are strings:
>>> (inspections
... .pivot_table(index=['Name', 'Date'],
... columns='Info',
... values='Value',
... aggfunc='first')
... .reset_index()
... .rename_axis(None, axis='columns')
... )
Name Date ... Grade Score
0 3 STAR J... 2017-05-10 ... A 12.0
1 A & L PI... 2017-08-22 ... A 9.0
2 AKSARAY ... 2017-07-25 ... A 13.0
3 ANTOJITO... 2017-06-01 ... A 10.0
4 BANGIA 2017-06-16 ... A 9.0
.. ... ... ... ... ...
95 VALL'S P... 2017-03-15 ... A 9.0
96 VIP GRILL 2017-06-12 ... A 10.0
97 WAHIZZA 2017-04-13 ... A 10.0
98 WANG MAN... 2017-08-29 ... A 12.0
99 XIAOYAN ... 2017-08-29 ... Z 49.0
Tabular data, by nature, is two-dimensional, and thus, there is a limited amount of information that can be presented in a single cell. As a workaround, you will occasionally see datasets with more than a single value stored in the same cell. Tidy data allows for just a single value for each cell. To rectify these situations, you will typically need to parse the string data into multiple columns with the methods from the .str
attribute.
In this recipe, we examine a dataset that has a column containing multiple different variables in each cell. We use the .str
attribute to parse these strings into separate columns to tidy the data.
>>> cities = pd.read_csv('data/texas_cities.csv')
>>> cities
City Geolocation
0 Houston 29.7604° N, 95.3698° W
1 Dallas 32.7767° N, 96.7970° W
2 Austin 30.2672° N, 97.7431° W
City
column looks good and contains exactly one value. The Geolocation
column, on the other hand, contains four variables: latitude, latitude direction, longitude, and longitude direction. Let's split the Geolocation
column into four separate columns. We will use the regular expression that matches any character followed by a space:
>>> geolocations = cities.Geolocation.str.split(pat='. ',
... expand=True)
>>> geolocations.columns = ['latitude', 'latitude direction',
... 'longitude', 'longitude direction']
Geolocation
was an object
, all the new columns are also objects. Let's change latitude
and longitude
into float
types:
>>> geolocations = geolocations.astype({'latitude':'float',
... 'longitude':'float'})
>>> geolocations.dtypes
latitude float64
latitude direction object
longitude float64
longitude direction object
dtype: object
City
column from the original:
>>> (geolocations
... .assign(city=cities['City'])
... )
latitude latitude direction ... longitude direction city
0 29.7604 N ... W Houston
1 32.7767 N ... W Dallas
2 30.2672 N ... W Austin
After reading the data, we decide how many variables there are in the dataset. Here, we chose to split the Geolocation
column into four variables, but we could have just chosen two for latitude
and longitude
and used a negative sign to differentiate between west and east and south and north.
There are a few ways to parse the Geolocation
column with the methods from the .str
attribute. The easiest way is to use the .split
method. We pass it a regular expression defined by any character (the period) and a space. When a space follows any character, a split is made, and a new column is formed. The first occurrence of this pattern takes place at the end of the latitude. A space follows the degree character, and a split is formed. The splitting characters are discarded and not kept in the resulting columns. The next split matches the comma and space following directly after the latitude direction.
A total of three splits are made, resulting in four columns. The second line in step 2 provides them with meaningful names. Even though the resulting latitude
and longitude
columns appear to be float
types, they are not. They were originally parsed from an object
column and therefore remain object
data types. Step 3 uses a dictionary to map the column names to their new types.
Instead of using a dictionary, which would require a lot of typing if you had many column names, you can use the function to_numeric
to attempt to convert each column to either integer
or float
. To apply this function iteratively over each column, use the .apply
method with the following:
>>> geolocations.apply(pd.to_numeric, errors='ignore')
latitude latitude direction longitude longitude direction
0 29.7604 N 95.3698 W
1 32.7767 N 96.7970 W
2 30.2672 N 97.7431 W
Step 4 concatenates the city to the DataFrame to complete the process of making tidy data.
The .split
method worked well in this example with a regular expression. For other examples, some columns might require you to create splits on several different patterns. To search for multiple regular expressions, use the pipe character (|
). For instance, if we wanted to split only the degree symbol and comma, each followed by a space, we would do the following:
>>> cities.Geolocation.str.split(pat=r'° |, ', expand=True)
0 1 2 3
0 29.7604 N 95.3698 W
1 32.7767 N 96.7970 W
2 30.2672 N 97.7431 W
This returns the same DataFrame from step 2. Any number of additional split patterns may be appended to the preceding string pattern with the pipe character.
The .extract
method is another method that allows you to extract specific groups within each cell. These capture groups must be enclosed in parentheses. Anything that matches outside the parentheses is not present in the result. The following line produces the same output as step 2:
''' {.sourceCode .pycon}
>>> cities.Geolocation.str.extract(r'([0-9.]+). (N|S), ([0-9.]+). (E|W)',
... expand=True)
0 1 2 3
0 29.7604 N 95.3698 W
1 32.7767 N 96.7970 W
2 30.2672 N 97.7431 W
'''
This regular expression has four capture groups. The first and third groups search for at least one or more consecutive digits with decimals. The second and fourth groups search for a single character (the direction). The first and third capture groups are separated by any character followed by a space. The second capture group is separated by a comma and then a space.
One particularly difficult form of messy data to diagnose appears whenever variables are stored both horizontally across the column names and vertically down column values. This type of dataset usually is not found in a database, but from a summarized report that someone else has already generated.
In this recipe, data is reshaped into tidy data with the .melt
and .pivot_table
methods.
>>> sensors = pd.read_csv('data/sensors.csv')
>>> sensors
Group Property 2012 2013 2014 2015 2016
0 A Pressure 928 873 814 973 870
1 A Temperature 1026 1038 1009 1036 1042
2 A Flow 819 806 861 882 856
3 B Pressure 817 877 914 806 942
4 B Temperature 1008 1041 1009 1002 1013
5 B Flow 887 899 837 824 873
Group
. The Property
column appears to have three unique variables, Pressure
, Temperature
, and Flow
. The rest of the columns 2012 to 2016 are themselves a single variable, which we can sensibly name Year
. It isn't possible to restructure this kind of messy data with a single DataFrame method. Let's begin with the .melt
method to pivot the years into their own column:
>>> sensors.melt(id_vars=['Group', 'Property'], var_name='Year')
Group Property Year value
0 A Pressure 2012 928
1 A Temperature 2012 1026
2 A Flow 2012 819
3 B Pressure 2012 817
4 B Temperature 2012 1008
.. ... ... ... ...
25 A Temperature 2016 1042
26 A Flow 2016 856
27 B Pressure 2016 942
28 B Temperature 2016 1013
29 B Flow 2016 873
.pivot_table
method to pivot the Property
column into new column names:
>>> (sensors
... .melt(id_vars=['Group', 'Property'], var_name='Year')
... .pivot_table(index=['Group', 'Year'],
... columns='Property', values='value')
... .reset_index()
... .rename_axis(None, axis='columns')
... )
Group Year Flow Pressure Temperature
0 A 2012 819 928 1026
1 A 2013 806 873 1038
2 A 2014 861 814 1009
3 A 2015 882 973 1036
4 A 2016 856 870 1042
5 B 2012 887 817 1008
6 B 2013 899 877 1041
7 B 2014 837 914 1009
8 B 2015 824 806 1002
9 B 2016 873 942 1013
Once we have identified the variables in step 1, we can begin our restructuring. pandas does not have a method to pivot columns simultaneously, so we must take on this task one step at a time. We correct the years by keeping the Property
column vertical by passing it to the id_vars
parameter in the .melt
method.
The result is now the pattern of messy data found in the recipe before last. As explained in the There's more... section of that recipe, we must use .pivot_table
to pivot a DataFrame when using more than one column in the index
parameter. After pivoting, the Group
and Year
variables are stuck in the index. We push them back out as columns with .reset_index
. The .pivot_table
method preserves the column name used in the columns
parameter as the name of the column index. After resetting the index, this name is meaningless, and we remove it with .rename_axis
.
Whenever a solution involves .melt
, .pivot_table
, or .pivot
, you can be sure that there is an alternative method using .stack
and .unstack
. The trick is first to move the columns that are not currently being pivoted into the index:
>>> (sensors
... .set_index(['Group', 'Property'])
... .rename_axis('Year', axis='columns')
... .stack()
... .unstack('Property')
... .rename_axis(None, axis='columns')
... .reset_index()
... )
Group Year Flow Pressure Temperature
0 A 2012 819 928 1026
1 A 2013 806 873 1038
2 A 2014 861 814 1009
3 A 2015 882 973 1036
4 A 2016 856 870 1042
5 B 2012 887 817 1008
6 B 2013 899 877 1041
7 B 2014 837 914 1009
8 B 2015 824 806 1002
9 B 2016 873 942 1013