A wide variety of options are available to combine two or more DataFrames or Series together. The append
method is the least flexible and only allows for new rows to be appended to a DataFrame. The concat
method is very versatile and can combine any number of DataFrames or Series on either axis. The join
method provides fast lookups by aligning a column of one DataFrame to the index of others. The merge
method provides SQL-like capabilities to join two DataFrames together.
When performing data analysis, it is far more common to create new columns than new rows. This is because a new row of data usually represents a new observation, and as an analyst, it is typically not your job to continually capture new data. Data capture is usually left to other platforms like relational database management systems. Nevertheless, it is a necessary feature to know as it will crop up from time to time.
In this recipe, we will begin by appending rows to a small dataset with the .loc
attribute and then transition to using the .append
method.
names
dataset, and output it:
>>> import pandas as pd
>>> import numpy as np
>>> names = pd.read_csv('data/names.csv')
>>> names
Name Age
0 Cornelia 70
1 Abbas 69
2 Penelope 4
3 Niko 2
.loc
attribute to set a single row label equal to this new data:
>>> new_data_list = ['Aria', 1]
>>> names.loc[4] = new_data_list
>>> names
Name Age
0 Cornelia 70
1 Abbas 69
2 Penelope 4
3 Niko 2
4 Aria 1
.loc
attribute uses labels to refer to the rows. In this case, the row labels exactly match the integer location. It is possible to append more rows with non-integer labels:
>>> names.loc['five'] = ['Zach', 3]
>>> names
Name Age
0 Cornelia 70
1 Abbas 69
2 Penelope 4
3 Niko 2
4 Aria 1
five Zach 3
>>> names.loc[len(names)] = {'Name':'Zayd', 'Age':2}
>>> names
Name Age
0 Cornelia 70
1 Abbas 69
2 Penelope 4
3 Niko 2
4 Aria 1
five Zach 3
6 Zayd 2
>>> names.loc[len(names)] = pd.Series({'Age':32, 'Name':'Dean'})
>>> names
Name Age
0 Cornelia 70
1 Abbas 69
2 Penelope 4
3 Niko 2
4 Aria 1
five Zach 3
6 Zayd 2
7 Dean 32
.loc
attribute to make changes to the names
DataFrame in-place. There is no separate copy of the DataFrame that is returned. In the next few steps, we will look at the .append
method, which does not modify the calling DataFrame. Instead, it returns a new copy of the DataFrame with the appended row(s). Let's begin with the original names
DataFrame and attempt to append a row. The first argument to .append
must be either another DataFrame, Series, dictionary, or a list of these, but not a list like the one in step 2. Let's see what happens when we attempt to use a dictionary with .append
:
>>> names = pd.read_csv('data/names.csv')
>>> names.append({'Name':'Aria', 'Age':1})
Traceback (most recent call last):
...
TypeError: Can only append a Series if ignore_index=True or if the Series has a name
ignore_index=True
parameter:
>>> names.append({'Name':'Aria', 'Age':1}, ignore_index=True)
Name Age
0 Cornelia 70
1 Abbas 69
2 Penelope 4
3 Niko 2
4 Aria 1
ignore_index
is a sneaky parameter. When set to True
, the old index will be removed completely and replaced with a RangeIndex
from 0 to n-1. For instance, let's specify an index for the names
DataFrame:
>>> names.index = ['Canada', 'Canada', 'USA', 'USA']
>>> names
Name Age
Canada Cornelia 70
Canada Abbas 69
USA Penelope 4
USA Niko 2
name
attribute with the .append
method:
>>> s = pd.Series({'Name': 'Zach', 'Age': 3}, name=len(names))
>>> s
Name Zach
Age 3
Name: 4, dtype: object
>>> names.append(s)
Name Age
Canada Cornelia 70
Canada Abbas 69
USA Penelope 4
USA Niko 2
4 Zach 3
.append
method is more flexible than the .loc
attribute. It supports appending multiple rows at the same time. One way to accomplish this is by passing in a list of Series:
>>> s1 = pd.Series({'Name': 'Zach', 'Age': 3}, name=len(names))
>>> s2 = pd.Series({'Name': 'Zayd', 'Age': 2}, name='USA')
>>> names.append([s1, s2])
Name Age
Canada Cornelia 70
Canada Abbas 69
USA Penelope 4
USA Niko 2
4 Zach 3
USA Zayd 2
>>> bball_16 = pd.read_csv('data/baseball16.csv')
>>> bball_16
playerID yearID stint teamID ... HBP SH SF GIDP
0 altuv... 2016 1 HOU ... 7.0 3.0 7.0 15.0
1 bregm... 2016 1 HOU ... 0.0 0.0 1.0 1.0
2 castr... 2016 1 HOU ... 1.0 1.0 0.0 9.0
3 corre... 2016 1 HOU ... 5.0 0.0 3.0 12.0
4 gatti... 2016 1 HOU ... 4.0 0.0 5.0 12.0
.. ... ... ... ... ... ... ... ... ...
11 reedaj01 2016 1 HOU ... 0.0 0.0 1.0 1.0
12 sprin... 2016 1 HOU ... 11.0 0.0 1.0 12.0
13 tucke... 2016 1 HOU ... 2.0 0.0 0.0 2.0
14 valbu... 2016 1 HOU ... 1.0 3.0 2.0 5.0
15 white... 2016 1 HOU ... 2.0 0.0 2.0 6.0
.to_dict
method to it to get an example row as a dictionary:
>>> data_dict = bball_16.iloc[0].to_dict()
>>> data_dict
{'playerID': 'altuvjo01', 'yearID': 2016, 'stint': 1, 'teamID': 'HOU', 'lgID': 'AL', 'G': 161, 'AB': 640, 'R': 108, 'H': 216, '2B': 42, '3B': 5, 'HR': 24, 'RBI': 96.0, 'SB': 30.0, 'CS': 10.0, 'BB': 60, 'SO': 70.0, 'IBB': 11.0, 'HBP': 7.0, 'SH': 3.0, 'SF': 7.0, 'GIDP': 15.0}
>>> new_data_dict = {k: '' if isinstance(v, str) else ... np.nan for k, v in data_dict.items()}
>>> new_data_dict
{'playerID': '', 'yearID': nan, 'stint': nan, 'teamID': '', 'lgID': '', 'G': nan, 'AB': nan, 'R': nan, 'H': nan, '2B': nan, '3B': nan, 'HR': nan, 'RBI': nan, 'SB': nan, 'CS': nan, 'BB': nan, 'SO': nan, 'IBB': nan, 'HBP': nan, 'SH': nan, 'SF': nan, 'GIDP': nan}
The .loc
attribute is used to select and assign data based on the row and column labels. The first value passed to it represents the row label. In step 2, names.loc[4]
refers to the row with a label equal to the integer 4. This label does not currently exist in the DataFrame. The assignment statement creates a new row with data provided by the list. As was mentioned in the recipe, this operation modifies the names DataFrame itself. If there were a previously existing row with a label equal to the integer 4, this command would have written over it. Using in-place modification makes this indexing operator riskier to use than the .append
method, which never modifies the original calling DataFrame. Throughout this book we have advocated chaining operations, and you should follow suit.
Any valid label may be used with the .loc
attribute, as seen in step 3. Regardless of what the new label value is, the new row is always appended to the end. Even though assigning with a list works, for clarity, it is best to use a dictionary so that we know exactly which columns are associated with each value, as done in step 4.
Steps 4 and 5 show a trick to dynamically set the new label to be the current number of rows in the DataFrame. Data stored in a Series will also get assigned correctly as long as the index labels match the column names.
The rest of the steps use the .append
method, which is a method that only appends new rows to DataFrames. Most DataFrame methods allow both row and column manipulation through an axis
parameter. One exception is the .append
method, which can only append rows to DataFrames.
Using a dictionary of column names mapped to values isn't enough information for .append
to work, as seen by the error message in step 6. To correctly append a dictionary without a row name, you will have to set the .ignore_index
parameter to True
.
Step 10 shows you how to keep the old index by converting your dictionary to a Series. Make sure to use the name
parameter, which is then used as the new index label. Any number of rows may be added with append
in this manner by passing a list of Series as the first argument.
When wanting to append rows in this manner with a much larger DataFrame, you can avoid lots of typing and mistakes by converting a single row to a dictionary with the .to_dict
method and then using a dictionary comprehension to clear out all the old values replacing them with some defaults. This can serve as a template for new rows.
Appending a single row to a DataFrame is a fairly expensive operation and if you find yourself writing a loop to append single rows of data to a DataFrame, then you are doing it wrong. Let's first create 1,000 rows of new data as a list of Series:
>>> random_data = []
>>> for i in range(1000):
... d = dict()
... for k, v in data_dict.items():
... if isinstance(v, str):
... d[k] = np.random.choice(list('abcde'))
... else:
... d[k] = np.random.randint(10)
... random_data.append(pd.Series(d, name=i + len(bball_16)))
>>> random_data[0]
2B 3
3B 9
AB 3
BB 9
CS 4
Name: 16, dtype: object
Let's time how long it takes to loop through each item making one append at a time:
>>> %%timeit
>>> bball_16_copy = bball_16.copy()
>>> for row in random_data:
... bball_16_copy = bball_16_copy.append(row)
4.88 s ± 190 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That took nearly five seconds for only 1,000 rows. If we instead pass in the entire list of Series, we get an enormous speed increase:
>>> %%timeit
>>> bball_16_copy = bball_16.copy()
>>> bball_16_copy = bball_16_copy.append(random_data)
78.4 ms ± 6.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If you pass in a list of Series objects, the time has been reduced to under one-tenth of a second. Internally, pandas converts the list of Series to a single DataFrame and then appends the data.
The concat
function enables concatenating two or more DataFrames (or Series) together, both vertically and horizontally. As per usual, when dealing with multiple pandas objects simultaneously, concatenation doesn't happen haphazardly but aligns each object by their index.
In this recipe, we combine DataFrames both horizontally and vertically with the concat
function and then change the parameter values to yield different results.
>>> stocks_2016 = pd.read_csv('data/stocks_2016.csv',
... index_col='Symbol')
>>> stocks_2017 = pd.read_csv('data/stocks_2017.csv',
... index_col='Symbol')
>>> stocks_2016
Shares Low High
Symbol
AAPL 80 95 110
TSLA 50 80 130
WMT 40 55 70
>>> stocks_2017
Shares Low High
Symbol
AAPL 50 120 140
GE 100 30 40
IBM 87 75 95
SLB 20 55 85
TXN 500 15 23
TSLA 100 100 300
concat
function to concatenate them together along the default axis (0):
>>> s_list = [stocks_2016, stocks_2017]
>>> pd.concat(s_list)
Shares Low High
Symbol
AAPL 80 95 110
TSLA 50 80 130
WMT 40 55 70
AAPL 50 120 140
GE 100 30 40
IBM 87 75 95
SLB 20 55 85
TXN 500 15 23
TSLA 100 100 300
concat
function concatenates DataFrames vertically, one on top of the other. One issue with the preceding DataFrame is that there is no way to identify the year of each row. The concat
function allows each piece of the resulting DataFrame to be labeled with the keys
parameter. This label will appear in the outermost index level of the concatenated frame and force the creation of a MultiIndex
. Also, the names
parameter has the ability to rename each index level for clarity:
>>> pd.concat(s_list, keys=['2016', '2017'],
... names=['Year', 'Symbol'])
Shares Low High
Year Symbol
2016 AAPL 80 95 110
TSLA 50 80 130
WMT 40 55 70
2017 AAPL 50 120 140
GE 100 30 40
IBM 87 75 95
SLB 20 55 85
TXN 500 15 23
TSLA 100 100 300
axis
parameter to columns
or 1
:
>>> pd.concat(s_list, keys=['2016', '2017'],
... axis='columns', names=['Year', None])
Year 2016 2017
Shares Low High Shares Low High
AAPL 80.0 95.0 110.0 50.0 120.0 140.0
GE NaN NaN NaN 100.0 30.0 40.0
IBM NaN NaN NaN 87.0 75.0 95.0
SLB NaN NaN NaN 20.0 55.0 85.0
TSLA 50.0 80.0 130.0 100.0 100.0 300.0
TXN NaN NaN NaN 500.0 15.0 23.0
WMT 40.0 55.0 70.0 NaN NaN NaN
concat
function, by default, uses an outer join, keeping all rows from each DataFrame in the list. However, it gives us an option to keep only rows that have the same index values in both DataFrames. This is referred to as an inner join. We set the join
parameter to inner
to change the behavior:
>>> pd.concat(s_list, join='inner', keys=['2016', '2017'],
... axis='columns', names=['Year', None])
Year 2016 2017
Shares Low High Shares Low High
Symbol
AAPL 80 95 110 50 120 140
TSLA 50 80 130 100 100 300
The concat
function accepts a list as the first parameter. This list must be a sequence of pandas objects, typically a list of DataFrames or Series. By default, all these objects will be stacked vertically, one on top of the other. In this recipe, only two DataFrames are concatenated, but any number of pandas objects work. When we were concatenating vertically, the DataFrames align by their column names.
In this dataset, all the column names were the same so each column in the 2017 data lined up precisely under the same column name in the 2016 data. However, when they were concatenated horizontally, as in step 4, only two of the index labels matched from both years – AAPL and TSLA. Therefore, these ticker symbols had no missing values for either year. There are two types of alignment possible using concat
, outer
(the default), and inner
referred to by the join
parameter.
The .append
method is a heavily watered-down version of concat
that can only append new rows to a DataFrame. Internally, .append
just calls the concat
function. For instance, step 2 from this recipe may be duplicated with the following:
>>> stocks_2016.append(stocks_2017)
Shares Low High
Symbol
AAPL 80 95 110
TSLA 50 80 130
WMT 40 55 70
AAPL 50 120 140
GE 100 30 40
IBM 87 75 95
SLB 20 55 85
TXN 500 15 23
TSLA 100 100 300
The .merge
and .join
DataFrame (and not Series) methods and the concat
function all provide very similar functionality to combine multiple pandas objects together. As they are so similar and they can replicate each other in certain situations, it can get very confusing regarding when and how to use them correctly.
To help clarify their differences, take a look at the following outline:
concat
:
.join
:
.merge
:
In this recipe, we will combine DataFrames. The first situation is simpler with concat
while the second is simpler with .merge
.
read_csv
function:
>>> years = 2016, 2017, 2018
>>> stock_tables = [pd.read_csv(
... f'data/stocks_{year}.csv', index_col='Symbol')
... for year in years]
>>> stocks_2016, stocks_2017, stocks_2018 = stock_tables
>>> stocks_2016
Shares Low High
Symbol
AAPL 80 95 110
TSLA 50 80 130
WMT 40 55 70
>>> stocks_2017
Shares Low High
Symbol
AAPL 50 120 140
GE 100 30 40
IBM 87 75 95
SLB 20 55 85
TXN 500 15 23
TSLA 100 100 300
>>> stocks_2018
Shares Low High
Symbol
AAPL 40 135 170
AMZN 8 900 1125
TSLA 50 220 400
concat
function is the only pandas method that is able to combine DataFrames vertically. Let's do this by passing it the list stock_tables
:
>>> pd.concat(stock_tables, keys=[2016, 2017, 2018])
Shares Low High
Symbol
2016 AAPL 80 95 110
TSLA 50 80 130
WMT 40 55 70
2017 AAPL 50 120 140
GE 100 30 40
... ... ... ...
TXN 500 15 23
TSLA 100 100 300
2018 AAPL 40 135 170
AMZN 8 900 1125
TSLA 50 220 400
axis
parameter to columns
:
>>> pd.concat(dict(zip(years, stock_tables)), axis='columns')
2016 ... 2018
Shares Low High ... Shares Low High
AAPL 80.0 95.0 110.0 ... 40.0 135.0 170.0
AMZN NaN NaN NaN ... 8.0 900.0 1125.0
GE NaN NaN NaN ... NaN NaN NaN
IBM NaN NaN NaN ... NaN NaN NaN
SLB NaN NaN NaN ... NaN NaN NaN
TSLA 50.0 80.0 130.0 ... 50.0 220.0 400.0
TXN NaN NaN NaN ... NaN NaN NaN
WMT 40.0 55.0 70.0 ... NaN NaN NaN
.join
and .merge
methods to replicate this functionality of concat
. Here, we use the .join
method to combine the stock_2016
and stock_2017
DataFrames. By default, the DataFrames align on their index. If any of the columns have the same names, then you must supply a value to the lsuffix
or rsuffix
parameters to distinguish them in the result:
>>> stocks_2016.join(stocks_2017, lsuffix='_2016',
... rsuffix='_2017', how='outer')
Shares_2016 Low_2016 ... Low_2017 High_2017
Symbol ...
AAPL 80.0 95.0 ... 120.0 140.0
GE NaN NaN ... 30.0 40.0
IBM NaN NaN ... 75.0 95.0
SLB NaN NaN ... 55.0 85.0
TSLA 50.0 80.0 ... 100.0 300.0
TXN NaN NaN ... 15.0 23.0
WMT 40.0 55.0 ... NaN NaN
concat
function from step 3, we can pass a list of DataFrames to the .join
method:
>>> other = [stocks_2017.add_suffix('_2017'),
... stocks_2018.add_suffix('_2018')]
>>> stocks_2016.add_suffix('_2016').join(other, how='outer')
Shares_2016 Low_2016 ... Low_2018 High_2018
AAPL 80.0 95.0 ... 135.0 170.0
TSLA 50.0 80.0 ... 220.0 400.0
WMT 40.0 55.0 ... NaN NaN
GE NaN NaN ... NaN NaN
IBM NaN NaN ... NaN NaN
SLB NaN NaN ... NaN NaN
TXN NaN NaN ... NaN NaN
AMZN NaN NaN ... 900.0 1125.0
>>> stock_join = stocks_2016.add_suffix('_2016').join(other,
... how='outer')
>>> stock_concat = (
... pd.concat(
... dict(zip(years, stock_tables)), axis="columns")
... .swaplevel(axis=1)
... .pipe(lambda df_:
... df_.set_axis(df_.columns.to_flat_index(), axis=1))
... .rename(lambda label:
... "_".join([str(x) for x in label]), axis=1)
... )
>>> stock_join.equals(stock_concat)
True
.merge
method that, unlike concat
and .join
, can only combine two DataFrames together. By default, .merge
attempts to align the values in the columns that have the same name for each of the DataFrames. However, you can choose to have it align on the index by setting the Boolean parameters left_index
and right_index
to True
. Let's merge the 2016 and 2017 stock data together:
>>> stocks_2016.merge(stocks_2017, left_index=True,
... right_index=True)
Shares_x Low_x High_x Shares_y Low_y High_y
Symbol
AAPL 80 95 110 50 120 140
TSLA 50 80 130 100 100 300
.merge
uses an inner join and automatically supplies suffixes for identically named columns. Let's change to an outer join and then perform another outer join of the 2018 data to replicate the behavior of concat
. Note that in pandas 1.0, the merge
index will be sorted and the concat
version won't be:
>>> stock_merge = (stocks_2016
... .merge(stocks_2017, left_index=True,
... right_index=True, how='outer',
... suffixes=('_2016', '_2017'))
... .merge(stocks_2018.add_suffix('_2018'),
... left_index=True, right_index=True,
... how='outer')
... )
>>> stock_concat.sort_index().equals(stock_merge)
True
.merge
method is built for this situation. Let's take a look at two new small datasets, food_prices
and food_transactions
:
>>> names = ['prices', 'transactions']
>>> food_tables = [pd.read_csv('data/food_{}.csv'.format(name))
... for name in names]
>>> food_prices, food_transactions = food_tables
>>> food_prices
item store price Date
0 pear A 0.99 2017
1 pear B 1.99 2017
2 peach A 2.99 2017
3 peach B 3.49 2017
4 banana A 0.39 2017
5 banana B 0.49 2017
6 steak A 5.99 2017
7 steak B 6.99 2017
8 steak B 4.99 2015
>>> food_transactions
custid item store quantity
0 1 pear A 5
1 1 banana A 10
2 2 steak B 3
3 2 pear B 1
4 2 peach B 2
5 2 steak B 1
6 2 coconut B 4
item
and store
columns:
>>> food_transactions.merge(food_prices, on=['item', 'store'])
custid item store quantity price Date
0 1 pear A 5 0.99 2017
1 1 banana A 10 0.39 2017
2 2 steak B 3 6.99 2017
3 2 steak B 3 4.99 2015
4 2 steak B 1 6.99 2017
5 2 steak B 1 4.99 2015
6 2 pear B 1 1.99 2017
7 2 peach B 2 3.49 2017
steak
items. As the steak
item appears twice in each table for store B
, a Cartesian product takes place between them, resulting in four rows. Also, notice that the item, coconut
, is missing because there was no corresponding price for it. Let's fix both of these issues:
>>> food_transactions.merge(food_prices.query('Date == 2017'),
... how='left')
custid item store quantity price Date
0 1 pear A 5 0.99 2017.0
1 1 banana A 10 0.39 2017.0
2 2 steak B 3 6.99 2017.0
3 2 pear B 1 1.99 2017.0
4 2 peach B 2 3.49 2017.0
5 2 steak B 1 6.99 2017.0
6 2 coconut B 4 NaN NaN
.join
method, but we must first put the joining columns of the food_prices
DataFrame into the index:
>>> food_prices_join = food_prices.query('Date == 2017')
... .set_index(['item', 'store'])
>>> food_prices_join
price Date
item store
pear A 0.99 2017
B 1.99 2017
peach A 2.99 2017
B 3.49 2017
banana A 0.39 2017
B 0.49 2017
steak A 5.99 2017
B 6.99 2017
.join
method only aligns with the index of the passed DataFrame but can use the index or the columns of the calling DataFrame. To use columns for alignment on the calling DataFrame, you will need to pass them to the on
parameter:
>>> food_transactions.join(food_prices_join, on=['item', 'store'])
custid item store quantity price Date
0 1 pear A 5 0.99 2017.0
1 1 banana A 10 0.39 2017.0
2 2 steak B 3 6.99 2017.0
3 2 pear B 1 1.99 2017.0
4 2 peach B 2 3.49 2017.0
5 2 steak B 1 6.99 2017.0
6 2 coconut B 4 NaN NaN
The output matches the result from step 11. To replicate this with the concat
function, you would need to put the item
and store
columns into the index of both DataFrames. However, in this particular case, an error would be produced as a duplicate index value occurs in at least one of the DataFrames (with item steak
and store B
):
>>> pd.concat([food_transactions.set_index(['item', 'store']),
... food_prices.set_index(['item', 'store'])],
... axis='columns')
Traceback (most recent call last):
...
ValueError: cannot handle a non-unique multi-index!
It can be tedious to repeatedly write the read_csv
function when importing many DataFrames at the same time. One way to automate this process is to put all the filenames in a list and iterate through them with a for
loop. This was done in step 1 with a list comprehension.
At the end of step 1, we unpack the list of DataFrames into their own appropriately named variables so that each table may be easily and clearly referenced. The nice thing about having a list of DataFrames is that it is the exact requirement for the concat
function, as seen in step 2. Notice how step 2 uses the keys
parameter to name each chunk of data. This can be also be accomplished by passing a dictionary to concat
, as done in step 3.
In step 4, we must change the type of .join
to outer
to include all of the rows in the passed DataFrame that do not have an index present in the calling DataFrame. In step 5, the passed list of DataFrames cannot have any columns in common. Although there is an rsuffix
parameter, it only works when passing a single DataFrame and not a list of them. To work around this limitation, we change the names of the columns beforehand with the .add_suffix
method, and then call the .join
method.
In step 7, we use .merge
, which defaults to aligning on all column names that are the same in both DataFrames. To change this default behavior, and align on the index of either one or both, set the left_index
or right_index
parameters to True
. Step 8 finishes the replication with two calls to .merge
. As you can see, when you are aligning multiple DataFrames on their index, concat
is usually going to be a far better choice than .merge
.
In step 9, we switch gears to focus on a situation where the .merge
method has the advantage. The .merge
method is the only one capable of aligning both the calling and passed DataFrame by column values. Step 10 shows you how easy it is to merge two DataFrames. The on
parameter is not necessary but provided for clarity.
Unfortunately, it is very easy to duplicate or drop data when combining DataFrames, as shown in step 10. It is vital to take some time to do some sanity checks after combining data. In this instance, the food_prices
dataset had a duplicate price for steak
in store B
, so we eliminated this row by querying for only the current year in step 11. We also change to a left join to ensure that each transaction is kept regardless if a price is present or not.
It is possible to use .join
in these instances, but all the columns in the passed DataFrame must be moved into the index first. Finally, concat
is going to be a poor choice whenever you intend to align data by values in their columns.
In summary, I find myself using .merge
unless I know that the indexes align.
It is possible to read all files from a particular directory into DataFrames without knowing their names. Python provides a few ways to iterate through directories, with the glob
module being a popular choice. The gas prices
directory contains five different CSV files, each having weekly prices of a particular grade of gas beginning from 2007. Each file has just two columns – the date for the week and the price. This is a perfect situation to iterate through all the files, read them into DataFrames, and combine them all together with the concat
function.
The glob
module has the glob
function, which takes a single parameter – the location of the directory you would like to iterate through as a string. To get all the files in the directory, use the string *
. In this example, ''*.csv'
returns only files that end in .csv
. The result from the glob
function is a list of string filenames, which can be passed to the read_csv
function:
>>> import glob
>>> df_list = []
>>> for filename in glob.glob('data/gas prices *.csv'):
... df_list.append(pd.read_csv(filename, index_col='Week',
... parse_dates=['Week']))
>>> gas = pd.concat(df_list, axis='columns')
>>> gas
Midgrade Premium Diesel All Grades Regular
Week
2017-09-25 2.859 3.105 2.788 2.701 2.583
2017-09-18 2.906 3.151 2.791 2.750 2.634
2017-09-11 2.953 3.197 2.802 2.800 2.685
2017-09-04 2.946 3.191 2.758 2.794 2.679
2017-08-28 2.668 2.901 2.605 2.513 2.399
... ... ... ... ... ...
2007-01-29 2.277 2.381 2.413 2.213 2.165
2007-01-22 2.285 2.391 2.430 2.216 2.165
2007-01-15 2.347 2.453 2.463 2.280 2.229
2007-01-08 2.418 2.523 2.537 2.354 2.306
2007-01-01 2.442 2.547 2.580 2.382 2.334
Learning SQL is a useful skill. Much of the world's data is stored in databases that accept SQL statements. There are many dozens of relational database management systems, with SQLite being one of the most popular and easy to use.
We will be exploring the chinook sample database provided by SQLite that contains 11 tables of data for a music store. One of the best things to do when first diving into a proper relational database is to study a database diagram (sometimes called an entity relationship diagram) to understand how tables are related. The following diagram will be immensely helpful when navigating through this recipe:
SQL relationships
In order for this recipe to work, you will need to have the sqlalchemy
Python package installed. If you installed the Anaconda distribution, then it should already be available to you. SQLAlchemy is the preferred pandas tool when making connections to databases. In this recipe, you will learn how to connect to a SQLite database. You will then ask two different queries, and answer them by joining together tables with the .merge
method.
>>> from sqlalchemy import create_engine
>>> engine = create_engine('sqlite:///data/chinook.db')
tracks
table with the read_sql_table
function. The name of the table is the first argument and the SQLAlchemy engine is the second:
>>> tracks = pd.read_sql_table('tracks', engine)
>>> tracks
TrackId ... UnitPrice
0 1 ... 0.99
1 2 ... 0.99
2 3 ... 0.99
3 4 ... 0.99
4 5 ... 0.99
... ... ... ...
3498 3499 ... 0.99
3499 3500 ... 0.99
3500 3501 ... 0.99
3501 3502 ... 0.99
3502 3503 ... 0.99
>>> (pd.read_sql_table('genres', engine)
... .merge(tracks[['GenreId', 'Milliseconds']],
... on='GenreId', how='left')
... .drop('GenreId', axis='columns')
... )
Name Milliseconds
0 Rock 343719
1 Rock 342562
2 Rock 230619
3 Rock 252051
4 Rock 375418
... ... ...
3498 Classical 286741
3499 Classical 139200
3500 Classical 66639
3501 Classical 221331
3502 Opera 174813
Milliseconds
column to the timedelta
data type:
>>> (pd.read_sql_table('genres', engine)
... .merge(tracks[['GenreId', 'Milliseconds']],
... on='GenreId', how='left')
... .drop('GenreId', axis='columns')
... .groupby('Name')
... ['Milliseconds']
... .mean()
... .pipe(lambda s_: pd.to_timedelta(s_, unit='ms')
... .rename('Length'))
... .dt.floor('s')
... .sort_values()
... )
Name
Rock And Roll 00:02:14
Opera 00:02:54
Hip Hop/Rap 00:02:58
Easy Listening 00:03:09
Bossa Nova 00:03:39
...
Comedy 00:26:25
TV Shows 00:35:45
Drama 00:42:55
Science Fiction 00:43:45
Sci Fi & Fantasy 00:48:31
Name: Length, Length: 25, dtype: timedelta64[ns]
customers
, invoices
, and invoice_items
tables all connected to each other:
>>> cust = pd.read_sql_table('customers', engine,
... columns=['CustomerId','FirstName',
... 'LastName'])
>>> invoice = pd.read_sql_table('invoices', engine,
... columns=['InvoiceId','CustomerId'])
>>> invoice_items = pd.read_sql_table('invoice_items', engine,
... columns=['InvoiceId', 'UnitPrice', 'Quantity'])
>>> (cust
... .merge(invoice, on='CustomerId')
... .merge(invoice_items, on='InvoiceId')
... )
CustomerId FirstName ... UnitPrice Quantity
0 1 Luís ... 1.99 1
1 1 Luís ... 1.99 1
2 1 Luís ... 0.99 1
3 1 Luís ... 0.99 1
4 1 Luís ... 0.99 1
... ... ... ... ... ...
2235 59 Puja ... 0.99 1
2236 59 Puja ... 0.99 1
2237 59 Puja ... 0.99 1
2238 59 Puja ... 0.99 1
2239 59 Puja ... 0.99 1
>>> (cust
... .merge(invoice, on='CustomerId')
... .merge(invoice_items, on='InvoiceId')
... .assign(Total=lambda df_:df_.Quantity * df_.UnitPrice)
... .groupby(['CustomerId', 'FirstName', 'LastName'])
... ['Total']
... .sum()
... .sort_values(ascending=False)
... )
CustomerId FirstName LastName
6 Helena Holý 49.62
26 Richard Cunningham 47.62
57 Luis Rojas 46.62
46 Hugh O'Reilly 45.62
45 Ladislav Kovács 45.62
...
32 Aaron Mitchell 37.62
31 Martha Silk 37.62
29 Robert Brown 37.62
27 Patrick Gray 37.62
59 Puja Srivastava 36.64
Name: Total, Length: 59, dtype: float64
The create_engine
function requires a connection string to work properly. The connection string for SQLite is the location of the database, which is located in the data
directory. Other relational database management systems have more complex connection strings. You will need to provide a username, password, hostname, port, and optionally, a database. You will also need to supply the SQL dialect and the driver. The general form for the connection string is as follows: dialect+driver://username:password@host:port/database
. The driver for your particular relational database might need to be installed separately.
Once we have created the engine, selecting entire tables into DataFrames is very easy with the read_sql_table
function in step 2. Each of the tables in the database has a primary key identifying each row. It is identified graphically with a key symbol in the diagram. In step 3, we link genres
to tracks
through GenreId
. As we only care about the track length, we trim the tracks DataFrame down to just the columns we need before performing the merge. Once the tables have merged, we can answer the query with a basic .groupby
operation.
We go one step further and convert the integer milliseconds into a Timedelta
object that is far easier to read. The key is passing in the correct unit of measurement as a string. Now that we have a Timedelta Series, we can use the .dt
attribute to access the .floor
method, which rounds the time down to the nearest second.
The query required to answer step 5 involves three tables. We can trim the tables down significantly to only the columns we need by passing them to the columns
parameter. When using .merge
, the joining columns are not kept when they have the same name. In step 6, we could have assigned a column for the
price times quantity with the following:
cust_inv['Total'] = cust_inv['Quantity'] * cust_inv['UnitPrice']
As has been emphasized through this book, we prefer chaining operations when possible, and hence you see .assign
used frequently.
If you are adept with SQL, you can write a SQL query as a string and pass it to the read_sql_query
function. For example, the following will reproduce the output from step 4:
>>> sql_string1 = '''
... SELECT
... Name,
... time(avg(Milliseconds) / 1000, 'unixepoch') as avg_time
... FROM (
... SELECT
... g.Name,
... t.Milliseconds
... FROM
... genres as g
... JOIN
... tracks as t on
... g.genreid == t.genreid
... )
... GROUP BY Name
... ORDER BY avg_time'''
>>> pd.read_sql_query(sql_string1, engine)
Name avg_time
0 Rock And Roll 00:02:14
1 Opera 00:02:54
2 Hip Hop/Rap 00:02:58
3 Easy Listening 00:03:09
4 Bossa Nova 00:03:39
.. ... ...
20 Comedy 00:26:25
21 TV Shows 00:35:45
22 Drama 00:42:55
23 Science Fiction 00:43:45
24 Sci Fi & Fantasy 00:48:31
To reproduce the answer from step 6, use the following SQL query:
>>> sql_string2 = '''
... SELECT
... c.customerid,
... c.FirstName,
... c.LastName,
... sum(ii.quantity * ii.unitprice) as Total
... FROM
... customers as c
... JOIN
... invoices as i
... on c.customerid = i.customerid
... JOIN
... invoice_items as ii
... on i.invoiceid = ii.invoiceid
... GROUP BY
... c.customerid, c.FirstName, c.LastName
... ORDER BY
... Total desc'''
>>> pd.read_sql_query(sql_string2, engine)
CustomerId FirstName LastName Total
0 6 Helena Holý 49.62
1 26 Richard Cunningham 47.62
2 57 Luis Rojas 46.62
3 45 Ladislav Kovács 45.62
4 46 Hugh O'Reilly 45.62
.. ... ... ... ...
54 53 Phil Hughes 37.62
55 54 Steve Murray 37.62
56 55 Mark Taylor 37.62
57 56 Diego Gutiérrez 37.62
58 59 Puja Srivastava 36.64