There are various functions that can be used to merge and join pandas' data structures, which include the following functions:
concat
append
The concat
function is used to join multiple pandas' data structures along a specified axis and possibly perform union or intersection operations along other axes. The following command explains the concat
function:
concat(objs, axis=0, , join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False)
The synopsis of the elements of concat
function are as follows:
objs
function: A list or dictionary of Series, DataFrame, or Panel objects to be concatenated.axis
function: The axis along which the concatenation should be performed. 0
is the default value.join
function: The type of join to perform when handling indexes on other axes. The 'outer'
function is the default.join_axes
function: This is used to specify exact indexes for the remaining indexes instead of doing outer/inner join.keys
function: This specifies a list of keys to be used to construct a MultiIndex.For an explanation of the remaining options, please refer to the documentation at http://pandas.pydata.org/pandas-docs/stable/merging.html.
Here is an illustration of the workings of concat
using our stock price examples from earlier chapters:
In [53]: stockDataDF=pd.read_csv('./tech_stockprices.csv').set_index(['Symbol']);stockDataDF Out[53]: Closing price EPS Shares Outstanding(M) P/E Market Cap(B) Beta Symbol AAPL 501.53 40.32 892.45 12.44 447.59 0.84 AMZN 346.15 0.59 459.00 589.80 158.88 0.52 FB 61.48 0.59 2450.00 104.93 150.92 NaN GOOG 1133.43 36.05 335.83 31.44 380.64 0.87 TWTR 65.25 -0.30 555.20 NaN 36.23 NaN YHOO 34.90 1.27 1010.00 27.48 35.36 0.66
We now take various slices of the data:
In [83]: A=stockDataDF.ix[:4, ['Closing price', 'EPS']]; A Out[83]: Closing price EPS Symbol AAPL 501.53 40.32 AMZN 346.15 0.59 FB 61.48 0.59 GOOG 1133.43 36.05 In [84]: B=stockDataDF.ix[2:-2, ['P/E']];B Out[84]: P/E Symbol FB 104.93 GOOG 31.44 In [85]: C=stockDataDF.ix[1:5, ['Market Cap(B)']];C Out[85]: Market Cap(B) Symbol AMZN 158.88 FB 150.92 GOOG 380.64 TWTR 36.23
Here, we perform a concatenation by specifying an outer join, which concatenates and performs a union on all the three data frames, and includes entries that do not have values for all the columns by inserting NaN
for such columns:
In [86]: pd.concat([A,B,C],axis=1) # outer join Out[86]: Closing price EPS P/E Market Cap(B) AAPL 501.53 40.32 NaN NaN AMZN 346.15 0.59 NaN 158.88 FB 61.48 0.59 104.93 150.92 GOOG 1133.43 36.05 31.44 380.64 TWTR NaN NaN NaN 36.23
We can also specify an inner join that does the concatenation, but only includes rows that contain values for all the columns in the final data frame by throwing out rows with missing columns, that is, it takes the intersection:
In [87]: pd.concat([A,B,C],axis=1, join='inner') # Inner join Out[87]: Closing price EPS P/E Market Cap(B) Symbol FB 61.48 0.59 104.93 150.92 GOOG 1133.43 36.05 31.44 380.64
The third case enables us to use the specific index from the original DataFrame to join on:
In [102]: pd.concat([A,B,C], axis=1, join_axes=[stockDataDF.index]) Out[102]: Closing price EPS P/E Market Cap(B) Symbol AAPL 501.53 40.32 NaN NaN AMZN 346.15 0.59 NaN 158.88 FB 61.48 0.59 104.93 150.92 GOOG 1133.43 36.05 31.44 380.64 TWTR NaN NaN NaN 36.23 YHOO NaN NaN NaN NaN
In this last case, we see that the row for YHOO
was included even though it wasn't contained in any of the slices that were concatenated. In this case, however, the values for all the columns are NaN
. Here is another illustration of concat
, but this time, it is on random statistical distributions. Note that in the absence of an axis argument, the default axis of concatenation is 0
:
In[135]: np.random.seed(100) normDF=pd.DataFrame(np.random.randn(3,4));normDF Out[135]: 0 1 2 3 0 -1.749765 0.342680 1.153036 -0.252436 1 0.981321 0.514219 0.221180 -1.070043 2 -0.189496 0.255001 -0.458027 0.435163 In [136]: binomDF=pd.DataFrame(np.random.binomial(100,0.5,(3,4)));binomDF Out[136]: 0 1 2 3 0 57 50 57 50 1 48 56 49 43 2 40 47 49 55 In [137]: poissonDF=pd.DataFrame(np.random.poisson(100,(3,4)));poissonDF Out[137]: 0 1 2 3 0 93 96 96 89 1 76 96 104 103 2 96 93 107 84 In [138]: rand_distribs=[normDF,binomDF,poissonDF] In [140]: rand_distribsDF=pd.concat(rand_distribs,keys=['Normal', 'Binomial', 'Poisson']);rand_distribsDF Out[140]: 0 1 2 3 Normal 0 -1.749765 0.342680 1.153036 -0.252436 1 0.981321 0.514219 0.221180 -1.070043 2 -0.189496 0.255001 -0.458027 0.435163 Binomial 0 57.00 50.00 57.00 50.00 1 48.00 56.00 49.00 43.00 2 40.00 47.00 49.00 55.00 Poisson 0 93.00 96.00 96.00 89.00 1 76.00 96.00 104.00 103.00 2 96.00 93.00 107.00 84.00
The append
function is a simpler version of concat
that concatenates along axis=0
. Here is an illustration of its use where we slice out the first two rows and first three columns of the stockData
DataFrame:
In [145]: stockDataA=stockDataDF.ix[:2,:3] stockDataA Out[145]: Closing price EPS Shares Outstanding(M) Symbol AAPL 501.53 40.32 892.45 AMZN 346.15 0.59 459.00
And the remaining rows:
In [147]: stockDataB=stockDataDF[2:] stockDataB Out[147]: Closing price EPS Shares Outstanding(M) P/E Market Cap(B) Beta Symbol FB 61.48 0.59 2450.00 104.93 150.92 NaN GOOG 1133.43 36.05 335.83 31.44 380.64 0.87 TWTR 65.25 -0.30 555.20 NaN 36.23 NaN YHOO 34.90 1.27 1010.00 27.48 35.36 0.66
Now, we use append
to combine the two data frames from the preceding commands:
In [161]:stockDataA.append(stockDataB) Out[161]: Beta Closing price EPS MarketCap(B) P/E Shares Outstanding(M) Symbol AMZN NaN 346.15 0.59 NaN NaN 459.00 GOOG NaN 1133.43 36.05 NaN NaN 335.83 FB NaN 61.48 0.59 150.92 104.93 2450.00 YHOO 27.48 34.90 1.27 35.36 0.66 1010.00 TWTR NaN 65.25 -0.30 36.23 NaN 555.20 AAPL 12.44 501.53 40.32 0.84 447.59 892.45
In order to maintain the order of columns similar to the original DataFrame, we can apply the reindex_axis
function:
In [151]: stockDataA.append(stockDataB).reindex_axis(stockDataDF.columns, axis=1) Out[151]: Closing price EPS Shares Outstanding(M) P/E Market Cap(B) Beta Symbol AAPL 501.53 40.32 892.45 NaN NaN NaN AMZN 346.15 0.59 459.00 NaN NaN NaN FB 61.48 0.59 2450.00 104.93 150.92 NaN GOOG 1133.43 36.05 335.83 31.44 380.64 0.87 TWTR 65.25 -0.30 555.20 NaN 36.23 NaN YHOO 34.90 1.27 1010.00 27.48 35.36 0.66
Note that for the first two rows, the value of the last two columns is NaN
, since the first DataFrame only contained the first three columns. The append
function does not work in places, but it returns a new DataFrame with the second DataFrame appended to the first.
We can append a single row to a DataFrame by passing a series or dictionary to the append
method:
In [152]: algos={'search':['DFS','BFS','Binary Search','Linear'], 'sorting': ['Quicksort','Mergesort','Heapsort','Bubble Sort'], 'machine learning':['RandomForest','K Nearest Neighbor','Logistic Regression','K-Means Clustering']} algoDF=pd.DataFrame(algos);algoDF Out[152]: machine learning search sorting 0 RandomForest DFS Quicksort 1 K Nearest Neighbor BFS Mergesort 2 Logistic Regression Binary Search Heapsort 3 K-Means Clustering Linear Bubble Sort In [154]: moreAlgos={'search': 'ShortestPath' , 'sorting': 'Insertion Sort', 'machine learning': 'Linear Regression'} algoDF.append(moreAlgos,ignore_index=True) Out[154]: machine learning search sorting 0 RandomForest DFS Quicksort 1 K Nearest Neighbor BFS Mergesort 2 Logistic Regression Binary Search Heapsort 3 K-Means Clustering Linear Bubble Sort 4 Linear Regression ShortestPath Insertion Sort
In order for this to work, you must pass the ignore_index=True
argument so that the index [0,1,2,3]
in algoDF
is ignored.
The merge
function is used to obtain joins of two DataFrame objects similar to those used in SQL database queries. The DataFrame objects are analogous to SQL tables. The following command explains this:
merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True)
Following is the synopsis of merge
function:
left
argument: This is the first DataFrame objectright
argument: This is the second DataFrame objecthow
argument: This is the type of join and can be inner, outer, left, or right. The default is inner.on
argument: This shows the names of columns to join on as join keys.left_on
and right_on
arguments : This shows the left and right DataFrame
column names to join on.left_index
and right_index
arguments: This has a Boolean value. If this is True
, use the left or right DataFrame
index/row labels to join on.sort
argument: This has a Boolean value. The default True
setting results in a lexicographical sorting. Setting the default value to False
may improve performance.suffixes
argument: The tuple of string suffixes to be applied to overlapping columns. The defaults are '_x'
and '_y'
.copy
argument: The default True
value causes data to be copied from the passed DataFrame
objects.The source of the preceding information can be found at http://pandas.pydata.org/pandas-docs/stable/merging.html.
Let us start to examine the use of merge by reading the U.S. stock index data into a DataFrame
:
In [254]: USIndexDataDF=pd.read_csv('./us_index_data.csv') USIndexDataDF Out[254]: TradingDate Nasdaq S&P 500 Russell 2000 DJIA 0 2014/01/30 4123.13 1794.19 1139.36 15848.61 1 2014/01/31 4103.88 1782.59 1130.88 15698.85 2 2014/02/03 3996.96 1741.89 1094.58 15372.80 3 2014/02/04 4031.52 1755.20 1102.84 15445.24 4 2014/02/05 4011.55 1751.64 1093.59 15440.23 5 2014/02/06 4057.12 1773.43 1103.93 15628.53
The source of this information can be found at http://finance.yahoo.com.
We can obtain slice1
of the data for rows 0 and 1 and the Nasdaq
and S&P 500
columns by using the following command:
In [255]: slice1=USIndexDataDF.ix[:1,:3] slice1 Out[255]: TradingDate Nasdaq S&P 500 0 2014/01/30 4123.13 1794.19 1 2014/01/31 4103.88 1782.59
We can obtain slice2
of the data for rows 0 and 1 and the Russell 2000
and DJIA
columns by using the following command:
In [256]: slice2=USIndexDataDF.ix[:1,[0,3,4]] slice2 Out[256]: TradingDate Russell 2000 DJIA 0 2014/01/30 1139.36 15848.61 1 2014/01/31 1130.88 15698.85
We can obtain slice3
of the data for rows 1 and 2 and the Nasdaq
and S&P 500
columns by using the following command:
In [248]: slice3=USIndexDataDF.ix[[1,2],:3] slice3 Out[248]: TradingDate Nasdaq S&P 500 1 2014/01/31 4103.88 1782.59 2 2014/02/03 3996.96 1741.89
We can now merge slice1
and slice2
as follows:
In [257]: pd.merge(slice1,slice2) Out[257]: TradingDate Nasdaq S&P 500 Russell 2000 DJIA 0 2014/01/30 4123.13 1794.19 1139.36 15848.61 1 2014/01/31 4103.88 1782.59 1130.88 15698.85
As you can see, this results in a combination of the columns in slice1
and slice2
. Since the on
argument was not specified, the intersection of the columns in slice1
and slice2
was used which is TradingDate
as the join column, and the rest of the columns from slice1
and slice2
were used to produce the output.
Note that in this case, passing a value for how
has no effect on the result since the values of the TradingDate
join key match for slice1
and slice2
.
We now merge slice3
and slice2
specifying inner
as the value of the how
argument:
In [258]: pd.merge(slice3,slice2,how='inner') Out[258]: TradingDate Nasdaq S&P 500 Russell 2000 DJIA 0 2014/01/31 4103.88 1782.59 1130.88 15698.85
The slice3
argument has values 2014/01/31 and 2014/02/03 unique values for TradingDate
, and slice2
has values 2014/01/30 and 2014/01/31 unique values for TradingDate
.
The merge
function uses the intersection of these values, which is 2014/01/31. This results in the single row result. Here, we specify outer
as the value of the how
argument:
In [269]: pd.merge(slice3,slice2,how='outer') Out[269]: TradingDate Nasdaq S&P 500 Russell 2000 DJIA 0 2014/01/31 4103.88 1782.59 1130.88 15698.85 1 2014/02/03 3996.96 1741.89 NaN NaN 2 2014/01/30 NaN NaN 1139.36 15848.61
Specifying outer
uses all the keys (union) from both DataFrames, which gives the three rows specified in the preceding output. Since not all the columns are present in the two DataFrames, the columns from the other DataFrame are NaN
for each row in a DataFrame that is not part of the intersection.
Now, we specify how='left'
as shown in the following command:
In [271]: pd.merge(slice3,slice2,how='left') Out[271]: TradingDate Nasdaq S&P 500 Russell 2000 DJIA 0 2014/01/31 4103.88 1782.59 1130.88 15698.85 1 2014/02/03 3996.96 1741.89 NaN NaN
Here, we see that the keys from the left DataFrame slice3
are used for the output. For columns that are not available in slice3
, that is Russell 2000
and DJIA, NaN
are used for the row with TradingDate
as 2014/02/03. This is equivalent to a SQL left outer join.
We specify how='right'
in the following command:
In [270]: pd.merge(slice3,slice2,how='right') Out[270]: TradingDate Nasdaq S&P 500 Russell 2000 DJIA 0 2014/01/31 4103.88 1782.59 1130.88 15698.85 1 2014/01/30 NaN NaN 1139.36 15848.61
This is the corollary to the how='left'
keys from the right DataFrame slice2
that are used. Therefore, rows with TradingDate
as 2014/01/31 and 2014/01/30 are in the result. For columns that are not in slice2
—Nasdaq
and S&P 500
—NaN
are used.
This is equivalent to a SQL right outer join. For a simple explanation of how SQL joins work, please refer to http://bit.ly/1yqR9vw.
The DataFrame.join
function is used to combine two DataFrames that have different columns with nothing in common. Essentially, this does a longitudinal join of two DataFrames. Here is an example:
In [274]: slice_NASD_SP=USIndexDataDF.ix[:3,:3] slice_NASD_SP Out[274]: TradingDate Nasdaq S&P 500 0 2014/01/30 4123.13 1794.19 1 2014/01/31 4103.88 1782.59 2 2014/02/03 3996.96 1741.89 3 2014/02/04 4031.52 1755.20 In [275]: slice_Russ_DJIA=USIndexDataDF.ix[:3,3:] slice_Russ_DJIA Out[275]: Russell 2000 DJIA 0 1139.36 15848.61 1 1130.88 15698.85 2 1094.58 15372.80 3 1102.84 15445.24
Here, we call the join
operator, as follows:
In [276]: slice_NASD_SP.join(slice_Russ_DJIA) Out[276]: TradingDate Nasdaq S&P 500 Russell 2000 DJIA 0 2014/01/30 4123.13 1794.19 1139.36 15848.61 1 2014/01/31 4103.88 1782.59 1130.88 15698.85 2 2014/02/03 3996.96 1741.89 1094.58 15372.80 3 2014/02/04 4031.52 1755.20 1102.84 15445.24
In this case, we see that the result is a combination of the columns from the two Dataframes. Let us see what happens when we try to use join with two DataFrames that have a column in common:
In [272]: slice1.join(slice2) ------------------------------------------------------------ Exception Traceback (most recent call last) ... Exception: columns overlap: Index([u'TradingDate'], dtype=object)
This results in an exception due to overlapping columns. You can find more information on using merge
, concat
, and join
operations in the official documentation page at http://pandas.pydata.org/pandas-docs/stable/merging.html.