In this chapter, we tackle the question of rearranging data in our data structures. We examine the various functions that enable us to rearrange data by utilizing them on real-world datasets. Such functions include groupby
, concat
, aggregate
, append
, and so on. The topics that we'll discuss are as follow:
We often detail granular data that we wish to aggregate or combine based on a grouping variable. We will illustrate some ways of doing this in the following sections.
The groupby
operation can be thought of as part of a process that involves the following three steps:
The groupby
clause is an operation on DataFrames. A Series is a 1D object, so performing a groupby
operation on it is not very useful. However, it can be used to obtain distinct rows of the Series. The result of a groupby
operation is not a DataFrame but dict
of DataFrame objects. Let us start with a dataset involving the world's most popular sport—soccer.
This dataset, obtained from Wikipedia, contains data for the finals of the European club championship since its inception in 1955. For reference, you can go to http://en.wikipedia.org/wiki/UEFA_Champions_League.
Convert the .csv
file into a DataFrame by using the following command:
In [27]: uefaDF=pd.read_csv('./euro_winners.csv') In [28]: uefaDF.head() Out[28]:
Thus, the output shows the season, the nations to which the winning and runner-up clubs belong, the score, the venue, and the attendance figures. Suppose we wanted to rank the nations by the number of European club championships they had won. We can do this by using groupby
. First, we apply groupby
to the DataFrame and see what is the type of the result:
In [84]: nationsGrp =uefaDF.groupby('Nation'), type(nationsGrp) Out[84]: pandas.core.groupby.DataFrameGroupBy
Thus, we see that nationsGrp
is of the pandas.core.groupby.DataFrameGroupBy
type. The column on which we use groupby
is referred to as the key. We can see what the groups look like by using the groups attribute on the resulting DataFrameGroupBy
object:
In [97]: nationsGrp.groups Out[97]: {'England': [12, 21, 22, 23, 24, 25, 26, 28, 43, 49, 52, 56], 'France': [37], 'Germany': [18, 19, 20, 27, 41, 45, 57], 'Italy': [7, 8, 9, 13, 29, 33, 34, 38, 40, 47, 51, 54], 'Netherlands': [14, 15, 16, 17, 32, 39], 'Portugal': [5, 6, 31, 48], 'Romania': [30], 'Scotland': [11], 'Spain': [0, 1, 2, 3, 4, 10, 36, 42, 44, 46, 50, 53, 55], 'Yugoslavia': [35]}
This is basically a dictionary that just shows the unique groups and the axis labels corresponding to each group—in this case the row number. The number of groups is obtained by using the len()
function:
In [109]: len(nationsGrp.groups) Out[109]: 10
We can now display the number of wins of each nation in descending order by applying the size()
function to the group and subsequently the sort()
function, which sorts according to place:
In [99]: nationWins=nationsGrp.size() In [100] nationWins.sort(ascending=False) nationWins Out[100]: Nation Spain 13 Italy 12 England 12 Germany 7 Netherlands 6 Portugal 4 Yugoslavia 1 Scotland 1 Romania 1 France 1 dtype: int64
The size()
function returns a Series with the group names as the index and the size of each group. The size()
function is also an aggregation function. We will examine aggregation functions later in the chapter.
To do a further breakup of wins by country and club, we apply a multicolumn groupby
function before applying size()
and sort()
:
In [106]: winnersGrp =uefaDF.groupby(['Nation','Winners']) clubWins=winnersGrp.size() clubWins.sort(ascending=False) clubWins Out[106]: Nation Winners Spain Real Madrid 9 Italy Milan 7 Germany Bayern Munich 5 England Liverpool 5 Spain Barcelona 4 Netherlands Ajax 4 England Manchester United 3 Italy Internazionale 3 Juventus 2 Portugal Porto 2 Benfica 2 England Nottingham Forest 2 Chelsea 1 France Marseille 1 Yugoslavia Red Star Belgrade 1 Germany Borussia Dortmund 1 Hamburg 1 Netherlands Feyenoord 1 PSV Eindhoven 1 Romania Steaua Bucuresti 1 Scotland Celtic 1 England Aston Villa 1 dtype: int64
A multicolumn groupby
specifies more than one column to be used as the key by specifying the key columns as a list. Thus, we can see that the most successful club in this competition has been Real Madrid of Spain. We now examine a richer dataset that will enable us to illustrate many more features of groupby
. This dataset is also soccer related and provides statistics for the top four European soccer leagues in the 2012-2013 season:
The source of this information is at http://soccerstats.com.
Let us now read the goal stats data into a DataFrame as usual. In this case, we create a row index on the DataFrame using the month:
In [68]: goalStatsDF=pd.read_csv('./goal_stats_euro_leagues_2012-13.csv') goalStatsDF=goalStatsDF.set_index('Month')
We look at the snapshot of the head and tail ends of our dataset:
In [115]: goalStatsDF.head(3) Out[115]: Stat EPL La Liga Serie A Bundesliga Month 08/01/2012 MatchesPlayed 20 20 10 10 09/01/2012 MatchesPlayed 38 39 50 44 10/01/2012 MatchesPlayed 31 31 39 27 In [116]: goalStatsDF.tail(3) Out[116]: Stat EPL La Liga Serie A Bundesliga Month 04/01/2013 GoalsScored 105 127 102 104 05/01/2013 GoalsScored 96 109 102 92 06/01/2013 GoalsScored NaN 80 NaN NaN
There are two measures in this data frame—MatchesPlayed
and GoalsScored
—and the data is ordered first by Stat
and then by Month
. Note that the last row in the tail()
output has the NaN
values for all the columns except La Liga
but we'll discuss this in more detail later. We can use groupby
to display the stats, but this will be done by grouped year instead. Here is how this is done:
In [117]: goalStatsGroupedByYear = goalStatsDF.groupby( lambda Month: Month.split('/')[2])
We can then iterate over the resulting groupby
object and display the groups. In the following command, we see the two sets of statistics grouped by year. Note the use of the lambda function to obtain the year group from the first day of the month. For more information about lambda functions, go to http://bit.ly/1apJNwS:
In [118]: for name, group in goalStatsGroupedByYear: print name print group 2012 Stat EPL La Liga Serie A Bundesliga Month 08/01/2012 MatchesPlayed 20 20 10 10 09/01/2012 MatchesPlayed 38 39 50 44 10/01/2012 MatchesPlayed 31 31 39 27 11/01/2012 MatchesPlayed 50 41 42 46 12/01/2012 MatchesPlayed 59 39 39 26 08/01/2012 GoalsScored 57 60 21 23 09/01/2012 GoalsScored 111 112 133 135 10/01/2012 GoalsScored 95 88 97 77 11/01/2012 GoalsScored 121 116 120 137 12/01/2012 GoalsScored 183 109 125 72 2013 Stat EPL La Liga Serie A Bundesliga Month 01/01/2013 MatchesPlayed 42 40 40 18 02/01/2013 MatchesPlayed 30 40 40 36 03/01/2013 MatchesPlayed 35 38 39 36 04/01/2013 MatchesPlayed 42 42 41 36 05/01/2013 MatchesPlayed 33 40 40 27 06/02/2013 MatchesPlayed NaN 10 NaN NaN 01/01/2013 GoalsScored 117 121 104 51 02/01/2013 GoalsScored 87 110 100 101 03/01/2013 GoalsScored 91 101 99 106 04/01/2013 GoalsScored 105 127 102 104 05/01/2013 GoalsScored 96 109 102 92 06/01/2013 GoalsScored NaN 80 NaN NaN
If we wished to group by individual month instead, we would need to apply groupby
with a level argument, as follows:
In [77]: goalStatsGroupedByMonth = goalStatsDF.groupby(level=0) In [81]: for name, group in goalStatsGroupedByMonth: print name print group print " " 01/01/2013 Stat EPL La Liga Serie A Bundesliga Month 01/01/2013 MatchesPlayed 42 40 40 18 01/01/2013 GoalsScored 117 121 104 51 02/01/2013 Stat EPL La Liga Serie A Bundesliga Month 02/01/2013 MatchesPlayed 30 40 40 36 02/01/2013 GoalsScored 87 110 100 101 03/01/2013 Stat EPL La Liga Serie A Bundesliga Month 03/01/2013 MatchesPlayed 35 38 39 36 03/01/2013 GoalsScored 91 101 99 106 04/01/2013 Stat EPL La Liga Serie A Bundesliga Month 04/01/2013 MatchesPlayed 42 42 41 36 04/01/2013 GoalsScored 105 127 102 104 05/01/2013 Stat EPL La Liga Serie A Bundesliga Month 05/01/2013 MatchesPlayed 33 40 40 27 05/01/2013 GoalsScored 96 109 102 92 06/01/2013 Stat EPL La Liga Serie A Bundesliga Month 06/01/2013 GoalsScored NaN 80 NaN NaN 06/02/2013 Stat EPL La Liga Serie A Bundesliga Month 06/02/2013 MatchesPlayed NaN 10 NaN NaN 08/01/2012 Stat EPL La Liga Serie A Bundesliga Month 08/01/2012 MatchesPlayed 20 20 10 10 08/01/2012 GoalsScored 57 60 21 23 09/01/2012 Stat EPL La Liga Serie A Bundesliga Month 09/01/2012 MatchesPlayed 38 39 50 44 09/01/2012 GoalsScored 111 112 133 135 10/01/2012 Stat EPL La Liga Serie A Bundesliga Month 10/01/2012 MatchesPlayed 31 31 39 27 10/01/2012 GoalsScored 95 88 97 77 11/01/2012 Stat EPL La Liga Serie A Bundesliga Month 11/01/2012 MatchesPlayed 50 41 42 46 11/01/2012 GoalsScored 121 116 120 137 12/01/2012 Stat EPL La Liga Serie A Bundesliga Month 12/01/2012 MatchesPlayed 59 39 39 26 12/01/2012 GoalsScored 183 109 125 72
Note that since in the preceding commands we're grouping on an index, we need to specify the level argument as opposed to just using a column name. When we group by multiple keys, the resulting group name is a tuple, as shown in the upcoming commands. First, we reset the index to obtain the original DataFrame and define a MultiIndex in order to be able to group by multiple keys. If this is not done, it will result in a ValueError
:
In [246]: goalStatsDF=goalStatsDF.reset_index() goalStatsDF=goalStatsDF.set_index(['Month','Stat']) In [247]: monthStatGroup=goalStatsDF.groupby(level=['Month','Stat']) In [248]: for name, group in monthStatGroup: print name print group ('01/01/2013', 'GoalsScored') EPL La Liga Serie A Bundesliga Month Stat 01/01/2013 GoalsScored 117 121 104 51 ('01/01/2013', 'MatchesPlayed') EPL La Liga Serie A Bundesliga Month Stat 01/01/2013 MatchesPlayed 42 40 40 18 ('02/01/2013', 'GoalsScored') EPL La Liga Serie A Bundesliga Month Stat 02/01/2013 GoalsScored 87 110 100 101
If our DataFrame has a MultiIndex, we can use groupby
to group by different levels of the hierarchy and compute some interesting statistics. Here is the goal stats data using a MultiIndex consisting of Month
and then Stat
:
In [134]:goalStatsDF2=pd.read_csv('./goal_stats_euro_leagues_2012-13.csv') goalStatsDF2=goalStatsDF2.set_index(['Month','Stat']) In [141]: print goalStatsDF2.head(3) print goalStatsDF2.tail(3) EPL La Liga Serie A Bundesliga Month Stat 08/01/2012 MatchesPlayed 20 20 10 10 09/01/2012 MatchesPlayed 38 39 50 44 10/01/2012 MatchesPlayed 31 31 39 27 EPL La Liga Serie A Bundesliga Month Stat 04/01/2013 GoalsScored 105 127 102 104 05/01/2013 GoalsScored 96 109 102 92 06/01/2013 GoalsScored NaN 80 NaN NaN
Suppose we wish to compute the total number of goals scored and the total matches played for the entire season for each league, we could do this as follows:
In [137]: grouped2=goalStatsDF2.groupby(level='Stat') In [139]: grouped2.sum() Out[139]: EPL La Liga Serie A Bundesliga Stat GoalsScored 1063 1133 1003 898 MatchesPlayed 380 380 380 306
Incidentally, the same result as the preceding one can be obtained by using sum directly and passing the level as a parameter:
In [142]: goalStatsDF2.sum(level='Stat') Out[142]: EPL La Liga Serie A Bundesliga Stat GoalsScored 1063 1133 1003 898 MatchesPlayed 380 380 380 306
Now, let us obtain a key statistic to determine how exciting the season was in each of the leagues—the goals per game ratio:
In [174]: totalsDF=grouped2.sum() In [175]: totalsDF.ix['GoalsScored']/totalsDF.ix['MatchesPlayed'] Out[175]: EPL 2.797368 La Liga 2.981579 Serie A 2.639474 Bundesliga 2.934641 dtype: float64
This is returned as a Series, as shown in the preceding command. We can now display the goals per game ratio along with the goals scored and matches played to give a summary of how exciting the league was, as follows:
gpg
is returned as a Series:In [234]: gpg=totalsDF.ix['GoalsScored']/totalsDF.ix['MatchesPlayed'] goalsPerGameDF=pd.DataFrame(gpg).T In [235]: goalsPerGameDF Out[235]: EPL La Liga Serie A Bundesliga 0 2.797368 2.981579 2.639474 2.934641
goalsPerGameDF
DataFrame so that the 0
index is replaced by GoalsPerGame
:In [207]: goalsPerGameDF=goalsPerGameDF.rename(index={0:'GoalsPerGame'}) In [208]: goalsPerGameDF Out[208]: EPL La Liga Serie A Bundesliga GoalsPerGame 2.797368 2.981579 2.639474 2.934641
goalsPerGameDF
DataFrame to the original one:In [211]: pd.options.display.float_format='{:.2f}'.format totalsDF.append(goalsPerGameDF) Out[211]: EPL La Liga Serie A Bundesliga GoalsScored 1063.00 1133.00 1003.00 898.00 MatchesPlayed 380.00 380.00 380.00 306.00 GoalsPerGame 2.80 2.98 2.64 2.93
The following is a graph that shows the goals per match of the European leagues, that we discussed, from 1955-2012. The source for this can be found at http://mattstil.es/images/europe-football.png.
Another way to generate summary statistics is by using the aggregate method explicitly:
In [254]: pd.options.display.float_format=None In [256]: grouped2.aggregate(np.sum) Out[256]: EPL La Liga Serie A Bundesliga Stat GoalsScored 1063 1133 1003 898 MatchesPlayed 380 380 380 306
This generates a grouped DataFrame object that is shown in the preceding command. We also reset the float format to None
, so the integer-valued data would not be shown as floats due to the formatting from the previous section.
For a grouped DataFrame object, we can specify a list of functions to be applied to each column:
In [274]: grouped2.agg([np.sum, np.mean,np.size]) Out[274]: EPL La Liga Serie A Bundesliga sum mean size sum mean size sum mean size sum mean size Stat GoalsScored 1063 106.3 11 1133 103.0 11 1003 100.3 11 898 89.8 11 MatchesPlayed 380 38.0 11 380 34.6 11 380 38.0 11 306 30.6 11
Note the preceding output that shows NA values are excluded from aggregate calculations. The agg
is an abbreviation form for aggregate. Thus, the calculations for the mean for EPL, Serie A, and Bundesliga are based on a size of 10 months and not 11. This is because no matches were played in the last month of June in these three leagues as opposed to La Liga, which had matches in June.
In the case of a grouped Series, we return to the nationsGrp
example and compute some statistics on the attendance figures for the country of the tournament winners:
In [297]: nationsGrp['Attendance'].agg({'Total':np.sum, 'Average':np.mean, 'Deviation':np.std}) Out[297]: Deviation Average Total Nation England 17091.31 66534.25 798411 France NaN 64400 64400 Germany 13783.83 67583.29 473083 Italy 17443.52 65761.25 789135 Netherlands 16048.58 67489.0 404934 Portugal 15632.86 49635.5 198542 Romania NaN 70000 70000 Scotland NaN 45000 45000 Spain 27457.53 73477.15 955203 Yugoslavia NaN 56000 56000
For a grouped Series, we can pass a list or dict
of functions. In the preceding case, a dict
was specified and the key values were used for the names of the columns in the resulting DataFrame. Note that in the case of groups of a single sample size, the standard deviation is undefined and NaN
is the result—for example, Romania.
The groupby-transform
function is used to perform transformation operations on a groupby
object. For example, we could replace NaN
values in the groupby
object using the fillna
method. The resulting object after using transform has the same size as the original groupby
object. Let us consider a DataFrame showing the goals scored for each month in the four soccer leagues:
In[344]: goalStatsDF3=pd.read_csv('./goal_stats_euro_leagues_2012-13.csv') goalStatsDF3=goalStatsDF3.set_index(['Month']) goalsScoredDF=goalStatsDF3.ix[goalStatsDF3['Stat']=='GoalsScored'] goalsScoredDF.iloc[:,1:] Out[344]: EPL La Liga Serie A Bundesliga Month 08/01/2012 57 60 21 23 09/01/2012 111 112 133 135 10/01/2012 95 88 97 77 11/01/2012 121 116 120 137 12/01/2012 183 109 125 72 01/01/2013 117 121 104 51 02/01/2013 87 110 100 101 03/01/2013 91 101 99 106 04/01/2013 105 127 102 104 05/01/2013 96 109 102 92 06/01/2013 NaN 80 NaN NaN
We can see that for June 2013, the only league for which matches were played was La Liga
, resulting in the NaN
values for the other three leagues. Let us group the data by year:
In [336]: goalsScoredPerYearGrp=goalsScoredDF.groupby(lambda Month: Month.split('/')[2]) goalsScoredPerYearGrp.mean() Out[336]: EPL La Liga Serie A Bundesliga 2012 113.4 97 99.2 88.8 2013 99.2 108 101.4 90.8
The preceding function makes use of a lambda function to obtain the year by splitting the Month
variable on the /
character and taking the third element of the resulting list.
If we do a count of the number of months per year during which matches were held in the various leagues, we have:
In [331]: goalsScoredPerYearGrp.count() Out[331]: EPL La Liga Serie A Bundesliga 2012 5 5 5 5 2013 5 6 5 5
It is often undesirable to display data with missing values and one common method to resolve this situation would be to replace the missing values with the group mean. This can be achieved using the transform-groupby function. First, we must define the transformation using a lambda function and then apply this transformation using the transform method:
In [338]: fill_fcn = lambda x: x.fillna(x.mean()) trans = goalsScoredPerYearGrp.transform(fill_fcn) tGroupedStats = trans.groupby(lambda Month: Month.split('/')[2]) tGroupedStats.mean() Out[338]: EPL La Liga Serie A Bundesliga 2012 113.4 97 99.2 88.8 2013 99.2 108 101.4 90.8
One thing to note from the preceding results is that replacing the NaN
values with the group mean in the original group, keeps the group means unchanged in the transformed data.
However, when we do a count on the transformed group, we see that the number of matches has changed from five to six for the EPL, Serie A, and Bundesliga:
In [339]: tGroupedStats.count() Out[339]: EPL La Liga Serie A Bundesliga 2012 5 5 5 5 2013 6 6 6 6
The filter method enables us to apply filtering on a groupby
object that results in a subset of the initial object. Here, we illustrate how to display the months of the season in which more than 100 goals were scored in each of the four leagues:
In [391]: goalsScoredDF.groupby(level='Month').filter(lambda x: np.all([x[col] > 100 for col in goalsScoredDF.columns])) Out[391]: EPL La Liga Serie A Bundesliga Month 09/01/2012 111 112 133 135 11/01/2012 121 116 120 137 04/01/2013 105 127 102 104
Note the use of the np.all
operator to ensure that the constraint is enforced for all the columns.