Sometimes, we may wish to split data into subsets and apply a function such as the mean, max, or min to each subset. In R, we can do this via the aggregate
or tapply
functions.
Here, we will use the example of a dataset of statistics on the top five strikers of the four clubs that made it to the semi-final of the European Champions League Football tournament in 2014. We will use it to illustrate aggregation in R and its equivalent GroupBy functionality in pandas.
In R aggregation is done using the following command:
> goal_stats=read.csv('champ_league_stats_semifinalists.csv') >goal_stats Club Player Goals GamesPlayed 1 Atletico Madrid Diego Costa 8 9 2 Atletico Madrid ArdaTuran 4 9 3 Atletico Madrid RaúlGarcía 4 12 4 Atletico Madrid AdriánLópez 2 9 5 Atletico Madrid Diego Godín 2 10 6 Real Madrid Cristiano Ronaldo 17 11 7 Real Madrid Gareth Bale 6 12 8 Real Madrid Karim Benzema 5 11 9 Real Madrid Isco 3 12 10 Real Madrid Ángel Di María 3 11 11 Bayern Munich Thomas Müller 5 12 12 Bayern Munich ArjenRobben 4 10 13 Bayern Munich Mario Götze 3 11 14 Bayern Munich Bastian Schweinsteiger 3 8 15 Bayern Munich Mario Mandžukić 3 10 16 Chelsea Fernando Torres 4 9 17 Chelsea Demba Ba 3 6 18 Chelsea Samuel Eto'o 3 9 19 Chelsea Eden Hazard 2 9 20 Chelsea Ramires 2 10
We can now compute the goals per game ratio for each striker, to measure their deadliness in front of a goal:
>goal_stats$GoalsPerGame<- goal_stats$Goals/goal_stats$GamesPlayed >goal_stats Club Player Goals GamesPlayedGoalsPerGame 1 Atletico Madrid Diego Costa 8 9 0.8888889 2 Atletico Madrid ArdaTuran 4 9 0.4444444 3 Atletico Madrid RaúlGarcía 4 12 0.3333333 4 Atletico Madrid AdriánLópez 2 9 0.2222222 5 Atletico Madrid Diego Godín 2 10 0.2000000 6 Real Madrid Cristiano Ronaldo 17 11 1.5454545 7 Real Madrid Gareth Bale 6 12 0.5000000 8 Real Madrid Karim Benzema 5 11 0.4545455 9 Real Madrid Isco 3 12 0.2500000 10 Real Madrid Ángel Di María 3 11 0.2727273 11 Bayern Munich Thomas Müller 5 12 0.4166667 12 Bayern Munich ArjenRobben 4 10 0.4000000 13 Bayern Munich MarioGötze 3 11 0.2727273 14 Bayern Munich Bastian Schweinsteiger 3 8 0.3750000 15 Bayern Munich MarioMandžukić 3 10 0.3000000 16 Chelsea Fernando Torres 4 9 0.4444444 17 Chelsea Demba Ba 3 6 0.5000000 18 Chelsea Samuel Eto'o 3 9 0.3333333 19 Chelsea Eden Hazard 2 9 0.2222222 20 Chelsea Ramires 2 10 0.2000000
Let's suppose that we wanted to know the highest goals per game ratio for each team. We would calculate this as follows:
>aggregate(x=goal_stats[,c('GoalsPerGame')], by=list(goal_stats$Club),FUN=max) Group.1 x 1 Atletico Madrid 0.8888889 2 Bayern Munich 0.4166667 3 Chelsea 0.5000000 4 Real Madrid 1.5454545
The tapply
function is used to apply a function to a subset of an array or vector that is defined by one or more columns. The tapply
function can also be used as follows:
>tapply(goal_stats$GoalsPerGame,goal_stats$Club,max) Atletico Madrid Bayern Munich Chelsea Real Madrid 0.8888889 0.4166667 0.5000000 1.5454545
In pandas, we can achieve the same result by using the GroupBy
function:
In [6]: import pandas as pd importnumpy as np In [7]: goal_stats_df=pd.read_csv('champ_league_stats_semifinalists.csv') In [27]: goal_stats_df['GoalsPerGame']= goal_stats_df['Goals']/goal_stats_df['GamesPlayed'] In [27]: goal_stats_df['GoalsPerGame']= goal_stats_df['Goals']/goal_stats_df['GamesPlayed'] In [28]: goal_stats_df Out[28]: Club Player Goals GamesPlayedGoalsPerGame 0 Atletico Madrid Diego Costa 8 9 0.888889 1 Atletico Madrid ArdaTuran 4 9 0.444444 2 Atletico Madrid RaúlGarcía 4 12 0.333333 3 Atletico Madrid AdriánLópez 2 9 0.222222 4 Atletico Madrid Diego Godín 2 10 0.200000 5 Real Madrid Cristiano Ronaldo 17 11 1.545455 6 Real Madrid Gareth Bale 6 12 0.500000 7 Real Madrid Karim Benzema 5 11 0.454545 8 Real Madrid Isco 3 12 0.250000 9 Real Madrid Ángel Di María 3 11 0.272727 10 Bayern Munich Thomas Müller 5 12 0.416667 11 Bayern Munich ArjenRobben 4 10 0.400000 12 Bayern Munich Mario Götze 3 11 0.272727 13 Bayern Munich BastianSchweinsteiger 3 8 0.375000 14 Bayern Munich MarioMandžukić 3 10 0.300000 15 Chelsea Fernando Torres 4 9 0.444444 16 Chelsea Demba Ba 3 6 0.500000 17 Chelsea Samuel Eto'o 3 9 0.333333 18 Chelsea Eden Hazard 2 9 0.222222 19 Chelsea Ramires 2 10 0.200000 In [30]: grouped = goal_stats_df.groupby('Club') In [17]: grouped['GoalsPerGame'].aggregate(np.max) Out[17]: Club Atletico Madrid 0.888889 Bayern Munich 0.416667 Chelsea 0.500000 Real Madrid 1.545455 Name: GoalsPerGame, dtype: float64 In [22]: grouped['GoalsPerGame'].apply(np.max) Out[22]: Club Atletico Madrid 0.888889 Bayern Munich 0.416667 Chelsea 0.500000 Real Madrid 1.545455 Name: GoalsPerGame, dtype: float64