Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Aggregation and GroupBy

Sometimes, we may wish to split data into subsets and apply a function such as the mean, max, or min to each subset. In R, we can do this via the aggregate or tapply functions.

Here, we will use the example of a dataset of statistics on the top five strikers of the four clubs that made it to the semi-final of the European Champions League Football tournament in 2014. We will use it to illustrate aggregation in R and its equivalent GroupBy functionality in pandas.

Aggregation in R

In R aggregation is done using the following command:

> goal_stats=read.csv('champ_league_stats_semifinalists.csv')
>goal_stats
              Club                 Player Goals GamesPlayed
1  Atletico Madrid            Diego Costa     8           9
2  Atletico Madrid             ArdaTuran     4           9
3  Atletico Madrid            RaúlGarcía     4          12
4  Atletico Madrid           AdriánLópez     2           9
5  Atletico Madrid            Diego Godín     2          10
6      Real Madrid      Cristiano Ronaldo    17          11
7      Real Madrid            Gareth Bale     6          12
8      Real Madrid          Karim Benzema     5          11
9      Real Madrid                   Isco     3          12
10     Real Madrid         Ángel Di María     3          11
11   Bayern Munich          Thomas Müller     5          12
12   Bayern Munich           ArjenRobben     4          10
13   Bayern Munich            Mario Götze     3          11
14   Bayern Munich Bastian Schweinsteiger     3           8
15   Bayern Munich        Mario Mandžukić     3          10
16         Chelsea        Fernando Torres     4           9
17         Chelsea               Demba Ba     3           6
18         Chelsea           Samuel Eto'o     3           9
19         Chelsea            Eden Hazard     2           9
20         Chelsea                Ramires     2          10

We can now compute the goals per game ratio for each striker, to measure their deadliness in front of a goal:

>goal_stats$GoalsPerGame<- goal_stats$Goals/goal_stats$GamesPlayed
>goal_stats
              Club   Player         Goals GamesPlayedGoalsPerGame
1  Atletico Madrid  Diego Costa     8           9    0.8888889
2  Atletico Madrid  ArdaTuran      4           9    0.4444444
3  Atletico Madrid  RaúlGarcía     4          12    0.3333333
4  Atletico Madrid  AdriánLópez    2           9    0.2222222
5  Atletico Madrid  Diego Godín     2          10    0.2000000
6  Real Madrid  Cristiano Ronaldo  17          11    1.5454545
7  Real Madrid  Gareth Bale         6          12    0.5000000
8  Real Madrid    Karim Benzema     5          11    0.4545455
9  Real Madrid       Isco           3          12    0.2500000
10 Real Madrid  Ángel Di María     3          11    0.2727273
11 Bayern Munich Thomas Müller     5          12    0.4166667
12 Bayern Munich  ArjenRobben     4          10    0.4000000
13 Bayern Munich  MarioGötze      3          11    0.2727273
14 Bayern Munich Bastian Schweinsteiger 3      8    0.3750000
15 Bayern Munich  MarioMandžukić  3          10    0.3000000
16 Chelsea       Fernando Torres   4           9    0.4444444
17 Chelsea           Demba Ba      3           6    0.5000000
18 Chelsea           Samuel Eto'o  3           9    0.3333333
19 Chelsea            Eden Hazard  2           9    0.2222222
20 Chelsea                Ramires  2          10    0.2000000

Let's suppose that we wanted to know the highest goals per game ratio for each team. We would calculate this as follows:

>aggregate(x=goal_stats[,c('GoalsPerGame')], by=list(goal_stats$Club),FUN=max)
          Group.1         x
1 Atletico Madrid 0.8888889
2   Bayern Munich 0.4166667
3         Chelsea 0.5000000
4     Real Madrid 1.5454545

The tapply function is used to apply a function to a subset of an array or vector that is defined by one or more columns. The tapply function can also be used as follows:

>tapply(goal_stats$GoalsPerGame,goal_stats$Club,max)
Atletico Madrid   Bayern Munich         Chelsea     Real Madrid 
      0.8888889       0.4166667       0.5000000       1.5454545

The pandas' GroupBy operator

In pandas, we can achieve the same result by using the GroupBy function:

In [6]: import pandas as pd
importnumpy as np
In [7]: goal_stats_df=pd.read_csv('champ_league_stats_semifinalists.csv')

In [27]: goal_stats_df['GoalsPerGame']=     goal_stats_df['Goals']/goal_stats_df['GamesPlayed']

In [27]: goal_stats_df['GoalsPerGame']= goal_stats_df['Goals']/goal_stats_df['GamesPlayed']

In [28]: goal_stats_df
Out[28]: Club           Player      Goals GamesPlayedGoalsPerGame
0       Atletico Madrid Diego Costa   8       9        0.888889
1       Atletico Madrid ArdaTuran    4       9         0.444444
2       Atletico Madrid RaúlGarcía   4       12        0.333333
3       Atletico Madrid AdriánLópez  2       9         0.222222
4       Atletico Madrid Diego Godín   2       10        0.200000
5       Real Madrid  Cristiano Ronaldo 17      11        1.545455
6       Real Madrid     Gareth Bale   6       12        0.500000
7       Real Madrid     Karim Benzema 5       11        0.454545
8       Real Madrid     Isco          3       12        0.250000
9       Real Madrid     Ángel Di María 3      11        0.272727
10      Bayern Munich   Thomas Müller  5       12        0.416667
11      Bayern Munich   ArjenRobben   4       10        0.400000
12      Bayern Munich   Mario Götze    3       11        0.272727
13      Bayern Munich  BastianSchweinsteiger 3   8     0.375000
14      Bayern Munich  MarioMandžukić  3       10        0.300000
15      Chelsea        Fernando Torres  4       9         0.444444
16      Chelsea        Demba Ba         3       6         0.500000
17      Chelsea        Samuel Eto'o     3       9         0.333333
18      Chelsea        Eden Hazard      2       9         0.222222
19      Chelsea        Ramires          2       10        0.200000

In [30]: grouped = goal_stats_df.groupby('Club')

In [17]: grouped['GoalsPerGame'].aggregate(np.max)
Out[17]: Club
         Atletico Madrid    0.888889
         Bayern Munich      0.416667
         Chelsea            0.500000
         Real Madrid        1.545455
         Name: GoalsPerGame, dtype: float64

In [22]: grouped['GoalsPerGame'].apply(np.max)

Out[22]: Club
         Atletico Madrid    0.888889
         Bayern Munich      0.416667
         Chelsea            0.500000
         Real Madrid        1.545455
         Name: GoalsPerGame, dtype: float64

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Aggregation and GroupBy

Create new playlist

Sign In

Sign Up

Aggregation and GroupBy

Aggregation in R

The pandas' GroupBy operator

Table of Contents for
Aggregation and GroupBy