Data aggregation

As a final topic, we will look at ways to get a condensed view of data with aggregations. pandas comes with a lot of aggregation functions built-in. We already saw the describe function in Chapter 3, Data Analysis with pandas. This works on parts of the data as well. We start with some artificial data again, containing measurements about the number of sunshine hours per city and date:

>>> df.head()
   country     city        date  hours
0  Germany  Hamburg  2015-06-01      8
1  Germany  Hamburg  2015-06-02     10
2  Germany  Hamburg  2015-06-03      9
3  Germany  Hamburg  2015-06-04      7
4  Germany  Hamburg  2015-06-05      3

To view a summary per city, we use the describe function on the grouped data set:

>>> df.groupby("city").describe()
                      hours
city
Berlin     count  10.000000
           mean    6.000000
           std     3.741657
           min     0.000000
           25%     4.000000
           50%     6.000000
           75%     9.750000
           max    10.000000
Birmingham count  10.000000
           mean    5.100000
           std     2.078995
           min     2.000000
           25%     4.000000
           50%     5.500000
           75%     6.750000
           max     8.000000

On certain data sets, it can be useful to group by more than one attribute. We can get an overview about the sunny hours per country and date by passing in two column names:

>>> df.groupby(["country", "date"]).describe()
                         hourscountry date
France  2015-06-01 count  5.000000
                   mean   6.200000
                   std    1.095445
                   min    5.000000
                   25%    5.000000
                   50%    7.000000
                   75%    7.000000
                   max    7.000000
        2015-06-02 count  5.000000
                   mean   3.600000
                   std    3.577709
                   min    0.000000
                   25%    0.000000
                   50%    4.000000
                   75%    6.000000
                   max    8.000000
UK      2015-06-07 std    3.872983
                   min    0.000000
                   25%    2.000000
                   50%    6.000000
                   75%    8.000000
                   max    9.000000

We can compute single statistics as well:

>>> df.groupby("city").mean()
            hours
cityBerlin        6.0
Birmingham    5.1
Bordeax       4.7
Edinburgh     7.5
Frankfurt     5.8
Glasgow       4.8
Hamburg       5.5
Leipzig       5.0
London        4.8
Lyon          5.0
Manchester    5.2
Marseille     6.2
Munich        6.6
Nice          3.9
Paris         6.3

Finally, we can define any function to be applied on the groups with the agg method. The preceding code could have been written in terms of agg like this:

>>> df.groupby("city").agg(np.mean)
hours
city
Berlin        6.0
Birmingham    5.1
Bordeax       4.7
Edinburgh     7.5
Frankfurt     5.8
Glasgow       4.8
...

But arbitrary functions are possible. As a last example, we define a custom function, which takes an input of a series object and computes the difference between the smallest and the largest element:

>>> df.groupby("city").agg(lambda s: abs(min(s) - max(s)))
        hours
city
Berlin         10
Birmingham      6
Bordeax        10
Edinburgh       8
Frankfurt       9
Glasgow        10
Hamburg        10
Leipzig         9
London         10
Lyon            8
Manchester     10
Marseille      10
Munich          9
Nice           10
Paris           9
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset