Split-apply-combine

R has a library called plyr for a split-apply-combine data analysis. The plyr library has a function called ddply, which can be used to apply a function to a subset of a DataFrame, and then, combine the results into another DataFrame.

Note

For more information on ddply, you can refer to the following: http://www.inside-r.org/packages/cran/plyr/docs/ddply

To illustrate, let us consider a subset of a recently created dataset in R, which contains data on flights departing NYC in 2013: http://cran.r-project.org/web/packages/nycflights13/index.html.

Implementation in R

Here, we will install the package in R and instantiate the library:

>install.packages('nycflights13')
...

>library('nycflights13')
>dim(flights)
[1] 336776     16

>head(flights,3)
year month day dep_timedep_delayarr_timearr_delay carrier tailnum flight
1 2013     1   1      517         2      830        11      UA  N14228   1545
2 2013     1   1      533         4      850        20      UA  N24211   1714
3 2013     1   1      542         2      923        33      AA  N619AA   1141
origindestair_time distance hour minute
1    EWR  IAH      227     1400    5     17
2    LGA  IAH      227     1416    5     33
3    JFK  MIA      160     1089    5     42

> flights.data=na.omit(flights[,c('year','month','dep_delay','arr_delay','distance')])
>flights.sample<- flights.data[sample(1:nrow(flights.data),100,replace=FALSE),]

>head(flights.sample,5)
year month dep_delayarr_delay distance
155501 2013     3         2         5      184
2410   2013     1         0         4      762
64158  2013    11        -7       -27      509
221447 2013     5        -5       -12      184
281887 2013     8        -1       -10      937

The ddply function enables us to summarize the departure delays (mean, standard deviation) by year and month:

>ddply(flights.sample,.(year,month),summarize, mean_dep_delay=round(mean(dep_delay),2), s_dep_delay=round(sd(dep_delay),2))
year month mean_dep_delaysd_dep_delay
1  2013     1          -0.20         2.28
2  2013     2          23.85        61.63
3  2013     3          10.00        34.72
4  2013     4           0.88        12.56
5  2013     5           8.56        32.42
6  2013     6          58.14       145.78
7  2013     7          25.29        58.88
8  2013     8          25.86        59.38
9  2013     9          -0.38        10.25
10 2013    10           9.31        15.27
11 2013    11          -1.09         7.73
12 2013    12           0.00         8.58

Let us save the flights.sample dataset to a CSV file so that we can use the data to show us how to do the same thing in pandas:

>write.csv(flights.sample,file='nycflights13_sample.csv', quote=FALSE,row.names=FALSE)

Implementation in pandas

In order to do the same thing in pandas, we read the CSV file saved in the preceding section:

In [40]: flights_sample=pd.read_csv('nycflights13_sample.csv')

In [41]: flights_sample.head()
Out[41]: year   month   dep_delayarr_delay       distance
0        2013   3       2       5       184
1        2013   1       0       4       762
2        2013   11      -7      -27     509
3        2013   5       -5      -12     184
4        2013   8       -1      -10     937

We achieve the same effect as ddply by making use of the GroupBy() operator:

In [44]: pd.set_option('precision',3)
In [45]: grouped = flights_sample_df.groupby(['year','month'])

In [48]: grouped['dep_delay'].agg([np.mean, np.std])

Out[48]:        mean    std
year    month           
2013    1       -0.20   2.28
        2       23.85   61.63
        3       10.00   34.72
        4       0.88    12.56
        5       8.56    32.42
        6       58.14   145.78
        7       25.29   58.88
        8       25.86   59.38
        9       -0.38   10.25
        10      9.31    15.27
        11      -1.09   7.73
        12      0.00    8.58
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset