Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Split-apply-combine

R has a library called plyr for a split-apply-combine data analysis. The plyr library has a function called ddply, which can be used to apply a function to a subset of a DataFrame, and then, combine the results into another DataFrame.

Note

For more information on ddply, you can refer to the following: http://www.inside-r.org/packages/cran/plyr/docs/ddply

To illustrate, let us consider a subset of a recently created dataset in R, which contains data on flights departing NYC in 2013: http://cran.r-project.org/web/packages/nycflights13/index.html.

Implementation in R

Here, we will install the package in R and instantiate the library:

>install.packages('nycflights13')
...

>library('nycflights13')
>dim(flights)
[1] 336776     16

>head(flights,3)
year month day dep_timedep_delayarr_timearr_delay carrier tailnum flight
1 2013     1   1      517         2      830        11      UA  N14228   1545
2 2013     1   1      533         4      850        20      UA  N24211   1714
3 2013     1   1      542         2      923        33      AA  N619AA   1141
origindestair_time distance hour minute
1    EWR  IAH      227     1400    5     17
2    LGA  IAH      227     1416    5     33
3    JFK  MIA      160     1089    5     42

> flights.data=na.omit(flights[,c('year','month','dep_delay','arr_delay','distance')])
>flights.sample<- flights.data[sample(1:nrow(flights.data),100,replace=FALSE),]

>head(flights.sample,5)
year month dep_delayarr_delay distance
155501 2013     3         2         5      184
2410   2013     1         0         4      762
64158  2013    11        -7       -27      509
221447 2013     5        -5       -12      184
281887 2013     8        -1       -10      937

The ddply function enables us to summarize the departure delays (mean, standard deviation) by year and month:

>ddply(flights.sample,.(year,month),summarize, mean_dep_delay=round(mean(dep_delay),2), s_dep_delay=round(sd(dep_delay),2))
year month mean_dep_delaysd_dep_delay
1  2013     1          -0.20         2.28
2  2013     2          23.85        61.63
3  2013     3          10.00        34.72
4  2013     4           0.88        12.56
5  2013     5           8.56        32.42
6  2013     6          58.14       145.78
7  2013     7          25.29        58.88
8  2013     8          25.86        59.38
9  2013     9          -0.38        10.25
10 2013    10           9.31        15.27
11 2013    11          -1.09         7.73
12 2013    12           0.00         8.58

Let us save the flights.sample dataset to a CSV file so that we can use the data to show us how to do the same thing in pandas:

>write.csv(flights.sample,file='nycflights13_sample.csv', quote=FALSE,row.names=FALSE)

Implementation in pandas

In order to do the same thing in pandas, we read the CSV file saved in the preceding section:

In [40]: flights_sample=pd.read_csv('nycflights13_sample.csv')

In [41]: flights_sample.head()
Out[41]: year   month   dep_delayarr_delay       distance
0        2013   3       2       5       184
1        2013   1       0       4       762
2        2013   11      -7      -27     509
3        2013   5       -5      -12     184
4        2013   8       -1      -10     937

We achieve the same effect as ddply by making use of the GroupBy() operator:

In [44]: pd.set_option('precision',3)
In [45]: grouped = flights_sample_df.groupby(['year','month'])

In [48]: grouped['dep_delay'].agg([np.mean, np.std])

Out[48]:        mean    std
year    month           
2013    1       -0.20   2.28
        2       23.85   61.63
        3       10.00   34.72
        4       0.88    12.56
        5       8.56    32.42
        6       58.14   145.78
        7       25.29   58.88
        8       25.86   59.38
        9       -0.38   10.25
        10      9.31    15.27
        11      -1.09   7.73
        12      0.00    8.58

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Split-apply-combine

Create new playlist

Sign In

Sign Up

Split-apply-combine

Note

Implementation in R

Implementation in pandas

Table of Contents for
Split-apply-combine