Another popular package is dplyr
, which invents a grammar of data manipulation. Instead of using the subset function ([]
), dplyr
defines a set of basic erb
functions as the building blocks of data operations and imports a pipeline operator to chain these functions to perform complex multistep tasks.
Run the following code to install dplyr
from CRAN if you don't have it yet:
install.packages("dplyr")
First, we will reload the product tables again to reset all data to their original forms:
library(readr) product_info <- read_csv("data/product-info.csv") product_stats <- read_csv("data/product-stats.csv") product_tests <- read_csv("data/product-tests.csv") toy_tests <- read_csv("data/product-toy-tests.csv")
Then, we will load the dplyr
package:
library(dplyr) ## ## Attaching package: 'dplyr' ## The following objects are masked from 'package:data.table': ## ## between, last ## The following objects are masked from 'package:stats': ## ## filter, lag ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union
The following output indicates that dplyr
generalizes a number of built-in functions, so they are masked after the package is attached.
Now, we can start to play with the verb functions it provides. First, we will use select
to select columns from the provided data frame by creating a new table with the given columns:
select(product_info, id, name, type, class) ## Source: local data frame [6 x 4] ## ## id name type class ## (chr) (chr) (chr) (chr) ## 1 T01 SupCar toy vehicle ## 2 T02 SupPlane toy vehicle ## 3 M01 JeepX model vehicle ## 4 M02 AircraftX model vehicle ## 5 M03 Runner model people ## 6 M04 Dancer model people
The printing of the preceding table is a bit different from the way both data.frame
and data.table
are printed. It not only shows the table itself, but also includes a header indicating the size of the data frame and the data types of each column.
It is clear that select()
uses non-standard evaluation that allows us to directly use column names of the given data frame as arguments. It works in a way similar to how subset()
, transform()
, and with()
work.
We can use filter
to filter the data frame by logical condition, which is also evaluated in the context of the data frame:
filter(product_info, released == "yes") ## Source: local data frame [4 x 5] ## ## id name type class released ## (chr) (chr) (chr) (chr) (chr) ## 1 T01 SupCar toy vehicle yes ## 2 M01 JeepX model vehicle yes ## 3 M02 AircraftX model vehicle yes ## 4 M03 Runner model people yes
If we want to filter records with multiple conditions, we only need to write each condition as an argument of filter()
:
filter(product_info, released == "yes", type == "model") ## Source: local data frame [3 x 5] ## ## id name type class released ## (chr) (chr) (chr) (chr) (chr) ## 1 M01 JeepX model vehicle yes ## 2 M02 AircraftX model vehicle yes ## 3 M03 Runner model people yes
The mutate
function is used to create a new data frame with new columns added or existing columns replaced, like transform
, but also supports in-place assignment, :=
, if the provided data is a data.table
:
mutate(product_stats, density = size / weight) ## Source: local data frame [6 x 5] ## ## id material size weight density ## (chr) (chr) (int) (dbl) (dbl) ## 1 T01 Metal 120 10.0 12.000000 ## 2 T02 Metal 350 45.0 7.777778 ## 3 M01 Plastics 50 NA NA ## 4 M02 Plastics 85 3.0 28.333333 ## 5 M03 Wood 15 NA NA ## 6 M04 Wood 16 0.6 26.666667
The arrange
function is used to create a new data frame sorted by one or more columns. The desc()
function indicates the descending order:
arrange(product_stats, material, desc(size), desc(weight)) ## Source: local data frame [6 x 4] ## ## id material size weight ## (chr) (chr) (int) (dbl) ## 1 T02 Metal 350 45.0 ## 2 T01 Metal 120 10.0 ## 3 M02 Plastics 85 3.0 ## 4 M01 Plastics 50 NA ## 5 M04 Wood 16 0.6 ## 6 M03 Wood 15 NA
The dplyr
function provides a rich set of join operations, including
inner_join
, left_join
, right_join
, full_join
, semi_join
, and anti_join
. If two tables to join have records that do not match, these join operations may behave very differently. For product_info
and product_tests
, the records match exactly, so left_join
should return the same results as merge
:
product_info_tests <- left_join(product_info, product_tests, by = "id") product_info_tests ## Source: local data frame [6 x 8] ## ## id name type class released quality durability ## (chr) (chr) (chr) (chr) (chr) (int) (int) ## 1 T01 SupCar toy vehicle yes NA 10 ## 2 T02 SupPlane toy vehicle no 10 9 ## 3 M01 JeepX model vehicle yes 6 4 ## 4 M02 AircraftX model vehicle yes 6 5 ## 5 M03 Runner model people yes 5 NA ## 6 M04 Dancer model people no 6 6 ## Variables not shown: waterproof (chr)
To know more about the difference between those join operations, run ?dplyr::join
.
To summarize the data by groups, we need to first create a grouped table by group_by()
. Then, we will use summarize()
to aggregate the data. For example, we will
divide product_info_tests
with type
and class
, and then for each type class group, we will calculate the average values of quality
and durability
:
summarize(group_by(product_info_tests, type, class), mean_quality = mean(quality, na.rm = TRUE), mean_durability = mean(durability, na.rm = TRUE)) ## Source: local data frame [3 x 4] ## Groups: type [?] ## ## type class mean_quality mean_durability ## (chr) (chr) (dbl) (dbl) ## 1 model people 5.5 6.0 ## 2 model vehicle 6.0 4.5 ## 3 toy vehicle 10.0 9.5
From the preceding code examples, you learned the verb functions select()
, filter()
, mutate()
, arrange()
, group_by()
, and summarize()
. Each of them is designed to do a small thing, but together they can perform comprehensive data operations when properly composed. Apart from these functions, dplyr
imports the pipeline operator %>%
from the magrittr
package to compose functions into pipelines.
Suppose we have product_info
and product_tests
. We need to analyze the released product by computing the average values of quality and durability for each type class group, and present the summary data in descending order of the average quality. This can be done nicely with the dplyr
verb functions composed by the pipeline operator:
product_info %>% filter(released == "yes") %>% inner_join(product_tests, by = "id") %>% group_by(type, class) %>% summarize( mean_quality = mean(quality, na.rm = TRUE), mean_durability = mean(durability, na.rm = TRUE)) %>% arrange(desc(mean_quality)) ## Source: local data frame [3 x 4] ## Groups: type [2] ## ## type class mean_quality mean_durability ## (chr) (chr) (dbl) (dbl) ## 1 model vehicle 6 4.5 ## 2 model people 5 NaN ## 3 toy vehicle NaN 10.0
But how does %>%
work? The pipeline operator basically does only one thing: put the result on the left-hand side of the first argument of the function call on the right-hand side, that is,
x %>% f(...)
will be basically evaluated as f(x, ...)
. Since %>%
is a package-defined binary operator, it allows us to chain function calls to either avoid redundant intermediate values or decompose nested calls.
Suppose we need to transform d0
to d3
through three steps. In each step, we need to call a function with the previous result and an argument. If we manipulate data like this, there will be many intermediate results, and sometimes, it consumes a lot of memory when the data is large:
d1 <- f1(d0, arg1) d2 <- f2(d1, arg2) d3 <- f3(d2, arg3)
If we want to avoid intermediate results, we'll have to write nested calls. This task does not look straightforward at all, especially when there are numerous arguments in each function call:
f3(f2(f1(d0, arg1), arg2), arg3)
Using the pipeline operator, the workflow can be rearranged as follows:
d0 %>% f1(arg1) %>% f2(arg2) %>% f3(arg3)
The code looks much cleaner and straightforward. The whole expression not only looks like a pipeline but also works like a pipeline. The d0 %>% f1(arg1)
equation is evaluated as f1(d0, arg1)
, which is sent to f2(., arg2)
, which is sent to f3(., arg3)
. The output of each step becomes the input of the next step.
Therefore, the pipeline operator not only works with dplyr
functions, but also works with all other functions. Suppose we want to make a density plot of the diamond prices:
data(diamonds, package = "ggplot2") plot(density(diamonds$price, from = 0), main = "Density plot of diamond prices")
The plot generated is shown as follows:
Using the pipeline operator, we can rewrite code as follows:
diamonds$price %>% density(from = 0) %>% plot(main = "Density plot of diamonds prices")
Like data.table
, dplyr
also supplies do()
to perform arbitrary operation on each group of data. For example, we can group diamonds
by cut
, and for each group, we can fit a linear model of log(price) ~ carat
. Different from data.table
, we need to specify the names of such operations so that the results can be stored as columns. Also, the expression in do()
is not directly evaluated in the context of the grouped data. Instead, we need to use .
to represent the data:
models <- diamonds %>% group_by(cut) %>% do(lmod = lm(log(price) ~ carat, data = .)) models ## Source: local data frame [5 x 2] ## Groups: <by row> ## ## cut lmod ## (fctr) (chr) ## 1 Fair <S3:lm> ## 2 Good <S3:lm> ## 3 Very Good <S3:lm> ## 4 Premium <S3:lm> ## 5 Ideal <S3:lm>
Note that a new column lmod
is created. It is not a typical data column of atomic vectors. Rather, it is a list of linear model objects, that is, the model for each value of cut
is stored in the list-typed column lmod
. We can access each model using an index:
models$lmod[[1]] ## ## Call: ## lm(formula = log(price) ~ carat, data = .) ## ## Coefficients: ## (Intercept) carat ## 6.785 1.251
The do()
function can be very helpful to perform highly customized operations. For example, suppose we need to analyze toy_tests
data by summarizing the quality and durability for each product. Consider what we should do if, we only need the top three test records with most samples, and the quality and durability of each product should be a weighted average of the measure and the sample.
Using dplyr
functions and pipeline, the preceding task can be easily done with the following code:
toy_tests %>% group_by(id) %>% arrange(desc(sample)) %>% do(head(., 3)) %>% summarize( quality = sum(quality * sample) / sum(sample), durability = sum(durability * sample) / sum(sample)) ## Source: local data frame [2 x 3] ## ## id quality durability ## (chr) (dbl) (dbl) ## 1 T01 9.319149 9.382979 ## 2 T02 9.040000 8.340000
Note that when the data is grouped, all subsequent operations are performed by group. To see the intermediate result, we will run the code before do(head(., 3))
:
toy_tests %>% group_by(id) %>% arrange(desc(sample)) ## Source: local data frame [8 x 5] ## Groups: id [2] ## ## id date sample quality durability ## (chr) (int) (int) (int) (int) ## 1 T01 20160405 180 9 10 ## 2 T01 20160302 150 10 9 ## 3 T01 20160502 140 9 9 ## 4 T01 20160201 100 9 9 ## 5 T02 20160403 90 9 8 ## 6 T02 20160502 85 10 9 ## 7 T02 20160303 75 8 8 ## 8 T02 20160201 70 7 9
We get all records sorted by sample
in descending order. Then, do(head(., 3))
will evaluate head(., 3)
for each group where .
represents the data in the group:
toy_tests %>% group_by(id) %>% arrange(desc(sample)) %>% do(head(., 3)) ## Source: local data frame [6 x 5] ## Groups: id [2] ## ## id date sample quality durability ## (chr) (int) (int) (int) (int) ## 1 T01 20160405 180 9 10 ## 2 T01 20160302 150 10 9 ## 3 T01 20160502 140 9 9 ## 4 T02 20160403 90 9 8 ## 5 T02 20160502 85 10 9 ## 6 T02 20160303 75 8 8
Now, we will get the top three records with most samples. It is handy to summarize the data as supposed.
The dplyr
function defines a very intuitive grammar of data manipulation and provides high-performance verb functions that are designed for use in pipeline. To learn more, I recommend that you read the package vignettes (https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html) and visit the interactive tutorial (https://www.datacamp.com/courses/dplyr-data-manipulation-r-tutorial) at DataCamp.