In the first section, we reviewed some built-in functions used to manipulate data frames. Then, we introduced sqldf
, which makes simple data query and summary easier. However, both approaches have their limitations. Using built-in functions can be verbose and slow, and it is not easy to summarize data because SQL is not as powerful as the full spectrum of R functions.
The data.table
package provides a powerful enhanced version of data.frame
. It is blazing fast and has the ability to handle large data that fits into memory. It invents a natural syntax of data manipulation using []
. Run the following command to install the package from CRAN if you don't have it yet:
install.packages("data.table")
Once the package is successfully installed, we will load the package and see what it offers:
library(data.table) ## ## Attaching package: 'data.table' ## The following objects are masked from 'package:reshape2': ## ## dcast, melt
Note that we previously loaded the reshape2
package in which dcast
and melt
are defined. The data.table
package also provides enhanced version of dcast
and melt
with more powerful functionality, better performance, and higher memory efficiency. We'll take a look at them later in this section.
Creating data.table
is very much like creating data.frame
:
dt <- data.table(x = 1:3, y = rnorm(3), z = letters[1:3]) dt ## x y z ## 1: 1 -0.50219235 a ## 2: 2 0.13153117 b ## 3: 3 -0.07891709 c
We can see its structure with str()
:
str(dt) ## Classes 'data.table' and 'data.frame': 3 obs. of 3 variables: ## $ x: int 1 2 3 ## $ y: num -0.5022 0.1315 -0.0789 ## $ z: chr "a" "b" "c" ## - attr(*, ".internal.selfref")=<externalptr>
It is clear that dt
is of class data.table
and data.frame
, which means that data.table
inherits from data.frame
. In other words, it inherits some behaviors of data.frame
, but override others as enhancements.
First, we still load the product data. However, this time, we will use fread()
provided by data.table
package. The fread()
function is super-fast, has great memory efficiency, and directly returns data.table
:
product_info <- fread("data/product-info.csv") product_stats <- fread("data/product-stats.csv") product_tests <- fread("data/product-tests.csv") toy_tests <- fread("data/product-toy-tests.csv")
If we take a look at product_info
, its appearance is only slightly different from that of a data frame:
product_info ## id name type class released ## 1: T01 SupCar toy vehicle yes ## 2: T02 SupPlane toy vehicle no ## 3: M01 JeepX model vehicle yes ## 4: M02 AircraftX model vehicle yes ## 5: M03 Runner model people yes ## 6: M04 Dancer model people no
Again, we will look at its structure:
str(product_info) ## Classes 'data.table' and 'data.frame': 6 obs. of 5 variables: ## $ id : chr "T01" "T02" "M01" "M02" ... ## $ name : chr "SupCar" "SupPlane" "JeepX" "AircraftX" ... ## $ type : chr "toy" "toy" "model" "model" ... ## $ class : chr "vehicle" "vehicle" "vehicle" "vehicle" ... ## $ released: chr "yes" "no" "yes" "yes" ... ## - attr(*, ".internal.selfref") =< externalptr>
As compared to data.frame
, if we supply only one argument to subset data.table
, it means selecting rows rather than columns:
product_info[1] ## id name type class released ## 1: T01 SupCar toy vehicle yes product_info[1:3] ## id name type class released ## 1: T01 SupCar toy vehicle yes ## 2: T02 SupPlane toy vehicle no ## 3: M01 JeepX model vehicle yes
If the number we supply in []
is negative, it means excluding the record, which is fully consistent with subsetting a vector:
product_info[-1] ## id name type class released ## 1: T02 SupPlane toy vehicle no ## 2: M01 JeepX model vehicle yes ## 3: M02 AircraftX model vehicle yes ## 4: M03 Runner model people yes ## 5: M04 Dancer model people no
In addition, data.table
also provides a number of symbols that represent important components of data.table
. One of the most useful symbols is .N
, which means the number of rows. If we want to select the last row, we no longer need nrow(product_info)
:
product_info[.N] ## id name type class released ## 1: M04 Dancer model people no
We can easily select the first and last rows:
product_info[c(1, .N)] ## id name type class released ## 1: T01 SupCar toy vehicle yes ## 2: M04 Dancer model people no
The syntax of data.table
subsetting automatically evaluates the expressions in the context of the data, that is, we can directly use column names as symbols, just like how we use subset
, transform
, and with
. For example, we can directly use released
in the first argument to select rows of products that are released:
product_info[released == "yes"] ## id name type class released ## 1: T01 SupCar toy vehicle yes ## 2: M01 JeepX model vehicle yes ## 3: M02 AircraftX model vehicle yes ## 4: M03 Runner model people yes
The first argument in the square brackets is a row filter, while the second is evaluated within the context of the filtered data. For example, we can directly use id
to represent product_info$id
because id
is evaluated within the context of product_info
:
product_info[released == "yes", id] ## [1] "T01" "M01" "M02" "M03"
The way to select columns of a data frame does not work here. If we put a character vector in the second argument, then we'll get the character vector itself because a string is indeed a string:
product_info[released == "yes", "id"] ## [1] "id"
To disable this behavior, we can specify with = FALSE
so that the second argument accepts a character vector to select columns, and it always returns a data.table
, no matter how many columns are specified:
product_info[released == "yes", "id", with = FALSE] ## id ## 1: T01 ## 2: M01 ## 3: M02 ## 4: M03 product_info[released == "yes", c("id", "name"), with = FALSE] ## id name ## 1: T01 SupCar ## 2: M01 JeepX ## 3: M02 AircraftX ## 4: M03 Runner
We can also write some other expressions as the second argument. For example, we can generate a table of the number of released products for each combination of type
and class
:
product_info[released == "yes", table(type, class)] ## class ## type people vehicle ## model 1 2 ## toy 0 1
However, if a list is produced, it will be transformed to data.table
instead:
product_info[released == "yes", list(id, name)] ## id name ## 1: T01 SupCar ## 2: M01 JeepX ## 3: M02 AircraftX ## 4: M03 Runner
In this way, we can easily create a new data.table
package with existing columns replaced:
product_info[, list(id, name, released = released == "yes")] ## id name released ## 1: T01 SupCar TRUE ## 2: T02 SupPlane FALSE ## 3: M01 JeepX TRUE ## 4: M02 AircraftX TRUE ## 5: M03 Runner TRUE ## 6: M04 Dancer FALSE
We can also easily create a new data.table
package with new columns based on existing columns:
product_stats[, list(id, material, size, weight, density = size / weight)] ## id material size weight density ## 1: T01 Metal 120 10.0 12.000000 ## 2: T02 Metal 350 45.0 7.777778 ## 3: M01 Plastics 50 NA NA ## 4: M02 Plastics 85 3.0 28.333333 ## 5: M03 Wood 15 NA NA ## 6: M04 Wood 16 0.6 26.666667
For simplicity, data.table
provides .()
to be short for list()
:
product_info[, .(id, name, type, class)] ## id name type class ## 1: T01 SupCar toy vehicle ## 2: T02 SupPlane toy vehicle ## 3: M01 JeepX model vehicle ## 4: M02 AircraftX model vehicle ## 5: M03 Runner model people ## 6: M04 Dancer model people product_info[released == "yes", .(id, name)] ## id name ## 1: T01 SupCar ## 2: M01 JeepX ## 3: M02 AircraftX ## 4: M03 Runner
By supplying the ordered indices, we can easily sort the records by the given criterion:
product_stats[order(size, decreasing = TRUE)] ## id material size weight ## 1: T02 Metal 350 45.0 ## 2: T01 Metal 120 10.0 ## 3: M02 Plastics 85 3.0 ## 4: M01 Plastics 50 NA ## 5: M04 Wood 16 0.6 ## 6: M03 Wood 15 NA
Previously, we always created a new data.table
package after subsetting. The data.table
package also provides :=
for in-place assignment of columns. For example, the original data of product_stats
is shown as follows:
product_stats ## id material size weight ## 1: T01 Metal 120 10.0 ## 2: T02 Metal 350 45.0 ## 3: M01 Plastics 50 NA ## 4: M02 Plastics 85 3.0 ## 5: M03 Wood 15 NA ## 6: M04 Wood 16 0.6
We will use :=
to create a new column directly in product_stats
:
product_stats[, density := size / weight]
Nothing shows here, but the original data.table
package is modified:
product_stats ## id material size weight density ## 1: T01 Metal 120 10.0 12.000000 ## 2: T02 Metal 350 45.0 7.777778 ## 3: M01 Plastics 50 NA NA ## 4: M02 Plastics 85 3.0 28.333333 ## 5: M03 Wood 15 NA NA ## 6: M04 Wood 16 0.6 26.666667
We can use :=
to replace an existing column:
product_info[, released := released == "yes"] product_info ## id name type class released ## 1: T01 SupCar toy vehicle TRUE ## 2: T02 SupPlane toy vehicle FALSE ## 3: M01 JeepX model vehicle TRUE ## 4: M02 AircraftX model vehicle TRUE ## 5: M03 Runner model people TRUE ## 6: M04 Dancer model people FALSE
The data.table
package provides :=
mainly because in-place modification has a much higher performance since it avoids unnecessary copies of data.
Another distinct feature of data.table
is the support of indexing, that is, we can create a key on data.table
, so accessing records by key can be extremely efficient. For example, we will use setkey()
to make id
the key of product_info
:
setkey(product_info, id)
Note that the function behaves in a very different way from most R functions. It does not return a new copy of the data table but directly installs a key to the original input. The data frame, however, looks unchanged:
product_info ## id name type class released ## 1: M01 JeepX model vehicle TRUE ## 2: M02 AircraftX model vehicle TRUE ## 3: M03 Runner model people TRUE ## 4: M04 Dancer model people FALSE ## 5: T01 SupCar toy vehicle TRUE ## 6: T02 SupPlane toy vehicle FALSE
Also, its key is created:
key(product_info) ## [1] "id"
Now, we can use a key to access the records in product_info
. For example, we can directly write a value of id
to get the records with that id
:
product_info["M01"] ## id name type class released ## 1: M01 JeepX model vehicle TRUE
If we use this with a data.table
package without a key, an error occurs and reminds you to set a key:
product_stats["M01"] ## Error in `[.data.table`(product_stats, "M01"): When i is a data.table (or character vector), x must be keyed (i.e. sorted, and, marked as sorted) so data.table knows which columns to join to and take advantage of x being sorted. Call setkey(x,...) first, see ?setkey.
We can also use setkeyv()
to set key, but it only accepts a character vector:
setkeyv(product_stats, "id")
This function is much easier to use if we have a dynamically determined vector to be the key. Now, we can use key to access product_stats
too:
product_stats["M02"] ## id material size weight density ## 1: M02 Plastics 85 3 28.33333
If two tables have the same key, we can easily join them together:
product_info[product_stats] ## id name type class released material size ## 1: M01 JeepX model vehicle TRUE Plastics 50 ## 2: M02 AircraftX model vehicle TRUE Plastics 85 ## 3: M03 Runner model people TRUE Wood 15 ## 4: M04 Dancer model people FALSE Wood 16 ## 5: T01 SupCar toy vehicle TRUE Metal 120 ## 6: T02 SupPlane toy vehicle FALSE Metal 350 ## weight density ## 1: NA NA ## 2: 3.0 28.333333 ## 3: NA NA ## 4: 0.6 26.666667 ## 5: 10.0 12.000000 ## 6: 45.0 7.777778
The key of a data.table
package can be more than one element. For example, to locate a record of toy_tests
, we need to specify both id
and date
. In the following code, we will set a key of the two columns on toy_tests
:
setkey(toy_tests, id, date)
Now, we can get a row by supplying both elements in the key:
toy_tests[.("T01", 20160201)] ## id date sample quality durability ## 1: T01 20160201 100 9 9
If we only supply the first element, we would get a subset of the data with all records that match the first element:
toy_tests["T01"] ## id date sample quality durability ## 1: T01 20160201 100 9 9 ## 2: T01 20160302 150 10 9 ## 3: T01 20160405 180 9 10 ## 4: T01 20160502 140 9 9
However, if we only supply the second element, we can't get anything but an error. It is because the algorithm it behind requires the key to be ordered:
toy_tests[.(20160201)] ## Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, : x.'id' is a character column being joined to i.'V1' which is type 'double'. Character columns must join to factor or character columns.
Also, we cannot get any data if we supply a key in a wrong order:
toy_tests[.(20160201, "T01")] ## Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, : x.'id' is a character column being joined to i.'V1' which is type 'double'. Character columns must join to factor or character columns.
Another important argument of subsetting a data.table
is by
, which is used to split the data into multiple parts, and for each part, evaluate the second argument. In this section, we'll demonstrate how the by
syntax makes it much easier to summarize data by groups. For example, the simplest usage of by
is counting the records in each group. In the following code, we will count the number of both released and unreleased products:
product_info[, .N, by = released] ## released N ## 1: TRUE 4 ## 2: FALSE 2
The group can be defined by more than one variable. For example, a tuple of type
and class
can be a group, and for each group, we will count the number of records:
product_info[, .N, by = .(type, class)] ## type class N ## 1: model vehicle 2 ## 2: model people 2 ## 3: toy vehicle 2
We can also perform statistical calculations for each group. Here, we will compute the mean value of quality for both waterproof products and non-waterproof ones:
product_tests[, mean(quality, na.rm = TRUE), by = .(waterproof)] ## waterproof V1 ## 1: no 10.00 ## 2: yes 5.75
Note that the mean values are stored in V1
because we didn't supply a name for the column, so the package uses its default column names. To avoid that, we will use expression in form of .(y = f(x))
instead:
product_tests[, .(mean_quality = mean(quality, na.rm = TRUE)), by = .(waterproof)] ## waterproof mean_quality ## 1: no 10.00 ## 2: yes 5.75
We can chain multiple []
in turn. In the following example, we will first join product_info
and product_tests
by shared key id
and then calculate the mean values of quality
and durability
for each group of type
and class
of released products:
product_info[product_tests][released == TRUE, .(mean_quality = mean(quality, na.rm = TRUE), mean_durability = mean(durability, na.rm = TRUE)), by = .(type, class)] ## type class mean_quality mean_durability ## 1: toy vehicle NaN 10.0 ## 2: model vehicle 6 4.5 ## 3: model people 5 NaN
Note that the values of by
columns will be unique in the resulting data.table
. We can use keyby
instead of by
to ensure it is automatically used as key by the resulted data.table
:
type_class_tests <- product_info[product_tests][released == TRUE, .(mean_quality = mean(quality, na.rm = TRUE), mean_durability = mean(durability, na.rm = TRUE)), keyby = .(type, class)] type_class_tests ## type class mean_quality mean_durability ## 1: model people 5 NaN ## 2: model vehicle 6 4.5 ## 3: toy vehicle NaN 10.0 key(type_class_tests) ## [1] "type" "class"
Then, we can directly use a tuple of key values to access the records:
type_class_tests[.("model", "vehicle"), mean_quality] ## [1] 6
You can clearly see that using keys can be much more convenient than using logical comparisons when we try to find certain records in a table. However, its true advantage is not demonstrated yet because the data is not large enough. Using key to search records can be much faster than iterative logical comparison for large data, because searching by key takes advantage of binary search while iteration wastes a lot of time doing unnecessary computation.
Here is an example to make a contrast. First, we will create a data of 10 million rows with an index column id
and two numeric columns filled with random numbers:
n <- 10000000 test1 <- data.frame(id = 1:n, x = rnorm(n), y = rnorm(n))
Now, we want to see find a row of id
being 8765432
. Let's see how long it takes:
system.time(row <- test1[test1$id == 876543, ]) ## user system elapsed ## 0.156 0.036 0.192 row ## id x y ## 876543 876543 0.02300419 1.291588
It seems no big deal, but suppose you need to frequently do this, say, hundreds of times per second, then your machine simply can't return a result in time.
Then, we will use data.table
to do this. First, we will call setDT()
to transform
data.frame
to data.table
. This function performs some magic to transform the object in place, no copy made. When we use the setDT()
function, we also provide a key id
so that the resulted data.table
has id
as its keyed column:
setDT(test1, key = "id") class(test1) ## [1] "data.table" "data.frame"
Now, test1
is transformed to data.table
. Then, we will search the same element:
system.time(row <- test1[.(8765432)]) ## user system elapsed ## 0.000 0.000 0.001 row ## id x y ## 1: 8765432 0.2532357 -2.121696
The results are the same, but the time data.table
takes is much shorter than data.frame
.
Previously, you learned how to reshape a data frame with the reshape2
package. The data.table
package provides faster and more powerful implementations of dcast
and melt
for the data.table
object.
For example, we will reshape toy_tests
by aligning the quality scores of each product to year-month tuples:
toy_tests[, ym := substr(date, 1, 6)] toy_quality <- dcast(toy_tests, ym ~ id, value.var = "quality") toy_quality ## ym T01 T02 ## 1: 201602 9 7 ## 2: 201603 10 8 ## 3: 201604 9 9 ## 4: 201605 9 10
First, we used :=
to create a new column ym
directly in toy_tests
and use dcast
to transform it in the same way with the previous example of reshape2
. The result looks the same with the output of reshape2::dcast
for data.frame
.
While reshape2::dcast
does not support multi-value value.var
, data.table::dcast
works with multiple value variables, as shown here:
toy_tests2 <- dcast(toy_tests, ym ~ id, value.var = c("quality", "durability")) toy_tests2 ## ym quality_T01 quality_T02 durability_T01 ## 1: 201602 9 7 9 ## 2: 201603 10 8 9 ## 3: 201604 9 9 10 ## 4: 201605 9 10 9 ## durability_T02 ## 1: 9 ## 2: 8 ## 3: 8 ## 4: 9
The column names, except the first, are no longer values of id
but are value variables with values of id
concatenated by the underscore symbol. In addition, the key of the output data.table
is automatically set to the variables that appear on the left-hand side of the reshaping formula
ym ~ id
:
key(toy_tests2) ## [1] "ym"
The key implies that we can access the records directly by supplying a value of ym
. However, the following code ends up with an error:
toy_tests2[.(201602)] ## Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, : x.'ym' is a character column being joined to i.'V1' which is type 'double'. Character columns must join to factor or character columns.
There's something wrong with the data types. We can run the following code to see the class of each column:
sapply(toy_tests2, class) ## ym quality_T01 quality_T02 durability_T01 ## "character" "integer" "integer" "integer" ## durability_T02 ## "integer"
The problem lies in the class of ym:
. It is a character vector, but we supplied a key of numeric values. Therefore, the search fails with unmatched data types. If we supply a string, we can get the corresponding record:
toy_tests2["201602"] ## ym quality_T01 quality_T02 durability_T01 ## 1: 201602 9 7 9 ## durability_T02 ## 1: 9
But how did ym
become a character vector in the first place? Recall ym := substr(date, 1, 6)
where date
is an integer vector, but substr()
will coerce date
to a character vector and then take out the first six characters. Therefore, it is natural that the result is a character vector. This is simply demonstrated as follows:
class(20160101) ## [1] "numeric" class(substr(20160101, 1, 6)) ## [1] "character"
The point here is that we need to be careful about the data types of the key columns.
If we use data.frame
, to change the names or the column order will cause copies of the data structure. In recent R versions, the copy is made fewer when we rename columns, but it is still hard to reorder the columns of a data frame without making a new copy. This should not be a problem when the data is small, but if the data is very large, the performance and memory pressure it imposes can really be an issue.
An enhanced version of data.frame
, data.table
provides a family of set
functions with reference semantics, that is, they modify data.table
in place and avoid unnecessary copying, thus exhibiting astonishing performance.
Take product_stats
as an example. We can call setDF()
to change data.table
to data.frame
in place without making copies:
product_stats ## id material size weight density ## 1: M01 Plastics 50 NA NA ## 2: M02 Plastics 85 3.0 28.333333 ## 3: M03 Wood 15 NA NA ## 4: M04 Wood 16 0.6 26.666667 ## 5: T01 Metal 120 10.0 12.000000 ## 6: T02 Metal 350 45.0 7.777778 setDF(product_stats) class(product_stats) ## [1] "data.frame"
We can call setDT()
to make any data.frame
to data.table
and set up a key if specified:
setDT(product_stats, key = "id") class(product_stats) ## [1] "data.table" "data.frame"
We can call setnames
to change the name of the given columns to their new names:
setnames(product_stats, "size", "volume") product_stats ## id material volume weight density ## 1: M01 Plastics 50 NA NA ## 2: M02 Plastics 85 3.0 28.333333 ## 3: M03 Wood 15 NA NA ## 4: M04 Wood 16 0.6 26.666667 ## 5: T01 Metal 120 10.0 12.000000 ## 6: T02 Metal 350 45.0 7.777778
If we add a new column, the column should appear as the last one. For example, we will add an index column for all rows using .I
representing 1:.N
:
product_stats[, i := .I] product_stats ## id material volume weight density i ## 1: M01 Plastics 50 NA NA 1 ## 2: M02 Plastics 85 3.0 28.333333 2 ## 3: M03 Wood 15 NA NA 3 ## 4: M04 Wood 16 0.6 26.666667 4 ## 5: T01 Metal 120 10.0 12.000000 5 ## 6: T02 Metal 350 45.0 7.777778 6
By convention, the index column should, in most cases, appear as the first column. We can supply a new order of column names to setcolorder()
so that the columns are directly reordered without making copies:
setcolorder(product_stats, c("i", "id", "material", "weight", "volume", "density")) product_stats ## i id material weight volume density ## 1: 1 M01 Plastics NA 50 NA ## 2: 2 M02 Plastics 3.0 85 28.333333 ## 3: 3 M03 Wood NA 15 NA ## 4: 4 M04 Wood 0.6 16 26.666667 ## 5: 5 T01 Metal 10.0 120 12.000000 ## 6: 6 T02 Metal 45.0 350 7.777778
The most commonly used syntax of data.table
is data[i, j, by]
, where i
, j
, and by
are all evaluated with dynamic scoping. In other words, we can use not only the columns directly, but also the predefined symbols such as .N
, .I
, and .SD
to refer to important components of the data, as well as symbols and functions that can be accessed in the calling environment.
Before demonstrating this, we will create a new data.table
named market_data
with a consecutive column of date
:
market_data <- data.table(date = as.Date("2015-05-01") + 0:299) head(market_data) ## date ## 1: 2015-05-01 ## 2: 2015-05-02 ## 3: 2015-05-03 ## 4: 2015-05-04 ## 5: 2015-05-05 ## 6: 2015-05-06
Then, we will add two new columns to market_data
by calling :=
as a function:
set.seed(123) market_data[, `:=`( price = round(30 * cumprod(1 + rnorm(300, 0.001, 0.05)), 2), volume = rbinom(300, 5000, 0.8) )]
Note that price
is a simple random walk, and volume
is randomly drawn from a binomial distribution:
head(market_data) ## date price volume ## 1: 2015-05-01 29.19 4021 ## 2: 2015-05-02 28.88 4000 ## 3: 2015-05-03 31.16 4033 ## 4: 2015-05-04 31.30 4036 ## 5: 2015-05-05 31.54 3995 ## 6: 2015-05-06 34.27 3955
Then, we will plot the data:
plot(price ~ date, data = market_data, type = "l", main = "Market data")
The plot generated is shown as follows:
Once the data is ready, we can aggregate the data and see how dynamic scoping can be used to make things easier.
First, we will look at the range of the date
column:
market_data[, range(date)] ## [1] "2015-05-01" "2016-02-24"
The data can be reduced to monthly open-high-low-close (OHLC) data easily by group aggregate:
monthly <- market_data[, .(open = price[[1]], high = max(price), low = min(price), close = price[[.N]]), keyby = .(year = year(date), month = month(date))] head(monthly) ## year month open high low close ## 1: 2015 5 29.19 37.71 26.15 28.44 ## 2: 2015 6 28.05 37.63 28.05 37.21 ## 3: 2015 7 36.32 40.99 32.13 40.99 ## 4: 2015 8 41.52 50.00 30.90 30.90 ## 5: 2015 9 30.54 34.46 22.89 27.02 ## 6: 2015 10 25.68 33.18 24.65 29.32
In the j
expression, we can generate an OHLC record with each data.table
grouped by year
and month
. If the output of j
is a list
, or data.frame
, or data.table
, then the output will be stacked together to result in one data.table
.
In fact, the j
expression can be anything, even with by
specified. More specifically, j
is evaluated within the context of each data.table
as a subset of the original data split by the value of the by
expression. For example, the following code does not aggregate data by group, but plot a price chart for each year:
oldpar <- par(mfrow = c(1, 2)) market_data[, { plot(price ~ date, type = "l", main = sprintf("Market data (%d)", year)) }, by = .(year = year(date))] par(oldpar)
The plot generated is shown as follows:
Note that we don't specify the data
argument of plot
because it is evaluated in the context of market_data
grouped by year
where price
and date
are already defined.
Moreover, the j
expression can be model-fitting code. Here is an example of batch fitting of linear models. First, we will load diamonds
data from the ggplot2
package:
data("diamonds", package = "ggplot2") setDT(diamonds) head(diamonds) ## carat cut color clarity depth table price x ## 1: 0.23 Ideal E SI2 61.5 55 326 3.95 ## 2: 0.21 Premium E SI1 59.8 61 326 3.89 ## 3: 0.23 Good E VS1 56.9 65 327 4.05 ## 4: 0.29 Premium I VS2 62.4 58 334 4.20 ## 5: 0.31 Good J SI2 63.3 58 335 4.34 ## 6: 0.24 Very Good J VVS2 62.8 57 336 3.94 ## y z ## 1: 3.98 2.43 ## 2: 3.84 2.31 ## 3: 4.07 2.31 ## 4: 4.23 2.63 ## 5: 4.35 2.75 ## 6: 3.96 2.48
The data contains 53940 records of diamonds with 10 properties. Here, we will fit linear regression models on each group of cut
to see how carat
and depth
may provide some information of log(price)
in each group.
In the following code, the j
expression involves fitting a linear model and coercing its coefficients into a list. Note that the j
expression is evaluated for each value of the keyby
expression. Since a list is returned, the estimated linear coefficients for each group will be stacked as one data.table
is shown as follows:
diamonds[, { m <- lm(log(price) ~ carat + depth) as.list(coef(m)) }, keyby = .(cut)] ## cut (Intercept) carat depth ## 1: Fair 7.730010 1.264588 -0.014982439 ## 2: Good 7.077469 1.973600 -0.014601101 ## 3: Very Good 6.293642 2.087957 -0.002890208 ## 4: Premium 5.934310 1.852778 0.005939651 ## 5: Ideal 8.495409 2.125605 -0.038080022
Dynamic scoping also allows us to combine the use of symbols that are predefined, inside or outside data.table
. For example, we can define a function to calculate the annual average values of a user-defined column of market_data
:
average <- function(column) { market_data[, .(average = mean(.SD[[column]])), by = .(year = year(date))] }
In the preceding j
expression, .SD
means the grouped data.table
for each value of year
. We can use .SD[[x]]
to extract the values of column x
, just like extracting an element by name from a list.
Then, we can run the following code to calculate the average prices for each year:
average("price") ## year average ## 1: 2015 32.32531 ## 2: 2016 32.38364
We will just change the argument to volume
to calculate the average volumes for each year:
average("volume") ## year average ## 1: 2015 3999.931 ## 2: 2016 4003.382
Also, we can use a specially invented syntax to create a dynamic number of columns with dynamically determined names.
Suppose we add three new alternative price columns, each adds some random noise to the original price
values. Instead of repeat calling market_data[, price1 := ...]
and market_data[, price2 := ...]
, we can use market_data[, (columns) := list(...)]
to set columns dynamically, where columns
is a character vector of column names and list(...)
is the values for each corresponding column in columns
:
price_cols <- paste0("price", 1:3) market_data[, (price_cols) := lapply(1:3, function(i) round(price + rnorm(.N, 0, 5), 2))] head(market_data) ## date price volume price1 price2 price3 ## 1: 2015-05-01 29.19 4021 30.55 27.39 33.22 ## 2: 2015-05-02 28.88 4000 29.67 20.45 36.00 ## 3: 2015-05-03 31.16 4033 34.31 26.94 27.24 ## 4: 2015-05-04 31.30 4036 29.32 29.01 28.04 ## 5: 2015-05-05 31.54 3995 36.04 32.06 34.79 ## 6: 2015-05-06 34.27 3955 30.12 30.96 35.19
On the other hand, if we get a table with many columns and we need to perform some computation on a subset of them, we can also use similar syntax to solve the problem. Imagine that the price-related columns may have missing values. We need to perform zoo::na.locf()
on each price column. First, we will use regular expression to get all the price columns:
cols <- colnames(market_data) price_cols <- cols[grep("^price", cols)] price_cols ## [1] "price" "price1" "price2" "price3"
Then, we will use similar syntax but add an additional argument, .SDcols = price_cols
, in order to limit the columns of .SD
to be only the price columns we get. The following code calls zoo::na.locf()
on each price column, and the old values of each column are replaced:
market_data[, (price_cols) := lapply(.SD, zoo::na.locf), .SDcols = price_cols]
In this section, we have demonstrated the usage of data.table
and how it makes data manipulation much easier. To see the full feature list of data.table
, visit
https://github.com/Rdatatable/data.table/wiki. To quickly review the usage, go through the data table cheat sheet (https://www.datacamp.com/community/tutorials/data-table-cheat-sheet).