Using data.table to manipulate data

In the first section, we reviewed some built-in functions used to manipulate data frames. Then, we introduced sqldf, which makes simple data query and summary easier. However, both approaches have their limitations. Using built-in functions can be verbose and slow, and it is not easy to summarize data because SQL is not as powerful as the full spectrum of R functions.

The data.table package provides a powerful enhanced version of data.frame. It is blazing fast and has the ability to handle large data that fits into memory. It invents a natural syntax of data manipulation using []. Run the following command to install the package from CRAN if you don't have it yet:

install.packages("data.table") 

Once the package is successfully installed, we will load the package and see what it offers:

library(data.table) 
##  
## Attaching package: 'data.table' 
## The following objects are masked from 'package:reshape2': 
##  
##     dcast, melt 

Note that we previously loaded the reshape2 package in which dcast and melt are defined. The data.table package also provides enhanced version of dcast and melt with more powerful functionality, better performance, and higher memory efficiency. We'll take a look at them later in this section.

Creating data.table is very much like creating data.frame:

dt <- data.table(x = 1:3, y = rnorm(3), z = letters[1:3]) 
dt 
##    x           y z 
## 1: 1 -0.50219235 a 
## 2: 2  0.13153117 b 
## 3: 3 -0.07891709 c 

We can see its structure with str():

str(dt) 
## Classes 'data.table' and 'data.frame':   3 obs. of  3 variables: 
##  $ x: int  1 2 3 
##  $ y: num  -0.5022 0.1315 -0.0789 
##  $ z: chr  "a" "b" "c" 
##  - attr(*, ".internal.selfref")=<externalptr> 

It is clear that dt is of class data.table and data.frame, which means that data.table inherits from data.frame. In other words, it inherits some behaviors of data.frame, but override others as enhancements.

First, we still load the product data. However, this time, we will use fread() provided by data.table package. The fread() function is super-fast, has great memory efficiency, and directly returns data.table:

product_info <- fread("data/product-info.csv") 
product_stats <- fread("data/product-stats.csv") 
product_tests <- fread("data/product-tests.csv") 
toy_tests <- fread("data/product-toy-tests.csv") 

If we take a look at product_info, its appearance is only slightly different from that of a data frame:

product_info 
##     id      name  type   class released 
## 1: T01    SupCar   toy vehicle      yes 
## 2: T02  SupPlane   toy vehicle       no 
## 3: M01     JeepX model vehicle      yes 
## 4: M02 AircraftX model vehicle      yes 
## 5: M03    Runner model  people      yes 
## 6: M04    Dancer model  people       no 

Again, we will look at its structure:

str(product_info) 
## Classes 'data.table' and 'data.frame':   6 obs. of  5 variables: 
##  $ id      : chr  "T01" "T02" "M01" "M02" ... 
##  $ name    : chr  "SupCar" "SupPlane" "JeepX" "AircraftX" ... 
##  $ type    : chr  "toy" "toy" "model" "model" ... 
##  $ class   : chr  "vehicle" "vehicle" "vehicle" "vehicle" ... 
##  $ released: chr  "yes" "no" "yes" "yes" ... 
##  - attr(*, ".internal.selfref") =< externalptr> 

As compared to data.frame, if we supply only one argument to subset data.table, it means selecting rows rather than columns:

product_info[1] 
##     id   name type   class released 
## 1: T01 SupCar  toy vehicle      yes 
product_info[1:3] 
##     id     name  type   class released 
## 1: T01   SupCar   toy vehicle      yes 
## 2: T02 SupPlane   toy vehicle       no 
## 3: M01    JeepX model vehicle      yes 

If the number we supply in [] is negative, it means excluding the record, which is fully consistent with subsetting a vector:

product_info[-1] 
##     id      name  type   class released 
## 1: T02  SupPlane   toy vehicle       no 
## 2: M01     JeepX model vehicle      yes 
## 3: M02 AircraftX model vehicle      yes 
## 4: M03    Runner model  people      yes 
## 5: M04    Dancer model  people       no 

In addition, data.table also provides a number of symbols that represent important components of data.table. One of the most useful symbols is .N, which means the number of rows. If we want to select the last row, we no longer need nrow(product_info):

product_info[.N] 
##     id   name  type  class released 
## 1: M04 Dancer model people       no 

We can easily select the first and last rows:

product_info[c(1, .N)] 
##     id   name  type   class released 
## 1: T01 SupCar   toy vehicle      yes 
## 2: M04 Dancer model  people       no 

The syntax of data.table subsetting automatically evaluates the expressions in the context of the data, that is, we can directly use column names as symbols, just like how we use subsettransform, and with. For example, we can directly use released in the first argument to select rows of products that are released:

product_info[released == "yes"] 
##     id      name  type   class released 
## 1: T01    SupCar   toy vehicle      yes 
## 2: M01     JeepX model vehicle      yes 
## 3: M02 AircraftX model vehicle      yes 
## 4: M03    Runner model  people      yes 

The first argument in the square brackets is a row filter, while the second is evaluated within the context of the filtered data. For example, we can directly use id to represent product_info$id because id is evaluated within the context of product_info:

product_info[released == "yes", id] 
## [1] "T01" "M01" "M02" "M03" 

The way to select columns of a data frame does not work here. If we put a character vector in the second argument, then we'll get the character vector itself because a string is indeed a string:

product_info[released == "yes", "id"] 
## [1] "id" 

To disable this behavior, we can specify with = FALSE so that the second argument accepts a character vector to select columns, and it always returns a data.table, no matter how many columns are specified:

product_info[released == "yes", "id", with = FALSE] 
##     id 
## 1: T01 
## 2: M01 
## 3: M02 
## 4: M03 
product_info[released == "yes", c("id", "name"), with = FALSE] 
##     id      name 
## 1: T01    SupCar 
## 2: M01     JeepX 
## 3: M02 AircraftX 
## 4: M03    Runner 

We can also write some other expressions as the second argument. For example, we can generate a table of the number of released products for each combination of type and class:

product_info[released == "yes", table(type, class)] 
##        class 
## type    people vehicle 
##   model      1       2 
##   toy        0       1 

However, if a list is produced, it will be transformed to data.table instead:

product_info[released == "yes", list(id, name)] 
##     id      name 
## 1: T01    SupCar 
## 2: M01     JeepX 
## 3: M02 AircraftX 
## 4: M03    Runner 

In this way, we can easily create a new data.table package with existing columns replaced:

product_info[, list(id, name, released = released == "yes")] 
##     id      name released 
## 1: T01    SupCar     TRUE 
## 2: T02  SupPlane    FALSE 
## 3: M01     JeepX     TRUE 
## 4: M02 AircraftX     TRUE 
## 5: M03    Runner     TRUE 
## 6: M04    Dancer    FALSE 

We can also easily create a new data.table package with new columns based on existing columns:

product_stats[, list(id, material, size, weight,  
  density = size / weight)] 
##     id material size weight   density 
## 1: T01    Metal  120   10.0 12.000000 
## 2: T02    Metal  350   45.0  7.777778 
## 3: M01 Plastics   50     NA        NA 
## 4: M02 Plastics   85    3.0 28.333333 
## 5: M03     Wood   15     NA        NA 
## 6: M04     Wood   16    0.6 26.666667 

For simplicity, data.table provides .() to be short for list():

product_info[, .(id, name, type, class)] 
##     id      name  type   class 
## 1: T01    SupCar   toy vehicle 
## 2: T02  SupPlane   toy vehicle 
## 3: M01     JeepX model vehicle 
## 4: M02 AircraftX model vehicle 
## 5: M03    Runner model  people 
## 6: M04    Dancer model  people 
product_info[released == "yes", .(id, name)] 
##     id      name 
## 1: T01    SupCar 
## 2: M01     JeepX 
## 3: M02 AircraftX 
## 4: M03    Runner 

By supplying the ordered indices, we can easily sort the records by the given criterion:

product_stats[order(size, decreasing = TRUE)] 
##     id material size weight 
## 1: T02    Metal  350   45.0 
## 2: T01    Metal  120   10.0 
## 3: M02 Plastics   85    3.0 
## 4: M01 Plastics   50     NA 
## 5: M04     Wood   16    0.6 
## 6: M03     Wood   15     NA 

Previously, we always created a new data.table package after subsetting. The data.table package also provides := for in-place assignment of columns. For example, the original data of product_stats is shown as follows:

product_stats 
##     id material size weight 
## 1: T01    Metal  120   10.0 
## 2: T02    Metal  350   45.0 
## 3: M01 Plastics   50     NA 
## 4: M02 Plastics   85    3.0 
## 5: M03     Wood   15     NA 
## 6: M04     Wood   16    0.6 

We will use := to create a new column directly in product_stats:

product_stats[, density := size / weight] 

Nothing shows here, but the original data.table package is modified:

product_stats 
##     id material size weight   density 
## 1: T01    Metal  120   10.0 12.000000 
## 2: T02    Metal  350   45.0  7.777778 
## 3: M01 Plastics   50     NA        NA 
## 4: M02 Plastics   85    3.0 28.333333 
## 5: M03     Wood   15     NA        NA 
## 6: M04     Wood   16    0.6 26.666667 

We can use := to replace an existing column:

product_info[, released := released == "yes"] 
product_info 
##     id      name  type   class released 
## 1: T01    SupCar   toy vehicle     TRUE 
## 2: T02  SupPlane   toy vehicle    FALSE 
## 3: M01     JeepX model vehicle     TRUE 
## 4: M02 AircraftX model vehicle     TRUE 
## 5: M03    Runner model  people     TRUE 
## 6: M04    Dancer model  people    FALSE 

The data.table package provides := mainly because in-place modification has a much higher performance since it avoids unnecessary copies of data.

Using key to access rows

Another distinct feature of data.table is the support of indexing, that is, we can create a key on data.table, so accessing records by key can be extremely efficient. For example, we will use setkey() to make id the key of product_info:

setkey(product_info, id) 

Note that the function behaves in a very different way from most R functions. It does not return a new copy of the data table but directly installs a key to the original input. The data frame, however, looks unchanged:

product_info 
##     id      name  type   class released 
## 1: M01     JeepX model vehicle     TRUE 
## 2: M02 AircraftX model vehicle     TRUE 
## 3: M03    Runner model  people     TRUE 
## 4: M04    Dancer model  people    FALSE 
## 5: T01    SupCar   toy vehicle     TRUE 
## 6: T02  SupPlane   toy vehicle    FALSE 

Also, its key is created:

key(product_info) 
## [1] "id" 

Now, we can use a key to access the records in product_info. For example, we can directly write a value of id to get the records with that id:

product_info["M01"] 
##     id  name  type   class released 
## 1: M01 JeepX model vehicle     TRUE 

If we use this with a data.table package without a key, an error occurs and reminds you to set a key:

product_stats["M01"] 
## Error in `[.data.table`(product_stats, "M01"): When i is a data.table (or character vector), x must be keyed (i.e. sorted, and, marked as sorted) so data.table knows which columns to join to and take advantage of x being sorted. Call setkey(x,...) first, see ?setkey. 

We can also use setkeyv() to set key, but it only accepts a character vector:

setkeyv(product_stats, "id") 

This function is much easier to use if we have a dynamically determined vector to be the key. Now, we can use key to access product_stats too:

product_stats["M02"] 
##     id material size weight  density 
## 1: M02 Plastics   85      3 28.33333 

If two tables have the same key, we can easily join them together:

product_info[product_stats] 
##     id      name  type   class released material size 
## 1: M01     JeepX model vehicle     TRUE Plastics   50 
## 2: M02 AircraftX model vehicle     TRUE Plastics   85 
## 3: M03    Runner model  people     TRUE     Wood   15 
## 4: M04    Dancer model  people    FALSE     Wood   16 
## 5: T01    SupCar   toy vehicle     TRUE    Metal  120 
## 6: T02  SupPlane   toy vehicle    FALSE    Metal  350 
##    weight   density 
## 1:     NA        NA 
## 2:    3.0 28.333333 
## 3:     NA        NA 
## 4:    0.6 26.666667 
## 5:   10.0 12.000000 
## 6:   45.0  7.777778 

The key of a data.table package can be more than one element. For example, to locate a record of toy_tests, we need to specify both id and date. In the following code, we will set a key of the two columns on toy_tests:

setkey(toy_tests, id, date) 

Now, we can get a row by supplying both elements in the key:

toy_tests[.("T01", 20160201)] 
##     id     date sample quality durability 
## 1: T01 20160201    100       9          9 

If we only supply the first element, we would get a subset of the data with all records that match the first element:

toy_tests["T01"] 
##     id     date sample quality durability 
## 1: T01 20160201    100       9          9 
## 2: T01 20160302    150      10          9 
## 3: T01 20160405    180       9         10 
## 4: T01 20160502    140       9          9 

However, if we only supply the second element, we can't get anything but an error. It is because the algorithm it behind requires the key to be ordered:

toy_tests[.(20160201)] 
## Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, : x.'id' is a character column being joined to i.'V1' which is type 'double'. Character columns must join to factor or character columns. 

Also, we cannot get any data if we supply a key in a wrong order:

toy_tests[.(20160201, "T01")] 
## Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, : x.'id' is a character column being joined to i.'V1' which is type 'double'. Character columns must join to factor or character columns. 

Summarizing data by groups

Another important argument of subsetting a data.table is by, which is used to split the data into multiple parts, and for each part, evaluate the second argument. In this section, we'll demonstrate how the by syntax makes it much easier to summarize data by groups. For example, the simplest usage of by is counting the records in each group. In the following code, we will count the number of both released and unreleased products:

product_info[, .N, by = released] 
##    released N 
## 1:     TRUE 4 
## 2:    FALSE 2 

The group can be defined by more than one variable. For example, a tuple of type and class can be a group, and for each group, we will count the number of records:

product_info[, .N, by = .(type, class)] 
##     type   class N 
## 1: model vehicle 2 
## 2: model  people 2 
## 3:   toy vehicle 2 

We can also perform statistical calculations for each group. Here, we will compute the mean value of quality for both waterproof products and non-waterproof ones:

product_tests[, mean(quality, na.rm = TRUE),  
  by = .(waterproof)] 
##    waterproof    V1 
## 1:         no 10.00 
## 2:        yes  5.75 

Note that the mean values are stored in V1 because we didn't supply a name for the column, so the package uses its default column names. To avoid that, we will use expression in form of .(y = f(x)) instead:

product_tests[, .(mean_quality = mean(quality, na.rm = TRUE)), 
  by = .(waterproof)] 
##    waterproof mean_quality 
## 1:         no        10.00 
## 2:        yes         5.75 

We can chain multiple [] in turn. In the following example, we will first join product_info and product_tests by shared key id and then calculate the mean values of quality and durability for each group of type and class of released products:

product_info[product_tests][released == TRUE, 
  .(mean_quality = mean(quality, na.rm = TRUE), 
    mean_durability = mean(durability, na.rm = TRUE)), 
  by = .(type, class)] 
##     type   class mean_quality mean_durability 
## 1:   toy vehicle          NaN            10.0 
## 2: model vehicle            6             4.5 
## 3: model  people            5             NaN 

Note that the values of by columns will be unique in the resulting data.table. We can use keyby instead of by to ensure it is automatically used as key by the resulted data.table:

type_class_tests <- product_info[product_tests][released == TRUE, 
  .(mean_quality = mean(quality, na.rm = TRUE), 
    mean_durability = mean(durability, na.rm = TRUE)), 
  keyby = .(type, class)] 
type_class_tests 
##     type   class mean_quality mean_durability 
## 1: model  people            5             NaN 
## 2: model vehicle            6             4.5 
## 3:   toy vehicle          NaN            10.0 
key(type_class_tests) 
## [1] "type"  "class" 

Then, we can directly use a tuple of key values to access the records:

type_class_tests[.("model", "vehicle"), mean_quality] 
## [1] 6 

You can clearly see that using keys can be much more convenient than using logical comparisons when we try to find certain records in a table. However, its true advantage is not demonstrated yet because the data is not large enough. Using key to search records can be much faster than iterative logical comparison for large data, because searching by key takes advantage of binary search while iteration wastes a lot of time doing unnecessary computation.

Here is an example to make a contrast. First, we will create a data of 10 million rows with an index column id and two numeric columns filled with random numbers:

n <- 10000000 
test1 <- data.frame(id = 1:n, x = rnorm(n), y = rnorm(n)) 

Now, we want to see find a row of id being 8765432. Let's see how long it takes:

system.time(row <- test1[test1$id == 876543, ]) 
##    user  system elapsed  
##   0.156   0.036   0.192 
row 
##            id          x        y 
## 876543 876543 0.02300419 1.291588 

It seems no big deal, but suppose you need to frequently do this, say, hundreds of times per second, then your machine simply can't return a result in time.

Then, we will use data.table to do this. First, we will call setDT() to transform  data.frame to data.table. This function performs some magic to transform the object in place, no copy made. When we use the setDT() function, we also provide a key id so that the resulted data.table has id as its keyed column:

setDT(test1, key = "id") 
class(test1) 
## [1] "data.table" "data.frame" 

Now, test1 is transformed to data.table. Then, we will search the same element:

system.time(row <- test1[.(8765432)]) 
##    user  system elapsed  
##   0.000   0.000   0.001 
row 
##         id         x         y 
## 1: 8765432 0.2532357 -2.121696 

The results are the same, but the time data.table takes is much shorter than data.frame.

Reshaping data.table

Previously, you learned how to reshape a data frame with the reshape2 package. The data.table package provides faster and more powerful implementations of dcast and melt for the data.table object.

For example, we will reshape toy_tests by aligning the quality scores of each product to year-month tuples:

toy_tests[, ym := substr(date, 1, 6)] 
toy_quality <- dcast(toy_tests, ym ~ id, value.var = "quality") 
toy_quality 
##        ym T01 T02 
## 1: 201602   9   7 
## 2: 201603  10   8 
## 3: 201604   9   9 
## 4: 201605   9  10 

First, we used := to create a new column ym directly in toy_tests and use dcast to transform it in the same way with the previous example of reshape2. The result looks the same with the output of reshape2::dcast for data.frame.

While reshape2::dcast does not support multi-value value.vardata.table::dcast works with multiple value variables, as shown here:

toy_tests2 <- dcast(toy_tests, ym ~ id, value.var = c("quality", "durability")) 
toy_tests2 
##        ym quality_T01 quality_T02 durability_T01 
## 1: 201602           9           7              9 
## 2: 201603          10           8              9 
## 3: 201604           9           9             10 
## 4: 201605           9          10              9 
##    durability_T02 
## 1:              9 
## 2:              8 
## 3:              8 
## 4:              9 

The column names, except the first, are no longer values of id but are value variables with values of id concatenated by the underscore symbol. In addition, the key of the output data.table is automatically set to the variables that appear on the left-hand side of the reshaping formula  ym ~ id:

key(toy_tests2) 
## [1] "ym" 

The key implies that we can access the records directly by supplying a value of ym. However, the following code ends up with an error:

toy_tests2[.(201602)] 
## Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, : x.'ym' is a character column being joined to i.'V1' which is type 'double'. Character columns must join to factor or character columns. 

There's something wrong with the data types. We can run the following code to see the class of each column:

sapply(toy_tests2, class) 
##             ym    quality_T01    quality_T02 durability_T01  
##    "character"      "integer"      "integer"      "integer"  
## durability_T02  
##      "integer" 

The problem lies in the class of ym:. It is a character vector, but we supplied a key of numeric values. Therefore, the search fails with unmatched data types. If we supply a string, we can get the corresponding record:

toy_tests2["201602"] 
##        ym quality_T01 quality_T02 durability_T01 
## 1: 201602           9           7              9 
##    durability_T02 
## 1:              9 

But how did ym become a character vector in the first place? Recall ym := substr(date, 1, 6) where date is an integer vector, but substr() will coerce date to a character vector and then take out the first six characters. Therefore, it is natural that the result is a character vector. This is simply demonstrated as follows:

class(20160101) 
## [1] "numeric" 
class(substr(20160101, 1, 6)) 
## [1] "character" 

The point here is that we need to be careful about the data types of the key columns.

Using in-place set functions

If we use data.frame, to change the names or the column order will cause copies of the data structure. In recent R versions, the copy is made fewer when we rename columns, but it is still hard to reorder the columns of a data frame without making a new copy. This should not be a problem when the data is small, but if the data is very large, the performance and memory pressure it imposes can really be an issue.

An enhanced version of data.framedata.table provides a family of set functions with reference semantics, that is, they modify data.table in place and avoid unnecessary copying, thus exhibiting astonishing performance.

Take product_stats as an example. We can call setDF() to change data.table to data.frame in place without making copies:

product_stats 
##     id material size weight   density 
## 1: M01 Plastics   50     NA        NA 
## 2: M02 Plastics   85    3.0 28.333333 
## 3: M03     Wood   15     NA        NA 
## 4: M04     Wood   16    0.6 26.666667 
## 5: T01    Metal  120   10.0 12.000000 
## 6: T02    Metal  350   45.0  7.777778 
setDF(product_stats) 
class(product_stats) 
## [1] "data.frame" 

We can call setDT() to make any data.frame to data.table and set up a key if specified:

setDT(product_stats, key = "id") 
class(product_stats) 
## [1] "data.table" "data.frame" 

We can call setnames to change the name of the given columns to their new names:

setnames(product_stats, "size", "volume") 
product_stats 
##     id material volume weight   density 
## 1: M01 Plastics     50     NA        NA 
## 2: M02 Plastics     85    3.0 28.333333 
## 3: M03     Wood     15     NA        NA 
## 4: M04     Wood     16    0.6 26.666667 
## 5: T01    Metal    120   10.0 12.000000 
## 6: T02    Metal    350   45.0  7.777778 

If we add a new column, the column should appear as the last one. For example, we will add an index column for all rows using .I representing 1:.N:

product_stats[, i := .I] 
product_stats 
##     id material volume weight   density i 
## 1: M01 Plastics     50     NA        NA 1 
## 2: M02 Plastics     85    3.0 28.333333 2 
## 3: M03     Wood     15     NA        NA 3 
## 4: M04     Wood     16    0.6 26.666667 4 
## 5: T01    Metal    120   10.0 12.000000 5 
## 6: T02    Metal    350   45.0  7.777778 6 

By convention, the index column should, in most cases, appear as the first column. We can supply a new order of column names to setcolorder() so that the columns are directly reordered without making copies:

setcolorder(product_stats,  
  c("i", "id", "material", "weight", "volume", "density")) 
product_stats 
##    i  id material weight volume   density 
## 1: 1 M01 Plastics     NA     50        NA 
## 2: 2 M02 Plastics    3.0     85 28.333333 
## 3: 3 M03     Wood     NA     15        NA 
## 4: 4 M04     Wood    0.6     16 26.666667 
## 5: 5 T01    Metal   10.0    120 12.000000 
## 6: 6 T02    Metal   45.0    350  7.777778 

Understanding dynamic scoping of data.table

The most commonly used syntax of data.table is data[i, j, by], where ij, and by are all evaluated with dynamic scoping. In other words, we can use not only the columns directly, but also the predefined symbols such as .N.I, and .SD to refer to important components of the data, as well as symbols and functions that can be accessed in the calling environment.

Before demonstrating this, we will create a new data.table named market_data with a consecutive column of date:

market_data <- data.table(date = as.Date("2015-05-01") + 0:299) 
head(market_data) 
##          date 
## 1: 2015-05-01 
## 2: 2015-05-02 
## 3: 2015-05-03 
## 4: 2015-05-04 
## 5: 2015-05-05 
## 6: 2015-05-06 

Then, we will add two new columns to market_data by calling := as a function:

set.seed(123) 
market_data[, `:=`( 
  price = round(30 * cumprod(1 + rnorm(300, 0.001, 0.05)), 2), 
  volume = rbinom(300, 5000, 0.8) 
)] 

Note that price is a simple random walk, and volume is randomly drawn from a binomial distribution:

head(market_data) 
##          date price volume 
## 1: 2015-05-01 29.19   4021 
## 2: 2015-05-02 28.88   4000 
## 3: 2015-05-03 31.16   4033 
## 4: 2015-05-04 31.30   4036 
## 5: 2015-05-05 31.54   3995 
## 6: 2015-05-06 34.27   3955 

Then, we will plot the data:

plot(price ~ date, data = market_data,  
  type = "l", 
  main = "Market data") 

The plot generated is shown as follows:

Understanding dynamic scoping of data.table

Once the data is ready, we can aggregate the data and see how dynamic scoping can be used to make things easier.

First, we will look at the range of the date column:

market_data[, range(date)] 
## [1] "2015-05-01" "2016-02-24" 

The data can be reduced to monthly open-high-low-close (OHLC) data easily by group aggregate:

monthly <- market_data[,  
  .(open = price[[1]], high = max(price),  
    low = min(price), close = price[[.N]]), 
  keyby = .(year = year(date), month = month(date))] 
head(monthly) 
##    year month  open  high   low close 
## 1: 2015     5 29.19 37.71 26.15 28.44 
## 2: 2015     6 28.05 37.63 28.05 37.21 
## 3: 2015     7 36.32 40.99 32.13 40.99 
## 4: 2015     8 41.52 50.00 30.90 30.90 
## 5: 2015     9 30.54 34.46 22.89 27.02 
## 6: 2015    10 25.68 33.18 24.65 29.32 

In the j expression, we can generate an OHLC record with each data.table grouped by year and month. If the output of j is a list, or data.frame, or data.table, then the output will be stacked together to result in one data.table.

In fact, the j expression can be anything, even with by specified. More specifically, j is evaluated within the context of each data.table as a subset of the original data split by the value of the by expression. For example, the following code does not aggregate data by group, but plot a price chart for each year:

oldpar <- par(mfrow = c(1, 2)) 
market_data[, { 
  plot(price ~ date, type = "l", 
    main = sprintf("Market data (%d)", year)) 
}, by = .(year = year(date))] 
par(oldpar) 

The plot generated is shown as follows:

Understanding dynamic scoping of data.table

Note that we don't specify the data argument of plot because it is evaluated in the context of market_data grouped by year where price and date are already defined.

Moreover, the j expression can be model-fitting code. Here is an example of batch fitting of linear models. First, we will load diamonds data from the ggplot2 package:

data("diamonds", package = "ggplot2") 
setDT(diamonds) 
head(diamonds) 
##    carat       cut color clarity depth table price    x 
## 1:  0.23     Ideal     E     SI2  61.5    55   326 3.95 
## 2:  0.21   Premium     E     SI1  59.8    61   326 3.89 
## 3:  0.23      Good     E     VS1  56.9    65   327 4.05 
## 4:  0.29   Premium     I     VS2  62.4    58   334 4.20 
## 5:  0.31      Good     J     SI2  63.3    58   335 4.34 
## 6:  0.24 Very Good     J    VVS2  62.8    57   336 3.94 
##       y    z 
## 1: 3.98 2.43 
## 2: 3.84 2.31 
## 3: 4.07 2.31 
## 4: 4.23 2.63 
## 5: 4.35 2.75 
## 6: 3.96 2.48 

The data contains 53940 records of diamonds with 10 properties. Here, we will fit linear regression models on each group of cut to see how carat and depth may provide some information of log(price) in each group.

In the following code, the j expression involves fitting a linear model and coercing its coefficients into a list. Note that the j expression is evaluated for each value of the keyby expression. Since a list is returned, the estimated linear coefficients for each group will be stacked as one data.table is shown as follows:

diamonds[, { 
  m <- lm(log(price) ~ carat + depth) 
  as.list(coef(m)) 
}, keyby = .(cut)] 
##          cut (Intercept)    carat        depth 
## 1:      Fair    7.730010 1.264588 -0.014982439 
## 2:      Good    7.077469 1.973600 -0.014601101 
## 3: Very Good    6.293642 2.087957 -0.002890208 
## 4:   Premium    5.934310 1.852778  0.005939651 
## 5:     Ideal    8.495409 2.125605 -0.038080022 

Dynamic scoping also allows us to combine the use of symbols that are predefined, inside or outside data.table. For example, we can define a function to calculate the annual average values of a user-defined column of market_data:

average <- function(column) { 
  market_data[, .(average = mean(.SD[[column]])), 
    by = .(year = year(date))] 
} 

In the preceding j expression, .SD means the grouped data.table for each value of year. We can use .SD[[x]] to extract the values of column x, just like extracting an element by name from a list.

Then, we can run the following code to calculate the average prices for each year:

average("price") 
##    year  average 
## 1: 2015 32.32531 
## 2: 2016 32.38364 

We will just change the argument to volume to calculate the average volumes for each year:

average("volume") 
##    year  average 
## 1: 2015 3999.931 
## 2: 2016 4003.382 

Also, we can use a specially invented syntax to create a dynamic number of columns with dynamically determined names.

Suppose we add three new alternative price columns, each adds some random noise to the original price values. Instead of repeat calling market_data[, price1 := ...] and market_data[, price2 := ...], we can use market_data[, (columns) := list(...)] to set columns dynamically, where columns is a character vector of column names and list(...) is the values for each corresponding column in columns:

price_cols <- paste0("price", 1:3) 
market_data[, (price_cols) := lapply(1:3,  
  function(i) round(price + rnorm(.N, 0, 5), 2))] 
head(market_data) 
##          date price volume price1 price2 price3 
## 1: 2015-05-01 29.19   4021  30.55  27.39  33.22 
## 2: 2015-05-02 28.88   4000  29.67  20.45  36.00 
## 3: 2015-05-03 31.16   4033  34.31  26.94  27.24 
## 4: 2015-05-04 31.30   4036  29.32  29.01  28.04 
## 5: 2015-05-05 31.54   3995  36.04  32.06  34.79 
## 6: 2015-05-06 34.27   3955  30.12  30.96  35.19 

On the other hand, if we get a table with many columns and we need to perform some computation on a subset of them, we can also use similar syntax to solve the problem. Imagine that the price-related columns may have missing values. We need to perform zoo::na.locf() on each price column. First, we will use regular expression to get all the price columns:

cols <- colnames(market_data) 
price_cols <- cols[grep("^price", cols)] 
price_cols 
## [1] "price"  "price1" "price2" "price3" 

Then, we will use similar syntax but add an additional argument, .SDcols = price_cols, in order to limit the columns of .SD to be only the price columns we get. The following code calls zoo::na.locf() on each price column, and the old values of each column are replaced:

market_data[, (price_cols) := lapply(.SD, zoo::na.locf), 
  .SDcols = price_cols] 

In this section, we have demonstrated the usage of data.table and how it makes data manipulation much easier. To see the full feature list of data.table, visit  https://github.com/Rdatatable/data.table/wiki. To quickly review the usage, go through the data table cheat sheet (https://www.datacamp.com/community/tutorials/data-table-cheat-sheet).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset