In the previous chapter, you learned how to compose SQL statements to query data from relational databases such as SQLite and MySQL. Is there a way to directly use SQL to query data frames in R as if these data frames are tables in relational databases? The sqldf
package says yes.
This package takes advantage of SQLite, thanks to its lightweight structure and easiness to embed into an R session. Run the following command to install this package if you don't have it:
install.packages("sqldf")
First, let's attach the package, as shown in the following code:
library(sqldf) ## Loading required package: gsubfn ## Loading required package: proto ## Loading required package: RSQLite ## Loading required package: DBI
Note that when we attach sqldf
, a number of other packages are automatically loaded. The sqldf
package depends on these packages, because what it does is basically transferring data and converting data types between R and SQLite.
Then, we will reload the product tables we used in the previous sections:
product_info <- read_csv("data/product-info.csv") product_stats <- read_csv("data/product-stats.csv") product_tests <- read_csv("data/product-tests.csv") toy_tests <- read_csv("data/product-toy-tests.csv")
The magic of this package is that we can directly query the data frames in our working environment with SQL. For example, we can select all records of product_info
:
sqldf("select * from product_info") ## Loading required package: tcltk ## id name type class released ## 1 T01 SupCar toy vehicle yes ## 2 T02 SupPlane toy vehicle no ## 3 M01 JeepX model vehicle yes ## 4 M02 AircraftX model vehicle yes ## 5 M03 Runner model people yes ## 6 M04 Dancer model people no
The sqldf
package supports simple select queries that are supported by SQLite. For example, we can select a certain set of columns:
sqldf("select id, name, class from product_info") ## id name class ## 1 T01 SupCar vehicle ## 2 T02 SupPlane vehicle ## 3 M01 JeepX vehicle ## 4 M02 AircraftX vehicle ## 5 M03 Runner people ## 6 M04 Dancer people
We can filter records by a certain condition:
sqldf("select id, name from product_info where released = 'yes'") ## id name ## 1 T01 SupCar ## 2 M01 JeepX ## 3 M02 AircraftX ## 4 M03 Runner
We can compute a new column and give it a name:
sqldf("select id, material, size / weight as density from product_stats") ## id material density ## 1 T01 Metal 12.000000 ## 2 T02 Metal 7.777778 ## 3 M01 Plastics NA ## 4 M02 Plastics 28.333333 ## 5 M03 Wood NA ## 6 M04 Wood 26.666667
We can sort records by certain columns in given orders:
sqldf("select * from product_stats order by size desc") ## id material size weight ## 1 T02 Metal 350 45.0 ## 2 T01 Metal 120 10.0 ## 3 M02 Plastics 85 3.0 ## 4 M01 Plastics 50 NA ## 5 M04 Wood 16 0.6 ## 6 M03 Wood 15 NA
The package also supports querying multiple data frames such as join
. In the following code, we will merge product_info
and product_stats
by id
, just like what we did with merge()
previously:
sqldf("select * from product_info join product_stats using (id)") ## id name type class released material size weight ## 1 T01 SupCar toy vehicle yes Metal 120 10.0 ## 2 T02 SupPlane toy vehicle no Metal 350 45.0 ## 3 M01 JeepX model vehicle yes Plastics 50 NA ## 4 M02 AircraftX model vehicle yes Plastics 85 3.0 ## 5 M03 Runner model people yes Wood 15 NA ## 6 M04 Dancer model people no Wood 16 0.6
Moreover, it also supports nested query. In the following code, we will select all records in product_info
that are made of wood:
sqldf("select * from product_info where id in (select id from product_stats where material = 'Wood')") ## id name type class released ## 1 M03 Runner model people yes ## 2 M04 Dancer model people no
Alternatively, we can use join
with the same where
condition to achieve the same goal. For many relational databases, join
usually works faster than in
when the data is large:
sqldf("select * from product_info join product_stats using (id) where material = 'Wood'") ## id name type class released material size weight ## 1 M03 Runner model people yes Wood 15 NA ## 2 M04 Dancer model people no Wood 16 0.6
In addition to join
, we can easily summarize data by group. For example, we group product_tests
by waterproof
into two groups: yes
and no
. For each group, we compute the average values of quality
and durability
, respectively:
sqldf("select waterproof, avg(quality), avg(durability) from product_tests group by waterproof") ## waterproof avg(quality) avg(durability) ## 1 no 10.00 9.5 ## 2 yes 5.75 5.0
For the toy_tests
data, it is easy to aggregate data for each product. Here is an example of averaging quality
and durability
values across time for each product:
sqldf("select id, avg(quality), avg(durability) from toy_tests group by id") ## id avg(quality) avg(durability) ## 1 T01 9.25 9.25 ## 2 T02 8.50 8.50
To make the results more informative, we can join product_info
with the grouped summary table so that the product information and average measures are presented together:
sqldf("select * from product_info join (select id, avg(quality), avg(durability) from toy_tests group by id) using (id)") ## id name type class released avg(quality) ## 1 T01 SupCar toy vehicle yes 9.25 ## 2 T02 SupPlane toy vehicle no 8.50 ## avg(durability) ## 1 9.25 ## 2 8.50
Using sqldf
and SQL to query data frames looks very handy, but the limitations are obvious too.
First, since sqldf
is, by default, based on SQLite, the limitation of the package is also the limitation of SQLite database, that is, the built-in group aggregate functions are limited. The official webpage (https://sqlite.org/lang_aggfunc.html) provides a list of functions: avg()
, count()
, group_concat()
, max()
, min()
, sum()
, and total()
. If we need more than that, for example, quantile()
, it won't be easy. In R, we can use much more advanced algorithms to aggregate columns.
Second, since we need to supply a string of select statements to query data, it is not very convenient to generate SQL dynamically when part of it is determined by R variables. Therefore, we need to use sprintf()
to allow the values of R variables to appear in the SQL statement.
Third, the limitation of sqldf
is also the limitation of SQL. It is hard to compute new columns with more complex algorithms. For example, if we need to compute a ranking column based on an existing numeric column, it would not be very easy to implement. However, in R, we just need order()
. Another thing is that it is hard or verbose to implement more complex filter operations such as ranking-based data filtering. For example, how do you select the first one or two products ordered by size
in descending order grouped by material
? Such a query requires a lot more thinking and tricks.
However, if we use the plyr
package, such a task is a piece of cake. If you have the package installed, run the following code:
install.packages("plyr")
To demonstrate how simple it is, we will use plyr::ddply
to do this. We will supply material
as the data splitter, that is, product_stats
is divided into several parts for each value taken by material
. We also supply a function to transform the input data frame (each part) to a new data frame. Then, the ddply
function combines these data frames together:
plyr::ddply(product_stats, "material", function(x) { head(x[order(x$size, decreasing = TRUE),], 1L) }) ## id material size weight ## 1 T02 Metal 350 45.0 ## 2 M02 Plastics 85 3.0 ## 3 M04 Wood 16 0.6
The anonymous function we supplied is called with three parts of product_stats
with distinct material
, each part having identical material
.
Another example is to select top two test results with the most samples:
plyr::ddply(toy_tests, "id", function(x) { head(x[order(x$sample, decreasing = TRUE), ], 2) }) ## id date sample quality durability ## 1 T01 20160405 180 9 10 ## 2 T01 20160302 150 10 9 ## 3 T02 20160403 90 9 8 ## 4 T02 20160502 85 10 9
The anonymous function we supplied is called with two parts of toy_tests
: one part is a data frame with id
being T01
and the other T02
. For each part, we order the sub-dataframe by sample
in the descending order and take the top two records. The task is easily finished.
In addition, ddply
, plyr
provides a variety of functions of many possible pairs of input-output data types. To learn more, visit http://had.co.nz/plyr/ and https://github.com/hadley/plyr.