When dealing with large datasets, it's useful to be able to query the data based on some arbitrary conditions. Also, it's more reliable to store the data in a database rather than in a flat file or as an in-memory resource. The Incanter library provides us with several useful functions to perform these operations, as we will demonstrate in the code example that will follow.
The Incanter library and the MongoDB driver used in the upcoming example can be added to a Leiningen project by adding the following dependency to the project.clj
file:
[congomongo "0.4.1"] [incanter "1.5.4"]
For the upcoming example, the namespace declaration should look similar to the following declaration:
(ns my-namespace (:use incanter.core [incanter.mongodb :only [insert-dataset fetch-dataset]] [somnium.congomongo :only [mongo!]] [incanter.datasets :only [get-dataset]]))
Also, this example requires MongoDB to be installed and running.
For this example, we will use the Iris dataset, which can be fetched using the get-dataset
function from the incanter.datasets
namespace. The code is as follows:
(def iris (get-dataset :iris))
As shown in the previous code, we simply bind the Iris dataset to a variable iris
. We can perform various operations on this dataset using the with-data
function. To view the data, we can use the view
function along with the with-data
function to provide a tabular representation of the dataset, as shown in the following code:
user> (with-data iris (view (conj-cols (range (nrow $data)) $data)))
The $data
variable is a special binding that can be used to represent the entire dataset within the scope of the with-data
function. In the previous code, we add an extra column to represent the row number of a record to the data using a composition of the conj-cols
, nrows
, and range
functions. The data is then displayed in a spreadsheet-like table using the view
function. The previous code produces the following table that represents the dataset:
We can also select columns we are interested in from the original dataset using the $
function within the scope of the with-data
function, as shown in the following code:
user> (with-data iris ($ [:Species :Sepal.Length])) | :Species | :Sepal.Length | |------------+---------------| | setosa | 5.1 | | setosa | 4.9 | | setosa | 4.7 | ... | virginica | 6.5 | | virginica | 6.2 | | virginica | 5.9 |
The $
function selects the :Species
and :Sepal.Length
columns from the iris
dataset in the code example shown previously. We can also filter the data based on a condition using the $where
function, as shown in the following code:
user> (with-data iris ($ [:Species :Sepal.Length] ($where {:Sepal.Length 7.7}))) | :Species | :Sepal.Length | |-----------+---------------| | virginica | 7.7 | | virginica | 7.7 | | virginica | 7.7 | | virginica | 7.7 |
The previous example queries the iris
dataset for records with the :Sepal.Length
column equal to 7.7
using the $where
function. We can also specify the lower or upper bound of the value to compare a column to using the :$gt
and :$lt
symbols in a map passed to $where
function, as shown in the following code:
user> (with-data iris ($ [:Species :Sepal.Length] ($where {:Sepal.Length {:$gt 7.0}}))) | :Species | :Sepal.Length | |-----------+---------------| | virginica | 7.1 | | virginica | 7.6 | | virginica | 7.3 | ... | virginica | 7.2 | | virginica | 7.2 | | virginica | 7.4 |
The previous example checks for records that have a :Sepal.Length
attribute with a value greater than 7
. To check whether a column's value lies within a given range, we can specify both the :$gt
and :$lt
keys in the map passed to the $where
function, as shown in the following code:
user> (with-data iris ($ [:Species :Sepal.Length] ($where {:Sepal.Length {:$gt 7.0 :$lt 7.5}}))) | :Species |:Sepal.Length | |------------+--------------| | virginica | 7.1 | | virginica | 7.3 | | virginica | 7.2 | | virginica | 7.2 | | virginica | 7.2 | | virginica | 7.4 |
The previous example checks for records that have a :Sepal.Length
attribute within the range of 7.0
and 7.5
. We can also specify a discrete set of values using the $:in
key, such as in the expression {:$in #{7.2 7.3 7.5}}
. The Incanter library also provides several other functions such as $join
and $group-by
that can be used to express more complex queries.
The Incanter library provides functions to operate with MongoDB to persist and fetch datasets. MongoDB is a nonrelational document database that allows for storage of JSON documents with dynamic schemas. To connect to a MongoDB instance, we use the mongo!
function, as shown in the following code:
user> (mongo! :db "sampledb") true
In the previous code, the database name sampledb
is specified as a keyword argument with the key :db
to the mongo!
function. We can also specify the hostname and port of the instance to connect to using the :host
and :post
keyword arguments, respectively.
We can store datasets in the connected MongoDB instance using the insert-dataset
function from the incanter.mongodb
namespace. Unfortunately, MongoDB does not support the use of the dot character (.) as column names, and so we must change the names of the columns in the iris
dataset in order to successfully store it using the insert-dataset
function. Replacing the column names can be done using the col-names
function, as shown in the following code:
user> (insert-dataset :iris (col-names iris [:SepalLength :SepalWidth :PetalLength :PetalWidth :Species]))
The previous code stores the iris
dataset in the MongoDB instance after replacing the dot characters in the column names.
To fetch the dataset back from the database, we use the fetch-dataset
function, as shown in the following code. The value returned by this function can be directly used by the with-data
function to query and view the dataset fetched.
user> (with-data (fetch-dataset :iris) ($ [:Species :_id] ($where {:SepalLength {:$gt 7}}))) | :Species | :_id | |-----------+--------------------------| | virginica | 52ebcc1144ae6d6725965984 | | virginica | 52ebcc1144ae6d6725965987 | | virginica | 52ebcc1144ae6d6725965989 | ... | virginica | 52ebcc1144ae6d67259659a0 | | virginica | 52ebcc1144ae6d67259659a1 | | virginica | 52ebcc1144ae6d67259659a5 |
We can also inspect the database after storing our dataset, using the mongo
client, as shown in the following code. As we mentioned our database name is sampledb
, we must select this database using the use
command, as shown in the following terminal output:
$ mongo MongoDB shell version: 2.4.6 connecting to: test Server has startup warnings: ... > use sampledb switched to db sampledb
We can view all collections in the database using the show collections
command. The queries can be executed using the find()
function on the appropriate property in the variable db
instance, as shown in the following code:
> show collections iris system.indexes > > db.iris.find({ SepalLength: 5}) { "_id" : ObjectId("52ebcc1144ae6d6725965922"), "Species" : "setosa", "PetalWidth" : 0.2, "PetalLength" : 1.4, "SepalWidth" : 3.6, "SepalLength" : 5 } { "_id" : ObjectId("52ebcc1144ae6d6725965925"), "Species" : "setosa", "PetalWidth" : 0.2, "PetalLength" : 1.5, "SepalWidth" : 3.4, "SepalLength" : 5 } ...
To conclude, the Incanter library provides us with a sufficient set of tools for querying and storing datasets. Also, MongoDB can be easily used to store datasets via the Incanter library.