So far, we've been focusing on splitting up datasets, on dividing them into groups of rows or groups of columns with functions and macros such as $
or $where
. However, sometimes we'd like to move in the other direction. We might have two related datasets and want to join them together to make a larger one. For example, we might want to join crime data to census data, or take any two related datasets that come from separate sources and analyze them together.
First, we'll need to include these dependencies in our project.clj
file:
(defproject inc-dsets "0.1.0" :dependencies [[org.clojure/clojure "1.6.0"] [incanter "1.5.5"] [org.clojure/data.csv "0.1.2"]])
We'll use these statements for inclusions:
(require '[clojure.java.io :as io] '[clojure.data.csv :as csv] '[clojure.string :as str] '[incanter.core :as i])
For our data file, we'll use the same data that we introduced in the Selecting columns with $ recipe: China's development dataset from the World Bank.
In this recipe, we'll take a look at how to join two datasets using Incanter:
data/chn/chn_Country_en_csv_v2.csv
file. We'll use the with-header
and read-country-data
functions that were defined in the Selecting columns with $ recipe:(def data-file "data/chn/chn_Country_en_csv_v2.csv") (def chn-data (read-country-data data-file))
:Indicator-Code
) and the data column (:2000
):(def chn-1990 (i/$ [:Indicator-Code :Indicator-Name :1990] chn-data)) (def chn-2000 (i/$ [:Indicator-Code :2000] chn-data))
(def chn-decade (i/$join [:Indicator-Code :Indicator-Code] chn-1990 chn-2000))
From this point on, we can use chn-decade
just as we use any other Incanter dataset.
Let's take a look at this in more detail:
(i/$join [:Indicator-Code :Indicator-Code] chn-1990 chn-2000)
The pair of column keywords in a vector ([:Indicator-Code :Indicator-Code]
) are the keys that the datasets will be joined on. In this case, the :Indicator-Code
column from both the datasets is used, but the keys can be different for the two datasets. The first column that is listed will be from the first dataset (chn-1990
), and the second column that is listed will be from the second dataset (chn-2000
).
This returns a new dataset. Each row of this new dataset is a superset of the corresponding rows from the two input datasets.