Often, you need to cut the data to make it more useful. One common transformation is to pull out all the values from one or more columns into a new dataset. This can be useful for generating summary statistics or aggregating the values of some columns.
The Incanter macro $
slices out parts of a dataset. In this recipe, we'll see this in action.
For this recipe, we'll need to have Incanter listed in our project.clj
file:
(defproject inc-dsets "0.1.0" :dependencies [[org.clojure/clojure "1.6.0"] [incanter "1.5.5"] [org.clojure/data.csv "0.1.2"]])
We'll also need to include these libraries in our script or REPL:
(require '[clojure.java.io :as io] '[clojure.data.csv :as csv] '[clojure.string :as str] '[incanter.core :as i])
Moreover, we'll need some data. This time, we'll use some country data from the World Bank. Point your browser to http://data.worldbank.org/country and select a country. I picked China. Under World Development Indicators, there is a button labeled Download Data. Click on this button and select CSV. This will download a ZIP file. I extracted its contents into the data/chn
directory in my project. I bound the filename for the primary data file to the data-file
name.
We'll use the $
macro in several different ways to get different results. First, however, we'll need to load the data into a dataset, which we'll do in steps 1 and 2:
(defn with-header [coll] (let [headers (map #(keyword (str/replace % space -)) (first coll))] (map (partial zipmap headers) (next coll)))) (defn read-country-data [filename] (with-open [r (io/reader filename)] (i/to-dataset (doall (with-header (drop 2 (csv/read-csv r)))))))
user=> (def chn-data (read-country-data data-file))
$
macro. It returns a sequence of the values in the column:user=> (i/$ :Indicator-Code chn-data) ("AG.AGR.TRAC.NO" "AG.CON.FERT.PT.ZS" "AG.CON.FERT.ZS" …
user=> (i/$ [:Indicator-Code :1992] chn-data) | :Indicator-Code | :1992 | |---------------------------+---------------------| | AG.AGR.TRAC.NO | 770629 | | AG.CON.FERT.PT.ZS | | | AG.CON.FERT.ZS | | | AG.LND.AGRI.K2 | 5159980 | …
user=> (i/$ [:Indicator-Code :1992 :2002] chn-data) | :Indicator-Code | :1992 | :2002 | |---------------------------+---------------------+---------------------| | AG.AGR.TRAC.NO | 770629 | | | AG.CON.FERT.PT.ZS | | 122.73027213719 | | AG.CON.FERT.ZS | | 373.087159048868 | | AG.LND.AGRI.K2 | 5159980 | 5231970 | …
The $
function is just a wrapper over Incanter's sel
function. It provides a good way to slice columns out of the dataset, so we can focus only on the data that actually pertains to our analysis.
The indicator codes for this dataset are a little cryptic. However, the code descriptions are in the dataset too:
user=> (i/$ [0 1 2] [:Indicator-Code :Indicator-Name] chn-data) | :Indicator-Code | :Indicator-Name | |-------------------+---------------------------------------------------------------| | AG.AGR.TRAC.NO | Agricultural machinery, tractors | | AG.CON.FERT.PT.ZS | Fertilizer consumption (% of fertilizer production) | | AG.CON.FERT.ZS | Fertilizer consumption (kilograms per hectare of arable land) | …