We saw in the last chapter how to use Incanter Zoo to work with time series data and how to smooth values using a running mean. However, sometimes we'll want to smooth data that doesn't have a time component. For instance, we may want to track the usage of a word throughout a larger document or set of documents.
For this, we'll need usual dependencies:
(defproject statim "0.1.0" :dependencies [[org.clojure/clojure "1.6.0"] [incanter "1.5.5"]])
We'll also require those in our script or REPL:
(require '[incanter.core :as i] '[incanter.stats :as s] '[incanter.charts :as c] '[clojure.string :as str])
For this recipe, we'll look at Sir Arthur Conan Doyle's Sherlock Holmes stories. You can download this from Project Gutenberg at http://www.gutenberg.org/cache/epub/1661/pg1661.txt or http://www.ericrochester.com/clj-data-analysis/data/pg1661.txt.
We'll look at the distribution of baker over the course of the books. This may give some indication of how important Holmes' residence at 221B Baker Street is for a given story.
(defn tokenize [text] (map str/lower-case (re-seq #"w+" text)))
(defn count-hits [x coll] (get (frequencies coll) x 0))
(def data-file "data/pg1661.txt") (def windows (partition 500 250 (tokenize (slurp data-file))))
count-hits
to get the number of times that baker appears in each window of tokens:(def baker-hits (map (partial count-hits "baker") windows))
(defn rolling-fn [f n coll] (map f (partition n 1 coll)))
(def baker-avgs (rolling-fn s/mean 10 baker-hits))
This graph shows the smoothed data overlaid over the raw frequencies:
This recipe processes a document through a number of stages to get the results:
By the way, that spike is from the short story, The Adventure of the Blue Carbuncle. A character in that story is Henry Baker, so the spike is not just from references to Baker Street, but also to the character.