One way to deal with very large data sets is to sample. This can be especially useful when we're getting started and want to explore a dataset. A good sample can tell us what's in the full dataset and what we'll need to do in order to clean and process it. Samples are used in any kind of survey or election exit polling.
In this recipe, we'll see a couple of ways of creating samples.
We'll use a basic project.clj
file:
(defproject cleaning-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"]])
There are two ways to sample from a stream of values. If you want 10 percent of the larger population, you can just take every tenth item. If you want 1,000 out of who knows how many items, the process is a little more complicated.
Sampling for an exact count is a little more complicated. We'll use Donald Knuth's algorithm from The Art of Computer Programming, Volume 2. This takes the sample off the front of the input sequence, and then from this point, each new item from the input has a chance of sample-size / size-of-collection-so-far randomly replacing one existing item in the sample. To implement this, we'll need one helper function which takes a map and a new key-value pair. It removes a random key from the map and inserts the new pair:
(defn rand-replace [m k v] (assoc (dissoc m (rand-nth (keys m))) k v))
We'll also need another small utility to create an infinite range that begins at a given place:
(defn range-from [x] (map (partial + x) (range)))
Now, we use this to create the function that does the sampling:
(defn sample-amount [k coll] (->> coll (drop k) (map vector (range-from (inc k))) (filter #(<= (rand) (/ k (first %)))) (reduce rand-replace (into {} (map vector (range k) (take k coll)))) (sort-by first) (map second)))
Using this is as simple as using the first function though:
user=> (sample-amount 10 (range 1000)) (70 246 309 430 460 464 471 547 955 976) user=> (count *1) 10
Sampling by percentage just compares the percentage against a random value for each item in the collection. If the random number is less than the value, it saves the item. Notice though, that since it's random, the exact number that it pulls out doesn't necessarily match the parameter exactly. In this case, 1 percent.
Sampling by a set amount is more complicated. We keep a map of the sample, keyed by each item's position in the original sequence. Originally, we populate this map with the first items off the sequence. Afterwards, we walk through the rest of the collection. For each item, we randomly decide whether to keep it or not. If we do keep it, we randomly swap it with one item that is currently in the sample.
Let's see what this looks like in the code:
sample-amount
will work over the rest of the collection, so we'll begin by dropping the initial sample off the front:(defn sample-amount [k coll] (->> coll (drop k)
(map vector (range-from (inc k)))
(filter #(<= (rand) (/ k (first %))))
rand-replace
to swap out an item from the sample Hashmap
for each item that passed the random filter in Step 3:(reduce rand-replace (into {} (map vector (range k) (take k coll))))
(sort-by first)
(map second)))