One of the issues that we'll need to deal with at some point is spelling errors. Especially when you're trying to work with raw text, spelling errors can throw a wrench in the works.
At one time, spell checkers were major pieces of software with lots of optimizations to run in the constrained environments that were once everyday desktops. Now, that's not the case. Peter Norvig has published a piece on the Internet titled How to Write a Spelling Corrector (http://norvig.com/spell-correct.html). This shows how to take some text that is assumed to be spelled correctly and generate a spell checker built on it. He included a 21-line implementation in Python.
For this recipe, we'll convert the Python code to Clojure. Our version will be longer, but less dense. We can certainly implement it more concisely than we will do it, but it will be helpful for our explanation to break it out more.
We'll use the minimal project.clj
file that we've seen in several recipes already in this chapter:
(defproject cleaning-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"]])
We require clojure.string
and one function from clojure.set
:
(require '[clojure.string :as string]) (use '[clojure.set :only (union)])
This algorithm works by comparing a set of permutations of a word against a map of correctly spelled words and their frequencies. The spelling that occurs most frequently wins:
(defn words [text] (re-seq #"[a-z]+" (string/lower-case text)))
(defn train [feats] (frequencies feats))
(def n-words (train (words (slurp "data/big.txt")))) (def alphabet "abcdefghijklmnopqrstuvwxyz")
(defn split-word [word i] [(.substring word 0 i) (.substring word i)]) (defn delete-char [[w1 w2]] (str w1 (.substring w2 1))) (defn transpose-split [[w1 w2]] (str w1 (second w2) (first w2) (.substring w2 2))) (defn replace-split [[w1 w2]] (let [w2-0 (.substring w2 1)] (map #(str w1 % w2-0) alphabet))) (defn insert-split [[w1 w2]] (map #(str w1 % w2) alphabet))
(defn edits-1 [word] (let [splits (map (partial split-word word) (range (inc (count word)))) long-splits (filter #(> (count (second %)) 1) splits) deletes (map delete-char long-splits) transposes (map transpose-split long-splits) replaces (mapcat replace-split long-splits) inserts (remove nil? (mapcat insert-split splits))] (set (concat deletes transposes replaces inserts))))
(defn known-edits-2 [word] (set (filter (partial contains? n-words) (apply union (map #(edits-1 %) (edits-1 word))))))
(defn known [words] (set (filter (partial contains? n-words) words)))
correct
function:(defn correct [word] (let [candidate-thunks [#(known (list word)) #(known (edits-1 word)) #(known-edits-2 word) #(list word)]] (->> candidate-thunks (map (fn [f] (f))) (filter #(> (count %) 0)) first (map (fn [w] [(get n-words w 1) w])) (reduce (partial max-key first)) second)))
Let's see how it works:
user=> (correct "deete") "delete" user=> (correct "editr") "editor" user=> (correct "tranpsose") "tranpsose" user=> (correct "eidtor") "editor" user=> (correct "eidtr") "elder"
It doesn't recognize transpose, and it miscorrects eidtr as elder. Let's take a look at the training data to see why:
user=> (n-words "transpose") nil user=> (n-words "elder") 40 user=> (n-words "editor") 17
That explains it. Transpose doesn't occur in the training set, and elder is there more than twice as often as editor, so it's the more likely correction.
The heart of this are the edits-1
and known-edits-2
functions. They perform a search over the space between strings, looking for all of the known words that are one or two edits away from the word to be checked. Before the operations are applied, the words are split into two by the split-word
function. The operations that constitute one edit are defined in a series of functions:
The correct
function looks at all of the edits returned that are in the training set and picks the one that is seen most frequently.
If you want more information on the statistics that make this work (and you should—it's quite interesting), see Norvig's explanation in his article at http://norvig.com/spell-correct.html.