One common problem with data is inconsistency. Sometimes, a value is capitalized, while sometimes it is not. Sometimes it is abbreviated, and sometimes it is full. At times, there is a misspelling.
When it's an open domain, such as words in a free-text field, the problem can be quite difficult. However, when the data represents a limited vocabulary (such as US state names, for our example here) there's a simple trick that can help. While it's common to use full state names, standard postal codes are also often used. A mapping from common forms or mistakes to a normalized form is an easy way to fix variants in a field.
For the project.clj
file, we'll use a very simple configuration:
(defproject cleaning-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"]])
We just need to make sure that the clojure.string/upper-case
function is available to us:
(use '[clojure.string :only (upper-case)])
(def state-synonyms {"ALABAMA" "AL", "ALASKA" "AK", "ARIZONA" "AZ", … "WISCONSIN" "WI", "WYOMING" "WY"})
(defn normalize-state [state] (let [uc-state (upper-case state)] (state-synonyms uc-state uc-state)))
normalize-state
with the strings we want to fix:user=> (map normalize-state ["Alabama" "OR" "Va" "Fla"]) ("AL" "OR" "VA" "FL")