One difficult issue when normalizing and cleaning up data is how to deal with time. People enter dates and times in a bewildering variety of formats; some of them are ambiguous, and some of them are vague. However, we have to do our best to interpret them and normalize them into a standard format.
In this recipe, we'll define a function that attempts to parse a date into a standard string format. We'll use the clj-time
Clojure library, which is a wrapper around the Joda Java library (http://joda-time.sourceforge.net/).
First, we need to declare our dependencies in the Leiningen project.clj
file:
(defproject cleaning-data "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [clj-time "0.9.0-beta1"]])
Then, we need to load these dependencies into our script or REPL. We'll exclude second
from clj-time
to keep it from clashing with clojure.core/second
:
(use '[clj-time.core :exclude (extend second)] '[clj-time.format])
In order to solve this problem, we'll specify a sequence of date/time formats and walk through them. The first that doesn't throw an exception will be the one that we'll use.
(def ^:dynamic *default-formats* [:date :date-hour-minute :date-hour-minute-second :date-hour-minute-second-ms :date-time :date-time-no-ms :rfc822 "YYYY-MM-dd HH:mm" "YYYY-MM-dd HH:mm:ss" "dd/MM/YYYY" "YYYY/MM/dd" "d MMM YYYY"])
->formatter
, which attempts to convert each type to a date formatter, and the protocol for both the types to be represented in the format list:(defprotocol ToFormatter (->formatter [fmt])) (extend-protocol ToFormatter java.lang.String (->formatter [fmt] (formatter fmt)) clojure.lang.Keyword (->formatter [fmt] (formatters fmt)))
parse-or-nil
will take a format and a date string, attempt to parse the date string, and return nil
if there are any errors:(defn parse-or-nil [fmt date-str] (try (parse (->formatter fmt) date-str) (catch Exception ex nil)))
normalize-datetime
. We just attempt to parse a date string with all of the formats, filter out any nil
values, and return the first non-nil. Because Clojure's lists are lazy, this will stop processing as soon as one format succeeds:(defn normalize-datetime [date-str] (first (remove nil? (map #(parse-or-nil % date-str) *default-formats*))))
Now we can try this out:
user=> (normalize-datetime "2012-09-12") #<DateTime 2012-09-12T00:00:00.000Z> user=> (normalize-datetime "2012/09/12") #<DateTime 2012-09-12T00:00:00.000Z> user=> (normalize-datetime "28 Sep 2012") #<DateTime 2012-09-28T00:00:00.000Z> user=> (normalize-datetime "2012-09-28 13:45") #<DateTime 2012-09-28T13:45:00.000Z>
This approach to parse dates has a number of problems. For example, because some date formats are ambiguous, the first match might not be the correct one.
However, trying out a list of formats is probably about the best we can do. Knowing something about our data allows us to prioritize the list appropriately, and we can augment it with ad hoc formats as we run across new data. We might also need to normalize data from different sources (for instance, U.S. date formats versus the rest of the world) before we merge the data together.