Hadoop was developed by Yahoo! to implement Google's MapReduce algorithm, and then it was open sourced. Since then, it's become one of the most widely tested and used systems for creating distributed processing.
The central part of this ecosystem is Hadoop, but it's also complemented by a range of other tools, including the Hadoop Distributed File System (HDFS) and Pig, a language used to write jobs in order to run them on Hadoop.
One tool that makes working with Hadoop easier is Cascading. This provides a workflow-like layer on top of Hadoop that can make the expression of some data processing and analysis tasks much easier. Cascalog is a Clojure-idiomatic interface to Cascading and, ultimately, Hadoop.
This recipe will show you how to access and query data in Clojure sequences using Cascalog.
First, we have to list our dependencies in the Leiningen project.clj
file:
(defproject distrib-data "0.1.0" :dependencies [[org.clojure/clojure "1.6.0"] [cascalog "2.1.1"] [org.slf4j/slf4j-api "1.7.7"]] :profiles {:dev {:dependencies [[org.apache.hadoop/hadoop-core "1.1.1"]]}})
Finally, we'll require the packages that we'll use, including the clojure.string
library:
(require '[clojure.string :as string]) (require '[cascalog.logic.ops :as c]) (use 'cascalog.api)
Most part of this recipe will define the data we'll query. For this, we will use a list of companions and actors from the British television program Doctor Who. The data is in a sequence of maps, so we'll need to transform it into several sequences of vectors, which is what Cascalog can access. In one sequence there will be a list of the companions' lowercased names, for which we'll use keys in other data tables. One will be the name key and the full name, and the final one will be a table of the companions' keys to a doctor they tagged along with. We'll also define a list of the actors who played the role of doctors and the years in which they played them.
(def input-data [{:given-name "Susan", :surname "Forman", :doctors [1]} {:given-name "Katarina", :surname nil, :doctors [1]} {:given-name "Victoria", :surname "Waterfield", :doctors [2]} {:given-name "Sarah Jane", :surname "Smith", :doctors [3 4 10]} {:given-name "Romana", :surname nil, :doctors [4]} {:given-name "Kamelion", :surname nil, :doctors [5]} {:given-name "Rose", :surname "Tyler", :doctors [9 10]} {:given-name "Martha", :surname "Jones", :doctors [10]} {:given-name "Adelaide", :surname "Brooke", :doctors [10]} {:given-name "Craig", :surname "Owens", :doctors [11]}]) (def companion (map string/lower-case (map :given-name input-data))) (def full-name (map (fn [{:keys [given-name surname]}] [(string/lower-case given-name) (string/trim (string/join space [given-name surname]))]) input-data)) (def doctor (mapcat #(map (fn [d] [(string/lower-case (:given-name %)) d]) (:doctors %)) input-data)) (def actor [[1 "William Hartnell" "1963–66"] [2 "Patrick Troughton" "1966–69"] [3 "Jon Pertwee" "1970–74"] [4 "Tom Baker" "1974–81"] [5 "Peter Davison" "1981–84"] [6 "Colin Baker" "1984–86"] [7 "Sylvester McCoy" "1987–89, 1996"] [8 "Paul McGann" "1996"] [9 "Christopher Eccleston" "2005"] [10 "David Tennant" "2005–10"] [11 "Matt Smith" "2010–present"]])
(?<- (stdout) [?companion] (companion ?companion))
RESULTS ----------------------- susan barbara ian vicki steven …
(?<- (stdout) [?name] (full-name _ ?name)) … RESULTS ----------------------- Susan Forman Barbara Wright Ian Chesterton Vicki Steven Taylor …
The structure of query statements is not hard to understand. Let's break one query statement apart:
(?<- (stdout) [?name] (full-name _ ?name))
The ?<-
operator creates a query and executes it. It's a combination of the <-
macro, which creates a query from output variables and predicates, and the ?-
function, which executes a query to a sink.
(?<- (stdout) [?name] (full-name _ ?name))
The first parameter is a Cascading tap sink. This is a destination for the data. Obviously, if there's a lot of data being output, just dumping it in the console won't be a good idea. In that case, you can send it to a file. Since there's not much data, we'll just write it to the screen.
(?<- (stdout) [?name] (full-name _ ?name))
The preceding is a vector of output variables. The names here must occur in the predicates that follow.
(?<- (stdout) [?name] (full-name _ ?name))
This is a list of predicates. In this example, there's only one predicate. It queries the full-name
table. It doesn't care about the values in the first column, so it just uses an underscore as a placeholder (_
). Using underscore as a variable name in this way is a convention in Clojure and similar languages for values that you want to ignore. The values in the second column are bound to the name ?name
, which is also found in the vector of output columns.
Of course, working with in-memory data isn't that useful. It's good for development and debugging, though. Later, we'll see how to connect to a datafile, and the query syntax is exactly the same.