In the previous recipe, the file we read was a CSV file, but we read it line by line. That's not optimal. Cascading provides a number of taps—sources of data or sinks to send data to—including one for CSV and other delimited data formats. Also, Cascalog has some good wrappers for several of these taps, but not for the CSV one.
In truth, creating a wrapper that exposes all the functionality of the delimited text format tap will be complex. There are options for delimiter characters, quote characters, including a header row, the types of columns, and other things. That's a lot of options, and dispatching to the right method can be tricky.
We won't worry about how to handle all the options right here. For this recipe, we will create a simple wrapper around the delimited text file tap that includes some of the more common options to read CSV files.
First, we'll need to use some of the same dependencies as the ones we've been using as well as some new ones. Here are the full dependencies that we'll need in our project.clj
file:
(defproject distrib-data "0.1.0" :dependencies [[org.clojure/clojure "1.6.0"] [cascalog "2.1.1"] [org.slf4j/slf4j-api "1.7.7"]] :profiles {:dev {:dependencies [[org.apache.hadoop/hadoop-core "1.2.1"]]}})
Also, we'll need to import a number of namespaces from these libraries into our script or REPL:
(require '[cascalog.logic.ops :as c] '[cascalog.cascading.tap :as tap] '[cascalog.cascading.util :as u]) (use 'cascalog.api) (import [cascading.tuple Fields] [cascading.scheme.hadoop.TextDelimited])
We'll also use the data file that we did in the Distributing data with Apache HDFS recipe. You can access it either locally or through HDFS, as we did earlier. I'll access it locally for this recipe.
cascading.scheme.hadoop.TextDelimited
tap scheme with the correct options and then calls the cascalog.tap/hfs-tap
Cascalog function with it. That will handle the rest, as shown here:(defn hfs-text-delim [path & {:keys [fields has-header delim quote-str] :as opts :or {fields Fields/ALL, has-header false, delim ",", quote-str """}}] (let [scheme (TextDelimited. (w/fields fields) has-header delim quote-str) tap-opts (select-keys opts [:scascalog :sinkmode :sinkparts :source-pattern :sink-template :templatefields])] (apply tap/hfs-tap scheme path tap-opts)))
user=> (?<- (stdout) [?origin_airport ?destin_airport] ((hfs-text-delim "data/16285/flights_with_colnames.csv" :has-header true) ?origin_airport ?destin_airport ?passengers ?flights ?month)) … RESULTS ----------------------- MHK AMW EUG RDM EUG RDM EUG RDM …
This function takes a number of options, such as fields
, has-header
, delim
, and quote-str
. The defaults are for CSV files, but they can be easily overridden for a variety of other formats. We saw the use of the :has-header
option in the previous example.
With the options in hand, it creates a TextDelimited
scheme object. And finally passes it to the hfs-tap
function, which wraps the scheme object in a tap. The tap serves as a data generator, and we bind the values from it to the names in our query.
Hadoop can consume a number of different file formats. Avro (http://avro.apache.org/) uses JSON schemas to store data in a fast, compact, and binary data format. Sequence files (http://wiki.apache.org/hadoop/SequenceFile) contain a binary key-value store. XML and JSON are also common data formats.
If we want to parse our own data formats in Cascading or Cascalog, we'll need to write our own source tap (http://docs.cascading.org/cascading/2.5/userguide/html/ch03s05.html). If it's a delimited text format, such as CSV or TSV, we can base the new tap on cascading.scheme.hadoop.TextDelimited
, just as we did in this recipe. See the JavaDocs for this class at http://docs.cascading.org/cascading/2.5/cascading-hadoop/cascading/scheme/hadoop/TextDelimited.html for more information on this.