Loading CSV and ARFF files into Weka

Weka is most comfortable when using its own file format: the Attribute-Relation File Format (ARFF). This format includes the types of data in the columns and other information that allow it to be loaded incrementally, and both of these can be important features. Because of this, Weka can load data more reliably. However, Weka can still import CSV files, and when it does, it attempts to guess the type of data in the columns.

In this recipe, we'll see what's necessary to load data from a CSV file and an ARFF file.

Getting ready

First, we'll need to add Weka to the dependencies in our Leiningen project.clj file:

(defproject d-mining "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [nz.ac.waikato.cms.weka/weka-dev "3.7.11"]])

Then we'll import the right classes into our script or REPL:

(import [weka.core.converters ArffLoader CSVLoader]
        [java.io File])

Finally, we'll need to have a CSV file to import. In this recipe, I'll use the dataset of Chinese land use data that we compiled for the Scaling variables to simplify variable relationships recipe in Chapter 7, Statistical Data Analysis with Incanter. It's in the file named data/chnchn-land.csv. You can also download this file from http://www.ericrochester.com/clj-data-analysis/data/chn-land.csv.

How to do it…

For this recipe, we'll write several utility functions and then use them to load the data:

  1. First, we'll need a utility function to convert options into an array of strings:
    (defn ->options
     [& opts]
      (into-array String
                  (map str (flatten (remove nil? opts)))))
  2. Next, we'll create a function that takes a filename and an optional :header keyword argument and returns the Weka dataset of instances:
    (defn load-csv [filename & {:keys [header]
                                :or {header true}}]
      (let [options (->options (when-not header "-H"))
            loader (doto (CSVLoader.)
                     (.setOptions options)
                     (.setSource (File. filename)))]
        (.getDataSet loader)))
  3. Finally, we can use this to load CSV files:
    (def data (load-csv "data/chn-land.csv"))
  4. Alternatively, if we have a file without a header row, we can do this:
    (def data (load-csv "data/chn-land.csv"
                        :header false))
  5. We can use a similar function to load ARFF files:
    (defn load-arff [filename]
      (.getDataSet
        (doto (ArffLoader.)
          (.setFile (File. filename)))))

There are ARFF files of standard datasets already created and available to download from http://weka.wikispaces.com/Datasets. We'll use some of these in later recipes.

How it works…

Weka can be used in a number of ways. Although we're using it as a library here, it is also possible to use it as a GUI or a command-line application. In fact, in a lot of ways, to use the interface as a library is the same as using it as a command-line application (just without calling it from the shell). Whether used from a GUI or programmatically, at some point, we're setting options using a command-line-style string array.

The ->options function takes care of converting and cleaning up a list of these options, and into-array converts the sequence into a string array.

For all of the later recipes that use Weka, this function will be a template. Each time it's used, we'll essentially create an object, set the options as a string array, and perform the operation. In the Discovering groups of data using K-Means clustering recipe, we'll reify the process with the defanalysis macro.

The ARFF format, for which we created a function in Step 5, is a Weka-specific datafile format. CSV files don't include information about the types of data stored in the columns. Weka tries to be smart and figure out what they are, but it isn't always successful. However, ARFF files do contain information about the columns' datatypes, and this makes them Weka's preferred file format.

There's more…

The columns may need filtering or renaming, especially if the CSV file doesn't have a header row. We'll see how to do that in the next recipe, Filtering, renaming, and deleting columns in Weka datasets.

See also…

For more about Weka, see its website at http://www.cs.waikato.ac.nz/ml/weka/

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset