Generally, data won't be quite in the form we'll need for our analyses. We spent a lot of time transforming data in Clojure in Chapter 2, Cleaning and Validating Data. Weka contains several methods for renaming columns and filtering the ones that will make it into the dataset.
Most datasets have one or more columns that will throw off clustering—row identifiers or name fields, for instance—so we must filter the columns in the datasets before we perform any analysis. We'll see lot of examples of this in the recipes to come.
We'll use the dependencies, imports, and datafiles that we did in the Loading CSV and ARFF files into Weka recipe. We'll also use the dataset that we loaded in that recipe. We'll need to access a different set of Weka classes, as well as the clojure.string
library:
(import [weka.filters Filter] [weka.filters.unsupervised.attribute Remove]) (require '[clojure.string :as str])
In this recipe, we'll first rename the columns from the dataset. Then we'll look at two different ways to remove columns, one destructively and one not.
We'll create a function to rename the attributes with a sequence of keywords, and then we'll see this function in action:
(defn set-fields [instances field-seq] (doseq [n (range (.numAttributes instances))] (.renameAttribute instances (.attribute instances n) (name (nth field-seq n)))))
user=> (map #(.. data (attribute %) name) (range (.numAttributes data))) ("Country-Code" "Year" "AG.SRF.TOTL.K2" "AG.LND.AGRI.ZS" "AG.LND.AGRI.K2")
(set-fields data [:country-code :year :total-land :agri-percent :agri-total])
This dataset also contains a number of columns that we won't use, for example, the field agri-percent. Since it won't ever be used, we'll destructively remove it from the dataset:
(defn attr-n [instances attr-name] (->> instances (.numAttributes) range (map #(vector % (.. instances (attribute %) name))) (filter #(= (second %) (name attr-name))) ffirst))
reduce
on the instances and remove the attributes as we go:(defn delete-attrs [instances attr-names] (reduce (fn [is n] (.deleteAttributeAt is (attr-n is n)) is) instances attr-names))
(delete-attrs data [:agri-percent])
There are a few attributes that we'll hide. Instead of destructively deleting attributes from one set of instances, filtering them creates a new dataset without the hidden attributes. It can be useful to have one dataset for clustering and another with the complete information for the dataset (for example, a name or ID attribute). For this example, I'll take out the country code:
filter
class to a dataset to create a new dataset. We'll use the Remove
filter in this function. This also uses the attr-n
function, which was used earlier in this recipe:(defn filter-attributes [dataset remove-attrs] (let [attrs (map inc (map (partial attr-n dataset) remove-attrs)) options (->options "-R" (str/join , (map str attrs))) rm (doto (Remove.) (.setOptions options) (.setInputFormat dataset))] (Filter/useFilter dataset rm)))
(def data-numbers (filter-attributes data [:country-code]))
And we can see the results.
user=> (map #(.. data-numbers (attribute %) name) (range (.numAttributes data-numbers))) ("year" "total-land" "agri-total")