Support vector machines (SVMs) try to divide two groups of data along a plane. An SVM finds the plane that is the farthest from both groups. If a plane comes much closer to group B, it will prefer a plane that is approximately an equal distance from both. SVMs have a number of nice properties. While other clustering or classification algorithms work well with defined clusters of data, SVMs may work fine with data that isn't in well-defined and delineated groupings. They are also not affected by the local minima. Algorithms such as K-Means or SOMs—which begin from a random starting point—can get caught in solutions that aren't bad for the area around the solution, but aren't the best for the entire space. This isn't a problem for SVMs.
First, we'll need these dependencies in our project.clj
file:
(defproject d-mining "0.1.0-SNAPSHOT" :dependencies [[org.clojure/clojure "1.6.0"] [nz.ac.waikato.cms.weka/weka-dev "3.7.11"] [nz.ac.waikato.cms.weka/LibSVM "1.0.6"]])
In the script or REPL, we'll import the SVM library:
(import [weka.classifiers.functions LibSVM])
We'll also use the ionosphere dataset from the Weka datasets. (You can download this from http://www.ericrochester.com/clj-data-analysis/data/UCI/ionosphere.arff.) This data is taken from a phased-array antenna system in Goose Bay, Labrador. For each observation, the first 34 attributes are from 17 pulse numbers (a pulse for each observation) for the system, with two attributes per pulse number. The thirty-fifth attribute indicates whether the reading is good or bad. Good readings show evidence of some kind of structure in the ionosphere. Bad readings do not; their signals pass through the ionosphere. We'll load this and set the last column, the "good" or "bad" column as the class index:
(def ion (doto (load-arff "data/UCI/ionosphere.arff") (.setClassIndex 34)))
Finally, we'll use the defanalysis
macro from the Discovering groups of data with K-Means clustering recipe and the sample-instances
function from the Classifying data with Naive Bayesian classifiers recipe.
For this recipe, we'll define some utility functions and the analysis algorithm wrapper. Then we'll put it through its paces:
true
to one option and a mnemonic keyword to another, but Weka wants both of these as integers. So to make the parameter values more natural to Clojure, we'll use several functions that convert the Clojure parameters to the integer strings that Weka wants:(defn bool->int [b] (if b 1 0)) (def svm-types {:c-svc 0, :nu-svc 1, :one-class-svm 2, :epsilon-svr 3, :nu-svr 4}) (def svm-fns {:linear 0, :polynomial 1, :radial-basis 2, :sigmoid 3})
LibSVM
class, which is a standalone library that works with Weka:(defanalysis svm LibSVM buildClassifier [["-S" svm-type :c-svc svm-types] ["-K" kernel-fn :radial-basis svm-fns] ["-D" degree 3] ["-G" gamma nil :not-nil] ["-R" coef0 0] ["-C" c 1] ["-N" nu 0.5] ["-Z" normalize false bool->int] ["-P" epsilon 0.1] ["-M" cache-size 40] ["-E" tolerance 0.001] ["-H" shrinking true bool->int] ["-W" weights nil :not-nil]])
(defn eval-instance ([] {:correct 0, :incorrect 0}) ([_] {:correct 0, :incorrect 0}) ([classifier sums instance] (if (= (.classValue instance) (.classifyInstance classifier instance)) (assoc sums :correct (inc (sums :correct))) (assoc sums :incorrect (inc (sums :incorrect))))))
(def ion-sample (sample-instances ion 35)) (def ion-svm (svm ion-sample))
eval-instance
to see how it did:user=> (reduce (partial eval-instance ion-svm) (eval-instance) ion) {:incorrect 81, :correct 270}
This gives us a total correct of 77 percent.
LibSVM
class at http://weka.wikispaces.com/LibSVM