We will start with the most commonly used machine learning technique, that is, classification. As we reviewed in the first chapter, the main idea is to automatically build a mapping between the input variables and the outcome. In the following sections, we will look at how to load the data, select features, implement a basic classifier in Weka, and evaluate the classifier performance.
For this task, we will have a look at the ZOO
database [ref]
. The database contains 101 data entries of the animals described with 18 attributes as shown in the following table:
animal |
aquatic |
fins |
hair |
predator |
legs |
feathers |
toothed |
tail |
eggs |
backbone |
domestic |
milk |
breathes |
cat size |
airborne |
venomous |
type |
An example entry in the dataset set is a lion with the following attributes:
animal: lion
hair: true
feathers: false
eggs: false
milk: true
airbone: false
aquatic: false
predator: true
toothed: true
backbone: true
breaths: true
venomous: false
fins: false
legs: 4
tail: true
domestic: false
catsize: true
type: mammal
Our task will be to build a model to predict the outcome variable, animal
, given all the other attributes as input.
Before we start with the analysis, we will load the data in Weka's ARRF format and print the total number of loaded instances. Each data sample is held within an Instances
object, while the complete dataset accompanied with meta-information is handled by the Instances
object.
To load the input data, we will use the DataSource
object that accepts a variety of file formats and converts them to Instances
:
DataSource source = new DataSource(args[0]); Instances data = source.getDataSet(); System.out.println(data.numInstances() + " instances loaded."); // System.out.println(data.toString());
This outputs the number of loaded instances
, as follows:
101 instances loaded.
We can also print the complete dataset by calling the data.toString()
method.
Our task is to learn a model that is able to predict the animal
attribute in the future examples for which we know the other attributes but do not know the animal
label. Hence, we remove the animal
attribute from the training set. We accomplish this by filtering out the animal attribute using the Remove
filter.
First, we set a string table of parameters, specifying that the first attribute must be removed. The remaining attributes are used as our dataset for training a classifier:
Remove remove = new Remove(); String[] opts = new String[]{ "-R", "1"};
Finally, we call the Filter.useFilter(Instances, Filter)
static method to apply the filter on the selected dataset:
remove.setOptions(opts); remove.setInputFormat(data); data = Filter.useFilter(data, remove);
As introduced in Chapter 1, Applied Machine Learning Quick Start, one of the pre-processing steps is focused on feature selection, also known as attribute selection. The goal is to select a subset of relevant attributes that will be used in a learned model. Why is feature selection important? A smaller set of attributes simplifies the models and makes them easier to interpret by users, this usually requires shorter training and reduces overfitting.
Attribute selection can take into account the class value or not. In the first case, an attribute selection algorithm evaluates the different subsets of features and calculates a score that indicates the quality of selected attributes. We can use different searching algorithms such as exhaustive search, best first search, and different quality scores such as information gain, Gini index, and so on.
Weka supports this process by an AttributeSelection
object, which requires two additional parameters: evaluator, which computes how informative an attribute is and a ranker, which sorts the attributes according to the score assigned by the evaluator.
In this example, we will use information gain as an evaluator and rank the features by their information gain score:
InfoGainAttributeEval eval = new InfoGainAttributeEval(); Ranker search = new Ranker();
Next, we initialize an AttributeSelection
object and set the evaluator, ranker, and data:
AttributeSelection attSelect = new AttributeSelection(); attSelect.setEvaluator(eval); attSelect.setSearch(search); attSelect.SelectAttributes(data);
Finally, we can print an order list of attribute indices, as follows:
int[] indices = attSelect.selectedAttributes(); System.out.println(Utils.arrayToString(indices));
The method outputs the following result:
12,3,7,2,0,1,8,9,13,4,11,5,15,10,6,14,16
The top three most informative attributes are 12
(fins), 3
(eggs), 7
(aquatic), 2
(hair), and so on. Based on this list, we can remove additional, non-informative features in order to help learning algorithms achieve more accurate and faster learning models.
What would make the final decision about the number of attributes to keep? There's no rule of thumb related to an exact number—the number of attributes depends on the data and problem. The purpose of attribute selection is choosing attributes that serve your model better, so it is better to focus whether the attributes are improving the model.
We have loaded our data, selected the best features, and are ready to learn some classification models. Let's begin with the basic decision trees.
Decision tree in Weka is implemented within the J48
class, which is a re-implementation of Quinlan's famous C4.5 decision tree learner [Quinlan, 1993].
First, we initialize a new J48
decision tree learner. We can pass additional parameters with a string table, for instance, tree pruning that controls the model complexity (refer to Chapter 1, Applied Machine Learning Quick Start). In our case, we will build an un-pruned tree, hence we will pass a single U
parameter:
J48 tree = new J48(); String[] options = new String[1]; options[0] = "-U"; tree.setOptions(options);
Next, we call the buildClassifier(Instances)
method to initialize the learning process:
tree.buildClassifier(data);
The built model is now stored in a tree
object. We can output the entire J48
unpruned tree calling the toString()
method:
System.out.println(tree);
The output is as follows:
J48 unpruned tree ------------------ feathers = false | milk = false | | backbone = false | | | airborne = false | | | | predator = false | | | | | legs <= 2: invertebrate (2.0) | | | | | legs > 2: insect (2.0) | | | | predator = true: invertebrate (8.0) | | | airborne = true: insect (6.0) | | backbone = true | | | fins = false | | | | tail = false: amphibian (3.0) | | | | tail = true: reptile (6.0/1.0) | | | fins = true: fish (13.0) | milk = true: mammal (41.0) feathers = true: bird (20.0) Number of Leaves : .9 Size of the tree : ..17
The outputted tree has 17
nodes in total, 9
of these are terminal (Leaves
).
Another way to present the tree is to leverage the built-in TreeVisualizer
tree viewer, as follows:
TreeVisualizer tv = new TreeVisualizer(null, tree.graph(), new PlaceNode2()); JFrame frame = new javax.swing.JFrame("Tree Visualizer"); frame.setSize(800, 500); frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE); frame.getContentPane().add(tv); frame.setVisible(true); tv.fitToScreen();
The code results in the following frame:
The decision process starts at the top node, also known as the root node. The node label specifies the attribute value that will be checked. In our example, we first check the value of the feathers
attribute. If the feather is present, we follow the right-hand branch, which leads us to the leaf labeled bird
, indicating there are 20
examples supporting this outcome. If the feather is not present, we follow the left-hand branch, which leads us to the next milk
attribute. We check the value of the attribute again and follow the branch that matches the attribute value. We repeat the process until we reach a leaf node.
We can build other classifiers by following the same steps: initialize a classifier, pass the parameters controlling the model complexity, and call the buildClassifier(Instances)
method.
In the next section, we will learn how to use a trained model to assign a class label to a new example whose label is unknown.
Suppose we record attributes for an animal whose label we do not know, we can predict its label from the learned classification model:
We first construct a feature vector describing the new specimen, as follows:
double[] vals = new double[data.numAttributes()]; vals[0] = 1.0; //hair {false, true} vals[1] = 0.0; //feathers {false, true} vals[2] = 0.0; //eggs {false, true} vals[3] = 1.0; //milk {false, true} vals[4] = 0.0; //airborne {false, true} vals[5] = 0.0; //aquatic {false, true} vals[6] = 0.0; //predator {false, true} vals[7] = 1.0; //toothed {false, true} vals[8] = 1.0; //backbone {false, true} vals[9] = 1.0; //breathes {false, true} vals[10] = 1.0; //venomous {false, true} vals[11] = 0.0; //fins {false, true} vals[12] = 4.0; //legs INTEGER [0,9] vals[13] = 1.0; //tail {false, true} vals[14] = 1.0; //domestic {false, true} vals[15] = 0.0; //catsize {false, true} Instance myUnicorn = new Instance(1.0, vals);
Finally, we call the classify(Instance)
method on the model to obtain the class value. The method returns label index, as follows:
double result = tree.classifyInstance(myUnicorn); System.out.println(data.classAttribute().value((int) result));
This outputs the mammal
class label.
We built a model, but we do not know if it can be trusted. To estimate its performance, we can apply a cross-validation technique explained in Chapter 1, Applied Machine Learning Quick Start.
Weka offers an Evaluation
class implementing cross validation. We pass the model, data, number of folds, and an initial random seed, as follows:
Classifier cl = new J48(); Evaluation eval_roc = new Evaluation(data); eval_roc.crossValidateModel(cl, data, 10, new Random(1), new Object[] {}); System.out.println(eval_roc.toSummaryString());
The evaluation results are stored in the Evaluation
object.
A mix of the most common metrics can be invoked by calling the toString()
method. Note that the output does not differentiate between regression and classification, so pay attention to the metrics that make sense, as follows:
Correctly Classified Instances 93 92.0792 % Incorrectly Classified Instances 8 7.9208 % Kappa statistic 0.8955 Mean absolute error 0.0225 Root mean squared error 0.14 Relative absolute error 10.2478 % Root relative squared error 42.4398 % Coverage of cases (0.95 level) 96.0396 % Mean rel. region size (0.95 level) 15.4173 % Total Number of Instances 101
In the classification, we are interested in the number of correctly/incorrectly classified instances.
Furthermore, we can inspect where a particular misclassification has been made by examining the confusion matrix. Confusion matrix shows how a specific class value was predicted:
double[][] confusionMatrix = eval_roc.confusionMatrix(); System.out.println(eval_roc.toMatrixString());
The resulting confusion matrix is as follows:
=== Confusion Matrix === a b c d e f g <-- classified as 41 0 0 0 0 0 0 | a = mammal 0 20 0 0 0 0 0 | b = bird 0 0 3 1 0 1 0 | c = reptile 0 0 0 13 0 0 0 | d = fish 0 0 1 0 3 0 0 | e = amphibian 0 0 0 0 0 5 3 | f = insect 0 0 0 0 0 2 8 | g = invertebrate
The first column names in the first row correspond to labels assigned by the classification mode. Each additional row then corresponds to an actual true class value. For instance, the second row corresponds instances with the mammal
true class label. In the column line, we read that all mammals were correctly classified as mammals. In the fourth row, reptiles, we notice that three were correctly classified as reptiles, while one was classified as fish
and one as an insect
. Confusion matrix hence, gives us an insight into the kind of errors that our classification model makes.
Naive Bayes is one of the most simple, efficient, and effective inductive algorithms in machine learning. When features are independent, which is rarely true in real world, it is theoretically optimal, and even with dependent features, its performance is amazingly competitive (Zhang, 2004). The main disadvantage is that it cannot learn how features interact with each other, for example, despite the fact that you like your tea with lemon or milk, you hate a tea having both of them at the same time.
Decision tree's main advantage is a model, that is, a tree, which is easy to interpret and explain as we studied in our example. It can handle both nominal and numeric features and you don't have to worry about whether the data is linearly separable.
Some other examples of classification algorithms are as follows:
weka.classifiers.rules.ZeroR
: This predicts the majority class and is considered as a baseline, that is, if your classifier's performance is worse than the average value predictor, it is not worth considering it.weka.classifiers.trees.RandomTree
: This constructs a tree that considers K randomly chosen attributes at each node.weka.classifiers.trees.RandomForest
: This constructs a set (that is, forest) of random trees and uses majority voting to classify a new instance.weka.classifiers.lazy.IBk
: This is the k-nearest neighbor's classifier that is able to select an appropriate value of neighbors based on cross-validation.weka.classifiers.functions.MultilayerPerceptron
: This is a classifier based on neural networks that use back-propagation to classify instances. The network can be built by hand, or created by an algorithm, or both.weka.classifiers.bayes.NaiveBayes
: This is a naive Bayes classifier that uses estimator classes, where numeric estimator precision values are chosen based on the analysis of the training data.weka.classifiers.meta.AdaBoostM1
: This is the class for boosting a nominal class classifier using the AdaBoost M1 method. Only nominal class problems can be tackled. This often dramatically improves the performance, but sometimes it overfits.weka.classifiers.meta.Bagging
: This is the class for bagging a classifier to reduce the variance. This can perform classification and regression, depending on the base learner.