Before we jump into our recipe, let's spend some time understanding how to check if our classification model is performing to our satisfaction. In the introduction section, we introduced a term called confusion matrix.
A confusion matrix is a matrix arrangement of the actual versus the predicted class labels. Let's say we have a two-class problem, that is, our y
can take either value, T
or F
. Let's say we trained a classifier to predict our y
. In our test data, we know the actual value of y
. We have the predicted value of our y
from our model. w0 values we can fill our confusion matrix, as follows:
Here is a table where we conveniently list our results from the test set. Remember that we know the class labels in our test set; hence, we can compare our classification model output with the actual class label.
TP
, which is an abbreviation for True Positive, we have a count of all those records in the test set whose label is T
, and where the model also predicted T
FN
, which is an abbreviation for False Negative, we have a count of all the records whose actual label is T
, but the algorithm predicted NFP
stands for False Positive, where the actual label is F
, but the algorithm predicted it as T
TN
stands for True Negative, where the algorithm predicted both the label and the actual class label as F
With the knowledge of this confusion matrix, we can now derive performance metrics with which we can measure the quality of our classification model. In future chapters, we will explore more metrics, but for now, we will introduce the accuracy and error rate.
Accuracy is defined as the ratio of a correct prediction to the total number of predictions. From the confusion matrix, we know that the sum of TP and TN is the total number of correct predictions:
Accuracy from the training set is always very optimistic. One should look at the test set's accuracy value to determine the true performance of the model.
Armed with this knowledge, let's jump into our recipe. The first classification algorithm that we will look at is K-Nearest Neighbor, in short, KNN. Before going into the details of KNN, let's look at a very simple classification algorithm, called the rote classifier algorithm. The rote classifier memorizes the entire training data, that is, it loads all the data in the memory. We need to perform classification on an unseen new training instance: it will try to match the new training instance with any of the training instances in the memory. It matches every attribute of the test instance with every attribute in the training instance. If it finds a match, it predicts the class label of the test instance as the class label of the matched training instance.
You should know by now that this classifier will fail if the test instance is not similar to any of the training instances loaded into the memory.
KNN is similar to rote classifier, except that instead of looking for an exact match, it uses a similarity measure. Similar to rote classifier, KNN loads all the training sets into the memory. When it needs to classify a test instance, it measures the distance between the test instance and all the training instances. Using this distance, it chooses K closest instances in the training set. Now, the prediction for the test set is based on the majority classes of the K nearest neighbors.
For example, if we have a two-class classification problem and we choose our K value as three, and if the given test record's three nearest neighbors have classes, 1, 1, and 0, it will classify the test instance as 1, which is the majority.
KNN belongs to a family of algorithms called instance-based learning. Additionally, as the decision to classify a test instance is taken last, it's also called a lazy learner.
For this recipe, we will generate some data using scikit's make_classification method. We will generate a matrix of four columns / attributes / features and 100 instances:
from sklearn.datasets import make_classification import numpy as np import matplotlib.pyplot as plt import itertools from sklearn.ensemble import BaggingClassifier from sklearn.neighbors import KNeighborsClassifier def get_data(): """ Make a sample classification dataset Returns : Independent variable y, dependent variable x """ x,y = make_classification(n_features=4) return x,y def plot_data(x,y): """ Plot a scatter plot fo all variable combinations """ subplot_start = 321 col_numbers = range(0,4) col_pairs = itertools.combinations(col_numbers,2) for col_pair in col_pairs: plt.subplot(subplot_start) plt.scatter(x[:,col_pair[0]],x[:,col_pair[1]],c=y) title_string = str(col_pair[0]) + "-" + str(col_pair[1]) plt.title(title_string) x_label = str(col_pair[0]) y_label = str(col_pair[1]) plt.xlabel(x_label) plt.xlabel(y_label) subplot_start+=1 plt.show() x,y = get_data() plot_data(x,y)
The get_data
function internally calls make_classification
to generate test data for any classification task.
It's always good practice to visualize the data before starting to feed it into any algorithm. Our plot_data
function produces a scatter plot between all the variables:
We have plotted all the variable combinations. The top two charts there in show combinations between the 0th and 1st column, followed by 0th and 2nd. The points are also colored by their class labels. This gives an idea of how much information is these variable combination to do a classification task.
We will separate our dataset preparation and model training into two different methods: get_train_test
to get the train and test data, and build_model
to build our model. Finally, we will use test_model
to validate the usefulness of our model:
from sklearn.cross_validation import StratifiedShuffleSplit from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import classification_report def get_train_test(x,y): """ Perpare a stratified train and test split """ train_size = 0.8 test_size = 1-train_size input_dataset = np.column_stack([x,y]) stratified_split = StratifiedShuffleSplit(input_dataset[:,-1],test_size=test_size,n_iter=1) for train_indx,test_indx in stratified_split: train_x = input_dataset[train_indx,:-1] train_y = input_dataset[train_indx,-1] test_x = input_dataset[test_indx,:-1] test_y = input_dataset[test_indx,-1] return train_x,train_y,test_x,test_y def build_model(x,y,k=2): """ Fit a nearest neighbour model """ knn = KNeighborsClassifier(n_neighbors=k) knn.fit(x,y) return knn def test_model(x,y,knn_model): y_predicted = knn_model.predict(x) print classification_report(y,y_predicted) if __name__ == "__main__": # Load the data x,y = get_data() # Scatter plot the data plot_data(x,y) # Split the data into train and test train_x,train_y,test_x,test_y = get_train_test(x,y) # Build the model knn_model = build_model(train_x,train_y) # Test the model print " Model evaluation on training set" print "================================ " test_model(train_x,train_y,knn_model) print " Model evaluation on test set" print "================================ " test_model(test_x,test_y,knn_model)
Let's try to follow the code from our main method. We must start by calling get_data
and plotting it using plot_data
, as described in the previous section.
As mentioned previously, we need to separate a part of the training data for the testing that is required to evaluate our model. We then invoke the get_train_test
method to do the same.
In get_train_test
, we decide our train test split size, which is the standard 80/20. We then use 80 percent of our data to train our model. Now, we combine both x and y to a single matrix before the split using NumPy's column_stack
method.
We then leverage StratifiedShuffleSplit
discussed in the previous recipe in order to get a uniform class label distribution between our training and test sets.
Armed with our train and test sets, we are now ready to build our classifier. We must invoke the build model with our training set, attributes x
, and class labels y
. This function also takes K
, the number of neighbors, as a parameter, with a default value of two.
We use scikit-learn's KNN implementation, KNeighborsClassifier
. We then create an object of the classifier and call the fit
method to build our model.
We are ready to test how good the model is using our training data. We can pass our training data (x and y) and our model to the test_model
function.
We know our actual class labels (y). We then invoke the predict function with our x to get the predicted labels. We then print some of the model evaluation metrics. We can start with printing the accuracy of the model, follow it up with a confusion matrix, and then finally show the output of a function called classification_report
. scikit-learn's metrics module provides a function called classification_report
, which can print several model evaluation metrics.
Let's look at our model metrics:
As you can see, our accuracy score is 91.25 percent. We will not repeat the definition of accuracy; you can refer to the introduction section.
Let's now look at our confusion matrix. The top left cell is the true positive cell. We can see that we have no false negatives but we have seven false positives (the first cell in the 2nd row).
Finally, we have precision, recall, an F1 score, and support in our classification report. Let's look at their definitions:
Precision
is the ratio of the true positive and the sum of the true positive and false positive
Accuracy
is the ratio of the true positive and the sum of the true positive and false negative
An F1 score is the harmonic mean of precision and sensitivity
We will see more about this metric in a separate recipe in a future chapter. For now, let's say we shall have high precision and recall values.
It is good to know that we have around 91 percent accuracy for our model, but the real test will be when it is run on the test data. Let's see the metrics for our test data:
It is good to know that our model has 95 percent accuracy for the test data, which is an indication of a good job in fitting the model.
Let's look a little bit deeper into the model that we have built:
We invoked a function called get_params
. This function returns all the parameters that are passed to the model. Let's examine each of the parameters.
The first parameter refers to the underlying data structure used by the KNN implementation. As every record in the training set has to be compared against every other record, brute force implementation may be compute-resource heavy. Hence, we can choose either kd_tree
or ball_tree
as the data structure. A brute will use the brute force method of looping through all the records for every record.
Leaf size is the parameter that is passed to the kd_tree
or ball_tree
method.
Metric is the distance measure used to find the neighbors. The p-value of two reduces Minkowski to Euclidean distance.
Finally, we have the weights parameter. KNN decides the class label of the test instance based on the class label of its K nearest neighbors. A majority vote decides the class label for the test instance. However, if we set the weights to distance, then each neighbor is given a weight that is inversely proportional to its distance. Hence, in order to decide the class label of a test set, weighted voting is performed, rather than simple voting.