Finding the nearest neighbors

Before we jump into our recipe, let's spend some time understanding how to check if our classification model is performing to our satisfaction. In the introduction section, we introduced a term called confusion matrix.

A confusion matrix is a matrix arrangement of the actual versus the predicted class labels. Let's say we have a two-class problem, that is, our y can take either value, T or F. Let's say we trained a classifier to predict our y. In our test data, we know the actual value of y. We have the predicted value of our y from our model. w0 values we can fill our confusion matrix, as follows:

Finding the nearest neighbors

Here is a table where we conveniently list our results from the test set. Remember that we know the class labels in our test set; hence, we can compare our classification model output with the actual class label.

  • Under TP, which is an abbreviation for True Positive, we have a count of all those records in the test set whose label is T, and where the model also predicted T
  • Under FN, which is an abbreviation for False Negative, we have a count of all the records whose actual label is T, but the algorithm predicted N
  • FP stands for False Positive, where the actual label is F, but the algorithm predicted it as T
  • TN stands for True Negative, where the algorithm predicted both the label and the actual class label as F

With the knowledge of this confusion matrix, we can now derive performance metrics with which we can measure the quality of our classification model. In future chapters, we will explore more metrics, but for now, we will introduce the accuracy and error rate.

Accuracy is defined as the ratio of a correct prediction to the total number of predictions. From the confusion matrix, we know that the sum of TP and TN is the total number of correct predictions:

Finding the nearest neighbors

Accuracy from the training set is always very optimistic. One should look at the test set's accuracy value to determine the true performance of the model.

Armed with this knowledge, let's jump into our recipe. The first classification algorithm that we will look at is K-Nearest Neighbor, in short, KNN. Before going into the details of KNN, let's look at a very simple classification algorithm, called the rote classifier algorithm. The rote classifier memorizes the entire training data, that is, it loads all the data in the memory. We need to perform classification on an unseen new training instance: it will try to match the new training instance with any of the training instances in the memory. It matches every attribute of the test instance with every attribute in the training instance. If it finds a match, it predicts the class label of the test instance as the class label of the matched training instance.

You should know by now that this classifier will fail if the test instance is not similar to any of the training instances loaded into the memory.

KNN is similar to rote classifier, except that instead of looking for an exact match, it uses a similarity measure. Similar to rote classifier, KNN loads all the training sets into the memory. When it needs to classify a test instance, it measures the distance between the test instance and all the training instances. Using this distance, it chooses K closest instances in the training set. Now, the prediction for the test set is based on the majority classes of the K nearest neighbors.

For example, if we have a two-class classification problem and we choose our K value as three, and if the given test record's three nearest neighbors have classes, 1, 1, and 0, it will classify the test instance as 1, which is the majority.

KNN belongs to a family of algorithms called instance-based learning. Additionally, as the decision to classify a test instance is taken last, it's also called a lazy learner.

Getting ready

For this recipe, we will generate some data using scikit's make_classification method. We will generate a matrix of four columns / attributes / features and 100 instances:

from sklearn.datasets import make_classification

import numpy as np
import matplotlib.pyplot as plt
import itertools

from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier

def get_data():
    """
    Make a sample classification dataset
    Returns : Independent variable y, dependent variable x
    """
    x,y = make_classification(n_features=4)
    return x,y
    





def plot_data(x,y):
    """
    Plot a scatter plot fo all variable combinations
    """
    subplot_start = 321
    col_numbers = range(0,4)
    col_pairs = itertools.combinations(col_numbers,2)
    
    for col_pair in col_pairs:
        plt.subplot(subplot_start)
        plt.scatter(x[:,col_pair[0]],x[:,col_pair[1]],c=y)
        title_string = str(col_pair[0]) + "-" + str(col_pair[1])
        plt.title(title_string)
        x_label = str(col_pair[0])
        y_label = str(col_pair[1])
        plt.xlabel(x_label)
        plt.xlabel(y_label)
        subplot_start+=1
    
    plt.show()


x,y = get_data()    
plot_data(x,y)

The get_data function internally calls make_classification to generate test data for any classification task.

It's always good practice to visualize the data before starting to feed it into any algorithm. Our plot_data function produces a scatter plot between all the variables:

Getting ready

We have plotted all the variable combinations. The top two charts there in show combinations between the 0th and 1st column, followed by 0th and 2nd. The points are also colored by their class labels. This gives an idea of how much information is these variable combination to do a classification task.

How to do it…

We will separate our dataset preparation and model training into two different methods: get_train_test to get the train and test data, and build_model to build our model. Finally, we will use test_model to validate the usefulness of our model:

from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

   

def get_train_test(x,y):
    """
    Perpare a stratified train and test split
    """
    train_size = 0.8
    test_size = 1-train_size
    input_dataset = np.column_stack([x,y])
    stratified_split = StratifiedShuffleSplit(input_dataset[:,-1],test_size=test_size,n_iter=1)

    for train_indx,test_indx in stratified_split:
        train_x = input_dataset[train_indx,:-1]
        train_y = input_dataset[train_indx,-1]
        test_x =  input_dataset[test_indx,:-1]
        test_y = input_dataset[test_indx,-1]
    return train_x,train_y,test_x,test_y
    

def build_model(x,y,k=2):
    """
    Fit a nearest neighbour model
    """
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(x,y)
    return knn

def test_model(x,y,knn_model):
    y_predicted = knn_model.predict(x)
    print classification_report(y,y_predicted)
    
    
if __name__ == "__main__":

    # Load the data    
    x,y = get_data()
    
    # Scatter plot the data
    plot_data(x,y)
    
    # Split the data into train and test    
    train_x,train_y,test_x,test_y = get_train_test(x,y)
    
    # Build the model
    knn_model = build_model(train_x,train_y)
    
    # Test the model
    print "
Model evaluation on training set"
    print "================================
"
    test_model(train_x,train_y,knn_model)
    
    print "
Model evaluation on test set"
    print "================================
"
    test_model(test_x,test_y,knn_model)

How it works…

Let's try to follow the code from our main method. We must start by calling get_data and plotting it using plot_data, as described in the previous section.

As mentioned previously, we need to separate a part of the training data for the testing that is required to evaluate our model. We then invoke the get_train_test method to do the same.

In get_train_test, we decide our train test split size, which is the standard 80/20. We then use 80 percent of our data to train our model. Now, we combine both x and y to a single matrix before the split using NumPy's column_stack method.

We then leverage StratifiedShuffleSplit discussed in the previous recipe in order to get a uniform class label distribution between our training and test sets.

Armed with our train and test sets, we are now ready to build our classifier. We must invoke the build model with our training set, attributes x, and class labels y. This function also takes K, the number of neighbors, as a parameter, with a default value of two.

We use scikit-learn's KNN implementation, KNeighborsClassifier. We then create an object of the classifier and call the fit method to build our model.

We are ready to test how good the model is using our training data. We can pass our training data (x and y) and our model to the test_model function.

We know our actual class labels (y). We then invoke the predict function with our x to get the predicted labels. We then print some of the model evaluation metrics. We can start with printing the accuracy of the model, follow it up with a confusion matrix, and then finally show the output of a function called classification_report. scikit-learn's metrics module provides a function called classification_report, which can print several model evaluation metrics.

Let's look at our model metrics:

How it works…

As you can see, our accuracy score is 91.25 percent. We will not repeat the definition of accuracy; you can refer to the introduction section.

Let's now look at our confusion matrix. The top left cell is the true positive cell. We can see that we have no false negatives but we have seven false positives (the first cell in the 2nd row).

Finally, we have precision, recall, an F1 score, and support in our classification report. Let's look at their definitions:

Precision is the ratio of the true positive and the sum of the true positive and false positive

Accuracy is the ratio of the true positive and the sum of the true positive and false negative

An F1 score is the harmonic mean of precision and sensitivity

We will see more about this metric in a separate recipe in a future chapter. For now, let's say we shall have high precision and recall values.

It is good to know that we have around 91 percent accuracy for our model, but the real test will be when it is run on the test data. Let's see the metrics for our test data:

How it works…

It is good to know that our model has 95 percent accuracy for the test data, which is an indication of a good job in fitting the model.

There's more…

Let's look a little bit deeper into the model that we have built:

There's more…

We invoked a function called get_params. This function returns all the parameters that are passed to the model. Let's examine each of the parameters.

The first parameter refers to the underlying data structure used by the KNN implementation. As every record in the training set has to be compared against every other record, brute force implementation may be compute-resource heavy. Hence, we can choose either kd_tree or ball_tree as the data structure. A brute will use the brute force method of looping through all the records for every record.

Leaf size is the parameter that is passed to the kd_tree or ball_tree method.

Metric is the distance measure used to find the neighbors. The p-value of two reduces Minkowski to Euclidean distance.

Finally, we have the weights parameter. KNN decides the class label of the test instance based on the class label of its K nearest neighbors. A majority vote decides the class label for the test instance. However, if we set the weights to distance, then each neighbor is given a weight that is inversely proportional to its distance. Hence, in order to decide the class label of a test set, weighted voting is performed, rather than simple voting.

See also

  • Preparing data for model building recipe in Chapter 6, Machine Learning I
  • Working with distance measures recipe in Chapter 5, Data Mining - Finding a needle in a haystack
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset