KNN modeling

As previously mentioned, it is critical to select the most appropriate parameter (k or K) when using this technique. Let's put the caret package to good use again in order to identify k. We will create a grid of inputs for the experiment, with k ranging from 2 to 20 by an increment of 1. This is easily done with the expand.grid() and seq() functions. The caret package parameter that works with the KNN function is simply .k:

    > grid1 <- expand.grid(.k = seq(2, 20, by = 1))

We will also incorporate cross-validation in the selection of the parameter, creating an object called control and utilizing the trainControl() function from the caret package, as follows:

    > control <- trainControl(method = "cv")

Now, we can create the object that will show us how to compute the optimal k value with the train() function, which is also part of the caret package. Remember that while conducting any sort of random sampling, you will need to set the seed value as follows:

    > set.seed(502)

The object created by the train() function requires the model formula, train data name, and an appropriate method. The model formula is the same as we've used before-y~x. The method designation is simply knn. With this in mind, this code will create the object that will show us the optimal k value, as follows:

    > knn.train <- train(type ~ ., data = train,
      method = "knn",
      trControl = control,
      tuneGrid = grid1)

Calling the object provides us with the k parameter that we are seeking, which is k=17:

    > knn.train
    k-Nearest Neighbors
    385 samples
      7 predictor
      2 classes: 'No', 'Yes'
    No pre-processing
    Resampling: Cross-Validated (10 fold)
    Summary of sample sizes: 347, 347, 345, 347, 347, 346, ...
    Resampling results across tuning parameters:
      k   Accuracy  Kappa  Accuracy SD  Kappa SD
       2  0.736     0.359  0.0506       0.1273  
       3  0.762     0.416  0.0526       0.1313  
       4  0.761     0.418  0.0521       0.1276  
       5  0.759     0.411  0.0566       0.1295  
       6  0.772     0.442  0.0559       0.1474  
       7  0.767     0.417  0.0455       0.1227  
       8  0.767     0.425  0.0436       0.1122  
       9  0.772     0.435  0.0496       0.1316  
      10  0.780     0.458  0.0485       0.1170  
      11  0.777     0.446  0.0437       0.1120  
      12  0.775     0.440  0.0547       0.1443  
      13  0.782     0.456  0.0397       0.1084  
      14  0.780     0.449  0.0557       0.1349  
      15  0.772     0.427  0.0449       0.1061  
      16  0.782     0.453  0.0403       0.0954  
      17  0.795     0.485  0.0382       0.0978  
      18  0.782     0.451  0.0461       0.1205  
      19  0.785     0.455  0.0452       0.1197  
      20  0.782     0.446  0.0451       0.1124  
    Accuracy was used to select the optimal model using the largest 
      value.
    The final value used for the model was k = 17.

In addition to the results that yield k=17, we get the information in the form of a table on the Accuracy and Kappa statistics and their standard deviations from the cross-validation. Accuracy tells us the percentage of observations that the model classified correctly. Kappa refers to what is known as Cohen's Kappa statistic. The Kappa statistic is commonly used to provide a measure of how well two evaluators can classify an observation correctly. It provides an insight into this problem by adjusting the accuracy scores, which is done by accounting for the evaluators being totally correct by mere chance. The formula for the statistic is Kappa = (per cent of agreement - per cent of chance agreement) / (1 - per cent of chance agreement).
The per cent of agreement is the rate that the evaluators agreed on for the class (accuracy), and percent of chance agreement is the rate that the evaluators randomly agreed on. The higher the statistic, the better they performed with the maximum agreement being one. We will work through an example when we will apply our model on the test data.

To do this, we will utilize the knn() function from the class package. With this function, we will need to specify at least four items. These would be the train inputs, the test inputs, correct labels from the train set, and k. We will do this by creating the knn.test object and see how it performs:

    > knn.test <- knn(train[, -8], test[, -8], train[, 8], k = 17)

With the object created, let's examine the confusion matrix and calculate the accuracy and kappa:

    > table(knn.test, test$type)
    knn.test No Yes
         No  77  26
         Yes 16  28

The accuracy is done by simply dividing the correctly classified observations by the total observations:

    > (77 + 28) / 147
    [1] 0.7142857

The accuracy of 71 per cent is less than that we achieved on the train data, which was almost eighty per cent. We can now produce the kappa statistic as follows:

    > #calculate Kappa
    > prob.agree <- (77 + 28) / 147 #accuracy
    > prob.chance <- ((77 + 26) / 147) * ((77 + 16) / 147)
    > prob.chance
    [1] 0.4432875
    > kappa <- (prob.agree - prob.chance) / (1 - prob.chance)
    > kappa
    [1] 0.486783

The kappa statistic of 0.49 is what we achieved with the train set. Altman(1991) provides a heuristic to assist us in the interpretation of the statistic, which is shown in the following table:

Value of K	Strength of Agreement
<0.20	Poor
0.21-0.40	Fair
0.41-0.60	Moderate
0.61-0.80	Good
0.81-1.00	Very good

With our kappa only moderate and with an accuracy just over 70 per cent on the test set, we should see whether we can perform better by utilizing weighted neighbors. A weighting schema increases the influence of neighbors that are closest to an observation versus those that are farther away. The farther away the observation is from a point in space, the more penalized its influence is. For this technique, we will use the kknn package and its train.kknn() function to select the optimal weighting scheme.

The train.kknn() function uses LOOCV that we examined in the prior chapters in order to select the best parameters for the optimal k neighbors, one of the two distance measures, and a kernel function.

The unweighted k neighbors algorithm that we created uses the Euclidian distance, as we discussed previously. With the kknn package, there are options available to compare the sum of the absolute differences versus the Euclidian distance. The package refers to the distance calculation used as the Minkowski parameter.

As for the weighting of the distances, many different methods are available. For our purpose, the package that we will use has ten different weighting schemas, which includes the unweighted ones. They are rectangular (unweighted), triangular, epanechnikov, biweight, triweight, cosine, inversion, gaussian, rank, and optimal. A full discussion of these weighting techniques is available in Hechenbichler K. and Schliep K.P. (2004).

For simplicity, let's focus on just two: triangular and epanechnikov. Prior to having the weights assigned, the algorithm standardizes all the distances so that they are between zero and one. The triangular weighting method multiplies the observation distance by one minus the distance. With Epanechnikov, the distance is multiplied by ¾ times (one minus the distance two). For our problem, we will incorporate these weighting methods along with the standard unweighted version for comparison purposes.

After specifying a random seed, we will create the train set object with kknn(). This function asks for the maximum number of k values (kmax), distance (one is equal to Euclidian and two is equal to absolute), and kernel. For this model, kmax will be set to 25 and distance will be 2:

    > set.seed(123)
    > kknn.train <- train.kknn(type ~ ., data = train, kmax = 25, 
        distance = 2, 
        kernel = c("rectangular", "triangular", "epanechnikov"))

A nice feature of the package is the ability to plot and compare the results, as follows:

    > plot(kknn.train)

The following is the output of the preceding command:

This plot shows k on the x-axis and the percentage of misclassified observations by kernel. To my pleasant surprise, the unweighted (rectangular) version at k: 19 performs the best. You can also call the object to see what the classification error and the best parameter are in the following way:

    > kknn.train 
    Call:
    train.kknn(formula = type ~ ., data = train, kmax = 25, distance = 
      2, kernel
     = c("rectangular", "triangular", "epanechnikov"))
    Type of response variable: nominal
    Minimal misclassification: 0.212987
    Best kernel: rectangular
    Best k: 19

So, with this data, weighting the distance does not improve the model accuracy in training and, as we can see here, didn't even do as well on the test set:

    > kknn.pred <- predict(kknn.train, newdata = test)
    > table(kknn.pred, test$type)
    kknn.pred No Yes
           No 76  27
          Yes 17  27

There are other weights that we could try, but as I tried these other weights, the results that I achieved were not more accurate than these. We don't need to pursue KNN any further. I would encourage you to experiment with various parameters on your own to see how they perform.

Table of Contents for KNN modeling

Create new playlist

Sign In

Sign Up

Table of Contents for
KNN modeling