Chapter 5. More Classification Techniques – K-Nearest Neighbors and Support Vector Machines

 

"Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write."

 
 --H.G. Wells

In Chapter 3, Logistic Regression and Discriminant Analysis we discussed using logistic regression to determine the probability that a predicted observation belongs to a categorical response—what we refer to as a classification problem. Logistic regression was just the beginning of classification methods, with a number of techniques that we can use to improve our predictions.

In this chapter, we will delve into two nonlinear techniques: K-Nearest Neighbors (KNN) and Support Vector Machines (SVM). These techniques are more sophisticated than what we've discussed earlier because the assumptions on linearity can be relaxed, which means a linear combination of the features in order to define the decision boundary is not needed. Be forewarned though, this does not always equal superior predictive ability. Additionally, these models can be a bit problematic to interpret for business partners and they can be computationally inefficient. When used wisely, they provide a powerful complement to the other tools and techniques discussed in this book. They can be used for continuous outcomes in addition to classification problems; however, for the purposes of this chapter, we will focus only on the latter.

After a high-level background on the techniques, we will lay out the business case and then put both of them to the test in order to determine the best method of the two, starting with KNN.

K-Nearest Neighbors

In our previous efforts, we built models that had coefficients or, said another way, parameter estimates for each of our included features. With KNN, we have no parameters as the learning method is the so-called instance-based learning. In short, The labeled examples (inputs and corresponding output labels) are stored and no action is taken until a new input pattern demands an output value. (Battiti and Brunato, 2014, p. 11). This method is commonly called lazy learning as no specific model parameters are produced. The train instances themselves represent the knowledge. For the prediction of any new instance (a new data point), the train data is searched for an instance that most resembles the new instance in question. KNN does this for a classification problem by looking at the closest points—the nearest neighbors to determine the proper class. The k comes into play by determining how many neighbors should be examined by the algorithm, so if k=5, it will examine the five nearest points. A weakness of this method is that all five points are given equal weight in the algorithm even if they are less relevant in learning. We will look at the methods using R and try to alleviate this issue.

The best way to understand how this works is with a simple visual example on a binary classification learning problem. In the following figure, we have a plot of whether a tumor is benign or malignant based on two predictive features. The X in the plot indicates a new observation that we would like to predict. If our algorithm considers K=3, the circle encompasses the three observations that are nearest to the one that we want to score. As the most commonly occurring classifications are malignant, the X data point is classified as malignant, as shown in the following figure:

K-Nearest Neighbors

Even from this simple example, it is clear that the selection of k for the Nearest Neighbors is critical. If k is too small, then you may have a high variance on the test set observations even though you have a low bias. On the other hand, as k grows you may decrease your variance but the bias may be unacceptable. Cross-validation is necessary to determine the proper k.

It is also important to point out the calculation of the distance or the nearness of the data points in our feature space. The default distance is Euclidian Distance. This is simply the straight-line distance from point A to point B—as the crow flies—or you can utilize the formula that it is equivalent to the square root of the sum of the squared differences between the corresponding points. The formula for Euclidian Distance, given point A and B with coordinates p1, p2, … pn and q1, q2,… qn respectively, would be as follows:

K-Nearest Neighbors

This distance is highly dependent on the scale that the features were measured on and so it is critical to standardize them. Other distance calculations can be used as well as weights depending on the distance. We will explore this in the upcoming example.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset