K-Nearest Neighbors

K-Nearest Neighbors, or simply k-NN, belongs to the class of instance-based learning, also known as lazy classifiers. It's one of the simplest classification methods because the classification is done by just looking at the K-closest examples in the training set (in terms of Euclidean distance or some other kind of distance) in the case that we want to classify. Then, given the K-similar examples, the most popular target (majority voting) is chosen as the classification label. Two parameters are mandatory for this algorithm: the neighborhood cardinality (K), and the measure to evaluate the similarity (although the Euclidean distance, or L2, is the most used and is the default parameter for most implementations).

Let's take a look at an example. We are going to use a large dataset, the MNIST handwritten digits. We will later explain why we decided using this dataset for our example. We intend to use only a small portion of it (1,000 samples) to keep the computational time reasonable, and we shuffle the observations to obtain better results (though as a consequence, your final output may be slightly different than ours):

In: from sklearn.utils import shuffle
from sklearn.datasets import fetch_mldata
from sklearn.model_selection import train_test_split

import pickle
mnist = pickle.load(open( "mnist.pickle", "rb" ))
mnist.data, mnist.target = shuffle(mnist.data, mnist.target)

# We reduce the dataset size, otherwise it'll take too much time to run
mnist.data = mnist.data[:1000]
mnist.target = mnist.target[:1000]
X_train, X_test, y_train, y_test = train_test_split(mnist.data,
mnist.target, test_size=0.8, random_state=0)

In: from sklearn.neighbors import KNeighborsClassifier
# KNN: K=10, default measure of distance (euclidean)
clf = KNeighborsClassifier(3)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In: from sklearn.metrics import classification_report
print (classification_report(y_test, y_pred))

Out:
precision recall f1-score support

0.0 0.79 0.91 0.85 82
1.0 0.62 0.98 0.76 86
2.0 0.88 0.68 0.76 77
3.0 0.71 0.83 0.77 69
4.0 0.68 0.88 0.77 91
5.0 0.69 0.66 0.67 56
6.0 0.93 0.86 0.89 90
7.0 0.91 0.85 0.88 102
8.0 0.91 0.41 0.57 73
9.0 0.79 0.50 0.61 74

avg / total 0.80 0.77 0.76 800

The performance is not so high on this dataset. However, please keep under consideration that the classifier has to work on ten different classes. Now let's check the time the classifier needs for the training and predicting:

In: %timeit clf.fit(X_train, y_train)
Out: 1.18 ms ± 119 µs per loop (mean ± std. dev. of 7 runs,
1000 loops each)
In: %timeit clf.predict(X_test)
Out: 179 ms ± 1.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The training speed is exceptional. Now consider the algorithm. The training phase is just copying the data into some data structure the algorithm will later use and nothing else (that's the reason it is called a lazy learner). On the contrary, the prediction speed is connected to the number of samples you have in your training step and to the number of features composing it (that's actually the feature matrix number of elements). In all the other algorithms that we've seen, the prediction speed is independent of the number of training cases that we have in our dataset. In conclusion, we can say that k-NN is great for small datasets, but it's definitely not the algorithm you would use when dealing with big data.

Just one last remark about this classification algorithm—you can also try the analogous regressor, KNeighborsRegressor, which works in the same way. Its algorithm is pretty much the same, except that the predicted value is the average of the K-target values of the neighborhood.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset