The k-nearest neighbor (k-NN) classification is one of the easiest classification methods to understand (particularly when there is little or no prior knowledge about the distribution of the data). The k-nearest neighbor classification has a way to store all the known cases and classify new cases based on a similarity measure (for example, the Euclidean distance function). The k-NN algorithm is popular in its statistical estimation and pattern recognition because of its simplicity.
For 1-nearest neighbor (1-NN), the label of one particular point is set to be the nearest training point. When you extend this for a higher value of k, the label of a test point is the one that is measured by the k nearest training points. The k-NN algorithm is considered to be a lazy learning algorithm because the optimization is done locally, and the computations are delayed until classification.
There are advantages and disadvantages of this method. The advantages are high accuracy, insensitive to outliers, and no assumptions about data. The disadvantages of k-NN is that it is computationally expensive and requires a lot of memory.
One of the following distance metrics could be used:
Let's consider an example where we are given a big basket of fruits with apples, bananas, and pears only. We will assume that the apples were red apples, not green. There is one characteristic that will distinguish these fruits from one another: color. Apples are red, bananas are yellow, and pears are green. These fruits can also be characterized by the weight of each. The following assumptions are made for the purpose of illustrating this example:
The shape characteristic is categorized as follows:
We have the data about the fruits in a basket as follows:
If we have an unlabeled fruit with a known weight and a color category, then applying the k-nearest neighbor method (with any distance formula) will most likely find the nearest k neighbors (if they are green, red, or yellow, the unlabeled fruit is most likely a pear, apple, or banana respectively). The following code demonstrates k-nearest neighbor algorithm using the shape and weight of fruits:
import csv import matplotlib.patches as mpatches import matplotlib.pyplot as plt count=0 x=[] y=[] z=[] with open('/Users/myhome/fruits_data.csv', 'r') as csvf: reader = csv.reader(csvf, delimiter=',') for row in reader: if count > 0: x.append(row[0]) y.append(row[1]) if ( row[2] == 'Apple' ): z.append('r') elif ( row[2] == 'Pear' ): z.append('g') else: z.append('y') count += 1 plt.figure(figsize=(11,11)) recs=[] classes=['Apples', 'Pear', 'Bananas'] class_colours = ['r','g','y'] plt.title("Apples, Bananas and Pear by Weight and Shape", fontsize=18) plt.xlabel("Shape category number", fontsize=14) plt.ylabel("Weight in ounces", fontsize=14) plt.scatter(x,y,s=600,c=z)
Let's pick four unlabeled fruits with their x and y values as A(3.5,6.2), B(2.75,6.2), C(2.9, 7.6), and D(2.4, 7.2) with the following code:
from math import pow, sqrt dist=[] def determineFruit(xv, yv, threshold_radius): for i in range(1,len(x)): xdif=pow(float(x[i])-xv, 2) ydif=pow(float(y[i])-yv, 2) sqrtdist = sqrt(xdif+ydif)) if ( xdif < threshold_radius and ydif < thresholdradius and sqrtdist < threshold_radius): dist.append(sqrtdist) else: dist.append(99) pear_count=0 apple_count=0 banana_count=0 for i in range(1,len(dist)): if dist[i] < threshold_radius: if z[i] == 'g': pear_count += 1 if z[i] == 'r': apple_count += 1 if z[i] == 'y': banana_count += 1 if ( apple_count >= pear_count and apple_count >= banana_count ): return "apple" elif ( pear_count >= apple_count and pear_count >= banana_count): return "pear" elif ( banana_count >= apple_count and banana_count >= pear_count): return "banana" dist=[] determine = determineFruit(3.5,6.2, 1) print determine 'pear'