Finding the closest objects in the feature space

Sometimes, the easiest thing to do is to just find the distance between two objects. We just need to find some distance metric, compute the pairwise distances, and compare the outcomes to what's expected.

Getting ready

A lower-level utility in scikit-learn is sklearn.metrics.pairwise. This contains server functions to compute the distances between the vectors in a matrix X or the distances between the vectors in X and Y easily.

This can be useful for information retrieval. For example, given a set of customers with attributes of X, we might want to take a reference customer and find the closest customers to this customer. In fact, we might want to rank customers by the notion of similarity measured by a distance function. The quality of the similarity depends upon the feature space selection as well as any transformation we might do on the space.

We'll walk through several different scenarios of measuring distance.

How to do it...

We will use the pairwise_distances function to determine the "closeness" of objects. Remember that the closeness is really just similarity that we use our distance function to grade.

First, let's import the pairwise distance function from the metrics module and create a dataset to play with:

>>> from sklearn.metrics import pairwise
>>> from sklearn.datasets import make_blobs
>>> points, labels = make_blobs()

This simplest way to check the distances is pairwise_distances:

>>> distances = pairwise.pairwise_distances(points)

distances is an N x N matrix with 0s along the diagonals. In the simplest case, let's see the distances between each point and the first point:

>>> np.diag(distances) [:5]
array([ 0.,  0.,  0.,  0.,  0.])

Now we can look for points that are closest to the first point in points:

>>> distances[0][:5]
array([  0., 11.82643041,1.23751545, 1.17612135, 14.61927874])

Ranking the points by closeness is very easy with np.argsort:

>>> ranks = np.argsort(distances[0])
>>> ranks[:5]
array([ 0, 27, 98, 23, 67])

The great thing about argsort is that now we can sort our points matrix to get the actual points:

>>> points[ranks][:5]
array([[ 8.96147382, -1.90405304],
       [ 8.75417014, -1.76289919],
       [ 8.78902665, -2.27859923],
       [ 8.59694131, -2.10057667],
       [ 8.70949958, -2.30040991]])

It's useful to see what the closest points look like. Other than some assurances, this works as intended:

How to do it...

How it works...

Given some distance function, each point is measured in a pairwise function. The default is the Euclidian distance, which is as follows:

How it works...

Verbally, this takes the difference between each component of the two vectors, squares the difference, sums them, and then takes the square root. This looks very familiar as we used something very similar to this when looking at the mean-squared error. If we take the square root, we have the same thing. In fact, a metric used often is root-mean-square deviation (RMSE), which is just the applied distance function.

In Python, this looks like the following:

>>> def euclid_distances(x, y):
       return np.power(np.power(x - y, 2).sum(), .5)
>>> euclid_distances(points[0], points[1])
11.826430406213145

There are several other functions available in scikit-learn, but scikit-learn will also use distance functions of SciPy. At the time of writing this book, the scikit-learn distance functions support sparse matrixes. Check out the SciPy documentation for more information on the distance functions:

  • cityblock
  • cosine
  • euclidean
  • l1
  • l2
  • manhattan

We can now solve problems. For example, if we were standing on a grid at the origin, and the lines were the streets, how far will we have to travel to get to point (5, 5)?.

>>> pairwise.pairwise_distances([[0, 0], [5, 5]], metric='cityblock')[0]
array([  0.,  10.])

There's more...

Using pairwise distances, we can find the similarity between bit vectors. It's a matter of finding the hamming distance, which is defined as follows:

There's more...

Use the following command:

>>> X = np.random.binomial(1, .5, size=(2, 4)).astype(np.bool)
>>> X
array([[False,  True, False, False],
       [False, False, False,  True]], dtype=bool)

>>> pairwise.pairwise_distances(X, metric='hamming')
array([[ 0. ,  0.25],
       [ 0.25,  0. ]])
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset