Multidimensional scaling

Whereas PCA tries to use optimization for retained variance, multidimensional scaling (MDS) tries to retain the relative distances as much as possible when reducing the dimensions. This is useful when we have a high-dimensional dataset and want to get a visual impression.

MDS does not care about the data points themselves; instead, it's interested in the dissimilarities between pairs of data points and it interprets these as distances. It takes all the N data points of dimension k and calculates a distance matrix using a distance function, d0, which measures the (most of the time, Euclidean) distance in the original feature space:

Now, MDS tries to position the individual data points in the lower dimensional so such that the new distance there resembles the distances in the original space as much as possible. As MDS is often used for visualization, the choice of the lower dimension is, most of the time, two or three.

Let's have a look at the following simple data consisting of three data points in five-dimensional space. Two of the data points are close by and one is very distinct, and we want to visualize this in three and two dimensions:

>>> X = np.c_[np.ones(5), 2 * np.ones(5), 10 * np.ones(5)].T
>>> print(X)
[[ 1. 1. 1. 1. 1.]
[ 2. 2. 2. 2. 2.]
[ 10. 10. 10. 10. 10.]]

Using the MDS class in scikit-learn's manifold package, we first specify that we want to transform X into a three-dimensional Euclidean space:

>>> from sklearn import manifold
>>> mds = manifold.MDS(n_components=3)
>>> Xtrans = mds.fit_transform(X) 

To visualize it in two dimensions, we would need to set n_components accordingly.

The results can be seen in the following two graphs. The triangle and circle are both close together, whereas the star is far away:

Let's have a look at the slightly more complex Iris dataset. We will use it later to contrast LDA with PCA. The Iris dataset contains four attributes per flower. With the preceding code, we would project it into three-dimensional space while keeping the relative distances between the individual flowers as much as possible. In the previous example, we did not specify any metric, so MDS will default to Euclidean. This means that flowers that were different according to their four attributes should also be far away in the MDS-scaled, three-dimensional space, and flowers that were similar should be almost together now, as shown in the following diagram:

Reducing the dimensionality to three and two dimensions with PCA instead, we see the expected larger spread of flowers belonging to the same class:

Of course, using MDS requires an understanding of the individual feature's units; maybe we are using features that cannot be compared using the Euclidean metric. For instance, a categorical variable, even when encoded as an integer (0 = circle,
1 = star, 2 = triangle, and so on), cannot be compared using a Euclidean metric (is a circle closer to a star than to triangle?).

However, once we are aware of this issue, MDS is a useful tool that reveals similarities in our data that would otherwise be difficult to see in the original feature space.

Looking a bit deeper into MDS, we realize that it's not a single algorithm, but rather a family of different algorithms, of which we have used just one. The same was true for PCA. Also, in case you realize that neither PCA nor MDS solves your problem, just look at the many other learning and embedding algorithms that are available in the scikit-learn toolkit.

However, before you get overwhelmed by the many different algorithms, it's always best to start with the simplest one and see how far you get with it. Then, take the next more complex one and continue from there.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset