It is easy to discover patterns by visualizing data with two or three dimensions. A high-dimensional dataset cannot be represented graphically, but we can still gain some insights into its structure by reducing it to two or three principal components.
Collected in 1936, Fisher's Iris data set is a collection of fifty samples from each of the three species of Iris: Iris setosa, Iris virginica, and Iris versicolor. The explanatory variables are measurements of the length and width of the petals and sepals of the flowers. The Iris dataset is commonly used to test classification models, and is included with scikit-learn. Let's reduce the iris
dataset's four dimensions so that we can visualize it in two dimensions:
>>> import matplotlib.pyplot as plt >>> from sklearn.decomposition import PCA >>> from sklearn.datasets import load_iris
First, we load the built-in iris data set and instantiate a PCA
estimator. The PCA
class takes a number of principal components to retain as a hyperparameter. Like the other estimators, PCA
exposes a fit_transform()
method that returns the reduced data matrix:
>>> data = load_iris() >>> y = data.target >>> X = data.data >>> pca = PCA(n_components=2) >>> reduced_X = pca.fit_transform(X)
Finally, we assemble and plot the reduced data:
>>> red_x, red_y = [], [] >>> blue_x, blue_y = [], [] >>> green_x, green_y = [], [] >>> for i in range(len(reduced_X)): >>> if y[i] == 0: >>> red_x.append(reduced_X[i][0]) >>> red_y.append(reduced_X[i][1]) >>> elif y[i] == 1: >>> blue_x.append(reduced_X[i][0]) >>> blue_y.append(reduced_X[i][1]) >>> else: >>> green_x.append(reduced_X[i][0]) >>> green_y.append(reduced_X[i][1]) >>> plt.scatter(red_x, red_y, c='r', marker='x') >>> plt.scatter(blue_x, blue_y, c='b', marker='D') >>> plt.scatter(green_x, green_y, c='g', marker='.') >>> plt.show()
The reduced instances are plotted in the following figure. Each of the dataset's three classes is indicated by its own marker style. From this two-dimensional view of the data, it is clear that one of the classes can be easily separated from the other two overlapping classes. It would be difficult to notice this structure without a graphical representation. This insight can inform our choice of classification model.