From a single column, we will now move on to multiple columns. In multivariate data analysis, we are interested in seeing if there any relationships between the columns that we are analyzing. In two column/variable cases, the best place to start is a standard scatter plot. There can be four types of relationships, as follows:
We will use the Iris dataset. It's a multivariate dataset introduced by Sir Ronald Fisher. Refer to https://archive.ics.uci.edu/ml/datasets/Iris for more information.
The Iris dataset has 150 instances and four attributes/columns. The 150 instances are composed of 50 records from each of the three species of the Iris flower (Setosa, virginica, and versicolor). The four attributes are the sepal length in cm, sepal width in cm, petal length in cm, and petal width in cm. Thus, the Iris dataset also serves as a great classification dataset. A classification method can be written in such a way that, given a record, we can classify which species that record belongs to after appropriate training.
Let's load the necessary libraries and extract the Iris data:
# Load Librarires from sklearn.datasets import load_iris import numpy as np import matplotlib.pyplot as plt import itertools # 1. Load Iris dataset data = load_iris() x = data['data'] y = data['target']col_names = data['feature_names']
We will proceed with demonstrating with a scatter plot:
# 2.Perform a simple scatter plot. # Plot 6 graphs, combinations of our columns, sepal length, sepal width, # petal length and petal width. plt.close('all') plt.figure(1) # We want a plot with # 3 rows and 2 columns, 3 and 2 in # below variable signifies that. subplot_start = 321 col_numbers = range(0,4) # Need it for labeling the graph col_pairs = itertools.combinations(col_numbers,2) plt.subplots_adjust(wspace = 0.5) for col_pair in col_pairs: plt.subplot(subplot_start) plt.scatter(x[:,col_pair[0]],x[:,col_pair[1]],c=y) plt.xlabel(col_names[col_pair[0]]) plt.ylabel(col_names[col_pair[1]]) subplot_start+=1plt.show()
The scikit library provides a convenient function to load the Iris dataset called load_iris()
. We will use this to load the Iris data in the variable data in step 1. The data
is a dictionary object. Using the data and target keys, we will retrieve the records and class labels. We will look at the x
and y
values:
>>> x.shape (150, 4) >>> y.shape (150,) >>>
As you can see, x
is a matrix with 150
rows and four columns; y
is a vector of length 150
. The data
dictionary can also be queried to view the column names using the feature_names
keyword, as follows:
>>> data['feature_names'] ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] >>>
We will then create a scatter plot of the iris variables in step 2. As we did before, we will use subplot here to accommodate all the plots in a single figure. We will get two combinations of our column using itertools.Combination
:
col_pairs = itertools.combinations(col_numbers,2)
We can iterate col_pairs
to get two combinations of our column and plot a scatter plot for each, as you can see in the following line of code:
plt.scatter(x[:,col_pair[0]],x[:,col_pair[1]],c=y)
We will pass a c
parameter in order to indicate the color of the points. In this case, we will pass our y variable (class label) so that the different species of iris are plotted in different colors in our scatter plot.
The resulting plot is as follows:
As you can see, we have plotted two combinations of our columns. We also have the class labels represented using three different colors. Let's look at the bottom left plot, petal length versus petal width. We see that different range of values belong to different class labels. Now, this gives us a great clue for classification; the petal width and length variables are good candidates if the problem in hand is classification.
These kinds of observations can be quickly made during the feature selection process with the help of bivariate scatter plots.