Using scatter plots for multivariate data

From a single column, we will now move on to multiple columns. In multivariate data analysis, we are interested in seeing if there any relationships between the columns that we are analyzing. In two column/variable cases, the best place to start is a standard scatter plot. There can be four types of relationships, as follows:

  • No relationship
  • Strong
  • Simple
  • Multivariate (not simple) relationship

Getting ready

We will use the Iris dataset. It's a multivariate dataset introduced by Sir Ronald Fisher. Refer to https://archive.ics.uci.edu/ml/datasets/Iris for more information.

The Iris dataset has 150 instances and four attributes/columns. The 150 instances are composed of 50 records from each of the three species of the Iris flower (Setosa, virginica, and versicolor). The four attributes are the sepal length in cm, sepal width in cm, petal length in cm, and petal width in cm. Thus, the Iris dataset also serves as a great classification dataset. A classification method can be written in such a way that, given a record, we can classify which species that record belongs to after appropriate training.

How to do it…

Let's load the necessary libraries and extract the Iris data:

# Load Librarires
from sklearn.datasets import load_iris
import numpy as np
import matplotlib.pyplot as plt
import itertools

# 1. Load Iris dataset
data = load_iris()
x = data['data']
y = data['target']col_names = data['feature_names']

We will proceed with demonstrating with a scatter plot:

# 2.Perform a simple scatter plot. 
# Plot 6 graphs, combinations of our columns, sepal length, sepal width,
# petal length and petal width.
plt.close('all')
plt.figure(1)
# We want a plot with
# 3 rows and 2 columns, 3 and 2 in
# below variable signifies that.
subplot_start = 321
col_numbers = range(0,4)
# Need it for labeling the graph
col_pairs = itertools.combinations(col_numbers,2)
plt.subplots_adjust(wspace = 0.5)

for col_pair in col_pairs:
    plt.subplot(subplot_start)
    plt.scatter(x[:,col_pair[0]],x[:,col_pair[1]],c=y)
    plt.xlabel(col_names[col_pair[0]])
    plt.ylabel(col_names[col_pair[1]])
    subplot_start+=1plt.show()

How it works…

The scikit library provides a convenient function to load the Iris dataset called load_iris(). We will use this to load the Iris data in the variable data in step 1. The data is a dictionary object. Using the data and target keys, we will retrieve the records and class labels. We will look at the x and y values:

>>> x.shape
(150, 4)
>>> y.shape
(150,)
>>>

As you can see, x is a matrix with 150 rows and four columns; y is a vector of length 150. The data dictionary can also be queried to view the column names using the feature_names keyword, as follows:

>>> data['feature_names']

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
>>>

We will then create a scatter plot of the iris variables in step 2. As we did before, we will use subplot here to accommodate all the plots in a single figure. We will get two combinations of our column using itertools.Combination:

col_pairs = itertools.combinations(col_numbers,2)

We can iterate col_pairs to get two combinations of our column and plot a scatter plot for each, as you can see in the following line of code:

plt.scatter(x[:,col_pair[0]],x[:,col_pair[1]],c=y)

We will pass a c parameter in order to indicate the color of the points. In this case, we will pass our y variable (class label) so that the different species of iris are plotted in different colors in our scatter plot.

The resulting plot is as follows:

How it works…

As you can see, we have plotted two combinations of our columns. We also have the class labels represented using three different colors. Let's look at the bottom left plot, petal length versus petal width. We see that different range of values belong to different class labels. Now, this gives us a great clue for classification; the petal width and length variables are good candidates if the problem in hand is classification.

Note

For the Iris dataset, the petal width and length can alone classify the records in their respective flower family.

These kinds of observations can be quickly made during the feature selection process with the help of bivariate scatter plots.

See also

  • Using iterables recipe in Chapter 1, Using Python for Data Science
  • Working with itertools recipe in Chapter 1, Using Python for Data Science
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset