If you have two variables and want to spot the correlation between those, a scatter plot may be the solution to spot patterns.
This type of plot is also very usable as a start for more advanced visualization of multidimensional data (for example, to plot a scatter plot matrix).
Scatter plots display values for two sets of data. The data visualization is done as a collection of points not connected by lines. Each of them has its coordinates determined by the value of the variables. One variable is controlled (independent variable), while the other variable is measured (dependent variable) and is often plotted on the y axis.
Here's a code sample that plots two plots: one with uncorrelated data and the other with strong positive correlation:
import matplotlib.pyplot as plt import numpy as np # generate x values x = np.random.randn(1000) # random measurements, no correlation y1 = np.random.randn(len(x)) # strong correlation y2 = 1.2 + np.exp(x) ax1 = plt.subplot(121) plt.scatter(x, y1, color='indigo', alpha=0.3, edgecolors='white', label='no correl') plt.xlabel('no correlation') plt.grid(True) plt.legend() ax2 = plt.subplot(122, sharey=ax1, sharex=ax1) plt.scatter(x, y2, color='green', alpha=0.3, edgecolors='grey', label='correl') plt.xlabel('strong correlation') plt.grid(True) plt.legend() plt.show()
Here, we also use more parameters such as color
for setting the color of the plot, marker
for using as a point marker (the default is circle
), alpha
(alpha transparency), edgecolors
(color of the marker edge), and label
(for legend box).
These are the plots we get:
A scatter plot is often used to identify potential association between two variables, and it's often drawn before working on a fitting regression function. It gives a good visual picture of the correlation, particularly for nonlinear relationships. matplotlib provides the scatter()
function to plot x
versus y—
unidimensional array of the same length as a scatter plot.