In this recipe, we will present different basic plots and what are they used for. Most of the plots described here are used daily, and some of them present the basis for understanding more advanced concepts in data visualization.
We start with some common charts from the matplotlib.pyplot
library with just sample datasets; we start with basic charting and lay down the foundations of the following recipes.
We start by creating a simple plot in IPython. IPython is great because it allows us to interactively change plots and see the results immediately. You need to follow these steps for that:
$ ipython
In [1]: from matplotlib.pyplot import *
plot
code:In [2]: plot([1,2,3,2,3,2,2,1]) Out[2]: [<matplotlib.lines.Line2D at 0x412fb50>]
The plot should open in a new window displaying the default look of the plot and some supporting information as shown here:
The basic plot in matplotlib contains the following elements:
You will notice that the values we provided to plot()
as y axis values. plot()
provides default values for the x axis; they are linear values from 0 to 7 (the number of y values -1).
Now, try adding values for the x axis; as first argument to the plot()
function again in the same IPython session, type the following script:
In [2]: plot([4,3,2,1],[1,2,3,4]) Out[2]: [<matplotlib.lines.Line2D at 0x31444d0>]
Note how IPython counts input and output lines (In [2] and Out [2]). This will help us remember where we are in the current session and enables more advanced features such as saving part of the session in a Python file. During data analysis, using IPython for prototyping is the fastest way to come to a satisfying solution and then save particular sessions into a file, to be executed later if you need to reproduce the same plot.
This will update the plot to look like this image:
We see here how matplotlib expands the y axis to accommodate the new value range and automatically changes color of the second plot line to enable us to distinguish the new plot.
Unless we turn off the hold
property (by calling hold(False)
), all subsequent plots will draw over the same axes. This is the default behavior in pylab
mode in IPython, while in regular Python scripts, hold
is off by default.
Let us pack some more common plots and compare them over the same dataset. You can type this in IPython or run it from a separate Python script:
from matplotlib.pyplot import * # some simple data x = [1,2,3,4] y = [5,4,3,2] # create new figure figure() # divide subplots into 2 x 3 grid # and select #1 subplot(231) plot(x, y) # select #2 subplot(232) bar(x, y) # horizontal bar-charts subplot(233) barh(x, y) # create stacked bar charts subplot(234) bar(x, y) # we need more data for stacked bar charts y1 = [7,8,5,3] bar(x, y1, bottom=y, color = 'r') # box plot subplot(235) boxplot(x) # scatter plot subplot(236) scatter(x,y) show()
With figure(),
we create a new figure. If we supply a string argument such as sample charts
, it will be the backend title of a window. If we call the figure()
function with the same parameter (that can also be a number), we will make the corresponding figure active and all the following plotting will be performed on that figure.
Next, we divide the figure into a 2 x 3 grid using a subplot(231)
call. We could call this using subplot(2, 3, 1)
, where the first parameter is the number of rows, the second is the number of columns, and the third represents the plot number.
We continue and create a common charting type using simple calls to create vertical bar charts (bar()
) and horizontal bars (barh()
). For stacked bar charts, we need to tie two bar chart calls together. We do that by connecting the second bar chart with the previous using the parameter bottom = y
.
Box plots are created using the boxplot()
call, where the box extends from lower to upper quartiles with the line at the median value. We will return to box plots shortly.
We finally create a scatter plot to give you an idea of a point-based dataset. This is probably more appropriately used when we have thousands of data points in a dataset, but here we wanted to illustrate the difference in representations of the same dataset.
We can return to box plots now as we need to explain the characteristics of this kind of plot.
A box plot presents, by default, the following elements:
To illustrate this behavior, we will demonstrate plotting the same dataset in a box plot and a histogram as shown in the following code:
from pylab import * dataset = [113, 115, 119, 121, 124, 124, 125, 126, 126, 126, 127, 127, 128, 129, 130, 130, 131, 132, 133, 136] subplot(121) boxplot(dataset, vert=False) subplot(122) hist(dataset) show()
That will give us the following plots:
In the preceding comparison, we can observe a difference in representation of the same dataset in two different charts. The one on the left points toward the five mentioned statistical values, while the one on the right (the histogram) displays the grouping of the dataset in a given range.