Box plots help to identify the outliers in data, and are useful for comparing distributions. As per Wikipedia, Box and whisker plots are uniform in their use of the box: the bottom and top of the box are always the first and third quartiles, and the band inside the box is always the second quartile (the median).
The lines extending from the box are the whiskers. Any data not included between the whiskers is an outlier.
matplotlib
plots in IPython Notebook, we will use an IPython magic function which starts with %
:%matplotlib inline import pandas as pd import numpy as np from pymongo import MongoClient import matplotlib as mpl import matplotlib.pyplot as plt
client = MongoClient('localhost', 27017) db = client.pythonbicookbook collection = db.accidents fields = {'Date':1, 'Police_Force':1, 'Accident_Severity':1, 'Number_of_Vehicles':1, 'Number_of_Casualties':1} data = collection.find({}, fields)
accidents = pd.DataFrame(list(data))
casualty_count = accidents.groupby('Date').agg({'Number_of_Casualties': np.sum}) vehicle_count = accidents.groupby('Date').agg({'Number_of_Vehicles': np.sum})
data_to_plot = [casualty_count['Number_of_Casualties'], vehicle_count['Number_of_Vehicles']]
fig = plt.figure(1, figsize=(9, 6))
ax = fig.add_subplot(111)
bp = ax.boxplot(data_to_plot)
color
and linewidth
of the caps as follows:for cap in bp['caps']: cap.set(color='#7570b3', linewidth=2)
color
and linewidth
of the medians:for median in bp['medians']: median.set(color='#b2df8a', linewidth=2)
fliers
and their fill:for flier in bp['fliers']: flier.set(marker='o', color='#e7298a', alpha=0.5)
ax.set_xticklabels(['Casualties', 'Vehicles'])
fig.savefig('fig1.png', bbox_inches='tight')
First, we import all the required Python libraries, and then connect to MongoDB. After this, we run a query against MongoDB, and create a new DataFrame from the result:
# Create a frequency table of casualty counts from the previous recipe casualty_count = accidents.groupby('Date').agg({'Number_of_Casualties': np.sum}) # Create a frequency table of vehicle counts vehicle_count = accidents.groupby('Date').agg({'Number_of_Vehicles': np.sum})
Next, we create frequency tables for casualty and vehicle counts:
# Create an array from the two frequency tables data_to_plot = [casualty_count['Number_of_Casualties'], vehicle_count['Number_of_Vehicles']]
After that we create an array from the two frequency tables:
fig = plt.figure(1, figsize=(9, 6))
Next we create an instance of a figure. The figure will be displayed when all is said and done; it is the chart that will be rendered in our IPython Notebook:
ax = fig.add_subplot(111)
After that, we add an axis instance to our figure. An axis is exactly what you might guess it is—the place for data points:
bp = ax.boxplot(data_to_plot)
The preceding line of code creates the boxplot using the data:
for cap in bp['caps']: cap.set(color='#7570b3', linewidth=2)
Here, we change the color and line width of the caps. The caps are the ends of the whiskers:
for median in bp['medians']: median.set(color='#b2df8a', linewidth=2)
The preceding code changes the color
and linewidth
of the medians. The medians divide the box in half; they allow the data to be split into quarters:
for flier in bp['fliers']: flier.set(marker='o', color='#e7298a', alpha=0.5)
Here we change the style of the fliers and their fill. The fliers are the outliers in the data, the data plotted past the whiskers.
ax.set_xticklabels(['Casualties', 'Vehicles'])
This preceding line of code puts labels on the x-axis.
fig.savefig('fig1.png', bbox_inches='tight')
Finally, we show the figure by saving it. This saves the figure as a PNG in our working directory (the same one the IPython Notebook is in) as well as displays it in the IPython Notebook:
You did it! This is definitely the most complex plot we've created yet.