In this recipe, you'll learn how to create a probability distribution histogram of two variables. This plot comes in handy when you are trying to see how much overlap there is between two variables in your data.
matplotlib
plots in IPython Notebook, we will use an IPython magic function which starts with %
:%matplotlib inline import pandas as pd import numpy as np from pymongo import MongoClient import matplotlib as mpl import matplotlib.pyplot as plt
client = MongoClient('localhost', 27017) db = client.pythonbicookbook collection = db.accidents fields = {'Date':1, 'Police_Force':1, 'Accident_Severity':1, 'Number_of_Vehicles':1, 'Number_of_Casualties':1} data = collection.find({}, fields)
accidents = pd.DataFrame(list(data))
vehicle_count = accidents.groupby('Date').agg({'Number_of_Vehicles': np.sum})
plt.hist(casualty_count['Number_of_Casualties'], bins=20, histtype='stepfilled', normed=True, color='b', label='Casualties') plt.hist(vehicle_count['Number_of_Vehicles'], bins=20, histtype='stepfilled', normed=True, color='r', alpha=0.5, label='Vehicles') plt.title("Casualties/Vehicles Histogram") plt.xlabel("Value") plt.ylabel("Probability") plt.legend() plt.show()
If you've been reading this chapter from the very beginning, then this code will be very familiar until we get to the plotting of the two DataFrames:
# Create a frequency table of vehicle counts vehicle_count = accidents.groupby('Date').agg({'Number_of_Vehicles': np.sum})
First, we create a second frequency table of vehicle counts:
# Plot the two dataframes plt.hist(casualty_count['Number_of_Casualties'], bins=20, histtype='stepfilled', normed=True, color='b', label='Casualties') plt.hist(vehicle_count['Number_of_Vehicles'], bins=20, histtype='stepfilled', normed=True, color='r', alpha=0.5, label='Vehicles') plt.title("Casualties/Vehicles Histogram") plt.xlabel("Value") plt.ylabel("Probability") plt.legend() plt.show()
Next, we add two histograms to our plot—one for the casualty count and one for the vehicle count. In both cases, we are using normed=True
so that we have the probability distributions. We then use the color argument to make each histogram a different color, give them each a label, and use histtype='stepfilled'
to make them step-filled histograms. Finally, we set the alpha value of the vehicle_count
histogram to 0.5
so that we can see where the two histograms overlap.
The resulting histogram looks like the following image: