A histogram is a graph that shows the distribution of numerical data. The matplotlib
Python library makes creating a histogram a snap. Here's how.
Before using this recipe, familiarize yourself with the following recipes as we'll be building on them:
matplotlib
plots in IPython Notebook, we will use an IPython magic function which starts with %
:%matplotlib inline import pandas as pd import numpy as np from pymongo import MongoClient import matplotlib as mpl import matplotlib.pyplot as plt
client = MongoClient('localhost', 27017) db = client.pythonbicookbook collection = db.accidents fields = {'Date':1, 'Police_Force':1, 'Accident_Severity':1, 'Number_of_Vehicles':1, 'Number_of_Casualties':1} data = collection.find({}, fields) Next, create a DataFrame from the results of the query: accidents = pd.DataFrame(list(data))
casualty
counts using the previous recipe:casualty_count = accidents.groupby('Date').agg({'Number_of_Casualties': np.sum})
casualty
count DataFrame and show it inline:plt.hist(casualty_count['Number_of_Casualties'], bins=30) plt.title('Number of Casualties Histogram') plt.xlabel('Value') plt.ylabel('Frequency') plt.show()
%matplotlib inline import pandas as pd import numpy as np from pymongo import MongoClient import matplotlib as mpl import matplotlib.pyplot as plt
As in the aforementioned recipes, we first import all the Python libraries that we need. The first line of code here is an IPython magic function. It allows us to show the plots (graphs) generated by matplotlib
in the IPython Notebook. If you use this code with a pure Python script, it isn't necessary.
In addition to pandas
, numpy
, and pymongo
, we import matplotlib
and pyplot
from matplotlib
. These allow us to create the plots:
# Import the data, 5 fields from the MongoDB data client = MongoClient('localhost', 27017) db = client.pythonbicookbook collection = db.accidents fields = {'Date':1, 'Police_Force':1, 'Accident_Severity':1, 'Number_of_Vehicles':1, 'Number_of_Casualties':1} data = collection.find({}, fields) accidents = pd.DataFrame(list(data))
Next, we create a connection to MongoDB, specify the database and collection to use, create a dictionary of fields that we want our query to return, run our query, and finally create a DataFrame:
# Create a frequency table of casualty counts from the previous recipe casualty_count = accidents.groupby('Date').agg({'Number_of_Casualties': np.sum})
As before, we generate a frequency table of casualties over time—this is what we will be graphing:
# Create a histogram from the casualty count DataFrame using the 'Number_of_Casualties' column and specifying 30-bins. The data will automatically be placed into one of the 30 bins based on the column value. plt.hist(casualty_count['Number_of_Casualties'], bins=30) plt.title('Number of Casualties Histogram) plt.xlabel('Value') plt.ylabel('Frequency') plt.show()
Finally, with a few more lines of code, we generate and display the plot, which looks like the following image:
The hist()
function takes a number of arguments. Here, we give it the column with the data source—in this instance, the Number of Casualties
column of our casualty_count
DataFrame as well as the number of bins to use. The easiest way to describe a bin is to liken it to a bucket. In this instance, we're indicating there should be 30 buckets. The histogram then places values into the buckets.
I chose 30 buckets due to our having millions of observations in our dataset. Be sure to experiment with the number of buckets you use for the histogram to see how it affects the generated plot.
Before showing the plot, we provide the label values for the title, x axis, and y axis. These small additions make our plot production ready.