Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Creating a histogram for a column

A histogram is a graph that shows the distribution of numerical data. The matplotlib Python library makes creating a histogram a snap. Here's how.

Getting ready

Before using this recipe, familiarize yourself with the following recipes as we'll be building on them:

Creating a Pandas DataFrame from a MongoDB query
Generating a frequency table for a single column by date

How to do it…

To create a histogram for a single column in a Pandas DataFrame, begin by importing all the required libraries. To show matplotlib plots in IPython Notebook, we will use an IPython magic function which starts with %:
```
%matplotlib inline
import pandas as pd
import numpy as np
from pymongo import MongoClient
import matplotlib as mpl
import matplotlib.pyplot as plt
```

Next, connect to MongoDB and run a query specifying the five fields to be retrieved from the MongoDB data:

client = MongoClient('localhost', 27017)
db = client.pythonbicookbook
collection = db.accidents
fields = {'Date':1,
          'Police_Force':1,
          'Accident_Severity':1,
          'Number_of_Vehicles':1,
          'Number_of_Casualties':1}
data = collection.find({}, fields)
Next, create a DataFrame from the results of the query:
accidents = pd.DataFrame(list(data))

Create a frequency table of casualty counts using the previous recipe:

casualty_count = accidents.groupby('Date').agg({'Number_of_Casualties': np.sum})

Finally, create the histogram from the casualty count DataFrame and show it inline:

plt.hist(casualty_count['Number_of_Casualties'],
         bins=30)
plt.title('Number of Casualties Histogram')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

How it works…

%matplotlib inline
import pandas as pd
import numpy as np
from pymongo import MongoClient
import matplotlib as mpl
import matplotlib.pyplot as plt

As in the aforementioned recipes, we first import all the Python libraries that we need. The first line of code here is an IPython magic function. It allows us to show the plots (graphs) generated by matplotlib in the IPython Notebook. If you use this code with a pure Python script, it isn't necessary.

In addition to pandas, numpy, and pymongo, we import matplotlib and pyplot from matplotlib. These allow us to create the plots:

# Import the data, 5 fields from the MongoDB data
client = MongoClient('localhost', 27017)
db = client.pythonbicookbook
collection = db.accidents
fields = {'Date':1,
          'Police_Force':1,
          'Accident_Severity':1,
          'Number_of_Vehicles':1,
          'Number_of_Casualties':1}
data = collection.find({}, fields)
accidents = pd.DataFrame(list(data))

Next, we create a connection to MongoDB, specify the database and collection to use, create a dictionary of fields that we want our query to return, run our query, and finally create a DataFrame:

# Create a frequency table of casualty counts from the previous recipe
casualty_count = accidents.groupby('Date').agg({'Number_of_Casualties': np.sum})

As before, we generate a frequency table of casualties over time—this is what we will be graphing:

# Create a histogram from the casualty count DataFrame using the 'Number_of_Casualties' column and specifying 30-bins. The data will automatically be placed into one of the 30 bins based on the column value.
plt.hist(casualty_count['Number_of_Casualties'],
         bins=30)
plt.title('Number of Casualties Histogram)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

Finally, with a few more lines of code, we generate and display the plot, which looks like the following image:

The hist() function takes a number of arguments. Here, we give it the column with the data source—in this instance, the Number of Casualties column of our casualty_count DataFrame as well as the number of bins to use. The easiest way to describe a bin is to liken it to a bucket. In this instance, we're indicating there should be 30 buckets. The histogram then places values into the buckets.

I chose 30 buckets due to our having millions of observations in our dataset. Be sure to experiment with the number of buckets you use for the histogram to see how it affects the generated plot.

Before showing the plot, we provide the label values for the title, x axis, and y axis. These small additions make our plot production ready.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Creating a histogram for a column

Create new playlist

Sign In

Sign Up

Creating a histogram for a column

Getting ready

How to do it…

How it works…

Table of Contents for
Creating a histogram for a column