Creating a histogram for a column

A histogram is a graph that shows the distribution of numerical data. The matplotlib Python library makes creating a histogram a snap. Here's how.

Getting ready

Before using this recipe, familiarize yourself with the following recipes as we'll be building on them:

  • Creating a Pandas DataFrame from a MongoDB query
  • Generating a frequency table for a single column by date

How to do it…

  1. To create a histogram for a single column in a Pandas DataFrame, begin by importing all the required libraries. To show matplotlib plots in IPython Notebook, we will use an IPython magic function which starts with %:
    %matplotlib inline
    import pandas as pd
    import numpy as np
    from pymongo import MongoClient
    import matplotlib as mpl
    import matplotlib.pyplot as plt
  2. Next, connect to MongoDB and run a query specifying the five fields to be retrieved from the MongoDB data:
    client = MongoClient('localhost', 27017)
    db = client.pythonbicookbook
    collection = db.accidents
    fields = {'Date':1,
              'Police_Force':1,
              'Accident_Severity':1,
              'Number_of_Vehicles':1,
              'Number_of_Casualties':1}
    data = collection.find({}, fields)
    Next, create a DataFrame from the results of the query:
    accidents = pd.DataFrame(list(data))
  3. Create a frequency table of casualty counts using the previous recipe:
    casualty_count = accidents.groupby('Date').agg({'Number_of_Casualties': np.sum})
  4. Finally, create the histogram from the casualty count DataFrame and show it inline:
    plt.hist(casualty_count['Number_of_Casualties'],
             bins=30)
    plt.title('Number of Casualties Histogram')
    plt.xlabel('Value')
    plt.ylabel('Frequency')
    plt.show()

How it works…

%matplotlib inline
import pandas as pd
import numpy as np
from pymongo import MongoClient
import matplotlib as mpl
import matplotlib.pyplot as plt

As in the aforementioned recipes, we first import all the Python libraries that we need. The first line of code here is an IPython magic function. It allows us to show the plots (graphs) generated by matplotlib in the IPython Notebook. If you use this code with a pure Python script, it isn't necessary.

In addition to pandas, numpy, and pymongo, we import matplotlib and pyplot from matplotlib. These allow us to create the plots:

# Import the data, 5 fields from the MongoDB data
client = MongoClient('localhost', 27017)
db = client.pythonbicookbook
collection = db.accidents
fields = {'Date':1,
          'Police_Force':1,
          'Accident_Severity':1,
          'Number_of_Vehicles':1,
          'Number_of_Casualties':1}
data = collection.find({}, fields)
accidents = pd.DataFrame(list(data))

Next, we create a connection to MongoDB, specify the database and collection to use, create a dictionary of fields that we want our query to return, run our query, and finally create a DataFrame:

# Create a frequency table of casualty counts from the previous recipe
casualty_count = accidents.groupby('Date').agg({'Number_of_Casualties': np.sum})

As before, we generate a frequency table of casualties over time—this is what we will be graphing:

# Create a histogram from the casualty count DataFrame using the 'Number_of_Casualties' column and specifying 30-bins. The data will automatically be placed into one of the 30 bins based on the column value.
plt.hist(casualty_count['Number_of_Casualties'],
         bins=30)
plt.title('Number of Casualties Histogram)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

Finally, with a few more lines of code, we generate and display the plot, which looks like the following image:

How it works…

The hist() function takes a number of arguments. Here, we give it the column with the data source—in this instance, the Number of Casualties column of our casualty_count DataFrame as well as the number of bins to use. The easiest way to describe a bin is to liken it to a bucket. In this instance, we're indicating there should be 30 buckets. The histogram then places values into the buckets.

I chose 30 buckets due to our having millions of observations in our dataset. Be sure to experiment with the number of buckets you use for the histogram to see how it affects the generated plot.

Before showing the plot, we provide the label values for the title, x axis, and y axis. These small additions make our plot production ready.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset