Plotting two sets of values in a probability distribution

In this recipe, you'll learn how to create a probability distribution histogram of two variables. This plot comes in handy when you are trying to see how much overlap there is between two variables in your data.

How to do it…

  1. To plot two sets of values in a probability distribution, begin by importing all the required libraries. To show the matplotlib plots in IPython Notebook, we will use an IPython magic function which starts with %:
    %matplotlib inline
    import pandas as pd
    import numpy as np
    from pymongo import MongoClient
    import matplotlib as mpl
    import matplotlib.pyplot as plt
  2. Next, connect to MongoDB and run a query specifying the five fields to be retrieved from the MongoDB data:
    client = MongoClient('localhost', 27017)
    db = client.pythonbicookbook
    collection = db.accidents
    fields = {'Date':1,
              'Police_Force':1,
              'Accident_Severity':1,
              'Number_of_Vehicles':1,
              'Number_of_Casualties':1}
    data = collection.find({}, fields)
  3. Next, create a DataFrame from the results of the query:
    accidents = pd.DataFrame(list(data))
  4. After that, create a frequency table of vehicle counts:
    vehicle_count = accidents.groupby('Date').agg({'Number_of_Vehicles': np.sum})
  5. Next, plot the two DataFrames and render them inline:
    plt.hist(casualty_count['Number_of_Casualties'], bins=20, histtype='stepfilled', normed=True, color='b', label='Casualties')
    plt.hist(vehicle_count['Number_of_Vehicles'], bins=20, histtype='stepfilled', normed=True, color='r', alpha=0.5, label='Vehicles')
    plt.title("Casualties/Vehicles Histogram")
    plt.xlabel("Value")
    plt.ylabel("Probability")
    plt.legend()
    plt.show()

How it works…

If you've been reading this chapter from the very beginning, then this code will be very familiar until we get to the plotting of the two DataFrames:

# Create a frequency table of vehicle counts
vehicle_count = accidents.groupby('Date').agg({'Number_of_Vehicles': np.sum})

First, we create a second frequency table of vehicle counts:

# Plot the two dataframes
plt.hist(casualty_count['Number_of_Casualties'], bins=20, histtype='stepfilled', normed=True, color='b', label='Casualties')
plt.hist(vehicle_count['Number_of_Vehicles'], bins=20, histtype='stepfilled', normed=True, color='r', alpha=0.5, label='Vehicles')
plt.title("Casualties/Vehicles Histogram")
plt.xlabel("Value")
plt.ylabel("Probability")
plt.legend()
plt.show()

Next, we add two histograms to our plot—one for the casualty count and one for the vehicle count. In both cases, we are using normed=True so that we have the probability distributions. We then use the color argument to make each histogram a different color, give them each a label, and use histtype='stepfilled' to make them step-filled histograms. Finally, we set the alpha value of the vehicle_count histogram to 0.5 so that we can see where the two histograms overlap.

The resulting histogram looks like the following image:

How it works…
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset