Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Plotting two sets of values in a probability distribution

In this recipe, you'll learn how to create a probability distribution histogram of two variables. This plot comes in handy when you are trying to see how much overlap there is between two variables in your data.

How to do it…

To plot two sets of values in a probability distribution, begin by importing all the required libraries. To show the matplotlib plots in IPython Notebook, we will use an IPython magic function which starts with %:
```
%matplotlib inline
import pandas as pd
import numpy as np
from pymongo import MongoClient
import matplotlib as mpl
import matplotlib.pyplot as plt
```

Next, connect to MongoDB and run a query specifying the five fields to be retrieved from the MongoDB data:

client = MongoClient('localhost', 27017)
db = client.pythonbicookbook
collection = db.accidents
fields = {'Date':1,
          'Police_Force':1,
          'Accident_Severity':1,
          'Number_of_Vehicles':1,
          'Number_of_Casualties':1}
data = collection.find({}, fields)

Next, create a DataFrame from the results of the query:
```
accidents = pd.DataFrame(list(data))
```

After that, create a frequency table of vehicle counts:

vehicle_count = accidents.groupby('Date').agg({'Number_of_Vehicles': np.sum})

Next, plot the two DataFrames and render them inline:

plt.hist(casualty_count['Number_of_Casualties'], bins=20, histtype='stepfilled', normed=True, color='b', label='Casualties')
plt.hist(vehicle_count['Number_of_Vehicles'], bins=20, histtype='stepfilled', normed=True, color='r', alpha=0.5, label='Vehicles')
plt.title("Casualties/Vehicles Histogram")
plt.xlabel("Value")
plt.ylabel("Probability")
plt.legend()
plt.show()

How it works…

If you've been reading this chapter from the very beginning, then this code will be very familiar until we get to the plotting of the two DataFrames:

# Create a frequency table of vehicle counts
vehicle_count = accidents.groupby('Date').agg({'Number_of_Vehicles': np.sum})

First, we create a second frequency table of vehicle counts:

# Plot the two dataframes
plt.hist(casualty_count['Number_of_Casualties'], bins=20, histtype='stepfilled', normed=True, color='b', label='Casualties')
plt.hist(vehicle_count['Number_of_Vehicles'], bins=20, histtype='stepfilled', normed=True, color='r', alpha=0.5, label='Vehicles')
plt.title("Casualties/Vehicles Histogram")
plt.xlabel("Value")
plt.ylabel("Probability")
plt.legend()
plt.show()

Next, we add two histograms to our plot—one for the casualty count and one for the vehicle count. In both cases, we are using normed=True so that we have the probability distributions. We then use the color argument to make each histogram a different color, give them each a label, and use histtype='stepfilled' to make them step-filled histograms. Finally, we set the alpha value of the vehicle_count histogram to 0.5 so that we can see where the two histograms overlap.

The resulting histogram looks like the following image:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Plotting two sets of values in a probability distribution

Create new playlist

Sign In

Sign Up

Plotting two sets of values in a probability distribution

How to do it…

How it works…

Table of Contents for
Plotting two sets of values in a probability distribution