Now that we have a clean dataset to work with, we'll look at the next phase of business intelligence—exploring the data. In this chapter, we will cover:
Creating Pandas DataFrames
Creating a Pandas DataFrame from a MongoDB query
Creating a Pandas DataFrame from a CSV file
Creating a Pandas DataFrame from an Excel file
Creating a Pandas DataFrame from a JSON file
Creating a data quality report
Generating summary statistics
For the entire dataset
Generating summary statistics for the entire dataset
Generating summary statistics for object type columns
Getting the mode of the entire dataset
For a single column
Generating summary statistics for a single column
Getting a count of unique values for a single column
Getting the minimum and maximum values for a single column
Generating quantiles for a single column
Getting the mean, median, mode and range for a single column
Generating frequency tables
Generating a frequency table for a single column by date
Generating a frequency table of two variables
Creating basic charts
Creating a histogram for a column
Plotting the data as a probability distribution
Plotting a cumulative distribution function
Showing the histogram as a stepped line
Plotting two sets of values in a probability distribution
Creating a customized box plot with whiskers
Creating a basic bar chart for a single column over time
Creating a Pandas DataFrame from a MongoDB query
To create a Pandas DataFrame from a MongoDB query, we will leverage our knowledge of creating MongoDB queries to get the information that we want.
Getting ready
Before running a query against MongoDB, determine the information you want to look at. By creating a query filter, you will save time by only retrieving the information that you want. This is very important when you have millions or billions of rows of data.
How to do it…
The following code can be run in an IPython Notebook or copied/pasted into a standalone Python script:
To create a Pandas DataFrame from a MongoDB query, the first thing we need to do is import the Python libraries that we need:
import pandas as pd
from pymongo import MongoClient
Next, create a connection to the MongoDB database:
client = MongoClient('localhost', 27017)
After that, use the connection we just created, and select the database and collection to query:
db = client.pythonbicookbook
collection = db.accidents
Next, run a query and put the results into an object called data:
data = collection.find({"Day_of_Week": 6})
Use Pandas to create a DataFrame from the query results:
accidents = pd.DataFrame(list(data))
Finally, show the top five results of the DataFrame:
accidents.head()
How it works…
As we have seen in previous recipes in which we queried MongoDB, we create our query filter and then run it. The biggest difference is this bit of code:
accidents = pd.DataFrame(list(data))
Data is a cursor object. By definition, a cursor is a pointer to a result set. In order to retrieve the data that the cursor points to, you have to iterate through it. We create a new Python list object using (list(data)), which then iterates through the data cursor for us, retrieving the underlying data and filling the DataFrame.