Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3. Learning What Your Data Truly Holds

Now that we have a clean dataset to work with, we'll look at the next phase of business intelligence—exploring the data. In this chapter, we will cover:

Creating Pandas DataFrames
- Creating a Pandas DataFrame from a MongoDB query
- Creating a Pandas DataFrame from a CSV file
- Creating a Pandas DataFrame from an Excel file
- Creating a Pandas DataFrame from a JSON file
Creating a data quality report
Generating summary statistics
- For the entire dataset
  - Generating summary statistics for the entire dataset
  - Generating summary statistics for object type columns
  - Getting the mode of the entire dataset
- For a single column
  - Generating summary statistics for a single column
  - Getting a count of unique values for a single column
  - Getting the minimum and maximum values for a single column
  - Generating quantiles for a single column
  - Getting the mean, median, mode and range for a single column
Generating frequency tables
- Generating a frequency table for a single column by date
- Generating a frequency table of two variables
Creating basic charts
- Creating a histogram for a column
- Plotting the data as a probability distribution
- Plotting a cumulative distribution function
- Showing the histogram as a stepped line
- Plotting two sets of values in a probability distribution
- Creating a customized box plot with whiskers
- Creating a basic bar chart for a single column over time

Creating a Pandas DataFrame from a MongoDB query

To create a Pandas DataFrame from a MongoDB query, we will leverage our knowledge of creating MongoDB queries to get the information that we want.

Getting ready

Before running a query against MongoDB, determine the information you want to look at. By creating a query filter, you will save time by only retrieving the information that you want. This is very important when you have millions or billions of rows of data.

How to do it…

The following code can be run in an IPython Notebook or copied/pasted into a standalone Python script:

To create a Pandas DataFrame from a MongoDB query, the first thing we need to do is import the Python libraries that we need:
```
import pandas as pd
from pymongo import MongoClient
```
Next, create a connection to the MongoDB database:
```
client = MongoClient('localhost', 27017)
```
After that, use the connection we just created, and select the database and collection to query:
```
db = client.pythonbicookbook
collection = db.accidents
```
Next, run a query and put the results into an object called data:
```
data = collection.find({"Day_of_Week": 6})
```
Use Pandas to create a DataFrame from the query results:
```
accidents = pd.DataFrame(list(data))
```
Finally, show the top five results of the DataFrame:
```
accidents.head()
```

How it works…

As we have seen in previous recipes in which we queried MongoDB, we create our query filter and then run it. The biggest difference is this bit of code:

accidents = pd.DataFrame(list(data))

Data is a cursor object. By definition, a cursor is a pointer to a result set. In order to retrieve the data that the cursor points to, you have to iterate through it. We create a new Python list object using (list(data)), which then iterates through the data cursor for us, retrieving the underlying data and filling the DataFrame.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 3. Learning What Your Data Truly Holds

Create new playlist

Sign In

Sign Up

Chapter 3. Learning What Your Data Truly Holds

Creating a Pandas DataFrame from a MongoDB query

Getting ready

How to do it…

How it works…

Table of Contents for
3. Learning What Your Data Truly Holds