Chapter 3. Learning What Your Data Truly Holds

Now that we have a clean dataset to work with, we'll look at the next phase of business intelligence—exploring the data. In this chapter, we will cover:

  • Creating Pandas DataFrames
    • Creating a Pandas DataFrame from a MongoDB query
    • Creating a Pandas DataFrame from a CSV file
    • Creating a Pandas DataFrame from an Excel file
    • Creating a Pandas DataFrame from a JSON file
  • Creating a data quality report
  • Generating summary statistics
    • For the entire dataset
      • Generating summary statistics for the entire dataset
      • Generating summary statistics for object type columns
      • Getting the mode of the entire dataset
    • For a single column
      • Generating summary statistics for a single column
      • Getting a count of unique values for a single column
      • Getting the minimum and maximum values for a single column
      • Generating quantiles for a single column
      • Getting the mean, median, mode and range for a single column
  • Generating frequency tables
    • Generating a frequency table for a single column by date
    • Generating a frequency table of two variables
  • Creating basic charts
    • Creating a histogram for a column
    • Plotting the data as a probability distribution
    • Plotting a cumulative distribution function
    • Showing the histogram as a stepped line
    • Plotting two sets of values in a probability distribution
    • Creating a customized box plot with whiskers
    • Creating a basic bar chart for a single column over time

Creating a Pandas DataFrame from a MongoDB query

To create a Pandas DataFrame from a MongoDB query, we will leverage our knowledge of creating MongoDB queries to get the information that we want.

Getting ready

Before running a query against MongoDB, determine the information you want to look at. By creating a query filter, you will save time by only retrieving the information that you want. This is very important when you have millions or billions of rows of data.

How to do it…

The following code can be run in an IPython Notebook or copied/pasted into a standalone Python script:

  1. To create a Pandas DataFrame from a MongoDB query, the first thing we need to do is import the Python libraries that we need:
    import pandas as pd
    from pymongo import MongoClient
  2. Next, create a connection to the MongoDB database:
    client = MongoClient('localhost', 27017)
  3. After that, use the connection we just created, and select the database and collection to query:
    db = client.pythonbicookbook
    collection = db.accidents
  4. Next, run a query and put the results into an object called data:
    data = collection.find({"Day_of_Week": 6})
  5. Use Pandas to create a DataFrame from the query results:
    accidents = pd.DataFrame(list(data))
  6. Finally, show the top five results of the DataFrame:
    accidents.head()

How it works…

As we have seen in previous recipes in which we queried MongoDB, we create our query filter and then run it. The biggest difference is this bit of code:

accidents = pd.DataFrame(list(data))

Data is a cursor object. By definition, a cursor is a pointer to a result set. In order to retrieve the data that the cursor points to, you have to iterate through it. We create a new Python list object using (list(data)), which then iterates through the data cursor for us, retrieving the underlying data and filling the DataFrame.

How it works…
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset