Generating a frequency table for a single column by date

A frequency table is another way of summarizing data; it shows the number of times a value occurs. In this recipe, we will create a frequency table of casualties by date.

Getting ready

To use this recipe, you need to have MongoDB running, and to have the accidents data imported.

How to do it…

  1. To generate a frequency table for a single column by date, begin by importing the Python libraries that we need:
    import pandas as pd
    import numpy as np
    from pymongo import MongoClient
  2. Next, connect to MongoDB, and run a query specifying the five fields to be retrieved from the MongoDB data:
    client = MongoClient('localhost', 27017)
    db = client.pythonbicookbook
    collection = db.accidents
    fields = {'Date':1,
              'Police_Force':1,
              'Accident_Severity':1,
              'Number_of_Vehicles':1,
              'Number_of_Casualties':1}
    data = collection.find({}, fields)
  3. Next, create a DataFrame from the results of the query:
    accidents = pd.DataFrame(list(data))
  4. Finally, use the groupby() and agg() methods of the DataFrame and show the results:
    casualty_count = accidents.groupby('Date').agg({'Number_of_Casualties': np.sum})
    casualty_count

How it works…

After importing the Pandas, numpy, and pymongo libraries, we create a connection to MongoDB, selecting the accidents collection from the Python Business Intelligence Cookbook database:

fields = {'Date':1,
          'Police_Force':1,
          'Accident_Severity':1,
          'Number_of_Vehicles':1,

Next we specify the five fields we want to use in our query. We are specifying fields as we do not need to retrieve all of the data in the collection in order to produce the frequency table. This makes our query run a lot faster:

data = collection.find({}, fields)
accidents = pd.DataFrame(list(data))

After that, we run our query and put the results into the data variable. We then create a pandas DataFrame from our data:

casualty_count = accidents.groupby('Date').agg({'Number_of_Casualties': np.sum})
casualty_count

Finally, we create our frequency table using the groupby() and agg() functions of Pandas. groupby(), just as in SQL, allows us to group the results by a specified field, in this case 'Date'. agg() computes the summary statistics that you provide. In this example, we are telling agg() to use the numpy sum function. The results look as shown in the following image:

How it works…
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset