Performing categorical variable analysis

Categorical variable analysis helps us understand the categorical types of data. Categorical types are non-numeric. In this recipe, we're using days of the week. Technically, it's a category as opposed to purely numeric data. The creators of the dataset have already converted the category—the name of the day of the week—to a number. If they had not done this, we could use Pandas to do it for us, and then perform our analysis.

In this recipe, we are going to plot the distribution of casualties by the day of the week.

How to do it…

  1. First, import the Python libraries that you need:
    import pandas as pd
    import numpy as np
    import matplotlib as plt
    import matplotlib.pyplot as plt
    %matplotlib inline
  2. Next, define a variable for the accidents data file, import the data, and view the top five rows:
    accidents_data_file = '/Users/robertdempsey/Dropbox/private/Python Business Intelligence Cookbook/Data/Stats19-Data1979-2004/Accidents7904.csv'
    accidents = pd.read_csv(accidents_data_file,
                            sep=',',
                            header=0,
                            index_col=False,
                            parse_dates=True,
                            tupleize_cols=False,
                            error_bad_lines=False,
                            warn_bad_lines=True,
                            skip_blank_lines=True,
                            low_memory=False
                            )
    accidents.head()
  3. After that, get a full list of the columns and data types in the DataFrame:
    accidents.dtypes
  4. With the data imported, create a new DataFrame of the casualty counts:
    casualty_count = accidents.groupby('Day_of_Week').Number_of_Casualties.count()
  5. Next, create a DataFrame of the casualty probabilities:
    casualty_probability = accidents.groupby('Day_of_Week').Number_of_Casualties.sum()/accidents.groupby('Day_of_Week').Number_of_Casualties.count()
  6. After that, create a new plot and add a figure:
    fig = plt.figure(figsize=(8,4))
  7. Next, add a subplot:
    ax1 = fig.add_subplot(121)
  8. Now label the x and y axis and provide a title:
    ax1.set_xlabel('Day of Week')
    ax1.set_ylabel('Casualty Count')
    ax1.set_title("Casualties by Day of Week")
  9. Finally, create the plot of the casualty counts:
    casualty_count.plot(kind='bar')
  10. Add a second subplot for the casualty probabilities:
    ax2 = fig.add_subplot(122)
  11. Create the plot specifying its type:
    casualty_probability.plot(kind = 'bar')
  12. Label the x and y axis, and add a title:
    ax2.set_xlabel('Day of Week')
    ax2.set_ylabel('Probability of Casualties')
    ax2.set_title("Probability of Casualties by Day of Week")
  13. Once you run all the preceding commands, the plot will be rendered if using IPython Notebook.

How it works…

The first thing that we need to do is import all the Python libraries that we'll need. The last line of the code—%matplotlib inline—is required only if you are running the code in IPython Notebook:

import pandas as pd
import numpy as np
import matplotlib as plt
import matplotlib.pyplot as plt
%matplotlib inline

Next, we define a variable for the full path to our data file. It's recommended to do this so that if the location of your data file changes, you have to update only one line of code.

accidents_data_file = '/Users/robertdempsey/Dropbox/private/Python Business Intelligence Cookbook/Data/Stats19-Data1979-2004/Accidents7904.csv'

Once you have the data file variable, use the read_csv() function provided by Pandas to create a DataFrame from the CSV file:

accidents = pd.read_csv(accidents_data_file,
                        sep=',',
                        header=0,
                        index_col=False,
                        parse_dates=True,
                        tupleize_cols=False,
                        error_bad_lines=False,
                        warn_bad_lines=True,
                        skip_blank_lines=True,
                        low_memory=False
                        )

If using IPython Notebook, use the head() function to view the top five rows of the DataFrame. This helps to ensure that the data is imported correctly:

accidents.head()

With the data imported, we need to create a DataFrame containing the casualty counts. To do that, we use the groupby() function of the Pandas DataFrame, grouping on the Day_of_Week column. We then aggregate the data using the count() function on the Number_of_Casualties column:

casualty_count = accidents.groupby('Day_of_Week').Number_of_Casualties.count()

Next, we create a second DataFrame to hold our casualty probability data. We create the probabilities by first grouping the Day_of_Week column, then summing the casualty counts followed by dividing that by grouping the Day_of_Week column and aggregating by calling count() on the Number_of_Casualties column.

casualty_probability = accidents.groupby('Day_of_Week').Number_of_Casualties.sum()/accidents.groupby('Day_of_Week').Number_of_Casualties.count()

That can be confusing as we are doing all of that in a single line of code. Another way to look at it is like this:

How it works…

With all the data in place, we now create a new plot and add a figure, specifying its size (width, height (in inches)):

fig = plt.figure(figsize=(8,4))

After that, we add our first subplot, passing in the arguments for:

  • nrows: The number of rows
  • ncols: The number of columns
  • plot_number: The plot that we are adding

This line of code gives instructions to add one row, two columns, and to add our first plot. We are specifying two columns, as we will be creating two plots and want to show them side-by-side:

ax1 = fig.add_subplot(121)

Next, add the labels for the x axis, the y axis, and a title. This makes your plot understandable to others:

ax1.set_xlabel('Day of Week')
ax1.set_ylabel('Casualty Count')
ax1.set_title("Casualties by Day of Week")

Now create the first plot using the plot() function of the DataFrame, specifying the type of plot we want, a bar chart in this case:

casualty_count.plot(kind='bar')

We're now finished with the first plot and can add our second. Much like the first time that we added a plot, we now use the same add_subplot() function; however, this time we specify a 2 as the third digit, as this is the second plot we're adding. Matplotlib will put this second plot into the second column:

ax2 = fig.add_subplot(122)

Create the second plot, this time calling plot() on the casualty_probability DataFrame, specifying that it should be a bar chart:

casualty_probability.plot(kind = 'bar')

Finally, we add labels to the x and y axes, and add a title.

ax2.set_xlabel('Day of Week')
ax2.set_ylabel('Probability of Casualties')
ax2.set_title("Probability of Casualties by Day of Week")

If using IPython Notebook, the plot will be displayed inline, and should look like the following image:

How it works…
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset