Categorical variable analysis helps us understand the categorical types of data. Categorical types are non-numeric. In this recipe, we're using days of the week. Technically, it's a category as opposed to purely numeric data. The creators of the dataset have already converted the category—the name of the day of the week—to a number. If they had not done this, we could use Pandas to do it for us, and then perform our analysis.
In this recipe, we are going to plot the distribution of casualties by the day of the week.
import pandas as pd import numpy as np import matplotlib as plt import matplotlib.pyplot as plt %matplotlib inline
accidents_data_file = '/Users/robertdempsey/Dropbox/private/Python Business Intelligence Cookbook/Data/Stats19-Data1979-2004/Accidents7904.csv' accidents = pd.read_csv(accidents_data_file, sep=',', header=0, index_col=False, parse_dates=True, tupleize_cols=False, error_bad_lines=False, warn_bad_lines=True, skip_blank_lines=True, low_memory=False ) accidents.head()
accidents.dtypes
casualty_count = accidents.groupby('Day_of_Week').Number_of_Casualties.count()
casualty_probability = accidents.groupby('Day_of_Week').Number_of_Casualties.sum()/accidents.groupby('Day_of_Week').Number_of_Casualties.count()
fig = plt.figure(figsize=(8,4))
ax1 = fig.add_subplot(121)
ax1.set_xlabel('Day of Week') ax1.set_ylabel('Casualty Count') ax1.set_title("Casualties by Day of Week")
casualty_count.plot(kind='bar')
ax2 = fig.add_subplot(122)
casualty_probability.plot(kind = 'bar')
ax2.set_xlabel('Day of Week') ax2.set_ylabel('Probability of Casualties') ax2.set_title("Probability of Casualties by Day of Week")
The first thing that we need to do is import all the Python libraries that we'll need. The last line of the code—%matplotlib inline
—is required only if you are running the code in IPython Notebook:
import pandas as pd import numpy as np import matplotlib as plt import matplotlib.pyplot as plt %matplotlib inline
Next, we define a variable for the full path to our data file. It's recommended to do this so that if the location of your data file changes, you have to update only one line of code.
accidents_data_file = '/Users/robertdempsey/Dropbox/private/Python Business Intelligence Cookbook/Data/Stats19-Data1979-2004/Accidents7904.csv'
Once you have the data file variable, use the read_csv()
function provided by Pandas to create a DataFrame from the CSV file:
accidents = pd.read_csv(accidents_data_file, sep=',', header=0, index_col=False, parse_dates=True, tupleize_cols=False, error_bad_lines=False, warn_bad_lines=True, skip_blank_lines=True, low_memory=False )
If using IPython Notebook, use the head()
function to view the top five rows of the DataFrame. This helps to ensure that the data is imported correctly:
accidents.head()
With the data imported, we need to create a DataFrame containing the casualty counts. To do that, we use the groupby()
function of the Pandas DataFrame, grouping on the Day_of_Week
column. We then aggregate the data using the count()
function on the Number_of_Casualties
column:
casualty_count = accidents.groupby('Day_of_Week').Number_of_Casualties.count()
Next, we create a second DataFrame to hold our casualty probability data. We create the probabilities by first grouping the Day_of_Week
column, then summing the casualty counts followed by dividing that by grouping the Day_of_Week
column and aggregating by calling count()
on the Number_of_Casualties
column.
casualty_probability = accidents.groupby('Day_of_Week').Number_of_Casualties.sum()/accidents.groupby('Day_of_Week').Number_of_Casualties.count()
That can be confusing as we are doing all of that in a single line of code. Another way to look at it is like this:
With all the data in place, we now create a new plot and add a figure, specifying its size (width, height (in inches)):
fig = plt.figure(figsize=(8,4))
After that, we add our first subplot, passing in the arguments for:
nrows
: The number of rowsncols
: The number of columnsplot_number
: The plot that we are addingThis line of code gives instructions to add one row, two columns, and to add our first plot. We are specifying two columns, as we will be creating two plots and want to show them side-by-side:
ax1 = fig.add_subplot(121)
Next, add the labels for the x axis, the y axis, and a title. This makes your plot understandable to others:
ax1.set_xlabel('Day of Week') ax1.set_ylabel('Casualty Count') ax1.set_title("Casualties by Day of Week")
Now create the first plot using the plot()
function of the DataFrame, specifying the type of plot we want, a bar chart in this case:
casualty_count.plot(kind='bar')
We're now finished with the first plot and can add our second. Much like the first time that we added a plot, we now use the same add_subplot()
function; however, this time we specify a 2 as the third digit, as this is the second plot we're adding. Matplotlib will put this second plot into the second column:
ax2 = fig.add_subplot(122)
Create the second plot, this time calling plot()
on the casualty_probability
DataFrame, specifying that it should be a bar chart:
casualty_probability.plot(kind = 'bar')
Finally, we add labels to the x and y axes, and add a title.
ax2.set_xlabel('Day of Week') ax2.set_ylabel('Probability of Casualties') ax2.set_title("Probability of Casualties by Day of Week")
If using IPython Notebook, the plot will be displayed inline, and should look like the following image: