Now that we know what we are working with, it is time to gain insights from the data using analysis. In this chapter, we will look at two types of analysis—statistical (what happened) and predictive (what could happen). You will learn how to perform the following:
A distribution analysis helps us understand the distribution of the various attributes of our data. Once plotted, you can see how your data is broken up. In this recipe, we'll create three plots: a distribution of weather conditions, a boxplot of light conditions, and a boxplot of light conditions grouped by weather conditions.
import pandas as pd import numpy as np import matplotlib as plt import matplotlib.pyplot as plt %matplotlib inline
accidents_data_file = '/Users/robertdempsey/Dropbox/private/Python Business Intelligence Cookbook/Data/Stats19-Data1979-2004/Accidents7904.csv' accidents = pd.read_csv(accidents_data_file, sep=',', header=0, index_col=False, parse_dates=True, tupleize_cols=False, error_bad_lines=False, warn_bad_lines=True, skip_blank_lines=True, low_memory=False ) accidents.head()
accidents.dtypes
fig = plt.figure() ax = fig.add_subplot(111) ax.hist(accidents['Weather_Conditions'], range = (accidents['Weather_Conditions'].min(),accidents['Weather_Conditions'].max())) counts, bins, patches = ax.hist(accidents['Weather_Conditions'], facecolor='green', edgecolor='gray') ax.set_xticks(bins) plt.title('Weather Conditions Distribution') plt.xlabel('Weather Condition') plt.ylabel('Count of Weather Condition') plt.show()
accidents.boxplot(column='Light_Conditions', return_type='dict');
accidents.boxplot(column='Light_Conditions', by = 'Weather_Conditions', return_type='dict');
The first thing we need to do is import all the Python libraries we'll need. The last line of code—%matplotlib
inline—is required only if you are running the code in IPython Notebook:
import pandas as pd import numpy as np import matplotlib as plt import matplotlib.pyplot as plt %matplotlib inline
Next, we define a variable for the full path to our data file. It is recommended to do this, because if the location of your data file changes, you have to update only one line of code:
accidents_data_file = '/Users/robertdempsey/Dropbox/private/Python Business Intelligence Cookbook/Data/Stats19-Data1979-2004/Accidents7904.csv'
Once you have the data file variable, use the read_csv()
function provided by Pandas to create a DataFrame from the CSV file:
accidents = pd.read_csv(accidents_data_file, sep=',', header=0, index_col=False, parse_dates=True, tupleize_cols=False, error_bad_lines=False, warn_bad_lines=True, skip_blank_lines=True, low_memory=False )
If using IPython Notebook, use the head()
function to view the top five rows of the DataFrame. This helps to ensure that the data is imported correctly:
accidents.head()
Matplotlib cannot chart text values; so, to check the kind of data we have, we use the dtypes
function of the DataFrame. This shows us the name and type of each column in the DataFrame:
accidents.dtypes
Now that we know what we're working with, we can perform our distribution analysis. The first thing we'll do is create a histogram of weather conditions. Start by creating a figure. This figure will be the chart that we output:
fig = plt.figure()
Next we add a subplot to the figure. The 111
argument that we pass in (in the code) is a simple way of passing three arguments to the add_subplot()
function:
nrows
: The number of rows in the plot.ncols
: The number of columns in the plot.plot_number
: The subplot that we are adding (you can add many). Each figure we create can have one or more plots. If you think of an image of a chart, the figure is the image and the plot is the chart.By specifying a single row and a single column, we are creating the typical (x,y)
graph, which you may remember from your Math class.
ax = fig.add_subplot(111)
Then we create a histogram for the subplot by calling the hist()
function and passing in values for these arguments:
x
: The data we want to plotrange
: The lower and upper range of the bins into which our data will be placedax.hist(accidents['Weather_Conditions'], range = (accidents['Weather_Conditions'].min(),accidents['Weather_Conditions'].max()))
We then customize our histogram by specifying the chart colors, and telling the plot to show tick marks on the x axis for each of the bins:
counts, bins, patches = ax.hist(accidents['Weather_Conditions'], facecolor='green', edgecolor='gray') ax.set_xticks(bins)
Next we add a title, and label the x and y axis:
plt.title('Weather Conditions Distribution') plt.xlabel('Weather Condition') plt.ylabel('Count of Weather Condition')
Finally, we use the show()
function to show the plot. If using IPython Notebook, the plot will be rendered right in the notebook:
plt.show()
The resulting plot should look like this:
As per the Data Guide provided with the data, the following are the corresponding meanings for the weather condition values:
Next we create a boxplot of the light conditions. We do this directly from the DataFrame using the boxplot()
function, passing in the column to use and the return type for the column.
Note the ;
at the end of the function call suppresses the typical output from the matplotlib
so that only the resulting plot will be displayed:
accidents.boxplot(column='Light_Conditions', return_type='dict');
The boxplot should look like the following image:
Lastly, we create a boxplot of the light conditions grouped by the weather condition. We again use the boxplot()
function from the DataFrame. We pass in three arguments:
cols
: the column containing the data that we want to plotreturn_type
: the kind of object to return; 'dict'
returns a dictionary whose values are the matplotlib
lines of the boxplot.accidents.boxplot(column='Light_Conditions', by = 'Weather_Conditions', return_type='dict');
The plot generated by these lines of code should look like the following image: