Chapter 4. Performing Data Analysis for Non Data Analysts

Now that we know what we are working with, it is time to gain insights from the data using analysis. In this chapter, we will look at two types of analysis—statistical (what happened) and predictive (what could happen). You will learn how to perform the following:

  • Statistical analysis
    • Performing a distribution analysis
    • Performing a categorical variable analysis
    • Performing a linear regression
    • Performing a time-series analysis
    • Performing outlier detection
  • Predictive analysis
    • Creating a predictive model using logistic regression
    • Creating a predictive model using random forest
    • Creating a predictive model using Support Vector Machines
  • Saving the results of your analysis
    • Saving the results of your analysis
    • Saving a predictive model for production use

Performing a distribution analysis

A distribution analysis helps us understand the distribution of the various attributes of our data. Once plotted, you can see how your data is broken up. In this recipe, we'll create three plots: a distribution of weather conditions, a boxplot of light conditions, and a boxplot of light conditions grouped by weather conditions.

How to do it…

  1. First, import the Python libraries that you need:
    import pandas as pd
    import numpy as np
    import matplotlib as plt
    import matplotlib.pyplot as plt
    %matplotlib inline
  2. Next, define a variable for the accidents data file, import the data, and view the top five rows:
    accidents_data_file = '/Users/robertdempsey/Dropbox/private/Python Business Intelligence Cookbook/Data/Stats19-Data1979-2004/Accidents7904.csv'
    accidents = pd.read_csv(accidents_data_file,
                            sep=',',
                            header=0,
                            index_col=False,
                            parse_dates=True,
                            tupleize_cols=False,
                            error_bad_lines=False,
                            warn_bad_lines=True,
                            skip_blank_lines=True,
                            low_memory=False
                            )
    accidents.head()
  3. After that, get a full list of the columns and data types in the DataFrame:
    accidents.dtypes
  4. Create a histogram of the weather conditions:
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.hist(accidents['Weather_Conditions'],
            range = (accidents['Weather_Conditions'].min(),accidents['Weather_Conditions'].max()))
    counts, bins, patches = ax.hist(accidents['Weather_Conditions'], facecolor='green', edgecolor='gray')
    ax.set_xticks(bins)
    plt.title('Weather Conditions Distribution')
    plt.xlabel('Weather Condition')
    plt.ylabel('Count of Weather Condition')
    plt.show()
  5. Next, create a box plot of the light conditions:
    accidents.boxplot(column='Light_Conditions',
                      return_type='dict');
  6. Finally, create a box plot of the light conditions grouped by weather conditions, as follows:
    accidents.boxplot(column='Light_Conditions',
                                 by = 'Weather_Conditions',
                                 return_type='dict');

How it works…

The first thing we need to do is import all the Python libraries we'll need. The last line of code—%matplotlib inline—is required only if you are running the code in IPython Notebook:

import pandas as pd
import numpy as np
import matplotlib as plt
import matplotlib.pyplot as plt
%matplotlib inline

Next, we define a variable for the full path to our data file. It is recommended to do this, because if the location of your data file changes, you have to update only one line of code:

accidents_data_file = '/Users/robertdempsey/Dropbox/private/Python Business Intelligence Cookbook/Data/Stats19-Data1979-2004/Accidents7904.csv'

Once you have the data file variable, use the read_csv() function provided by Pandas to create a DataFrame from the CSV file:

accidents = pd.read_csv(accidents_data_file,
                        sep=',',
                        header=0,
                        index_col=False,
                        parse_dates=True,
                        tupleize_cols=False,
                        error_bad_lines=False,
                        warn_bad_lines=True,
                        skip_blank_lines=True,
                        low_memory=False
                        )

If using IPython Notebook, use the head() function to view the top five rows of the DataFrame. This helps to ensure that the data is imported correctly:

accidents.head()

Matplotlib cannot chart text values; so, to check the kind of data we have, we use the dtypes function of the DataFrame. This shows us the name and type of each column in the DataFrame:

accidents.dtypes

Now that we know what we're working with, we can perform our distribution analysis. The first thing we'll do is create a histogram of weather conditions. Start by creating a figure. This figure will be the chart that we output:

fig = plt.figure()

Next we add a subplot to the figure. The 111 argument that we pass in (in the code) is a simple way of passing three arguments to the add_subplot() function:

  • nrows: The number of rows in the plot.
  • ncols: The number of columns in the plot.
  • plot_number: The subplot that we are adding (you can add many). Each figure we create can have one or more plots. If you think of an image of a chart, the figure is the image and the plot is the chart.

By specifying a single row and a single column, we are creating the typical (x,y) graph, which you may remember from your Math class.

ax = fig.add_subplot(111)

Then we create a histogram for the subplot by calling the hist() function and passing in values for these arguments:

  • x: The data we want to plot
  • range: The lower and upper range of the bins into which our data will be placed
    ax.hist(accidents['Weather_Conditions'],
            range = (accidents['Weather_Conditions'].min(),accidents['Weather_Conditions'].max()))

We then customize our histogram by specifying the chart colors, and telling the plot to show tick marks on the x axis for each of the bins:

counts, bins, patches = ax.hist(accidents['Weather_Conditions'], facecolor='green', edgecolor='gray')
ax.set_xticks(bins)

Next we add a title, and label the x and y axis:

plt.title('Weather Conditions Distribution')
plt.xlabel('Weather Condition')
plt.ylabel('Count of Weather Condition')

Finally, we use the show() function to show the plot. If using IPython Notebook, the plot will be rendered right in the notebook:

plt.show()

The resulting plot should look like this:

How it works…

As per the Data Guide provided with the data, the following are the corresponding meanings for the weather condition values:

  • -1: Data missing or out of range
  • 1: Fine no high winds
  • 2: Raining no high winds
  • 3: Snowing no high winds
  • 4: Fine + high winds
  • 5: Raining + high winds
  • 6: Snowing + high winds
  • 7: Fog or mist
  • 8: Other
  • 9: Unknown

Next we create a boxplot of the light conditions. We do this directly from the DataFrame using the boxplot() function, passing in the column to use and the return type for the column.

Note the ; at the end of the function call suppresses the typical output from the matplotlib so that only the resulting plot will be displayed:

accidents.boxplot(column='Light_Conditions',
                  return_type='dict');

The boxplot should look like the following image:

How it works…

Lastly, we create a boxplot of the light conditions grouped by the weather condition. We again use the boxplot() function from the DataFrame. We pass in three arguments:

  • cols: the column containing the data that we want to plot
  • return_type: the kind of object to return; 'dict' returns a dictionary whose values are the matplotlib lines of the boxplot.
    accidents.boxplot(column='Light_Conditions',
                      by = 'Weather_Conditions',
                      return_type='dict');

The plot generated by these lines of code should look like the following image:

How it works…
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset