Performing outlier detection

Outlier detection is used to find outliers in the data that can throw off your analysis. Outliers come in two flavors: Univariate and Multivariate. A univariate outlier is a data point that consists of an extreme value on one variable. Univariate outliers can be seen when looking at a single variable. A multivariate outlier is a combination of unusual scores on at least two variables, and are found in multidimensional data.

For this recipe, we are going to use the college dataset from An Introduction to Statistical Learning with Applications in R.

How to do it…

  1. First, import the Python libraries that you need:
    import pandas as pd
    import numpy as np
    import matplotlib as plt
    import matplotlib.pyplot as plt
    %matplotlib inline
  2. Next, define a variable for the college's data file, import the data, and view the top five rows:
    data_file = '/Users/robertdempsey/Dropbox/private/Python Business Intelligence Cookbook/Data/ISL/College.csv'
    colleges = pd.read_csv(data_file,
                            sep=',',
                            header=0,
                            index_col=0,
                            parse_dates=True,
                            tupleize_cols=False,
                            error_bad_lines=False,
                            warn_bad_lines=True,
                            skip_blank_lines=True,
                            low_memory=False
                            )
    colleges.head()
  3. After that, get a full list of the columns and data types in the DataFrame:
    colleges.dtypes
  4. Next, create a boxplot of the number of applications and the number of accepted applicants:
    colleges.boxplot(column=['Apps', 'Accept'],
                     return_type='axes',
                     figsize=(12,6))
  5. After that, create a scatterplot showing the relationship between the application and acceptance numbers:
    colleges.plot(kind='scatter',
                  x='Accept',
                  y='Apps',
                  figsize=(16, 6))
  6. Next, label each point:
    fig, ax = plt.subplots()
    colleges.plot(kind='scatter',
                  x='Accept',
                  y='Apps',
                  figsize=(16, 6),
                  ax=ax)
    for k, v in colleges.iterrows():
        ax.annotate(k,(v['Accept'],v['Apps']))
  7. Finally, draw the scatterplot:
    fig.canvas.draw()

How it works…

The first thing we need to do is to import all the Python libraries we'll need. The last line of code—%matplotlib inline—is required only if you are running the code in IPython Notebook:

import pandas as pd
import numpy as np
import matplotlib as plt
import matplotlib.pyplot as plt
%matplotlib inline

Next, we define a variable for the full path to our data file. It's recommended to do this so that if the location of your data file changes, you have to update only one line of code:

data_file = '/Users/robertdempsey/Dropbox/private/Python Business Intelligence Cookbook/Data/ISL/College.csv'

Once you have the data file variable, use the read_csv() function provided by Pandas to create a DataFrame from the CSV file:

colleges = pd.read_csv(data_file,
                        sep=',',
                        header=0,
                        index_col=0,
                        parse_dates=True,
                        tupleize_cols=False,
                        error_bad_lines=False,
                        warn_bad_lines=True,
                        skip_blank_lines=True,
                        low_memory=False
                  )

If using IPython Notebook, use the head() function to view the top five rows of the DataFrame. This helps to ensure that the data is imported correctly, and allows us to see some of the data:

colleges.head()

Because this is a new dataset, we use the dtypes function to view the names and types of the columns of the DataFrame:

colleges.dtypes

Next, find out the number of rows and columns contained in the DataFrame using the shape function:

colleges.shape

Next, we create a boxplot of the number of applications and the number of accepted applicants. We do this by calling the boxplot() function on the DataFrame and passing in three arguments:

  • cols: The columns to plot.
  • return_type: The kind of object to return. 'axes' returns the matplotlib axes the boxplot is drawn on.
  • figsize: The size of the plot:
    colleges.boxplot(column=['Apps', 'Accept'],
                     return_type='axes',
                     figsize=(12,6))

The resulting plot looks like the following image:

How it works…

We can immediately see that there are, in fact, outliers in the data. The outliers are those plotted towards the top of the graph.

To better see the outliers, we next create a scatterplot showing the relationship between the application and acceptance numbers:

colleges.plot(kind='scatter',
              x='Accept',
              y='Apps',
              figsize=(16, 6))
How it works…

This helps confirm that we do indeed have outliers. However, which colleges do those points heading to the top right of the plot represent? To find out, we next label each point.

First we create our plot as we did before:

fig, ax = plt.subplots()
colleges.plot(kind='scatter',
              x='Accept',
              y='Apps',
              figsize=(16, 6),
              ax=ax)

Next, we loop through each row of our DataFrame, assign them labels using the annotate function, and specify the x and y values of the point we want to annotate and its label.

An example annotation would be (Abilene Christian University, 1232, 1660) with the label being Abilene Christian University, the x value (accepted) being 1232, and the y value (apps) being 1660.

for k, v in colleges.iterrows():
    ax.annotate(k,(v['Accept'],v['Apps']))

Finally, we draw the annotated scatterplot as follows:

fig.canvas.draw()
How it works…

We can now clearly see the two outliers in the data:

  • Rutgers at New Brunswick
  • Purdue University at West Lafayette

If we wanted to see all 777 of the labeled points, we would need make the plot a lot bigger.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset