Finding outliers in univariate data

Outliers are data points that are far away from the other data points in your data. They have to be handled carefully in data science applications. Including them in some of your algorithms unknowingly may lead to wrong results or conclusions. It is very important to account for them properly and have the right algorithms in order to handle them.

 

"Outlier detection is an extremely important problem with a direct application in a wide variety of application domains, including fraud detection (Bolton, 2002), identifying computer network intrusions and bottlenecks (Lane, 1999), criminal activities in e-commerce and detecting suspicious activities (Chiu, 2003)."

 
 --- Jayakumar and Thomas, A New Procedure of Clustering Based on Multivariate Outlier Detection (Journal of Data Science 11(2013), 69-84)

We will look at the detection of outliers in univariate data in this recipe and then move on to look at outliers in multivariate and text data.

Getting ready

In this recipe, we will look at the following three methods for outlier detection in univariate data:

  • Median absolute deviation
  • Mean plus or minus three standard deviation

Let's see how we can leverage these methods to spot outliers in univariate data. Before we jump into the next section, let's create a dataset with outliers so that we can evaluate our method empirically:

import numpy as np
import matplotlib.pyplot as plt

n_samples = 100
fraction_of_outliers = 0.1
number_inliers = int ( (1-fraction_of_outliers) * n_samples )
number_outliers = n_samples - number_inliers

We will create 100 data points, and 10 percent of them will be outliers:

# Get some samples from a normal distribution
normal_data = np.random.randn(number_inliers,1)

We will use the randn function in the random module of NumPy to generate our inliers. This will be a sample from a distribution with a mean of zero and a standard deviation of one. Let's verify the mean and standard deviation of our sample:

# Print the mean and standard deviation
# to confirm the normality of our input data.
mean = np.mean(normal_data,axis=0)
std = np.std(normal_data,axis=0)
print "Mean =(%0.2f) and Standard Deviation (%0.2f)"%(mean[0],std[0])

We will calculate the mean and standard deviation with the functions from NumPy and print the output. Let's inspect the output:

Mean =(0.24) and Standard Deviation (0.90)

As you can see, the mean is close to zero and the standard deviation is close to one.

Now, let's create the outliers. This will be 10 percent of the whole dataset, that is, 10 points, given that our sample size is 100. As you can see, we sampled our outliers from a uniform distribution between -9 and 9. Any points between this range will have an equal chance of being selected. We will concatenate our inlier and outlier data. It will be good to see the data with a scatter plot before we run our outlier detection program:

# Create outlier data
outlier_data = np.random.uniform(low=-9,high=9,size=(number_outliers,1))
total_data = np.r_[normal_data,outlier_data]
print "Size of input data = (%d,%d)"%(total_data.shape)
# Eyeball the data
plt.cla()
plt.figure(1)
plt.title("Input points")
plt.scatter(range(len(total_data)),total_data,c='b')

Let's look at the graph that is generated:

Getting ready

Our y axis is the actual values that we generated and our x axis is a running count. It will be a good exercise to mark the points that you feel are outliers. We can later compare our program output with your manual selections.

How to do it…

  1. Let's start with the median absolute deviation. Then we will plot our values, with the outliers marked in red:
    # Median Absolute Deviation
    median = np.median(total_data)
    b = 1.4826
    mad = b * np.median(np.abs(total_data - median))
    outliers = []
    # Useful while plotting
    outlier_index = []
    print "Median absolute Deviation = %.2f"%(mad)
    lower_limit = median - (3*mad)
    upper_limit = median + (3*mad)
    print "Lower limit = %0.2f, Upper limit = %0.2f"%(lower_limit,upper_limit)
    for i in range(len(total_data)):
        if total_data[i] > upper_limit or total_data[i] < lower_limit:
            print "Outlier %0.2f"%(total_data[i])
            outliers.append(total_data[i])
            outlier_index.append(i)
    
    plt.figure(2)
    plt.title("Outliers using mad")
    plt.scatter(range(len(total_data)),total_data,c='b')
    plt.scatter(outlier_index,outliers,c='r')
    plt.show()
  2. Moving on to the mean plus or minus three standard deviation, we will plot our values, with the outliers colored in red:
    # Standard deviation
    std = np.std(total_data)
    mean = np.mean(total_data)
    b = 3
    outliers = []
    outlier_index = []
    lower_limt = mean-b*std
    upper_limt = mean+b*std
    print "Lower limit = %0.2f, Upper limit = %0.2f"%(lower_limit,upper_limit)
    for i in range(len(total_data)):
        x = total_data[i]
        if x > upper_limit or x < lower_limt:
            print "Outlier %0.2f"%(total_data[i])
            outliers.append(total_data[i])
            outlier_index.append(i)
    
    
    plt.figure(3)
    plt.title("Outliers using std")
    plt.scatter(range(len(total_data)),total_data,c='b')
    plt.scatter(outlier_index,outliers,c='r')
    plt.savefig("B04041 04 10.png")
    plt.show()

How it works…

In step 1, we use the median absolute deviation to detect the outliers in the data:

median = np.median(total_data)
b = 1.4826
mad = b * np.median(np.abs(total_data - median))

We first calculate the median value of our dataset using the median function from NumPy. Next, we declare a variable with a value of 1.4826. This is a constant to be multiplied with the absolute deviation from the median. Finally, we calculate the median of absolute deviations of each entry from the median value and multiply it with the constant, b.

Any point that is more than or less than three times the median absolute deviation is deemed as an outlier for our method:

lower_limit = median - (3*mad)
upper_limit = median + (3*mad)

print "Lower limit = %0.2f, Upper limit = %0.2f"%(lower_limit,upper_limit)

We then calculate the lower and upper limits of the median absolute deviation, as shown previously, and classify every point as either an outlier or inlier, as follows:

for i in range(len(total_data)):
if total_data[i] > upper_limit or total_data[i] < lower_limit:
print "Outlier %0.2f"%(total_data[i])
outliers.append(total_data[i])
outlier_index.append(i)

Finally, we have all our outlier points stored in a list by the name of outliers. We must also store the index of the outliers in a separate list called outlier_index. This is done for the ease of plotting, as you will see in the next step.

We then plot the original points and outliers. The plot looks as follows:

How it works…

The points marked in red are classified as outliers by the algorithm.

In step 3, we code up the second algorithm, mean plus or minus three standard deviation:

std = np.std(total_data)
mean = np.mean(total_data)
b = 3

We then calculate the standard deviation and mean of our dataset. Here, you can see that we have set our b = 3. As the name of our algorithm suggests, we will need a standard deviation of three, and this b is used for the same:

lower_limt = mean-b*std
upper_limt = mean+b*std

print "Lower limit = %0.2f, Upper limit = %0.2f"%(lower_limit,upper_limit)

for i in range(len(total_data)):
x = total_data[i]
if x > upper_limit or x < lower_limt:
print "Outlier %0.2f"%(total_data[i])
outliers.append(total_data[i])
outlier_index.append(i)

We can calculate the lower and upper limits as the mean minus three times the standard deviation. Using these values, we can then classify every point as either an outlier or inlier in the for loop. We then add all the outliers and their indices to the two lists, outliers and outlier_index, to plot.

Finally, we plot the outliers:

How it works…

There's more…

As per the definition of outliers, outliers in a given dataset are those points that are far away from the other points in the data source. The estimates of the center of the dataset and the spread of the dataset can be used to detect the outliers. In the methods that we outlined in this recipe, we used the mean and median as the estimates for the center of the data and standard deviation, and the median absolute deviation as the estimates for the spread. Spread is also called scale.

Let's do a little bit of rationalization about why our methods work in the detection of the outliers. Let's start with the method of using standard deviation. For Gaussian data, we know that 68.27 percent of the data lies with in one standard deviation, 95.45 percent in two, and 99.73 percent lies in three. Thus, according to our rule that any point that is more than three standard deviations from the mean is classified as an outlier. However, this method is not robust. Let's look at a small example.

Let's sample eight data points from a normal distribution, with the mean as zero and the standard deviation as one.

Let's use the convenient function from NumPy .random to generate our numbers:

np.random.randn(8)

This gives us the following numbers:

-1.76334861, -0.75817064,  0.44468944, -0.07724717,  0.12951944,0.43096092, -0.05436724, -0.23719402

Let's add two outliers to it manually, for example, 45 and 69, to this list.

Our dataset now looks as follows:

-1.763348607322289, -0.7581706357821458, 0.4446894368956213, -0.07724717210195432, 0.1295194428816003, 0.4309609200681169, -0.05436724238743103, -0.23719402072058543, 45, 69

The mean of the preceding dataset is 11.211 and the standard deviation is 23.523.

Let's look at the upper rule, mean + 3 * std. This is 11.211 + 3 * 23.523 = 81.78.

Now, according to this upper bound rule, both the points, 45 and 69, are not outliers! Both the mean and the standard deviation are non-robust estimators of the center and scale of the dataset, as they are extremely sensitive to outliers. If we replace one of the points with an extreme point in a dataset with n observations, it will completely change the estimate of the mean and the standard deviation. This property of the estimators is called the finite sample breakdown point.

Note

The finite sample breakdown point is defined as the proportion of the observations in a sample that can be replaced before the estimator fails to describe the data accurately.

Thus, for the mean and standard deviation, the finite sample breakdown point is 0 percent because in a large sample, replacing even a single point would change the estimators drastically.

In contrast, the median is a more robust estimate. The median is the middle observation in a finite set of observations that is sorted in an ascending order. For the median to change drastically, we have to replace half of the observations in the data that are far away from the median. This gives you a 50 percent finite sample breakdown point for the median.

The median absolute deviation method is attributed to the following paper:

Leys, C., et al., Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median, Journal of Experimental Social Psychology (2013), http://dx.doi.org/10.1016/j.jesp.2013.03.013.

See also

  • Performing summary statistics and plots recipe in Chapter 1, Using Python for Data Science
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset