This recipe describes how to deal with datasets coming from the real world and how to clean them before doing any visualization.
We will present a few techniques, which are different in essence but have the same goal, to get the data cleaned.
Cleaning, however, should not be fully automatic. We need to understand the data as given and be able to understand what the outliers are and what the data points represent before we apply any of the robust modern algorithms made to clean the data. This is not something that can be defined in a recipe because it relies on vast areas such as statistics, knowledge of the domain, and a good eye (and then some luck).
We will use the standard Python modules we already know about, so no additional installation is required.
In this recipe, I will introduce a new. Median absolute deviation (MAD) in statistics represents a measure of the variability of a univariate (possessing one variable) sample of quantitative data. It is a measure of statistical dispersion. It falls into a group of robust statistics in a way that it is more resilient to outliers.
Here's one example that shows how to use MAD to detect outliers in our data. We will perform the following steps for this:
Look at the following lines of code depicting this:
import numpy as np import matplotlib.pyplot as plt def is_outlier(points, threshold=3.5): """ This returns a boolean array with "True" if points are outliers and "False" otherwise. These are the data points with a modified z-score greater than this: # value will be classified as outliers. """ # transform into vector if len(points.shape) == 1: points = points[:,None] # compute median value median = np.median(points, axis=0) # compute diff sums along the axis diff = np.sum((points - median)**2, axis=-1) diff = np.sqrt(diff) # compute MAD med_abs_deviation = np.median(diff) # compute modified Z-score # http://www.itl.nist.gov/div898/handbook/eda/section4/eda43.htm#Iglewicz modified_z_score = 0.6745 * diff / med_abs_deviation # return a mask for each outlier return modified_z_score > threshold # Random data x = np.random.random(100) # histogram buckets buckets = 50 # Add in a few outliers x = np.r_[x, -49, 95, 100, -100] # Keep valid data points # Note here that # "~" is logical NOT on boolean numpy arrays filtered = x[~is_outlier(x)] # plot histograms plt.figure() plt.subplot(211) plt.hist(x, buckets) plt.xlabel('Raw') plt.subplot(212) plt.hist(filtered, buckets) plt.xlabel('Cleaned') plt.show()
Note that in NumPy, the ~
operator is overloaded to operate as a logical operator and not on Boolean arrays.
The preceding code produces two distinct histograms. The first one, which has been drawn using all the data, contains one main box with height 100 centered in 0.5 and three other very small boxes. This means that most of the samples were grouped in the first box and the other boxes just contain outliers. Indeed, in the second histogram, which has been drawn without the outliers, we can observe the details of the distribution of the data in the interval 0-1.
Another way to identify outliers is to visually inspect your data. In order to do so, we could create scatter plots, where we could easily spot values that are out of the central swarm or create a box plot, which will display the median, quartiles above and below the median, and points that are distant even from the extremes of the distribution of the data.
The box extends from the lower to the upper quartile values of the data, with a line at the median. The whiskers extend from the box to show the interquartile range. Flier points are those past the end of the whiskers.
Here's an example to demonstrate that:
from pylab import * # fake up some data spread= rand(50) * 100 center = ones(25) * 50 # generate some outliers high and low flier_high = rand(10) * 100 + 100 flier_low = rand(10) * -100 # merge generated data set data = concatenate((spread, center, flier_high, flier_low), 0) subplot(311) # basic plot # 'gx' defining the outlier plotting properties boxplot(data, 0, 'gx') # compare this with similar scatter plot subplot(312) spread_1 = concatenate((spread, flier_high, flier_low), 0) center_1 = ones(70) * 25 scatter(center_1, spread_1) xlim([0, 50]) # and with another that is more appropriate for # scatter plot subplot(313) center_2 = rand(70) * 50 scatter(center_2, spread_1) xlim([0, 50]) show()
We can then see x-shaped markers representing outliers, as shown in the following table:
We can also see that the second plot showing a similar dataset in the scatter plot is not very intuitive because the x axis has all the values at 25 and we don't really distinguish between inliers and outliers.
The third plot, where we generated values on the x axis to be spread across the range from 0 to 50, gives us more visibility of the different values and we can see what values are outliers in terms of the y axis.
What if we have a dataset with missing values? We can use NumPy loaders to compensate for missing values, or we can write code to replace existing values with the ones we need for further use.
For example, we want to illustrate some dataset over the geographical map of USA and have values for state names that are not consistent in the dataset. For example, we have values OH
, Ohio
, OHIO
, US-OH
, and OH-USA
all representing the state of Ohio in the USA. What we must do in this situation is that we need to inspect the dataset manually by loading it in a spreadsheet processor such as Microsoft Excel or OpenOffice.org Calc. Sometimes, it is easy enough to just print all the lines using Python. If the file is CSV or CSV-like, we can open it with any text editor and inspect the data directly.
After we have concluded what is present in the data, we can write Python code to group those similar values and replace them with the one value that is going to make further processing consistent. The usual way of doing this is to read in lines of the file using readlines()
and use standard Python string manipulation functions to perform manipulations.
There are special products, both commercial and non-commercial (such as OpenRefine available at https://github.com/OpenRefine) that provide some automation around transformation on "dirty" live datasets.
Manual work is still involved, depending on how noisy the data is and how great our understanding of that data is.
If you want to find out more about cleaning outliers and cleaning of data in general, look for statistical models and the sampling theory.