Imputing the data

In many real-world scenarios, we have the problem of incomplete or missing data. We need a strategy to handle the incomplete data. This strategy can be formulated either using the data alone or in conjunction with the class labels, if the labels are present.

Getting ready

Let's first look at the ways of imputing the data without using the class labels.

A simple technique is to ignore the missing value and hence, avoid the overhead of data imputation. However, this can be applied when the data is available in abundance, which is not always the case. If the dataset has very few missing values and the percentage of the missing values is minimal, we can ignore them. Typically, it's not about ignoring a single value of a variable, it's about ignoring a tuple that contains this variable. We have to be more careful when ignoring a whole tuple, as the other attributes in this tuple may be very critical for our task.

A better way to handle the missing data is to estimate it. Now, the estimation process can be carried out considering only the data or in conjunction with the class label. In the case of a continuous variable, the mean, median, or the most frequent value can be used to replace the missing value. Scikit-learn provides you with an Imputer() function in module preprocessing to handle the missing data. Let's see an example where we will perform data imputation. To better understand the imputation technique, we will artificially introduce some missing values in the Iris dataset.

How to do it…

Let's load the necessary libraries to begin with. We will load the Iris dataset as usual and introduce some arbitrary missing values:

# Load Libraries
from sklearn.datasets import load_iris
from sklearn.preprocessing import Imputer
import numpy as np
import numpy.ma as ma

# 1. Load Iris Data Set
data = load_iris()
x = data['data']
y = data['target']

# Make a copy of hte original x value
x_t = x.copy()

# 2.	Introduce missing values into second row
x_t[2,:] = np.repeat(0,x.shape[1])

Let's see some data imputation in action:

# 3.	Now create an imputer object with strategy as mean, 
# i.e. fill the missing values with the mean value of the missing column.
imputer = Imputer(missing_values=0,strategy="mean")
x_imputed = imputer.fit_transform(x_t)


mask = np.zeros_like(x_t)
mask[2,:] = 1
x_t_m = ma.masked_array(x_t,mask)

print np.mean(x_t_m,axis=0)print x_imputed[2,:]

How it works…

Step 1 is about loading the Iris data in memory. In step 2, we will introduce some missing values; in this case, we will set all the columns in the third row to 0.

In step 3, we will use the Imputer object to handle the missing data:

imputer = Imputer(missing_values=0,strategy="mean")

As you can see, we will need two parameters, missing_values to specify the missing values, and strategy, which is a way to impute these missing values. The Imputer object provides the following three strategies:

  • mean
  • median
  • most_frequent

Using the mean, any cell with the 0 value will be replaced by the mean value of the column that the cell belongs to. In the case of the median, the median value is used to replace 0, and in most_frequent, as the name suggests, the most frequent value is used to replace 0. Based on the context of our application, one of these strategies can be applied.

The intial value of x[2,:] is as follows:

>>> x[2,:]
array([ 4.7,  3.2,  1.3,  0.2])

We will make it 0 in all the columns and use an imputer with the mean strategy.

Before we look at the imputer output, let's calculate the mean values for all the columns:

import numpy.ma as ma
mask = np.zeros_like(x_t)
mask[2,:] = 1
x_t_m = ma.masked_array(x_t,mask)


print np.mean(x_t_m,axis=0)

The output is as follows:

[5.851006711409397 3.053020134228189 3.7751677852349017 1.2053691275167793]

Now, let's look at the imputed output for row number 2:

print x_imputed[2,:]

The following is the output:

[ 5.85100671  3.05302013  3.77516779  1.20536913]

As you can see, the imputer has filled the missing values with the mean value of the respective columns.

There's more…

As we discussed, we can also leverage the class labels and impute the missing values either using the mean or median:

# Impute based on class label
missing_y = y[2]
x_missing = np.where(y==missing_y)[0]
y = data['target']
# Mean stragegy 
print np.mean(x[x_missing,:],axis=0)
# Median stragegy
print np.median(x[x_missing,:],axis=0)

Instead of using the mean or median of the whole dataset, what we did was to subset the data by the class variable of the missing tuple:

missing_y = y[2]

We introduced the missing value in the third record. We will take the class label associated with this record to the missing_y variable:

x_missing = np.where(y==missing_y)[0]

Now, we will take all the tuples that have the same class label:

# Mean stragegy 
print np.mean(x[x_missing,:],axis=0)
# Median stragegy
print np.median(x[x_missing,:],axis=0)

We can now apply the mean or median strategy by replacing the missing tuple with the mean or median of all the tuples that belong to this class label.

We took the mean/median value of this subset for the data imputation process.

See also

  • Performing Summary Statistics recipe in Chapter 3, Analyzing Data - Explore & Wrangle
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset