Normalizing the training data

As we have seen, it is best to normalize the data to remove obvious movie- or user-specific effects. We will just use one very simple type of normalization that we used before: conversion to z-scores.

Unfortunately, we cannot simply use scikit-learn's normalization objects as we have to deal with the missing values in our data (that is, not all movies were rated by all users). Thus, we want to normalize by the mean and standard deviation of the values that are, in fact, present.

We will write our own class that will ignore missing values. This class will follow the scikit-learn preprocessing API. We can even derive from scikit-learn's TransformerMixin class to add a fit_transform method:

from sklearn.base import TransformerMixin 
class NormalizePositive(TransformerMixin):

We want to choose the axis of normalization. By default, we normalize along the first axis, but sometimes it will be useful to normalize along the second one. This follows the convention of many other NumPy-related functions:

    def __init__(self, axis=0): 
        self.axis = axis

The most important method is the fit method. In our implementation, we compute the mean and standard deviation of the values that are not zero. Recall that zeros indicate missing values:

def fit(self, features, y=None):

If the axis is 1, we operate on the transposed array as follows:

      if self.axis == 1: 
          features = features.T 
      #  count features that are greater than zero in axis 0: 
      binary = (features > 0) 
      count0 = binary.sum(axis=0) 
 
      # to avoid division by zero, set zero counts to one: 
      count0[count0 == 0] = 1. 
 
      # computing the mean is easy: 
      self.mean = features.sum(axis=0)/count0 
 
      # only consider differences where binary is True: 
      diff = (features - self.mean) * binary 
      diff **= 2 
      # regularize the estimate of std by adding 0.1 
      self.std = np.sqrt(0.1 + diff.sum(axis=0)/count0) 
      return self

We add 0.1 to the direct estimate of the standard deviation to avoid underestimating the value of the standard deviation when there are only a few samples, all of which may be exactly the same. The exact value used does not matter much for the final result, but we need to avoid division by zero.

The transform method needs to take care of maintaining the binary structure, as follows:

def transform(self, features): 
    if self.axis == 1: 
         features = features.T 
    binary = (features > 0) 
    features = features - self.mean 
    features /= self.std 
    features *= binary 
    if self.axis == 1: 
        features = features.T 
    return features

Note how we took care of transposing the input matrix when the axis is 1 and then transformed it back so that the return value has the same shape as the input. The inverse_transform method performs the inverse operation to transform, as shown in the following code:

def inverse_transform(self, features, copy=True):
     if copy:
         features = features.copy()
     if self.axis == 1:
         features = features.T
     features *= self.std
     features += self.mean
     if self.axis == 1:
         features = features.T
     return features

Finally, we add the fit_transform method, which, as the name indicates, combines both the fit and transform operations:

def fit_transform(self, features):
    return self.fit(features).transform(features)

The methods that we defined (fit, transform, transform_inverse, and fit_transform) were the same as the objects defined in the sklearn.preprocessing module. In the following sections, we will first normalize the inputs, generate normalized predictions, and finally apply the inverse transformation to obtain the final predictions.

Table of Contents for Normalizing the training data

Create new playlist

Sign In

Sign Up

Table of Contents for
Normalizing the training data