Scaling data to the standard normal

A preprocessing step that is almost recommended is to scale columns to the standard normal. The standard normal is probably the most important distribution of all statistics.

If you've ever been introduced to statistics, you must have almost certainly seen z-scores. In truth, that's all this recipe is about—transforming our features from their endowed distribution into z-scores.

Getting ready

The act of scaling data is extremely useful. There are a lot of machine learning algorithms, which perform differently (and incorrectly) in the event the features exist at different scales. For example, SVMs perform poorly if the data isn't scaled because it uses a distance function in its optimization, which is biased if one feature varies from 0 to 10,000 and the other varies from 0 to 1.

The preprocessing module contains several useful functions to scale features:

>>> from sklearn import preprocessing
>>> import numpy as np # we'll need it later

How to do it...

Continuing with the boston dataset, run the following commands:

>>> X[:, :3].mean(axis=0) #mean of the first 3 features
array([  3.59376071,  11.36363636,  11.13677866])
>>> X[:, :3].std(axis=0)
array([  8.58828355,  23.29939569,   6.85357058])

There's actually a lot to learn from this initially. Firstly, the first feature has the smallest mean but varies even more than the third feature. The second feature has the largest mean and standard deviation—it takes the widest spread of values:

>>> X_2 = preprocessing.scale(X[:, :3])

>>> X_2.mean(axis=0)
array([  6.34099712e-17,  -6.34319123e-16,  -2.68291099e-15])

>>> X_2.std(axis=0)
array([ 1.,  1.,  1.])

How it works...

The center and scaling function is extremely simple. It merely subtracts the mean and divides by the standard deviation:

How it works...

In addition to a function, there is also a center and scaling class that is easy to invoke, and this is particularly useful when used in conjunction with the Pipelines mentioned later. It's also useful for the center and scaling class to persist across individual scaling:

>>> my_scaler = preprocessing.StandardScaler()
>>> my_scaler.fit(X[:, :3])
>>> my_scaler.transform(X[:, :3]).mean(axis=0)
array([  6.34099712e-17,  -6.34319123e-16,  -2.68291099e-15])

Scaling features to mean 0, and standard deviation 1 isn't the only useful type of scaling. Preprocessing also contains a MinMaxScaler class, which will scale the data within a certain range:

>>> my_minmax_scaler = preprocessing.MinMaxScaler()
>>> my_minmax_scaler.fit(X[:, :3])
>>> my_minmax_scaler.transform(X[:, :3]).max(axis=0)
array([ 1.,  1.,  1.]) 

It's very simple to change the minimum and maximum values of the MinMaxScaler class from its default of 0 and 1, respectively:

>>> my_odd_scaler = preprocessing.MinMaxScaler(feature_range=(-3.14,
                                                               3.14))

Furthermore, another option is normalization. This will scale each sample to have a length of 1. This is different from the other types of scaling done previously, where the features were scaled. Normalization is illustrated in the following command:

>>> normalized_X = preprocessing.normalize(X[:, :3])

If it's not apparent why this is useful, consider the Euclidian distance (a measure of similarity) between three of the samples, where one sample has the values (1, 1, 0), another has (3, 3, 0), and the final has (1, -1, 0).

The distance between the 1st and 3rd vector is less than the distance between the 1st and 2nd though the 1st and 3rd are orthogonal, whereas the 1st and 2nd only differ by a scalar factor of 3. Since distances are often used as measures of similarity, not normalizing the data first will be misleading..

There's more...

Imputation is a very deep subject. Here are a few things to consider when using scikit-learn's implementation.

Creating idempotent scalar objects

It is possible to scale the mean and/or variance in the StandardScaler instance. For instance, it's possible (though not useful) to create a StandardScaler instance, which simply performs the identity transformation:

>>> my_useless_scaler = preprocessing.StandardScaler(with_mean=False, 
                                                     with_std=False)
>>> transformed_sd = my_useless_scaler
                     .fit_transform(X[:, :3]).std(axis=0)
>>> original_sd = X[:, :3].std(axis=0)
>>> np.array_equal(transformed_sd, original_sd)

Handling sparse imputations

Sparse matrices aren't handled differently from normal matrices when doing scaling. This is because to mean center the data, the data will have its 0s altered to nonzero values, thus the matrix will no longer be sparse:

>>> matrix = scipy.sparse.eye(1000)
>>> preprocessing.scale(matrix)

ValueError: Cannot center sparse matrices: pass 'with_mean=False' instead See docstring for motivation and alternatives.

As noted in the error, it is possible to scale a sparse matrix with_std only:

>>> preprocessing.scale(matrix, with_mean=False)
<1000x1000 sparse matrix of type '<type 'numpy.float64'>'
        with 1000 stored elements in Compressed Sparse Row format>

The other option is to call todense() on the array. However, this is dangerous because the matrix is already sparse for a reason, and it will potentially cause a memory error.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset