Standardizing the data

Standardization is the process of converting the input so that it has a mean of 0 and standard deviation of 1.

Getting ready

If you are given a vector X, the mean of 0 and standard deviation of 1 for X can be achieved by the following equation:

Note

Standardized X = x– mean(value) / standard deviation (X)

Let's see how this can be achieved in Python.

How to do it…

Let's import the necessary libraries to begin with. We will follow this with the generation of the input data:

# Load Libraries
import numpy as np
from sklearn.preprocessing import scale

# Input data generation
np.random.seed(10)
x = [np.random.randint(10,25)*1.0 for i in range(10)]

We are now ready to demonstrate standardization:

x_centered = scale(x,with_mean=True,with_std=False)
x_standard = scale(x,with_mean=True,with_std=True)

print x
print x_centered
print x_standard
print "Orginal x mean = %0.2f, Centered x mean = %0.2f, Std dev of 
        standard x =%0.2f"%(np.mean(x),np.mean(x_centered),np.std(x_standard))

How it works…

We will generate some random data using np.random:

x = [np.random.randint(10,25)*1.0 for i in range(10)]

We will perform standardization using the scale function from scikit-learn:

x_centered = scale(x,with_mean=True,with_std=False)
x_standard = scale(x,with_mean=True,with_std=True)

The x_centered is scaled using only the mean; you can see the with_mean parameter set to True and with_std set to False.

The x_standard is standardized using both mean and standard deviation.

Now let us look at the output.

The original data is as follows:

[19.0, 23.0, 14.0, 10.0, 11.0, 21.0, 22.0, 19.0, 23.0, 10.0]

Next, we will print x_centered, where we centered it with the mean value:

[ 1.8  5.8 -3.2 -7.2 -6.2  3.8  4.8  1.8  5.8 -7.2]

Finally we will print x_standardized, where we used both the mean and standard deviation:

[ 0.35059022  1.12967961 -0.62327151 -1.4023609  -1.20758855  0.74013492
  0.93490726  0.35059022  1.12967961 -1.4023609 ]

Orginal x mean = 17.20, Centered x mean = 0.00, Std dev of standard x =1.00

There's more…

Note

Standardization can be generalized to any level and spread, as follows:

Standardized value = value – level / spread

Let's break the preceding equation in two parts: just the numerator part, which is called centering, and the whole equation, which is called standardization. Using the mean values, centering plays a critical role in regression. Consider a dataset that has two attributes, weight and height. We will center the data such that the predictor, weight, has a mean of 0. This makes the interpretation of intercept easier. The intercept will be interpreted as what is the expected height when the predictor values are set to their mean.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset