Scaling the data

In this we will learn to how to scale the data.

Getting ready

Scaling is an important type of data transformation. Typically, by doing scaling on a dataset, we can control the range of values that the data type can assume. In a dataset with multiple columns, the columns with a bigger range and scale tend to dominate other columns. We will perform scaling of the dataset in order to avoid these interferences.

Let's say that we are comparing two software products based on the number of features and the number of lines of code. The difference in the number of lines of code will be very high compared to the difference in the number of features. In this case, our comparison will be dominated by the number of lines of code. If we use any similarity measure, the similarity or difference will be dominated by the number of lines of code. To avoid such a situation, we will adopt scaling. The simplest scaling is min-max scaling. Let's look at min-max scaling on a randomly generated dataset.

How to do it…

Let's generate some random data in order to test our scaling functionality:

# Load Libraries
import numpy as np

# 1.	Generate some random data for scaling
np.random.seed(10)
x = [np.random.randint(10,25)*1.0 for i in range(10)]

Now, we will demonstrate scaling:

# 2.Define a function, which can perform min max scaling given a list of numbers
def min_max(x):
    return [round((xx-min(x))/(1.0*(max(x)-min(x))),2) for xx in x]

# 3.Perform scaling on the given input list.    
print x 
print min_max(x)    

How it works…

In step 1, we will generate a list of random numbers between 10 and 25. In step 2, we will define a function to perform min-max scaling on the given input. Min-max scaling is defined as follows:

x_scaled = x – min(x) / max(x) –min (x)

In step 2 we define a function to do the above task.

This transforms the range of the given value. After transformation, the values will fall in the [ 0,1 ] range.

In step 3, we will first print the original input list. The output is as follows:

[19, 23, 14, 10, 11, 21, 22, 19, 23, 10]

We will pass this list to our min_max function in order to get the scaled output, which is as follows:

[0.69, 1.0, 0.31, 0.0, 0.08, 0.85, 0.92, 0.69, 1.0, 0.0]

You can see the scaling in action; 10, which is the smallest number, has been assigned a value of 0.0 and 23, the highest number, is assigned a value of 1.0. Thus, we scaled the data in the [0,1] range.

There's more…

Scikit-learn provides a MinMaxScaler function for the same:

from sklearn.preprocessing import MinMaxScaler
import numpy as np

np.random.seed(10)
x = np.matrix([np.random.randint(10,25)*1.0 for i in range(10)])
x = x.T
minmax = MinMaxScaler(feature_range=(0.0,1.0))
print x
x_t = minmax.fit_transform(x)
print x_t

The output is as follows:

[19.0, 23.0, 14.0, 10.0, 11.0, 21.0, 22.0, 19.0, 23.0, 10.0]
[0.69, 1.0, 0.31, 0.0, 0.08, 0.85, 0.92, 0.69, 1.0, 0.0]

We saw examples where we scaled the data to a range (0,1); this can be extended to any range. Let's say that our new range is nr_min,nr_max, then the min-max formula is modified as follows:

x_scaled =  ( x – min(x) / max(x) –min (x) ) * (nr_max- nr_min) + nr_min

The following will be the Python code:

import numpy as np

np.random.seed(10)
x = [np.random.randint(10,25)*1.0 for i in range(10)]

def min_max_range(x,range_values):
    return [round( ((xx-min(x))/(1.0*(max(x)-min(x))))*(range_values[1]-range_values[0]) 
    + range_values[0],2) for xx in x]

print min_max_range(x,(100,200))

where, range_values is a tuple of two elements, where the 0th element is the new range's lower end and the first element is the higher end. Let's invoke this function on our input and see how the output is, as follows:

print min_max_range(x,(100,200))

[169.23, 200.0, 130.77, 100.0, 107.69, 184.62, 192.31, 169.23, 200.0, 100.0]

The lowest value, 10, is now scaled to 100 and the highest value, 23, is scaled to 200.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset