In this we will learn to how to scale the data.
Scaling is an important type of data transformation. Typically, by doing scaling on a dataset, we can control the range of values that the data type can assume. In a dataset with multiple columns, the columns with a bigger range and scale tend to dominate other columns. We will perform scaling of the dataset in order to avoid these interferences.
Let's say that we are comparing two software products based on the number of features and the number of lines of code. The difference in the number of lines of code will be very high compared to the difference in the number of features. In this case, our comparison will be dominated by the number of lines of code. If we use any similarity measure, the similarity or difference will be dominated by the number of lines of code. To avoid such a situation, we will adopt scaling. The simplest scaling is min-max scaling. Let's look at min-max scaling on a randomly generated dataset.
Let's generate some random data in order to test our scaling functionality:
# Load Libraries import numpy as np # 1. Generate some random data for scaling np.random.seed(10) x = [np.random.randint(10,25)*1.0 for i in range(10)]
Now, we will demonstrate scaling:
# 2.Define a function, which can perform min max scaling given a list of numbers def min_max(x): return [round((xx-min(x))/(1.0*(max(x)-min(x))),2) for xx in x] # 3.Perform scaling on the given input list. print x print min_max(x)
In step 1, we will generate a list of random numbers between 10 and 25. In step 2, we will define a function to perform min-max scaling on the given input. Min-max scaling is defined as follows:
x_scaled = x – min(x) / max(x) –min (x)
In step 2 we define a function to do the above task.
This transforms the range of the given value. After transformation, the values will fall in the [ 0,1 ] range.
In step 3, we will first print the original input list. The output is as follows:
[19, 23, 14, 10, 11, 21, 22, 19, 23, 10]
We will pass this list to our min_max
function in order to get the scaled output, which is as follows:
[0.69, 1.0, 0.31, 0.0, 0.08, 0.85, 0.92, 0.69, 1.0, 0.0]
You can see the scaling in action; 10
, which is the smallest number, has been assigned a value of 0.0
and 23
, the highest number, is assigned a value of 1.0
. Thus, we scaled the data in the [0,1] range.
Scikit-learn provides a MinMaxScaler function for the same:
from sklearn.preprocessing import MinMaxScaler import numpy as np np.random.seed(10) x = np.matrix([np.random.randint(10,25)*1.0 for i in range(10)]) x = x.T minmax = MinMaxScaler(feature_range=(0.0,1.0)) print x x_t = minmax.fit_transform(x) print x_t
The output is as follows:
[19.0, 23.0, 14.0, 10.0, 11.0, 21.0, 22.0, 19.0, 23.0, 10.0] [0.69, 1.0, 0.31, 0.0, 0.08, 0.85, 0.92, 0.69, 1.0, 0.0]
We saw examples where we scaled the data to a range (0,1); this can be extended to any range. Let's say that our new range is nr_min,nr_max
, then the min-max formula is modified as follows:
x_scaled = ( x – min(x) / max(x) –min (x) ) * (nr_max- nr_min) + nr_min
The following will be the Python code:
import numpy as np np.random.seed(10) x = [np.random.randint(10,25)*1.0 for i in range(10)] def min_max_range(x,range_values): return [round( ((xx-min(x))/(1.0*(max(x)-min(x))))*(range_values[1]-range_values[0]) + range_values[0],2) for xx in x] print min_max_range(x,(100,200))
where, range_values is a tuple of two elements, where the 0th element is the new range's lower end and the first element is the higher end. Let's invoke this function on our input and see how the output is, as follows:
print min_max_range(x,(100,200)) [169.23, 200.0, 130.77, 100.0, 107.69, 184.62, 192.31, 169.23, 200.0, 100.0]
The lowest value, 10
, is now scaled to 100
and the highest value, 23
, is scaled to 200
.