Curse of dimensionality

KNN completely depends on distance. Hence, it is worth studying about the curse of dimensionality to understand when KNN deteriorates its predictive power with the increase in the number of variables required for prediction. This is an obvious fact that high-dimensional spaces are vast. Points in high-dimensional spaces tend to be dispersing from each other more compared with the points in low-dimensional space. Though there are many ways to check the curve of dimensionality, here we are using uniform random values between zero and one generated for 1D, 2D, and 3D space to validate this hypothesis.

In the following lines of codes, the mean distance between 1,000 observations has been calculated with the change in dimensions. It is apparent that with the increase in dimensions, distance between points increases logarithmically, which gives us the hint that we need to have an exponential increase in data points with increase in dimensions in order to make machine learning algorithms work correctly:

>>> import numpy as np 
>>> import pandas as pd 
 
# KNN Curse of Dimensionality 
>>> import random,math 

The following code generates random numbers between zero and one from uniform distribution with the given dimension, which is equivalent of length of array or list:

>>> def random_point_gen(dimension): 
...     return [random.random() for _ in range(dimension)] 

The following function calculates root mean sum of squares of Euclidean distances (2-norm) between points by taking the difference between points and sum the squares and finally takes the square root of total distance:

>>> def distance(v,w): 
...     vec_sub = [v_i-w_i for v_i,w_i in zip(v,w)] 
...     sum_of_sqrs = sum(v_i*v_i for v_i in vec_sub) 
...     return math.sqrt(sum_of_sqrs) 

Both dimension and number of pairs are utilized for calculating the distances with the following code:

>>> def random_distances_comparison(dimension,number_pairs): 
...     return [distance(random_point_gen(dimension),random_point_gen(dimension)) 
            for _ in range(number_pairs)] 
 
>>> def mean(x): 
...     return sum(x) / len(x) 

The experiment has been done by changing dimensions from 1 to 201 with an increase of 5 dimensions to check the increase in distance:

>>> dimensions = range(1, 201, 5)

Both minimum and average distances have been calculated to check, however, both illustrate the similar story:

>>> avg_distances = [] 
>>> min_distances = [] 
 
>>> dummyarray = np.empty((20,4)) 
>>> dist_vals = pd.DataFrame(dummyarray) 
>>> dist_vals.columns = ["Dimension","Min_Distance","Avg_Distance","Min/Avg_Distance"] 
 
>>> random.seed(34) 
>>> i = 0 
>>> for dims in dimensions: 
...     distances = random_distances_comparison(dims, 1000)   
...     avg_distances.append(mean(distances))     
...     min_distances.append(min(distances))      
     
...     dist_vals.loc[i,"Dimension"] = dims 
...     dist_vals.loc[i,"Min_Distance"] = min(distances) 
...     dist_vals.loc[i,"Avg_Distance"] = mean(distances) 
...     dist_vals.loc[i,"Min/Avg_Distance"] = min(distances)/mean(distances) 
                  
...     print(dims, min(distances), mean(distances), min(distances)*1.0 / mean( distances)) 
...     i = i+1 
 
# Plotting Average distances for Various Dimensions 
>>> import matplotlib.pyplot as plt 
>>> plt.figure() 
>>> plt.xlabel('Dimensions') 
>>> plt.ylabel('Avg. Distance') 
>>> plt.plot(dist_vals["Dimension"],dist_vals["Avg_Distance"]) 
>>> plt.legend(loc='best') 
 
>>> plt.show() 
From the preceding graph, it is proved that with the increase in dimensions, mean distance increases logarithmically. Hence the higher the dimensions, the more data is needed to overcome the curse of dimensionality!
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset