In this recipe, we will see a model-free method for clustering the data points called Learning Vector Quantization, LVQ for short. LVQ can be used in classification tasks. Not much of an inference can be made between the target variables and prediction variables using this technique. Unlike the other methods, it is tough to make out what relationships exist between the response variable, Y, and predictor, X. They serve very well as a black box approach in many real-world scenarios.
LVQ is an online learning algorithm where the data points are processed one at a time. It makes a very simple intuition. Assume that we have prototype vectors identified for the different classes present in our dataset. The training points will be attracted towards the prototypes of similar classes and will repel the other prototypes.
The major steps in LVQ are as follows:
Select k initial prototype vectors for each class in the dataset. If it's a two-class problem and we decide to have two prototype vectors for each class, we will end up with four initial prototype vectors. The initial prototype vectors are selected randomly from the input dataset.
We will start our iteration. Our iteration will end when our epsilon value has reached either zero or a predefined threshold. We will decide an epsilon value and decrement the epsilon value with every iteration.
In each iteration, we will sample an input point (with replacement) and find the closest prototype vector to this point. We will use Euclidean distance to find the closest point. We will update the prototype vector of the closest point, as follows:
If the class label of the prototype vector is the same as the input data point, we will increment the prototype vector with the difference between the prototype vector and data point.
If the class label is different, we will decrement the prototype vector with the difference between the prototype vector and data point.
We will use the Iris dataset to demonstrate how LVQ works. As in some of our previous recipe, we will use the convenient data loading function from scikit-learn in order to load the Iris dataset. Iris is a well known classification dataset. However our purpose of using it here is to only demonstrate LVQ's capability. Datasets without class lablels can also be used or processed by LVQ. As we are going to use Euclidean distance, we will scale the data using minmax scaling.
from sklearn.datasets import load_iris import numpy as np from sklearn.metrics import euclidean_distances data = load_iris() x = data['data'] y = data['target'] # Scale the variables from sklearn.preprocessing import MinMaxScaler minmax = MinMaxScaler() x = minmax.fit_transform(x)
R = 2 n_classes = 3 epsilon = 0.9 epsilon_dec_factor = 0.001
class prototype(object): """ Class to hold prototype vectors """ def __init__(self,class_id,p_vector,eplsilon): self.class_id = class_id self.p_vector = p_vector self.epsilon = epsilon def update(self,u_vector,increment=True): if increment: # Move the prototype vector closer to input vector self.p_vector = self.p_vector + self.epsilon*(u_vector - self.p_vector) else: # Move the prototype vector away from input vector self.p_vector = self.p_vector - self.epsilon*(u_vector - self.p_vector)
def find_closest(in_vector,proto_vectors): closest = None closest_distance = 99999 for p_v in proto_vectors: distance = euclidean_distances(in_vector,p_v.p_vector) if distance < closest_distance: closest_distance = distance closest = p_v return closest
def find_class_id(test_vector,p_vectors): return find_closest(test_vector,p_vectors).class_id
# Choose R initial prototypes for each class p_vectors = [] for i in range(n_classes): # Select a class y_subset = np.where(y == i) # Select tuples for choosen class x_subset = x[y_subset] # Get R random indices between 0 and 50 samples = np.random.randint(0,len(x_subset),R) # Select p_vectors for sample in samples: s = x_subset[sample] p = prototype(i,s,epsilon) p_vectors.append(p) print "class id Initial protype vector " for p_v in p_vectors: print p_v.class_id,' ',p_v.p_vector print
while epsilon >= 0.01: # Sample a training instance randonly rnd_i = np.random.randint(0,149) rnd_s = x[rnd_i] target_y = y[rnd_i] # Decrement epsilon value for next iteration epsilon = epsilon - epsilon_dec_factor # Find closes prototype vector to given point closest_pvector = find_closest(rnd_s,p_vectors) # Update closes prototype vector if target_y == closest_pvector.class_id: closest_pvector.update(rnd_s) else: closest_pvector.update(rnd_s,False) closest_pvector.epsilon = epsilon print "class id Final Prototype Vector " for p_vector in p_vectors: print p_vector.class_id,' ',p_vector.p_vector
predicted_y = [find_class_id(instance,p_vectors) for instance in x ] from sklearn.metrics import classification_report print print classification_report(y,predicted_y,target_names=['Iris-Setosa','Iris-Versicolour', 'Iris-Virginica'])
In step 1, we initialize the parameters for the algorithm. We have chosen our R value as two, that is, we have two prototype vectors per class label. The Iris dataset is a three-class problem, so we have six prototype vectors in total. We must choose our epsilon value and epsilon decrement factor.
We then define a data structure to hold the details of our prototype vector in step 2. Our class stores the following for each point in the dataset:
self.class_id = class_id self.p_vector = p_vector self.epsilon = epsilon
The class id to which the prototype vector belongs is the vector itself and the epsilon value. It also has a function update that is used to change the prototype values:
def update(self,u_vector,increment=True): if increment: # Move the prototype vector closer to input vector self.p_vector = self.p_vector + self.epsilon*(u_vector - self.p_vector) else: # Move the prototype vector away from input vector self.p_vector = self.p_vector - self.epsilon*(u_vector - self.p_vector)
In step 3, we define the following function, which takes any given vector as the input and a list of all the prototype vectors. Out of all the prototype vectors, this function returns the closest prototype vector to the given vector:
for p_v in proto_vectors: distance = euclidean_distances(in_vector,p_v.p_vector) if distance < closest_distance: closest_distance = distance closest = p_v
As you can see, it loops through all the prototype vectors to find the closest one. It uses Euclidean distance to measure the similarity.
Step 4 is a small function that can return the class ID of the closest prototype vector to the given vector.
Now that we have finished all the required preprocessing for the LVQ algorithm, we can move on to the actual algorithm in step 5. For each class, we must select the initial prototype vectors. We then select R random points from each class. The outer loop goes through each class, and for each class, we select R random samples and create our prototype object, as follows:
samples = np.random.randint(0,len(x_subset),R) # Select p_vectors for sample in samples: s = x_subset[sample] p = prototype(i,s,epsilon) p_vectors.append(p)
In step 6, we increment or decrement the prototype vectors iteratively. We loop continuously till our epsilon value falls below a threshold of 0.01.
We then randomly sample a point from our dataset, as follows:
# Sample a training instance randonly rnd_i = np.random.randint(0,149) rnd_s = x[rnd_i] target_y = y[rnd_i]
The point and its corresponding class ID have been retrieved.
We can then find the closed prototype vector to this point, as follows:
closest_pvector = find_closest(rnd_s,p_vectors)
If the current point's class ID matches the prototype's class ID, we call the update
method, with the increment set to True
, or else we will call the update
with the increment set to False
:
# Update closes prototype vector if target_y == closest_pvector.class_id: closest_pvector.update(rnd_s) else: closest_pvector.update(rnd_s,False)
Finally, we update the epsilon value for the closest prototype vector:
closest_pvector.epsilon = epsilon
We can print the prototype vectors in order to look at them manually:
print "class id Final Prototype Vector " for p_vector in p_vectors: print p_vector.class_id,' ',p_vector.p_vector
In step 7, we put our prototype vectors into action to do some predictions:
predicted_y = [find_class_id(instance,p_vectors) for instance in x ]
We can get the predicted class ID using the find_class_id
function. We pass a point and all the learned prototype vectors to it to get the class ID.
Finally, we give our predicted output in order to generate a classification report:
print classification_report(y,predicted_y,target_names=['Iris-Setosa','Iris-Versicolour', 'Iris-Virginica'])
The classification report function is a convenient function provided by the scikit-learn library to view the classification accuracy scores:
You can see that we have done pretty well with our classification. Keep in mind that we did not keep a separate test set. Never measure the accuracy of your model based on the training data. Always use a test set that is unseen by the training routines. We did it only for illustration purposes.
Keep in mind that this technique does not involve any optimization criteria as in the other classification methods. Hence, it is very difficult to judge how good the prototype vectors have been generated.
In our recipe, we initialized the prototype vectors as random values. You can use the k-means algorithm to initialize the prototype vectors.