Learning vector quantization

In this recipe, we will see a model-free method for clustering the data points called Learning Vector Quantization, LVQ for short. LVQ can be used in classification tasks. Not much of an inference can be made between the target variables and prediction variables using this technique. Unlike the other methods, it is tough to make out what relationships exist between the response variable, Y, and predictor, X. They serve very well as a black box approach in many real-world scenarios.

Getting ready

LVQ is an online learning algorithm where the data points are processed one at a time. It makes a very simple intuition. Assume that we have prototype vectors identified for the different classes present in our dataset. The training points will be attracted towards the prototypes of similar classes and will repel the other prototypes.

The major steps in LVQ are as follows:

Select k initial prototype vectors for each class in the dataset. If it's a two-class problem and we decide to have two prototype vectors for each class, we will end up with four initial prototype vectors. The initial prototype vectors are selected randomly from the input dataset.

We will start our iteration. Our iteration will end when our epsilon value has reached either zero or a predefined threshold. We will decide an epsilon value and decrement the epsilon value with every iteration.

In each iteration, we will sample an input point (with replacement) and find the closest prototype vector to this point. We will use Euclidean distance to find the closest point. We will update the prototype vector of the closest point, as follows:

If the class label of the prototype vector is the same as the input data point, we will increment the prototype vector with the difference between the prototype vector and data point.

If the class label is different, we will decrement the prototype vector with the difference between the prototype vector and data point.

We will use the Iris dataset to demonstrate how LVQ works. As in some of our previous recipe, we will use the convenient data loading function from scikit-learn in order to load the Iris dataset. Iris is a well known classification dataset. However our purpose of using it here is to only demonstrate LVQ's capability. Datasets without class lablels can also be used or processed by LVQ. As we are going to use Euclidean distance, we will scale the data using minmax scaling.

from sklearn.datasets import load_iris
import numpy as np
from sklearn.metrics import euclidean_distances

data = load_iris()
x = data['data']
y = data['target']

# Scale the variables
from sklearn.preprocessing import MinMaxScaler
minmax = MinMaxScaler()
x = minmax.fit_transform(x)

How to do it…

  1. Let's first declare the parameters for LVQ:
    R = 2
    n_classes = 3
    epsilon = 0.9
    epsilon_dec_factor = 0.001
  2. Define a class to hold the prototype vectors:
    class prototype(object):
        """
        Class to hold prototype vectors
        """
    
        def __init__(self,class_id,p_vector,eplsilon):
            self.class_id = class_id
            self.p_vector = p_vector
            self.epsilon = epsilon
            
        def update(self,u_vector,increment=True):
            if increment:
                # Move the prototype vector closer to input vector
                self.p_vector = self.p_vector + self.epsilon*(u_vector - self.p_vector)
            else:
                # Move the prototype vector away from input vector
                self.p_vector = self.p_vector - self.epsilon*(u_vector - self.p_vector)
  3. This is the function to find the closest prototype vector for a given vector:
    def find_closest(in_vector,proto_vectors):
        closest = None
        closest_distance = 99999
        for p_v in proto_vectors:
            distance = euclidean_distances(in_vector,p_v.p_vector)
            if distance < closest_distance:
                closest_distance = distance
                closest = p_v
        return closest
  4. A convenient function to find the class ID of the closest prototype vector is as follows:
    def find_class_id(test_vector,p_vectors):
        return find_closest(test_vector,p_vectors).class_id
  5. Choose the initial K * number of classes of prototype vectors:
    # Choose R initial prototypes for each class        
    p_vectors = []
    for i in range(n_classes):
        # Select a class
        y_subset = np.where(y == i)
        # Select tuples for choosen class
        x_subset  = x[y_subset]
        # Get R random indices between 0 and 50
        samples = np.random.randint(0,len(x_subset),R)
        # Select p_vectors
        for sample in samples:
            s = x_subset[sample]
            p = prototype(i,s,epsilon)
            p_vectors.append(p)
    
    print "class id 	 Initial protype vector
    "
    for p_v in p_vectors:
        print p_v.class_id,'	',p_v.p_vector
           print
  6. Perform iteration to adjust the prototype vector in order to classify/cluster any new incoming points using the existing data points:
    while epsilon >= 0.01:
        # Sample a training instance randonly
        rnd_i = np.random.randint(0,149)
        rnd_s = x[rnd_i]
        target_y = y[rnd_i]
    
        # Decrement epsilon value for next iteration
        epsilon = epsilon - epsilon_dec_factor    
        # Find closes prototype vector to given point
        closest_pvector = find_closest(rnd_s,p_vectors)
        
        # Update closes prototype vector
        if target_y == closest_pvector.class_id:
            closest_pvector.update(rnd_s)
        else:
            closest_pvector.update(rnd_s,False)
        closest_pvector.epsilon = epsilon
            
    print "class id 	 Final Prototype Vector
    "
    for p_vector in p_vectors:
        print p_vector.class_id,'	',p_vector.p_vector
  7. The following is a small test to verify the correctness of our method:
    predicted_y = [find_class_id(instance,p_vectors) for instance in x ]
    
    from sklearn.metrics import classification_report
    
    print
    print classification_report(y,predicted_y,target_names=['Iris-Setosa','Iris-Versicolour', 'Iris-Virginica'])

How it works…

In step 1, we initialize the parameters for the algorithm. We have chosen our R value as two, that is, we have two prototype vectors per class label. The Iris dataset is a three-class problem, so we have six prototype vectors in total. We must choose our epsilon value and epsilon decrement factor.

We then define a data structure to hold the details of our prototype vector in step 2. Our class stores the following for each point in the dataset:

self.class_id = class_id
self.p_vector = p_vector
self.epsilon = epsilon

The class id to which the prototype vector belongs is the vector itself and the epsilon value. It also has a function update that is used to change the prototype values:

def update(self,u_vector,increment=True):
if increment:
# Move the prototype vector closer to input vector
self.p_vector = self.p_vector + self.epsilon*(u_vector - self.p_vector)
else:
# Move the prototype vector away from input vector
self.p_vector = self.p_vector - self.epsilon*(u_vector - self.p_vector)

In step 3, we define the following function, which takes any given vector as the input and a list of all the prototype vectors. Out of all the prototype vectors, this function returns the closest prototype vector to the given vector:

for p_v in proto_vectors:
distance = euclidean_distances(in_vector,p_v.p_vector)
if distance < closest_distance:
closest_distance = distance
closest = p_v

As you can see, it loops through all the prototype vectors to find the closest one. It uses Euclidean distance to measure the similarity.

Step 4 is a small function that can return the class ID of the closest prototype vector to the given vector.

Now that we have finished all the required preprocessing for the LVQ algorithm, we can move on to the actual algorithm in step 5. For each class, we must select the initial prototype vectors. We then select R random points from each class. The outer loop goes through each class, and for each class, we select R random samples and create our prototype object, as follows:

samples = np.random.randint(0,len(x_subset),R)
# Select p_vectors
for sample in samples:
s = x_subset[sample]
p = prototype(i,s,epsilon)
p_vectors.append(p)

In step 6, we increment or decrement the prototype vectors iteratively. We loop continuously till our epsilon value falls below a threshold of 0.01.

We then randomly sample a point from our dataset, as follows:

# Sample a training instance randonly
rnd_i = np.random.randint(0,149)
rnd_s = x[rnd_i]
target_y = y[rnd_i]

The point and its corresponding class ID have been retrieved.

We can then find the closed prototype vector to this point, as follows:

closest_pvector = find_closest(rnd_s,p_vectors)

If the current point's class ID matches the prototype's class ID, we call the update method, with the increment set to True, or else we will call the update with the increment set to False:

# Update closes prototype vector
if target_y == closest_pvector.class_id:
closest_pvector.update(rnd_s)
else:
closest_pvector.update(rnd_s,False)

Finally, we update the epsilon value for the closest prototype vector:

closest_pvector.epsilon = epsilon

We can print the prototype vectors in order to look at them manually:

print "class id 	 Final Prototype Vector
"
for p_vector in p_vectors:
print p_vector.class_id,'	',p_vector.p_vector

In step 7, we put our prototype vectors into action to do some predictions:

predicted_y = [find_class_id(instance,p_vectors) for instance in x ]

We can get the predicted class ID using the find_class_id function. We pass a point and all the learned prototype vectors to it to get the class ID.

Finally, we give our predicted output in order to generate a classification report:

print classification_report(y,predicted_y,target_names=['Iris-Setosa','Iris-Versicolour', 'Iris-Virginica'])

The classification report function is a convenient function provided by the scikit-learn library to view the classification accuracy scores:

How it works…

You can see that we have done pretty well with our classification. Keep in mind that we did not keep a separate test set. Never measure the accuracy of your model based on the training data. Always use a test set that is unseen by the training routines. We did it only for illustration purposes.

There's more...

Keep in mind that this technique does not involve any optimization criteria as in the other classification methods. Hence, it is very difficult to judge how good the prototype vectors have been generated.

In our recipe, we initialized the prototype vectors as random values. You can use the k-means algorithm to initialize the prototype vectors.

See also

  • Clustering of data using K-Means recipe in Chapter 5, Data Mining - Finding a needle in a haystack
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset