We have now arrived at the core. The discussion up until now was necessary because it gives you the background required to build an object recognition system. Now, let's build an object recognizer that can recognize whether the given image contains a dress, a pair of shoes, or a bag. We can easily extend this system to detect any number of items. We are starting with three distinct items so that you can start experimenting with it later.
Before we start, we need to make sure that we have a set of training images. There are many databases available online where the images are already arranged into groups. Caltech256 is perhaps one of the most popular databases for object recognition. You can download it from http://www.vision.caltech.edu/Image_Datasets/Caltech256. Create a folder called images
and create three subfolders inside it, that is, dress
, footwear
, and bag
. Inside each of those subfolders, add 20 images corresponding to that item. You can just download these images from the internet, but make sure those images have a clean background.
For example, a dress image would like this:
A footwear image would look like this:
A bag image would look like this:
Now that we have 60 training images, we are ready to start. As a side note, object recognition systems actually need tens of thousands of training images in order to perform well in the real world. Since we are building an object recognizer to detect 3 types of objects, we will take only 20 training images per object. Adding more training images will increase the accuracy and robustness of our system.
The first step here is to extract feature vectors from all the training images and build the visual dictionary (also known as codebook). Here is the code:
import os import sys import argparse import cPickle as pickle import json import cv2 import numpy as np from sklearn.cluster import KMeans def build_arg_parser(): parser = argparse.ArgumentParser(description='Creates features for given images') parser.add_argument("--samples", dest="cls", nargs="+", action="append", required=True, help="Folders containing the training images. The first element needs to be the class label.") parser.add_argument("--codebook-file", dest='codebook_file', required=True, help="Base file name to store the codebook") parser.add_argument("--feature-map-file", dest='feature_map_file', required=True, help="Base file name to store the feature map") return parser # Loading the images from the input folder def load_input_map(label, input_folder): combined_data = [] if not os.path.isdir(input_folder): raise IOError("The folder " + input_folder + " doesn't exist") # Parse the input folder and assign the labels for root, dirs, files in os.walk(input_folder): for filename in (x for x in files if x.endswith('.jpg')): combined_data.append({'label': label, 'image': os.path.join(root, filename)}) return combined_data class FeatureExtractor(object): def extract_image_features(self, img): # Dense feature detector kps = DenseDetector().detect(img) # SIFT feature extractor kps, fvs = SIFTExtractor().compute(img, kps) return fvs # Extract the centroids from the feature points def get_centroids(self, input_map, num_samples_to_fit=10): kps_all = [] count = 0 cur_label = '' for item in input_map: if count >= num_samples_to_fit: if cur_label != item['label']: count = 0 else: continue count += 1 if count == num_samples_to_fit: print "Built centroids for", item['label'] cur_label = item['label'] img = cv2.imread(item['image']) img = resize_to_size(img, 150) num_dims = 128 fvs = self.extract_image_features(img) kps_all.extend(fvs) kmeans, centroids = Quantizer().quantize(kps_all) return kmeans, centroids def get_feature_vector(self, img, kmeans, centroids): return Quantizer().get_feature_vector(img, kmeans, centroids) def extract_feature_map(input_map, kmeans, centroids): feature_map = [] for item in input_map: temp_dict = {} temp_dict['label'] = item['label'] print "Extracting features for", item['image'] img = cv2.imread(item['image']) img = resize_to_size(img, 150) temp_dict['feature_vector'] = FeatureExtractor().get_feature_vector( img, kmeans, centroids) if temp_dict['feature_vector'] is not None: feature_map.append(temp_dict) return feature_map # Vector quantization class Quantizer(object): def __init__(self, num_clusters=32): self.num_dims = 128 self.extractor = SIFTExtractor() self.num_clusters = num_clusters self.num_retries = 10 def quantize(self, datapoints): # Create KMeans object kmeans = KMeans(self.num_clusters, n_init=max(self.num_retries, 1), max_iter=10, tol=1.0) # Run KMeans on the datapoints res = kmeans.fit(datapoints) # Extract the centroids of those clusters centroids = res.cluster_centers_ return kmeans, centroids def normalize(self, input_data): sum_input = np.sum(input_data) if sum_input > 0: return input_data / sum_input else: return input_data # Extract feature vector from the image def get_feature_vector(self, img, kmeans, centroids): kps = DenseDetector().detect(img) kps, fvs = self.extractor.compute(img, kps) labels = kmeans.predict(fvs) fv = np.zeros(self.num_clusters) for i, item in enumerate(fvs): fv[labels[i]] += 1 fv_image = np.reshape(fv, ((1, fv.shape[0]))) return self.normalize(fv_image) class DenseDetector(object): def __init__(self, step_size=20, feature_scale=40, img_bound=20): self.detector = cv2.FeatureDetector_create("Dense") self.detector.setInt("initXyStep", step_size) self.detector.setInt("initFeatureScale", feature_scale) self.detector.setInt("initImgBound", img_bound) def detect(self, img): return self.detector.detect(img) class SIFTExtractor(object): def compute(self, image, kps): if image is None: print "Not a valid image" raise TypeError gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) kps, des = cv2.SIFT().compute(gray_image, kps) return kps, des # Resize the shorter dimension to 'new_size' # while maintaining the aspect ratio def resize_to_size(input_image, new_size=150): h, w = input_image.shape[0], input_image.shape[1] ds_factor = new_size / float(h) if w < h: ds_factor = new_size / float(w) new_size = (int(w * ds_factor), int(h * ds_factor)) return cv2.resize(input_image, new_size) if __name__=='__main__': args = build_arg_parser().parse_args() input_map = [] for cls in args.cls: assert len(cls) >= 2, "Format for classes is `<label> file`" label = cls[0] input_map += load_input_map(label, cls[1]) # Building the codebook print "===== Building codebook =====" kmeans, centroids = FeatureExtractor().get_centroids(input_map) if args.codebook_file: with open(args.codebook_file, 'w') as f: pickle.dump((kmeans, centroids), f) # Input data and labels print "===== Building feature map =====" feature_map = extract_feature_map(input_map, kmeans, centroids) if args.feature_map_file: with open(args.feature_map_file, 'w') as f: pickle.dump(feature_map, f)
The first thing we need to do is extract the centroids. This is how we are going to build our visual dictionary. The get_centroids
method in the FeatureExtractor
class is designed to do this. We keep collecting the image features extracted from keypoints until we have a sufficient number of them. Since we are using a dense detector, 10 images should be sufficient. The reason we are just taking 10 images is because they will give rise to a large number of features. The centroids will not change much even if you add more feature points.
Once we've extracted the centroids, we are ready to move on to the next step of feature extraction. The set of centroids is our visual dictionary. The function, extract_feature_map
, will extract a feature vector from each image and associate it with the corresponding label. The reason we do this is because we need this mapping to train our classifier. We need a set of datapoints, and each datapoint should be associated with a label. So, we start from an image, extract the feature vector, and then associate it with the corresponding label (like bag, dress, or footwear).
The Quantizer
class is designed to achieve vector quantization and build the feature vector. For each keypoint extracted from the image, the get_feature_vector
method finds the closest visual word in our dictionary. By doing this, we end up building a histogram based on our visual dictionary. Each image is now represented as a combination from a set of visual words. Hence the name, Bag of Words.
The next step is to train the classifier using these features. Here is the code:
import os import sys import argparse import cPickle as pickle import numpy as np from sklearn.multiclass import OneVsOneClassifier from sklearn.svm import LinearSVC from sklearn import preprocessing def build_arg_parser(): parser = argparse.ArgumentParser(description='Trains the classifier models') parser.add_argument("--feature-map-file", dest="feature_map_file", required=True, help="Input pickle file containing the feature map") parser.add_argument("--svm-file", dest="svm_file", required=False, help="Output file where the pickled SVM model will be stored") return parser # To train the classifier class ClassifierTrainer(object): def __init__(self, X, label_words): # Encoding the labels (words to numbers) self.le = preprocessing.LabelEncoder() # Initialize One vs One Classifier using a linear kernel self.clf = OneVsOneClassifier(LinearSVC(random_state=0)) y = self._encodeLabels(label_words) X = np.asarray(X) self.clf.fit(X, y) # Predict the output class for the input datapoint def _fit(self, X): X = np.asarray(X) return self.clf.predict(X) # Encode the labels (convert words to numbers) def _encodeLabels(self, labels_words): self.le.fit(labels_words) return np.array(self.le.transform(labels_words), dtype=np.float32) # Classify the input datapoint def classify(self, X): labels_nums = self._fit(X) labels_words = self.le.inverse_transform([int(x) for x in labels_nums]) return labels_words if __name__=='__main__': args = build_arg_parser().parse_args() feature_map_file = args.feature_map_file svm_file = args.svm_file # Load the feature map with open(feature_map_file, 'r') as f: feature_map = pickle.load(f) # Extract feature vectors and the labels labels_words = [x['label'] for x in feature_map] # Here, 0 refers to the first element in the # feature_map, and 1 refers to the second # element in the shape vector of that element # (which gives us the size) dim_size = feature_map[0]['feature_vector'].shape[1] X = [np.reshape(x['feature_vector'], (dim_size,)) for x in feature_map] # Train the SVM svm = ClassifierTrainer(X, labels_words) if args.svm_file: with open(args.svm_file, 'w') as f: pickle.dump(svm, f)
We use the scikit-learn
package to build the SVM model. You can install it, as shown next:
$ pip install scikit-learn
We start with labeled data and feed it to the OneVsOneClassifier
method. We have a classify
method that classifies an input image and associates a label with it.
Let's give this a trial run, shall we? Make sure you have a folder called images
, where you have the training images for the three classes. Create a folder called models
, where the learning models will be stored. Run the following commands on your terminal to create the features and train the classifier:
$ python create_features.py --samples bag images/bag/ --samples dress images/dress/ --samples footwear images/footwear/ --codebook-file models/codebook.pkl --feature-map-file models/feature_map.pkl $ python training.py --feature-map-file models/feature_map.pkl --svm-file models/svm.pkl
Now that the classifier has been trained, we just need a module to classify the input image and detect the object inside. Here is the code to do it:
import os import sys import argparse import cPickle as pickle import cv2 import numpy as np import create_features as cf from training import ClassifierTrainer def build_arg_parser(): parser = argparse.ArgumentParser(description='Extracts features from each line and classifies the data') parser.add_argument("--input-image", dest="input_image", required=True, help="Input image to be classified") parser.add_argument("--svm-file", dest="svm_file", required=True, help="File containing the trained SVM model") parser.add_argument("--codebook-file", dest="codebook_file", required=True, help="File containing the codebook") return parser # Classifying an image class ImageClassifier(object): def __init__(self, svm_file, codebook_file): # Load the SVM classifier with open(svm_file, 'r') as f: self.svm = pickle.load(f) # Load the codebook with open(codebook_file, 'r') as f: self.kmeans, self.centroids = pickle.load(f) # Method to get the output image tag def getImageTag(self, img): # Resize the input image img = cf.resize_to_size(img) # Extract the feature vector feature_vector = cf.FeatureExtractor().get_feature_vector(img, self.kmeans, self.centroids) # Classify the feature vector and get the output tag image_tag = self.svm.classify(feature_vector) return image_tag if __name__=='__main__': args = build_arg_parser().parse_args() svm_file = args.svm_file codebook_file = args.codebook_file input_image = cv2.imread(args.input_image) print "Output class:", ImageClassifier(svm_file, codebook_file).getImageTag(input_image)
We are all set! We just extract the feature
vector from the input image and use it as the input argument to the classifier. Let's go ahead and see if this works. Download a random footwear image from the internet and make sure it has a clean background. Run the following command by replacing new_image.jpg
with the right filename:
$ python classify_data.py --input-image new_image.jpg --svm-file models/svm.pkl --codebook-file models/codebook.pkl
We can use the same technique to build a visual search engine. A visual search engine looks at the input image and shows a bunch of images that are similar to it. We can reuse the object recognition framework to build this. Extract the feature vector from the input image, and compare it with all the feature vectors in the training dataset. Pick out the top matches and display the results. This is a simple way of doing things!
In the real world, we have to deal with billions of images. So, you cannot afford to search through every single image before you display the output. There are a lot of algorithms that are used to make sure that this is efficient and fast in the real world. Deep Learning is being used extensively in this field and it has shown a lot of promise in recent years. It is a branch of machine learning that focuses on learning optimal representation of data, so that it becomes easier for the machines to learn new tasks. You can learn more about it at http://deeplearning.net.