Feature extraction

Chances are, that raw pixel values are not the most informative way to represent the data, as we have already realized in Chapter 3, Finding Objects via Feature Matching and Perspective Transforms. Instead, we need to derive a measurable property of the data that is more informative for classification.

However, often it is not clear which features would perform best. Instead, it is often necessary to experiment with different features that the modeler finds appropriate. After all, the choice of features might strongly depend on the specific dataset to be analyzed or the specific classification task to be performed. For example, if you have to distinguish between a stop sign and a warning sign, then the most telling feature might be the shape of the sign or the color scheme. However, if you have to distinguish between two warning signs, then color and shape will not help you at all, and you will be required to come up with more sophisticated features.

In order to demonstrate how the choice of features affects classification performance, we will focus on the following:

  • A few simple color transformations, such as grayscale, RGB, and HSV. Classification based on grayscale images will give us some baseline performance for the classifier. RGB might give us slightly better performance because of the distinct color schemes of some traffic signs. Even better performance is expected from HSV. This is because it represents colors even more robustly than RGB. Traffic signs tend to have very bright, saturated colors that (ideally) are quite distinct from their surroundings.
  • Speeded-Up Robust Features (SURF), which should appear very familiar to you by now. We have previously recognized SURF as an efficient and robust method of extracting meaningful features from an image, so can't we use this technique to our advantage in a classification task?
  • Histogram of Oriented Gradients (HOG), which is by far the most advanced feature descriptor to be considered in this chapter. The technique counts occurrences of gradient orientations along a dense grid laid out on the image, and is well-suited for use with SVMs.

Feature extraction is performed by the gtsrb._extract_features function, which is implicitly called by gtsrb.load_data. It extracts different features as specified by the feature input argument.

The easiest case is not to extract any features, instead simply resizing the image to a suitable size:

def _extract_feature(X, feature):
    # operate on smaller image
    small_size = (32, 32)
    X = [cv2.resize(x, small_size) for x in X]


For most of the following features, we will be using the (already suitable) default arguments in OpenCV. However, these values are not set in stone, and even in real-world classification tasks, it is often necessary to search across the range of possible values for both feature extracting and learning parameters in a process called hyperparameter exploration.

Common preprocessing

There are three common forms of preprocessing that are almost always applied to any data before classification: mean subtraction, normalization, and principal component analysis (PCA). In this chapter, we will focus on the first two.

Mean subtraction is the most common form of preprocessing (sometimes also referred to as zero-centering or de-meaning), where the mean value of every feature dimension is calculated across all samples in a dataset. This feature-wise average is then subtracted from every sample in the dataset. You can think of this process as centering the cloud of data on the origin. Normalization refers to the scaling of data dimensions so that they are of roughly the same scale. This can be achieved by either dividing each dimension by its standard deviation (once it has been zero-centered), or scaling each dimension to lie in the range of [-1, 1]. It makes sense to apply this step only if you have reason to believe that different input features have different scales or units. In the case of images, the relative scales of pixels are already approximately equal (and in the range of [0, 255]), so it is not strictly necessary to perform this additional preprocessing step.

In this chapter, the idea is to enhance the local intensity contrast of images so that we do not focus on the overall brightness of an image:

# normalize all intensities to be between 0 and 1
X = np.array(X).astype(np.float32) / 255

# subtract mean
X = [x - np.mean(x) for x in X]

Grayscale features

The easiest feature to extract is probably the grayscale value of each pixel. Usually, grayscale values are not very indicative of the data they describe, but we will include them here for illustrative purposes (that is, to achieve baseline performance):

if feature == 'gray' or feature == 'surf':
    X = [cv2.cvtColor(x, cv2.COLOR_BGR2GRAY) for x in X]

Color spaces

Alternatively, you might find that colors contain some information that raw grayscale values cannot capture. Traffic signs often have a distinct color scheme, and it might be indicative of the information it is trying to convey (that is, red for stop signs and forbidden actions, green for informational signs, and so on). We could opt for using the RGB images as input, in which case we do not have to do anything, since the dataset is already RGB.

However, even RGB might not be informative enough. For example, a stop sign in broad daylight might appear very bright and clear, but its colors might appear much less vibrant on a rainy or foggy day. A better choice might be the HSV color space, which rearranges RGB color values in a cylindrical coordinate space along the axes of hue, saturation, and value (or brightness). The most telling feature of traffic signs in this color space might be the hue (a more perceptually relevant description of color or chromaticity), better distinguishing the color scheme of different sign types. Saturation and value could be equally important, however, as traffic signs tend to use relatively bright and saturated colors that do not typically appear in natural scenes (that is, their surroundings).

In OpenCV, the HSV color space is only a single call to cv2.cvtColor away:

if feature == 'hsv':
    X = [cv2.cvtColor(x, cv2.COLOR_BGR2HSV) for x in X]

Speeded Up Robust Features

But wait a minute! In Chapter 3, Finding Objects via Feature Matching and Perspective Transforms you learned that the SURF descriptor is one of the best and most robust ways to describe images independent of scale or rotations. Can we use this technique to our advantage in a classification task?

Glad you asked! To make this work, we need to adjust SURF so that it returns a fixed number of features per image. By default, the SURF descriptor is only applied to a small list of interesting keypoints in the image, the number of which might differ on an image-by-image basis. This is unsuitable for our current purposes, because we want to find a fixed number of feature values per data sample.

Instead, we need to apply SURF to a fixed dense grid laid out over the image, which can be achieved by creating a dense feature detector:

if feature == 'surf':
    # create dense grid of keypoints
    dense = cv2.FeatureDetector_create("Dense")
    kp = dense.detect(np.zeros(small_size).astype(np.uint8))

Then it is possible to obtain SURF descriptors for each point on the grid and append that data sample to our feature matrix. We initialize SURF with a minHessian value of 400 as we did before, and:

surf = cv2.SURF(400)
surf.upright = True
surf.extended = True

Keypoints and descriptors can then be obtained via this code:

kp_des = [surf.compute(x, kp) for x in X]

Because surf.compute has two output arguments, kp_des will actually be a concatenation of both keypoints and descriptors. The second element in the kp_des array is the descriptor that we care about. We select the first num_surf_features from each data sample and add it back to the training set:

num_surf_features = 36
X = [d[1][:num_surf_features, :] for d in kp_des]

Histogram of Oriented Gradients

The last feature descriptor to consider is the Histogram of Oriented Gradients (HOG). HOG features have previously been shown to work exceptionally well in combination with SVMs, especially when applied to tasks such as pedestrian recognition.

The essential idea behind HOG features is that the local shapes and appearance of objects within an image can be described by the distribution of edge directions. The image is divided into small connected regions, within which a histogram of gradient directions (or edge directions) is compiled. Then, the descriptor is assembled by concatenating the different histograms. For improved performance, the local histograms can be contrast-normalized, which results in better invariance to changes in illumination and shadowing. You can see why this sort of preprocessing might just be the perfect fit for recognizing traffic signs under different viewing angles and lighting conditions.

The HOG descriptor is fairly accessible in OpenCV by means of cv2.HOGDescriptor, which takes the detection window size (32 x 32), the block size (16 x 16), the cell size (8 x 8), and the cell stride (8 x 8), as input arguments. For each of these cells, the HOG descriptor then calculates a histogram of oriented gradients using nine bins:

elif feature == 'hog':
    # histogram of oriented gradients
    block_size = (small_size[0] / 2, small_size[1] / 2)
    block_stride = (small_size[0] / 4, small_size[1] / 4)
    cell_size = block_stride
    num_bins = 9
    hog = cv2.HOGDescriptor(small_size, block_size, block_stride, cell_size, num_bins)

Applying the HOG descriptor to every data sample is then as easy as calling hog.compute:

X = [hog.compute(x) for x in X]

After we have extracted all the features we want, we should remember to have gtsrb._extract_features return the assembled list of data samples so that they can be split into training and test sets:

X = [x.flatten() for x in X]
return X

Now, we are finally ready to train the classifier on the preprocessed dataset.

