The convolution operation

CNNs are widely used in the area of computer vision and they outperform most of the traditional computer vision techniques that we have been using. CNNs combine the famous convolution operation and neural networks, hence the name convolutional neural network. So, before diving into the neural network aspect of CNNs, we are going to introduce the convolution operation and see how it works.

The main purpose of the convolution operation is to extract information or features from an image. Any image could be considered as a matrix of values and a specific group of values in this matrix will form a feature. The purpose of the convolution operation is to scan this matrix and try to extract relevant or explanatory features for that image. For example, consider a 5 by 5 image whose corresponding intensity or pixel values are shown as zeros and ones:

Figure 9.1: Matrix of pixel values

And consider the following 3 x 3 matrix:

Figure 9.2: Matrix of pixel values

We can convolve the 5 x 5 image using a 3 x 3 one as follows:

Figure 9.3: The convolution operation. The output matrix is called a convolved feature or feature map

The preceding figure could be summarized as follows. In order to convolve the original 5 by 5 image using the 3 x 3 convolution kernel, we need to do the following:

Scan the original green image using the orange matrix and each time move by only 1 pixel (stride)
For every position of the orange image, we do element-wise multiplication between the orange matrix and the corresponding pixel values in the green matrix
Add the results of these element-wise multiplication operations together to get a single integer which will form a single value in the output pink matrix

As you can see from the preceding figure, the orange 3 by 3 matrix only operates on one part of the original green image at a time in each move (stride), or it only sees a part at a time.

So, let's put the previous explanation in the context of CNN terminology:

The orange 3 x 3 matrix is called a kernel, feature detector, or filter
The output pink matrix that contain the results of the element-wise multiplications is called the feature map

Because of the fact that we are getting the feature map based on the element-wise multiplication between the kernel and the corresponding pixels in the original input image, changing the values of the kernel or the filter will give different feature maps each time.

So, we might think that we need to figure out the values of the feature detectors ourselves during the training of the convolution neural networks, but this is not the case here. CNNs figure out these numbers during the learning process. So, if we have more filters, it means that we can extract more features from the image.

Before jumping to the next section, let's introduce some terminology that is usually used in the context of CNNs:

Stride: We mentioned this term briefly earlier. In general, stride is the number of pixels by which we move our feature detector or filter over the pixels of the input matrix. For example, stride 1 means moving the filter one pixel at a time while convolving the input image and stride 2 means moving the filter two pixels at a time while convolving the input image. The more stride we have, the smaller the generated feature maps are.
Zero-padding: If we wanted to include the border pixels of the input image, then part of our filter will be outside the input image. Zero-padding solves this problem by padding the input matrix with zeros around the borders.

Table of Contents for The convolution operation

Create new playlist

Sign In

Sign Up

Table of Contents for
The convolution operation