A CNN, also known as a ConvNet, is one of the most widely used deep learning algorithms for computer vision tasks. Let's say we are performing an image-recognition task. Consider the following image. We want our CNN to recognize that it contains a horse:
How can we do that? When we feed the image to a computer, it basically converts it into a matrix of pixel values. The pixel values range from 0 to 255, and the dimensions of this matrix will be of [image width x image height x number of channels]. A grayscale image has one channel, and colored images have three channels red, green, and blue (RGB).
Let's say we have a colored input image with a width of 11 and a height of 11, that is 11 x 11, then our matrix dimension would be of [11 x 11 x 3]. As you can see in [11 x 11 x 3], 11 x 11 represents the image width and height and 3 represents the channel number, as we have a colored image. So, we will have a 3D matrix.
But it is hard to visualize a 3D matrix, so, for the sake of understanding, let's consider a grayscale image as our input. Since the grayscale image has only one channel, we will get a 2D matrix.
As shown in the following diagram, the input grayscale image will be converted into a matrix of pixel values ranging from 0 to 255, with the pixel values representing the intensity of pixels at that point:
Okay, now we have an input matrix of pixel values. What happens next? How does the CNN come to understand that the image contains a horse? CNNs consists of the following three important layers:
- The convolutional layer
- The pooling layer
- The fully connected layer
With the help of these three layers, the CNN recognizes that the image contains a horse. Now we will explore each of these layers in detail.