Inception v1

Inception v1 is the first version of the network. An object in an image appears in different sizes and in a different positions. For example, look at the first image; as you can see, the parrot, when viewed closer, takes up the whole portion of the image but in the second image, when the parrot is viewed from a distance, it takes up a smaller region of the image:

Thus, we can say objects (in the given image, it's a parrot) can appear on any region of the image. It might be small or big. It might take up a whole region of the image, or just a very small portion. Our network has to exactly identify the object. But what's the problem here? Remember how we learned that we use a filter to extract features from the image? Now, because our object of interest varies in size and location in each image, choosing the right filter size is difficult.

We can use a filter of a large size when the object size is large, but a large filter size is not suitable when we have to detect an object that is in a small corner of an image. Since we use a fixed receptive field that is a fixed filter size, it is difficult to recognize objects in the images whose position varies greatly. We can use deep networks, but they are more vulnerable to overfitting.

To overcome this, instead of using a single filter of the same size, the inception network uses multiple filters of varying sizes on the same input. An inception network consists of nine inception blocks stacked over one another. A single inception block is shown in the following figure. As you will notice, we perform convolution operations on a given image with three different filters of varying size, that is, 1 x 1, 3 x 3, and 5 x 5. Once the convolution operation is performed by all these different filters, we concatenate the results and feed it to the next inception block:

As we are concatenating output from multiple filters, the depth of the concatenated result will increase. Although we use padding that only matches the shape of the input and output to be the same but we will still have different depths. Since the result of one inception block is the feed to another, the depth keeps on increasing. So, to avoid the increase in the depth, we just add a 1 x 1 convolution before the 3 x 3 and 5 x 5 convolution, as shown in the following figure. We also perform a max pooling operation, and a 1 x 1 convolution is added after the max pooling operation as well:

Each inception block extracts some features and feeds them to the next inception block. Let's say we are trying to recognize a picture of a parrot. The inception block in the first few layers detects basic features, and the later inception blocks detect high-level features. As we saw, in a convolutional network, inception blocks will only extract features, and don't perform any classification. So, we feed the features extracted by the inception block to a classifier, which will predict whether the image contains a parrot or not.

As the inception network is deep, with nine inception blocks, it is susceptible to the vanishing-gradient problem. To avoid this, we introduce classifiers between the inception blocks. Since each inception block learns the meaningful feature of the image, we try to perform classification and compute loss from the intermediate layers as well. As shown in the following figure, we have nine inception blocks. We take the result of the third inception block, , and feed it to an intermediate classifier, and also the result of the sixth inception block, , to another intermediate classifier. There is also yet another classifier at the end of the final inception blocks. This classifier basically consists of average pooling, 1 x 1 convolutions, and a linear layer with softmax activations:

The intermediate classifiers are actually called auxiliary classifiers. So, the final loss of the inception network is the weighted sum of the auxiliary classifier's loss and the loss of the final classifier (real loss), as follows:

Table of Contents for Inception v1

Create new playlist

Sign In

Sign Up

Table of Contents for
Inception v1