Capsule networks

Capsule networks (CapsNets) were introduced by Geoffrey Hinton to overcome the limitations of convolutional networks.

Hinton stated the following:

"The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster."

But what is wrong with the pooling operation? Remember when we used the pooling operation to reduce the dimension and to remove unwanted information? The pooling operation makes our CNN representation invariant to small translations in the input.

This translation invariance property of a CNN is not always beneficial, and can be prone to misclassifications. For example, let's say we need to recognize whether an image has a face; the CNN will look for whether the image has eyes, a nose, a mouth, and ears. It does not care about which location they are in. If it finds all such features, then it classifies it as a face.

Consider two images, as shown in the following figure. The first image is the actual face, and in the second image, the eyes are placed on the left side, one above the another, and the ears and mouth are placed on the right. But the CNN will still classify both the images as a face as both images have all the features of a face, that is, ears, eyes, a mouth, and a nose. The CNN thinks that both images consist of a face. It does not learn the spatial relationship between each feature; that the eyes should be placed at the top and should be followed by a nose, and so on. All it checks for is the existence of the features that make up the face.

This problem will become worse when we have a deep network, as in the deep network, the features will become abstract, and it will also shrink in size due to the several pooling operations:

To overcome this, Hinton introduced a new network called the Capsule network, which consists of capsules instead of neurons. Like a CNN, the Capsule network checks for the presence of certain features to classify the image, but apart from detecting the features, it will also check the spatial relationship between them. That is, it learns the hierarchy of the features. Taking our example of recognizing a face, the Capsule network will learn that the eyes should be at the top and the nose should be in the middle, followed by a mouth and so on. If the image does not follow this relationship, then the Capsule network will not classify it as a face:

A Capsule network consists of several capsules connected together. But, wait. What is a capsule?

A capsule is a group of neurons that learn to detect a particular feature in the image; say, eyes. Unlike neurons, which return a scalar, capsules return a vector. The length of the vector tells us whether a particular feature exists in a given location, and the elements of the vector represent the properties of the features, such as, position, angle, and so on.

Let's say we have a vector, , as follows:

The length of the vector can be calculated as follows:

We have learned that the length of the vector represents the probability of the existence of the features. But the preceding length does not represent a probability, as it exceeds 1. So, we convert this value into a probability using a function called the squash function. The squash function has an advantage. Along with calculating probability, it also preserves the direction of the vector:

Just like a CNN, capsules in the earlier layers detect basic features including eyes, a nose, and so on, and the capsules in the higher layers detect high-level features, such as the overall face. Thus, capsules in the higher layers take input from the capsules in the lower layers. In order for the capsules in the higher layers to detect a face, they not only check for the presence of features such as a nose and eyes, but also check their spatial relationships.

Now that we have a basic understanding of what a capsule is, we will go into this in more detail and see how exactly a Capsule network works.

Table of Contents for Capsule networks

Create new playlist

Sign In

Sign Up

Table of Contents for
Capsule networks