The convolution layer

 A convolution layer consists of three major stages, each of which poses some structural constraints over a multilayered network:

  • Feature extraction: Each unit makes connections from a locally receptive field in the previous layer, thus forcing the network to extract local features. If we have a 32 x 32 image and the receptive field size is 4 x 4, then one hidden layer will be connected to 16 units in the previous layers, and we will have 28 x 28 hidden units in total. Thus, the input layer makes 28 x 28 x 16 connections to the hidden layer, and this is the number of parameters (weight on each connection) between these two layers. Had it been a fully connected dense hidden layer, there would be 32 x 32 x 28 x 28 parameters. So, we get a dramatic reduction in the number of parameters with this architectural constraint. Now, the output of this local linear activation is run through a nonlinear activation function, such as ReLU. This stage is sometimes called the detector stage. Once a feature detector is learned, the exact location of the feature in an unseen image is not important, as long as its position relative to other features is preserved. The synaptic weights associated with the receptive field of a hidden neuron is the kernel of the convolution.
  • Feature mappingThe feature detector creates a feature map that is in the form of a plane (green plane shown as follows). To extract the different types of local features and have a richer representation of the data, several convolutions are performed in parallel to produce several feature maps, as shown here:

  • Subsampling by pooling: This is done by a computational layer that subsamples the output of the feature detector by replacing the feature detector units at certain locations with summary statistics of the nearby units. The summary statistics can be maximums or averages. This operation reduces the sensitivity of the feature maps output to simple distortions, such as linear shifts and rotations; pooling introduces invariance:

Combining all of these three stages gives us one complex layer in the CNN—each of the three stages are simple layers in their own right:

The pooled feature maps can be arranged in a volume by stacking them together side by side, as follows. Then we can again apply the next level of convolution to this. Now, the receptive field for a hidden unit in a single feature map will be a volume of neural units, as shown in the following figure. However, the same set of two-dimensional weights will be used across the depth. The depth dimension is typically made up of channels. If we have an RGB input image, then our input itself will have three channels. But, the convolutions are applied in two-dimensional, and the same weights are shared across all channels:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset