Getting acquainted 

Before we begin diving into code, let's cover some basic terminology so that we are all on the same page when referring to things. This terminology applies to CNNs as well as the ConvNetSharp framework.

Convolution: In mathematics, a convolution is an operation performed on two functions. This operation produces a third function, which is an expression of how the shape of one is modified by the other. This is represented visually in the following diagram:

It is important to note that the convolutional layer itself is the building block of a CNN. This layer's parameters consist of a set of learnable filters (sometimes called kernels). These kernels have a small receptive field, which is a smaller view into the total image, and this view extends through the full depth of the input volume. During the forward propagation phase, each filter is convolved across the width and the height of the entire input volume. It is this convolution that computes the dot product between the filter and the input. This then produces a two-dimensional map (sometimes called an activation map) of the filter. This helps the network learn which filters should activate when they detect a feature at that respective input position.

Dot product computation: The following diagram is a visualization of what we mean when we say dot product computation:

Vol class: In ConvNetSharp, the Vol class is simply a wrapper around a one-dimensional list of numbers, their gradients, and dimensions (that is, width, depth, and height).
Net class: In ConvNetSharp, Net is a very simple class that contains a list of layers. When a Vol is passed through the Net class, Net iterates through all its layers, forward-propagates each one by calling the forward() function, and returns the result of the last layer. During back propagation, Net calls the backward() function of each layer to compute the gradient.
Layers: As we know, every neural network is just a linear list of layers, and ours is no different. For a neural network, the first layer must be an input layer, and our last layer must be an output layer. Every layer takes an input Vol and produces a new output Vol.
Fully-connected layer: The fully-connected layer is perhaps the most important layer and is definitely the most interesting in terms of what it does. It houses a layer of neurons that perform weighted addition of all the inputs. These are then passed through a non-linear activation function such as a ReLU.
Loss layers and classifier layers: These layers are helpful when we need to predict a set of discrete classes for our data. You can use softmax, SVM, and many other types of layers. As always, you should experiment with your particular problem to see which one works best.
Loss layers and the L2 regression layer: This layer takes a list of targets and backward-propagates the L2 loss through them.
Convolution layer: This layer is almost a mirror of the fully-connected layer. The difference here is that neurons are only connected locally to a few neurons in the layer rather than being connected to all of them. They also share parameters.
Trainers: The Trainer class takes a network and a set of parameters. It passes this through the network, sees the predictions, and adjusts the network weights to make the provided labels more accurate for that particular input. Over time, the process will transform the network and map all the inputs to the correct outputs.

With that behind us, let's now talk a bit about CNNs themselves. A CNN consists of an input and an output layer; there's no big surprise there. There will be one or more hidden layers which consist of convolutional layers, pooling layers, fully-connected layers, or normalization layers. It is in these hidden layers that the magic happens. Convolutional layers apply a convolution operation to the input and pass the result to the next layer. We'll talk more about that in a moment.

As we progress, the activation maps will be stacked for all of the filters that run along the depth dimension. This, in turn, will form the full output volume of the layer itself. Each neuron on that layer processes data only for its own receptive field (the data view it can see). This information is shared with other neurons.

The thing that we have to always keep in mind with a CNN is the input size, which can require an extremely high number of neurons to process, depending on the resolution of the image. This could become architecturally inconvenient, and even intractable, because each pixel is a variable that needs to be processed.

Let's take a look at an example. If we have an image of 100 x 100 pixels, we would all agree that this is a small image. However, this image has 10,000 pixels in total (100 x 100), all of which are weights for each neuron in the second layer. Convolution is key to addressing this issue, as it reduces the number of parameters and allows the network to go deeper with fewer parameters. With 10,000 learnable parameters, the solution may be totally intractable; however, if we reduce that image to a 5 x 5 area, for example, we now have 25 different neurons to handle instead of 10,000, which is much more feasible. This will also help us to eliminate, or at least greatly reduce, the vanishing or exploding gradient problem we sometimes encounter when we train multi-layer networks.

Let's now take a quick look at how this works visually. As shown in the following diagram, we will use the number 6 and run it through a CNN to see if our network can detect the number we are trying to draw. The image at the bottom of the following screenshot is what we will draw. By the time we convolve things all the way up to the top, we should be able to light up the single neuron that denotes the number 6, as follows:

In the preceding screenshot, we can see an input layer (our single number 6), convolutional layers, down-sampling layers, and an output layer. Our progression is as follows: we start with a 32 x 32 image, which leaves us with 1,024 neurons. We then go down to 120 neurons, then to 100 neurons, and finally to 10 neurons in our output layer – that's one neuron for each of the 10 numerical digits. You can see that as we progress towards our output layer, the dimension of the image decreases. As we can see, we have 32 x 32 in our first convolutional layer, 10 x 10 in our second, and 5 x 5 in our second pooling layer.

It's also worth noting that each neuron in the output layer is fully connected to all 100 nodes in the fully-connected layer preceding it; hence, the term fully-connected layer.

If we make a three-dimensional drawing of this network and flip it around, we can better see how convolution occurs. The following diagram depicts just that, as the activated neurons are brighter in color. The layers continue to convolve until a decision is made as to which digit we have drawn, shown as follows:

Table of Contents for Getting acquainted&#xA0;

Create new playlist

Sign In

Sign Up

Table of Contents for
Getting acquainted