Image classification and drawbacks of DNNs

Before we start developing the end-to-end project for image classification using CNN, we need some background studies, such as the drawbacks of regular DNNs, suitability of CNNs over DNNs for image classification, CNN constructions, CNN's different operations, and so on. Although regular DNNs work fine for small images (for example, MNIST, CIFAR-10), it breaks down for larger images because of the huge number of parameters it requires. For example, a 100 x 100 image has 10,000 pixels, and if the first layer has just 1,000 neurons (which already severely restricts the amount of information transmitted to the next layer), this means a total of 10 million connections. And that's just for the first layer.

CNNs solve this problem using partially connected layers. Because consecutive layers are only partially connected and because it heavily reuses its weights, a CNN has far fewer parameters than a fully connected DNN, which makes it much faster to train, reduces the risk of overfitting, and requires much less training data. Moreover, when a CNN has learned a kernel that can detect a particular feature, it can detect that feature anywhere on the image. In contrast, when a DNN learns a feature in one location, it can detect it only in that particular location. Since images typically have very repetitive features, CNNs are able to generalize much better than DNNs for image processing tasks, such as classification, using fewer training examples.

Importantly, DNN has no prior knowledge of how pixels are organized; it does not know that nearby pixels are close. A CNN's architecture embeds this prior knowledge. Lower layers typically identify features in small areas of the images, while higher layers combine the lower-level features into larger features. This works well with most natural images, giving CNNs a decisive head start compared to DNNs:

Figure 1: Regular DNN versus CNN

For example, in Figure 1, on the left, you can see a regular three-layer neural network. On the right, a ConvNet arranges its neurons in three dimensions (width, height, and depth), as visualized in one of the layers. Every layer of a ConvNet transforms the 3D input volume to a 3D output volume of neuron activations. The red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be three (red, green, and blue channels).

So, all the multilayer neural networks we looked at had layers composed of a long line of neurons, and we had to flatten input images or data to 1D before feeding them to the neural network. However, what happens once you try to feed them a 2D image directly? The answer is that, in CNN, each layer is represented in 2D, which makes it easier to match neurons with their corresponding inputs. We will see examples of it in upcoming sections.

Another important fact is all the neurons in a feature map share the same parameters, so it dramatically reduces the number of parameters in the model, but, more importantly, it means that once the CNN has learned to recognize a pattern in one location, it can recognize it in any other location. In contrast, once a regular DNN has learned to recognize a pattern in one location, it can recognize it only in that particular location.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset