Deep neural networks

Neural networks are extremely popular today, thanks to major research advancement over the last 10 years. The result of this research has culminated in deep learning algorithms and architecture. Big technology giants such as Google, Facebook, and Microsoft are heavily investing in deep learning network research. Complex neural networks powered by deep learning are considered state of the art in AI and machine learning. We see them being used in everyday life. For example, Google's image search is powered by deep learning. Google Translate is another application powered by deep learning today. The field of computer vision has made several advancements thanks to deep learning.

The following diagram is a typical neural network, commonly called a multi-layer perceptron:

This network architecture has a single hidden layer with two nodes. The output layer is activated by a softmax function. This network is built for a classification task. The hidden layer can be activated by tanh, sigmoid, relu, or soft relu activation functions. The activation function performs the key role of introducing non-linearity in the network. The product of weight and the input from the previous layer summed up with the bias is passed to these activation functions.

Let's compare this network to a deeper network.

The following is a deep network architecture:

Compared to the multi-layer perceptron, we can see that there are several hidden layers. For the sake of clarity in the diagram, we have only two nodes in each hidden layer, but in practice there may be several hundred nodes.

As you can see, the deep learning networks are different from their sibling single-hidden-layer neural networks in terms of their depth, that is, the number of hidden layers. The input data is now passed through multiple hidden layers, on the assumption that each layer will learn some aspects of the input data.

So when do we call a network deep? Networks with more than three layers (including input and output) can be called deep neural network.

As previously mentioned, in deep learning networks the nodes of each hidden layer learn to identify a distinct set of features based on the previous layer’s output. Subsequent hidden layers can understand more complex features as we aggregate and recombine features from the previous layer:

  • Feature hierarchy: The previously mentioned phenomenon of subsequent layers in a deep network understands more complex features as a result of aggregation. The recombination of features by the previous layer is called a feature hierarchy. This allows the network to handle very large and complex datasets.
  • Latent features: Latent features are the hidden features in the dataset. The feature hierarchy allows deep neural networks to discover the latent features in the data. This allows them to work on unlabeled data efficiently to discover anomalies, structures, and other features embedded in the data, without any human intervention. For example, we can take a million images and cluster them according to their similarities.

Let us see how intuitively a feedforward neural network learns from the data.

Let us consider a simple network with:

  • One input layer and two nodes
  • One hidden layer and two nodes
  • One output layer and two nodes

Say we are solving a classification task. Our input X is a 2 x 2 matrix and our output Y is a vector of the binary response, either one or zero.

We initialize the weights and biases to begin with.

The two nodes in our input layer are fully connected to the two nodes in the hidden layer. We have four weights going from our input node to our hidden node. We can represent our weights using a 2 x 2 matrix. We have bias across the two nodes in our hidden layer. We can represent our bias using a 1 x 2 matrix.

Again, we connect our hidden nodes fully to the two nodes of our output layer. The weight matrix is again 2 x 2 and the bias matrix is again 1 x 2.

Let us look at our network architecture:

{w1,w2,w3,w4} and {b1,b2} are the weights and bias for the input to hidden nodes.

{v1,v2,v3,v4} and {bb1,bb2} are the weights and bias for the hidden to output nodes.

It is very important to initialize the weights and bias randomly. Xavier initialization is a popular initialization used widely today. See this blog for a very intuitive explanation on Xavier initialization, at http://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset