As you may know, deep learning is getting a lot of press and is without any doubt the hottest topic in the machine learning field. Deep learning can be understood as a set of algorithms that were developed to train artificial neural networks with many layers most efficiently. In this chapter, you will learn the basic concepts of artificial neural networks so that you will be well equipped to further explore the most exciting areas of research in the machine learning field, as well as the advanced Python-based deep learning libraries that are currently being developed.
The topics that we will cover are as follows:
At the beginning of this book, we started our journey through machine learning algorithms with artificial neurons in Chapter 2, Training Machine Learning Algorithms for Classification. Artificial neurons represent the building blocks of the multi-layer artificial neural networks that we are going to discuss in this chapter. The basic concept behind artificial neural networks was built upon hypotheses and models of how the human brain works to solve complex problem tasks. Although artificial neural networks have gained a lot of popularity in recent years, early studies of neural networks go back to the 1940s when Warren McCulloch and Walter Pitt first described how neurons could work. However, in the decades that followed the first implementation of the McCulloch-Pitt neuron model, Rosenblatt's perceptron in the 1950s, many researchers and machine learning practitioners slowly began to lose interest in neural networks since no one had a good solution for training a neural network with multiple layers. Eventually, interest in neural networks was rekindled in 1986 when D.E. Rumelhart, G.E. Hinton, and R.J. Williams were involved in the (re)discovery and popularization of the backpropagation algorithm to train neural networks more efficiently, which we will discuss in more detail later in this chapter (Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (1986). Learning Representations by Back-propagating Errors. Nature 323 (6088): 533–536).
During the previous decade, many more major breakthroughs resulted in what we now call deep learning algorithms, which can be used to create feature detectors from unlabeled data to pre-train deep neural networks—neural networks that are composed of many layers. Neural networks are a hot topic not only in academic research, but also in big technology companies such as Facebook, Microsoft, and Google who invest heavily in artificial neural networks and deep learning research. As of today, complex neural networks powered by deep learning algorithms are considered as state-of-the-art when it comes to complex problem solving such as image and voice recognition. Popular examples of the products in our everyday life that are powered by deep learning are Google's image search and Google Translate, an application for smartphones that can automatically recognize text in images for real-time translation into 20 languages (http://googleresearch.blogspot.com/2015/07/how-google-translate-squeezes-deep.html).
Many more exciting applications of deep neural networks are under active development at major tech companies, for example, Facebook's DeepFace for tagging images (Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. DeepFace: Closing the gap to human-level performance in face verification. In Computer Vision and Pattern Recognition CVPR, 2014 IEEE Conference, pages 1701–1708) and Baidu's DeepSpeech, which is able to handle voice queries in Mandarin (A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al. DeepSpeech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014). In addition, the pharmaceutical industry recently started to use deep learning techniques for drug discovery and toxicity prediction, and research has shown that these novel techniques substantially exceed the performance of traditional methods for virtual screening (T. Unterthiner, A. Mayr, G. Klambauer, and S. Hochreiter. Toxicity prediction using deep learning. arXiv preprint arXiv:1503.01445, 2015).
This chapter is all about multi-layer neural networks, how they work, and how to train them to solve complex problems. However, before we dig deeper into a particular multi-layer neural network architecture, let's briefly reiterate some of the concepts of single-layer neural networks that we introduced in Chapter 2, Training Machine Learning Algorithms for Classification, namely, the ADAptive LInear NEuron (Adaline) algorithm that is shown in the following figure:
In Chapter 2, Training Machine Learning Algorithms for Classification, we implemented the Adaline algorithm to perform binary classification, and we used a gradient descent optimization algorithm to learn the weight coefficients of the model. In every epoch (pass over the training set), we updated the weight vector using the following update rule:
In other words, we computed the gradient based on the whole training set and updated the weights of the model by taking a step into the opposite direction of the gradient . In order to find the optimal weights of the model, we optimized an objective function that we defined as the Sum of Squared Errors (SSE) cost function . Furthermore, we multiplied the gradient by a factor, the learning rate , which we chose carefully to balance the speed of learning against the risk of overshooting the global minimum of the cost function.
In gradient descent optimization, we updated all weights simultaneously after each epoch, and we defined the partial derivative for each weight in the weight vector as follows:
Here is the target class label of a particular sample , and is the activation of the neuron, which is a linear function in the special case of Adaline. Furthermore, we defined the activation function as follows:
Here, the net input is a linear combination of the weights that are connecting the input to the output layer:
While we used the activation to compute the gradient update, we implemented a threshold function (Heaviside function) to squash the continuous-valued output into binary class labels for prediction:
In this section, we will see how to connect multiple single neurons to a multi-layer feedforward neural network; this special type of network is also called a multi-layer perceptron (MLP). The following figure explains the concept of an MLP consisting of three layers: one input layer, one hidden layer, and one output layer. The units in the hidden layer are fully connected to the input layer, and the output layer is fully connected to the hidden layer, respectively. If such a network has more than one hidden layer, we also call it a deep artificial neural network.
We could add an arbitrary number of hidden layers to the MLP to create deeper network architectures. Practically, we can think of the number of layers and units in a neural network as additional hyperparameters that we want to optimize for a given problem task using the cross-validation that we discussed in Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning.
However, the error gradients that we will calculate later via backpropagation would become increasingly small as more layers are added to a network. This vanishing gradient problem makes the model learning more challenging. Therefore, special algorithms have been developed to pretrain such deep neural network structures, which is called deep learning.
As shown in the preceding figure, we denote the th activation unit in the th layer as , and the activation units and are the bias units, respectively, which we set equal to 1. The activation of the units in the input layer is just its input plus the bias unit:
Each unit in layer is connected to all units in layer via a weight coefficient. For example, the connection between the th unit in layer to the th unit in layer would be written as . Please note that the superscript in stands for the th sample, not the th layer. In the following paragraphs, we will often omit the superscript for clarity.
While one unit in the output layer would suffice for a binary classification task, we saw a more general form of a neural network in the preceding figure, which allows us to perform multi-class classification via a generalization of the
One-vs-All (OvA) technique. To better understand how this works, remember the one-hot representation of categorical variables that we introduced in Chapter 4, Building Good Training Sets – Data Preprocessing. For example, we would encode the three class labels in the familiar Iris dataset (0=Setosa, 1=Versicolor, 2=Virginica
) as follows:
This one-hot vector representation allows us to tackle classification tasks with an arbitrary number of unique class labels present in the training set.
If you are new to neural network representations, the terminology around the indices (subscripts and superscripts) may look a little bit confusing at first. You may wonder why we wrote and not to refer to the weight coefficient that connects the
th unit in layer to the th unit in layer . What may seem a little bit quirky at first will make much more sense in later sections when we vectorize the neural network representation. For example, we will summarize the weights that connect the input and hidden layer by a matrix , where is the number of hidden units and is the number of hidden units plus bias unit. Since it is important to internalize this notation to follow the concepts later in this chapter, let's summarize what we just discussed in a descriptive illustration of a simplified 3-4-3 multi-layer perceptron:
In this section, we will describe the process of forward propagation to calculate the output of an MLP model. To understand how it fits into the context of learning an MLP model, let's summarize the MLP learning procedure in three simple steps:
Finally, after repeating the steps for multiple epochs and learning the weights of the MLP, we use forward propagation to calculate the network output and apply a threshold function to obtain the predicted class labels in the one-hot representation, which we described in the previous section.
Now, let's walk through the individual steps of forward propagation to generate an output from the patterns in the training data. Since each unit in the hidden unit is connected to all units in the input layers, we first calculate the activation as follows:
Here, is the net input and is the activation function, which has to be differentiable to learn the weights that connect the neurons using a gradient-based approach. To be able to solve complex problems such as image classification, we need nonlinear activation functions in our MLP model, for example, the sigmoid (logistic) activation function that we used in logistic regression in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn:
As we can remember, the sigmoid function is an S-shaped curve that maps the net input onto a logistic distribution in the range 0 to 1, which passes the origin at z = 0.5, as shown in the following graph:
The MLP is a typical example of a feedforward artificial neural network. The term feedforward refers to the fact that each layer serves as the input to the next layer without loops, in contrast to recurrent neural networks, an architecture that we will discuss later in this chapter. The term multi-layer perceptron may sound a little bit confusing, since the artificial neurons in this network architecture are typically sigmoid units, not perceptrons. Intuitively, we can think of the neurons in the MLP as logistic regression units that return values in the continuous range between 0 and 1.
For purposes of code efficiency and readability, we will now write the activation in a more compact form using the concepts of basic linear algebra, which will allow us to vectorize our code implementation via NumPy rather than writing multiple nested and expensive Python for
loops:
Here, is our dimensional feature vector of a sample plus bias unit. is an dimensional weight matrix where is the number of hidden units in our neural network. After matrix-vector multiplication, we obtain the dimensional net input vector to calculate the activation (where ). Furthermore, we can generalize this computation to all samples in the training set:
Here, is now an matrix, and the matrix-matrix multiplication will result in a dimensional net input matrix . Finally, we apply the activation function to each value in the net input matrix to get the activation matrix for the next layer (here, output layer):
Similarly, we can rewrite the activation of the output layer in the vectorized form:
Here, we multiply the matrix (t is the number of output units) by the dimensional matrix to obtain the dimensional matrix (the columns in this matrix represent the outputs for each sample).
Lastly, we apply the sigmoid activation function to obtain the continuous valued output of our network: