Chapter 12. Implementing a Multilayer Artificial Neural Network from Scratch

As you may know, deep learning is getting a lot of attention from the press and is without any doubt the hottest topic in the machine learning field. Deep learning can be understood as a set of algorithms that were developed to train artificial neural networks with many layers most efficiently. In this chapter, you will learn the basic concepts of artificial neural networks so that you will be well-equipped for the following chapters, which will introduce advanced Python-based deep learning libraries and Deep Neural Network (DNN) architectures that are particularly well-suited for image and text analyses.

The topics that we will cover in this chapter are as follows:

  • Getting a conceptual understanding of multilayer neural networks
  • Implementing the fundamental backpropagation algorithm for neural network training from scratch
  • Training a basic multilayer neural network for image classification

Modeling complex functions with artificial neural networks

At the beginning of this book, we started our journey through machine learning algorithms with artificial neurons in Chapter 2, Training Simple Machine Learning Algorithms for Classification. Artificial neurons represent the building blocks of the multilayer artificial neural networks that we will discuss in this chapter. The basic concept behind artificial neural networks was built upon hypotheses and models of how the human brain works to solve complex problem tasks. Although artificial neural networks have gained a lot of popularity in recent years, early studies of neural networks go back to the 1940s when Warren McCulloch and Walter Pitt first described how neurons could work.

However, in the decades that followed the first implementation of the McCulloch-Pitt neuron model—Rosenblatt's perceptron in the 1950s, many researchers and machine learning practitioners slowly began to lose interest in neural networks since no one had a good solution for training a neural network with multiple layers. Eventually, interest in neural networks was rekindled in 1986 when D.E. Rumelhart, G.E. Hinton, and R.J. Williams were involved in the (re)discovery and popularization of the backpropagation algorithm to train neural networks more efficiently, which we will discuss in more detail later in this chapter (Learning representations by back-propagating errors, David E. Rumelhart, Geoffrey E. Hinton, Ronald J. Williams, Nature, 323 (6088): 533–536, 1986). Readers who are interested in the history of Artificial Intelligence (AI), machine learning, and neural networks are also encouraged to read the Wikipedia article on AI winter, which are the periods of time where a large portion of the research community lost interest in the study of neural networks (https://en.wikipedia.org/wiki/AI_winter).

However, neural networks have never been as popular as they are today, thanks to the many major breakthroughs that have been made in the previous decade, which resulted in what we now call deep learning algorithms and architectures—neural networks that are composed of many layers. Neural networks are a hot topic not only in academic research but also in big technology companies such as Facebook, Microsoft, and Google, who invest heavily in artificial neural networks and deep learning research. As of today, complex neural networks powered by deep learning algorithms are considered state of the art when it comes to complex problem solving such as image and voice recognition. Popular examples of the products in our everyday life that are powered by deep learning are Google's image search and Google Translate—an application for smartphones that can automatically recognize text in images for real-time translation into more than 20 languages.

Many exciting applications of DNNs have been developed at major tech companies and the pharmaceutical industry as listed in the following, non-comprehensive list of examples:

  • Facebook's DeepFace for tagging images (DeepFace: Closing the Gap to Human-Level Performance in Face Verification, Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1701–1708, 2014)
  • Baidu's DeepSpeech, which is able to handle voice queries in Mandarin (DeepSpeech: Scaling up end-to-end speech recognition, A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and Andrew Y. Ng, arXiv preprint arXiv:1412.5567, 2014)
  • Google's new language translation service (Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, arXiv preprint arXiv:1412.5567, 2016)
  • Novel techniques for drug discovery and toxicity prediction (Toxicity prediction using Deep Learning, T. Unterthiner, A. Mayr, G. Klambauer, and S. Hochreiter, arXiv preprint arXiv:1503.01445, 2015)
  • A mobile application that can detect skin cancer with an accuracy similar to professionally trained dermatologists (Dermatologist-level classification of skin cancer with deep neural networks, A. Esteva, B.Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S.Thrun, in Nature 542, no. 7639, 2017, pages 115-118)

Single-layer neural network recap

This chapter is all about multilayer neural networks, how they work, and how to train them to solve complex problems. However, before we dig deeper into a particular multilayer neural network architecture, let's briefly reiterate some of the concepts of single-layer neural networks that we introduced in Chapter 2, Training Simple Machine Learning Algorithms for Classification, namely, the ADAptive LInear NEuron (Adaline) algorithm, which is shown in the following figure:

Single-layer neural network recap

In Chapter 2, Training Simple Machine Learning Algorithms for Classification, we implemented the Adaline algorithm to perform binary classification, and we used the gradient descent optimization algorithm to learn the weight coefficients of the model. In every epoch (pass over the training set), we updated the weight vector w using the following update rule:

Single-layer neural network recap

In other words, we computed the gradient based on the whole training set and updated the weights of the model by taking a step into the opposite direction of the gradient Single-layer neural network recap. In order to find the optimal weights of the model, we optimized an objective function that we defined as the Sum of Squared Errors (SSE) cost function Single-layer neural network recap. Furthermore, we multiplied the gradient by a factor, the learning rate Single-layer neural network recap, which we had to choose carefully to balance the speed of learning against the risk of overshooting the global minimum of the cost function.

In gradient descent optimization, we updated all weights simultaneously after each epoch, and we defined the partial derivative for each weight Single-layer neural network recap in the weight vector w as follows:

Single-layer neural network recap

Here, Single-layer neural network recap is the target class label of a particular sample Single-layer neural network recap, and Single-layer neural network recap is the activation of the neuron, which is a linear function in the special case of Adaline. Furthermore, we defined the activation function Single-layer neural network recap as follows:

Single-layer neural network recap

Here, the net input z is a linear combination of the weights that are connecting the input to the output layer:

Single-layer neural network recap

While we used the activation Single-layer neural network recap to compute the gradient update, we implemented a threshold function to squash the continuous valued output into binary class labels for prediction:

Single-layer neural network recap

Note

Note that although Adaline consists of two layers, one input layer and one output layer, it is called single-layer network because of its single link between the input and output layers.

Also, we learned about a certain trick to accelerate the model learning, the so-called stochastic gradient descent optimization. Stochastic gradient descent approximates the cost from a single training sample (online learning) or a small subset of training samples (mini-batch learning). We will make use of this concept later in this chapter when we implement and train a multilayer perceptron. Apart from faster learning—due to the more frequent weight updates compared to gradient descent—its noisy nature is also regarded as beneficial when training multilayer neural networks with non-linear activation functions, which do not have a convex cost function. Here, the added noise can help to escape local cost minima, but we will discuss this topic in more detail later in this chapter.

Introducing the multilayer neural network architecture

In this section, you will learn how to connect multiple single neurons to a multilayer feedforward neural network; this special type of fully connected network is also called Multilayer Perceptron (MLP). The following figure illustrates the concept of an MLP consisting of three layers:

Introducing the multilayer neural network architecture

The MLP depicted in the preceding figure has one input layer, one hidden layer, and one output layer. The units in the hidden layer are fully connected to the input layer, and the output layer is fully connected to the hidden layer. If such a network has more than one hidden layer, we also call it a deep artificial neural network.

Note

We can add an arbitrary number of hidden layers to the MLP to create deeper network architectures. Practically, we can think of the number of layers and units in a neural network as additional hyperparameters that we want to optimize for a given problem task using cross-validation techniques that we discussed in Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning.

However, the error gradients that we will calculate later via backpropagation will become increasingly small as more layers are added to a network. This vanishing gradient problem makes the model learning more challenging. Therefore, special algorithms have been developed to help train such deep neural network structures; this is known as deep learning.

As shown in the preceding figure, we denote the ith activation unit in the lth layer as Introducing the multilayer neural network architecture. To make the math and code implementations a bit more intuitive, we will not use numerical indices to refer to layers, but we will use the in superscript for the input layer, the h superscript for the hidden layer, and the o superscript for the output layer. For instance, Introducing the multilayer neural network architecture refers to the ith value in the input layer, Introducing the multilayer neural network architecture refers to the ith unit in the hidden layer, and Introducing the multilayer neural network architecture refers to the ith unit in the output layer. Here, the activation units Introducing the multilayer neural network architecture and Introducing the multilayer neural network architecture are the bias units, which we set equal to 1. The activation of the units in the input layer is just its input plus the bias unit:

Introducing the multilayer neural network architecture

Note

Later in this chapter, we will implement the multilayer perceptron using separate vectors for the bias unit, which makes the code implementation more efficient and easier to read. This concept is also used by TensorFlow, a deep learning library that we will introduce in Chapter 13, Parallelizing Neural Network Training with TensorFlow. However, the mathematical equations that will follow, would appear more complex or convoluted if we had to work with additional variables for the bias. However, note that the computation via appending 1s to the input vector (as shown previously) and using a weight variable as bias is exactly the same as operating with separate bias vectors; it is merely a different convention.

Each unit in layer l is connected to all units in layer Introducing the multilayer neural network architecture via a weight coefficient. For example, the connection between the kth unit in layer l to the jth unit in layer Introducing the multilayer neural network architecture will be written as Introducing the multilayer neural network architecture. Referring back to the previous figure, we denote the weight matrix that connects the input to the hidden layer as Introducing the multilayer neural network architecture, and we write the matrix that connects the hidden layer to the output layer as Introducing the multilayer neural network architecture.

While one unit in the output layer would suffice for a binary classification task, we saw a more general form of a neural network in the preceding figure, which allows us to perform multiclass classification via a generalization of the One-versus-All (OvA) technique. To better understand how this works, remember the one-hot representation of categorical variables that we introduced in Chapter 4, Building Good Training Sets – Data Preprocessing. For example, we can encode the three class labels in the familiar Iris dataset (0=Setosa, 1=Versicolor, 2=Virginica) as follows:

Introducing the multilayer neural network architecture

This one-hot vector representation allows us to tackle classification tasks with an arbitrary number of unique class labels present in the training set.

If you are new to neural network representations, the indexing notation (subscripts and superscripts) may look a little bit confusing at first. What may seem overly complicated at first will make much more sense in later sections when we vectorize the neural network representation. As introduced earlier, we summarize the weights that connect the input and hidden layers by a matrix Introducing the multilayer neural network architecture, where d is the number of hidden units and m is the number of input units including the bias unit. Since it is important to internalize this notation to follow the concepts later in this chapter, let's summarize what we have just learned in a descriptive illustration of a simplified 3-4-3 multilayer perceptron:

Introducing the multilayer neural network architecture

Activating a neural network via forward propagation

In this section, we will describe the process of forward propagation to calculate the output of an MLP model. To understand how it fits into the context of learning an MLP model, let's summarize the MLP learning procedure in three simple steps:

  1. Starting at the input layer, we forward propagate the patterns of the training data through the network to generate an output.
  2. Based on the network's output, we calculate the error that we want to minimize using a cost function that we will describe later.
  3. We backpropagate the error, find its derivative with respect to each weight in the network, and update the model.

Finally, after we repeat these three steps for multiple epochs and learn the weights of the MLP, we use forward propagation to calculate the network output and apply a threshold function to obtain the predicted class labels in the one-hot representation, which we described in the previous section.

Now, let's walk through the individual steps of forward propagation to generate an output from the patterns in the training data. Since each unit in the hidden layer is connected to all units in the input layers, we first calculate the activation unit of the hidden layer Activating a neural network via forward propagation as follows:

Activating a neural network via forward propagation
Activating a neural network via forward propagation

Here, Activating a neural network via forward propagation is the net input and Activating a neural network via forward propagation is the activation function, which has to be differentiable to learn the weights that connect the neurons using a gradient-based approach. To be able to solve complex problems such as image classification, we need non-linear activation functions in our MLP model, for example, the sigmoid (logistic) activation function that we remember from the section about logistic regression in Chapter 3, A Tour of Machine Learning Classifiers Using scikit-learn:

Activating a neural network via forward propagation

As we can remember, the sigmoid function is an S-shaped curve that maps the net input z onto a logistic distribution in the range 0 to 1, which cuts the y-axis at z = 0, as shown in the following graph:

Activating a neural network via forward propagation

MLP is a typical example of a feedforward artificial neural network. The term feedforward refers to the fact that each layer serves as the input to the next layer without loops, in contrast to recurrent neural networks—an architecture that we will discuss later in this chapter and discuss in more detail in Chapter 16, Modeling Sequential Data Using Recurrent Neural Networks. The term multilayer perceptron may sound a little bit confusing since the artificial neurons in this network architecture are typically sigmoid units, not perceptrons. Intuitively, we can think of the neurons in the MLP as logistic regression units that return values in the continuous range between 0 and 1.

For purposes of code efficiency and readability, we will now write the activation in a more compact form using the concepts of basic linear algebra, which will allow us to vectorize our code implementation via NumPy rather than writing multiple nested and computationally expensive Python for loops:

Activating a neural network via forward propagation
Activating a neural network via forward propagation

Here, Activating a neural network via forward propagation is our 1 x m dimensional feature vector of a sample Activating a neural network via forward propagation plus a bias unit. Activating a neural network via forward propagation is an m x d dimensional weight matrix where d is the number of units in the hidden layer. After matrix-vector multiplication, we obtain the 1 x d dimensional net input vector Activating a neural network via forward propagation to calculate the activation Activating a neural network via forward propagation (where Activating a neural network via forward propagation). Furthermore, we can generalize this computation to all n samples in the training set:

Activating a neural network via forward propagation

Here, Activating a neural network via forward propagation is now an n x m matrix, and the matrix-matrix multiplication will result in an n x d dimensional net input matrix Activating a neural network via forward propagation. Finally, we apply the activation function Activating a neural network via forward propagation to each value in the net input matrix to get the n x d activation matrix Activating a neural network via forward propagation for the next layer (here, the output layer):

Activating a neural network via forward propagation

Similarly, we can write the activation of the output layer in vectorized form for multiple samples:

Activating a neural network via forward propagation

Here, we multiply the d x t matrix Activating a neural network via forward propagation (t is the number of output units) by the n x d dimensional matrix Activating a neural network via forward propagation to obtain the n x t dimensional matrix Activating a neural network via forward propagation (the columns in this matrix represent the outputs for each sample).

Lastly, we apply the sigmoid activation function to obtain the continuous valued output of our network:

Activating a neural network via forward propagation
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset