This chapter introduces the core notions of artificial neural networks (ANNs), a class of algorithms central to modern-day deep learning. The history of artificial neural networks is a surprisingly old one, dating back to the early 1940s. It took many decades for its applications to become vast successes in many areas, but the basic ideas remain in effect.
At the core of ANNs is the idea to take inspiration from neuroscience and model a class of algorithms that works similarly to the way we hypothesize part of the brain functions. In particular, we use the notion of neurons as atomic blocks for our artificial networks. Neurons form groups called layers, and these layers are connected to each other in specific ways to span a network. Given input data, neurons can transfer information layer by layer via connections, and we say that they activate if the signal is strong enough. In this way, data is propagated through the network until we arrive at the last step, the output layer, from which we get our predictions. These predictions can then be compared to the expected output to compute the error of the prediction, which the network uses to learn and improve future predictions.
Although the brain-inspired architecture analogy is useful at times, we don’t want to overstress it here. We do know a lot about the visual cortex of our brain in particular, but the analogy can sometimes be misleading or even harmful. We think it’s better to think of ANNs as trying to uncover the guiding principles of learning in organisms, just as an airplane makes use of aerodynamics, but doesn’t try to copy a bird.
To make things more concrete in this chapter, we provide a basic implementation of a neural network to follow—starting from scratch. You’ll apply this network to tackle a problem from optical character recognition (OCR); namely, how to let a computer predict which digit is displayed on an image of a handwritten digit.
Each image in our OCR data set is made up of pixels laid out on a grid, and you must analyze spatial relationships between the pixels to figure out what digit it represents. Go, like many other board games, is played on a grid, and you must consider the spatial relationships on the board in order to pick a good move. You might hope that machine-learning techniques for OCR could also apply to games like Go. As it turns out, they do. Chapters 6 through 8 show how to apply these methods to games.
We keep the mathematics relatively low in this chapter. If you aren’t familiar with the basics of linear algebra, calculus, and probability theory or need a brief and practical reminder, we suggest that you read appendix A first. Also, the more difficult parts of the learning procedure of a neural network can be found in appendix B. If you know neural networks, but have never implemented one, we suggest that you skip to section 5.5 right away. If you’re familiar with implementing networks as well, jump right into chapter 6, in which you’ll apply neural networks to predict moves from games generated in chapter 4.
Before introducing neural networks in detail, let’s start with a concrete use case. Throughout this chapter, you’ll build an application that can predict digits from handwritten image data reasonably well, with about 95% accuracy. Notably, you’ll do all this by exposing only the pixel values of the images to a neural network; the algorithm will learn to extract relevant information about the structure of digits on its own.
You’ll use the Modified National Institute of Standards and Technology (MNIST) data set of handwritten digits to do so, a well-studied data set among machine-learning practitioners and the fruit fly of deep learning.
In this chapter, you’ll use the NumPy library for handling low-level mathematical operations. NumPy is the industry standard for machine learning and mathematical computing in Python, and you’ll use it throughout the rest of the book. Before trying out any of the code samples in this chapter, you should install NumPy with your preferred package manager. If you use pip, run pip install numpy from a shell to install it. If you use Conda, run conda install numpy.
The MNIST data set consists of 60,000 images, 28 × 28 pixels each. A few examples of this data are shown in figure 5.1. To humans, recognizing most of these examples is a trivial task, and you can easily read the examples in the first row as 7, 5, 3, 9, 3, 0, and so on. But in some cases, it’s difficult even for humans to understand what the picture represents. For instance, the fourth picture in row five in figure 5.1 could easily be a 4 or a 9.
Each image in MNIST is annotated with a label, a digit from 0 to 9 representing the true value depicted on the image.
Before you can look at the data, you need to load it first. In our GitHub repository for this book, you’ll find a file called mnist.pkl.gz located in the folder http://mng.bz/P8mn.
In this folder, you’ll also find all the code you’ll write in this chapter. As before, we suggest that you follow the flow of this chapter and build the code base as you go, but you can also run the code as found in the GitHub repository.
Because the labels in this data set are integers from 0 to 9, you’ll use a technique called one-hot encoding to transform the digit 1 to a vector of length 10 with all 0s, except you place a 1 at position 1. This representation is useful and widely used in the context of machine learning. Reserving the first slot in a vector for label 1 allows algorithms such as neural networks to distinguish more easily between labels. Using one-hot encoding, the digit 2, for instance, has the following representation: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0].
import six.moves.cPickle as pickle import gzip import numpy as np def encode_label(j): 1 e = np.zeros((10, 1)) e[j] = 1.0 return e
The benefit of one-hot encoding is that each digit has its own “slot,” and you can use neural networks to output probabilities for an input image, which will become useful later.
Examining the contents of the file mnist.pkl.gz, you have access to three pools of data: training, validation, and test data. Recall from chapter 1 that you use training data to train, or fit, a machine-learning algorithm, and use test data to evaluate how well the algorithm learned. Validation data can be used to tweak and validate the configuration of the algorithm, but can be safely ignored in this chapter.
Images in the MNIST data set are quadratic and have both height and width of 28 pixels. You load image data into feature vectors of size 784 = 28 × 28; you discard the image structure altogether and look only at pixels represented as a vector. Each value of this vector represents a grayscale value between 0 and 1, with 0 being white and 1 black.
def shape_data(data): features = [np.reshape(x, (784, 1)) for x in data[0]] 1 labels = [encode_label(y) for y in data[1]] 2 return zip(features, labels) 3 def load_data(): with gzip.open('mnist.pkl.gz', 'rb') as f: train_data, validation_data, test_data = pickle.load(f) 4 return shape_data(train_data), shape_data(test_data) 5
You now have a simple representation of the MNIST data set; both features and labels are encoded as vectors. Your task is to devise a mechanism that learns to map features to labels accurately. Specifically, you want to design an algorithm that takes in training features and labels to learn, such that it can predict labels for test features.
Neural networks can do this job well, as you’ll see in the next section, but let’s first discuss a naive approach that will show you the general problems you have to tackle for this application. Recognizing digits is a relatively simple task for humans, but it’s difficult to explain exactly how to do it and how we know what we know. This phenomenon of knowing more than you can explain is called Polanyi’s paradox. This makes it particularly hard to describe to a machine explicitly how to solve this problem.
One aspect that plays a crucial role is pattern recognition—each handwritten digit has certain traits that derive from its prototypical, digital version. For instance, a 0 is roughly an oval, and in many countries a 1 is simply a vertical line. Given this heuristic, you can naively approach classifying handwritten digits by comparing them to each other: given an image of an 8, this image should be closer to the average image of an 8 than to any other digit. The following average_digit function does this for you.
import numpy as np from dlgo.nn.load_mnist import load_data from dlgo.nn.layers import sigmoid_double def average_digit(data, digit): 1 filtered_data = [x[0] for x in data if np.argmax(x[1]) == digit] filtered_array = np.asarray(filtered_data) return np.average(filtered_array, axis=0) train, test = load_data() avg_eight = average_digit(train, 8) 2
What does the average 8 in your training set look like? Figure 5.2 gives the answer.
Because handwriting can differ quite a lot from individual to individual, as expected, the average 8 is a little fuzzy, but it’s still noticeably shaped like an 8. Maybe you can use this representation to identify other 8s in your data set? You use the following code to compute and display figure 5.2.
from matplotlib import pyplot as plt img = (np.reshape(avg_eight, (28, 28))) plt.imshow(img) plt.show()
This average representation of an 8, avg_eight, in the training set of MNIST, should carry a lot of information about what it means for an image to have an 8 on it. You’ll use avg_eight as parameters of a simple model to decide whether a given input vector x, representing a digit, is an 8. In the context of neural networks, we often speak of weights when referring to parameters, and avg_eight will serve as your weight.
For convenience, you’ll use transposition and define W = np.transpose(avg_eight). You can then compute the dot product of W and x, which does pointwise multiplication of values of W and x and sums up all 784 resulting values. If your heuristic is right, if x is an 8, individual pixels should have a darker tone at roughly the same places W does, and vice versa. Conversely, if x isn’t an 8, there should be less overlap. Let’s test this hypothesis in a few examples.
x_3 = train[2][0] 1 x_18 = train[17][0] 2 W = np.transpose(avg_eight) np.dot(W, x_3) 3 np.dot(W, x_18) 4
You compute the dot product of your weights W with two MNIST samples, one representing a 4, and another one representing an 8. You can see that the latter result of 54.2 for the 8 is much higher than the 20.1 result for the 4. So, it seems you’re on to something. Now, how do you decide when a resulting value is high enough to predict it as an 8? In principle, the dot product of two vectors can spit out any real number. What you do to address this is transform the output of the dot product to the range [0, 1]. Doing so, you can, for instance, try to define a cutoff value at 0.5 and declare everything above this value an 8.
One way to do this is with the sigmoid function. The sigmoid function is often denoted by s, the Greek letter sigma. For a real number x, the sigmoid function is defined as:
Figure 5.3 shows how it looks to gain some intuition.
Next, let’s code the sigmoid function in Python before you apply it to the output of the dot product.
def sigmoid_double(x): return 1.0 / (1.0 + np.exp(-x)) def sigmoid(z): return np.vectorize(sigmoid_double)(z)
Note that you provide both sigmoid_double, which operates on double values, as well as a version that computes the sigmoid for vectors that you’ll use extensively in this chapter. Before you apply sigmoid to your previous computation, note that the sigmoid of 2 is already close to 1, so for your two previously computed samples, sigmoid(54.2) and sigmoid(20.1) will be practically indistinguishable. You can fix this issue by shifting the output of the dot product toward 0. Doing this is called applying a bias term, which we often refer to as b. From the samples, you compute that a good guess for a bias term might be b = –45. Using weights and the bias term, you can now compute predictions of your model as follows.
def predict(x, W, b): 1 return sigmoid_double(np.dot(W, x) + b) b = -45 2 print(predict(x_3, W, b)) 3 print(predict(x_18, W, b)) 4
You get satisfying results on the two examples x_3 and x_18. The prediction for the latter is close to 1, and for the former is almost 0. This procedure of mapping an input vector x to s(Wx + b) for W, a vector the same size as x, is called logistic regression. Figure 5.4 depicts this algorithm schematically for a vector of length 4.
To get a better idea of how well this procedure works, let’s compute predictions for all training and test samples. As indicated before, you define a cutoff, or decision threshold, to decide whether a prediction is counted as an 8. As an evaluation metric here, you choose accuracy; you compute the ratio of correct predictions among all predictions.
def evaluate(data, digit, threshold, W, b): 1 total_samples = 1.0 * len(data) correct_predictions = 0 for x in data: if predict(x[0], W, b) > threshold and np.argmax(x[1]) == digit: 2 correct_predictions += 1 if predict(x[0], W, b) <= threshold and np.argmax(x[1]) != digit: 3 correct_predictions += 1 return correct_predictions / total_samples
Let’s use this evaluation function to assess the quality of predictions for three data sets: the training set, the test set, and the set of all 8s among the test set. You do this with a threshold of 0.5 and weights W and bias term b as before.
evaluate(data=train, digit=8, threshold=0.5, W=W, b=b) 1 evaluate(data=test, digit=8, threshold=0.5, W=W, b=b) 2 eight_test = [x for x in test if np.argmax(x[1]) == 8] evaluate(data=eight_test, digit=8, threshold=0.5, W=W, b=b) 3
What you see is that accuracy on the training set is highest, at about 78%. This shouldn’t come as a surprise, because you calibrated your model on the training set. In particular, it doesn’t make sense to evaluate on the training set, because it doesn’t tell you how well your algorithm generalizes (how well it performs on previously unseen data). Performance on test data is close to that for training, at about 77%, but it’s the last accuracy term that’s noteworthy. On the set of all 8s in the test set, you achieve merely 66%, so with your simple model, you’re right in only two out of three unseen cases of 8s. This result might be acceptable as a first baseline, but is far from the best you can do. What went wrong, and what can you do better?
Although the discussion of this little use case and the naive model you built might not seem like much, you’ve already seen many parts constituting neural networks. In the next section, you’ll use the intuition built around this use case to take your first steps with neural networks by tackling each of these four points.
How can you improve your OCR model? As we hinted in the introduction, neural networks can do a bang-up job on this sort of task—much better than our handcrafted model. But the handcrafted model does illustrate key concepts that you’ll use to build your neural network. This section describes the model from the previous section in the language of neural networks.
In section 5.1, you saw logistic regression used for binary classification. To recap, you took a feature vector x, representing a data sample, fed it into the algorithm by first multiplying it by a weight matrix W and then adding a bias term b. To end up with a prediction y between 0 and 1, you applied the sigmoid function to it: y = s(Wx + b)
You should notice a few things here. First of all, the feature vector x can be interpreted as a collection of neurons, sometimes called units, connected to y by means of W and b, which you’ve already seen in figure 5.4. Next, note that the sigmoid can be seen as an activation function, in that it takes the outcome of Wx + b and maps it to the range [0,1]. If you interpret a value close to 1 as the neuron y activating, and in turn not activating if it’s close to 0, this setup can be seen as a small example of an artificial neural network already.
In the use case in section 5.1, you simplified the problem of recognizing handwritten digits to a binary classification problem; namely, distinguishing an 8 from all other digits. But you’re interested in predicting 10 classes, one for each digit. At least formally, you can achieve this fairly easily by changing what you denote by y, W, and b; you alter the output, weights, and bias of your model.
First, you make y a vector of length 10; y will have one value representing the likelihood of each of the 10 digits:
Next, let’s adapt weights and bias accordingly. Recall that so far W is a vector of length 784. Instead, you make W a matrix with dimensions (10, 784). This way, you can do matrix multiplication of W and an input vector x, namely Wx, the result of which will be a vector of length 10. Continuing, if you make the bias term a vector of length 10, you can add it to Wx. Finally, note that you can compute the sigmoid of a vector z by applying it to each of its components:
Figure 5.5 depicts this slightly altered setup for four input and two output neurons.
Now, what did you gain? You can now map an input vector x to an output vector y, whereas before, y was just a single value. The benefit of this is that nothing stops you from doing this vector-to-vector transformation multiple times, thereby building what we call a feed-forward network.
Let’s quickly recap what you did in the preceding section. On a high level, you carried out the following steps:
At the heart of feed-forward networks is the idea that you can apply this process iteratively, thereby applying many times over the simple building blocks specified by these two steps. These building blocks form what we call a layer. With this notation, you can say that you stack many layers to form a multilayer neural network. Let’s modify our last example by introducing one more layer. You now have to run the following steps:
Note that you use superscripts here to denote which layer you’re in, and subscripts to denote position within a vector or matrix. The prescription of working with two layers instead of just one is visualized in figure 5.6.
At this point, it should become clear that you’re not bound to any particular number of layers to stack. You could use many more. Moreover, you don’t necessarily have to use logistic sigmoid as the activation all the time. You have a plethora of activation functions to choose from, and we introduce some of them in the next chapter. Applying these functions of all layers in a network sequentially to one or more data points is usually referred to as a forward pass. It’s referred to as forward, because data always flows only forward, from input to output (in the figures, left to right), and never back.
With this notation, depicting a regular feed-forward network with three layers looks like figure 5.7.
To recap what you’ve learned so far, let’s put together all the notions we mentioned in one concise list:
At this point in the book, you’re concerned with only sequential neural networks, in which the layers form a sequence. In a sequential network, you start with the input, and each following (hidden) layer has precisely one predecessor and one successor, ending in the output layer. This is enough to cover everything you need in order to apply deep learning to the game of Go.
In general, the theory of neural networks allows for arbitrary nonsequential architectures as well. For instance, in some applications it makes sense to concatenate, or add, the output of two layers (you merge two or more previous layers). In such a scenario, you merge multiple inputs and emit one output.
In other applications, it can be useful to split an input into several outputs. In general, a layer can have multiple inputs and outputs. We introduce multi-input and multi-output networks in chapters 11 and 12, respectively.
The setup of a multilayer perceptron with l layers is fully described by the set of weights W = W1, ..., Wl, the set of biases b = b1, ..., bl, and the set of activation functions chosen for each layer. But a vital ingredient for learning from data and updating parameters is still missing: loss functions and how to optimize them.
Section 5.3 defined how to set up a feed-forward neural network and pass input data through it, but you still don’t know how to assess the quality of your predictions. To do so, you need a measure to define how close prediction and actual outcome are.
To quantify how much you missed your target with your prediction, we introduce the concept of loss functions, often called objective functions. Let’s say you have a feed-forward network with weights W, biases b, and sigmoid activation functions. For a given set of input features X1, ..., Xk and respective labels ŷ1, ..., ŷk (the symbol ŷ for a label is pronounced y-hat), using your network you can compute predictions y1, ..., yk. In such a scenario, a loss function is defined as follows:
Here, Loss(yi,ŷi) ≥ 0, and Loss is a differentiable function. A loss function is a smooth function that assigns a non-negative value to each (prediction, label) pair. The loss of a bunch of features and labels is the sum of losses of the samples. A loss function assesses the fit of your algorithm’s parameters, given the data you show it. Your training target is to minimize loss by finding good strategies to adapt your parameters.
One example of a widely used loss function is mean squared error (MSE). Although MSE isn’t ideal for our use case, it’s one of the most intuitive loss functions to work with. You measure how close your prediction was to the actual label, by measuring the squared distance and averaging over all observed examples. Writing ŷ = ŷ1, ..., ŷk for labels and y = y1, ..., yk for predictions, the mean squared error is defined as follows:
We’ll present the benefits and drawbacks of various loss functions after you’ve seen an application of the theory presented here. For now, let’s implement the mean squared error in Python.
import random import numpy as np class MSE: 1 def __init__(self): pass @staticmethod def loss_function(predictions, labels): diff = predictions - labels return 0.5 * sum(diff * diff)[0] 2 @staticmethod def loss_derivative(predictions, labels): return predictions - labels 3
Note that you implemented not only the loss function itself, but also its derivative with respect to your predictions: loss_derivative. This derivative is a vector and is obtained by subtracting labels from predictions.
Next, you’ll see how derivatives like this one for MSE play a crucial role in training neural networks.
The loss function for a set of predictions and labels gives you information about how well tuned the parameters of your model are. The smaller the loss, the better your predictions, and vice versa. The loss function itself is a function of the parameters of your network. In your MSE implementation, you don’t directly supply weights, but they’re implicitly given through predictions, because you use weights to compute them.
In theory, you know from calculus that to minimize the loss, you need to compute its derivative and set it to 0. We call the set of parameters at this point a solution. Computing the derivative of a function and evaluating it at a specific point is called computing the gradient. You’ve done the first step in computing the derivative in your MSE implementation, but there’s more to it. What you aim for is to explicitly compute gradients for all weight and bias terms in your network.
If you need a refresher on the basics of calculus, make sure to check out appendix A. Figure 5.8 shows a surface in three-dimensional space. This surface can be interpreted as a loss function for two-dimensional input. The first two axes represent your weights, and the third, upward-pointing axis indicates the loss value.
Intuitively speaking, when you compute the gradient of a function for a given point, that gradient points in the direction of steepest ascent. Starting with a loss function, Loss, and a set of parameters W, the gradient descent algorithm to find a minimum for this function goes like this:
Because your loss function is non-negative, you know in particular that it has a minimum. It could have many, even infinitely many, minima. For instance, if you think about a flat surface, every point on it is a minimum.
A point with zero gradient reached by gradient descent is by definition a minimum. The precise mathematical definition of a minimum for differentiable functions of many variables is relatively involved and uses information about the curvature of the function.
With gradient descent, you’ll eventually find a minimum; you can follow the gradient of the function until you find a point of zero gradient. There’s only one caveat: you don’t know whether this minimum is a local or a global minimum. You might be stuck in a plateau that locally is the smallest point the function can take, but other points might have a smaller absolute value. The marked point in figure 5.8 is a local minimum, but clearly smaller values exist on this surface.
What we do to resolve this problem might strike you as strange: we ignore it. In practice, gradient descent often leads to satisfying results, so in the context of loss functions for neural networks, we tend to ignore the question of whether a minimum is local or global. We usually won’t even run the algorithm until convergence, but rather stop after a predefined number of steps.
Figure 5.9 shows how gradient descent works for the loss surface from figure 5.8 and a choice of parameters indicated by the marked point in the upper right.
In your MSE implementation, you’ve seen that the derivative of the mean squared error loss is easy to calculate formally: it’s the difference between labels and predictions. But to evaluate such a derivative, you have to compute predictions first. To get a view of the gradients for all parameters, you have to evaluate and aggregate derivatives for every sample in your training set. Given that you’re usually dealing with many thousands, if not millions, of data samples for your network, this is practically infeasible. Instead, you approximate gradient computation with a technique called stochastic gradient descent.
To compute gradients and apply gradient descent for a neural network, you’d have to evaluate the loss function and its derivative with respect to network parameters at every single point in the training set, which is too expensive in most cases. Instead, you use a technique called stochastic gradient descent (SGD). To run SGD, you first select a few samples from your training set, which you call a mini-batch. Each mini-batch is selected with a fixed length that we call the mini-batch size. For a classification problem like the handwritten digit problem you’re tackling, it’s good practice to choose a batch size in the same order of magnitude as the number of labels so as to make sure each label is represented in a mini-batch.
For a given feed-forward neural network with l layers and a mini-batch of input data x1, ..., xk of mini-batch size k, you can compute the forward pass of your neural network and compute the loss for that mini-batch. For each sample xj in this batch, you can then evaluate the gradient of your loss function with respect to any parameter in your network. The weight and bias gradients in layer i we call ΔjWi and Δjbi, respectively.
For each layer and each sample in the batch, you compute the respective gradients and use the following update rules for parameters:
You update your parameters by subtracting the cumulative error you receive for that batch. Here a > 0 denotes the learning rate, an entity specified ahead of training the network.
You’d get much more precise information about gradients if you were to sum over all training samples at once. Using mini-batches is a compromise in terms of gradient accuracy, but is much more computationally efficient. We call this method stochastic gradient descent because the mini-batch samples are randomly chosen. Although in gradient descent you have a theoretical guarantee of approaching a local minimum, in SGD that’s not the case. Figure 5.10 displays the typical behavior of SGD. Some of your approximate stochastic gradients may not point toward a descending direction, but given enough iterations, you’ll usually get close to a (local) minimum.
Computing (stochastic) gradients is defined by the fundamental principles of calculus. The way you use the gradients to update parameters isn’t. Techniques like the update rule for SGD are called optimizers.
Many other optimizers exist, as well as more sophisticated versions of stochastic gradient descent. We cover some of the extensions of SGD in chapter 7. Most of these extensions revolve around adapting the learning rate over time or have more granular updates for individual weights.
We discussed how to update parameters of a neural network by using stochastic gradient descent, but we didn’t explain how to arrive at the gradients. The algorithm used to compute these gradients is called the backpropagation algorithm and is covered in detail in appendix B. This section gives you the intuition behind backpropagation and the necessary building blocks to implement a feed-forward network yourself.
Recall that in a feed-forward network, you run the forward pass on data by computing one simple building block after another. From the output of the last layer, the network’s predictions, and the labels, you can compute the loss. The loss function itself is the composition of simpler functions. To compute the derivative of the loss function, you can use a fundamental property from calculus: the chain rule. This rule roughly says that the derivative of composed functions is the composition of the derivatives of these functions. Therefore, just as you passed input data forward layer by layer, you can pass derivatives back layer by layer. You propagate derivatives back through your network, hence the name backpropagation. In figure 5.11, you can see the backpropagation in action for a feed-forward network with two dense layers and sigmoid activations.
To guide you through figure 5.11, let’s take it step-by-step:
Now that you have all the necessary mathematics covered to build and run a feed-forward network, let’s apply what you’ve learned on a theoretical level by building a neural network implementation from scratch.
The preceding section covered a lot of theoretical ground, but on a conceptual level, you got away with working through only a few basic concepts. For our implementation, you need to worry about only three things: a Layer class, a SequentialNetwork class that’s built by adding several Layer objects one by one, and a Loss class that the network needs for backpropagation. These three classes are covered next, after which you’ll load and inspect handwritten digit data and apply your network implementation to it. Figure 5.12 shows how those Python classes fit together to implement the forward and backward passes described in the previous section.
To start with a general Layer class, note that layers, as we discussed them before, come with not only a prescription to deal with input data (the forward pass), but also a mechanism to propagate back error terms. In order not to recompute activation values on the backward pass, it’s practical to maintain the state of data coming into and out of the layer for both passes. Having said that, the following initialization of Layer should be straightforward. You’ll begin creating a layers module; later in this chapter you’ll use the components in this module to build up a neural network.
import numpy as np class Layer: 1 def __init__(self): self.params = [] self.previous = None 2 self.next = None 3 self.input_data = None 4 self.output_data = None self.input_delta = None 5 self.output_delta = None
A layer has a list of parameters and stores both its current input and output data, as well as the respective input and output deltas for the backward pass.
Also, because you’re concerned with sequential neural networks, it makes sense to give each layer a successor and a predecessor. Continuing with the definition, you add the following.
def connect(self, layer): 1 self.previous = layer layer.next = self
Next, you provide stubs for forward and backward passes in an abstract Layer class, which subclasses have to implement.
def forward(self): 1 raise NotImplementedError def get_forward_input(self): 2 if self.previous is not None: return self.previous.output_data else: return self.input_data def backward(self): 3 raise NotImplementedError def get_backward_input(self): 4 if self.next is not None: return self.next.output_delta else: return self.input_delta def clear_deltas(self): 5 pass def update_params(self, learning_rate): 6 pass def describe(self): 7 raise NotImplementedError
As helper functions, you provide get_forward_input and get_backward_input, which just retrieve input for the respective pass, but take special care of input and output neurons. On top of this, you implement a clear_deltas method to reset your deltas periodically, after accumulating deltas over mini-batches, as well as update_params, which takes care of updating parameters for this layer, after the network using this layer tells it to.
Note that as a last piece of functionality, you add a method for a layer to print a description of itself, which you add for convenience to get an easier overview of what your network looks like.
Next, you’ll provide your first layer, an ActivationLayer. You’ll work with the sigmoid function, which you’ve implemented already. To do backpropagation, you also need its derivative, which can be implemented easily.
def sigmoid_prime_double(x): return sigmoid_double(x) * (1 - sigmoid_double(x)) def sigmoid_prime(z): return np.vectorize(sigmoid_prime_double)(z)
Note that as for the sigmoid itself, you provide both scalar and vector versions for the derivative. Now, to define an ActivationLayer with the sigmoid function as activation built in, you note that the sigmoid function doesn’t have any parameters, so you don’t need to worry about updating any parameters just yet.
class ActivationLayer(Layer): 1 def __init__(self, input_dim): super(ActivationLayer, self).__init__() self.input_dim = input_dim self.output_dim = input_dim def forward(self): data = self.get_forward_input() self.output_data = sigmoid(data) 2 def backward(self): delta = self.get_backward_input() data = self.get_forward_input() self.output_delta = delta * sigmoid_prime(data) 3 def describe(self): print("|-- " + self.__class__.__name__) print(" |-- dimensions: ({},{})" .format(self.input_dim, self.output_dim))
Carefully inspect the gradient implementation to see how it fits the picture described in figure 5.11. For this layer, the backward pass is just element-wise multiplication of the layer’s current delta with the sigmoid derivative evaluated at the input of this layer: σ(x) · (1 – σ(x)) · Δ.
To continue with your implementation, you next turn to DenseLayer, which is the more complicated layer to implement, but also the last one you’ll tackle in this chapter. Initializing this layer takes a few more variables, because this time you also have to take care of the weight matrix, bias term, and their respective gradients.
class DenseLayer(Layer): def __init__(self, input_dim, output_dim): 1 super(DenseLayer, self).__init__() self.input_dim = input_dim self.output_dim = output_dim self.weight = np.random.randn(output_dim, input_dim) 2 self.bias = np.random.randn(output_dim, 1) self.params = [self.weight, self.bias] 3 self.delta_w = np.zeros(self.weight.shape) 4 self.delta_b = np.zeros(self.bias.shape)
Note that you initialize W and b randomly. There are many ways to initialize the weights of a neural network. Random initialization is an acceptable baseline, but there are many more sophisticated ways to initialize parameters so that they more accurately reflect the structure of your input data.
Initializing parameters is an interesting topic, and we’ll discuss a few other initialization techniques in chapter 6.
For now, just keep in mind that initialization will influence your learning behavior. If you think about the loss surface in figure 5.10, initialization of parameters means choosing a starting point for optimization; you can easily imagine that different starting points for SGD on the loss surface of figure 5.10 may lead to different results. That makes initialization an important topic in the study of neural networks.
Now, the forward pass for a dense layer is straightforward.
def forward(self): data = self.get_forward_input() self.output_data = np.dot(self.weight, data) + self.bias 1
As for the backward pass, recall that to compute the delta for this layer, you just need to transpose W and multiply it by the incoming delta: WtΔ. The gradients for W and b are also easily computed: ΔW = Δyt and Δb = Δ, where y denotes the input to this layer (evaluated with the data you’re currently using).
def backward(self): data = self.get_forward_input() delta = self.get_backward_input() 1 self.delta_b += delta 2 self.delta_w += np.dot(delta, data.transpose()) 3 self.output_delta = np.dot(self.weight.transpose(), delta) 4
The update rule for this layer is given by accumulating the deltas, according to the learning rate you specify for your network.
def update_params(self, rate): 1 self.weight -= rate * self.delta_w self.bias -= rate * self.delta_b def clear_deltas(self): 2 self.delta_w = np.zeros(self.weight.shape) self.delta_b = np.zeros(self.bias.shape) def describe(self): 3 print("|--- " + self.__class__.__name__) print(" |-- dimensions: ({},{})" .format(self.input_dim, self.output_dim))
Having taken care of layers as building blocks for a network, let’s turn to the networks themselves. You initialize a sequential neural network by equipping it with an empty list of layers and let it use MSE as the loss function, unless provided otherwise.
class SequentialNetwork: 1 def __init__(self, loss=None): print("Initialize Network...") self.layers = [] if loss is None: self.loss = MSE() 2
Next, you add functionality to add layers one by one.
def add(self, layer): 1 self.layers.append(layer) layer.describe() if len(self.layers) > 1: self.layers[-1].connect(self.layers[-2])
At the core of your network implementation is the train method. You use mini-batches as input: you shuffle training data and split it into batches of size mini_batch_size. To train your network, you feed it one mini-batch after another. To improve learning, you’ll feed the network your training data in batches multiple times. We say that we train it for multiple epochs. For each mini-batch, you call the train_batch method. If test_data is provided, you evaluate network performance after each epoch.
def train(self, training_data, epochs, mini_batch_size, learning_rate, test_data=None): n = len(training_data) for epoch in range(epochs): 1 random.shuffle(training_data) mini_batches = [ training_data[k:k + mini_batch_size] for k in range(0, n, mini_batch_size) 2 ] for mini_batch in mini_batches: self.train_batch(mini_batch, learning_rate) 3 if test_data: n_test = len(test_data) print("Epoch {0}: {1} / {2}" .format(epoch, self.evaluate(test_data), n_test)) 4 else: print("Epoch {0} complete".format(epoch))
Now, your train_batch computes forward and backward passes on this mini-batch and updates parameters afterward.
def train_batch(self, mini_batch, learning_rate): self.forward_backward(mini_batch) 1 self.update(mini_batch, learning_rate) 2
The two steps, update and forward_backward, are computed as follows.
def update(self, mini_batch, learning_rate): learning_rate = learning_rate / len(mini_batch) 1 for layer in self.layers: layer.update_params(learning_rate) 2 for layer in self.layers: layer.clear_deltas() 3 def forward_backward(self, mini_batch): for x, y in mini_batch: self.layers[0].input_data = x for layer in self.layers: layer.forward() 4 self.layers[-1].input_delta = self.loss.loss_derivative(self.layers[-1].output_data, y) 5 for layer in reversed(self.layers): layer.backward() 6
The implementation is straightforward, but there are a few noteworthy points to observe. First, you normalize the learning rate by your mini-batch size in order to keep updates small. Second, before computing the full backward pass by traversing layers in reversed order, you compute the loss derivative of the network output, which serves as the first input delta for the backward pass.
The remaining part of your SequentialNetwork implementation concerns model performance and evaluation. To evaluate your network on test data, you need to feed this data forward through your network, and this is precisely what single_forward does. The evaluation takes place in evaluate, and you return the number of correctly predicted results to assess accuracy.
def single_forward(self, x): 1 self.layers[0].input_data = x for layer in self.layers: layer.forward() return self.layers[-1].output_data def evaluate(self, test_data): 2 test_results = [( np.argmax(self.single_forward(x)), np.argmax(y) ) for (x, y) in test_data] return sum(int(x == y) for (x, y) in test_results)
Having implemented a feed-forward network, let’s return to our initial use case of predicting handwritten digits for the MNIST data set. After importing the necessary classes that you just built, you load MNIST data, initialize a network, add layers to it, and then train and evaluate the network with your data.
To build a network, keep in mind that your input dimension is 784 and your output dimension is 10, the number of digits. You choose three dense layers with output dimensions 392, 196, and 10, respectively, and add sigmoid activations after each of them. With each new dense layer, you are effectively dividing layer capacity in half. The layer sizes and the number of layers are hyperparameters for this network. You’ve chosen these values to set up a network architecture. We encourage you to experiment with other layer sizes to gain intuition about the learning process of a network in relation to its architecture.
from dlgo.nn import load_mnist from dlgo.nn import network from dlgo.nn.layers import DenseLayer, ActivationLayer training_data, test_data = load_mnist.load_data() 1 net = network.SequentialNetwork() 2 net.add(DenseLayer(784, 392)) 3 net.add(ActivationLayer(392)) net.add(DenseLayer(392, 196)) net.add(ActivationLayer(196)) net.add(DenseLayer(196, 10)) net.add(ActivationLayer(10)) 4
You train the network on data by calling train with all required parameters. You run training for 10 epochs and set the learning rate to 3.0. As mini-batch size, you choose 10, the number of classes. If you were to shuffle training data near perfectly, in most batches each class would be represented, leading to good stochastic gradients.
net.train(training_data, epochs=10, mini_batch_size=10, learning_rate=3.0, test_data=test_data) 1
Now, run this on the command line:
python run_network.py
That yields the following prompt:
Initialize Network... |--- DenseLayer |-- dimensions: (784,392) |-- ActivationLayer |-- dimensions: (392,192) |--- DenseLayer |-- dimensions: (192,10) |-- ActivationLayer |-- dimensions: (10,10) Epoch 0: 6628 / 10000 Epoch 1: 7552 / 10000 ...
The numbers you get for each epoch don’t matter here, apart from the fact that the result is highly dependent on the initialization of the weights. But it’s noteworthy to observe that you often end up with more than 95% accuracy in less than 10 epochs. This is quite an achievement already, especially given that you did this completely from scratch. In particular, this model performs vastly better than your naive model from the beginning of this chapter. Still, you can do much better.
Note that for the use case you studied, you completely disregard the spatial structure of your input images and treat them as vectors. But it should be clear that the neighborhood of a given pixel is important information that should be used. Ultimately, you want to come back to the game of Go, and you’ve seen throughout chapters 2 and 3 just how important the neighborhood of (a string of) stones is.
In the next chapter, you’ll see how to build a particular kind of neural network that’s more suitable for detecting patterns in spatial data, such as images or Go boards. This will bring you much closer to developing a Go bot in chapter 7.