Chapter 5. Getting started with neural networks

This chapter covers

  • Introducing the fundamentals of artificial neural networks
  • Teaching a network to recognize handwritten digits
  • Creating neural networks by composing layers
  • Understanding how neural networks learn from data
  • Implementing a simple neural network from scratch

This chapter introduces the core notions of artificial neural networks (ANNs), a class of algorithms central to modern-day deep learning. The history of artificial neural networks is a surprisingly old one, dating back to the early 1940s. It took many decades for its applications to become vast successes in many areas, but the basic ideas remain in effect.

At the core of ANNs is the idea to take inspiration from neuroscience and model a class of algorithms that works similarly to the way we hypothesize part of the brain functions. In particular, we use the notion of neurons as atomic blocks for our artificial networks. Neurons form groups called layers, and these layers are connected to each other in specific ways to span a network. Given input data, neurons can transfer information layer by layer via connections, and we say that they activate if the signal is strong enough. In this way, data is propagated through the network until we arrive at the last step, the output layer, from which we get our predictions. These predictions can then be compared to the expected output to compute the error of the prediction, which the network uses to learn and improve future predictions.

Although the brain-inspired architecture analogy is useful at times, we don’t want to overstress it here. We do know a lot about the visual cortex of our brain in particular, but the analogy can sometimes be misleading or even harmful. We think it’s better to think of ANNs as trying to uncover the guiding principles of learning in organisms, just as an airplane makes use of aerodynamics, but doesn’t try to copy a bird.

To make things more concrete in this chapter, we provide a basic implementation of a neural network to follow—starting from scratch. You’ll apply this network to tackle a problem from optical character recognition (OCR); namely, how to let a computer predict which digit is displayed on an image of a handwritten digit.

Each image in our OCR data set is made up of pixels laid out on a grid, and you must analyze spatial relationships between the pixels to figure out what digit it represents. Go, like many other board games, is played on a grid, and you must consider the spatial relationships on the board in order to pick a good move. You might hope that machine-learning techniques for OCR could also apply to games like Go. As it turns out, they do. Chapters 6 through 8 show how to apply these methods to games.

We keep the mathematics relatively low in this chapter. If you aren’t familiar with the basics of linear algebra, calculus, and probability theory or need a brief and practical reminder, we suggest that you read appendix A first. Also, the more difficult parts of the learning procedure of a neural network can be found in appendix B. If you know neural networks, but have never implemented one, we suggest that you skip to section 5.5 right away. If you’re familiar with implementing networks as well, jump right into chapter 6, in which you’ll apply neural networks to predict moves from games generated in chapter 4.

5.1. A simple use case: classifying handwritten digits

Before introducing neural networks in detail, let’s start with a concrete use case. Throughout this chapter, you’ll build an application that can predict digits from handwritten image data reasonably well, with about 95% accuracy. Notably, you’ll do all this by exposing only the pixel values of the images to a neural network; the algorithm will learn to extract relevant information about the structure of digits on its own.

You’ll use the Modified National Institute of Standards and Technology (MNIST) data set of handwritten digits to do so, a well-studied data set among machine-learning practitioners and the fruit fly of deep learning.

In this chapter, you’ll use the NumPy library for handling low-level mathematical operations. NumPy is the industry standard for machine learning and mathematical computing in Python, and you’ll use it throughout the rest of the book. Before trying out any of the code samples in this chapter, you should install NumPy with your preferred package manager. If you use pip, run pip install numpy from a shell to install it. If you use Conda, run conda install numpy.

5.1.1. The MNIST data set of handwritten digits

The MNIST data set consists of 60,000 images, 28 × 28 pixels each. A few examples of this data are shown in figure 5.1. To humans, recognizing most of these examples is a trivial task, and you can easily read the examples in the first row as 7, 5, 3, 9, 3, 0, and so on. But in some cases, it’s difficult even for humans to understand what the picture represents. For instance, the fourth picture in row five in figure 5.1 could easily be a 4 or a 9.

Figure 5.1. A few samples from the MNIST data set for handwritten digits, a well-studied entity in the field of optical character recognition

Each image in MNIST is annotated with a label, a digit from 0 to 9 representing the true value depicted on the image.

Before you can look at the data, you need to load it first. In our GitHub repository for this book, you’ll find a file called mnist.pkl.gz located in the folder http://mng.bz/P8mn.

In this folder, you’ll also find all the code you’ll write in this chapter. As before, we suggest that you follow the flow of this chapter and build the code base as you go, but you can also run the code as found in the GitHub repository.

5.1.2. MNIST data preprocessing

Because the labels in this data set are integers from 0 to 9, you’ll use a technique called one-hot encoding to transform the digit 1 to a vector of length 10 with all 0s, except you place a 1 at position 1. This representation is useful and widely used in the context of machine learning. Reserving the first slot in a vector for label 1 allows algorithms such as neural networks to distinguish more easily between labels. Using one-hot encoding, the digit 2, for instance, has the following representation: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0].

Listing 5.1. One-hot encoding of MNIST labels
import six.moves.cPickle as pickle
import gzip
import numpy as np


def encode_label(j):         1
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e

  • 1 You one-hot encode indices to vectors of length 10.

The benefit of one-hot encoding is that each digit has its own “slot,” and you can use neural networks to output probabilities for an input image, which will become useful later.

Examining the contents of the file mnist.pkl.gz, you have access to three pools of data: training, validation, and test data. Recall from chapter 1 that you use training data to train, or fit, a machine-learning algorithm, and use test data to evaluate how well the algorithm learned. Validation data can be used to tweak and validate the configuration of the algorithm, but can be safely ignored in this chapter.

Images in the MNIST data set are quadratic and have both height and width of 28 pixels. You load image data into feature vectors of size 784 = 28 × 28; you discard the image structure altogether and look only at pixels represented as a vector. Each value of this vector represents a grayscale value between 0 and 1, with 0 being white and 1 black.

Listing 5.2. Reshaping MNIST data and loading training and test data
def shape_data(data):
    features = [np.reshape(x, (784, 1)) for x in data[0]]        1

    labels = [encode_label(y) for y in data[1]]                  2

    return zip(features, labels)                                 3


def load_data():
    with gzip.open('mnist.pkl.gz', 'rb') as f:
        train_data, validation_data, test_data = pickle.load(f)  4

    return shape_data(train_data), shape_data(test_data)         5

  • 1 Flatten the input images to feature vectors of length 784.
  • 2 All labels are one-hot encoded.
  • 3 Create pairs of features and labels.
  • 4 Unzipping and loading the MNIST data yields three data sets.
  • 5 Discard validation data here and reshape the other two data sets.

You now have a simple representation of the MNIST data set; both features and labels are encoded as vectors. Your task is to devise a mechanism that learns to map features to labels accurately. Specifically, you want to design an algorithm that takes in training features and labels to learn, such that it can predict labels for test features.

Neural networks can do this job well, as you’ll see in the next section, but let’s first discuss a naive approach that will show you the general problems you have to tackle for this application. Recognizing digits is a relatively simple task for humans, but it’s difficult to explain exactly how to do it and how we know what we know. This phenomenon of knowing more than you can explain is called Polanyi’s paradox. This makes it particularly hard to describe to a machine explicitly how to solve this problem.

One aspect that plays a crucial role is pattern recognition—each handwritten digit has certain traits that derive from its prototypical, digital version. For instance, a 0 is roughly an oval, and in many countries a 1 is simply a vertical line. Given this heuristic, you can naively approach classifying handwritten digits by comparing them to each other: given an image of an 8, this image should be closer to the average image of an 8 than to any other digit. The following average_digit function does this for you.

Listing 5.3. Computing the average value for images representing the same digit
import numpy as np
from dlgo.nn.load_mnist import load_data
from dlgo.nn.layers import sigmoid_double


def average_digit(data, digit):                                      1
    filtered_data = [x[0] for x in data if np.argmax(x[1]) == digit]
    filtered_array = np.asarray(filtered_data)
    return np.average(filtered_array, axis=0)


train, test = load_data()
avg_eight = average_digit(train, 8)                                  2

  • 1 Compute the average over all samples in your data representing a given digit.
  • 2 Use the average 8 as parameters for a simple model to detect 8s.

What does the average 8 in your training set look like? Figure 5.2 gives the answer.

Figure 5.2. This is what an average handwritten 8 from the MNIST training set looks like. Averaging many hundreds of images in general will result in an unrecognizable blob, but this average 8 still looks very much like an 8.

Because handwriting can differ quite a lot from individual to individual, as expected, the average 8 is a little fuzzy, but it’s still noticeably shaped like an 8. Maybe you can use this representation to identify other 8s in your data set? You use the following code to compute and display figure 5.2.

Listing 5.4. Computing and displaying the average 8 in your training set
from matplotlib import pyplot as plt

img = (np.reshape(avg_eight, (28, 28)))
plt.imshow(img)
plt.show()

This average representation of an 8, avg_eight, in the training set of MNIST, should carry a lot of information about what it means for an image to have an 8 on it. You’ll use avg_eight as parameters of a simple model to decide whether a given input vector x, representing a digit, is an 8. In the context of neural networks, we often speak of weights when referring to parameters, and avg_eight will serve as your weight.

For convenience, you’ll use transposition and define W = np.transpose(avg_eight). You can then compute the dot product of W and x, which does pointwise multiplication of values of W and x and sums up all 784 resulting values. If your heuristic is right, if x is an 8, individual pixels should have a darker tone at roughly the same places W does, and vice versa. Conversely, if x isn’t an 8, there should be less overlap. Let’s test this hypothesis in a few examples.

Listing 5.5. Computing how close a digit is to your weights by using the dot product
x_3 = train[2][0]            1
x_18 = train[17][0]          2

W = np.transpose(avg_eight)
np.dot(W, x_3)               3
np.dot(W, x_18)              4

  • 1 Training sample at index 2 is a 4
  • 2 Training sample at index 17 is an 8
  • 3 This evaluates to about 20.1.
  • 4 This term is much bigger, about 54.2.

You compute the dot product of your weights W with two MNIST samples, one representing a 4, and another one representing an 8. You can see that the latter result of 54.2 for the 8 is much higher than the 20.1 result for the 4. So, it seems you’re on to something. Now, how do you decide when a resulting value is high enough to predict it as an 8? In principle, the dot product of two vectors can spit out any real number. What you do to address this is transform the output of the dot product to the range [0, 1]. Doing so, you can, for instance, try to define a cutoff value at 0.5 and declare everything above this value an 8.

One way to do this is with the sigmoid function. The sigmoid function is often denoted by s, the Greek letter sigma. For a real number x, the sigmoid function is defined as:

Figure 5.3 shows how it looks to gain some intuition.

Figure 5.3. Plot of the sigmoid function. The sigmoid maps real values to a range of [0, 1]. Around 0, the slope is rather steep, and the curve flattens out both for small and large values.

Next, let’s code the sigmoid function in Python before you apply it to the output of the dot product.

Listing 5.6. Simple implementation of sigmoid function for double values and vectors
def sigmoid_double(x):
    return 1.0 / (1.0 + np.exp(-x))


def sigmoid(z):
    return np.vectorize(sigmoid_double)(z)

Note that you provide both sigmoid_double, which operates on double values, as well as a version that computes the sigmoid for vectors that you’ll use extensively in this chapter. Before you apply sigmoid to your previous computation, note that the sigmoid of 2 is already close to 1, so for your two previously computed samples, sigmoid(54.2) and sigmoid(20.1) will be practically indistinguishable. You can fix this issue by shifting the output of the dot product toward 0. Doing this is called applying a bias term, which we often refer to as b. From the samples, you compute that a good guess for a bias term might be b = –45. Using weights and the bias term, you can now compute predictions of your model as follows.

Listing 5.7. Computing predictions from weights and bias with dot product and sigmoid
def predict(x, W, b):                           1
    return sigmoid_double(np.dot(W, x) + b)


b = -45                                         2

print(predict(x_3, W, b))                       3
print(predict(x_18, W, b))                      4

  • 1 A simple prediction is defined by applying sigmoid to the output of np.doc(W, x) + b.
  • 2 Based on the examples computed so far, you set the bias term to –45.
  • 3 The prediction for the example with a 4 is close to 0.
  • 4 The prediction for an 8 is 0.96 here. You seem to be onto something with your heuristic.

You get satisfying results on the two examples x_3 and x_18. The prediction for the latter is close to 1, and for the former is almost 0. This procedure of mapping an input vector x to s(Wx + b) for W, a vector the same size as x, is called logistic regression. Figure 5.4 depicts this algorithm schematically for a vector of length 4.

Figure 5.4. An example of logistic regression, mapping an input vector x of length 4 to an output value y between 0 and 1. The schematic indicates how the output y depends on all four values in the input vector x.

To get a better idea of how well this procedure works, let’s compute predictions for all training and test samples. As indicated before, you define a cutoff, or decision threshold, to decide whether a prediction is counted as an 8. As an evaluation metric here, you choose accuracy; you compute the ratio of correct predictions among all predictions.

Listing 5.8. Evaluating predictions of your model with a decision threshold
def evaluate(data, digit, threshold, W, b):                               1
    total_samples = 1.0 * len(data)
    correct_predictions = 0
    for x in data:
        if predict(x[0], W, b) > threshold and np.argmax(x[1]) == digit:  2
            correct_predictions += 1
        if predict(x[0], W, b) <= threshold and np.argmax(x[1]) != digit: 3
            correct_predictions += 1
    return correct_predictions / total_samples

  • 1 For an evaluation metric, you choose accuracy, the ratio of correct predictions among all.
  • 2 Predicting an instance of an 8 as 8 is a correct prediction.
  • 3 If the prediction is below your threshold and the sample isn’t an 8, you also predicted correctly.

Let’s use this evaluation function to assess the quality of predictions for three data sets: the training set, the test set, and the set of all 8s among the test set. You do this with a threshold of 0.5 and weights W and bias term b as before.

Listing 5.9. Calculating prediction accuracy for three data sets
evaluate(data=train, digit=8, threshold=0.5, W=W, b=b)        1

evaluate(data=test, digit=8, threshold=0.5, W=W, b=b)         2

eight_test = [x for x in test if np.argmax(x[1]) == 8]
evaluate(data=eight_test, digit=8, threshold=0.5, W=W, b=b)   3

  • 1 Accuracy on training data of your simple model is 78% (0.7814).
  • 2 Accuracy on test data is slightly lower, at 77% (0.7749).
  • 3 Evaluating only on the set of 8s in the test set results in only 67% accuracy (0.6663).

What you see is that accuracy on the training set is highest, at about 78%. This shouldn’t come as a surprise, because you calibrated your model on the training set. In particular, it doesn’t make sense to evaluate on the training set, because it doesn’t tell you how well your algorithm generalizes (how well it performs on previously unseen data). Performance on test data is close to that for training, at about 77%, but it’s the last accuracy term that’s noteworthy. On the set of all 8s in the test set, you achieve merely 66%, so with your simple model, you’re right in only two out of three unseen cases of 8s. This result might be acceptable as a first baseline, but is far from the best you can do. What went wrong, and what can you do better?

  • Your model is capable of distinguishing between only a specific digit (here, an 8) and all others. Because in both the training and test sets the number of images per digit is balanced, only about 10% are 8s. Thus, a model predicting 0 all the time would yield about 90% accuracy. Keep in mind such class imbalances when analyzing classification problems like this one. In light of this, your 77% accuracy on test data doesn’t look quite as strong anymore. You need to define a model that can predict all 10 digits accurately.
  • The parameters of your models are fairly small. For a collection of many thousand diverse handwritten images, all you have is a set of weights the size of one of such image. It’s unrealistic to believe you can capture the variability in handwriting found on these images with such a small model. You have to find a class of algorithms that uses many more parameters effectively to capture the variability in data.
  • For a given prediction, you simply chose a cutoff value to declare a digit an 8 or not. You didn’t use the actual prediction value to assess the quality of your model. For instance, a correct prediction at 0.95 certainly indicates a stronger result than one at 0.51. You have to formalize the notion of how close the prediction was to the actual outcome.
  • You handcrafted the parameters of your model, guided by intuition. Although this might be a good first shot, the promise of machine learning is that you don’t have to impose your view on the data, but rather let the algorithm learn from data. Whenever your model makes correct predictions, you need to reinforce this behavior, and whenever the output is wrong, you need to adjust your model accordingly. In other words, you need to devise a mechanism that updates model parameters according to how well you predicted on training data.

Although the discussion of this little use case and the naive model you built might not seem like much, you’ve already seen many parts constituting neural networks. In the next section, you’ll use the intuition built around this use case to take your first steps with neural networks by tackling each of these four points.

5.2. The basics of neural networks

How can you improve your OCR model? As we hinted in the introduction, neural networks can do a bang-up job on this sort of task—much better than our handcrafted model. But the handcrafted model does illustrate key concepts that you’ll use to build your neural network. This section describes the model from the previous section in the language of neural networks.

5.2.1. Logistic regression as simple artificial neural network

In section 5.1, you saw logistic regression used for binary classification. To recap, you took a feature vector x, representing a data sample, fed it into the algorithm by first multiplying it by a weight matrix W and then adding a bias term b. To end up with a prediction y between 0 and 1, you applied the sigmoid function to it: y = s(Wx + b)

You should notice a few things here. First of all, the feature vector x can be interpreted as a collection of neurons, sometimes called units, connected to y by means of W and b, which you’ve already seen in figure 5.4. Next, note that the sigmoid can be seen as an activation function, in that it takes the outcome of Wx + b and maps it to the range [0,1]. If you interpret a value close to 1 as the neuron y activating, and in turn not activating if it’s close to 0, this setup can be seen as a small example of an artificial neural network already.

5.2.2. Networks with more than one output dimension

In the use case in section 5.1, you simplified the problem of recognizing handwritten digits to a binary classification problem; namely, distinguishing an 8 from all other digits. But you’re interested in predicting 10 classes, one for each digit. At least formally, you can achieve this fairly easily by changing what you denote by y, W, and b; you alter the output, weights, and bias of your model.

First, you make y a vector of length 10; y will have one value representing the likelihood of each of the 10 digits:

Next, let’s adapt weights and bias accordingly. Recall that so far W is a vector of length 784. Instead, you make W a matrix with dimensions (10, 784). This way, you can do matrix multiplication of W and an input vector x, namely Wx, the result of which will be a vector of length 10. Continuing, if you make the bias term a vector of length 10, you can add it to Wx. Finally, note that you can compute the sigmoid of a vector z by applying it to each of its components:

Figure 5.5 depicts this slightly altered setup for four input and two output neurons.

Figure 5.5. In this simple network, four input neurons are connected to two output neurons by means of first multiplying with a two-by-four matrix, adding a two-dimensional bias term, and then applying the sigmoid component-wise.

Now, what did you gain? You can now map an input vector x to an output vector y, whereas before, y was just a single value. The benefit of this is that nothing stops you from doing this vector-to-vector transformation multiple times, thereby building what we call a feed-forward network.

5.3. Feed-forward networks

Let’s quickly recap what you did in the preceding section. On a high level, you carried out the following steps:

  1. You started from a vector of input neurons x and applied a simple transformation to it, namely z = Wx + b. In the language of linear algebra, these transformations are called affine linear. Here you use z as an intermediary variable to ease notation down the line.
  2. You applied an activation function, the sigmoid y = s(z), to get output neurons y. The outcome of applying s tells you how much y activates.

At the heart of feed-forward networks is the idea that you can apply this process iteratively, thereby applying many times over the simple building blocks specified by these two steps. These building blocks form what we call a layer. With this notation, you can say that you stack many layers to form a multilayer neural network. Let’s modify our last example by introducing one more layer. You now have to run the following steps:

  1. Starting with the input x, compute z1 = W1x + b1.
  2. From the intermediate result z1, you get the output y by computing y = W2z1 + b2.

Note that you use superscripts here to denote which layer you’re in, and subscripts to denote position within a vector or matrix. The prescription of working with two layers instead of just one is visualized in figure 5.6.

Figure 5.6. An artificial neural network with two layers. The input neurons x connect to an intermediate set of units z, which themselves connect to the output neurons y.

At this point, it should become clear that you’re not bound to any particular number of layers to stack. You could use many more. Moreover, you don’t necessarily have to use logistic sigmoid as the activation all the time. You have a plethora of activation functions to choose from, and we introduce some of them in the next chapter. Applying these functions of all layers in a network sequentially to one or more data points is usually referred to as a forward pass. It’s referred to as forward, because data always flows only forward, from input to output (in the figures, left to right), and never back.

With this notation, depicting a regular feed-forward network with three layers looks like figure 5.7.

Figure 5.7. A neural network with three layers. When defining a neural network, you’re limited in neither the number of layers, nor the number of neurons per layer.

To recap what you’ve learned so far, let’s put together all the notions we mentioned in one concise list:

  • A sequential neural network is a mechanism to map features, or input neurons, x, to predictions, or output neurons, y. You do this by stacking layers of simple functions one by one in a sequential manner.
  • A layer is a prescription to map a given input to an output. Computing the output of a layer for a batch of data is called a forward pass. Likewise, to compute the forward pass for a sequential network is to compute the forward pass of each layer sequentially, starting with the input.
  • The sigmoid function is an activation function that takes a vector of real-valued neurons and activates them so that they’re mapped to the range [0,1]. You interpret values close to 1 as activating.
  • Given a weight matrix W and a bias term b, applying the affine-linear transformation Wx + b forms a layer. This kind of layer is usually called a dense layer or fully connected layer. Going forward, we’ll stick to calling them dense layers.
  • Depending on the implementation, dense layers may or may not come with activations built in; you might see s(Wx + b) as the layer, not just the affine-linear transformation. On the other hand, it’s common to consider just the activation function a layer, and you’ll do so in your implementation. In the end, whether to add activations into your dense layers or not is just a slightly different view of how to split up and group parts of a function into a logical unit.
  • A feed-forward neural network is a sequential network consisting of dense layers with activations. For historical reasons we don’t have space to dive into, this architecture is also often called multilayer perceptron, or MLP for short.
  • All neurons that are neither input nor output neurons are called hidden units. In contrast, input and output neurons are sometimes called visible units. The intuition behind this is that hidden units are internal to the network, whereas the visible ones are directly observable. This is somewhat of a stretch, because you normally have access to any part of the system, but it’s nevertheless good to know this nomenclature. Consequently, the layers between input and output are called hidden layers: each sequential network with at least two layers has at least a hidden one.
  • If not stated otherwise, x will stand for input to the network, y for output of it; sometimes with subscripts to indicate which sample you’re considering. Stacking many layers to build a large network with many hidden layers is called a deep neural network, hence the name deep learning.
Nonsequential neural networks

At this point in the book, you’re concerned with only sequential neural networks, in which the layers form a sequence. In a sequential network, you start with the input, and each following (hidden) layer has precisely one predecessor and one successor, ending in the output layer. This is enough to cover everything you need in order to apply deep learning to the game of Go.

In general, the theory of neural networks allows for arbitrary nonsequential architectures as well. For instance, in some applications it makes sense to concatenate, or add, the output of two layers (you merge two or more previous layers). In such a scenario, you merge multiple inputs and emit one output.

In other applications, it can be useful to split an input into several outputs. In general, a layer can have multiple inputs and outputs. We introduce multi-input and multi-output networks in chapters 11 and 12, respectively.

The setup of a multilayer perceptron with l layers is fully described by the set of weights W = W1, ..., Wl, the set of biases b = b1, ..., bl, and the set of activation functions chosen for each layer. But a vital ingredient for learning from data and updating parameters is still missing: loss functions and how to optimize them.

5.4. How good are our predictions? Loss functions and optimization

Section 5.3 defined how to set up a feed-forward neural network and pass input data through it, but you still don’t know how to assess the quality of your predictions. To do so, you need a measure to define how close prediction and actual outcome are.

5.4.1. What is a loss function?

To quantify how much you missed your target with your prediction, we introduce the concept of loss functions, often called objective functions. Let’s say you have a feed-forward network with weights W, biases b, and sigmoid activation functions. For a given set of input features X1, ..., Xk and respective labels ŷ1, ..., ŷk (the symbol ŷ for a label is pronounced y-hat), using your network you can compute predictions y1, ..., yk. In such a scenario, a loss function is defined as follows:

Here, Loss(yi,ŷi) ≥ 0, and Loss is a differentiable function. A loss function is a smooth function that assigns a non-negative value to each (prediction, label) pair. The loss of a bunch of features and labels is the sum of losses of the samples. A loss function assesses the fit of your algorithm’s parameters, given the data you show it. Your training target is to minimize loss by finding good strategies to adapt your parameters.

5.4.2. Mean squared error

One example of a widely used loss function is mean squared error (MSE). Although MSE isn’t ideal for our use case, it’s one of the most intuitive loss functions to work with. You measure how close your prediction was to the actual label, by measuring the squared distance and averaging over all observed examples. Writing ŷ = ŷ1, ..., ŷk for labels and y = y1, ..., yk for predictions, the mean squared error is defined as follows:

We’ll present the benefits and drawbacks of various loss functions after you’ve seen an application of the theory presented here. For now, let’s implement the mean squared error in Python.

Listing 5.10. Mean squared error loss function and its derivative
import random
import numpy as np


class MSE:                                        1

    def __init__(self):
        pass

    @staticmethod
    def loss_function(predictions, labels):
        diff = predictions - labels
        return 0.5 * sum(diff * diff)[0]          2

    @staticmethod
    def loss_derivative(predictions, labels):
        return predictions - labels               3

  • 1 Use the mean squared error as your loss function.
  • 2 By defining MSE as 0.5 times the square difference between predictions and labels...
  • 3 ...the loss derivative is simply predictions—labels.

Note that you implemented not only the loss function itself, but also its derivative with respect to your predictions: loss_derivative. This derivative is a vector and is obtained by subtracting labels from predictions.

Next, you’ll see how derivatives like this one for MSE play a crucial role in training neural networks.

5.4.3. Finding minima in loss functions

The loss function for a set of predictions and labels gives you information about how well tuned the parameters of your model are. The smaller the loss, the better your predictions, and vice versa. The loss function itself is a function of the parameters of your network. In your MSE implementation, you don’t directly supply weights, but they’re implicitly given through predictions, because you use weights to compute them.

In theory, you know from calculus that to minimize the loss, you need to compute its derivative and set it to 0. We call the set of parameters at this point a solution. Computing the derivative of a function and evaluating it at a specific point is called computing the gradient. You’ve done the first step in computing the derivative in your MSE implementation, but there’s more to it. What you aim for is to explicitly compute gradients for all weight and bias terms in your network.

If you need a refresher on the basics of calculus, make sure to check out appendix A. Figure 5.8 shows a surface in three-dimensional space. This surface can be interpreted as a loss function for two-dimensional input. The first two axes represent your weights, and the third, upward-pointing axis indicates the loss value.

Figure 5.8. An example of a loss function for two-dimensional input (a loss surface). This surface has a minimum around the dark area in the lower right that can be computed by solving the derivative of the loss function.

5.4.4. Gradient descent to find minima

Intuitively speaking, when you compute the gradient of a function for a given point, that gradient points in the direction of steepest ascent. Starting with a loss function, Loss, and a set of parameters W, the gradient descent algorithm to find a minimum for this function goes like this:

  1. Compute the gradient Δ of Loss for the current set of parameters W (compute the derivative of Loss with respect to each weight W).
  2. Update W by subtracting Δ from it. We call this step following the gradient. Because Δ points in the direction of steepest ascent, subtracting it leads you in the direction of steepest descent.
  3. Repeat until Δ is 0.

Because your loss function is non-negative, you know in particular that it has a minimum. It could have many, even infinitely many, minima. For instance, if you think about a flat surface, every point on it is a minimum.

Local and global minima

A point with zero gradient reached by gradient descent is by definition a minimum. The precise mathematical definition of a minimum for differentiable functions of many variables is relatively involved and uses information about the curvature of the function.

With gradient descent, you’ll eventually find a minimum; you can follow the gradient of the function until you find a point of zero gradient. There’s only one caveat: you don’t know whether this minimum is a local or a global minimum. You might be stuck in a plateau that locally is the smallest point the function can take, but other points might have a smaller absolute value. The marked point in figure 5.8 is a local minimum, but clearly smaller values exist on this surface.

What we do to resolve this problem might strike you as strange: we ignore it. In practice, gradient descent often leads to satisfying results, so in the context of loss functions for neural networks, we tend to ignore the question of whether a minimum is local or global. We usually won’t even run the algorithm until convergence, but rather stop after a predefined number of steps.

Figure 5.9 shows how gradient descent works for the loss surface from figure 5.8 and a choice of parameters indicated by the marked point in the upper right.

Figure 5.9. Iteratively following the gradients of a loss function will eventually lead you to a minimum.

In your MSE implementation, you’ve seen that the derivative of the mean squared error loss is easy to calculate formally: it’s the difference between labels and predictions. But to evaluate such a derivative, you have to compute predictions first. To get a view of the gradients for all parameters, you have to evaluate and aggregate derivatives for every sample in your training set. Given that you’re usually dealing with many thousands, if not millions, of data samples for your network, this is practically infeasible. Instead, you approximate gradient computation with a technique called stochastic gradient descent.

5.4.5. Stochastic gradient descent for loss functions

To compute gradients and apply gradient descent for a neural network, you’d have to evaluate the loss function and its derivative with respect to network parameters at every single point in the training set, which is too expensive in most cases. Instead, you use a technique called stochastic gradient descent (SGD). To run SGD, you first select a few samples from your training set, which you call a mini-batch. Each mini-batch is selected with a fixed length that we call the mini-batch size. For a classification problem like the handwritten digit problem you’re tackling, it’s good practice to choose a batch size in the same order of magnitude as the number of labels so as to make sure each label is represented in a mini-batch.

For a given feed-forward neural network with l layers and a mini-batch of input data x1, ..., xk of mini-batch size k, you can compute the forward pass of your neural network and compute the loss for that mini-batch. For each sample xj in this batch, you can then evaluate the gradient of your loss function with respect to any parameter in your network. The weight and bias gradients in layer i we call ΔjWi and Δjbi, respectively.

For each layer and each sample in the batch, you compute the respective gradients and use the following update rules for parameters:

You update your parameters by subtracting the cumulative error you receive for that batch. Here a > 0 denotes the learning rate, an entity specified ahead of training the network.

You’d get much more precise information about gradients if you were to sum over all training samples at once. Using mini-batches is a compromise in terms of gradient accuracy, but is much more computationally efficient. We call this method stochastic gradient descent because the mini-batch samples are randomly chosen. Although in gradient descent you have a theoretical guarantee of approaching a local minimum, in SGD that’s not the case. Figure 5.10 displays the typical behavior of SGD. Some of your approximate stochastic gradients may not point toward a descending direction, but given enough iterations, you’ll usually get close to a (local) minimum.

Figure 5.10. Stochastic gradients are less precise, so when following them on a loss surface, you might take a few detours before closing in on a local minimum.

Optimizers

Computing (stochastic) gradients is defined by the fundamental principles of calculus. The way you use the gradients to update parameters isn’t. Techniques like the update rule for SGD are called optimizers.

Many other optimizers exist, as well as more sophisticated versions of stochastic gradient descent. We cover some of the extensions of SGD in chapter 7. Most of these extensions revolve around adapting the learning rate over time or have more granular updates for individual weights.

5.4.6. Propagating gradients back through your network

We discussed how to update parameters of a neural network by using stochastic gradient descent, but we didn’t explain how to arrive at the gradients. The algorithm used to compute these gradients is called the backpropagation algorithm and is covered in detail in appendix B. This section gives you the intuition behind backpropagation and the necessary building blocks to implement a feed-forward network yourself.

Recall that in a feed-forward network, you run the forward pass on data by computing one simple building block after another. From the output of the last layer, the network’s predictions, and the labels, you can compute the loss. The loss function itself is the composition of simpler functions. To compute the derivative of the loss function, you can use a fundamental property from calculus: the chain rule. This rule roughly says that the derivative of composed functions is the composition of the derivatives of these functions. Therefore, just as you passed input data forward layer by layer, you can pass derivatives back layer by layer. You propagate derivatives back through your network, hence the name backpropagation. In figure 5.11, you can see the backpropagation in action for a feed-forward network with two dense layers and sigmoid activations.

Figure 5.11. Forward and backward passes in a two-layer feed-forward neural network with sigmoid activations and MSE loss function

To guide you through figure 5.11, let’s take it step-by-step:

  1. Forward pass on training data. In this step, you take an input data sample x and pass it through the network to arrive at a prediction, as follows:

    1. You compute the affine-linear part: Wx + b.
    2. You apply the sigmoid function σ(x) to the result. Note that we’re abusing notation slightly in that x in a computation step means the output of the previous result.
    3. You repeat these two steps until you arrive at the output layer. We chose two layers in this example, but the number of layers doesn’t matter.
  2. Loss function evaluation. In this step, you take your labels ŷ for the sample x and compare them to your predictions y by computing a loss value. In this example, you choose mean squared error as your loss function.
  3. Propagating error terms back. In this step, you take your loss value and pass it back through the network. You do so by computing derivatives layer-wise, which is possible because of the chain rule. While the forward pass feeds input data through the network in one direction, the backward pass feeds error terms back in the opposite direction.

    1. You propagate error terms, or deltas, denoted by Δ, in the inverse order of the forward pass.
    2. To begin with, you compute the derivative of the loss function, which will be your initial Δ. Again, as in the forward pass, we abuse notation and call the propagated error term Δ at every step in the process.
    3. You compute the derivative of the sigmoid with respect to its input, which is simply σ·(1 – σ). To pass Δ to the next layer, you can do component-wise multiplication: σ(1 – σ) · Δ.
    4. The derivative of your affine-linear transformation Wx + b with respect to x is simply W. To pass on Δ, you compute Wt · Δ.
    5. These two steps are repeated until you reach the first layer of the network.
  4. Update weights with gradient information. In the final step, you use the deltas you computed along the way to update your network parameters (weights and bias terms).

    1. The sigmoid function doesn’t have any parameters, so there’s nothing for you to do.
    2. The update Δb that the bias term in each layer receives is simply Δ.
    3. The update ΔW for the weights in a layer is given by Δ · xT (you need to transpose x before multiplying it with delta).
    4. Note that we started out by saying x is a single sample. Everything we discussed carries over to mini-batches, however. If x denotes a mini-batch of samples (x is a matrix in which every column is an input vector), the computations of forward and backward passes look exactly the same.

Now that you have all the necessary mathematics covered to build and run a feed-forward network, let’s apply what you’ve learned on a theoretical level by building a neural network implementation from scratch.

5.5. Training a neural network step-by-step in Python

The preceding section covered a lot of theoretical ground, but on a conceptual level, you got away with working through only a few basic concepts. For our implementation, you need to worry about only three things: a Layer class, a SequentialNetwork class that’s built by adding several Layer objects one by one, and a Loss class that the network needs for backpropagation. These three classes are covered next, after which you’ll load and inspect handwritten digit data and apply your network implementation to it. Figure 5.12 shows how those Python classes fit together to implement the forward and backward passes described in the previous section.

Figure 5.12. Class diagram for your Python implementation of a feed-forward network. A SequentialNetwork contains several Layer instances. Each Layer implements a mathematical function and its derivative. The forward and backward methods implement the forward and backward pass, respectively. A Loss instance calculates your loss function, the error between your prediction and your training data.

5.5.1. Neural network layers in Python

To start with a general Layer class, note that layers, as we discussed them before, come with not only a prescription to deal with input data (the forward pass), but also a mechanism to propagate back error terms. In order not to recompute activation values on the backward pass, it’s practical to maintain the state of data coming into and out of the layer for both passes. Having said that, the following initialization of Layer should be straightforward. You’ll begin creating a layers module; later in this chapter you’ll use the components in this module to build up a neural network.

Listing 5.11. Base layer implementation
import numpy as np

class Layer:                        1
    def __init__(self):
        self.params = []

        self.previous = None        2
        self.next = None            3

        self.input_data = None      4
        self.output_data = None

        self.input_delta = None     5
        self.output_delta = None

  • 1 Layers are stacked to build a sequential neural network.
  • 2 A layer knows its predecessor (previous)...
  • 3 ...and its successor (next).
  • 4 Each layer can persist data flowing into and out of it in the forward pass.
  • 5 Analogously, a layer holds input and output data for the backward pass.

A layer has a list of parameters and stores both its current input and output data, as well as the respective input and output deltas for the backward pass.

Also, because you’re concerned with sequential neural networks, it makes sense to give each layer a successor and a predecessor. Continuing with the definition, you add the following.

Listing 5.12. Connecting layers through successors and predecessors
    def connect(self, layer):        1
        self.previous = layer
        layer.next = self

  • 1 This method connects a layer to its direct neighbors in the sequential network.

Next, you provide stubs for forward and backward passes in an abstract Layer class, which subclasses have to implement.

Listing 5.13. Forward and backward passes in a layer of a sequential neural network
    def forward(self):                           1
        raise NotImplementedError

    def get_forward_input(self):                 2
        if self.previous is not None:
            return self.previous.output_data
        else:
            return self.input_data

    def backward(self):                          3
        raise NotImplementedError

    def get_backward_input(self):                4
        if self.next is not None:
            return self.next.output_delta
        else:
            return self.input_delta

    def clear_deltas(self):                      5
        pass

    def update_params(self, learning_rate):      6
        pass

    def describe(self):                          7
        raise NotImplementedError

  • 1 Each layer implementation has to provide a function to feed input data forward.
  • 2 input_data is reserved for the first layer; all others get their input from the previous output.
  • 3 Layers have to implement backpropagation of error terms—a way to feed input errors backward through the network.
  • 4 Input delta is reserved for the last layer; all other layers get their error terms from their successor.
  • 5 You compute and accumulate deltas per mini-batch, after which you need to reset these deltas.
  • 6 Update layer parameters according to current deltas, using the specified learning_rate.
  • 7 Layer implementations can print their properties.

As helper functions, you provide get_forward_input and get_backward_input, which just retrieve input for the respective pass, but take special care of input and output neurons. On top of this, you implement a clear_deltas method to reset your deltas periodically, after accumulating deltas over mini-batches, as well as update_params, which takes care of updating parameters for this layer, after the network using this layer tells it to.

Note that as a last piece of functionality, you add a method for a layer to print a description of itself, which you add for convenience to get an easier overview of what your network looks like.

5.5.2. Activation layers in neural networks

Next, you’ll provide your first layer, an ActivationLayer. You’ll work with the sigmoid function, which you’ve implemented already. To do backpropagation, you also need its derivative, which can be implemented easily.

Listing 5.14. Implementation of the derivative of the sigmoid function
def sigmoid_prime_double(x):
    return sigmoid_double(x) * (1 - sigmoid_double(x))


def sigmoid_prime(z):
    return np.vectorize(sigmoid_prime_double)(z)

Note that as for the sigmoid itself, you provide both scalar and vector versions for the derivative. Now, to define an ActivationLayer with the sigmoid function as activation built in, you note that the sigmoid function doesn’t have any parameters, so you don’t need to worry about updating any parameters just yet.

Listing 5.15. Sigmoid activation layer
class ActivationLayer(Layer):                             1
    def __init__(self, input_dim):
        super(ActivationLayer, self).__init__()

        self.input_dim = input_dim
        self.output_dim = input_dim

    def forward(self):
        data = self.get_forward_input()
        self.output_data = sigmoid(data)                  2

    def backward(self):
        delta = self.get_backward_input()
        data = self.get_forward_input()
        self.output_delta = delta * sigmoid_prime(data)   3

    def describe(self):
        print("|-- " + self.__class__.__name__)
        print("  |-- dimensions: ({},{})"
              .format(self.input_dim, self.output_dim))

  • 1 This activation layer uses the sigmoid function to activate neurons.
  • 2 The forward pass is simply applying the sigmoid to the input data.
  • 3 The backward pass is element-wise multipli-cation of the error term with the sigmoid derivative evaluated at the input to this layer.

Carefully inspect the gradient implementation to see how it fits the picture described in figure 5.11. For this layer, the backward pass is just element-wise multiplication of the layer’s current delta with the sigmoid derivative evaluated at the input of this layer: σ(x) · (1 – σ(x)) · Δ.

5.5.3. Dense layers in Python as building blocks for feed-forward networks

To continue with your implementation, you next turn to DenseLayer, which is the more complicated layer to implement, but also the last one you’ll tackle in this chapter. Initializing this layer takes a few more variables, because this time you also have to take care of the weight matrix, bias term, and their respective gradients.

Listing 5.16. Dense layer weight initialization
class DenseLayer(Layer):

    def __init__(self, input_dim, output_dim):                 1

        super(DenseLayer, self).__init__()

        self.input_dim = input_dim
        self.output_dim = output_dim

        self.weight = np.random.randn(output_dim, input_dim)   2
        self.bias = np.random.randn(output_dim, 1)

        self.params = [self.weight, self.bias]                 3

        self.delta_w = np.zeros(self.weight.shape)             4
        self.delta_b = np.zeros(self.bias.shape)

  • 1 Dense layers have input and output dimensions.
  • 2 Randomly initialize weight matrix and bias vector.
  • 3 The layer parameters consist of weights and bias terms.
  • 4 Deltas for weights and biases are set to 0.

Note that you initialize W and b randomly. There are many ways to initialize the weights of a neural network. Random initialization is an acceptable baseline, but there are many more sophisticated ways to initialize parameters so that they more accurately reflect the structure of your input data.

Parameter initialization as a starting point for optimization

Initializing parameters is an interesting topic, and we’ll discuss a few other initialization techniques in chapter 6.

For now, just keep in mind that initialization will influence your learning behavior. If you think about the loss surface in figure 5.10, initialization of parameters means choosing a starting point for optimization; you can easily imagine that different starting points for SGD on the loss surface of figure 5.10 may lead to different results. That makes initialization an important topic in the study of neural networks.

Now, the forward pass for a dense layer is straightforward.

Listing 5.17. Dense layer forward pass
    def forward(self):
        data = self.get_forward_input()
        self.output_data = np.dot(self.weight, data) + self.bias    1

  • 1 The forward pass of the dense layer is the affine-linear transformation on input data defined by weights and biases.

As for the backward pass, recall that to compute the delta for this layer, you just need to transpose W and multiply it by the incoming delta: WtΔ. The gradients for W and b are also easily computed: ΔW = Δyt and Δb = Δ, where y denotes the input to this layer (evaluated with the data you’re currently using).

Listing 5.18. Dense layer backward pass
    def backward(self):
        data = self.get_forward_input()
        delta = self.get_backward_input()                          1

        self.delta_b += delta                                      2

        self.delta_w += np.dot(delta, data.transpose())            3

        self.output_delta = np.dot(self.weight.transpose(), delta) 4

  • 1 For the backward pass, you first get input data and delta.
  • 2 The current delta is added to the bias delta.
  • 3 Then you add this term to the weight delta.
  • 4 The backward pass is completed by passing an output delta to the previous layer.

The update rule for this layer is given by accumulating the deltas, according to the learning rate you specify for your network.

Listing 5.19. Dense layer weight update mechanism
    def update_params(self, rate):                       1
        self.weight -= rate * self.delta_w
        self.bias -= rate * self.delta_b

    def clear_deltas(self):                              2
        self.delta_w = np.zeros(self.weight.shape)
        self.delta_b = np.zeros(self.bias.shape)

    def describe(self):                                  3
        print("|--- " + self.__class__.__name__)
        print("  |-- dimensions: ({},{})"
              .format(self.input_dim, self.output_dim))

  • 1 Using weight and bias deltas, you can update model parameters with gradient descent.
  • 2 After updating parameters, you should reset all deltas.
  • 3 A dense layer can be described by its input and output dimensions.

5.5.4. Sequential neural networks with Python

Having taken care of layers as building blocks for a network, let’s turn to the networks themselves. You initialize a sequential neural network by equipping it with an empty list of layers and let it use MSE as the loss function, unless provided otherwise.

Listing 5.20. Sequential neural network initialization
class SequentialNetwork:                    1
    def __init__(self, loss=None):
        print("Initialize Network...")
        self.layers = []
        if loss is None:
            self.loss = MSE()               2

  • 1 In a sequential neural network, you stack layers sequentially.
  • 2 If no loss function is provided, MSE is used.

Next, you add functionality to add layers one by one.

Listing 5.21. Adding layers sequentially
    def add(self, layer):                           1
        self.layers.append(layer)
        layer.describe()
        if len(self.layers) > 1:
            self.layers[-1].connect(self.layers[-2])

  • 1 Whenever you add a layer, you connect it to its predecessor and let it describe itself.

At the core of your network implementation is the train method. You use mini-batches as input: you shuffle training data and split it into batches of size mini_batch_size. To train your network, you feed it one mini-batch after another. To improve learning, you’ll feed the network your training data in batches multiple times. We say that we train it for multiple epochs. For each mini-batch, you call the train_batch method. If test_data is provided, you evaluate network performance after each epoch.

Listing 5.22. Train method on a sequential network
    def train(self, training_data, epochs, mini_batch_size,
              learning_rate, test_data=None):
        n = len(training_data)
        for epoch in range(epochs):                                     1
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k + mini_batch_size] for
                k in range(0, n, mini_batch_size)                       2
            ]
            for mini_batch in mini_batches:
                self.train_batch(mini_batch, learning_rate)             3
            if test_data:
                n_test = len(test_data)
                print("Epoch {0}: {1} / {2}"
                      .format(epoch, self.evaluate(test_data), n_test)) 4
            else:
                print("Epoch {0} complete".format(epoch))

  • 1 To train your network, you pass over data for as many times as there are epochs.
  • 2 Shuffle training data and create mini-batches.
  • 3 For each mini-batch, you train your network.
  • 4 If you provided test data, you evaluate your network on it after each epoch.

Now, your train_batch computes forward and backward passes on this mini-batch and updates parameters afterward.

Listing 5.23. Training a sequential neural network on a batch of data
    def train_batch(self, mini_batch, learning_rate):
        self.forward_backward(mini_batch)              1

        self.update(mini_batch, learning_rate)         2

  • 1 To train the network on a mini-batch, you compute feed-forward and backward pass...
  • 2 ...and then update model parameters accordingly.

The two steps, update and forward_backward, are computed as follows.

Listing 5.24. Updating rule and feed-forward and backward passes for your network
    def update(self, mini_batch, learning_rate):
        learning_rate = learning_rate / len(mini_batch)                   1
        for layer in self.layers:
            layer.update_params(learning_rate)                            2
        for layer in self.layers:
            layer.clear_deltas()                                          3

    def forward_backward(self, mini_batch):
        for x, y in mini_batch:
            self.layers[0].input_data = x
            for layer in self.layers:
                layer.forward()                                           4
            self.layers[-1].input_delta = 
                self.loss.loss_derivative(self.layers[-1].output_data, y) 5
            for layer in reversed(self.layers):
                layer.backward()                                          6

  • 1 A common technique is to normalize the learning rate by the mini-batch size.
  • 2 Update parameters for all layers.
  • 3 Clear all deltas in each layer.
  • 4 For each sample in the mini batch, feed the features forward layer by layer.
  • 5 Compute the loss derivative for the output data.
  • 6 Do layer-by-layer backpropagation of error terms.

The implementation is straightforward, but there are a few noteworthy points to observe. First, you normalize the learning rate by your mini-batch size in order to keep updates small. Second, before computing the full backward pass by traversing layers in reversed order, you compute the loss derivative of the network output, which serves as the first input delta for the backward pass.

The remaining part of your SequentialNetwork implementation concerns model performance and evaluation. To evaluate your network on test data, you need to feed this data forward through your network, and this is precisely what single_forward does. The evaluation takes place in evaluate, and you return the number of correctly predicted results to assess accuracy.

Listing 5.25. Evaluation
    def single_forward(self, x):              1
        self.layers[0].input_data = x
        for layer in self.layers:
                layer.forward()
        return self.layers[-1].output_data

    def evaluate(self, test_data):            2
        test_results = [(
            np.argmax(self.single_forward(x)),
            np.argmax(y)
        ) for (x, y) in test_data]
        return sum(int(x == y) for (x, y) in test_results)

  • 1 Pass a single sample forward and return the result.
  • 2 Compute accuracy on test data.

5.5.5. Applying your network handwritten digit classification

Having implemented a feed-forward network, let’s return to our initial use case of predicting handwritten digits for the MNIST data set. After importing the necessary classes that you just built, you load MNIST data, initialize a network, add layers to it, and then train and evaluate the network with your data.

To build a network, keep in mind that your input dimension is 784 and your output dimension is 10, the number of digits. You choose three dense layers with output dimensions 392, 196, and 10, respectively, and add sigmoid activations after each of them. With each new dense layer, you are effectively dividing layer capacity in half. The layer sizes and the number of layers are hyperparameters for this network. You’ve chosen these values to set up a network architecture. We encourage you to experiment with other layer sizes to gain intuition about the learning process of a network in relation to its architecture.

Listing 5.26. Instantiating a neural network
from dlgo.nn import load_mnist
from dlgo.nn import network
from dlgo.nn.layers import DenseLayer, ActivationLayer

training_data, test_data = load_mnist.load_data()  1

net = network.SequentialNetwork()                  2

net.add(DenseLayer(784, 392))                      3
net.add(ActivationLayer(392))
net.add(DenseLayer(392, 196))
net.add(ActivationLayer(196))
net.add(DenseLayer(196, 10))
net.add(ActivationLayer(10))                       4

  • 1 Load training and test data.
  • 2 Initialize a sequential neural network.
  • 3 You can then add dense and activation layers one by one.
  • 4 The final layer has size 10, the number of classes to predict.

You train the network on data by calling train with all required parameters. You run training for 10 epochs and set the learning rate to 3.0. As mini-batch size, you choose 10, the number of classes. If you were to shuffle training data near perfectly, in most batches each class would be represented, leading to good stochastic gradients.

Listing 5.27. Running a neural network instance on training data
net.train(training_data, epochs=10, mini_batch_size=10,
          learning_rate=3.0, test_data=test_data)          1

  • 1 You can now easily train the model by specifying train and test data, the number of epochs, the mini-batch size, and the learning rate.

Now, run this on the command line:

python run_network.py

That yields the following prompt:

Initialize Network...
|--- DenseLayer
  |-- dimensions: (784,392)
|-- ActivationLayer
  |-- dimensions: (392,192)
|--- DenseLayer
  |-- dimensions: (192,10)
|-- ActivationLayer
  |-- dimensions: (10,10)
Epoch 0: 6628 / 10000
Epoch 1: 7552 / 10000
...

The numbers you get for each epoch don’t matter here, apart from the fact that the result is highly dependent on the initialization of the weights. But it’s noteworthy to observe that you often end up with more than 95% accuracy in less than 10 epochs. This is quite an achievement already, especially given that you did this completely from scratch. In particular, this model performs vastly better than your naive model from the beginning of this chapter. Still, you can do much better.

Note that for the use case you studied, you completely disregard the spatial structure of your input images and treat them as vectors. But it should be clear that the neighborhood of a given pixel is important information that should be used. Ultimately, you want to come back to the game of Go, and you’ve seen throughout chapters 2 and 3 just how important the neighborhood of (a string of) stones is.

In the next chapter, you’ll see how to build a particular kind of neural network that’s more suitable for detecting patterns in spatial data, such as images or Go boards. This will bring you much closer to developing a Go bot in chapter 7.

5.6. Summary

  • A sequential neural network is a simple artificial neural network built from a linear stack of layers. You can apply neural networks to a wide variety of machine-learning problems, including image recognition.
  • A feed-forward network is a sequential network consisting of dense layers with an activation function.
  • Loss functions assess the quality of our predictions. Mean squared error is one of the most common loss functions used in practice. A loss function gives you a rigorous way to quantify the accuracy of your model.
  • Gradient descent is an algorithm for minimizing a function. Gradient descent involves following the steepest slope of a function. In machine learning, you use gradient descent to find model weights that give you the smallest loss.
  • Stochastic gradient descent is a variation on the gradient descent algorithm. In stochastic gradient descent, you compute the gradient on a small subset of your training set called a mini-batch, and then update the network weights based on each mini-batch. Stochastic gradient descent is normally much faster than regular gradient descent on large training sets.
  • With a sequential neural network, you can use the backpropagation algorithm to calculate the gradient efficiently. The combination of backpropagation and mini-batches makes training fast enough to be practical on huge data sets.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset