Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3. Deep Neural Network (DNN) Fundamentals

In this chapter, we’ll explore the core concepts behind DNN models, the category of machine learned models usually used for image and audio processing. Comprehending these basic ideas now will help you understand adversarial examples in greater depth later in the book. Following this basic introduction, Chapter 4 will then explore models for understanding complex images, audio, and video. The two chapters will provide sufficient background for the discussions on adversarial examples that follow but are not intended to provide a comprehensive introduction to deep learning.

If you are familiar with the principles of deep learning and neural networks, feel free to skip this chapter and Chapter 4. Conversely, if you are inclined to learn more than is required for this book, there are numerous excellent resources available to gain a better understanding of machine learning and neural networks. Links to a selection of online resources are included in this book’s GitHub repository.

At the end of this chapter, there are some snippets of Python code. As with all the code in this book, reading it is optional (you can understand adversarial examples without the code). If you are interested in the code, I encourage you to also download and experiment with some of the Jupyter notebooks provided in the associated GitHub repository.

Machine Learning

DNNs belong to the broader category of machine learning (ML); the capability for a machine to learn how to extract patterns from data, without requiring the rules behind the data patterns to be explicitly coded. The resulting learned algorithm is known as a model.

The model is a mathematical algorithm whose parameters have been refined to optimize its behavior. The algorithm “learns” by being repeatedly presented with training data, where each piece of training data is an example of what it needs to learn. Each training example enables the model to be incrementally improved by adjusting its parameters. When the algorithm is suitably refined, it has been trained. Typically, the model’s accuracy is then tested against a test dataset that differs from the training data. The hope is that the model will perform the specific task well on data other than that in the initial training set, so it will work when presented with data that it’s never seen before.

Using traditional (nonneural network) machine learning you can create models that perform fairly clever tasks. For example, based on training data representing many different plant characteristics (height, leaf shape, blossom color, etc.) and corresponding taxonomic genera, it may be possible to train an ML model to infer the genus of any plant based on a list of supplied characteristics. The plant genus may depend on a complex combination of characteristics, but given sufficient training examples and the correct ML approach, the resulting trained model should be able to classify plants without a software engineer needing to explicitly code (or even understand) the relationships between the various characteristics and plant genera.

As you’ll recall from Chapter 1, there are a number of broad strategies for training ML models:

Supervised learning: The flora classification example falls into the category of supervised learning because the model is trained using characteristics along with a label representing the correct answer (in this case, the genus).
Unsupervised learning: In unsupervised learning, no labels are provided during the training step. The model has not been presented with clear answers—it is trained to find patterns in the data, to detect anomalies, or to find associations. A model trained using plant data without the associated labels may be able to learn plant groupings based on combinations of characteristics that often occur together. The trained model could then establish which group a new plant falls into (based on its characteristics), or perhaps whether the new plant doesn’t naturally fall into any of the groupings learned by the model (i.e., identify an anomaly).
Semi-supervised learning: Semi-supervised learning is (as you might expect) a halfway house between supervised and unsupervised learning, for when training data is available but not all of it is labeled. Typically, the model uses unlabeled data to establish patterns, then labeled data is used to improve its accuracy.
Reinforcement learning: In reinforcement learning, the ML model returns some kind of feedback to establish how good it is. The model is refined to optimize for a particular goal (such as learning how to win at a video game) by repeatedly attempting the goal, receiving some kind of feedback (a score indicating how well it has done), and adjusting its approach.

Most of the discussions in this book will relate to models learned through supervised learning, as this is where the majority of research into adversarial input has focused so far. However, the concepts of adversarial examples also apply to models learned through unsupervised, semi-supervised, and reinforcement methods.

A Conceptual Introduction to Deep Learning

Although traditional machine learning techniques are powerful, these models are unable to deal with very complex data where the relevant information—the salient features—required for the task is unclear. Plant characteristics are represented by structured data, such as a binary value to indicate whether the plant has blossoms, or an integer to indicate the number of petals on each flower. The importance of each feature in establishing the plant genus may not be clear, but the features themselves are clearly articulated.

Many tasks use data types that are unstructured or have features that are difficult to discern. In the world of AI, focus is often on unstructured raw data extracted from the real world, such as image and audio. A digital image typically comprises thousands of pixel values, each individual pixel having little relevance on its own; the meaning of the image is based on complex spatial relationships between pixels. Similarly, audio is represented by a series of values over time, where each value is meaningless individually. The order of and spacing between the values determine how the sound should be interpreted.

Visual and audio data is something that we humans and other animals are particularly good at processing with our biological “neural network” brains. The visual and audio processing areas of our brains extract the relevant information from the raw data we receive through our eyes and ears. It’s so easy for us humans to process audio and visual data that it can be difficult to understand why this is challenging for a machine.

Take a simple example of recognizing a cat. For a machine, establishing object boundaries, catering for occlusion and shading, and recognizing the features that constitute a “cat” is incredibly difficult. However, for humans this is a trivial problem. We can recognize a cat from minimal information—a tail, a paw, or perhaps just a movement. We can even recognize breeds of cat that we have never seen before as “cats.”

Audio is no different. Humans recognize sounds and also comprehend speech effortlessly. We filter and ignore background noise and focus on relevant detail. We understand speech based on complex orders of different sounds and context. Yet for a computer this is a nontrivial task, as the digital representation of a sound file is complex and messy, with simultaneous noises from multiple origins.

As humans, we manage visual and audio processing without difficulty. We are unlikely to even be able to explain what features and patterns make up a cat or what sound combinations make up a particular sentence. Somehow our brains manage to learn those features and patterns, and we apply this learned algorithm in our everyday lives as if it were trivial. However, it’s infeasible to explicitly write code to adequately cater for all the different possible scenarios in the physical world, and traditional ML algorithms are insufficiently flexible to learn the features required to handle this complexity.

Traditional machine learning techniques may not be up to the task of audio and image processing, but deep learning is. Deep models excel in a multitude of complex computational tasks, especially when the data is unstructured (such as image, audio, or text) or where features are difficult to discern (such as predictive modeling¹ where there are many variables).

Chapter 1 introduced the concept of an artificial neural network (ANN) visually in terms of “neurons” (very roughly akin to biological neurons) and connections between these neurons which could be considered (again, very roughly) similar to the axons and synapses in the brain. This provides a nice way of thinking about artificial neural networks; they are essentially interconnected neurons, typically arranged in layers² as depicted in Figure 3-1. This illustrates the most basic sort, known as a multilayer perceptron. In this case every node in one layer is connected to every other node in the adjacent layer, so it’s classified as a fully interconnected or dense neural network.

The first layer (on the left) that takes the input is called the input layer. The output of the algorithm is seen in the righthand output layer. The layers between are called hidden layers because they represent partial computations performed by the network that are not visible externally. The hidden layers are where all the magic happens as the computation propagates through the network, and it is the existence of one or more hidden layers that makes an ANN a “deep” neural network.

Figure 3-2 illustrates this idea. Data is input to the DNN through the first layer (on the left), causing the neurons in that layer to be activated to differing degrees. This just means that the neurons are allocated a number, an activation or an intensity. A high activation means the neuron has a higher numeric value assigned, and a low activation indicates a lower value is assigned.

Neurons firing in one layer cause the connections to relay this information to the next layer, so the activation of one layer is determined by the activation of the previous layer. Neurons respond differently to input; some neurons fire (activate) at a greater or lower intensity than others, given the same input. Also, some connections are “stronger,” meaning that they carry more weight than others, so they play a more significant role in determining the effect of the behavior downstream. Eventually a result pops out at each of the nodes in the righthand layer (the output layer). This is known as a feed-forward network as the data flows in one direction. The information is always “fed” forward through the network and there’s no looping back to previous nodes. More formally, a feed-forward network is a type of directed acyclic graph.

An image depicting the incremental steps in a DNN calculation.

There’s obviously lots of great stuff happening as the data passes through the hidden layers of the network and is transformed, perhaps from an image to a classification, or from speech to representative text. We might assume (rightly) that the hidden layers are gradually extracting the relevant information from the input (be it a bit of a cat’s ear or the phoneme sound for an “a,” or whatever). Higher- and higher-level features are extracted, then potentially combined, until the relevant answer is generated at the output layer.

DNN Models as Mathematical Functions

Up to this point we have considered DNNs as artificial approximations of biological processes (the brain) purely in a conceptual (visual) way by considering the neurons and their interconnections. However, ML models (including DNNs) are actually mathematical functions. The mathematics of the functions may be complicated, but an ML model is just a piece of math.

So, simply put, ML models are mathematical functions which take some input and return some output. We can therefore write them like this:

y = f (x)

$y = f (x)$

The act of training the model is working out what the function f should be.

To introduce the mathematics behind DNNs, let’s use an example dataset—Fashion-MNIST. This dataset is available online for the purposes of experimenting and benchmarking ML models.

The Universality of Artificial Neural Networks

A key differentiator of a DNN model compared with a traditional ML techniques is that DNNs are able to represent universal functions. That is, there is a neural network that can represent an accurate approximation for every function,³ no matter how complicated that function is. For a DNN to adhere to the universality principle, it requires just a single hidden layer.

Fashion-MNIST comprises 70,000 grayscale images of 28 x 28 pixels resolution. Each image depicts an item of clothing that can be separated into one of 10 classifications: “T-shirt/top,” “Trouser,” “Pullover,” “Dress,” “Coat,” “Sandal,” “Shirt,” “Sneaker,” “Bag,” or “Ankle boot.” Examples of images from this dataset, along with their corresponding labels, are shown in Figure 3-3.

The Fashion-MNIST dataset is a particularly nice one to demonstrate image classification using a DNN model because it does not include the complexity usually associated with images. First, the images are low resolution and in monochrome. This reduces the complexity of the DNN required to perform the classification, making it easier to train without specialist hardware. Second, as you can see in Figure 3-3, each image only depicts a single item, sized and placed centrally. In Chapter 4 we’ll see that dealing with spatial positioning within an image requires a more complex DNN architecture.

A selection of examples from Fashion-MNIST.

We can use this dataset to experiment with simple neural network image classification: given an input image, produce a clothing classification. Figure 3-4 illustrates (conceptually) what we would like from a trained DNN given an image of trousers as input.

The DNN model is f. A well-trained image classification DNN model f presented with an image of some trousers (depicted by x) would return a value of y that means “this is a pair of trousers.”

A conceptual depiction of a DNN to classify Fashion-MNIST images.

You don’t need to be a mathematician to realize that there will be a lot of clever stuff happening in the DNN model that converts an image to one of 10 clothing categories. We’ll begin by considering the inputs and outputs to the model, then consider the internals. Finally, we’ll look at how the model is trained to make accurate predictions.

DNN Inputs and Outputs

Each neuron in the input layer is assigned a value representing an aspect of the input data. So, for a Fashion-MNIST input, each neuron in the input layer represents one pixel value of the input image. The model will require 784 neurons in its input layer, because each image has 28 x 28 pixels (Figure 3-4, with its 5 input neurons, is clearly an extreme oversimplification). Each neuron in this layer is allocated a value between 0 and 255 depicting the intensity of the pixel that it represents.⁴

The output layer is another list of values (numbers) representing whatever output the DNN has been trained to generate. Typically, this output is a list of predictions, each associated with a single classification. The Fashion-MNIST dataset has 10 categories of clothing, so there are 10 neurons in the output layer and 10 numeric outputs, each one representing the relative confidence that the image belongs to a specific clothing category. The first neuron value represents the prediction that the image is a “T-shirt/top,” the second represents the prediction that the image is a “Trouser,” and so on.

As this is a classification task, the most activated (highest value) neuron in the output layer is going to represent the DNN’s answer. In Figure 3-4, the neuron corresponding to “Trouser” would have the highest activation if the model was classifying successfully.

The list of inputs (pixel values) and the list of outputs (clothing classification confidences) are represented mathematically as vectors. The previous formula can therefore be better expressed as a function that takes a vector input x and returns a vector output y:

(\begin{matrix} y_{1} \\ y_{2} \\ : \\ y_{n - 1} \\ y_{n} \end{matrix}) = f (\begin{matrix} x_{1} \\ x_{2} \\ : \\ x_{i - 1} \\ x_{i} \end{matrix})

$(\begin{matrix} y_{1} \\ y_{2} \\ : \\ y_{n - 1} \\ y_{n} \end{matrix}) = f (\begin{matrix} x_{1} \\ x_{2} \\ : \\ x_{i - 1} \\ x_{i} \end{matrix})$

where i is the number of neurons in the input layer (784 in the Fashion-MNIST case) and n is the number of neurons in the output layer (10 in the Fashion-MNIST case).

So, we have the concept of a mathematical function representing the DNN. This function takes one vector (the input) and returns another (the output). But what exactly is that function?

DNN Internals and Feed-Forward Processing

Like the input and output layers, each of the hidden layers in a DNN is represented as a vector. Every layer in the network can be represented as a simple vector with one value per neuron. The values of the vector describing the input layer determine the values of the first hidden layer vector, which determines the values of the subsequent hidden layer vector, and so on until an output vector is generated to represent the predictions.

So, how do the vector values of one layer influence the vector values of the next? To appreciate how a DNN model actually calculates its answer, we’ll zoom into a specific node in the network, shaded in blue in Figure 3-5, and consider how its activation value is calculated.

The value of a specific neuron will be defined by the activations of the previous layer of neurons, each depicted by the letter a with a subscript indicating the neuron.

An image showing the weights and biases that contribute to a neuron's activation.

There are two types of adjustable parameters in a simple feed-forward neural network:

Weights: An individual weight value is associated with each connection in the network. This determines the strength of the connection, or the amount that the activation of the connected-from neuron influences the connected-to neuron. In Figure 3-5 the weights are depicted by the letter w with a subscript to indicate which connection they are associated with.
Biases: An individual bias value is associated with every neuron in the network. This determines whether the neuron tends to be active or not. In Figure 3-5 the bias is depicted by the letter b.

The activation of a particular neuron (after the input layer) will be determined by contributions from the neuron’s upstream connections, plus an adjustment by the neuron’s bias. The contribution of an individual connection is the product of the connected-from neuron’s activation and the weight of the connection itself. Therefore, the contribution of the first connection in Figure 3-5 is the product of the weight of the connection and the bias associated with the connected-to neuron:

w 0 period a 0

$w 0 period a 0$

Now, simply sum them all up to get the combined input from all the contributing connections:

w_{0} . a_{0} + w_{1} . a_{1} + . . . + w_{n - 1} . a_{n - 1} + w_{n} . a_{n}

$w_{0} . a_{0} + w_{1} . a_{1} + . . . + w_{n - 1} . a_{n - 1} + w_{n} . a_{n}$

And add the bias associated with the connected-to neuron:

w_{0} . a_{0} + w_{1} . a_{1} + . . . + w_{n - 1} . a_{n - 1} + w_{n} . a_{n} + b

$w_{0} . a_{0} + w_{1} . a_{1} + . . . + w_{n - 1} . a_{n - 1} + w_{n} . a_{n} + b$

To make the DNN behave as required, the result from this calculation is then fed into an activation function. We’ll define this as A. This gives the formula for calculating the activation of any particular neuron in the network (other than those in the input layer whose activations are determined directly by the input data):

A (w_{0} . a_{0} + w_{1} . a_{1} + . . . + w_{n - 1} . a_{n - 1} + w_{n} . a_{n} + b)

$A (w_{0} . a_{0} + w_{1} . a_{1} + . . . + w_{n - 1} . a_{n - 1} + w_{n} . a_{n} + b)$

There are a number of possible activation functions that can be used during neural processing. The particular function for each neuron of the network is explicitly defined when the network is architected and depends upon the type of neural network and the network layer.

For example, an activation function that has proven to work very well for the hidden layers of DNNs is the Rectified Linear Unit (ReLU). The hidden neural network layers are most effective at learning when small upstream contributions do not cause the neuron to fire and larger ones do. This is like the synapses in the brain that fire at a particular threshold. ReLU mimics this idea. It’s simple—if the input to the ReLU function:

(w_{0} . a_{0} + w_{1} . a_{1} + . . . + w_{n - 1} . a_{n - 1} + w_{n} . a_{n} + b)

$(w_{0} . a_{0} + w_{1} . a_{1} + . . . + w_{n - 1} . a_{n - 1} + w_{n} . a_{n} + b)$

exceeds a static threshold (usually zero), the function returns that value; if not, it returns zero. This is depicted in Figure 3-6.

When the DNN is performing a classification task, it is useful for the final layer to output a set of prediction probabilities that all sum to 1. Therefore, for the Fashion-MNIST classification, we require a final layer in the network that takes the scores output from the network—called the logits—and scales them so that they represent probabilities and therefore add to exactly 1. Neurons in this layer will use a different activation function called softmax to perform this scaling step.

Applying all this to create a Fashion-MNIST classification model, Figure 3-4 can be redrawn with more accuracy. The number of hidden layers and number of neurons in each hidden layer is somewhat arbitrary, but there are some rules of thumb to get a good result. Typically, one hidden layer will suffice with somewhere between the number of neurons in the input and output layers. Let’s assume there are two hidden layers, each with 56 neurons. The resulting network architecture is shown in Figure 3-7.

The next part of the puzzle is to understand how all the weights and bias values are adjusted to give a good result.

A DNN Architecture for Image Classification for Fashion-MNIST Data

How a DNN Learns

So far, we have established that a DNN model is a mathematical function, f. In “DNN Inputs and Outputs” we wrote this as a function that takes a vector representing the input data and returns an output vector. We can write this simply as:

y = f (x)

$y = f (x)$

where the bold form of x and y indicates that they represent vectors.

The function f comprises many parameters depicting each of the weights and biases in the network. These must be adjusted correctly for it to give the expected output given any input. So let’s rewrite the preceding expression, this time specifying a function that takes both an input image and a set of parameters that represent all the weights and biases. The character Θ (theta) is the notation used to represent all the values of the weights and biases in the network:

y = f (x; Θ)

$y = f (x; Θ)$

Training the network involves adjusting these weights and biases (Θ) so that each time a training input is presented to the network it returns a value as close as possible to the training label’s correct (target) label classification. Don’t underestimate the task of optimizing all these parameters—the DNN function has one bias parameter for each hidden neuron and a weight assigned to each connection. So, for each layer that gives us:

n u m b e r O f P a r a m e t e r s P e r L a y e r = (n u m b e r O f N o d e s I n P r e v i o u s L a y e r * n u m b e r O f N o d e s I n L a y e r) + n u m b e r O f N o d e s I n L a y e r

$n u m b e r O f P a r a m e t e r s P e r L a y e r = (n u m b e r O f N o d e s I n P r e v i o u s L a y e r * n u m b e r O f N o d e s I n L a y e r) + n u m b e r O f N o d e s I n L a y e r$

For example, for the Fashion-MNIST classifier model shown in Figure 3-7 there are 47,722 different parameters to adjust—and this is a relatively simple neural network! It’s unsurprising, then, that training a DNN requires considerable amounts of training data examples and is computationally expensive.

To begin training, the weights and biases within the network are initialized randomly. Consequently, the network will perform dreadfully; it needs to learn. During this learning phase, the weights and biases are iteratively adjusted to ensure that the DNN works optimally across all the training data in the hope that, given new examples, it will return the correct answer.

Here’s how it learns. The network is repeatedly presented with example inputs from the training dataset and scored as to how badly it does relative to the expected labels associated with the training inputs. This score is referred to as the cost (or loss) of the network and is established by a cost function or loss function, a special function that quantifies how badly the network is performing its task. If the cost is big, then the network is giving poor results. If the cost is small, the network is performing well.

The cost function will take a number of parameters: the DNN function itself (f) with all its weights and biases (that is, all its parameters as represented by Θ), and the training examples. The training example labels will be needed too because these define what a “good” answer is—the ground truth.

For a single training example, the cost function can therefore be expressed as follows:

upper C left-parenthesis f left-parenthesis bold x semicolon normal upper Theta right-parenthesis comma bold l right-parenthesis

$upper C left-parenthesis f left-parenthesis bold x semicolon normal upper Theta right-parenthesis comma bold l right-parenthesis$

where:

: C represents the cost function for the DNN f with the parameters Θ, given the training example x and its associated target predictions $𝐥$ $?$ .

So what does this function C actually do? Well, there are multiple ways that we could measure how well the model has performed for a single training example. A simple method is to simply subtract the expected labels from the actual values produced by the DNN for that training example. The difference is squared to ensure that larger discrepancies between the target labels and predicted probabilities generate a disproportionately big loss value. This causes big differences between the target and predicted values to be penalized more than small ones.

Here’s the equation that we use to measure the cost for a specific training example:

{((\begin{matrix} y_{0} \\ y_{1} \\ : \\ y_{n - 1} \\ y_{n} \end{matrix}) - (\begin{matrix} l_{0} \\ l_{1} \\ : \\ l_{n - 1} \\ l_{n} \end{matrix}))}^{2}

${((\begin{matrix} y_{0} \\ y_{1} \\ : \\ y_{n - 1} \\ y_{n} \end{matrix}) - (\begin{matrix} l_{0} \\ l_{1} \\ : \\ l_{n - 1} \\ l_{n} \end{matrix}))}^{2}$

where:

: $𝐥_{i}$ $?_{i}$ has the value 1 when i represents the correct target classification, and the value 0 otherwise.

Put another way, the cost for one training example is:

C (f (x; Θ), l) = {(f (x; Θ) - l)}^{2}

$C (f (x; Θ), l) = {(f (x; Θ) - l)}^{2}$

For example, say we present the Fashion-MNIST image classifier with an image of a pair of trousers during training, and it returns the following vector:

(\begin{matrix} 0.232 \\ 0.119 \\ : \\ 0.151 \\ 0.005 \end{matrix})

$(\begin{matrix} 0.232 \\ 0.119 \\ : \\ 0.151 \\ 0.005 \end{matrix})$

The target label associated with the image is “Trouser,” corresponding to the second value in the vector. Ideally, this prediction should be close to 1 rather than its current value of 0.119. If the network was performing perfectly, it would have returned the vector of predictions:

(\begin{matrix} 0 \\ 1 \\ : \\ 0 \\ 0 \end{matrix})

$(\begin{matrix} 0 \\ 1 \\ : \\ 0 \\ 0 \end{matrix})$

The calculated cost for this example is the square of the difference between the target vector and the predictions for the label:

{((\begin{matrix} 0.032 \\ 0.119 \\ : \\ 0.151 \\ 0.005 \end{matrix}) - (\begin{matrix} 0 \\ 1 \\ : \\ 0 \\ 0 \end{matrix}))}^{2} = 0.001 + 0.776 + . . + 0.023 + 0.000

${((\begin{matrix} 0.032 \\ 0.119 \\ : \\ 0.151 \\ 0.005 \end{matrix}) - (\begin{matrix} 0 \\ 1 \\ : \\ 0 \\ 0 \end{matrix}))}^{2} = 0.001 + 0.776 + . . + 0.023 + 0.000$

In this case, you can see that the cost is high for the trouser classification. Therefore, some parameters in the network need to be tuned to fix this.

To really assess how good the network is, however, we need to consider the cost over all the training examples. One measure of this overall performance is to take the average cost. This calculation provides us with the loss function. In the case described, we’re taking the mean of the squares of all the errors. This loss function is known as mean squared error (MSE).

Other loss functions use different algorithms to calculate loss during training. For classification models, it’s usual to use a categorical cross entropy loss function. This function is better at penalizing errors that the network returns with higher confidence. We’ll use a variation of this in the code example at the end of this chapter.

To improve the network, we want to adjust its weights and biases to reduce the average cost to its minimum possible value. In other words, the parameters represented by Θ require adjusting until the equation is the best it can be for all the possible training examples. At this point, the network will be optimally adjusted. This is achieved through a technique called gradient descent.

To understand gradient descent, imagine that there is just one adjustable parameter $Θ_{i}$ $Θ_{i}$ in the network, rather than thousands. This enables a pictorial representation, as shown in Figure 3-8. The x-axis corresponds to the adjustable parameter. Given any value of x, the average cost of the network can be calculated using our chosen loss function (such as MSE). Of course, the x-axis is vastly simplified—it should really be thousands of axes, one for each parameter—but the illustration holds to represent the concept.

Any point on the curve represents the cost for a particular combination of weights and biases for the network. At the start of the training phase, when the parameters are initialized randomly, the cost will be high. We need to find the parameters that will bring the cost down to its minimum value.

Gradient descent uses mathematical methods to calculate the gradient of the cost function at the current stage of training (that is, for the current parameter settings, Θ). This gradient can then be used to calculate whether we need to increase or decrease each weight and bias in the network to improve it. Through repetitive application of this technique, the network parameters are optimized to produce a good result. Gradient descent is akin to rolling down the slope of the function in Figure 3-8. There is a risk that you will roll to a nonoptimum place, marked on the graph as “good parameter optimization.” Here you would get “stuck” because you would need to roll uphill first to reach the “superb parameter optimization” location. This point is referred to as a “local minimum.” Techniques can be employed during gradient descent to reduce this risk.⁵

The method of deciding which weights and biases are updated and by how much during the deep neural network training phase is known as backpropagation because the loss calculated for the output layer is “pushed back” through the network to cause the relevant adjustments in the weight and bias parameters. The math for backpropagation is complicated and not necessary for understanding adversarial examples.⁶ However, if you are curious and would like to understand this and the feed-forward aspects of neural network processing (as introduced in “DNN Internals and Feed-Forward Processing”) in greater detail, some links to useful resources are included in this book’s GitHub repository.

The notion of gradient descent is an important and recurring one throughout machine learning. It’s a technique usually employed for optimizing models during training. However, we will see in Chapter 5 that the technique can also be applied to optimize the generation of adversarial examples.

Creating a Simple Image Classifier

Creating and training a deep learning model from scratch is a difficult task requiring understanding of the mathematics behind feed-forward processing and backpropagation. Nowadays, however, there are multiple libraries available to automatically build and train models, making it extremely simple to create a DNN without needing to code the underpinning algorithms.

Jupyter Notebook for Simple Image Classifier

The code included in this book is available for download from the book’s GitHub repository.

To access and run the code and to set up dependencies and install the Jupyter notebook, see the instructions in the GitHub repository.

The code snippets in this section can be found in the Jupyter notebook chapter03/fashionMNIST_classifier.ipynb.

This section shows the primary code steps required to create a classification model for the Fashion-MNIST image data in Python. It’s a network that demonstrates how easy it is to create a deep model using open software libraries. We’ll use the TensorFlow deep learning library with the Keras API to build and train the model. This code is based on one of the online tutorials provided as an introduction to Keras.⁷

Fully connected means that every node in each layer is connected to every node in the subsequent layer. It is also feed-forward, meaning that the computation travels from the input layer sequentially through all the hidden layers, and out of the output layer.

First, import the required libraries, TensorFlow, and its version of the Keras library:

import tensorflow as tf
from tensorflow import keras

The Fashion-MNIST data is provided with Keras. There are two datasets provided, one for training the model and one for evaluating it. Both datasets comprise a list of images with their corresponding labels. The labels are provided with the test data to evaluate how good the model is by testing it with each test image and checking whether the result matches the expected label.

It’s also handy to provide a list of classification names corresponding to the 10 possible clothing classifications. This will enable us to print the name of the classification rather than just its number (e.g., “T-shirt/top” rather than “0”) later on:

fashion_mnist = keras.datasets.fashion_mnist

(train_images,train_labels),(test_images,test_labels) = fashion_mnist.load_data()

class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

Let’s take a look at the image at index 9, shown in Figure 3-9:

import matplotlib.pyplot as plt
plt.gca().grid(False)
plt.imshow(test_images[9], cmap=plt.cm.binary)

Each image in the dataset comprises an array of pixels, each with a value depicting its intensity: 0 to 255. We need to normalize the values for the input layer of the DNN so that each lies between 0 and 1:

train_images = train_images/255.0
test_images = test_images/255.00

The Keras programming interface provides a simple way to create a model layer by layer with the appropriate neuron activation functions (in this case ReLU or softmax). The following code illustrates how to create a model architecture like that shown in Figure 3-7. The compile step also includes variables to define the way that the model learns and how it judges its accuracy:

model = keras.Sequential([keras.layers.Flatten(input_shape=(28,28)),
                          keras.layers.Dense(56, activation='relu'),
                          keras.layers.Dense(56, activation='relu'),
                          keras.layers.Dense(10, activation='softmax')
                         ])
model.compile(optimizer=tf.keras.optimizers.Adam(), 
              loss='sparse_categorical_crossentropy', 
              metrics=['accuracy'])

: The optimizer parameter determines how the network will be optimized during training. The Adam optimizer is a good choice in this case to speed up the training as it uses an intelligent algorithm to perform gradient descent.
: Here’s where the previously described loss function is defined. The loss function sparse_categorical_crossentropy is a variation on categorical_crossentropy, used when the target labels are passed as a single list of values rather than a zeroed array with the relevant value set to 1. This representation is possible if exactly one class is true at a time. For example, a label representing “Pullover” is represented as 2 in the training label data as it’s the third label in a list starting at 0, rather than [0, 0, 1, 0, 0, 0, 0, 0, 0, 0].
: The metrics parameter determines how the model will be judged during training.

We can take a look at the generated model to ensure it is what we expected:

model.summary()

This generates the following output:

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
flatten_2 (Flatten)          (None, 784)               0
_________________________________________________________________
dense_4 (Dense)              (None, 56)                43960
_________________________________________________________________
dense_5 (Dense)              (None, 56)                3192
_________________________________________________________________
dense_6 (Dense)              (None, 10)                570
=================================================================
Total params: 47,722
Trainable params: 47,722
Non-trainable params: 0

Looks good. The total number of parameters also matches the value calculated in “How a DNN Learns”.

Using the principles described in this same section, a single line of code is required to train the model. The function here is called fit because it is fitting the model to the requirements of the training data. The epochs parameter defines the number of training iterations—that is, the number of times that the model will be refined based on its calculated loss across all the training examples. Let’s set that to 6:

model.fit(train_images, train_labels, epochs=6)

Which generates the following output:

Epoch 1/6
60000/60000 [================] - 4s 66us/sample - loss: 0.5179 - acc: 0.8166
Epoch 2/6
60000/60000 [================] - 4s 58us/sample - loss: 0.3830 - acc: 0.8616
Epoch 3/6
60000/60000 [================] - 3s 58us/sample - loss: 0.3452 - acc: 0.8739
Epoch 4/6
60000/60000 [================] - 4s 59us/sample - loss: 0.3258 - acc: 0.8798
Epoch 5/6
60000/60000 [================] - 4s 59us/sample - loss: 0.3087 - acc: 0.8863
Epoch 6/6
60000/60000 [================] - 4s 59us/sample - loss: 0.2933 - acc: 0.8913
Out[8]:
<tensorflow.python.keras.callbacks.History at 0x22f2b3b06d8>

As you can see, Keras displays the model’s loss and accuracy during each stage of the training phase with respect to the training data. The accuracy of the model is the percentage of training data samples that it is correctly classifying. The loss decreases in each epoch as the model’s accuracy increases. This is gradient descent in action! The model’s weights and biases are being adjusted to minimize the loss, as previously illustrated in Figure 3-8.

Keras provides the following method for checking the accuracy of the generated model after it’s been trained. We want to be sure that the model works well on data other than that provided during training, so the test dataset (test images with their expected labels) is used for this evaluation:

test_loss, test_acc = model.evaluate(test_images, test_labels)
print('Model accuracy based on test data:', test_acc)

Which generates this output:

10000/10000 [================] - 0s 35us/sample - loss: 0.3623 - acc: 0.8704
Model accuracy based on test data: 0.8704

The model accuracy is slightly lower on the test data than the training data, illustrating that the model has been tailored a little too much for the test data. This is called overfitting because the model fits the training data too accurately and has not generalized enough for other data. Still, it’s fairly good, returning nearly 90% correct answers for the test dataset.

Let’s look at the predictions that the model generates for a particular image in the test dataset. We will take the image at index 6. First, here’s the code to plot the image and its label (Figure 3-10) to show what we are expecting:

image_num = 6
print("Expected label: ", class_names[test_labels[image_num]])

import matplotlib.pyplot as plt
imgplot = plt.imshow(test_images[image_num], cmap=plt.cm.binary)

Expected label:  Coat

Keras uses the function predict to generate predictions for a set of inputs. The following code generates a set of model predictions for each of the test data images and prints the predictions for our selected image:

predictions = model.predict(test_images)
print("Predictions for image:", predictions[image_num])

This generates the following output:

Predictions for image: [2.0931453e-04 2.5958019e-05 5.3381161e-03
 9.3024952e-05 9.8870182e-01 3.4905071e-08 5.4028798e-03 2.1192791e-10
  2.2762216e-04 1.2793139e-06]

The output is the vector produced in the final layer of the model for the selected image. The softmax layer ensures that all the values add up to 1. Just to prove this, let’s add them up:

total_probability = 0
for i in range(10):
    total_probability += predictions[image_num][i]
print(total_probability)

This produces the following output:

1.0000000503423947

Here’s the code that takes the highest prediction and returns the associated clothing classification as a string:

import numpy as np
index_of_highest_prediction = np.argmax(predictions[image_num])

print("Classification: ", class_names[index_of_highest_prediction])
print("Confidence:     ", predictions[image_num][index_of_highest_prediction])

This generates the following output:

Classification:  Coat
Confidence:      0.9887018

The image has been correctly classified as a “Coat” and we can see that the model was confident in this prediction.

Finally, let’s save the model so we can use it again later:

model.save("../models/fashionMNIST.h5")

: The model is converted to HDF5 format and stored in the models directory.

We will return to this classifier throughout the book to explore adversarial examples .

¹ The forecasting of outcomes based on past data.

² As any self-respecting data scientist will be keen to point out, ANNs are not necessarily arranged in layers, but let’s keep it simple and not get distracted by all the possible ANN structures at this stage.

³ Strictly speaking, this is true for continuous functions only where a small change in the input will not cause a sudden large step change in the output. However, as discontinuous functions can often be approximated by continuous ones, this is not usually a restriction.

⁴ In practice, it’s beneficial to scale these values so that they all lie between 0 and 1.

⁵ In deep learning, there’s an additional interesting complication whereby the number of potential changeable parameters results in saddle points in the cost graph. These are points that appear to be a minimum for one parameter, but a maximum for another. To picture this visually, think of a point at the center of a horse’s saddle where one direction takes you up and another takes you downward.

⁶ Backpropagation uses the chain rule, a mathematical technique that enables f(x) to be optimized based on the contribution of the functions in each layer of the DNN.

⁷ You can see the original tutorial on the TensorFlow site. The Keras tutorials are excellent resources for an early introduction to the Keras programming model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
3. Deep Neural Network (DNN) Fundamentals

Chapter 3. Deep Neural Network (DNN) Fundamentals

Machine Learning

A Conceptual Introduction to Deep Learning

Figure 3-1. A multilayer perceptron

Figure 3-2. Incremental calculation steps in a multilayer perceptron

DNN Models as Mathematical Functions

The Universality of Artificial Neural Networks

Figure 3-3. The Fashion-MNIST dataset contains simply depicted fashion items and their associated labels.

Figure 3-4. At the simplest level, an image classification model for Fashion-MNIST will take an image and return a classification.

DNN Inputs and Outputs

DNN Internals and Feed-Forward Processing

Figure 3-5. Zooming in on a neuron within a network to see its activation function

Figure 3-6. ReLU function

Figure 3-7. DNN architecture for image classification using Fashion-MNIST data

How a DNN Learns

Figure 3-8. Using gradient descent, parameters of the network are tuned to minimize the cost (loss) during training.

Creating a Simple Image Classifier

Jupyter Notebook for Simple Image Classifier

Figure 3-9. Code output

Figure 3-10. Code output

Table of Contents for 3. Deep Neural Network (DNN) Fundamentals

Create new playlist

Sign In

Sign Up

Chapter 3. Deep Neural Network (DNN) Fundamentals

Machine Learning

A Conceptual Introduction to Deep Learning

Figure 3-1. A multilayer perceptron

Figure 3-2. Incremental calculation steps in a multilayer perceptron

DNN Models as Mathematical Functions

The Universality of Artificial Neural Networks

Figure 3-3. The Fashion-MNIST dataset contains simply depicted fashion items and their associated labels.

Figure 3-4. At the simplest level, an image classification model for Fashion-MNIST will take an image and return a classification.

DNN Inputs and Outputs

DNN Internals and Feed-Forward Processing

Figure 3-5. Zooming in on a neuron within a network to see its activation function

Figure 3-6. ReLU function

Figure 3-7. DNN architecture for image classification using Fashion-MNIST data

How a DNN Learns

Figure 3-8. Using gradient descent, parameters of the network are tuned to minimize the cost (loss) during training.

Creating a Simple Image Classifier

Jupyter Notebook for Simple Image Classifier

Figure 3-9. Code output

Figure 3-10. Code output

Table of Contents for
3. Deep Neural Network (DNN) Fundamentals