Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 10. Defending Against Adversarial Inputs

In this chapter, we’ll consider some of the methods that have been proposed for detecting and defending against adversarial example attacks. The good news is that some defenses can work. The bad news is that each defense has limitations, and if the attacker is aware of the method being used, they may be able to adapt their attack to circumvent the defense.

This chapter considers three broad approaches to defense:

Improve the model: In the first part of this chapter, we’ll focus on the model itself and techniques that have been proposed for creating more robust neural networks.
Remove adversarial aspects from input: In “Data Preprocessing”, we’ll then look at whether it’s possible to render adversarial input benign before it’s submitted to the model.
Minimize the adversary’s knowledge: Next, “Concealing the Target” will consider ways in which the adversary’s knowledge of the target model and broader processing chain might be reduced to make it more difficult to create successful adversarial examples. As highlighted in Chapter 9, target concealment should not be relied upon as a defense.

There’s currently no single solution to this problem, but it is an active area of research. Table 10-1 summarizes the capabilities of the defense techniques described in this chapter at the time of writing.

Table 10-1. Summary of defenses
Defense	Improve model robustness	Remove adversarial data characteristics	Minimize the adversary’s knowledge
Gradient masking (in “Improving the Model”)	Limited	N/A	N/A
Adversarial training (in “Improving the Model”)	Limited	N/A	N/A
Out-of-distribution confidence training (in “Improving the Model”)	Promising but not guaranteed	N/A	N/A
Randomized dropout at test time (in “Improving the Model”)	Promising but not guaranteed	N/A	N/A
Data preprocessing (in “Data Preprocessing”)	N/A	Limited	N/A
Target concealment (in “Concealing the Target”)	N/A	N/A	Limited

In practice, the capability of an adversary to successfully launch an adversarial attack will be constrained by several factors imposed by the broader processing and its security. This chapter will consider both the model and the system in which it is a component.

The chapter concludes with some practical advice for developing robustness against adversarial examples in real-world systems, in “Building Strong Defenses Against Adversarial Input”.

Improving the Model

Let’s start by exploring what can be done to protect the model itself from adversarial examples. For example, can the DNN be retrained so that it is robust to adversarial input? Are there characteristics shared by adversarial examples that we could use to detect an attack? Alternatively, could we predict when the algorithm is likely to perform incorrectly and therefore reduce the confidence of results that are less certain?

It’s important to remember that any alteration to the model for the purpose of defense must not unacceptably impact the algorithm’s accuracy; we must consider the effect of any defense mechanism on the good data as well as on adversarial input.

We’ll consider four approaches:

Gradient masking: This approach alters the model to hide the gradients in the prediction landscape to make it difficult to create adversarial examples.
Adversarial training: This involves training (or retraining) the network so that it learns to distinguish adversarial input. This is done by including adversarial examples in the training data.
Out-of-distribution (OoD) detection: Here we look at whether it is possible to train the network to return not only a prediction, but also a confidence measure of how sure it is of that prediction, based on whether the data lies within the distribution that the network is able to operate over with high accuracy.
Randomized dropout uncertainty measurements: Finally, this approach adds a training technique called randomized dropout to the model post-training into introduce uncertainty to the network’s predictions. This is based on a premise that adversarial inputs result in greater uncertainty and so it may be possible to detect them.

Gradient Masking

Gradient masking¹ is a technique that has been proposed to make the calculation of adversarial examples more difficult. The idea is to either hide the gradients of the DNN algorithm’s prediction landscape, or smooth them in such a way as to make them useless to an attacker.

Of course, this approach is only beneficial in scenarios where an attacker has sufficient access to the DNN algorithm, enabling the use of the model’s gradients to develop the attack. Most obviously, it is prudent to conceal the algorithm from an attacker (see “Concealing the Target”) to prevent attacks that exploit model gradients. These types of attacks include the white box gradient-based methods applied to the target (“White Box Methods”) and score-based methods (“Score-Based Black Box Methods”). Score-based methods use predictions to infer gradients, so unless there is a necessity to return a network’s scores to a user, this information should not be exposed directly or indirectly through a response.

A technique called defensive distillation has been suggested to disrupt the ability to generate adversarial examples using methods that use gradients.² This approach repurposes a technique called distillation that had been initially proposed to reduce the size of neural networks. Distillation of a neural network re-creates the DNN function (its parameters) so that the gradients on the prediction landscape near the training points are smoothed.

How to Distill a Neural Network

To “distill” a network, it is initially trained as usual using labels. The model is then used to create predictions for the training dataset. Next, a new “distilled” version of the model is trained using the training data and predictions (probabilities), rather than the initial hard labels.

Neural networks that are trained on discrete labels have less smooth prediction landscape gradients than those trained on probabilities. The smoothing of the distillation process has the effect of reducing the size of the model, but may also reduce the model’s accuracy. This trade-off may be justifiable; a smaller footprint may be useful to run a model where there are hardware constraints, such as on a mobile device.

The idea of network distillation is depicted in Figure 10-1. The image illustrates the prediction landscape for a particular classification before and after distillation. From a position in the input space near the training data points in the distilled network, there are no obvious gradients, so methods that extrapolate gradients from a specific point in the input space will not work. The most likely direction of change to create an adversarial example is not obvious.

Image showing prediction landscape around the training points before and after distillation of a network.

Smoothing of model gradients has limited potential for defending against adversarial attacks, because the defense assumes that the gradients are important to establishing the adversarial input. Consider the methods introduced in Chapter 6. The gradients are important for attacks such as FGSM and JSMA. However, they are not important for methods that explore the search space without the use of gradients, such as the white box L-BFGS attack, or limited black box methods such as the boundary attack. An attacker can also circumvent gradient masking, even when using one of the gradient approaches, by introducing a random step in the algorithm to move the input to a location in the input space that avoids masked gradients.

Smoothing of gradients also provides no protection against transfer attacks. In fact, adversarial examples created using gradient-based approaches on a substitute model that has not been subjected to gradient masking can transfer to a target model whose gradients have been masked.

Adversarial Training

Adversarial training is perhaps the most intuitive approach to strengthening a neural network against adversarial examples. After all, the network can be trained to distinguish complicated features and patterns, so surely there are some features of adversarial examples that will enable them to be spotted by the model?

Adversarial input indicates a flaw in the DNN algorithm illustrating that the DNN is unable to generalize over all inputs. Therefore, detecting adversarial input and treating it correctly as such would improve the robustness of the algorithm. If we can detect adversarial input, we can respond appropriately, perhaps by reducing the confidence of the output prediction or by flagging the input as “adversarial” to trigger additional verification or actions (essentially, introducing an additional classification for this input). An alternative approach is to train the network to correctly classify adversarial data with its original (nonadversarial) label.

As discussed in Chapter 7, most adversarial examples are believed to lie within adversarial subspaces—continuous regions of misclassification within the input space. Surely, then, we could just train (or retrain) the model with many labeled adversarial examples so that it learns to generalize correctly over these subspaces? Figure 10-2 illustrates this idea.

This notion of training the model using adversarial examples has been explored from several perspectives.³ Unfortunately, although this technique can appear to be a good defense, the trained model is only robust to adversarial examples generated by the same or a similar method as the adversarial training data.

Image depicting input spaces of target and substitute models with adversarial training data indicated

To understand the limits of adversarial training, think about how the adversarial training data needs to be generated. Performing the training requires the generation of large quantities of training examples, through the methods described (or similar to those) in Chapter 6. Some of these techniques are computationally expensive or require many iterations, so this places a constraint on the ability to generate this data at scale. Take, for example, the iterative boundary attack. That takes thousands of iterations to generate an adversarial example, so it’s going to be too slow to create a whole adversarial training dataset. For this reason, there’s an obvious benefit to using quick adversarial generation methods that use simple approximations (such as FGSM and its variants) to generate the training set.

However, you are simply training the network to recognize a certain type of adversarial example. If, for the sake of speed and resources, the adversarial training data is generated using white box methods that approximate model gradients, you will end up with a DNN only able to recognize the features of adversarial examples generated using similar methods. As soon as an attacker uses a different method to generate examples (such as a boundary attack that doesn’t use gradients), the defense will fail. Furthermore, if an attacker uses the same simple gradient approaches on a (different) substitute model and then performs a transfer attack, the defense will have poor ability to detect adversarial examples because it has learned what looks “adversarial” using a different set of gradients.

“OK,” you might say, “so just create adversarial training data using the boundary attack method too.” But therein lies the problem—new methods for adversarial generation are continually being devised, so you can never be sure that your DNN is entirely robust. Adversarial training will only catch similar adversarial input drawn from training data created with similar methods; it is not guaranteed to protect against methods that you have not thought of or not had time to generate training data for, or methods that have not yet been devised.

Jupyter Notebooks for Adversarial Training

The code snippets in this section are from the following:

The Jupyter notebook /chapter10/fashionMNIST_adversarial_training.ipynb contains the code to train a network with adversarial data.

And the Jupyter notebook chapter10/fashionMNIST_adversarial_training_evaluation.ipynb contains the code to experiment with and evaluate the adversarially trained network.

A proposed method to improve the defense is to use ensemble adversarial training.⁴ This technique still uses low-cost methods such as FGSM and JSMA to generate the training data. However, by generating the adversarial training data using different models with different parameters and therefore gradients, you ensure that the model learns adversarial examples that are not tightly coupled to its parameters. This results in greater diversity in the adversarial training data perturbations and a model that exhibits greater ability to recognize adversarial input.

The fact that adversarial training is not guaranteed to create a model able to correctly classify all adversarial examples does not make it worthless. If the model is able to correctly classify even some adversarial inputs, that is likely to be beneficial. Adversarial training should not be relied upon for defense, however.

Let’s try doing some adversarial training to see whether it improves the Fashion-MNIST model’s robustness to adversarial examples.

The first step is to create some adversarial training data. For the purposes of demonstration, we’ll generate this using images from the original training dataset using a weak attack the simple GradientSignAttack provided with Foolbox).

First, define the attack:

import foolbox
fmodel = foolbox.models.TensorFlowModel.from_keras(model, bounds=(0, 1))

attack_criterion = foolbox.criteria.Misclassification()
attack_fn = foolbox.attacks.GradientSignAttack(fmodel,
                            criterion=attack_criterion,
                            distance=foolbox.distances.Linfinity)

Augment the training data with 6,000 extra adversarial images and retrain the model:

x_images = train_images[0:6000, :]
predictions = model.predict(x_images)

x_train_adv_images, x_train_adv_perturbs, x_train_labels =
                     generate_adversarial_data(original_images = x_images,
                                               predictions = predictions,
                                               attack_fn = attack_fn)

: generate_adversarial_data is a helper utility included in the repository. It iterates through the provided images to create one adversarial example for each (assuming an adversarial example can be found). It also returns as output additional information on the perturbation distances and labels, which we’ll use.

This produces the following warnings:

Warning: Unable to find adversarial example for image at index:  2393
Warning: Unable to find adversarial example for image at index:  3779
Warning: Unable to find adversarial example for image at index:  5369

There were 3 images out of the 6,000 for which the algorithm was unable to find an adversarial example. We still have 5,997, which will be plenty for training.

Next, augment the training data with these examples and retrain the model:

train_images_adv = np.concatenate((train_images, x_train_adv_images),
                                  axis=0)
train_labels_adv = np.concatenate((train_labels,
                                   np.full(x_train_adv_images.shape[0],
                                           adversarial_label)),
                                  axis=0)

model_adv = keras.Sequential([keras.layers.Flatten(input_shape=(28,28)),
                              keras.layers.Dense(56, activation='relu'),
                              keras.layers.Dense(56, activation='relu'),
                              keras.layers.Dense(10, activation='softmax',
                                                   name='predictions_layer')
                         ])
model_adv.compile(optimizer=tf.keras.optimizers.Adam(),
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

model_adv.fit(train_images_plus_adv, train_labels_plus_adv, epochs=6)

Which generates the following output:

Epoch 1/6
65996/65996 [================] - 5s 72us/sample - loss: 0.5177 - acc: 0.8151
Epoch 2/6
65996/65996 [================] - 4s 67us/sample - loss: 0.3880 - acc: 0.8582
Epoch 3/6
65996/65996 [================] - 4s 67us/sample - loss: 0.3581 - acc: 0.8677
Epoch 4/6
65996/65996 [================] - 5s 69us/sample - loss: 0.3310 - acc: 0.8763
Epoch 5/6
65996/65996 [================] - 4s 58us/sample - loss: 0.3141 - acc: 0.8839
Epoch 6/6
65996/65996 [================] - 4s 64us/sample - loss: 0.3016 - acc: 0.8881
Out[29]:
<tensorflow.python.keras.callbacks.History at 0x181239196a0>

The first thing to check is whether our new adversarially trained model performs as well as the original on the original test data:

test_loss, test_acc = model.evaluate(test_images, test_labels)
print('Original model accuracy based on nonadversarial test data:', test_acc)
test_loss, test_acc = model_adv.evaluate(test_images, test_labels)
print('Adversarially trained model accuracy based on nonadversarial test data:',
       test_acc)

This produces the following output:

10000/10000 [================] - 0s 37us/sample - loss: 0.3591 - acc: 0.8699
Original model accuracy based on nonadversarial test data: 0.8699
10000/10000 [================] - 1s 53us/sample - loss: 0.3555 - acc: 0.8707
Adversarially trained model accuracy based on nonadversarial test data: 0.8707

Not bad! It actually performs slightly better, so the training doesn’t appear to have affected the model’s accuracy.

Now we need some adversarial test data. For a proper evaluation, we should assume that the attacker has complete knowledge of the defense and, therefore, the adversarially trained model. However, let’s take a naive approach to begin with for comparison and use the original (nonadversarially trained) model to create the test data.

Define the attack:

import foolbox
fmodel = foolbox.models.TensorFlowModel.from_keras(model, bounds=(0, 1))

attack_criterion = foolbox.criteria.Misclassification()
x_images = test_images[0:600, :]

attack_fn = foolbox.attacks.GradientSignAttack(fmodel,
                            criterion=attack_criterion,
                            distance=foolbox.distances.Linfinity)

Then generate the test dataset. We’ll call this x_test_adv_images1:

(x_test_adv_images1, x_test_adv_perturbs1, x_test_labels1) =
                     generate_adversarial_data(original_images = x_images,
                              predictions = model.predict(x_images),
                              attack_fn = attack_fn)

Take a look at the confusion matrix demonstrating the adversarially trained model’s performance against the adversarial examples generated with the initial model (Figure 10-3):

show_confusion_matrix(model_adv, x_test_adv_images1, x_test_labels1, class_names)

A confusion matrix illustrating the adversarially trained model's accuracy over test data set 1

This looks really good—the majority of the adversarial test data has been correctly classified by the model.

Unfortunately, this is a poor evaluation. First, we’ve only tested the model’s robustness to a specific attack (see the following note). Second, for a proper evaluation, we need to assume that the attacker has complete knowledge of the model and its defenses. An attacker with complete knowledge is able to generate adversarial data directly against the adversarially trained model, so this is the scenario that needs to be evaluated.

Testing the Adversarially Trained Model Using Different Attacks

Using the Jupyter notebook, you can further experiment with adversarial training using different attacks to generate training data.

You can also check the generated model against test data created using a different attack. Unless the test attack is similar in approach to the one used to train the model, the resulting confusion matrix is unlikely to classify the resulting data as effectively as that shown in Figure 10-3.

The first step is to regenerate the test data using the adversarially trained model. We’ll call this x_test_adv_images2:

fmodel_adv = foolbox.models.TensorFlowModel.from_keras(model_adv,
                            bounds=(0, 1)) 
attack_fn = foolbox.attacks.GradientSignAttack(fmodel_adv,
                            criterion=attack_criterion,
                            distance=foolbox.distances.Linfinity)

(x_test_adv_images2, x_test_adv_perturbs2, x_test_labels2) =
                     generate_adversarial_data(original_images = x_images,
                              predictions = model_adv.predict(x_images), 
                              attack_fn = attack_fn)

: Notice the use of the adversarial model on these two lines.

This output is generated:

Warning: Unable to find adversarial example for image at index:  76

Figure 10-4 shows the resulting the confusion matrix:

show_confusion_matrix(model_adv, x_test_adv_images2, x_test_labels2, class_names)

A confusion matrix illustrating the adversarially trained model's accuracy over test data set 2

You can see that this time the model has performed far worse and has failed to correctly classify any of the adversarial examples (see the diagonal line of zeros from top left to bottom right). This is no surprise, as the examples in x_test_adv_images2 were all developed to fool this adversarially trained model.

Now let’s take a look at whether it’s more difficult for the adversary to create adversarial examples for the adversarially trained model. We’ll plot success rate against required perturbation.

The helper method generate_adversarial_data returns distance measurements for each of the adversarial examples that were found. Assuming the GradientSign method attempts to optimize for the minimum distance, these should indicate the minimum distance required for each adversarial example.

Let’s plot the perturbations required to generate adversarial examples against the original model and the adversarially trained model:

plt.hist((x_test_adv_perturbs1['foolbox_diff'], 
          x_test_adv_perturbs2['foolbox_diff']), 
         bins=20, 
         cumulative=True, 
         label=('Original model','Adversarially trained model'))

plt.title("Adversarial example success rate")
plt.xlabel("Perturbation")
plt.ylabel("Number of successful adversarial examples")
plt.legend(loc='right')

plt.show()

: This gives a list of each adversarial example’s perturbation measurement (in this case, the $L^{\infty}$ $L^{\infty}$ -norm) for the examples generated using the original model.
: This gives a list of each adversarial example’s perturbation measurement (in this case, the $L^{\infty}$ $L^{\infty}$ -norm) for the examples generated using the adversarially trained model.
: This defines the number of histogram “bins.”
: This specifies a cumulative histogram plot.

Figure 10-5 shows the resulting graph.

Success rate against perturbation is similar for examples generated on the original and adversarially trained models.

What a disappointment! The graph indicates that the adversarial success rate with respect to perturbation is no worse on an adversarially trained network than the original one. The adversarial training hasn’t made it any more difficult to create adversarial examples using the same attack method.

To understand why this is the case, once again consider the prediction landscape. There are thousands of directions in which the image might be perturbed to create an adversarial example. The trained network has closed off some of these options, but there are many more still available to the algorithm. With extensive adversarial training we might remove increasing numbers of options, but this is unlikely to cover every possibility open to the adversarial algorithm (although it might take the algorithm longer to locate adversarial input).

If you look carefully at the lower perturbations on the histogram, you’ll notice that the success rate is slightly better on the adversarially trained network than the original one. This suggests that the algorithm didn’t produce the best results on the original model—in other words, it does not always return an adversarial example with the minimum perturbation possible. This may be because each iteration of the gradient attack is based on the gradients in the prediction landscape at the point of the current image. If those gradients don’t reflect those further away from the image, the algorithm may step in directions that don’t produce the optimum result.

Out-of-Distribution Confidence Training

As we have seen, DNNs sometimes produce incorrect results with high confidence because they fail to generalize correctly for all the possible inputs. Many of the generalizations that a DNN will struggle with fall into the more general out-of-distribution (OoD) category—those inputs that are not within the distribution of the training data, and therefore are not inputs that the network could be expected to safely perform on.

We initially considered OoD inputs in Chapter 5. Figure 10-6 illustrates the concept of OoD data. The input space on the left shows the distribution of training data for three classifications. The fully trained model taking test data shown on the right will work well for test data of a similar distribution to the training data. However, there is no guarantee that it will perform correctly given a point outside this distribution.

Image showing a point in th einput space that is not in the same distribution as the training data.

OoD input is not necessarily adversarial. It might be simply an edge case that was not represented by the training dataset. Alternatively, it may be an unrealistic or nonsensical image. The problem with OoD input is that the network may still return a confident prediction for this data, but the reliability of this prediction is lower than that of input with similarities to the training dataset.

Chapter 1 illustrated this point with OoD adversarial examples that did not look anything like real-life images but resulted in high-confidence predictions for a particular classification. As a reminder, Figure 10-7 illustrates one of these examples.

Image showing adversarial example that is unrecognizable by humans.

So, are all adversarial examples also OoD? Not necessarily. While adversarial examples might lie in areas of the input space that are OoD (like the OoD data point shown in Figure 10-6), they may also exploit points in the input space where the algorithm has failed to generalize properly. These points may still be in the same distribution as the training data. The two cases are shown in Figure 10-8, which depicts the prediction landscapes resulting from a training dataset. Adversarial point 1 is clearly OoD, whereas adversarial point 2 lies comfortably within the training data distribution, but at a point where the algorithm has been unable to generalize correctly.

If we were able to detect OoD input, this would not guarantee detection of all adversarial examples. However, associating some confidence with a model’s predictions also has broader benefits. Any method of measuring the distribution of data that the network is most likely to perform well against will be helpful in identifying whether an input falls into the “safe” distribution or not.

There are some methods for checking for realism in images, such as detecting high contrast between neighboring pixels. These methods may be successful in capturing “obvious” OoD data such as the “cheetah” shown in Figure 10-7, but they are less successful at detecting OoD images with clearer shapes and patterns.

Although it’s difficult to detect OoD statistically outside the network, other techniques have been proposed using the network itself. A promising approach is for the network to calibrate its score (make it less confident) or flag input when it is classed as OoD. To achieve this we train a DNN to output not only predicted scores but also the confidence that the network has in its prediction. For example, the way that this approach would fit into the DNN architecture introduced in “DNNs for Image Processing” is illustrated in Figure 10-9.⁵

During training, the network learns to estimate confidence in its predictions as it also learns to make the predictions itself.

The training cost function introduced back in Chapter 3 can be rearticulated to optimize the accuracy of its confidence as well as the accuracy of the actual predictions. So, if the training network is making bad predictions for a training example, it should return a low confidence score. In this case, the network would be penalized during training (the cost would increase) if it returned a high confidence score. Similarly, the modified cost function ensures that a low confidence score returned for an incorrect prediction is also penalized. The network is trained well when it has not only learned to generate predictions close to the target labels, but can also generate accurate confidence measures for each training value.

Image showing adversarial examples that are unrecognisable by humans.

This additional feedback from the network indicates whether the input lies in the distribution of data on which it is safely trained to perform. For example, the cheetah probability score for the image in Figure 10-7 might be high, but if the network works correctly, its confidence in that score will be low. At the time of writing, this is nascent research, but it’s showing promising results.

Randomized Dropout Uncertainty Measurements

There is a group of techniques collectively known as regularization used during neural network training to reduce the possibility of the network overfitting to the training data. These techniques reduce the complexity of the neural network, forcing it to generalize more and therefore work over a wider variety of data. Dropout is one of these regularization techniques—it randomly “removes” neurons so that they do not contribute to training iterations, and reinstates them for subsequent iterations. This may seem counterintuitive; surely that makes it more difficult for the network to learn? Yes, it does, but that is the whole point of the technique; it prevents the network from relying too heavily on specific units in the network when making a decision, and thus ensures that the network generalizes better.

Dropout is not only useful to prevent the model overfitting during training time, it can also be used when the model goes live, forcing the model to rely on different units each time it is queried. This introduces some uncertainty to the network, so the predictions for a particular input are not deterministic. Repeated queries to the network using the same input will return different results, and the variance of these results provides a measure of uncertainty of the network for a particular instance.

In their research, Feinman et al.⁶ propose a method of adversarial detection called Bayesian neural network uncertainty whereby adversarial examples are detected because the randomized network is less certain of them than their natural counterparts. This defense relies on the premise that the predictions for a nonadversarial input will be more consistent than those for an adversarial one. So, if the same input were presented to a network incorporating randomized dropout multiple times, its predictions would vary more if that input were adversarial. When the variance metric for a specific input is over a defined threshold, the input is classified as adversarial. This approach has shown promising results in detecting adversarial inputs even when the attacker is assumed to have knowledge of the defense.⁷

Code Example: Dropout for Adversarial Detection

The code snippets in this section are from the Jupyter notebook chapter10/fashionMNIST_dropout_for_detection.ipynb in the book’s GitHub repository.

You can use this notebook to further experiment with randomized dropout, such as by changing the dropout parameters or altering the attack methods.

We’ll use the Keras “functional” API here rather than the “sequential” API used up to now in this book, as it enables us to create a network that incorporates randomized dropout after it has been trained.⁸

Here is the code to create the same Fashion-MNIST classifier as previously, but with dropout enabled on one of the hidden layers:

from tensorflow.keras.layers import Input, Dense, Flatten, Dropout
from tensorflow.keras.models import Model

inputs = Input(shape=(28,28))
x = Flatten()(inputs)
x = Dense(56, activation='relu')(x)
x = Dropout(0.2)(x, training=True) 
x = Dense(56, activation='relu')(x)
predictions = Dense(10, activation='softmax')(x)

model = Model(inputs=inputs, outputs=predictions)

model.compile(optimizer=tf.train.AdamOptimizer(),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
model.summary()

: This line adds the dropout. The training=True parameter indicates (un-intuitively) that dropout should be applied post-training as well as during training. This will add uncertainty to the network’s predictions. The proportion of uncertainty is determined by the parameter passed to the Dropout function. You can experiment with this level of uncertainty in the notebook to see its effect.

This generates the following output:

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_5 (InputLayer)         (None, 28, 28)            0
_________________________________________________________________
flatten_4 (Flatten)          (None, 784)               0
_________________________________________________________________
dense_12 (Dense)             (None, 56)                43960
_________________________________________________________________
dropout_4 (Dropout)          (None, 56)                0     
_________________________________________________________________
dense_13 (Dense)             (None, 56)                3192
_________________________________________________________________
dense_14 (Dense)             (None, 10)                570
=================================================================
Total params: 47,722
Trainable params: 47,722
Non-trainable params: 0
_________________________________________________________________

: Here’s the additional dropout layer.

Next, we train the model and take a look at its accuracy against the test data in the Fashion-MNIST dataset:

model.fit(train_images, train_labels, epochs=12)

: A model incorporating dropout during training requires more epochs to establish the same accuracy. Hence the epochs parameter is set higher than in the previous examples.

This produces the following output:

Epoch 1/12
60000/60000 [================] - 4s 63us/sample - loss: 0.3243 - acc: 0.8777
Epoch 2/12
60000/60000 [================] - 4s 63us/sample - loss: 0.3174 - acc: 0.8812
Epoch 3/12
60000/60000 [================] - 4s 61us/sample - loss: 0.3119 - acc: 0.8834
Epoch 4/12
60000/60000 [================] - 4s 61us/sample - loss: 0.3114 - acc: 0.8845
Epoch 5/12
60000/60000 [================] - 4s 63us/sample - loss: 0.3042 - acc: 0.8854
Epoch 6/12
60000/60000 [================] - 4s 61us/sample - loss: 0.2987 - acc: 0.8882
Epoch 7/12
60000/60000 [================] - 4s 61us/sample - loss: 0.2982 - acc: 0.8870
Epoch 8/12
60000/60000 [================] - 3s 53us/sample - loss: 0.2959 - acc: 0.8889
Epoch 9/12
60000/60000 [================] - 4s 61us/sample - loss: 0.2931 - acc: 0.8902
Epoch 10/12
60000/60000 [================] - 4s 63us/sample - loss: 0.2894 - acc: 0.8909
Epoch 11/12
60000/60000 [================] - 3s 53us/sample - loss: 0.2859 - acc: 0.8919
Epoch 12/12
60000/60000 [================] - 4s 66us/sample - loss: 0.2831 - acc: 0.8927
Out[11]:
<tensorflow.python.keras.callbacks.History at 0x156074c26d8>

Let’s check the network’s accuracy:

test_loss, test_acc = model.evaluate(test_images, test_labels)
print('Model accuracy based on test data:', test_acc)

Which we see in the following output:

10000/10000 [================] - 0s 40us/sample - loss: 0.3827 - acc: 0.8689
Model accuracy based on test data: 0.8689

If you rerun this cell in the Jupyter notebook, the accuracy will keep changing due to the uncertainty of the network.

Now we require a batch of nonadversarial images and a batch of adversarial ones. We’ll use the test data supplied with Fashion-MNIST and some adversarial images that have been generated previously. For this example, we’ll use the images generated using the FastGradient Foolbox attack against the original model:

import numpy as np

num_images = 1000
x_images = test_images[:num_images]
x_images_adv = ....

: To generate the adversarial images, use the generate_adversarial_data helper utility that we used previously. For conciseness, this code is not included here.

We repeatedly generate predictions using the dropout model for both batches of data. The number of times each image will be submitted to the model is defined by L:

L = 100
num_classes = 10

predictions_matrix = np.zeros((L, num_images, num_classes))  
predictions_matrix_adv = np.zeros((L, num_images, num_classes)) 

for i in range(L):
    predictions = model.predict(x_images)
    predictions_adv = model.predict(x_images_adv)
    predictions_matrix[i] = predictions
    predictions_matrix_adv[i] = predictions_adv

: predictions_matrix is a matrix representing the predictions for all the non-adversarial images over L submissions.
: predictions_matrix_adv is a matrix representing the predictions for all the adversarial images over L submissions.

Next we calculate a single value of uncertainty for each image that represents the amount of variation over all the image’s predictions. Here’s the function for determining the uncertainty of a set of predictions for a single image:

def uncertainty(predictions):
    return(np.sum(np.square(predictions))/predictions.shape[0]
           - np.sum(np.square(np.mean(predictions, axis=0)))) 

uncertainty_results = np.zeros((num_images))
uncertainty_results_adv = np.zeros((num_images))

for i in range(num_images): 
    uncertainty_results[i] = uncertainty(predictions_matrix[:,i])
    uncertainty_results_adv[i] = uncertainty(predictions_matrix_adv[:,i])

: This is an implementation of the following calculation defined in Carlini and Wagner 2017: $U (x) = (\frac{1}{L} \sum_{i = 1} ∥ F_{r} (x) ∥) - ∥ \frac{1}{L} \sum_{i = 1} F_{r} (x) ∥$ $U (x) = (\frac{1}{L} \sum_{i = 1} ∥ F_{r} (x) ∥) - ∥ \frac{1}{L} \sum_{i = 1} F_{r} (x) ∥$ , where $parallel-to y parallel-to$ $parallel-to y parallel-to$ is taken to be the square of the $L^{2}$ $L^{2}$ -norm. It’s simply a way of calculating a measure of variance for a set of predictions.
: Calculate the uncertainty measure for each image to generate two lists—one for the uncertainties associated with each of the nonadversarial images and one for the uncertainties associated with the adversarial ones.

Finally, let’s plot the results:

import matplotlib.pyplot as plt

plt.hist((uncertainty_results, uncertainty_results_adv),
            bins=50,
            label=('Nonadversarial','Adversarial'))

plt.title("Model prediction uncertainty")
plt.xlabel("Uncertainty per image")
plt.ylabel("Number of images")
plt.legend(loc='right')

plt.show()

Figure 10-10 shows the resulting graph.

Output image from code depicting the variation of responses for adversarial and non-adversarial images.

You can see that the predictions returned from the model tend to have greater uncertainty for adversarial images, so this is a good result. In this case, the threshold at which we would classify data as “adversarial” or “normal” is not clear-cut.⁹ The threshold might be established by using an ROC curve, as described in the following note.

ROC Curves

An important tool in machine learning is the receiver operating characteristic (ROC) curve, often used to address the problem of setting a prediction threshold at which a binary decision is made. The ROC curve plots the true positive rate against the false positive rate in the ROC space. This plot can be used to establish a threshold by comparing the true positives (the benefit) against the false positives (the cost).

In the case of an adversarial defense, the binary threshold might be the threshold at which the probability of an input being “adversarial” means that it is treated as such. If the defense identifies adversarial inputs when it’s 50% certain, for example, what is the false positive rate? That is, how many nonadversarial inputs are misclassified as adversarial with this threshold? The acceptable threshold of false positives versus false negatives will depend on the scenario; is it more important to catch adversarial input at the cost of reducing the accuracy across nonadversarial data, or is it better to ensure that nonadversarial data is not misclassified? A ROC curve can be used to articulate this as part of the evaluation, allowing the thresholds to be established for specific scenarios.

Using the Jupyter notebook, you could also experiment with other attacks, or with adversarial data generated on the dropout model itself (with dropout enabled during training only). If you try to generate the adversarial examples using a model that has dropout enabled after training, it may give interesting results; it’s difficult for attacks to work with a continually shifting prediction landscape.

A model using dropout to detect adversarial input might be used alongside a non-dropout operational model to ensure that the functional behavior of the system is deterministic .

Data Preprocessing

Now let’s take a different approach and consider whether we can remove adversarial data in the broader processing chain prior to it being submitted to the DNN.

We’ll focus on two areas:

Preprocessing in the broader processing chain: We’ll first look at the effects preprocessing in the broader processing chain might inadvertently have on adversarial input.
Intelligent removal of adversarial content: Next we’ll consider whether there are any proven statistical methods to deliberately remove adversarial content before it reaches the DNN. This could be detecting the adversarial input itself or removing aspects from all data that are likely to cause the model to return a result that is incorrect.

Preprocessing in the Broader Processing Chain

In real-world applications, DNNs do not exist in isolation. The efficacy of an adversarial example depends on many more factors when tested outside the laboratory environment as a result of the broader processing chain. Our adversary cannot be confident that their beautifully crafted adversarial examples will not be rendered benign (either deliberately or unintentionally) or detected by the target system’s processing.

As part of the wider processing chain, network and computer security solutions provide protection against threats by automatically detecting data likely to contain malicious content. Most commonly, this is achieved through firewalls and antivirus software that assess risk based on either the provenance of the data (whether or not it is acquired from a trustworthy source) or the data itself (whether it contains content indicative of malware).

Unfortunately, because DNNs processing image, audio, and video often take data generated directly from the physical world (e.g., by cameras) or from untrusted digital sources (such as web uploads), adversarial examples cannot be detected according to trustworthiness. Similarly, detection based on content is challenging because adversarial examples appear benign to the target system.

Once the data has passed any firewall or other security boundaries, it is subject to organizational processing. It’s helpful to consider a couple of different example scenarios:

Example: Social network image upload filtering: A processing chain for uploading images or video to a social networking site might involve images undergoing various preprocessing steps prior to vectorization and ingest to the neural network. This might include potential transformations, such as compression, normalization, smoothing, and resizing.
Example: Digital assistant: In this scenario, the input is also subject to limitations or preprocessing at the input sensor (in this case, the microphone) that may introduce noise to the signal or filter the data being captured.¹⁰ It then undergoes further processing, such as Fourier transformation, before reaching the DNN.

Often, it is precision of the data that is being exploited to make input adversarial, so a transformation that ultimately removes some of the precision from the data may render an input nonadversarial.

Consider the impact a reduction in image precision might have on adversarial examples. In Chapter 4 the precision of an image was defined as its spatial resolution and pixel density. An adversarial example dependent on a few pixels (perhaps generated by minimizing the L⁰-norm measurement) may be more likely impacted by reduced spatial resolution because the changed pixels may be lost when pixel density is reduced. Conversely, an adversarial example with minuscule changes to pixels across the whole image (possibly by minimizing the $L^{\infty}$ $L^{\infty}$ -norm) may be less robust to reduced color resolution where those subtle variations in color are lost. Adversarial input that unwittingly exploits characteristics that are lost or approximated during the processing chain may therefore be rendered less robust.

There are several reasons for data reduction during preprocessing, including:

Normalization and compression: The processing chain may perform normalization steps (for example, converting data to a consistent format or resizing images prior to submitting them to the DNN). If this normalization step results in a reduction of information contained in the data, it may remove the information that makes the input adversarial.
: Compression is also an important aspect of the digital encoding step, and this may be relevant to adversarial input if it results in a reduction in precision (see the following note).
: The precision of any digital information is constrained to the lowest precision of its previous processing. An image stored as 640 x 480 pixels, suitable for photographic display, will always retain that resolution; displaying it on an HD television screen will not increase the spatial detail, nor will storing it in a file format with higher resolution.¹¹
Removal of noise and extraneous data: The processing chain will remove noise or extraneous data from the source if it is likely to aid the processing of that data or improve it for human perception.
: Noise refers to distortion in the data, often introduced during the data capture step. Visual distortion in images might show as pixels that do not represent the scene accurately. Speckles, for example, might appear in an image taken in a low-light situation. Audio noise might manifest itself as crackles or interference introduced by the microphone sensor or audio equipment, reverberation, echo, and background noise.
: Gaussian blur is a blurring method commonly used to remove noise from images. This could be performed in the broader system preprocessing unrelated to the DNN to clean up images. Any adversarial perturbation or patch that might be categorized as “noise” or not relevant by the processing chain is subject to removal.
: Other extraneous data might also be removed during preprocessing. For example, speech processing systems might exploit specific audio processing techniques to extract the most relevant aspects of the sound—those that correspond to the human vocal tract—using MFC (as introduced in “Audio”).

Lossless and Lossy Data Compression

Digital, audio, and video formats are often compressed to save space. When compression is simply a shorthand way of storing the same data using fewer bytes, it is known as lossless.

In contrast, lossy compression uses intelligent algorithms to also remove data that is likely to be superfluous. This has the advantage that the image, audio, or video takes up less space without incurring any noticeable reduction in quality. Any data lost during lossy compression will be lost forever; it is not reinstated when the image, audio, or video is uncompressed.

For example, the MP3 format compresses audio data using intelligent (lossy) compression that bases the bit rate on the audio’s complexity; less complex aspects may be stored at a lower bit rate.¹² JPEGs also use lossy compression to reduce image size by removing information that is nonessential and unlikely to be missed by humans.

Data preprocessing may thwart simple attacks where the adversary is not aware of the transformation steps.¹³ It may also make attacks more difficult by placing additional constraints on the attack. However, data preprocessing should not be relied upon for effective defense.

Intelligently Removing Adversarial Content

For now, there are no statistical methods that can test for adversarial examples prior to submission to the DNN (see the following note for a discussion).

Statistical Methods for Detecting Adversarial Examples

A few methods have been proposed for the statistical detection of adversarial examples.

For example, Grosse et al.¹⁴ use data distribution methods to ascertain whether adversarial input can be distinguished using a technique called Maximum Mean Discrepancy, and Feinman et al.¹⁵ propose analyzing the distribution of the final hidden layer neural network outputs to detect adversarial input, rather than the raw input. The output of this hidden layer represents the extracted higher-level features of the input, so this approach considers the distribution of semantic information, rather than raw pixel data.

Unfortunately, these approaches have not yet proven to be effective defenses.¹⁶

As an alternative approach, it might be easier to apply an intelligent transformation that will not affect the classification of nonadversarial input, but will alter adversarial input so that the characteristics that make it adversarial are removed. This doesn’t require detecting input that is adversarial, just removing aspects of the input that are likely to be exploited for “adversariality” during a preprocessing step before the data is passed to the DNN.

For example, it may be that some pixels in an image are not very important in determining the image’s classification for nonadversarial data, but are exploited by adversarial examples to force the DNN to create an incorrect result. Removing these pixels from the data might remove “adversariality” while not adversely affecting the predictions for “good” data. Similarly, removal of audio frequencies outside speech thresholds removes the potential for creating adversarial input outside the vocal range without affecting the effectiveness of a speech recognition system.

One approach that has been explored is the use of principle component analysis (PCA), a mathematical technique to identify the characteristics of data that are most influential in a decision. PCA has been tested as a method to establish which parts of data are exploited by adversarial examples that do not influence the decisions for good data. Unfortunately, it turns out that there aren’t any obvious characteristics that influence “adversariality,” so once again, this is not yet an effective defense.¹⁷

Concealing the Target

An important aspect of target concealment is that the target is not just the DNN; it’s the complete processing system, including the defense mechanisms in place.

Chapter 7 explored the challenges of generating robust adversarial content in the real world. Lack of knowledge of the DNN or an inability to query the target greatly impacts the ease with which adversarial input can be created. Conversely, knowledge of the DNN enables the attacker to create a replica to develop and sharpen adversarial input prior to launching it on the target system. The ability to test an attack on the target system is also a valuable aid in either verifying adversarial input or developing an attack using black box methods (the latter usually requiring many queries to the target).

If your organization is using a pretrained, openly or commercially available model, it will be fairly easy for the adversary to generate a replica DNN to develop accurate adversarial examples. As seen in Chapter 7, even access to the training data provides the adversary with sufficient information to generate a model substitute close enough to the original for adversarial examples to successfully transfer. However, while basing a DNN algorithm on a commercially sourced or openly available model may not be ideal, it may be the only practical option given the quantities of labeled data typically required to train a network.

Knowledge of the complete processing chain includes knowledge of preprocessing steps that might transform adversarial input to nonadversarial and the treatment of the output from the neural network. This includes prediction thresholds for making decisions and, the organizational response to such predictions.

Knowledge of any active defenses in place includes the knowledge of methods to identify or remove adversarial input. Knowledge of such defenses might also be established through testing—querying the system and checking the broader organizational response to adversarial input. For example, if the attacker is aware that, on identifying an adversarial patch, you will then use that patch to search for other adversarial images, they may know not to reuse the patch, or might even exploit it to generate false positives.

There are practical measures that can be put in place to limit an adversary’s access to the target and, therefore, their ability to generate adversarial input. These include:

Throttling queries: A black box direct attack requiring high volumes of queries might be made more difficult by hindering the speed at which queries can be made. However, throttling based on client identification may have little effect, as a determined attacker will then simply issue queries using a variety of the identifiers that you are exploiting—a variety of IP addresses or accounts, for example.
Detecting based on query patterns: Alternatively, the target system could detect an attempted black box direct attack by testing for large quantities of very similar but nonidentical inputs, because such an attack would also require many similar queries, each an iterative tweak to the previous input. For example, input similarity might be detected quickly though image hashing. Image hashing assigns a “hash” value to an image, where similar images are assigned hash values that are close to each other.¹⁸ There is, of course, the risk that a high volume of similar queries are not necessarily adversarial, so the relevance of this defense depends on the scenario.
Minimizing feedback: An adversary can exploit responses from queries to the target system to generate robust adversarial examples. These responses might be used to develop the adversarial input itself, or to test the adversarial input after it has been initially crafted on a substitute model. Reducing the amount of useful information returned from a query will increase the difficulty of generating adversarial input. This includes ensuring that model scores are not released and error messages do not reveal unnecessary information.
Providing nondeterministic responses: A nondeterministic response is an extremely powerful defense if it prevents the attacker from establishing the detailed workings of your system. “Randomized Dropout Uncertainty Measurements” presents an example of such an approach, but nondeterminism could also be introduced in the broader processing chain if it was acceptable to the operational scenario.

Building Strong Defenses Against Adversarial Input

This chapter has introduced a combination of approaches for detecting or removing adversarial input.

As the robustness of neural networks increases, so will the sophistication of the attacks. This continual process—an “arms race” between attacker and defender—is similar to the evolution of malware detection or spam email filtering, and the adversarial example landscape will similarly change along with the available defenses.

Open Projects

While currently there are no sure defenses, several current initiatives seek to bring the exploration of adversarial attacks and defenses into the public domain and to help improve the robustness of DNN algorithms by pitting attacks against defense mechanisms. Some of these were mentioned previously, in Chapter 6. Good places to start include:

CleverHans: CleverHans is an open source library and code repository for the development of attacks and associated defenses with the aim of benchmarking machine learning system vulnerability to adversarial examples.¹⁹
Foolbox: Foolbox is a toolbox for creating adversarial examples to enable testing of defenses.²⁰ Start by reviewing the documentation.
IBM’s Adversarial Robustness Toolbox: This library’s code repository includes adversarial attacks, defenses, and detection methods. It also supports robustness metrics.
Robust ML: Robust ML aims to provide a central website for learning about defenses and their analyses and evaluations.
Robust Vision Benchmark: Robust Vision Benchmark provides a platform for testing the effectiveness of attacks and the robustness of models.
Competitions: Several competitions have encouraged participation in the generation of adversarial attacks and defenses, including some organized by Google.²¹

In addition, see the Unrestricted Adversarial Examples Challenge and Kaggle.²²

Taking a Holistic View

Defense depends on the context in which the DNN is deployed and the risks that adversarial examples could potentially pose to your organization. If you have complete control over and trust in the data that the AI is processing, there may be no risk of attack from adversarial input.

Developing end-to-end solutions that are robust to adversarial examples requires a multipronged approach. It is important to consider the complete system and not just the model (or models) used by your organization in isolation. The risk of some threats can be reduced through simple process or processing chain changes.

For example, you may not be able to remove capabilities (resources and skills) from an adversary, but you can increase the difficulty of an attack by reducing the attacker’s ability to affect the input or removing access to target information. This could be achieved by preventing inadvertent and unnecessary information leakage in responses. Where there is risk of a physical-world attack, simple nontechnical measures such as monitoring of physical areas where an adversarial attack could take place may be sufficient to remove the threat.

In practice, many applications already have additional Information Assurance (IA) protecting the AI components; for example, digital assistants provide additional levels of security to prevent inadvertent commands, such as requiring authentication to perform impactful actions like money transfers and providing audio responses to voice commands. You may also be able to reduce the risk by establishing the veracity of the model’s output through information acquired from other data sources. An example of this is an autonomous vehicle that might augment its knowledge with camera data, but does not wholly rely on this image information.

Defenses should be evaluated assuming that the attacker has complete knowledge. So, you should evaluate the the robustness of a system to attack with full knowledge of the model, the processing chain, and all defenses. You should also use the strongest attacks available to perform any evaluation. This requires remaining informed of the latest developments in attacks and defenses. The evaluation is not a static one-time process, but is ongoing as better attacks and defenses are developed. Evaluation should be a combination of formal assessments and cybersecurity testing of the complete system, so that adversarial examples are incorporated into any “red-blue” team approach to testing.

A holistic view is not only about defenses in place in the technical solution, but also understanding the broader impact that such attacks have on the organization and how to prevent inappropriate responses. For example, having a human in the loop to check alerts from surveillance cameras prior to acting on them may be appropriate to prevent a DoS attack. The AI processing is then performing a triage of the surveillance data, and leaving the ultimate decision to a human.

If you are able to detect adversarial attacks, consider your response. It may be appropriate to ignore and log the attempted attacks, similar to dealing with spam email. Alternatively, in some scenarios, it may be appropriate to respond more actively. Repeated detected adversarial input could be grounds to (for example) limit subsequent access of a user to a social media platform. If you are explicitly detecting adversarial input, ensure that your organization does not respond in a way that would leave it open to a DoS attack if it was flooded with adversarial examples.

Finally, detecting and dealing with adversarial input is only part of the assurance of machine learned models. Assurance also involves ensuring that a model is able to operate safely over the inputs it could receive, whether they are adversarial or not. In addition, there are other adversarial threats to machine learning that should be considered as part of information assurance, such as poisoning of training data impacting the integrity of the model, and model “reverse engineering” to extract confidential training data.

¹ This term was initially coined by Nicolas Papernot et al., in “Practical Black-Box Attacks Against Machine Learning,” Sixth International Conference on Learning Representations (2018), http://bit.ly/2IrqeJc.

² Nicolas Papernot et al. “Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks” (2016), http://bit.ly/2KZXfOo.

³ See for example Goodfellow et al., “Explaining and Harnessing Adversarial Examples.”

⁴ Florian Tramèr et al., “Ensemble Adversarial Training: Attacks and Defenses,” International Conference on Learning Representations (2018), http://bit.ly/2XldcFh.

⁵ Terrance De Vries and Graham W. Taylor, “Learning Confidence for Out-of-Distribution Detection in Neural Networks” (2018), http://bit.ly/2XZHpH1.

⁶ Reuben Feinman et al., “Detecting Adversarial Samples from Artifacts” (2017), http://bit.ly/2XpavTe.

⁷ Nicolas Carlini and David Wagner, “Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods” (2017), http://bit.ly/2WTMhBe.

⁸ For details of the two different programming approaches, refer to the Keras documentation for the Sequential model and the functional API guide.

⁹ The researchers achieved better results than this, with a clearer distinction between normal and adversarial. This may be due to use of more accurate models than our very simple classifier.

¹⁰ Other challenges imposed by the physical environment were discussed in Chapter 8.

¹¹ There are processing techniques that increase the resolution of image and audio through inference of missing data, but these techniques would not reinstate adversarial perturbations previously lost during data compression and normalization.

¹² The Fourier transform is also used in MP3 compression.

¹³ See for example, Xin Li and Fuxin Li, “Adversarial Examples Detection in Deep Networks with Convolutional Filter Statistics,” International Conference on Computer Vision (2017), http://bit.ly/2FjDIVu. The authors prove that passing a filter over an image could be successful in removing the adversariality from examples that had been generated by simple techniques such as FGSM.

¹⁴ Kathrin Grosse et al., “On the (Statistical) Detection of Adversarial Examples” (2017), http://bit.ly/2IszblI.

¹⁵ Reuben Feinman et al., “Detecting Adversarial Samples from Artifacts” (2017), http://bit.ly/2XpavTe.

¹⁶ See Carlini and Wagner, “Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods.”

¹⁷ In Carlini and Wagner, “Adversarial Examples Are Not Easily Detected.”

¹⁸ Not to be confused with a cryptographic hash, where the hash value does not indicate similarity of the input.

¹⁹ Nicolas Papernot et al., “Technical Report on the CleverHans v2.1.0 Adversarial Examples Library” (2017), http://bit.ly/2Xnwav0.

²⁰ Jonas Rauber et al., “Foolbox: A Python Toolbox to Benchmark the Robustness of Machine Learning Models” (2018), http://bit.ly/2WYFgPL.

²¹ Alexey Kurakin et al., “Adversarial Attacks and Defences Competition” (2018), organized as part of the Neural Information Processing Systems (NIPS) conference 2017, http://bit.ly/2WYGzy9.

²² Several competitions as part of NIPS 2017.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
10. Defending Against Adversarial Inputs

Chapter 10. Defending Against Adversarial Inputs

Improving the Model

Gradient Masking

How to Distill a Neural Network

Figure 10-1. The effect of distillation on the prediction landscape around training points for a particular classification

Adversarial Training

Figure 10-2. Prediction landscape for a model trained with adversarial examples

Jupyter Notebooks for Adversarial Training

Figure 10-3. Code output

Testing the Adversarially Trained Model Using Different Attacks

Figure 10-4. Code output

Figure 10-5. Code output

Out-of-Distribution Confidence Training

Figure 10-6. The test data (right) has a similar distribution to the training data (left) except for one OoD point.

Figure 10-7. A digitally generated adversarial example that is OoD but results in a high-confidence prediction of “cheetah” from an image classifier (image from Nguyen et al. 2015)

Figure 10-8. Adversarial examples are not necessarily OoD.

Figure 10-9. Extending a CNN architecture to calculate confidence scores for predictions

Randomized Dropout Uncertainty Measurements

Code Example: Dropout for Adversarial Detection

Figure 10-10. Code output

ROC Curves

Data Preprocessing

Preprocessing in the Broader Processing Chain

Lossless and Lossy Data Compression

Intelligently Removing Adversarial Content

Statistical Methods for Detecting Adversarial Examples

Concealing the Target

Building Strong Defenses Against Adversarial Input

Open Projects

Taking a Holistic View

Table of Contents for 10. Defending Against Adversarial Inputs

Create new playlist

Sign In

Sign Up

Chapter 10. Defending Against Adversarial Inputs

Improving the Model

Gradient Masking

How to Distill a Neural Network

Figure 10-1. The effect of distillation on the prediction landscape around training points for a particular classification

Adversarial Training

Figure 10-2. Prediction landscape for a model trained with adversarial examples

Jupyter Notebooks for Adversarial Training

Figure 10-3. Code output

Testing the Adversarially Trained Model Using Different Attacks

Figure 10-4. Code output

Figure 10-5. Code output

Out-of-Distribution Confidence Training

Figure 10-6. The test data (right) has a similar distribution to the training data (left) except for one OoD point.

Figure 10-7. A digitally generated adversarial example that is OoD but results in a high-confidence prediction of “cheetah” from an image classifier (image from Nguyen et al. 2015)

Figure 10-8. Adversarial examples are not necessarily OoD.

Figure 10-9. Extending a CNN architecture to calculate confidence scores for predictions

Randomized Dropout Uncertainty Measurements

Code Example: Dropout for Adversarial Detection

Figure 10-10. Code output

ROC Curves

Data Preprocessing

Preprocessing in the Broader Processing Chain

Lossless and Lossy Data Compression

Intelligently Removing Adversarial Content

Statistical Methods for Detecting Adversarial Examples

Concealing the Target

Building Strong Defenses Against Adversarial Input

Open Projects

Taking a Holistic View

Table of Contents for
10. Defending Against Adversarial Inputs