Chapter 6. Methods for Generating Adversarial Perturbation

Chapter 5 considered the principles of adversarial input, but how are adversarial examples generated in practice? This chapter presents techniques for generating adversarial images and provides some code for you to experiment with. In Chapter 7 we’ll then explore how such methods might be incorporated into a real-world attack where the DNN is part of a broader processing chain and the adversary has additional challenges, such as remaining covert.

Open Projects and Code

There are several initiatives to bring the exploration of adversarial attacks and defenses into the public domain, such as CleverHans, Foolbox, and IBM’s Adversarial Robustness Toolbox. These projects are detailed further in Chapter 10.

For consistency, all the code in this book uses the Foolbox libraries.

Before considering the methods for creating adversarial input, you might wonder—how difficult is it to create an adversarial example simply by trial and error? You might, for example, add some random perturbation to an image and see the effect it has on the model’s predictions. Unfortunately for an adversary, it isn’t quite so simple. During its learning phase, the DNN will have generalized from the training data, so it is likely to have resilience to small random perturbations; such changes are therefore unlikely to be successful. Figure 6-1 illustrates that even when every pixel color value has been incrementally perturbed by a significant random amount, the ResNet50 classifier still makes a correct classification. Misclassification only occurs when the perturbation is visible.

The effects of random perturbation on the koala image.
Figure 6-1. Resulting ResNet50 predictions for an image with random variation added; maximum amount of perturbation per pixel is shown for each iteration.

To effectively generate adversarial examples, an adversary will need to be far more cunning. There are several approaches that they might take, each assuming a different level of knowledge of the DNN algorithm. We can categorize the methods for generating adversarial input as follows, based on the attacker’s level of access to the model:

White box

These methods exploit complete knowledge of the DNN model to create adversarial input.

Limited black box

These methods refine adversarial input based on an output generated from the model or from the system in which it resides. For example, the output might be simply a final classification.

Score-based black box

These methods refine adversarial input based on the raw predictions (scores) returned from the DNN. Score-based methods may have access to all of the scores or just the highest (for example, the top 10) scores. Score-based methods lie somewhere between white box and limited black box methods; they require access to more detailed responses than a limited black box attack, but do not require access to the model algorithm as a white box attack does.

A Naive Approach to Generating Adversarial Perturbation

To experiment with the code used for Figure 6-1, use the Jupyter notebook chapter06/resnet50_naive_attack.ipynb on the book’s GitHub site.

For the Fashion-MNIST classifier, use the Jupyter notebook chapter06/fashionMNIST_naive_attack.ipynb.

The information available to the attacker for each of these three methods is depicted pictorially in Figure 6-2.

Image of a DNN and its outputs with arrows to the information available to each of the attack methods.
Figure 6-2. Information available in white box, score-based black box, and limited black box methods

The remainder of this chapter considers each of these methods in turn.

White Box Methods

White box methods involve complete knowledge of the DNN model (its parameters and architecture) and use mathematical optimization methods to establish adversarial examples based on calculating the gradients within the input space landscape that was discussed in Chapter 5. Such techniques are particularly interesting because they provide insight into the weaknesses inherent within DNNs. This section explains how these white-box methods work.

Searching the Input Space

Chapter 5 introduced the notion of moving an image across classification boundaries in the input space through carefully selected perturbations. The aim was to minimize the movement in the input space (as measured by one of the Lp-norms) as well as achieving the required adversarial goal—be it a targeted or untargeted misclassification. Let’s reconsider the Fashion-MNIST example from that chapter and the impact that altering a coat image has on the predictions returned from the DNN. We’ll focus on the “Coat” prediction landscape, as highlighted in Figure 6-3.

Movement of an image to an area in the input space outside its classification.
Figure 6-3. Untargeted attack—moving an image outside the “Coat” classification area of the input space

Perhaps the most obvious approach to finding a position in the input space that achieves the adversarial goal is to simply search outward from the initial image. In other words, start with the image, test a few small changes to see which move the predictions in the required direction (i.e., which lower the strength of the “Coat” prediction in our example), then repeat on the tweaked image. This is essentially an iterative search of the input space, beginning at the original image and moving outwards until the predictions change in such a way as to alter its classification, as shown in Figure 6-4.

Although this might appear to be a pretty simple task, this search is in fact a considerable challenge in itself. There are simply so many different pixel changes and combinations that need to be explored. One option might be to get the answer by brute force, perhaps experimenting with all the possible small perturbations to an input to see what gives the best result. However, this is not computationally feasible, as the following note explains.

Many Possible Perturbations

Let’s assume an image resolution of 224 x 224 pixels. Say we want to generate all the possible very small permutations to that image by minimizing the change to any specific pixel to be a value of either plus or minus ε, where ε is a small amount. If we could generate all these permutations, perhaps we could test each one to see whether any fulfilled our adversarial criteria. But how many variations would there be to test?

Recall from Chapter 5 that a low-resolution (224 x 224) color image is represented by 150,528 pixel values (where each pixel has 3 values for red, green, and blue).

For each possible perturbation that we want to generate from the original image, an individual pixel value might remain the same, increase by ε, or decrease by ε.

Therefore, there are 3150,528 combinations of perturbation whereby each pixel can only change by exactly ε from its original value. To be pedantic, we should subtract 1 from this value so we don’t count the possible perturbation where every pixel remains the same. Try putting this into your calculator and you will get an overflow error. Alternatively, pop it into a Jupyter notebook cell using:

pow(3,150528) - 1

(The number returned is too large to include here.)

The search approach may be computationally intractable, but access to the DNN algorithm grants the attacker a huge privilege; they can use the algorithm to generate a mathematical approximation to this search, which reduces the number of search combinations.

In their initial paper,1 Szegedy at al. use the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm to speed up the exploration of the area surrounding the original image to find adversarial examples. The L-BFGS algorithm is a mathematical technique that allows more effective searching of the input space by approximating the probability gradients in the input space near a particular point. The “limited-memory” aspect of the algorithm is a further approximation to reduce the amount of computer memory required during the iterative search.

The L-BFGS algorithm proved to be effective in establishing adversarial examples, but it’s still incredibly slow and computationally expensive.2 To optimize the search further, we need to understand more about the characteristics of DNN algorithms and the shapes of the prediction landscapes that they define.

Exploiting Model Linearity

In 2015, some of the same authors of the original research into adversarial examples reconsidered the problem. They proved a simple algorithm—the Fast Gradient Sign Method (FGSM)3—to be effective in generating adversarial examples.

The FGSM algorithm was never intended to be the best approach to finding adversarial input. Rather, its purpose was to demonstrate a characteristic of DNN algorithms that makes the problem of creating adversarial input far simpler. Let’s begin by looking at the algorithm itself, and then consider what it tells us.

The FGSM algorithm calculates the direction in the input space which, at the location of the image, appears to be the fastest route to a misclassification. This direction is calculated using gradient descent and a cost function similar in principle to that for training a network (as explained in “How a DNN Learns”). A conceptual explanation is to simply think of the direction calculation as being an indirect measure of the steepness of the contours in the prediction landscape at the location of the image.

The adversarial direction is very crudely calculated, so each input value (pixel value or, put another way, axis in the multidimensional input space) is assigned in one of two ways:

Plus

Indicating that this input value would be best being increased to cause a misclassification

Minus

Indicating that this input value would be best decreased to cause a misclassification

This simple allocation of “plus” or “minus” may appear counterintuitive. However, FGSM isn’t concerned with the relative importance of any particular change, just the direction (positive or negative) of the change. If, for example, increasing one input value would have a greater adversarial effect than increasing another, the two values would still both be assigned a “plus” and treated equivalently.

With the direction established, FGSM then applies a tiny perturbation to every input value (pixel value in the case of an image), adding the perturbation if the value’s adversarial direction of change has been deemed positive, and subtracting the perturbation otherwise. The hope is that these changes will alter the image so that it resides just outside its correct classification, therefore resulting in an (untargeted) adversarial example.

FGSM works on the principle that every input value (pixel) is changed, but each one only by a minuscule amount. The method is built on the fact that infinitesimal changes to every dimension in a high-dimensional space create a significant overall change; changing lots of pixels each by a tiny amount results in a big movement across the input space. Cast your mind back to the Lp-norm measurements and you’ll realize that this method is minimizing the maximum change to a single pixel—that’s the L-norm.

The surprising aspect of FGSM is that it works at all! To illustrate this point, take a look at the two scenarios in Figure 6-5, where the input space of two different DNN models is depicted. Both images show very zoomed-in pictures of the input space, so the arrow depicting the difference between the position of the original and new image generated by the FGSM algorithm represents the very small movement in each dimension.

The boundary for the “Shirt” classification area is depicted by the same thick line in both cases, but you can see that the model on the left has far more consistent gradients. The contours suggest a smooth, consistent hill, rather than the inconsistent, unpredictable landscape on the right.

The direction of the perturbation calculated by FGSM is determined by the gradient at the point of the original image. On the left, this simple algorithm takes us successfully to an adversarial location because the gradients remain roughly the same as the image moves away from the original. In contrast, using FGSM with the gradients on the right fails to place the image in an adversarial location because the gradient near the original image location is not indicative of the gradients further away. The resulting image still lies within the “Shirt” classification area.

One image depicting Fast Gradient Sign Method assuming model linearity and another showing how it would fail with a non-linear model.
Figure 6-5. The Fast Gradient Sign Method assuming model linearity and nonlinearity

The whole premise of the success of FGSM assumes that the steepness of the slope in a particular direction will be maintained. Put in mathematical terminology, the function that the model represents exhibits linear behavior. With linear models, it’s possible to vastly approximate the mathematics for generating adversarial perturbation by simply looking at local gradients.

FGSM will not generate the best adversarial input, but that isn’t the purpose of this algorithm. By showing that the FGSM approach worked across state-of-the-art image classification DNNs, the researchers proved the fundamental characteristic of linearity was inherent to many of these algorithms; DNN models tend to have consistent gradients across the input space landscape, as shown in the figure on the left in Figure 6-5, rather than inconsistent gradients as shown on the right. Previous to FGSM, it was assumed that DNN algorithms comprised more complex nonlinear gradients, which you could envisage as landscapes with continually varying steepness, incorporating dips and hills. This linearity occurs because the optimization step during training will always favor the simplest model (the simplest gradients).

Model linearity makes the mathematics for generating adversarial examples with white box methods far simpler as it’s possible to greatly approximate the mathematics for generating adversarial perturbation by simply looking at local gradients.

Even with the tiny perturbations spread across the image, FGSM can overstep. Hence, it can be improved by iteratively adding very small perturbations until the image is just past the classification boundary and therefore becomes adversarial. While we’re at it, we might as well recheck the gradient direction on each iteration just in case the model is not entirely linear. This technique is referred to as the basic iterative method.6

The following code snippet demonstrates the FGSM attack on the Fashion-MNIST classifier that we created in Chapter 3. We’ll use the openly available Foolbox library for this example.

Code Example: Gradient Attack

To experiment with the FGSM code in this section, see the Jupyter notebook chapter06/fashionMNIST_foolbox_gradient.ipynb.

To begin, import the required packages:

import numpy as np
import matplotlib.pyplot as plt

import tensorflow as tf
from tensorflow import keras

Load the model previously saved in Chapter 3 and run the test images through it:

fashion_mnist = keras.datasets.fashion_mnist
_, (test_images, test_labels) = fashion_mnist.load_data()
test_images = test_images/255.0

model = tf.keras.models.load_model("../models/fashionMNIST.h5") 1

predictions = model.predict(test_images) 2
1

Load the Fashion-MNIST classifier that was saved in Chapter 3.

2

Get the model’s predictions for the test data.

Select an original (nonadversarial) image and display it (see Figure 6-6) along with its prediction:

image_num = 7 1

class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

x = test_images[image_num]
y = np.argmax(predictions[image_num])
y_name = class_names[y]

print("Prediction for original image:", y, y_name)

plt.imshow(x, cmap=plt.cm.binary)
1

Change this number to run the attack on a different image.

This output is generated:

Prediction for original image: 6 Shirt
Image of a shirt
Figure 6-6. Code output

Next, create a Foolbox model from our Keras one:

import foolbox
from foolbox.models import KerasModel
fmodel = foolbox.models.TensorFlowModel.from_keras(model, bounds=(0, 255))

Define the attack specificity:

attack_criterion = foolbox.criteria.Misclassification() 1
distance = foolbox.distances.Linfinity                  2
1

The attack_criterion defines the specificity of the attack. In this case, it is a simple misclassification.

2

The perturbation distance will be optimized according to the L-norm.

Define the attack method:

attack = foolbox.attacks.GradientSignAttack(fmodel,
                                            criterion=attack_criterion,
                                            distance=distance)

And run the attack:

x_adv = attack(input_or_adv = x,
               label = y,
               unpack = False) 1
1

Specifying unpack = False means that a foolbox.adversarial.Adversarial object is returned, rather than an image. The image and other information about the adversarial example (such as its distance from the original) can be accessed from this object.

Now, let’s print out the results (see Figure 6-7):

preds = model.predict(np.array([x_adv.image]))  1

plt.figure()

# Plot the original image
plt.subplot(1, 3, 1)
plt.title(y_name
plt.imshow(x, cmap=plt.cm.binary)
plt.axis('off')

# Plot the adversarial image
plt.subplot(1, 3, 2
plt.title(class_names[np.argmax(preds[0])])
plt.imshow(x_adv.image, cmap=plt.cm.binary)
plt.axis('off')

# Plot the difference
plt.subplot(1, 3, 3)
plt.title('Difference')
difference = x_adv.image - x
plt.imshow(difference, vmin=0, vmax=1, cmap=plt.cm.binary)
plt.axis('off')

print(x_adv.distance)  2

plt.show()
1

This line gets the predictions for the adversarial example. x_adv.image represents the adversarial image part of the Adversarial object.

2

x_adv.distance is an object representing the perturbation required to generate the adversarial image.

This generates the following output:

normalized Linf distance = 1.50e-04
Image depicting shirt original, adversarial image and adversarial perturbation
Figure 6-7. Code output

The code has optimized for the L-norm. If you look carefully at the difference, you can see that many pixels have been changed slightly.

Before moving on, there’s an important caveat to this discussion about model linearity with respect to audio. Preprocessing such as MFCC and the recurrent nature of LSTMs that are often part of speech-to-text solutions (see “Audio”) introduce nonlinearity to the model. Therefore, to successfully establish adversarial distortion using a technique such as FGSM on a speech-to-text system will require iterative steps rather than a single-step approach. In addition, the complex processing chain in speech-to-text systems makes generating a loss function more difficult than in the image domain. It requires minimizing loss for an audio sample over the complete end-to-end chain (including steps such as MFCC and CTC), which in turn requires more challenging mathematics and increased compute power.7

Adversarial Saliency

“What’s the DNN Thinking?” introduced the concept of saliency maps that enable us to visualize the aspects of input data most important in determining the DNN’s predictions. This concept is not unique to DNN processing; these maps have been used for many years as a method of depicting the pixels (or groups of pixels) most relevant to a particular computer vision recognition task.

Saliency calculations can also be exploited in generating adversarial examples. Knowing the most relevant features in determining a classification is bound to be useful if we wish to restrict perturbation to the areas that will have the most influence in moving a benign input to an adversarial one. The Jacobian Saliency Map Approach (JSMA)8 demonstrates this approach.

The JSMA involves calculating the adversarial saliency score for each value that makes up the input. Applied to image data, this is a score for each and every pixel value (three per pixel in the case of color images) indicating its relative importance in achieving the adversarial goal. Changes to pixels with a higher score have greater potential to change the image to an adversarial one than pixels with a lower score.

The adversarial saliency for a particular pixel considers two things:

  • The effect of the change on increasing the predicted score for the target classification (in a targeted attack)

  • The effect of the change on decreasing the predicted score for all other classifications

The changes to input that are likely to have the greatest effect on achieving the adversarial goal will have a high value for both, so these are the changes that will be made first.

The JSMA selects the pixels that have the greatest impact and changes them by a set amount in the relevant direction (i.e., either increases or decreases their values). Essentially, this is moving a set distance along carefully chosen directions in the multidimensional input space in the hope of bringing the image to a location in the prediction landscape that satisfies the adversarial criteria. If the goal is not achieved, the process is repeated until it is.

The JSMA minimizes the number of pixel value changes to the image, so it’s the L0-norm measurement that’s being used this time as a measure of change.

The following code snippet demonstrates the Foolbox SaliencyMapAttack on the ResNet50 classifier that we created in Chapter 4.

Code Example: Saliency Attack

The complete code for this attack can be found in the Jupyter notebook chapter06/resnet50_foolbox_saliency.ipynb.

Let’s begin by selecting our original nonadversarial photograph and running it through the classifier:

original_image_path = '../images/koala.jpg'
x = image_from_file(original_image_path, [224,224]) 1
1

This helper utility is in the GitHub repository. It reads in the file and resizes it.

Import the relevant libraries and the ResNet50 model. We’ll pass the nonadversarial image to the model to check the prediction returned from ResNet50:

import tensorflow as tf
from tensorflow import keras
from keras.applications.resnet50 import ResNet50
from keras.applications.resnet50 import preprocess_input
from keras.applications.resnet50 import decode_predictions

model = ResNet50(weights='imagenet', include_top=True)

x_preds = model.predict(np.expand_dims(preprocess_input(x), 0))
y = np.argmax(x_preds)
y_name = decode_predictions(x_preds, top=1)[0][0][1]

print("Prediction for image: ", y_name)

This generates the following output:

Prediction for image:  koala

Now let’s create the Foolbox model from the ResNet50 one. We need to articulate the preprocessing required:

import foolbox

preprocessing = (np.array([103.939, 116.779, 123.68]), 1) 1
fmodel = foolbox.models.TensorFlowModel.from_keras(model,
                                                   bounds=(0, 255),
                                                   preprocessing=preprocessing)
1

The Foolbox model can perform preprocessing on the image to make it suitable for ResNet50. This involves normalizing the data around the ImageNet mean RGB values on which the classifier was initially trained. The preprocessing variable defines the means for this preprocessing step. The equivalent normalization is done in keras.applications.resnet50.preprocess_input—the function that we have called previously to prepare input for ResNet50. To understand this preprocessing in greater detail and try it for yourself, take a look at the Jupyter notebook chapter04/resnet50_preprocessing.ipynb.

As mentioned in “Image Classification Using ResNet50”, ResNet50 was trained on images with the channels ordered BGR, rather than RGB. This step (also in keras.applications.resnet50.preprocess_input) switches the channels of the image data to BGR:

x_bgr = x[..., ::-1]

Next, we set up the Foolbox attack:

attack_criterion = foolbox.criteria.Misclassification()
attack = foolbox.attacks.SaliencyMapAttack(fmodel, criterion=attack_criterion)

and run it:

x_adv = attack(input_or_adv = x_bgr,
                              label = y,
                              unpack = False)

Let’s check the predicted label and class name of the returned adversarial image:

x_adv = adversarial.image[..., ::-1] 1

x_adv_preds = model.predict(preprocess_input(x_adv[np.newaxis].copy()))
y_adv = np.argmax(x_adv_preds)
y_adv_name = decode_predictions(x_adv_preds, top=1)[0][0][1]

print(print("Prediction for image: ", y_adv_name))
1

Get the adversarial image from the foolbox.adversarial object and change the channels back to RGB order.

The output is:

Prediction for image:  weasel

Finally, we display the images alongside each other with their difference (Figure 6-8):

import matplotlib.pyplot as plt

plt.figure()

# Plot the original image
plt.subplot(1, 3, 1)
plt.title(y_name)
plt.imshow(x)
plt.axis('off')

# Plot the adversarial image
plt.subplot(1, 3, 2)
plt.title(y_adv_name)
plt.imshow(x_adv)
plt.axis('off')

# Plot the difference
plt.subplot(1, 3, 3)
plt.title('Difference')
difference = x_adv - x

# Set differences that haven't changed to 255 so they don't show on the plot
difference[difference == 0] = 255 1
plt.imshow(abs(difference))
plt.xticks([])
plt.yticks([])

plt.show()
Saliency output
Figure 6-8. Code output

The difference image in Figure 6-8 shows that a relatively small number of pixels have been changed. You may be able to see the perturbation on the adversarial image in the center.

Increasing Adversarial Confidence

The FGSM and JSMA methods generate adversarial examples, but because these attacks generate input close to classification boundaries, they attacks may be susceptible to preprocessing or active defense by the target system. For example, a minor change to the pixels of an adversarial image might change its classification.

Creating input that is more confidently adversarial allows the adversary greater robustness. Carlini and Wagner proposed an alternative attack method which does exactly this.10 The attack iteratively minimizes the L2-norm measurement of change, while also ensuring that the difference between the confidence of the adversarial target and the next most likely classification is maximized. This gives the adversarial example greater wiggle room before it is rendered nonadversarial. This attack is referred to as the C&W attack.

Variations on White Box Approaches

This section has introduced a few of the available white box approaches for generating adversarial examples. There are other methods, and undoubtedly more will be developed.

All white box approaches share the same aim: to optimize the search of the input space by using some computationally feasible algorithm. Either directly or indirectly, all the methods exploit knowledge of model gradients to minimize the required adversarial perturbation. Optimizing the search may involve some approximation or random step, and that may incur a trade-off in terms of a larger perturbation than is absolutely necessary. In addition, due to the variation in how the algorithms search the input space, different approaches will return different adversarial examples.

Limited Black Box Methods

Limited black box query methods iteratively refine the adversarial perturbation based on output returned from the model. This could be, for example, the final classification of an image (“cat” or “dog”) or the text output from a speech-to-text translation. With these methods, the adversary has no access to the model itself or to the raw scores (predictions) from its output layer. Essentially, there’s a level of indirection; the output of the DNN model has been processed and simplified in some way to provide a result and it is only this result that the attacker has access to.

It’s not immediately obvious how an adversary might effectively search the input space in the limited black box scenario. One approach would be to iteratively make a small change to the input, run it through the model, and check the classification to see whether it has altered. However, this is unlikely to be an effective strategy because there is no way of knowing whether the small change is moving the image toward the required nefarious part of the input space until the classification changes. This brute-force method will be too slow and clumsy; the adversary will need a better plan.

The boundary attack proposed by Brendel et al.11 is a clever strategy that is pleasantly simple in its approach. A targeted boundary attack is depicted in Figure 6-9. It is seeded with the original image (in this case, a sneaker) and a sample image (in this case, a sandal), as shown by the circle and square, respectively. The sample is an image that is classified by the model as the target classification. So, in this case, we want the sneaker image to look like a sneaker but be classified as a sandal.

The boundary attack begins with the sandal image rather than the original sneaker image. It iteratively moves the image closer to the sneaker, while never allowing each iterative step to alter the adversarial classification by straying across the classification boundary.

The algorithm begins with an initialization step. The sample image from the adversarial target class is overlaid with a diluted version of the original to create an input that is just adversarial. This is done by overlaying the sample sandal image with selected pixels from the original sneaker image until the resulting image is located just on the adversarial side of the classification boundary. This will take quite a few iterations; overlay, test, and repeat until any further changes would make the image nonadversarial. At this point, the image has moved just across the classification boundary and is classified as a sneaker. It’s gone a little too far, so the input from the original image is reduced very slightly to return it to the adversarial (sandal) classification.

Conceptually, this is moving the image from the sample toward the original, then stopping at the boundary just before it takes on the original classification. The location of the image at the end of the initialization step is shown in Figure 6-9 as the triangle marked with a “1.”

Image depicting targeted boundary attack
Figure 6-9. Targeted limited black box boundary attack

At the end of the initialization step, the current image is just within the adversarial boundary. However, it is unlikely to be sufficiently close to the original for the perturbation to be hidden. In this case, it most likely still looks like the sandal. The algorithm then creeps along the edge of the boundary, at each step testing random perturbations that would bring the input closer to the original sneaker. Each time, the input is submitted to the target DNN; if it steps outside the adversarial boundary, it is discarded. Each iteration to bring the sample image closer to the original may therefore take multiple steps. At some point, the image is deemed close enough to the original to look like it, and yet retain the adversarial classification of the sample.

Let’s take a look at the code for a targeted boundary attack on Fashion-MNIST. We’ll try to turn the sneaker into a sandal, as illustrated in Figure 6-9.

Code Example: Saliency Attack

The complete code for this attack can be found in the Jupyter notebook chapter06/fashionMNIST_foolbox_boundary.ipynb.

The code for the boundary attack requires specifying an original image (x) and starting point image. First we’ll specify the original image, establish its classification, and display it (see Figure 6-10):

original_image_num = 9

x = test_images[original_image_num]
y = np.argmax(predictions[original_image_num]) 1
y_name = class_names[y]

print("Prediction for original image:", y, y_name)

plt.imshow(x, cmap=plt.cm.binary)
1

y is the original (nonadversarial) prediction.

This output is generated:

Prediction for original image: 7 Sneaker
Boundary attack sneaker
Figure 6-10. Code output

Next, we select the starting point image, establish its classification, and display it (see Figure 6-11). This image has the required adversarial classification:

starting_point_image_num = 52

starting_point_image = test_images[starting_point_image_num]
y_adv = np.argmax(predictions[starting_point_image_num])    1
y_adv_name = class_names[y_adv]

print("Prediction for starting point image:", y_adv, y_adv_name)
import matplotlib.pyplot as plt

plt.imshow(starting_point_image, cmap=plt.cm.binary)
1

y_adv is the target adversarial prediction.

This generates the following output:

Prediction for starting point image: 5 Sandal
Boundary attack sandal
Figure 6-11. Code output

Now we prepare the Foolbox attack:

import foolbox

fmodel = foolbox.models.TensorFlowModel.from_keras(model, bounds=(0, 1))
attack_criterion = foolbox.criteria.TargetClass(y_adv) 1
attack = foolbox.attacks.BoundaryAttack(fmodel, criterion=attack_criterion)
1

This is a targeted attack, so we use the TargetClass criteria.

And issue the attack:

x_adv = attack(input_or_adv = x,
               label = y,
               starting_point = starting_point_image,
               unpack = False,
               log_every_n_steps = 500)

Which generates this output:

run with verbose=True to see details
Step 0: 5.60511e-02, stepsizes = 1.0e-02/1.0e-02:
Step 500: 1.44206e-02, stepsizes = 1.5e-02/2.0e-03:
Step 1000: 3.43213e-03, stepsizes = 1.5e-02/1.3e-03: d. reduced by 0.26% (...)
Step 1500: 1.91473e-03, stepsizes = 6.7e-03/5.9e-04: d. reduced by 0.12% (...)
Step 2000: 1.54220e-03, stepsizes = 3.0e-03/1.7e-04: d. reduced by 0.03% (...)
Step 2500: 1.41537e-03, stepsizes = 8.8e-04/5.1e-05: d. reduced by 0.01% (...)
Step 3000: 1.37426e-03, stepsizes = 5.9e-04/2.3e-05: d. reduced by 0.00% (...)
Step 3500: 1.34719e-03, stepsizes = 3.9e-04/2.3e-05:
Step 4000: 1.32744e-03, stepsizes = 3.9e-04/1.5e-05: d. reduced by 0.00% (...)
Step 4500: 1.31362e-03, stepsizes = 1.7e-04/1.0e-05:
Step 5000: 1.30831e-03, stepsizes = 5.1e-05/2.0e-06:

The attack is stopped by default when the algorithm converges or after 5,000 iterations. At this point we hope that we have crept around the decision boundary to a position close enough to the original image for it to look the same. Let’s take a look (see the output in Figure 6-12):

preds = model.predict(np.array([x_adv.image]))

plt.figure()

# Plot the original image
plt.subplot(1, 3, 1)
plt.title(y_name)
plt.imshow(x, cmap=plt.cm.binary)
plt.axis('off')

# Plot the adversarial image
plt.subplot(1, 3, 2)
plt.title(class_names[np.argmax(preds[0])])
plt.imshow(x_adv.image, cmap=plt.cm.binary)
plt.axis('off')

# Plot the difference
plt.subplot(1, 3, 3)
plt.title('Difference')
difference = x_adv.image - x
plt.imshow(difference, vmin=0, vmax=1, cmap=plt.cm.binary)
plt.axis('off')

plt.show()
Boundary attack results
Figure 6-12. Code output

Figure 6-13 shows the image after every hundred iterations (up to 1,200 iterations) during the attack. The sandal changes to be increasingly similar to the original image without changing its classification. The sandal prediction is in brackets under each image—you’ll see that it creeps around the boundary with the prediction close to 0.5 throughout the optimization. The code to generate Figure 6-13 is also in the Jupyter notebook chapter06/fashionMNIST_foolbox_boundary.ipynb.

Image depicting the image transformation in targeted boundary attack
Figure 6-13. The boundary attack begins with an image from the target class and gradually moves it closer to the original

The boundary attack is a very powerful attack able to effectively create adversarial input. However, it is likely to take thousands of iterations, and each iteration may involve several queries to the DNN.

Score-Based Black Box Methods

Score-based methods fall somewhere between the white box and limited black box categories. Sometimes in research literature score-based models are termed black box, but the term score-based is used throughout this book to clearly distinguish between the two.

Score-based methods require access to the output class probabilities of the model; that is, the attacker can submit an input and receive the predicted scores from which the final decision (such as the classification) will be made by the DNN. It’s worth noting that the scores available to the attacker may be limited (for example, the top five probabilities).

Score-based methods might appear closer to limited black box methods; after all, the adversary only has access to input and output. However, access to the scores could be considered “privileged,” as typically an attacker would not have access to the raw DNN output. The score-based methods are therefore closer in characteristics to white box approaches. They approximate the model’s algorithm through output predicted scores and then perform an intelligent search to establish the perturbation required to achieve the adversarial goal. But unlike in a white box attack, the attacker does not have access to the model algorithm. Therefore, they cannot compute the algorithm gradients that are required for the white box methods presented previously. (There are other ways of searching for the adversarial example, such as using a genetic algorithm, as explained in the following note.)

At this point, you might be wondering why any organization that cared about the security of its model would make the scores available. For example, if an image was automatically deemed to contain inappropriate content on a social media website and therefore qualify for removal, the individual who uploaded the image would receive at best a warning or notification that their image had been censored. They would not receive the detailed output scores depicting the classification probabilities that the DNN assigned to the image. So, no, an organization wouldn’t make the scoring information available in scenarios such as this. However, there are a number of open APIs created for the benefit of showcasing and advancing DNN technology that do make the scores available. While the security of these models themselves is not a risk, as we shall see in Chapter 7, openly available model APIs could be exploited as a model substitute, saving the adversary the effort of creating their own model.

Summary

This chapter has considered a selection of methods for the generation of adversarial perturbation. There are many other variations on these methods, and new methods are regularly proposed.

Although these methods were discussed in the context of images, the fundamental techniques are just as applicable to audio or video. Also, don’t assume mathematical optimization approaches are always required to generate adversarial input; very simple methods might also be exploited if there is no mitigation in place at the target. For example, it has been shown13 that algorithms performing video search and summarization can be fooled by simply inserting occasional still images into the video. The still images contain content that results in the network returning an incorrect result, but they are inserted at a sufficiently low rate to not be noticed by humans.

As we have seen, the algorithms to generate adversarial perturbation are essentially mathematical optimizations. When considering defenses in Part IV, it will become clear that many approaches to defending against adversarial examples involve changing or extending the model’s algorithm. Creating adversarial examples that thwart the defenses thus remains an optimization problem, just against a different algorithm. Many defenses, therefore, succeed in constraining the methods available to the adversary, but are not guaranteed defenses.

This chapter has considered the mathematical approaches for generating adversarial examples against DNN models. Next, Part III explores how an attacker can use these theoretical approaches against real-world systems that incorporate AI.

1 Szegedy et al., “Intriguing Properties of Neural Networks.”

2 Optimizations have since been made to this approach to generate more effective perturbations, such as those described in Nicholas Carlini and David Wagner, “Towards Evaluating the Robustness of Neural Networks” (2016), http://bit.ly/2KZoIzL.

3 Ian J. Goodfellow et al., “Explaining and Harnessing Adversarial Examples” (2015), http://bit.ly/2FeUtRJ.

4 Goodfellow et al., “Explaining and Harnessing Adversarial Examples.”

5 A classifier collapses this probability vector to a single value that represents the most probable classification.

6 Alexey Kurakin et al., “Adversarial Machine Learning at Scale” (2016), http://bit.ly/31Kr3EO.

7 Carlini and Wagner, “Audio Adversarial Examples.”

8 Nicolas Papernot et al., “The Limitations of Deep Learning in Adversarial Settings,” 1st IEEE European Symposium on Security & Privacy (2016), http://bit.ly/2ZyrSOQ.

9 Once again, as with backpropagation, derivatives are calculated using the chain rule: the mathematical technique that enables the derivative to be calculated by considering f(x) as composition of the functions in each DNN layer.

10 Carlini and Wagner, “Towards Evaluating the Robustness of Neural Networks.”

11 Wieland Brendel et al., “Decision-Based Adversarial Attacks: Reliable Attacks Against Black-Box Machine Learning Models,” Proceedings of the International Conference on Learning Representations (2018), http://bit.ly/2Y7Zi6w.

12 M. Alzantot et al., “Did You Hear That? Adversarial Examples Against Automatic Speech Recognition,” Conference on Neural Information Processing Systems, Machine Deception Workshop (2017), http://bit.ly/2ITvtR3.

13 Hosseini Hossein et al., “Deceiving Google’s Cloud Video Intelligence API Built for Summarizing Videos” (2017), http://bit.ly/2FhbDxR.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset