Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5. The Principles of Adversarial Input

This chapter looks at some of the core principles underpinning the generation of adversarial examples. We’ll hold off on the more detailed mathematics and specific techniques and begin by building upon the ideas presented in the previous chapters. This discussion will use analogy and approximation to provide an intuitive understanding prior to delving into the details. The aim is to understand, at a high level, how the addition of adversarial perturbation or an adversarial patch could cause a DNN to return an incorrect result.

To recap:

Adversarial perturbation: A combination of imperceptible (or nearly imperceptible) small changes distributed across the input data which cause the model to return an incorrect result. For an image, this might be small changes to several disparate pixels across an image.
Adversarial patch: An addition to a specific area (spatial or temporal) of the input data to cause the model to return an incorrect result. An adversarial patch is likely to be perceptible by a human observer, but could be disguised as something benign.

This chapter considers the generation of adversarial perturbation and patches by direct manipulation of digital data. Adversarial patches and perturbation are easier to apply to the input in its digital form, but it may be possible to apply these techniques to the real world (altering traffic signs for autonomous vehicles, for example) to cause the sensor (camera or microphone) to generate digital input that has the adversary’s desired effect. Chapter 8 looks at the additional challenges facing the adversary if they do not have access to the digital form of the input.

Very broadly speaking, adversarial attacks can be divided into two types:

Untargeted attacks: An untargeted (or indiscriminate) attack aims to cause the DNN to return an incorrect result, such as a misclassification. An example of this would be to avoid facial detection; so long as an image is not positively identified as a specific person, the actual DNN output is not important.
Targeted attacks: A targeted attack aims to generate a specific output from the DNN processing; for example, to cause an autonomous vehicle to not recognize a stop sign.

It will come as no surprise that an untargeted attack will be easier to achieve than a targeted attack, as the attacker is less fussy about the DNN output and so has greater scope for manipulation of the input data. However, the techniques employed are similar for both cases.

Prior to considering the attacks themselves, let’s begin by looking at the raw input presented to a DNN and the features that the model subsequently extracts from that input—those are the characteristics that it deems most important in making its decision. For the purposes of this explanation and those that follow in this chapter, we’ll use image classification, the most commonly researched area of adversarial examples. However, the concepts presented here are by no means limited to images; the ideas can be applied to other modalities, such as audio.

Mathematics Refresher

If you are unfamiliar with (or have forgotten) mathematical notation, examples and explanations of the mathematical symbols used in this book can be found in Appendix A.

The Input Space

DNNs are learned functions that map some complex input to an output. We considered a simple image classification task using the Fashion-MNIST dataset in Chapter 3. Chapter 4 then went on to explain how the principles of deep learning can be applied to other scenarios, such as more complex image recognition, audio classification, and speech-to-text.

All the scenarios presented in the preceding chapters take complex data as input. For example, the ResNet50 classifier presented in Chapter 4 takes ImageNet data cropped to 224 x 224 pixels. Therefore, each image comprises 50,176 pixels in total. The color of each pixel is described by three channels (red, green, blue), so each image is represented by 50,176 x 3 (150,528) values. Each of these values has a value between 0 and 255. This gives a staggering $256 Superscript 150528$ $256 Superscript 150528$ possible images that could be presented to the image classifier!

Using a similar calculation,¹ a relatively low-resolution 1.3 megapixel photograph can encode $256 Superscript 3932160$ $256 Superscript 3932160$ possible picture variations. Even the far lower-resolution monochrome Fashion-MNIST classifier has $256 Superscript 784$ $256 Superscript 784$ possible inputs.²

One way of envisaging all the possible images that could be input to the DNN would be to place each one at its own point in a highly dimensional input space. In such a space, one dimension represents one input neuron value (or raw “feature”). This equates to one dimension per pixel value (three values and three dimensions for each pixel if it’s in color, one value and one dimension for each pixel if it’s in grayscale). For the ResNet50 classifier, the input space would comprise 150,528 dimensions, each of which could have one of 256 values. For Fashion-MNIST, it’s 784 dimensions.

We cannot visualize such a complex hyperdimensional space. So, for this discussion, Figure 5-1 presents an outrageous oversimplification to two dimensions for the two example datasets.

2-d depiction of input spaces with locations of images marked.

Although the input space is vast, it’s worth noting that most of the possible images would not represent anything that we would understand as a proper “picture.” They might look like random pixel combinations, or perhaps patterns that don’t represent anything in the real world. Every possible image has a specific location in the input space, however. Changing one of its pixels will move it in the space along the dimension (or dimensions in the case of a color image) representing that pixel value.

As described in Chapter 3, for each image the DNN returns a vector of probabilities—one value within the vector for each possible classification.

In the Fashion-MNIST input space, images falling in one area of the input space might be assigned a high probability of being “Bag,” and images falling in another area might be assigned a higher probability of being “Coat” or “Sandal” or one of the other clothing classifications. Every point in the input space has a set of 10 values returned from this image classifier.

Input Space Versus Feature Space

The term feature space expresses the same concept of a multidimensional space, but with variation across features rather than raw input values. The feature space is therefore the collection of feature combinations used by the ML algorithm to make its predictions.

In more traditional (non-DNN) ML applications, the raw data fed into the learned model represents the features on which the model will make its predictions. Therefore, the feature space can be considered to be the same as the input space.

In contrast, DNNs are usually trained to extract the features from raw data. Therefore, an interpretation of feature space in the context of neural networks would be the lower-dimensional space of more complex features that have been extracted by the DNN to make its predictions. For example, in the case of a CNN performing image classification, this might be the higher-level feature information output from the convolutional layers in the first part of the network.

Strictly speaking, when referring to changes affecting the raw data to create DNN adversarial examples, it may be more correct to use the term input space. In practice, however, the two terms are often used interchangeably—the input pixels in an image are, after all, just very low-level features.

A nice way to think of this is in terms of multiple landscapes, each with contours. Higher (darker) ground depicts a more confident prediction for the classification that the landscape refers to (“Coat,” “Bag,” etc.). Although this landscape analogy is very simple, it forms the fundamental basis of the mathematical explanations of adversarial examples that follow.

For example, if we zoom into the area of the input space where a coat image resides and provide shading to illustrate the probabilities assigned to each of the 10 classes for images in this area, the input space might be visualized as a set of prediction landscapes, something like that shown in Figure 5-2. The outrageous simplification of the high-dimensional input space to two dimensions still applies.

A visual depiction of a 2-dimensional input space with contours showing model predictions for each classification.

For each classification, the darker shaded areas represent the areas in the input space where images are confidently predicted to be that classification. Images in lighter shaded areas are less confidently predicted to be that classification.

A single classification for an image can be established based on whichever of the list of predictions is the highest, and possibly with the extra constraint that it must have a minimum confidence. In Figure 5-2, this minimum confidence is 0.5 and is marked with a continuous line. Images falling outside this boundary in the prediction landscape will therefore be assigned a different classification (or be unclassified). In the Figure 5-2 depiction, the image falls within the “Coat” boundary, and is therefore classified correctly.

Generalizations from Training Data

Each of the landscapes in Figure 5-2 represents a mapping from the image position to a specific classification. So this is simply a visual depiction of the formula representing the neural network algorithm:

y = f (x; Θ)

$y = f (x; Θ)$

where the shading represents one of the values in the returned vector y for the coat image at location x, assuming the DNN has the set of weights and biases represented by Θ.

Cast your mind back to “How a DNN Learns”, which discussed the process by which a DNN learns by adjusting all the weights and biases represented by Θ. In terms of the contour analogy, this can be thought of as shifting the prediction landscapes so that each of the training examples is at (or close to) the relevant height for its true classification. The training examples’ x values do not change, but the landscape is molded to ensure that the DNN is as accurate as possible for the training data.

The process of gradient descent is readjusting the parameters of the function to shift the contours so that the training data is correctly classified. At the beginning of training, the parameters are randomly initialized, so the landscape is bad for the training examples. During training, you can imagine the landscape gradually morphing as the parameters are altered, optimizing the function for the training set. This is shown in Figure 5-3.

The changing prediction landscape of the input space during training.

At the end of the training the majority of the training samples with the “Coat” label should fall within an area of the input space that allocates a high prediction to “Coat,” and similarly for all the other classes. The optimization step might fail to create classification boundaries that allocate all the training images to their correct classes, especially when the image does not have features similar to the others in its class. However, the purpose of the optimization is to generalize over patterns and create a best fit for the training data.

The ability of a model to perform accurate predictions for all possible inputs depends upon it learning the correct prediction landscapes for the areas at and around the training examples, and also in areas of the input space where no training examples exist. This is a nontrivial task, as by far the majority of the input space will have no training examples and many areas of the input space will reside outside the set of data with which the network has been reliably trained.

The aim of a DNN is to return accurate results across all possible input data, based on generalization of characteristics, but the accuracy of the model will be highly dependent on the characteristics of the training data on which these generalizations are made. For example, exactly what aspect of a Fashion-MNIST image causes the DNN to allocate it to the “Coat” classification? If the characteristic is not something that a human would use to perform the same reasoning, there’s scope for an adversary to exploit this difference.

Furthermore, the training data is unlikely to be representative of all possible types of input. The model is unlikely to perform accurately on data outside the training data distribution, known as out-of-distribution (OoD) data, and adversarial examples can exploit this weakness in the algorithm.

Out-of-Distribution Data

OoD data does not conform to the same distribution as the training set. If you consider the number of possible inputs to a DNN, it is unsurprising that the majority of potential inputs for image and audio tasks will lie outside this distribution.

For example, the training data for Fashion-MNIST comprises 60,000 examples. Sounds like a lot? This is actually tiny relative to the $784 Superscript 256$ $784 Superscript 256$ possible inputs for a 28 x 28 grayscale image. Similarly, even though there are over 14 million images in the ImageNet dataset, if we restrict the images to resolution 224 x 224, this training set provides a very sparse representation of the complete input space for all possibilities ( $150528 Superscript 256$ $150528 Superscript 256$ ).

In practice, if the training data represents examples from the real world (such as photographic images), many OoD inputs will correspond to data that would not occur naturally. Random-pixel images, for example, or images that have undergone some weird manipulation would be OoD. Usually these inputs will result in model predictions that are inconclusive, which is behavior that we might expect. Sometimes, however, the model will return incorrect confident predictions for OoD inputs.

Recognizing OoD data is extremely challenging—we’ll return to this in Chapter 10 when considering defenses.

Experimenting with Out-of-Distribution Data

It’s interesting to experiment with random or unreal images to see what DNN classifiers make of them. Figure 5-4 shows a couple of examples of the predictions returned when random images are presented to the Fashion-MNIST and ResNet50 models.

Random images and their resulting predicted classes.

Code Examples: Experimenting with Random Data

You can test the Fashion-MNIST classifier on random images using the Jupyter notebook chapter05/fashionMNIST_random_images.ipynb on this book’s GitHub site.

To improve the model, you might like to experiment with retraining it with training images comprising random pixels and an additional classification label of “Unclassified.” The Jupyter notebook also includes the code to do this.

The Jupyter notebook chapter05/resnet50_random_images.ipynb provides the code for testing ResNet50 on random data.

The confident prediction returned by the Fashion-MNIST classifier on the left is “Bag,” and by far the majority (over 99%) of random-pixel images passed to the classifier will have this classification. This indicates that the model has learned to classify most of the input space as “Bag.” It’s likely that the images of bags are not being identified by specific pixels, but based on the fact that they do not belong to any other classification. ResNet50 at least does not return a confident classification, so has not misidentified the random image..

What’s the DNN Thinking?

The mathematical function described by a DNN extracts and quantifies characteristics of the data in order to make its predictions. Perhaps some particular characteristics of the image data are more important to the algorithm than others; for example, a particular combination of pixels in an image might indicate a feature such as a dog’s nose, thus increasing the probability that the image is of a dog.

This would make sense, but how could we work out which features a DNN is actually responding to? Put another way, what is the model “seeing” (in the case of image data) or “hearing” (in the case of audio)? Knowing this information might aid in creating adversarial examples.

Once again, this is best illustrated using image classification. Taking each individual pixel in an image, we can calculate the pixel’s saliency with respect to a particular classification—that is, how much the pixel contributes to a specific classification. A high value means that the pixel is particularly salient to the DNN in producing a particular result and a low value indicates that it is less important to the model for that result. For images, we can view all these values in a saliency map to see which aspects of the image the DNN focuses on to make its classification.

Saliency Mathematics

Saliency is estimated by considering the partial derivative of an output with respect to its input:

\frac{\partial o u t p u t}{\partial i n p u t}

$\frac{\partial o u t p u t}{\partial i n p u t}$

If a small change in the input causes a large change in the output, the input is salient.

So, the saliency of a particular pixel i on a classification j is defined by the following partial derivative:

StartFraction normal partial-differential f left-parenthesis bold x right-parenthesis Subscript j Baseline Over normal partial-differential x Subscript i Baseline EndFraction

$StartFraction normal partial-differential f left-parenthesis bold x right-parenthesis Subscript j Baseline Over normal partial-differential x Subscript i Baseline EndFraction$

Conceptually, this is the gradient of the dimension i at the point where the image lies in j’s prediction landscape. The steeper the gradient, the greater the saliency.

Saliency measured in this way is an estimation as it relies on consistent (linear) gradients. Model linearity will be explored in “Exploiting Model Linearity”, and we will see in Chapter 6 that saliency scores can be used to create adversarial examples.

Code Examples: Generating Saliency Maps

There are several Python packages available for visualizing image saliency. The code used to generate the images in this chapter uses the Keras-vis Python package.

If you would like to experiment with the code for generating the saliency visualizations described in this chapter, the code and more detailed explanations are included in the GitHub repository. You can experiment using the Fashion-MNIST data in the Jupyter notebook chapter05/fashionMNIST_vis_saliency.ipynb or using the ResNet50 data found in the Jupyter notebook chapter05/resnet50_vis_saliency.ipynb.

Figure 5-5 provides an example of an image with its top three classifications using the ResNet50 DNN classifier. To the right of the image is the associated saliency map, which highlights the pixels that were most important in generating the classification. The top three prediction scores returned from the ResNet50 classifier are “analog_clock,” “wall_clock,” and “bell_cote.” (A “bell cote” is a small chamber containing bells.)

ImageNet Data

The ResNet50 model used in these examples was trained on ImageNet data. If you are interested in exploring this training data to see for yourself what the model is learning from, search ImageNet for training examples assigned to the different classifications.

The saliency map highlights aspects of the clock face such as the digits, suggesting that ResNet50 deems these to be critical features to its prediction. Look closely and you will see that the classifier is also extracting other “clock” features, such as the hands and the square boxing that may be typical of wall clocks. The third-highest (but nonetheless low) prediction of “bell_cote” might be attributed to the casing below the clock, which has similarities in shape to bell cote chambers.

The aspects of the image in Figure 5-5 that the ResNet50 classifier identifies as salient are what we (as humans) might expect to be most important in the classification. Neural networks are, however, simply generalizing patterns based on training data, and those patterns are not always so intuitive.

Clock image with the approriate saliency map.

Figure 5-6 illustrates this point. Two cropped versions of an identical image return completely different results from the neural network.

Where the candles are more prevalent in the top image, the highest prediction is “candle” followed by “matchstick,” and it is clear from the associated saliency map that the DNN’s attention is on the flames and the candles. The circle outline of the cake is also extracted by the neural network; this is likely to be a feature common across images of Dutch ovens, explaining the third prediction.

The second image, where the candles have been partially cropped, is misclassified by the DNN as a “puck.” The saliency map might provide some explanation for this classification; the disc shape being extracted is similar to that of an ice hockey puck. One of the candles is salient to the DNN, but its flame is not. This may explain the third prediction of “spindle.”

Now let’s consider the Fashion-MNIST model that we trained in Chapter 3. This was a particularly simple neural network, trained to classify very low-resolution images into one of 10 clothing types. This classifier may be basic in the world of DNN models, but it is effective at achieving its task with high model accuracy.

Cake images with the approriate saliency map.

Figure 5-7 depicts the pixels that are deemed most important (salient) by the simple model in correctly classifying a selection of images as “Trouser” and “Ankle boot.” Unlike with the previously shown saliency pictures, the saliency maps overlay the original images so that the relationship between the pixels and the images is clear. To make the images simpler, the saliency maps also show only the 10 most prevalent pixels in determining the predicted classification.

Fashion MNIST images with associated saliency ovrelayed for the target classification.

Figure 5-7 illustrates that the pixels most salient to a DNN in determining the image classification are not what we might expect. For example, it appears that the model has learned to distinguish trousers based primarily on pixels in the top and bottom rows of the image rather than, for example, the leg shape of the trousers. Similarly, particular pixels around the toes of the boots seem important to the “Ankle boot” classification. Certain clusters of pixels, once again near the edges of the images, also have unexpected relevance. Because the model has been trained to identify the easiest method to distinguish categories of clothing, it might not pick out the features that we would intuitively use to categorize clothing. Therefore, the pixels at the edges of the image might be sufficient in discriminating between the different clothing categories for the restricted Fashion-MNIST dataset.

With the concepts of an input space and saliency in mind, let’s move on to see how this all relates to the generation of adversarial input.

Perturbation Attack: Minimum Change, Maximum Impact

As you will have gathered from the previous sections, adversarial examples exploit flaws in untested areas of the input space of a DNN model, causing it to return an incorrect answer. These examples introduce a perturbation or patch that would not fool or might not even be noticed by a human.³ So, whatever alteration changes a benign image into an adversarial one, there’s a broad principle that it should result in minimum change to the data, while also maximizing the effect on the result produced by the DNN.

Let’s begin by considering the challenge of adding some perturbation—perhaps changing a few salient pixels, or changing many pixels very slightly—to our Fashion-MNIST coat image, to result in a misclassification. Changing a selection of pixels in the image will shift it through the input space to another location, moving it across the landscapes depicted originally in Figure 5-2. This shift is depicted in Figure 5-8 by the arrow going from the original image position indicated by the circle to the adversarial image position indicated by the triangle.

On the one hand, the image must be changed sufficiently so that its position within the input space is no longer within the “Coat” classification area. If this is a targeted attack, there is the additional constraint that the image has moved to an area of the input space that will result in the target classification. In Figure 5-8, the changed image is now classified as “Sneaker.” Generation of adversarial perturbation therefore comes down to the challenge of which pixels will cause the most change away from the correct classification, and possibly toward a target classification.

On the other hand, any perturbation required must be minimized so that it is insignificant to the human eye. In other words, ideally the perturbation is likely to be the minimum change to the image required to move it just outside the “Coat” classification boundary or just inside the target classification boundary.⁴ There are a number of approaches; we might focus on changing a few of the pixels that are most critical for the classification change (the most salient ones), or we could change many pixels, but to such a small extent that the overall effect on the image is not noticeable.

The concepts described use a vastly simplified pictorial representation, but they illustrate the fundamental principles of adversarial example generation, regardless of the technique employed. The generation of adversarial examples typically requires altering a nonadversarial example to move it to a different part of the input space that will change the model’s predictions to the maximum desired effect.

Movement of an image to an area in the input space outside its classification.

Adversarial Patch: Maximum Distraction

The principles behind the generation of an adversarial patch are very similar to those employed for a perturbation attack. Once again, the aim is to alter the input in such a way that it moves through the input space, either away from its initial classification (untargeted attack) or toward a target classification (targeted attack) This time, however, a localized area of the image is altered, rather than implementing more general perturbations across the whole picture. The altered area or “patch” must be optimized to “pull” the image toward another part of the input space.

If the target misclassification is “koala,” the patch would ideally represent what would be perceived as the perfect koala, encapsulating every characteristic the model deemed to be important to a koala classification. The patch should contain all the salient features of a koala, so it appears to be more koala-like (to the DNN) than anything that you would ever see in the real world—in other words, it should be an excessively “koala-y” koala. This positions it comfortably within an area of the input space such that the features of the unpatched image are overlooked. The very toastery toaster in Figure 1-5 illustrates this nicely.

Optimizing the adversarial example might also consider the size of the patch, its location on the image, and potentially the way it will be perceived by humans. Moving the patch around on the image and resizing it will obviously have an effect on the resulting image’s position within the input space and may affect its classification.

Supernormal Stimulus

The concept of distraction through exaggerated, unnatural versions of things in the real world is not unique to AI. Scientists have also proven a similar concept of supernormal stimulus in animals and humans.

In the 1950s, ethologist Nikolaas Tinbergen demonstrated that artificially exaggerated versions of natural objects could stimulate innate behaviors in gulls to a greater extent than their natural equivalents.⁵ He proved this with false oversized eggs and mock “beaks” made from knitting needles with patterns that exaggerated those on a real beak. Psychologists have since extended these ideas to humans in areas such as junk food, entertainment, and art.

Measuring Detectability

Methods used to generate adversarial perturbation require a measurement of the distance from the benign input to the adversarial. This is a measurement of the distance moved through the input space, as shown by the arrows in Figure 5-8. The mathematics then seeks to minimize this value (minimize the change) while ensuring that the input fulfills the adversarial criteria.

Mathematics provides us with different mechanisms for measuring differences between points across multidimensional spaces, and we can exploit these techniques to provide a quantifiable “difference” between two image positions in the input space. Constraining the amount of difference allowed between an adversarial example and its nonadversarial counterpart ensures that the perturbation is minimized. A high similarity score (a small difference) would imply that it would appear nonadversarial to a human observer, whereas a large change might suggest it is more likely to be noticed. In fact, human perception is more complex than this as some aspects of the input may be more noticeable to a human observer than others, so the mathematical quantification may be too simple.

The next section describes how the difference between a benign and an adversarial input can be measured mathematically, and the section “Considering Human Perception” considers the added complexity introduced by human perception.

A Mathematical Approach to Measuring Perturbation

There are several different mathematical approaches to measuring distance in high-dimensional space. These measurements are called the L^p-norm, where the value of p determines how the distance is calculated. Figure 5-9 summarizes these various distance measurements.

Perhaps the most obvious measurement is the Euclidean distance between the original and adversarial images in the input space. This is simply an application of Pythagoras’ theorem (albeit in a highly dimensional feature space) to determine the distance between the two images by calculating the sum of the squared difference in each feature dimension and then taking its root. In mathematical parlance, this difference measurement is called the L²-norm and belongs to a wider set of approaches for measuring the size of a vector. The vector being considered in measuring adversarial change has its origin at the original image and its end at the adversarial one.

A curious characteristic of high-dimensional spaces is that their dimensionality ensures that all the points in the space are a similar Euclidean distance apart. While the L²-norm is the most intuitive measure of distance in low two- or three-dimensional spaces that we understand, it turns out that it is a poor measure of distance in high-dimensional spaces. The L²-norm is often used in the generation of adversarial examples, but it is not necessarily the best measure of perturbation.

An alternative distance measurement is the L¹-norm, which is simply the sum of all the pixel differences. This is sometimes referred to the “taxicab” norm; whereas the L²-norm measures direct distance (“as the crow flies”), the L¹-norm is akin to a taxicab finding the fastest route through a city that has its streets arranged in a grid plan.

Another approach might be to measure the difference between two images in terms of the total number of pixels that have different values. In input space terms, this is simply the number of dimensions that have different values between the two images. This measurement is referred to mathematically as the $upper L Superscript 0$ $upper L Superscript 0$ -“norm.”⁶ Intuitively, this is a reasonable approach as we might expect that fewer pixel changes would be less perceptible than many, but the L⁰-norm does not restrict the amount that these pixels have changed (so there could be considerable difference, but in small portions of the image).

Finally, you might argue that it really does not matter how many pixels are changed, so long as every pixel change is not perceivable, or is difficult to perceive. In this case we’d be interested in ensuring that the maximum change made to any pixel is kept within a threshold. This is known as the $L^{\infty}$ $L^{\infty}$ -norm (the “infinity” norm) and has been a very popular approach in research as it enables many infinitesimal and imperceptible changes to an image that combined can have a significant effect on the image’s classification.

Mathematical Norm Measurements

Here’s how L^p-norms are written mathematically.

The generic formula for L^p-norm measurements is:

{∥ d ∥}_{p} = {(| d |}_{1}^{p} + {| d |}_{2}^{p} + {. . . | d |}_{n}^{p})^{\frac{1}{p}}

${∥ d ∥}_{p} = {(| d |}_{1}^{p} + {| d |}_{2}^{p} + {. . . | d |}_{n}^{p})^{\frac{1}{p}}$

where $StartAbsoluteValue d EndAbsoluteValue Subscript 1 Baseline comma StartAbsoluteValue d EndAbsoluteValue Subscript 2 Baseline period period period StartAbsoluteValue d EndAbsoluteValue Subscript n Baseline$ $StartAbsoluteValue d EndAbsoluteValue Subscript 1 Baseline comma StartAbsoluteValue d EndAbsoluteValue Subscript 2 Baseline period period period StartAbsoluteValue d EndAbsoluteValue Subscript n Baseline$ represents the vector between the two positions in the input space (i.e., the adversarial perturbation or patch).

The absolute bars ensure that an absolute (nonnegative) measurement is returned, regardless of whether the vector is traveling in a positive or negative direction along any of its axes. These are sometimes omitted when p is even, as raising the vector values to an even number will ensure that they are not negative.

When p is 1, the L¹-norm (“taxicab” norm) measurement is:

{∥ d ∥}_{1} = {(| d |}_{1}^{1} + {| d |}_{2}^{1} + {. . . | d |}_{n}^{1})^{1} = {| d |}_{1} + {| d |}_{2} + . . . {| d |}_{n}

${∥ d ∥}_{1} = {(| d |}_{1}^{1} + {| d |}_{2}^{1} + {. . . | d |}_{n}^{1})^{1} = {| d |}_{1} + {| d |}_{2} + . . . {| d |}_{n}$

This is simply the sum of all the absolute values of the pixel changes.

When p is 2, the L² (Euclidean norm) measurement is therefore:

{∥ d ∥}_{2} = {(| d |}_{1}^{2} + {| d |}_{2}^{2} + {. . . | d |}_{n}^{2})^{\frac{1}{2}} = \sqrt{{| d |}_{1}^{2} + {| d |}_{2}^{2} + . . . {| d |}_{n}^{2}}

${∥ d ∥}_{2} = {(| d |}_{1}^{2} + {| d |}_{2}^{2} + {. . . | d |}_{n}^{2})^{\frac{1}{2}} = \sqrt{{| d |}_{1}^{2} + {| d |}_{2}^{2} + . . . {| d |}_{n}^{2}}$

and when p is ∞, the $L^{\infty}$ $L^{\infty}$ -norm resolves to:

{∥ d ∥}_{\infty} = {m a x {| d |}_{1} {, | d |}_{2} {. . . | d |}_{n}}

${∥ d ∥}_{\infty} = {m a x {| d |}_{1} {, | d |}_{2} {. . . | d |}_{n}}$

This is simply the maximum individual pixel change.

Finally, another useful measurement of perturbation is the number of pixels changed (the count of nonzero values). This is sometimes referred to as the L⁰-“norm.” However, it is not a proper mathematical norm because setting p = 0 requires calculating 0⁰, which is not defined.

So, which is the best measurement for an adversary to use to ensure an effective perturbation attack? Is it better to change a few pixels (minimize L⁰-norm), or to change many pixels but constrain each change to be small (minimize $L^{\infty}$ $L^{\infty}$ -norm), or perhaps to minimize the overall distance that the perturbation moves in the input space (with the L²-norm or L¹-norm)? The answer depends on several factors which will be discussed later in the book, including human perception and the level of robustness required of the adversarial example to data preprocessing.

Considering Human Perception

The challenge of the adversarial example is to generate input that results in incorrect interpretation by the network, but without being detectable as an attack by a human being. This might mean that the change is imperceptible to humans or insignificant enough for a human to disregard it either consciously or subconsciously.

At the most fundamental level, our perception is constrained by the physical limitations to the range of electromagnetic or sound waves that our senses can process. It would seem intuitive, therefore, that the data allowed into neural network technology designed to mimic human decisions about images or audio should be subject to similar restrictions as those imposed on us by our eyes and ears. To a large extent, this constraint is imposed on human-consumable digital data. For example, image formats (PNG, JPEG, etc.) are designed to represent information in the visible spectrum. Similarly, much audio processing is constrained to frequencies that are human-audible or, in the case of speech processing, within the range of sound produced by the human vocal tract. Without these constraints, an adversary might simply augment data with information that humans cannot hear or see to confuse a DNN (see the following note for an example).

Dolphin Attack: Exploiting Ultrasound

In 2017, researchers Zhang et al. demonstrated the efficacy of ultrasound voice commands as a mechanism to add an audio adversarial patch inaudible to humans but discernible by digital assistants (referred to as a “dolphin” attack).⁷ While interesting, this type of attack could be easily prevented simply by ensuring that the digital assistant filters out sounds inaudible to humans or, better still, outside the human vocal range. Adversarial attacks that use parts of the electromagnetic spectrum or sound wave frequencies that cannot be perceived by human eyes and ears are unlikely, therefore, to pose a realistic threat.

Assuming all the data presented is within the human sensory range, the problem with the mathematical measurements of the differences described here is that they assign equal weight to each part of the input data. The measurements assume every pixel within an image is perceived equally and contributes equally to a human’s perception of that image. This is clearly not the case; it has been proven that people are typically less aware of changes to a busy part of an image. Pixel changes in a simple area (clear sky, for example) are likely to be more noticeable.

For audio, the distortion metric often employed in generating adversarial examples is decibels (dB), a logarithmic scale that measures the relative loudness of the distortion with respect to the original audio. This is a good way to ensure that adversarial audio remains imperceptible to humans because it ensures that the changes during quiet points are relatively small with respect to any changes introduced to louder aspects of the audio.

There has been considerable research into the aspects of image or sound that humans pay most attention to. These are the salient features from the human rather than the machine perspective (as previously discussed in “What’s the DNN Thinking?”). Adversarial examples might be improved by skewing perturbations toward aspects of the input data that are less interesting to a human but more interesting to the model. Consider images: humans subconsciously divide image information into constituent parts, paying greater attention to the foreground and less to the background. There may be effective, simple techniques that could enable greater flexibility in creating adversarial input, such as favoring perturbations in busy background areas of the image.

We can turn to psychology for the definition of the absolute threshold of sensation: the minimum stimulus required for it to be registered by an individual 50% of the time. Unsurprisingly, this varies from person to person and also varies for a specific individual depending on aspects such as their physiological state. From the perspective of an adversary, understanding thresholds by which stimuli might be registered by a human may be beneficial in creating adversarial data.

Another interesting consideration is sensory adaptation. Our senses become less responsive to constant stimulus over time, enabling us to notice changes that might be important to survival. In a car, for example, we may stop noticing the sound of the engine after a time. Conversely, we are particularly sensitive to abrupt changes to sensory input, such as sudden noise. From an adversarial perspective, therefore, there may be benefit in gradually introducing perturbation to video or audio to remain undetected .

Summary

This chapter introduced some of the high-level principles underpinning adversarial examples. Prior to getting into more detail about adversarial techniques in Chapter 6, here’s the high-level mathematical explanation of adversarial input.

The Mathematics Behind Adversarial Examples

Let’s assume that we have a neural network that is a classifier. It takes a vector representing the input data and returns an output:

f (x; Θ) = y

$f (x; Θ) = y$

In this equation, f is the function applied by the DNN algorithm to generate the output. Recall from Chapter 3 that Θ represents all the parameters (weights and biases) of the network. When the network has been successfully trained, these parameters do not change, so we can simplify that equation to:

f (x) = y

$f (x) = y$

where x identifies the point in the raw input space where the input is located.

What x and y represent varies with the type of input data and the task that the DNN is addressing. In the case of a monochrome image classifier, x is a vector of real numbers, each corresponding to the value of a pixel:

x \in ℝ

$x \in ℝ$

For a classifier, y is a single classification such as “dog” or “cat” assigned an enumeration derived from the vector of probabilities returned from the DNN’s output layer (such as the highest value). So y is not a vector of probabilities in this case, but belongs to the set of numbers from $1 period period period upper L$ $1 period period period upper L$ , where L is the number of classifications. This is written as:

y \in {1, 2, . . . L}

$y \in {1, 2, . . . L}$

Creating an adversarial example requires a carefully calculated change to the initial input. In simple mathematics, we can express this as follows:

x^{adv} = x + r

$x^{adv} = x + r$

where:

x^adv is a vector representing the updated (adversarial) input data.
x is the vector representing the original input data.
r is a vector representing a small change to the original input data.

For the input x^adv to be successfully adversarial, the output of the model (the classification) must differ from that of a nonadversarial equivalent. We can write this as:

f (x^{adv}) \neq f (x)

$f (x^{adv}) \neq f (x)$

If the attack is targeted, we have the additional constraint:

f (x^{adv}) = y_{t}

$f (x^{adv}) = y_{t}$

where y_t represents the target adversarial classification.

Whether the attack is a perturbation or a patch, the value of the adversarial change r must be minimized to make it imperceptible (or less perceptible) to humans. If this measurement is a simple L^p-norm for a perturbation attack, the aim is to find the closest image to x, so we also state that the perturbation r should be as small as possible. For an untargeted attack, we can write this as:

\underset{r}{arg min} {∥ r ∥_{p} : f (x^{adv}) \neq f (x)}

$\underset{r}{arg min} {∥ r ∥_{p} : f (x^{adv}) \neq f (x)}$

And for a targeted attack, it’s:

\underset{r}{arg min} {∥ r ∥_{p} : f (x^{adv}) = y_{t}}

$\underset{r}{arg min} {∥ r ∥_{p} : f (x^{adv}) = y_{t}}$

The value of p depends on the measurement of distance being used to assess the adversarial example. So, for example, if the adversarial technique exploits Euclidean distance as its measurement, the preceding equation would be written as:

\underset{r}{arg min} {∥ r ∥_{2} : f (x^{adv}) = y_{t}}

$\underset{r}{arg min} {∥ r ∥_{2} : f (x^{adv}) = y_{t}}$

The measurement might be more complex, for example incorporating some consideration of the impact of the adversarial perturbation on human perception, but much of the work on adversarial examples uses L^p-norm measurements.

The goal of the adversary is to find the optimal value of r that satisfies the preceding constraints. Mathematically, this can be done through a constrained optimization algorithm, of which there are several (we will consider some in Chapter 6). The aim of the algorithm selected is therefore to solve the following problem to establish the perturbation r in order to create the adversarial example:

x^{adv} = x + \underset{r}{arg min} {∥ r ∥_{p} : f (x^{adv}) \neq f (x)}

$x^{adv} = x + \underset{r}{arg min} {∥ r ∥_{p} : f (x^{adv}) \neq f (x)}$

Or, more specifically, for a targeted attack :

x^{adv} = x + \underset{r}{arg min} {∥ r ∥_{p} : f (x^{adv}) = y_{t}}

$x^{adv} = x + \underset{r}{arg min} {∥ r ∥_{p} : f (x^{adv}) = y_{t}}$

¹ The number of pixels is 1,280 x 1,024, which equals 1,310,720. There are 3 channels per pixel, giving 3,932,160 pixel values, each between 0 and 255.

² 28 x 28 = 784 grayscale pixels for each image.

³ As mentioned in Chapter 1, as strictly defined an adversarial example does not need to remain unnoticed by a human, as this term may be used to signify adversarial intent. However, for the audio and image discussions in this book, these are the adversarial examples that we are interested in.

⁴ There is a risk that a less confident adversarial example might creep back to its original (correct) classification if it resides too close to the critical boundary that makes it adversarial. This might be the case, for example, if its pixels were changed in a small way during the processing chain prior to it reaching the DNN classifier. An adversarial example with greater robustness might reside more significantly away from the original classification boundary or more comfortably within the target classification area (in the case of a targeted attack).

⁵ Niko Tinbergen, The Herring Gull’s World: A Study of the Social Behavior of Birds (London: Collins, 1953), 25.

⁶ The quotes are deliberate here; see the mathematical explanation in “Mathematical Norm Measurements” if you are interested in an explanation.

⁷ Guoming Zhang et al., “DolphinAtack: Inaudible Voice Commands,” Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (2017), http://bit.ly/2MWUtft.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
5. The Principles of Adversarial Input

Chapter 5. The Principles of Adversarial Input

Mathematics Refresher

The Input Space

Figure 5-1. Input spaces outrageously simplified to two dimensions (obviously not to scale)

Input Space Versus Feature Space

Figure 5-2. A model’s prediction landscapes for each classification—zoomed into a tiny area of the complete input space

Generalizations from Training Data

Figure 5-3. The changing prediction landscape of the input space during training

Out-of-Distribution Data

Experimenting with Out-of-Distribution Data

Figure 5-4. Classification predictions for randomly generated images

Code Examples: Experimenting with Random Data

What’s the DNN Thinking?

Code Examples: Generating Saliency Maps

ImageNet Data

Figure 5-5. Clock image with associated saliency map (ResNet50 classifier)

Figure 5-6. Cake images with associated saliency map (ResNet50 classifier)

Figure 5-7. Fashion-MNIST images with associated saliency overlaid for the target classification (basic classifier)

Perturbation Attack: Minimum Change, Maximum Impact

Figure 5-8. Untargeted attack—moving outside the “Coat” classification area of the input space

Adversarial Patch: Maximum Distraction

Supernormal Stimulus

Measuring Detectability

A Mathematical Approach to Measuring Perturbation

Figure 5-9. A visual depiction of L^p-norm measurements where the number of dimensions is 2

Considering Human Perception

Dolphin Attack: Exploiting Ultrasound

Summary

Table of Contents for 5. The Principles of Adversarial Input

Create new playlist

Sign In

Sign Up

Chapter 5. The Principles of Adversarial Input

Mathematics Refresher

The Input Space

Figure 5-1. Input spaces outrageously simplified to two dimensions (obviously not to scale)

Input Space Versus Feature Space

Figure 5-2. A model’s prediction landscapes for each classification—zoomed into a tiny area of the complete input space

Generalizations from Training Data

Figure 5-3. The changing prediction landscape of the input space during training

Out-of-Distribution Data

Experimenting with Out-of-Distribution Data

Figure 5-4. Classification predictions for randomly generated images

Code Examples: Experimenting with Random Data

What’s the DNN Thinking?

Code Examples: Generating Saliency Maps

ImageNet Data

Figure 5-5. Clock image with associated saliency map (ResNet50 classifier)

Figure 5-6. Cake images with associated saliency map (ResNet50 classifier)

Figure 5-7. Fashion-MNIST images with associated saliency overlaid for the target classification (basic classifier)

Perturbation Attack: Minimum Change, Maximum Impact

Figure 5-8. Untargeted attack—moving outside the “Coat” classification area of the input space

Adversarial Patch: Maximum Distraction

Supernormal Stimulus

Measuring Detectability

A Mathematical Approach to Measuring Perturbation

Figure 5-9. A visual depiction of Lp-norm measurements where the number of dimensions is 2

Considering Human Perception

Dolphin Attack: Exploiting Ultrasound

Summary

Table of Contents for
5. The Principles of Adversarial Input

Figure 5-9. A visual depiction of L^p-norm measurements where the number of dimensions is 2