Chapter 7. Attack Patterns for Real-World Systems

In this chapter we explore the various attack patterns that could be used to generate adversarial input, taking into account the attacker’s goals and capabilities. These patterns exploit the methods described in Chapter 6, and as we will see, the selected approach will depend on factors such as the access that the adversary has to the target to test and develop adversarial input and their knowledge of the target model and processing chain. We’ll also consider whether an adversarial perturbation or an adversarial patch could be reused across different image or audio files.

Attack Patterns

Chapter 6 considered different techniques for generating adversarial examples. These methods have been proven in a “laboratory” environment, but how do they play out in real-world scenarios where the adversary has limited knowledge of or access to the target model and broader system? Creating adversarial input that is effective in a real-world scenario will pose a significant challenge to any attacker.

There are several different patterns that might be exploited to generate adversarial input and subsequently launch an attack. These patterns vary in terms of complexity and the resources needed to generate adversarial examples. In addition, some approaches require greater knowledge of, or access to, the target system han others. The pattern selected may also depend upon the required robustness and covertness of the attack.

Broadly speaking, we can categorize these approaches as follows:

Direct attack

The attacker develops the attack on the target system itself.

Replica attack

The attacker has access to an exact replica of the target DNN in order to develop the attack.

Transfer attack

The attacker develops the attack on a substitute model which approximates the target.

Universal transfer attack

The attacker has no information about the target model. They create adversarial input that works across an ensemble of models that perform similar functions to the target in the hope that it will also work on the target DNN.

Figure 7-1 provides a summary of the four approaches.

Image depicting direct, replica and substitute approaches
Figure 7-1. Different attack patterns for developing adversarial input

In the following sections we explore each of the patterns in greater detail. We’ll assume at this stage that the adversary can manipulate the digital content; physical-world adversarial examples will be considered in Chapter 8.

Attack Pattern Terminology

Many different terms are used across the literature to describe the various attack patterns. The lack of consistently adopted terminology can make things quite confusing.

For example, you may come across the term black box used to refer to a direct attack. Similarly, a replica attack using white box methods may be referred to elsewhere simply as a “white box attack.”

To avoid ambiguity, the terms white box and black box have not been used in this book to refer to attack patterns because these terms also imply the use of a specific algorithmic method.

Consider, for example, a replica attack. As the attacker has complete knowledge of the model, the model architecture, and its parameters, it may seem logical to use a white box method to generate adversarial input. However, the adversary could use a black box method (such as the boundary attack), possibly because it is more robust against defenses, or maybe just because it’s easier to implement. Similarly, although transfer attacks are sometimes referred to as “black box,” such an attack could use white box or black box methods on the substitute model.

Direct Attack

In a direct attack, the attacker is able to submit inputs to the actual target and receive corresponding results, enabling accurate feedback to refine adversarial input.

In such an attack, the adversary is unlikely to have access to more detail than the restricted responses that the target system returns.1 Additionally, the feedback may not be direct but inferred; for example, failure of a video to be successfully uploaded might suggest that it has been classified as containing violent content, although the adversary did not receive this classification explicitly. Creation of adversarial input will therefore require a black box approach. As discussed in “Limited Black Box Methods”, black box approaches iteratively refine queries submitted to the system based on the responses returned to morph the input and move it to the required adversarial area of the input space.

A direct attack poses a significant problem. Finding that perfect adversarial input using a black box approach such as a boundary attack takes many iterations (tens of thousands). Each iteration also requires a handful of queries to the target DNN. That’s a lot of queries, which are unlikely to go unnoticed by the defending organization! What’s more, the throughput and latency of a commercial deployment will slow the rate at which these queries can be processed. In fact, the target system might also limit queries or introduce latency to responses specifically to protect against such an attack. If the attacker is fortunate enough to have access to the scores returned from the target, it may be possible to reduce the query volume through more intelligent strategies, such as the genetic algorithm approach described in “Score-Based Black Box Methods”. However, as discussed previously, access to the scores is likely to be limited.

The direct approach considers the complete processing chain and any active defenses being employed by the system, not only the DNN. So, although an attacker may use one of the other nondirect approaches to develop an attack, an element of direct experimentation on the target is likely to be critical to ensure the adversarial robustness of the input.

Replica Attack

An obvious approach to developing adversarial input is to use an exact replica of the target to finesse the adversarial input, prior to launching it on the target. We’ll consider a couple of scenarios: when the attacker has access to a replica of the complete target, and when the attacker has access to just the DNN algorithm.

Replica system

It is possible for the attacker to have a local copy of the complete target system to experiment with. For example, a commercially purchased digital voice assistant or perhaps an autonomous vehicle. An attacker could develop their own adversarial input on their local target by repetitive simulation queries and monitoring of the responses using a black box method.2

In practice, target systems that are commercially available to buy (not web hosted) will often not accept input digitally and return a digital response. For example, a commercially purchased digital assistant is unlikely to provide a nice programming interface to its internal processing chain for the attacker to use to iteratively refine the adversarial input based on repeated queries. Rather, it will take audio input and return audio (speech) or instigate some digital command (such as an online purchase). Automating the generation of adversarial input will be more challenging when the interaction (request or response) is not digital.

Similar to the black box attack, access to a complete replica enables the attacker to test their attack against the complete processing chain, not just against the target DNN.

Replica DNN

Armed with information about all aspects of the trained DNN under attack (that is, the model architecture and all its parameters), the attacker is in a strong position to generate adversarial input. In traditional software terms, it’s akin to having knowledge of the source code of an internal algorithm in a target system. With this knowledge and sufficient capability, the attacker can create a replica of the DNN and use any method they wish on the replica to create adversarial examples that exploit flaws in the target DNN exactly.

Using a copy of the target to develop adversarial input might seem like a no-brainer, but where would an attacker get access to a replica? Surely a security conscious organization would never knowingly share any algorithm that it uses internally? This isn’t necessarily true—as DNNs typically require large amounts of labeled data, computational resource, and a data scientist to train effectively, it would not be unreasonable for an organization to use a pretrained model obtained commercially or through open source, simply to save on time and resources. If the attacker was aware of the model being used and has some access to an identical one (either by creating a replica or using an existing published copy) they would be able to launch a replica attack. Even if the attacker had no insider knowledge of the model being used, they might be able to infer likely models from those publicly available and through educated guesswork about the target organization’s internal processing.

Transfer Attack

When the adversary does not have access to the DNN algorithm used by the defending organization, it might be possible to use a good-enough approximation of the target DNN to develop adversarial input prior to launching it on the target. This is called a transfer attack.

The transfer attack strategy is likely to be the most feasible approach in many real-life scenarios where the internal DNN implementation is unknown outside the defending organization. This approach is also preferable to a direct attack because the development of the adversarial example is performed without querying the target. Thus, it does not arouse suspicion and is not subject to any limitations on or throttling of queries.

In a transfer attack, the adversary creates a substitute model to develop the adversarial input based on some limited knowledge of the target DNN. They use white box, score-based black box, or limited black box methods on this substitute to refine the adversarial example, prior to submitting it to the target. As with a replica attack, they may not need to create their own substitute model, or they might be able to exploit an existing one (for example, an online model with open API access).

There is, of course, an obvious problem for the attacker: the substitute needs to be behave in a similar way to the target, at least on the adversarial input. So, without access to the exact model, how easy would it be to create something that was close enough to work?

Creating something akin to the target system’s DNN might appear to be an insurmountable problem. Think of the complexity of a DNN: how many layers does it have, what activation functions does it use, what’s its structure, and what are its possible outputs? Is it really possible to create adversarial input using an approximate model substitute that will transfer effectively to the target system?

The answer is quite surprising. It turns out that, with limited access to information about the model, it is sometimes possible to create an imitation close enough to the original to develop adversarial examples that will transfer between the two models.3 To understand why this is the case, consider what defines a DNN:

Input and output mappings

This includes the format and precision of the input and the allowable outputs. In the case of a DNN classifier, there’s the set of classifications and potentially the hierarchical structure of those classes (e.g., “dog” is a subset of the “animal” class).

Internal architecture

The network type (for example LSTM or CNN), its layers, and number of nodes for each layer. This includes details of the activation function(s); basically, the aspects of the model defined prior to training.

Parameters

The weights and biases established during training.

The adversary might be able to make an educated guess regarding certain aspects of these. For example, for the DNN architecture, it’s likely that image classification is performed with some type of convolutional network. The adversary may also be able to guess the resolution of the input and possible output predictions (even without information on the exhaustive list of classifications).

Given that the attacker has made some informed guesswork about the network architecture and input/output mappings, let’s consider the case where they have access to some, or all, of the training data. This is not an unreasonable assumption as, due to the cost of generating and labeling training data, training sets are often shared online.

Knowing the training data can enable the attacker to create a good approximation of the target model, even if they’re able to infer very little about its architecture. Put another way, two models with different architectures are likely to be susceptible to similar adversarial examples if they have been trained with the same data. This idea is illustrated in in Figure 7-2. Two completely different models trained on the same data are likely to establish similar prediction landscapes. This is as you would expect because the training data itself is obviously instrumental in defining the model parameters.

Image depicting input spaces of target and substitute models with training data indicated
Figure 7-2. Input spaces of target and substitute models with training data indicated

Regions of the input space where the model returns an incorrect result are known as adversarial subspaces. These subspaces are where the training step has failed to generalize correctly. Therefore, they are likely to be at similar locations in models that share training data. Therefore, it’s probable, but not guaranteed, that an adversarial example would transfer successfully if the substitute model shared its training data with its target.

To illustrate the notion of similar adversarial subspaces across models, Figure 7-3 depicts this idea using the prediction landscapes of the two models shown in Figure 7-2. The models have similar adversarial subspaces as they are based on the same training data. Hence, adversarial examples are likely to transfer across these models.

Image depicting input spaces of target and substitute models with adversarial subspaces indicated.
Figure 7-3. Input spaces of target and substitute models with adversarial subspaces indicated in white

As we will see in Chapter 10, the ability to approximate a DNN model based on knowledge of the training data has important implications for information security. Training datasets should therefore be considered sensitive artifacts as they indirectly imply behavior of the machine learned models that they are used to train.

Universal Transfer Attack

A universal transfer attack is a method exploited when the adversary has no knowledge of the target DNN or its training data. In this scenario, the adversary creates adversarial input using an ensemble of substitute models. If the example works across a variety of substitutes, it may be flexible enough to transfer to the target.

It’s interesting that although training datasets may differ, they are likely to populate similar areas of the input space if they are derived from information representing the natural world. The areas of the model where there is a lack of representative data (corresponding to OoD data4) are likely to be similar regardless of the training dataset. Training datasets are also likely to bear similarities to each other due to (for example) common camera angles or common voice characteristics. In other words, different training datasets may have similar distributions in terms of characteristics, features, and, most importantly, adversarial subspaces.

Universal adversarial subspaces make a universal transfer attack possible. It is (unsurprisingly) more difficult to achieve, but the ability to launch a universal transfer attack is very powerful to an adversary. Adversarial input that works across models also allows an attack across a group of DNNs (such as multiple search engines). Once again, in practice an attacker might combine a universal transfer attack with direct experimentation (see the following sidebar for an example).

A most interesting application of the universal attack is in cases where the attacker receives no feedback from the target because the target organization is processing the data for its own purposes (market analysis, for example). We could consider this a “black hole” attack; the adversary may never know whether it succeeded and will need a high degree of tolerance to failure.

Reusable Patches and Reusable Perturbation

Imagine the possibility of creating reusable adversarial alterations that could be shared across different input data. For example, adversarial perturbation or a patch that could work effectively on any image, or adversarial distortion applicable to all audio files. This create-once-and-reuse approach would undoubtedly open an exciting opportunity for our adversary. No longer would they need to regenerate fresh adversarial content for every image or audio file; they could instead simply overlay a patch or perturbation that they had prepared earlier, saving on cost, time, and queries to the target system.

In “Adversarial Patch,” researchers Brown et al.5 generated patches that could be reused on different images. The size or location of the patch on the target image that it is being added to will affect its effectiveness, but these patches really do work across images. This instinctively makes sense; if you can find some super-salient patch that is able to distract a neural network, it’s likely to work across multiple images.

Experimentation with Reusable Patches

If you’d like to try a reusable patch on your own images, head over to the Jupyter notebook chapter07/reusable_patch.ipynb in this book’s GitHub repository.

Intuitively, you might assume that reusable adversarial perturbation is unlikely to be achievable because the adversarial change is generated to achieve its aims based on the characteristics of a specific image or specific piece of audio. Put another way, adversarial perturbation is a movement from a location in the input space that is unlikely to work from a different starting point. Figure 7-4 illustrates this point. The illustration shows an imaginary input space zoomed out, and two images sharing the same classification but in different locations. When the adversarial perturbation calculated to move image 1 to a misclassification is applied to image 2, the second image still resides comfortably within the correct classification area.

Image depicting input space with the same perturbation applied to multiple input data. Only one results in a successful targeted attack.
Figure 7-4. Transferring adversarial perturbation to a different image

Although it might seem unlikely, researchers Moosavi-Dezfooli et al.6 have proven that it is possible to generate perturbations that are universal. This is achieved by optimizing adversarial perturbation using a formula that considers a sample of images from a distribution, rather than a single image in isolation.

The technique works as follows. An adversarial perturbation is calculated for an initial image to take that image outside its classification boundary. The same perturbation is then applied to a second image in the sample. If the second image is correctly classified (in other words, remains nonadversarial) after the perturbation has been applied—as shown in Figure 7-4—the perturbation has failed to demonstrate that it works across both images. An additional delta to the original perturbation is calculated that is sufficient to misclassify the second image. As long as this second delta does not return the first image to its correct classification, the resulting perturbation will succeed across both images. Figure 7-5 depicts this second delta.

Image depicting input space with the same perturbation applied to multiple input data. The perturbation is refined for the second image to make it work across both.
Figure 7-5. Calculating universal adversarial perturbation

This is repeated over a distribution of realistic images with the constraints that the resulting perturbation must be kept within a minimally quantified change (as defined by an Lp-norm) and that a specified proportion of the images must be fooled (termed the fooling rate). Obviously, a higher fooling rate may require more perturbation and therefore may be more noticeable, but the results are impressive. Although the perturbation is not guaranteed to work over all images, the researchers proved it possible to create perturbations that are imperceptible and work over 80% or 90% of images. They also proved that these perturbations transferred well across different models. The universal perturbations are untargeted; they change any image from its correct classification to another, unspecified one, and the resulting adversarial classification may differ between images.

Similar methods to those described above are likely to be applicable to audio. In the same way that the location and size of an adversarial patch on an image might affect its perceptibility and effectiveness, an audio patch may need to be located at a particular point within audio or have its loudness adjusted to work optimally. In addition to the benefits of reusability, universal adversarial audio distortion could be potentially played in any environment to, for example, maliciously control audio devices with hidden voice commands in waveforms. Adversarial audio reproduced in the physical world is discussed in “Adversarial Sound”.

Adversarial alterations that are designed to work across multiple inputs are unlikely to be guaranteed to succeed. However, they are low-cost, so may be suitable for cases where robustness is not paramount (the following sidebar for a hypothetical example). Perhaps the most interesting aspect of this approach is that it potentially allows for the sharing of adversarial capability between different threat actors. For example, a group or individual could sell adversarial patches online with simple software to attach them to images.

Bringing It Together: Hybrid Approaches and Trade-offs

In practice, a real-world attack is likely to comprise a combination of approaches. Perhaps the adversary will initially develop adversarial examples on a substitute model, then occasionally test them on the target system to get the best of all worlds.

As with all cyberattacks, preparation is key. Generating effective adversarial input prior to the attack is likely to involve experimentation on the target system to establish what works and what doesn’t. So there’s often a trade-off between covertness and robustness; developing a robust adversarial example may require more queries to the target. Running more queries, however, increase the chance of detection.

Adversarial examples that perform well in a theoretical environment may be susceptible to processing in the broader system. For example, as we will see in Chapter 10, adversarial examples often exploit the accuracy and resolution of digital data, so preprocessing that reduces data resolution may reduce adversarial effectiveness. Therefore, unless the adversary has an exact copy of the complete system under attack (that is, the complete processing chain, not just the DNN), they are likely to need to perform at least some experimentation on the target system.

Covertness during the generation of adversarial examples boils down to minimizing the number of queries to the target system. The replica, transfer, and universal transfer attacks all use substitute models to develop the attack prior to launching it. Access to a substitute model in these cases grants the adversary a luxury; it removes requirement to submit inputs to the target system until the actual attack, making it easier for the attacker to remain undetected.

In addition to all these challenges, unfortunately for our adversary, the defending system may not simply be an implementation that processes all input passively; there may be active defenses in place. We’ll get to these in Chapter 10.

Knowledge of the target is always beneficial to an attacker. As we will also see in Chapter 10, the DNN models, or data that could be used to derive them, such as training data, should be treated as sensitive assets of an organization.

There is no “standard” threat. This chapter has presented several approaches to generating adversarial examples, but in practice, a combination of approaches that consider this trade-off depending on the target system and the adversarial goals will be most effective.

1 In real-world systems where there might be a motivation for an attack, the adversary will not receive the raw probabilistic scores from the target.

2 The local copy may send information back to a centralized backend for processing, as is often the case for voice controlled audio assistants.

3 Florian Tramèr et al., “The Space of Transferable Adversarial Examples” (2017), http://bit.ly/2IVGNfc.

4 Out-of-distribution data was introduced in “Generalizations from Training Data”.

5 Brown et al., “Adversarial Patch.”

6 Seyed-Mohsen Moosavi-Dezfooli et al., “Universal Adversarial Perturbations,” IEEE Conference on Computer Vision and Pattern Recognition (2017), http://bit.ly/2WV6JS7.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset