Chapter 9. Evaluating Model Robustness to Adversarial Inputs

To begin the exploration of defenses, this chapter looks at evaluating the robustness of DNN models against adversarial examples. This will provide the foundations for understanding the effectiveness of the defenses described in Chapter 10.

Evaluating robustness of individual DNN components enables objective comparison of models and defense approaches. For example, this might be for research purposes, to see whether a new defense approach is more or less effective than previous approaches. Alternatively, evaluation may be necessary to ensure that the most recently deployed model in your organization is equally or more secure than the previous version.

Model evaluation requires a consistent methodology and consistent measures to ensure that the metrics used for comparison are objective. Unfortunately, generating metrics that indicate a neural network’s ability to defend against adversarial examples is not simple. We need to initially answer the question: defense against what? Therefore, based on Chapter 7 and Chapter 8, we will begin by considering how we model the threat that is being defended against. This is the threat model discussed in “Adversarial Goals, Capabilities, Constraints, and Knowledge”.

Complete Knowledge Evaluation

When evaluating defenses, it’s important to bear in mind that keeping the workings of the target system secret should never be viewed as defense in itself. Information security practice does not adhere to the “security by obscurity” principle,1 so any evaluation of adversarial attacks should assume the adversary has complete knowledge of the DNN and any defense mechanisms in place. This is sometimes referred to as the “complete knowledge” attack scenario.

Evaluation against consistent threat models is critical for the objective comparison of defenses. In “Model Evaluation” we’ll look at techniques for evaluating a model’s robustness to adversarial examples.

“Empirically Derived Robustness Metrics” examines the role of the threat model during testing and considers some of the empirically derived metrics that can be used to establish model robustness.

We’ll then consider whether it is possible to use theoretical approaches to determine a model’s robustness in “Theoretically Derived Robustness Metrics”. A formal proof of a model’s robustness has the advantage that it would enable us to derive metrics relevant to a model but independent of the threat.

“Summary” summarizes this chapter before we delve into the defenses in Chapter 10.

Red-Blue Teaming

Red-blue teaming assumes that the individuals evaluating the defense by attacking (the red team) are not those who developed it (the blue team). The red team emulates the behaviors of the adversary, and the blue team emulates the defending organization. Separating the roles of the attacker and the defender ensures a different mindset in attack methodology, making it more likely you’ll find weaknesses in the system.

Red teams should attempt to exploit adversarial example vulnerabilities using methods such as those described in Chapter 7 and Chapter 8 to emulate the best possible attack. Unlike in model evaluation, the red team should not be given complete knowledge of the target system as this may cause them to miss vulnerabilities by biasing their thinking (for example, not considering attacks where they know there are defenses in place).

The evaluation of a model (including any defense built into the model) is important, but not the whole story. Used in real systems, DNNs are simply components in a broader chain. An evaluation of a model in isolation is useful for model comparison and for assuring the individual component, but the system should also be tested as a whole. Assuring required levels of system robustness to adversarial examples requires broader system security testing, including adversarial examples as a potential vulnerability. A common approach to cybersecurity testing of systems is red-blue teaming.

Adversarial Goals, Capabilities, Constraints, and Knowledge

When the attacker’s profile is systematically scrutinized in the context of a target system, this articulation of the threat is known as a threat model. Threat modeling is a detailed analysis of the threat actor, attack vectors, and risks, and is specific to the organization and scenario. This chapter is not intended to provide a complete threat model. Rather, it describes some of the key information that should be captured when modeling the threat of adversarial examples—information that could be used in a broader threat modeling scenario.

It may not be feasible to model or anticipate all possible threats, but it is prudent to consider different possible threat scenarios. An understanding from the attacker’s perspective will aid the design of secure defenses against adversarial input.

The adversary is is often modeled in terms of goals, capabilities, constraints, and knowledge.

Goals

Let’s start with the goals—what is the attacker trying to achieve, and how does this define the nature of the adversarial input that will be presented to the system?

At the highest level, you can view the threat goals as what the attacker is trying to achieve by fooling the system. This takes us back to the initial motivations presented in Chapter 2. More nebulous reasons such as “because I can” shouldn’t be overlooked. Such motivations can result in less specific goals and less obvious threat models.

The high-level motivations determine the goals. These goals can be defined by the required specificity of the attack, minimum success rate, and maximum perturbation limit (Figure 9-1):

Specificity

The required specificity of the attack refers to whether the aim is simply to create a false prediction from the DNN (an untargeted attack) or to create a specific false prediction (a targeted attack). This is determined by the motivation of an attack. For example, it may be sufficient for an evasion attack to be untargeted if it does not matter what the DNN’s interpretation of the input is, as long as it evades a specific interpretation. Conversely, an attack to create confusion may require that the DNN interprets the input in a targeted way to generate a specific false positive.

Success rate

The attack success rate refers to the confidence of the adversary that the attack will achieve its aim of fooling the classifier to the required specificity.

The rate at which the attacker requires attacks to be successful will be determined by the consequences of the input failing to achieve its aim. An evasion attack typically warrants greater confidence in success than a false-positive attack, for example. If failure to evade the AI might result in prosecution, the stakes would be higher and more effort would be invested into methods that will ensure that the adversarial input is robust to failure. Conversely, if the motivation is to cause confusion with large quantities of adversarial input resulting in false positives, it may not matter much if the attack fails some of the time. While motivations such as evasion from detection might appear obvious, other motivations, such as creating a denial of service (DoS), causing confusion, or simply discrediting an organization, should not be overlooked. These types of attacks may not require high success rates, which makes it far easier for the adversary to launch an attack.

Perturbation limit (perceptibility)

The perturbation limit is the maximum perturbation acceptable to the attacker and is an indirect measurement of perceptibility. The acceptable perceptibility of the adversarial input depends on the attack motivation and context. In some cases, there may not even be any need to disguise the adversarial alteration. Take, for example, an adversarial patch added to the corner of an image. The human sender and human recipient of the image may both be aware that the patch has been added, so the sender doesn’t need to disguise the patch if it does not affect the main image content. In other scenarios, minimizing perceptibility may be very important. For example, if adding a malicious voice command to audio, it is likely to be important that it is not noticed.

Although perturbation is generally measured in terms of Lp-norms, there’s no reason why another measure of change might not be used. Perhaps it will incorporate a more complex algorithm that takes into account the nuances of human perception or the context in which the adversarial example will be seen: for example, giving a higher perturbation value to a change that a person might notice than to one that’s not obvious, even if they have the same “digital distance.” In the context of an adversarial patch, the perturbation threshold will include the constraint that the perturbation may be measured in terms of L0-norm (simply the number of pixels changed) with additional constraints on the location of the pixels in the image (such as near the edge of the image).

Diagram depicting adversarial goals.
Figure 9-1. The goal of the adversary can be described in terms of required specificity, success rate, and perceptibility (perturbation limit).

There’s an obvious tension between success rate and perturbation; the success rate of an attack will increase as the constraints on perturbation are lessened. If we were to plot the success rate of a specific attack against the allowed perturbation for a particular input x, it might look something like Figure 9-2. This graph illustrates the ease with which adversarial examples can be created for a specific input as the allowed perturbation varies. Notice that there is a minimum perturbation required to create an adversarial example. The shape of the plot and the minimum perturbation will depend on the network, the original image, and the perturbation measurement. Hold this thought because this plot and its associated metrics are useful for evaluating the robustness of neural networks too. We’ll revisit this type of graph, and the code needed to generate it, in Chapter 10.

Plot of adversarial example success versus perturbation
Figure 9-2. Allowing greater perturbation increases the success rate for an adversarial example.

The specificity of the attack is varied by increasing or lessening the constraints of the logic used to generate the adversarial input. Targeted attacks impose tighter constraints on the adversarial input generated, so are likely to be more difficult to achieve and therefore have a lower success rate.

The attack goals are encapsulated in the mathematical and logical constraints used to generate the adversarial example. For example, consider the C&W attack summarized in Figure 9-3. (The complete mathematics is detailed in “Increasing Adversarial Confidence”.)

You can see how this algorithm articulates the adversarial goals. The specificity (in this case targeted) is captured by the restriction that the DNN returns a target class for the adversarial input. The perturbation is calculated to minimize the distance from the original image by including the L2-norm measurement between the adversarial example and the original, defined by xadv-x22. This is addressing the goal of ensuring that the image remains within the required perceptibility. The additional robustness requirement c.l(xadv) affects the likely success rate.

Diagram depicting adversarial goals.
Figure 9-3. Adversarial goals are captured in the mathematics of the C&W algorithm

The methods for generating attacks shown in Chapter 6, including the C&W attack, allow an attacker to create an attack that adheres to their goals, but only consider the neural network algorithm on its own. The threat is against the complete processing chain and any defense mechanisms in place, so achieving the required specificity, success rate, and perturbation on the actual target will require more mathematical modeling and/or experimentation on the complete target. In other words, generating the adversarial example using an algorithm such as the one in Figure 9-3 on a DNN in isolation doesn’t ensure that the specificity, success rate, and perturbation goals hold when the example is tried on the complete system.

Capabilities, Knowledge, and Access

The ability of the attacker to attain these goals will depend on several factors: their capabilities, their knowledge of the target, and their ability to affect the input data to make it adversarial. The latter two can be interpreted as potential constraints on the attack; less knowledge will constrain the attacker in achieving their goals, as will a lack of ability to alter the input data. This is summarized in Figure 9-4.

Diagram depicting the threat model.
Figure 9-4. Adversarial goals are constrained by target knowledge and access, adversarial capability, and the attacker’s ability to affect the input.

Consider each of the factors in turn:

Capabilities

The success of an attack is constrained by the resources (skills, software, hardware, and time) available to the adversary. The resources required will depend greatly on the goals of the attack. A simple, low-cost attack could be adding an adversarial patch that has been shared online to a digital image. At the other extreme, developing a robust perturbation using an ensemble of models and large compute capacity will require greater expenditure, time, and expertise.

The capability of lone threat actors should not be underestimated. Hackers may work alone, but they are likely to be highly knowledgeable and can utilize public cloud and online resources to develop attacks.

Ability to affect the input

The extent to which an attacker is able to alter the data may restrict their ability to create adversarial input. For example, the level to which changes might impact human perception may restrict the ways digital image content can be altered.

In a physical-world attack, the adversary has no access to the digital data, placing significant constraints on the creation of adversarial content. The attacker may be hindered by very low-tech challenges, such as physical access to the required location of the sensor.

Knowledge of or access to the target

Knowledge of the DNN model itself is a significant aid in creating adversarial examples, enabling (for example) a more powerful replica attack for developing the adversarial input. Knowledge of the target should not be solely seen in terms of the ability of the attacker to replicate the model, however. Successfully launching a robust adversarial example attack ideally requires knowledge of the complete processing chain and all its defenses.

Where an attacker does not have complete knowledge of the target, it may be possible to infer the target’s behavior through analyzing its responses to queries. Experimenting directly with the target system will increase the robustness of an adversarial example, but will incur the trade-off that it might be detected if the target is checking for suspicious input. Therefore, target access during attack preparation will be constrained by a need to remain undetected. The attacker may need to ensure queries do not look suspicious (for example, by slowing the rate at which they are submitted or by submitting queries from multiple IPs to make them appear unrelated).

If the attacker has their own copy of the target (for example, a digital assistant), they have unlimited access to experiment with, and gain knowledge about, the system. However, systems that take data from the physical world and respond directly through nondigital means are more difficult to perform automated experimentation on. For example, there is no programmable interface for interaction with a digital assistant. In these cases, it’s more likely that the attacker would generate the adversarial data on a substitute model, then perform refinement testing on the physical copy before launching it on the target device(s).

Model Evaluation

An interesting question is whether it is possible to quantify a model’s robustness to adversarial examples. This would enable comparisons and assurances of models deployed to operational environments. For example, it might be useful to quantify the effect that a particular defense has on the DNN’s robustness, especially if the defense resulted in a trade-off regarding model accuracy. Alternatively, it may be useful to objectively compare defenses to establish which were most effective in assuring the safe operation of the model.

It’s not currently possible to create DNNs that perform completely flawlessly across all possible inputs. Even a network taking low-resolution (224 x 224 pixel) color images, for example, would need to be proven to perform correctly over 256150528 different possible images. We know that DNNs can provide results with outstanding accuracy across inputs that are representative of the training dataset and not deliberately adversarial, but it is computationally infeasible to assure the network’s integrity across all input.

An evaluation of defenses can be done empirically (by testing) or theoretically (by mathematical calculations). When evaluating a defense, either empirically or theoretically, it is critical that the effect of the defense on the model’s accuracy on nonadversarial data is not overlooked. There are two aspects to this:

  • Model accuracy on nonadversarial data must be retested when the defense is in place. The tolerance to a reduction in model accuracy will depend very much on the operational scenario.

  • In developing defenses to augment existing models, care must be taken not to inadvertently reduce the model accuracy for good data. An overzealous adversarial defense mechanism might wrongly predict benign data as adversarial (in other words, placing inputs in the “false positive” categorization, rather than “true negative”).

Let’s consider the empirical and theoretical approaches to robustness evaluations in turn.

Empirically Derived Robustness Metrics

The problem with adversarial examples is that they are deliberate attempts to fool the model. It is very difficult to create appropriate adversarial tests and consistent robustness metrics because:

  • The nature of the adversarial data is very difficult to predict. Adversarial data generated using one attack method may have totally different characteristics from adversarial data generated using another method.

  • The likelihood of data that fools the model occurring during normal operations may appear very low; however, if an attacker is going to deliberate lengths to fool the model, the probability of that data fooling the model will be far higher.

Establishing accurate and useful comparative robustness metrics for a particular network requires a clear definition of the threat model, attacks, and test data against which the model robustness is being evaluated:

Threat model

The threat model includes the goals—the specificity (targeted versus untargeted), success rate, and perceptibility threshold (such as the Lp-norm measurement being used and its acceptable bounds). It may also include the ability of the attacker to affect the input (considering, for example, physical-world constraints). The evaluation is meaningless without a clear definition of the threat model; it defines the scope of adversarial tests applied during the evaluation.

The selected threat model will depend on the reason for the evaluation. For example, from a research perspective, you might evaluate a defense with the aim of creating better and more robust DNNs. In this case, you may choose to use common threat models to enable direct comparison with other defenses. Alternatively, you might want to evaluate a defense for a specific operational deployment; you might then choose to focus on specific threat models and test scenarios (for example, particular targeted attacks) because they pose greater risk to your organization. In practice, you may choose to perform multiple evaluations against different threat models.

Attack methodology

A comprehensive description of the attack methods, including any parameters used, forms part of the evaluation. You should assume complete knowledge of the system and its defenses when generating the evaluation adversarial examples.

Proving that a defense works for one attack does not guarantee its effectiveness against another, stronger method. The evaluation should therefore encompass a variety of attacks; while it is not possible to know all the attacks, the attacks should include the strongest known at the time. The aim is to evaluate the defense against the best attacks possible, and this includes attacks that adapt to circumvent the defense.

Test data

The evaluation of a defense’s robustness will relate to the test attack data used during experimentation. This data is defined by the method of attack used (including its parameters) and also the data from which the test adversarial examples were generated by this method. Varying the original data used for generating test adversarial examples may affect the success rate of an attack.

Evaluating robustness to adversarial examples is part of the broader problem of evaluating the veracity of the model over inputs. Therefore, evaluating defenses is part of the broader task of establishing the model’s accuracy. Assuming that the test data has a similar distribution to the training data, there are commonly used methods for empirically evaluating model accuracy.

The code presented back in Chapter 3 that created a classifier for the Fashion-MNIST dataset demonstrated a very basic evaluation of the model:

test_loss, test_acc = model.evaluate(test_images, test_labels)
print('Model accuracy based on test data:', test_acc)

This generates the following output:

10000/10000 [================] - 0s 35us/sample - loss: 0.3623 - acc: 0.8704
Model accuracy based on test data: 0.8704

This evaluation states that approximately 87% of all the test examples presented to the model were classified correctly. We can perform a more detailed examination of the model’s accuracy through a confusion matrix. Figure 9-5 is a confusion matrix for the classifier in Chapter 3.

Code Example: Confusion Matrix

The code for producing this matrix is included in the Jupyter notebook chapter03/fashionMNIST_classifier.ipynb.

A confusion matrix for the Fashion-MNIST classifier.
Figure 9-5. A confusion matrix for the Fashion-MNIST classifier provides a summary of the model’s performance for each of the fashion labels.

Each row in the matrix corresponds to the test images with that row’s particular (true) label. The top row, for example, refers to the test data that is labeled “T-shirt/top.” The cells contain the proportion of the test data that was predicted according to the column label. So, the proportion of T-shirt/tops that were correctly labeled is 0.83 (83%). The proportion of T-shirt/tops that were mislabeled as “Shirt” is 0.07 (7%). If the model was performing perfectly against the test data, the cells of the diagonal would all be 1.0 and all others would be 0.

You can learn quite a lot about a classifier model from its confusion matrix. For example, Figure 9-5 indicates that shirts are the most likely to be misclassified—11% of the time they are misinterpreted by the model as T-shirts/tops. Similarly, the model is most accurate when classifying the trouser, bag, and ankle boot examples in the test data (98%). As we will see in Chapter 10, a confusion matrix can be useful in indicating how adversarial test data is misclassified.

To use the confusion matrix for evaluation, we need to create the appropriate attack methodology and generate adversarial test data. To evaluate the efficacy of a specific defense, the testing is performed on the nondefended model and then repeated on the model with the defense in place. The evaluation should be performed with the strongest attacks possible that deliberately aim to defeat any defense mechanism in place. This is extremely difficult because new attacks are constantly evolving, so evaluation of adversarial robustness is a point-in-time assessment that may require subsequent recalculation when further attacks are created. Once again, we may just be interested in the difference in perturbation required to achieve a specific success rate. Or, the metric of greatest interest might be the change in minimum perturbation for an adversarial example. We’ll take a look at using a confusion matrix for defense evaluation in a code example in Chapter 10.

Another metric is how difficult (in terms of perturbation required) it is to create adversarial examples against a particular target network. This takes us back to the image presented earlier Figure 9-2 and repeated in Figure 9-6.

Plot of adversarial example success versus perturbation
Figure 9-6. Allowing greater perturbation increases the success rate for an adversarial example.

For measuring the robustness of a network to any attack, we might want to take the worst case (from the defender’s perspective), which is the smallest perturbation required for an attack to be successful (label 1 on the plot). If the threat model does not place any constraints on the perceptibility boundaries, then the evaluation of the defense is just the point at which the success rate is sufficiently high to achieve the threat model goal (label 2 on the plot). We’ll take a further look at how graphs of this type are generated in a code example in Chapter 10.

Evaluation of a Defense Applied Across Models

It’s important to realize that the evaluation of a defense on one model is not guaranteed to hold across other models. In particular, a defense approach tested on a model trained with a “toy” dataset such as Fashion-MNIST may not hold for models trained on data with more realism.

It may be acceptable to perform the evaluation on a single model before and after the defense has been added if you wish to evaluate the robustness of one network only (for example, to test a model to be deployed operationally). However, for a more general evaluation of a defense (for example, for research purposes), performing the evaluation across a variety of models with the defense in place will give a better indication of the defense’s effectiveness.

Theoretically Derived Robustness Metrics

Empirical robustness measurements are subject to ambiguity because there are so many variables associated with the evaluation method. They are also not guaranteed measures of a network’s robustness because increasingly effective attack methods for generating adversarial examples are continually being developed.

A mathematical, rather than empirical, measurement of robustness opens the possibility of a more consistent and reliable metric. Mathematically calculated metrics for software assurance are particularly relevant for the evaluation of safety-critical systems. In the context of adversarial examples, theoretical evaluation may be required to ensure the adequate assurance of a component in an autonomous vehicle, for example. Theoretically derived metrics based wholly on the model (not the threat), also offer the advantage of being attack-agnostic.

One approach is to mathematically calculate the minimum perturbation required to generate an adversarial example. To date, researchers have been able to prove that no adversarial examples can be generated within a defined distance of each of a specific set of inputs. So, assuming there is a “safe zone” around “normal” inputs where no adversarial examples lie, can we establish the minimum (worst case) safe zone for all such inputs? The idea behind this is illustrated in Figure 9-7.

Input space containing several points, each surrounded by different size safe zone circles.
Figure 9-7. Calculating adversarial example safe zones in the input space

The metric that is being calculated is the minimum perturbation required to generate an adversarial example across all correctly classified in-distribution inputs (the smallest safe zone). This is a nontrivial calculation because of the vastness of the input space and the complexity of the prediction landscape. However, researchers have proposed methods using mathematical approximation to establish a robustness metric with a high level of accuracy.2 At the time of writing, this is nascent research. However, theoretically derived metrics such as this are likely to play an increasingly important part in DNN network security evaluation.

Summary

This chapter has considered the threat model and explored different methods for evaluating robustness to adversarial examples. Although it is difficult to achieve consistent approaches to model evaluation, there is considerable interest in establishing standard empirical measures for evaluating model robustness through open projects that are enumerated in Chapter 10.

Open Project for Evaluating Adversarial Robustness

Developing methodologies for evaluating adversarial robustness is a new area and subject to change. “On Evaluating Adversarial Robustness” is a living document produced by Carlini et al.3 to solicit and share contributions relating to the evaluation of neural network defenses in an open forum.

Any empirical evaluation of a model holds only for a specific threat. To make comparisons between models requires that the threat considered is consistent. The possibility of theoretically derived metrics based on the model itself is appealing, and more research in this area will hopefully lead to more approaches to generating objective metrics by which models can be compared.

The next chapter investigates different approaches that have been proposed to defend against adversarial input. We’ll refer back to the evaluation methods in this chapter when exploring defenses.

1 In accordance with Shannon’s maxim, “the enemy knows the system.”

2 Tsui-Wei Weng et al., “Evaluating the Robustness of Neural Networks: An Extreme Value Theory Approach,” International Conference on Learning Representations (2018), http://bit.ly/2Rn5THO.

3 Nicholas Carlini et al., “On Evaluating Adversarial Robustness” (2019), http://bit.ly/2IT2jkR.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset