Chapter 6

Incidents – Markers of Resilience or Brittleness?

David D. Woods

Richard I. Cook

Incidents are Ambiguous

The adaptive capacity of any system is usually assessed by observing how it responds to disruptions or challenges. Adaptive capacity has limits or boundary conditions, and disruptions provide information about where those boundaries lie and how the system behaves when events push it near or over those boundaries. Resilience in particular is concerned with understanding how well the system adapts and to what range or sources of variation. This allows one to detect undesirable drops in adaptive capacity and to intervene to increase aspects of adaptive capacity.

Thus, monitoring or measuring the adaptiveness and resilience of a system quickly leads to a basic ambiguity. Any given incident includes the system adapting to attempt to handle the disrupting event or variation on textbook situations. In an episode we observe the system stretch nearly to failure or a fracture point. Hence, are cases that fall short of breakdown success stories or anticipations of future failure stories? And if the disruption pushes the system to a fracture point, do the negative consequences always indicate a brittle system, since all finite systems can be pushed eventually to a breaking point?

Consider the cases in this book that provide images of resilience. Cook & Nemeth (Chapter 13, example 2) analyzes how a system under tremendous challenges has adapted to achieve high levels of performance. The medical response to attacks via bus bombings in Israel quickly became highly adapted to providing care, to identifying victims, and to providing information and counselling to families despite many kinds of uncertainty and difficulties. Yet in that analysis one also sees potential limiting conditions should circumstances change.

Patterson et al. (2004b) analyze a case where the normal crosschecks broke down (one class of mechanisms for achieving resilience). As a result, a miscommunication and resulting erroneous treatment plan went forward with real, though not fatal, consequences for the patient involved. The medication misadministration in this case represents a breakdown in resilience of the system, yet in the analysis one also sees how the system did have mechanisms for reducing errors in treatment plans and cross-checks for detecting problems before they affected any patient (based on particular individuals who always handled the chemotherapy cases and thus had the knowledge and position in the network to detect possible problems and challenge physician plans effectively). In addition, the analysis of the case in terms of resilience factors revealed many aspects of successful and unsuccessful cross checks that could play out in other situations. By understanding the processes of cross-checking as a contributor to system resilience (even though they broke down in this episode), new system design concepts and new criteria for assessing performance are generated.

Cook & O’Connor (2005) provide a case that falls between the ‘success’ story of the Israeli medical response to bus bombings and the ‘failure’ story of the medication misadministration and the missed opportunities to catch that error. Cook & O’Connor describe the ‘MAR knockout’ (medication administration record) case where the pharmacy computer system broke down in a way that provided inaccurate medication plans hospital wide. The nurses did detect that medication plans were inaccurate, and that this was not an isolated problem but rather hospital wide. While no one knew why the medication administration records were wrong, the situation required a quick response to limit potential misadministrations to patients. Pharmacy, nurses, and physicians improvised a manual system by finding the last reliable medication records, updating these manually, and taking advantage of the presence of fax machines on wards. Through their efforts over 24 hours, no medication misadministrations occurred while the software trouble was diagnosed and restored (a failed reload from backup). The response to the computer infrastructure breakdown revealed that ‘the resilience of this system resides in its people rather than in its technology. It was people who detected the failure, responded, planned a response, devised the workarounds needed to keep the rest of the system working, and restored the pharmacy system. There were a large cadre of experienced operations people … available to assist in the recovery. … What kept the MAR knockout case from being a catastrophe was the ability of workers to recover the system from the brink of disaster’ (Cook & O’Conner, 2005).

While no one was injured, the authors see the case as an indicator of the increasing brittleness of the medication administration system. Computer infrastructure is becoming ever more vital to the ability to deliver care. Economic pressure is reducing the buffers, which operated in the incident response, as ‘inefficiencies.’ The resource of knowledgeable staff is tightening as managers misunderstand and undervalue the human contribution to successful function. Alternative modes of managing medications are becoming less viable as more equipment disappears (the goal of the ‘paperless’ hospital); and as the experience base for using alternative systems for managing medication orders is eroding.

Are cases of successful response to disruptions, stories of resilience that indicate future success if disruptions occur? Are stories of brittleness anticipations of future failures? Do the adaptations evident in any case illustrate resilience – since there was successful adaptation to handle the disrupting event, or do these cases reveal brittleness since the episode revealed how close the system came to falling off the edge? And the reverse questions can be posed as well for every case of failure: do breakdowns indict the adaptiveness of the system given that there are always finite resources, irreducible uncertainty, and basic dilemmas behind every situation?

These questions reveal that the value of incidents is in how they mark boundary conditions on the mechanisms/model of adaptiveness built into the system’s design. Incidents simultaneously show how the system in question can stretch given disruptions and the limits on that capacity to handle or buffer these challenges.

Assessing resilience requires models of classes of adaptive behavior. These models need to capture the processes that contribute to adaptation when surprising disruptions occur. The descriptions of processes are useful when they specify the limits on these processes relative to challenges posed by external events. This allows resilience managers to monitor the boundary conditions on the current system’s adaptive capacity and to target specific points where investments are valuable to preserve or expand that adaptive capacity given the vectors of change being experienced by that field of practice.

Patterson et al. (2004b) provide the start of one class of adaptive behavior – cross-check processes. Here we provide another class of adaptive behavior – the ‘decompensation’ event pattern.

‘Decompensation:’ A Pattern in Adaptive Response

One part of assessing a system’s resilience is whether that system knows if it is operating near boundary conditions. Assessing the margin is not a simple static state (the distance of an operating point to a definitive boundary), but a more complex assessment of adaptive responses to different kinds of disturbances. Incidents are valuable because they provide information about what stretches the system and how well the system can stretch.

These complexities are illustrated by one kind of pattern in adaptive response called ‘decompensation.’ Cases of ‘decompensation’ constitute a kind of incident and have been analyzed in highly automated systems such as aircraft or cardiovascular physiology (Cook et al., 1991; Woods, 1994; Sarter et al., 1997). The basic decompensation pattern evolves across two phases. In the first phase, automated loops compensate for a growing disturbance; the successful compensation partially masks the presence and development of the underlying disturbance. The second phase of a decompensation event occurs because the automated response cannot compensate for the disturbance indefinitely. After the response mechanism’s capacity is exhausted, the controlled parameter suddenly collapses (the decompensation event that leads to the name).

The question is: does the supervisory controller of such systems detect the developing problem during the first phase of the event pattern or do they miss the signs that the lower order or base controller (automated loops in the typical system analysis) is working harder and harder to compensate and getting nearer to its capacity limit as the external challenge persists or grows?

In these situations, the critical information for the outside monitor is not the symptoms per se but the force with which they must be resisted relative to the capabilities of the base control systems. For example, when a human is acting as the base control system, as an effective team member this operator would communicate to others the fact that they need to exert unusual control effort (Norman, 1990). Such information provides a diagnostic cue for the team and is a signal that additional resources need to be injected to keep the process under control. If there is no information about how hard the base control system is working to maintain control in the face of disturbances, it is quite difficult to recognize the seriousness of the situation during the phase 1 portion and to respond early enough to avoid the decompensation collapse that marks phase 2 of the event pattern.

While this pattern has been noted in supervisory control of automated processes, it also can be used as an analogy or test case for the challenges of monitoring the resilience of organizations. In the standard decompensation pattern, the base system adapts to handle disruptions or faults. To determine if this adaptive behavior is a signal of successful control or a sign of incipient failure requires an assessment of the control capability of the base system in the face of various kinds and sizes of disturbances.

Consider the challenge for a person monitoring these kinds of situations. When disturbances occur, the presence of adaptive capacity produces counter-influences, which makes control appear adequate or allows only slight signs of trouble over a period of time. Eventually, if no changes are made or assistance injected, the capacity to compensate becomes exhausted and control collapses in the form of a major incident or accident. The apparent success during the first phase of the event can mask or hide how adaptive mechanisms are being stretched to work harder to compensate and how buffers are being exhausted. This makes it difficult for monitors to understand what is occurring, especially as the first phase may play out over relatively longer time periods, and leads to great surprise when the second phase occurs.

While detecting the slide toward decompensation is in principle difficult, decompensation incidents reveal how human supervisory controllers of automated systems normally develop the expertise needed to anticipate the approach to boundaries of adaptive capacity and intervene to provide additional capacity to cope with the disturbances underway. For example, in intensive care units and in anaesthesiology (Cook et al., 1991), physicians are able to make judgements about the adaptive capacity of a patient’s physiological systems. Precariously balanced heart patients have little capacity for handling stressors or disrupting events – their control mechanisms are ‘run out.’ Anesthesiologists and intensivists try to avoid physiological challenges in such cases and try to provide extra control capacity from outside mechanisms with drips of various cardiovascular medications. Note that adaptive behavior still occurs whether the patient is a young adult with a single problem (possesses a large adaptive reservoir) or a precariously balanced elderly patient with several longstanding cardiovascular related problems (possessing very limited control reserves). But there is something about the kinds of adaptations going on or not going on that allows the experienced physician to recognize the difference in adaptive capacities to project the cardiovascular system’s ability to handle the next challenge (i.e., more than simple demographics and previous medical history).

There is information available to support the above kinds of assessments. The information consists of relationships between the state of the process being controlled, the disturbances the process has been or could be subjected to, and the kinds of responses that have been made to recent challenges. These relationships indicate whether control loops are having trouble handling situations (working very hard to maintain control), working at the extreme of their range, or moving towards the extreme part of their capabilities. Picking up on these relationships indicates to the observer whether or not there is limited remaining capacity to compensate even though no direct indicators of control failures will have occurred yet.

It is also clear that expert judgements about adaptive capacities come from tangible experience with the systems responses to variations and disruptions. Seeing how the system responds to small perturbations provides the feedback to the physician about the capacity to handle other and larger disruptions or challenges (a feel for the dynamics of disturbance and response often captured in terms like a ‘sluggish’ response). Such judgements require baseline information about normal adaptive capabilities given typical disturbances. In monitoring patient physiology, anesthesiologists may sometimes even inject small challenges to observe the dynamic responses of the cardiovascular system as a means to generate information about adaptive capacity.

Our conjecture is that, inspired directly or indirectly by these very detailed situations of judging adaptive capacity in supervisory control, we can create mechanisms to monitor the adaptive capacity of organizations and anticipate when its adaptive capacity is precarious.

For example, in safety management, change and production pressure are disturbances that erode or place new demands on adaptive capacity. During the first phase of a decompensation pattern, it is difficult for normal monitoring mechanisms to see or interpret the subtle signs of reduced adaptive capacity. Eventually, failures occur if adaptive capacity continues to erode and as new challenges combine. These failures represent a sudden collapse in adaptive capacity as demands continued to stretch the system. While a failure may be seen as a message about specific vulnerabilities and proximal factors, they are also signs of the general decompensation signature as a pattern of change in the adaptive capacity of the organization, cf. Doyle’s ‘HOT’ analysis of systems with ‘highly optimized tolerance’ (Carlson & Doyle, 2000).

Returning to the opening question of this chapter, cases are not valuable based on whether adaptation was successful or unsuccessful. Cases are valuable because they reveal patterns in how adaptive capacity is used to meet challenges (Cook et al., 2000). Such cases need to be analyzed to determine when challenges are falling outside or stressing the competence envelope, how the system engages pools of adaptive capacity to meet these challenges, and where the adaptive capacity is being used up to handle these unanticipated perturbations. The decompensation kind of incident reminds us that the basic unit of analysis for assessing resilience is observations of a system’s response to varying kinds of disturbances.

Measures of brittleness and resilience will emerge when we abstract general patterns from specific cases of challenge and response. Resilience engineering needs to search out incidents because they reveal boundary conditions and how the system behaves near these boundaries. These cases will provide evidence of otherwise hidden sources for adaptiveness and also illustrate the limits on a system’s adaptive capabilities. General patterns in adaptive system behavior (as illustrated by the decompensation event signature and in other patterns taken from the cases of resilience/brittleness captured in this book) will help measure an organization’s resilience and target interventions to enhance an organization’s resilience.

Acknowledgements

This work was supported in part by grant NNA04CK45A from NASA Ames Research Center to develop resilience engineering concepts for managing organizational risk.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset