Chapter 7

Resilience Engineering: Chronicling the Emergence of Confused Consensus

Sidney Dekker

My contribution to this book on resilience engineering consists of a chapter that attempts to capture the emergence and exchange of ideas during the symposium (held in Söderköping, Sweden, during October 2004) – ideas that together seem to move towards a type of confused consensus on resilience engineering (and sometimes towards a common understand of what it is not). From the presentations, arguments, claims, hypotheses, questions and suggestions, I have extracted themes that seemed to animate many of the ideas that generated discussions, and discussions that generated ideas. The themes, as organised below, could represent at least some kind of logical flow; an order in which one idea gives rise to the next, as the constraints of the previous one on progress for resilience engineering become clear. I first present the themes in summary, and then get into them in more detail:

•  We have to get smarter at predicting the next accident. The recombinant predictive logic that drives accident prediction today is insensitive to the pressures of normal work by normal people in normal organizations. We may not even need accident models, as the focus should be more on models of normal work. It is normal work (and the slow, incremental drift into safety margins and eventually across boundaries that is a side effect of normal work in resource-constrained, adaptive organizations) that seems to be an important engine behind complex system accidents.

•  Detecting drift into failure that happens to seemingly safe systems, before breakdowns occur is a major role for resilience engineering. While it is critical to capture the relational dynamics and longerterm socio-organizational trends behind system failure, ‘drift’ as such is ill-defined and not well-modelled. One important ingredient in drift is the sacrificing decision, where goals of safety and production/efficiency are weighed against one another, and managers make locally rational decisions based on criteria from inside their situated context. But establishing a connection between individual micro-level managerial trade-offs and macro-level drift is challenging both practically and theoretically. In addition, it is probably not easy to influence sacrificing decisions so that they do not put a system on the road towards failure. Should we really micro-adjust managers’ day-to-day cognitive calculus of reward?

•  Detecting drift is difficult and makes many assumptions about our ability to distinguish longitudinal trends, establish the existence of safety boundaries, and warn people of their movement towards them. Instead, a critical component in estimating an organization’s resilience could be a momentary charting of the distance between operations as they really go on, and operations as they are imagined in the minds of managers or rulemakers. This distance tells us something about the models of risk currently applied, and how well (or badly) calibrated organizational decision makers are. This, however, requires comparisons and perhaps a type of quantification (even if conceptual) that may be challenging to attain. Updating management about the real nature of work (and their deficient perception of it) also requires messengers to go beyond ‘reassuring’ signals to management, which in turn demands a particular climate that allows the boss to hear such news.

•  Looking for additional markers of resilience, what was explored in the symposium? If charting the distance between operations as imagined and as they really occur is too difficult, then an even broader indicator of resilience could be the extent to which the organization succeeds in keeping discussions of risk alive even when everything looks safe. The requirement to keep discussions of risk alive invites us to think about the role and nature of a safety organization in novel ways. Another powerful indicator is how the organization responds to failure: can the organization rebound even when exposed to enormous pressure? This closes the circle as it brings us back to theme 1, whether resilience is about effectively predicting (and then preventing) the next accident. In terms of the cognitive challenge it represents, preventing the next accident could conceptually be close to managing the aftermath of one: doing both effectively involves rebounding from previous understandings of the system and interpretations of the risk it is exposed to, being forced to see the gap between work as imagined and work as done, updating beliefs about safety and brittleness, and recalibrating models of risk.

Resilience Engineering and Getting Smarter at Predicting the Next Accident

The need for resilience engineering arises in part out of the inadequacy of current models for understanding, and methods for predicting, safety breakdowns in complex systems. In very safe systems, incident reporting no longer works well, as it makes a number of questionable assumptions that no longer seem to be valid.

The first assumption of incident reporting is that we possess enough computational capacity to perform meaningful recombinant logic on the database of incidents, or that we have other working ways of extracting intelligence from our incident data. This assumption is already being outpaced by very successful incident reporting systems (e.g., NASA’s Aviation Safety Reporting System) which indeed could be said to be a victim of their own success. The number of reports received can be overwhelming, which makes meaningful extraction of risk factors and prediction of how they will recombine extremely difficult. In fact, a main criterion of success of incident reporting systems across industries (the number of reports received) could do with a reconsideration: celebrating a large count of negatives (instances of system problems, errors, or failures) misses the point since the real purpose of incident reporting is not the gathering of data but the analysis, anticipation and reduction of risk.

The second assumption is that incident reports actually contain the ingredients of the next accident. This may not always be true, as accidents in very safe systems seem to emerge from (what looks to everybody like) normal people doing normal work in normal organizations. In the time leading up to an accident there may be few reportworthy failures or noteworthy organizational deficiencies that would end up as recombinable factors in an incident reporting system. Even if we had the computational capacity to predict how factors may recombine to produce an accident, this is going to be of no help if the relevant ‘factors’ are not in the incident database in the first place (as they were never judged to be ‘abnormal’ enough to warrant reporting). The focus on human errors and failure events may mean that incident reporting systems are not as useful when it comes to predicting accidents in safe systems. This is because in safe systems, it is not human errors or failure events that lead to accidents. Normal work does.

The third assumption is that the decomposition principles used in incident reporting systems and the failure models that underlie them have something to do with the way in which factors actually recombine to create an accident. Most fundamentally, the metaphors that inspire and justify incident reporting (e.g., the ‘iceberg’, the ‘Swiss cheese’), assume that there is a linear progress of breakdowns (a domino effect) that either gets stopped somewhere (constituting an incident) or not (creating an accident). Neither the presumed substantive similarity between incident and accident, nor the linear, chain-like breakdown scenarios enjoy much empirical support in safe systems. Instead, safety and risk in safe systems are emergent properties that arise from a much more complex interaction of all factors that constitute normal work.

As a result, linear extensions of current safety efforts (incident reporting, safety and quality management, proficiency checking, standardization and proceduralization, more rules and regulations) seem of little use in breaking the asymptote, even if they could be necessary to sustain the high safety levels already attained.

Do We Need Accident Models at All?

The problem with the third assumption calls another fundamental idea into question: what kinds of accident models are useful for resilience engineering? People use models of accidents the whole time. Most accident models, especially naïve or folk ones, appeal to a kind of exclusivity of factors and activities (or how they interact) that lead up to an accident. The idea is that organizational work (or the organizational structure that exists) preceding an accident is somehow fundamentally or at least identifiably different from that which is not followed by an accident. Thus, accidents require their own set of models if people want to gain predictive leverage. Accident models are about understanding accidents, and about presumably predicting them before they happen. But if success and failure stem from the same source, then accident models are either useless, or as useless (or useful) as equally good models of success would be. The problem is that we do not have many good models of operational success, or even of normal work that does not lead to trouble. More than once, symposium participants remarked (in multifarious language) that we need better models of normal work, including better operationalised models of the effects of ‘safety culture’ (or lack thereof) on people’s everyday decision-making.

Resilience engineering wants to differ from safety practices that have evolved differently in various industries. The nuclear industry, for example, relies to a large extent on barriers and defences in depth. Whatever goes on behind all the barriers and how risk may change or shift out there, the system is insulated from it. Aviation seems forever rooted in a ‘fly-fix-fly’ model, where (presumably or hopefully) fail-safe designs are introduced into operation. Subsequently, huge transnational, partially overlapping webs of failure monitoring (from event reporting to formal accident investigation) are relied upon largely ‘bottom-up’ to have the industry make adjustments where necessary. Aviation system safety thus seems to run relatively open-loop and represents a partially reactive control of risk. Military systems, having to deal with active adversaries, emphasise hazards, hazard analysis and hazard control, relying in large part on top-down approaches to making safer systems. In contrast with aviation, this represents a more proactive control of risk. But how do these systems control (or are even aware of) their own adaptive processes that enable them to update and adjust their management of risk?

There is a complex dynamic underneath the processes through which organizations decide how to control risk. One critical aspect is that doing things safely is ongoing. Consistent with the challenge to the use of accident models, doing things safely is perhaps not separable from doing things at all. Individuals as well as entire industries continually develop and refine strategies that are sensitive to known or likely pathways to failure, they continually apply these strategies as they go about their normal work, and they adapt these strategies as they see new evidence on risk come in. Leverage for change and improvement, then, lies in the interpretation of what constitutes such risk; leverage lies in helping individuals and systems adapt effectively to a constantly changing world, calibrating them to newly emerging threats, identifying obsolete models of risk, and helping them avoid overconfidence in the continued effectiveness of their strategies. This, as said above, could be central to resilience engineering: helping organizations not with better control of risk per se, but helping them better manage the processes by which they decide how to control such risk.

This obviously lies beyond the prediction of accidents. This lies beyond the mere controlling of risk, instead taking resilience engineering to a new level. It is about attaining better control of the control of risk. Rather than just helping organizations, or people, become sensitive to pathways to failure, resilience engineering is about helping them become sensitive about the models of risk that they apply to their search and control of pathways to failure.

Modelling the Drift into Failure

One way to think about the greatest risk to today’s safe socio-technical systems is the drift into failure, and a potential contribution from resilience engineering could be to help organizations detect this drift. ‘Drifting into failure’ is a metaphor for the slow, incremental movement of systems operations toward (and eventually across) the boundaries of their safety envelope. Pressures of scarcity and competition typically fuel such movement and uncertain technology and incomplete knowledge about where the boundaries actually are, result in people not stopping the movement or even seeing it. Recognising that a system is drifting into failure is difficult because the entire protective structure (including suppliers, regulators, managerial hierarchies, etc.) seems to slide along with the operational core toward the boundary. Even if an operational system is ‘borrowing more from safety’ than it was previously or than it is elsewhere by operating with smaller failure margins, this may be considered ‘normal’, as the regulator approved it. Almost everybody inside the system does it, goes along, and agrees, implicitly or not, with what is defined as risky or safe. Also, the departures from previous practice are seldom quick or large or shocking (and thus difficult to detect): rather, there is a slow succession of tiny incremental deviations from what previously was the ‘norm’. Each departure in itself is hardly noteworthy. In fact, such ‘departures’ are part and parcel of normal adaptive system behaviour, as organizations (and their regulators) continually realign themselves and their operations with current interpretations of the balance between profitability and risk (and have to do so in environments of resource scarcity and competition). As a result, large system accidents of the past few decades have revealed that what is considered ‘normal’, or acceptable risk is highly negotiable and subject to all kinds of local pressures and interpretations.

Do We Need (and Can We Attain) a Better Model of Drift?

The intuitive appeal of ‘drift into failure’ is not matched by specific, working models of it. In fact, drift into failure is an unsatisfactory explanation so far because it is not a model, just like the ‘Swiss cheese’ (or defences in depth) idea is not a model. It may be a good metaphor for how complex systems slowly move towards the edges of breakdown, but it lacks all kinds of underlying dynamic relationships and potentials for operationalization that could make it into a model.

One important ingredient for such a model would be the ‘sacrificing decision’. To make risk a proactive part of management decision-making requires ways to know when to relax the pressure on throughput and efficiency goals, i.e., making a sacrifice decision, and resilience engineering could be working on ways to help organizations decide when to relax production pressure to reduce risk. Symposium participants agree on the need to better understand this judgement process in organizations, and how micro-level decisions are linked to macro-level ‘drift’. The decision to value production over safety is often implicit, and any larger or broader side effects that echo through an organization go unrecognised. So far, the result is individuals and organizations acting much riskier than they would ever desire, or than they would have in retrospect. The problem with sacrificing decisions is that they are often very sound when set against local judgement criteria; given the time and budget pressures and short-term incentives that shape decision-making behaviour in those contexts. In other words, given the decision makers’ knowledge, goals, focus of attention, and the nature of the data available to them at the time, it made sense. Another problem with the sacrifice judgement is related to hindsight. The retrospective view will indicate that the sacrifice or relaxation may have been unnecessary since ‘nothing happened.’ A substitute measure for the ‘soundness’ of a sacrificing decision, then, could be to assess how peers and superiors react to it. Indeed, for the lack of such ersatz criteria, aggregate organizational behaviour is likely to move closer to safety margins, the final evidence of which could be an accident.

Of course, it could be argued that ‘sacrifice decisions’ are not really special at all. In fact, all decisions are sacrifices of one kind or another, as the choosing of one option almost always excludes other options. All decisions, in that sense, could be seen as trade-offs, where issues of sensitivity and decision criterion enter into consideration (whether consciously or not).

The metaphor of drift, suggests that it is in these normal, day-today processes of organizational management and decision-making that we can find the seeds of organizational failure and success, and a role of resilience engineering could be to find leverage for making further progress on safety by better understanding and influencing these processes. One large operationalization problem lies in the reciprocal macro-micro connection: how do micro-level decisions and trade-offs not only represent and reproduce macro-structural pressures (of production, resource scarcity, competition) and the associated organizational priorities and preferences, but how are micro-level decisions related to eventual macro-level organizational drift? These links, from macro-structural forces down to micro-level decisions, and in turn from micro-level decisions up to macro-level organizational drift, are widely open for future investigation.

Yet even if we conduct such research, is a model of drift desirable? The complexity of the organizational, political, psychological and sociological phenomena involved would make it a serious challenge, but that may not even be the main problem. It would seem, as has been suggested before, that safety and risk in complex organizations are emergent, not resultant, properties: safety and risk cannot be predicted or modelled on the basis of constituent components and their interactions. So why try to build a model of drift as an emergent phenomenon, or why even apply modelling criteria to the way in which you want to capture it? One way to approach the way in which complex systems fail or succeed, then, is through a simile (like ‘drift into failure’) rather than through a model. A simile (just like the Swiss cheese) can guide, direct attention, take us to interesting places for research and progress on safety. It may not ‘explain’ as a traditional model would. But, again, this is the whole point of emergent processes – they call for different kinds of approaches (e.g., system dynamics) and different kinds of accident models.

Where are the Safety Boundaries that You can Drift Across?

Detecting how an organization is drifting into failure before breakdowns occur is difficult. One reason is that we lack a good, operationalised model of drift that could focus us on the tell-tale signs. And, as mentioned previously, a true model of drift may be out of reach altogether since it may be fundamentally immeasurable. Another issue is that of boundaries: drifting would not be risky if it were not for the existence of safety boundaries across which you actually can drift into breakdown. Other than shedding light on the adaptive life of organizations (a key aspect of resilience engineering), perhaps there is even no point in understanding drift if we cannot chart it relative to safety boundaries. But if so, what are these boundaries, and where are they?

The idea of safety boundaries appeals to intuitive sense: it can appear quite clear when boundaries are crossed (e.g., in an accident). Yet this is not the same as trying to plot them before an accident. There, the notion of hard, identifiable boundaries is very difficult to develop further in any meaningful way. With safety (and accidents) as emergent phenomena, deterministic efforts to nail down borders of safe practice seems rather hopeless. The number of variables involved, and their interaction, makes the idea of safety boundaries as probability patterns more appealing: probability patterns that vary in an indeterministic fashion with a huge complex of operational and organizational factors.

Going even further, it seems that safety boundaries, and where they lie, are a projection or an expression of our current models of risk. This idea moves away from earlier, implied realist interpretations of safety margins and boundaries, and denies their independent existence. They are not present themselves, but are reflections of how we think about safety and risk at that time and in that organization and operation. The notion of boundaries as probability patterns that arise out of our own projections and models of risk is as profoundly informative as it is unsettling. It does not mean that organizations cannot adapt in order to stay away from those boundaries, indeed, they can. People and organizations routinely act on their current beliefs of what makes them safe or unsafe. But it does mean that the work necessary to identify boundaries must shift from the engineering-inspired calculative to the ethnographically or sociologically interpretive. We want to learn something meaningful about insider interpretations, about people’s changing (or fixed) beliefs and how they do or do not act on them.

Work as Imagined versus Work as Actually Done

One marker of resilience that comes out of converging lines of evidence, is the distance between operations as management imagines they go on and how they actually go on. A large distance indicates that organizational leadership may be ill-calibrated to the challenges and risks encountered in real operations.

Commercial aircraft line maintenance is emblematic: A job-perception gap exists where supervisors are convinced that safety and success result from mechanics following procedures – a sign-off means that applicable procedures were followed. But mechanics may encounter problems for which the right tools or parts are not at hand; the aircraft may be parked far away from base. Or there may be too little time: Aircraft with a considerable number of problems may have to be turned around for the next flight within half an hour. Mechanics, consequently, see success as the result of their evolved skills at adapting, inventing, compromising, and improvising in the face of local pressures and challenges on the line – a sign-off means the job was accomplished in spite of resource limitations, organizational dilemmas, and pressures. Those mechanics that are most adept are valued for their productive capacity even by higher organizational levels. Unacknowledged by those levels, though, are the vast informal work systems that develop so mechanics can get work done, advance their skills at improvising and satisficing, impart them to one another, and condense them in unofficial, self-made documentation (McDonald et al., 2002). Seen from the outside, a defining characteristic of such informal work systems would be routine violations of procedures (which, in aviation, is commonly thought to be ‘unsafe’). But from the inside, the same behaviour is a mark of expertise, fuelled by professional and interpeer pride. And of course, informal work systems emerge and thrive in the first place because procedures are inadequate to cope with local challenges and surprises, and because procedures’ (and management’s) conception of work collides with the scarcity, pressure and multiple goals of real work.

International disaster relief work similarly features widely diverging images of work. The formal understanding of relief work, which needs to be coordinated across various agencies and nations, includes an allegiance to distant supervisors (e.g., in the head office in some other country) and their higher-order goals. These include a sensitivity to the political concerns and bureaucratic accountability that form an inevitable part of cross-national relief work. Plans and organizational structures are often overspecified, and new recruits to disaster work are persuaded to adhere to procedure and protocol and defer to hierarchy. Lines of authority are clear and should be checked before acting in the field.

None of this works in practice. In the field, disaster relief workers show a surprising dissociation from distant supervisors and their global goals. Plans and organizational structures are fragile in the face of surprise and the inevitable contretemps. Actual work immediately drifts away from protocol and procedure, and workers defer to people with experience and resource access rather than to formal hierarchy. Improvisation occurs across organizational and political boundaries, sometimes in contravention to larger political constraints or sensitivities. Authority is diffuse and often ignored. Sometimes people check after they have acted (Suparamaniam & Dekker, 2003).

Generally, the consequence of a large distance appears to be greater organizational brittleness, rather than resilience. In aircraft maintenance, for example, the job perception gap means that incidents linked with maintenance issues do not generate meaningful pressure for change, but instead produce outrage over widespread violations and cause the system to reinvent or reassert itself according to the official understanding of how it works (rather than how it actually works). Weaknesses persist, and a fundamental misunderstanding of what makes the system work (and what might make it unsafe) is reinforced (i.e., follow the procedures and you will be safe). McDonald et al., 2002 famously called such continued dysfunctional reinvention of the system-as-it-ought-to-be ‘cycles of stability’. Similarly, the distributed and politically pregnant nature of international disaster relief work seems to make the aggregate system immune to adapting on a global scale. The distance between the formal and the actual images of work can be taken as a sign of local worker intransigence, or of unusual adversity, but not as a compelling indication of the need to better adapt the entire system (including disaster relief worker training) to the conditions that shape and constrain the execution of real work. The resulting global brittleness is compensated for by local improvisation, initiative and creativity.

But this is not the whole story. A distance between how management understands operations to go on and how they really go on, can actually be a marker of resilience at the operational level. There are indications from commercial aircraft maintenance, for example, that management’s relative lack of clues of about real conditions and fluctuating pressures inspires the locally inventive and effective use of tools and other resources vis-à-vis task demands. Similarly, practitioners ranging from medical workers to soldiers can ‘hide’ available resources from their leadership and keep them in stock in order to have some slack or reserve capacity to expand when demands suddenly escalate. ‘Good’ leadership always extracts maximum utilization from invested resources. ‘Good’ followership, then, is about hiding resources for times when leadership turns out to be not that ‘good’ after all. Indeed, leadership can be ignorant of, or insensitive to, the locally fluctuating pressures of real work on the sharp end. As a result, even safety investments can quickly get consumed for production purposes. In response, practitioners at the sharp end may secure a share of resources and hold it back for possible use in more pressurised times.

From an organizational perspective, such parochial and subversive investments in slack are hardly a sign of effective adjustment, and there is a need to better capture processes of cross-adaptation and their relationship to organizational resilience. One idea here is to see the management of resilience as a matter of balancing individual resilience (individual responses to operational challenges) and system resilience (the large-scale autocatalytic combination of individual behaviour). Again, the hope here is that this can produce a better understanding of the micro-macro connection – how individual trade-offs at the sharp end (themselves influenced by macro-structural forces) in turn can create global side effects on an emergent scale.

Measuring and Closing the Gap

How can we measure this gap between the system as designed or imagined and the system as actually operated? And, perhaps more importantly, what can we do about it? The distance between the real system and the system as imagined grows in part through local feedback loops that confirm managers into thinking that what they are doing is ‘right’. Using their local judgement criteria, this can indeed be justified. The challenge then, is to make the gap visible and provide a basis for learning and adaptation where necessary. This implies the ability to contrast the present situation with, for example, a historical ‘ideal’. The hierarchical control model proposed by Nancy Leveson is one example of an approach that can provide snapshots of the system-as-designed and the system as actually evolved. From one time to the next, it can show that changes have occurred in, and erosion has happened to, control loops between various processes and structures (Leveson, 2002). The gradual erosion of control, and the subsequent mismatch between the system-as-designed and the system as actually operated, can be seen as an important ingredient in the drift into failure.

One problem here is that operations ‘as imagined’ are not always as amenable to capture or analysis as the idealised control model of a system’s starting position would be. Understanding the gap between the system-as-imagined and the system as actually operated requires investment not only in understanding how the system really works, but also how it is imagined to work. The latter can sometimes even be more difficult. Take the opinion of the quality manager of an airline flight school that suffered a string of fatal accidents, who ‘was convinced that the school was a top training institution; he had no knowledge of any serious problems in the organization’ (Raad voor de Transportveiligheid, 2003, p. 41). This is a position that can be shown, with hindsight, to contrast with a deeply troubled actual organization which faced enormous cost pressure, lacked qualified or well-suited people in leadership posts, ran operations with non-qualified instructors, featured an enormous distrust between management and personnel, and had a non-functioning reporting system (all of which was taken as ‘normal’ or acceptable by insiders, as well as the regulator, and none of which was taken as signals of potential danger). But the issue here is that the quality manager’s idea about the system is not only a sharp contradistinction from the actual system, it is also very underspecified, general, vague. There is no way to really calibrate whether this manager has the ‘requisite imagination’ (Adamski & Westrum, 2003) to adapt his organization as necessary, although it is inviting to entertain the suspicion that he does not.

Ways forward on making both images of work explicit include their confrontation with each other. A part of resilience engineering could be to help leadership with what has been called ‘broadening checks’ – a confrontation with their beliefs about what makes the system work (and safe or unsafe) in part through a better visualization of what actually goes on inside. One approach to broadening checks suggested and developed by Andrew Hale is to do scenario-based auditing. Scenario-based auditing represents a form of guided simulation, which could be better suited for capturing eroded portions of an organization’s control structure. In general, broadening checks could help managers achieve greater requisite imagination by confronting them with the way in which the system actually works, in contrast to their imagination.

Towards Broader Markers of Resilience

Looking for additional markers of resilience, what was explored in the symposium? Here I take up two points: whether safe organizations succeed in keeping discussion about risk alive, and how organizations respond to failure. Both of these could be telling us something important, and something potentially measurable, about the resilience or brittleness of the organization.

Keeping Discussions about Risk Alive Even When all Looks Safe

As identified previously, an important ingredient of engineering a resilient organization is constantly testing whether ideas about risk still match with reality; whether the model of operations (and what makes them safe or unsafe) is still up to date. As also said, however, it is not always easy to expose this gap between the system-as-imagined and the system as really operated. The models that managers and even operational people apply in their management of risk could be out of date without anybody (including outsiders, such as researchers or regulators) knowing it because they are often so implicit, so unspecified. In fact, industries are seldom busily entertained with explicating and recalibrating their models of risk. Rather, there seems to be hysteresis: old models of risk and safety (e.g., human errors are a major threat to otherwise safe systems) still characterise most approaches in aviation, including latter-day proposals to conduct ‘line-oriented safety audits’ (which, ironically, are designed explicitly to find out how real work actually takes place!).

If the distance between the model of risk (or the system as imagined) and actual risk (or the system as actually operated) is difficult to illuminate, are there other indicators that could help us judge the resilience of an organization? One way is to see whether activities associated with recalibrating models of safety and risk are going on at all. This typically involves stakeholders discussing risk even when everything looks safe. Indeed, if discussions about risk are going on even in the absence of obvious threats to safety, we could get some confidence that an organization is investing in an analysis, and possibly in a critique and subsequent update, of its models of risk. Finding out that such discussions are taking place, of course, is not the same as knowing whether their resulting updates are effective or result in a better calibrated organization. It is questionable whether it is sufficient just to keep a discussion about risk alive. We should also be thinking about ways to measure the extent to which these discussions have a meaningful effect on how the organization relates to operational challenges. We must remind ourselves that talking about risk is not the managing of it, and that it constitutes only a step toward actually engineering the resilience of an organization.

The role of a safety organization in fanning this discussion, by the way, is problematic. We could think that a safety department can take a natural lead in driving, guiding and inspiring such discussions. But a safety department has to reconcile competing demands for being involved as an insider (to know what is going on), while being independent as an outsider (to be able to step back and look at what really is going on and influence things where necessary). In addition, a strong safety department could send the message to the rest of an organization that safety is something that can and should be delegated to experts; to a dedicated group (and indeed, parts of it probably should be). But keeping discussions about risk alive is something that should involve every organizational decision maker, throughout organizational hierarchies.

Resilience and Responses to Failure

What if failures do occur? Are there criteria of resilience that can be extracted then? Indicators of how resilient an organization is consistently seem to include how the organization responds to failure. Intuitively, crises could be a sign of a lack of fundamental organizational resilience, independent of the effectiveness of the response. But effective management of crises may also represent an important marker of resilience: What is the ability of the organization to rebound even when exposed to enormous pressure? Preventing the next accident could conceptually be close to managing the aftermath of one and in this sense carry equally powerful implications for our ability to judge the resilience of an organization.

The comparison is this: suppose we want to predict an accident accurately. This, and responding to a failure effectively, both involve retreating or recovering from previous understandings of the system and interpretations of the risk it is exposed to. Stakeholders need to abandon old beliefs about the real challenges that their system faces and embrace new ones. Both are also about stakeholders being forced to acknowledge the gap between work as imagined and work as done. The rubble of an accident often yields powerful clues about the nature of the work as actually done and can put this in sharp contradistinction with official understandings or images of that work (unless the work as actually done retreats underground until the probe is over – which it does in some cases, see McDonald et al., 2002). Similarly, getting smarter at predicting the next accident (if we see that as part of resilience engineering) is in part about finding out about this gap, this distance between the various images of work (e.g., ‘official’ versus ‘real’), and carefully managing it in the service of better grasping the actual nature of risk in operations.

Cognitively, then, predicting the next accident and mopping up the previous one demands the same types of revisionist activities and insights. Once again, throughout the symposium (and echoed in the themes of the present book) this has been identified as a critical ingredient of resilience: constantly testing whether ideas about risk still match reality; updating beliefs about safety and brittleness; and recalibrating models of risk. A main question is how to help organizations manage these processes of insight and revision effectively, and further work on resilience engineering is bound to address this question.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset