© Raymond Pompon 2016

Raymond Pompon, IT Security Risk Control Management, 10.1007/978-1-4842-2140-2_4

4. Risk Analysis: Natural Threats

Raymond Pompon

(1)Seattle, Washington, USA

Even with all our technology and the inventions that make modern life so much easier than it once was, it takes just one big natural disaster to wipe all that away and remind us that, here on Earth, we’re still at the mercy of nature.

—Neil deGrasse Tyson

There are many ways to look at risk. One way is to divide risk into natural, accidental events and man-made intentional acts of aggression. Both types of risk are important, but there is some insight to be gained in looking at them differently. This chapter explores risk arising from natural and accidental threats.

Disaster Strikes

What is a disaster? What can go wrong when a random occurrence has far-ranging consequences? Consider this:

  • A transformer in a nearby substation explodes and an entire nearby business park goes dark. The medical service company there has its own generators, so their server room stays up. However, the business park is on a hill and the local water pumps are offline. The landlord closes all the offices because none of the bathrooms are functioning and the local board of health is citing a sanitation hazard.

  • A backhoe tears through an underground telecom conduit, breaking the primary T3 serving the nearby bank. Good news: the bank had a secondary Internet connection. Bad news: the second Internet connection ran through the same conduit as the primary. The bank is offline and the telecom company is digging furiously to expose the conduit to replace the broken lines. Then it starts raining. Hard. Now the hole is filling up with mud and water, slowing repair efforts.

  • Heavy snowfall blankets the hills and streets of a major metropolitan city. Then temperatures drop, freezing the snow into slick ice. No wheeled vehicle in the city can drive faster than ten miles per hour. Roadways clog up with stalled cars. At a large software development company, hundreds of employees jump onto the remote access VPN and promptly overload it. Only 50 can work at a time out of an office of 200. Work now breaks down into shifts with many personnel forced to work off their cell phones and locally stored files until the snowplows can clear the roads.

  • A thousand-year storm (now more common in the age of climate change) hits and flooding takes out not only the data center but also an entire company’s campus. It turns out that the campus was built in an ancient lakebed. Sewers are reversing and streets have become rivers. Even if they could work remotely, failing levees are forcing local employees to flee their homes.

  • A 6.8-magnitude earthquake hits just south of a major city. At one company’s data center, a large battery stack that was not properly bolted down pulls away from the wall by a few inches, breaking the main power coupling. The coupling is how all power flows into the server room. The emergency generator in the parking lot, despite being fully operational, is useless without a connection to the server room. Worse, outside feeds also come in over the same coupler. It’s a specialized part and none are available locally. The company needs a new coupler flown in but the earthquake damaged the airport runway so n othing can land.

  • A law firm is on the top floor of a prestigious downtown office building boasting an award-winning Venetian trattoria in the lobby. One evening, a fire breaks out in the kitchen of the restaurant. Fire fighters quickly douse the flames but they also close the building for days to inspect for damages. During that time, the power to the building is also turned off. The law firm servers go dark.

  • In 2012, Hurricane Sandy damaged hundreds of thousands of homes and shut down power to dozens of data centers all over the New York metropolitan area. Data centers switched over to generators, but generators require fuel and the fuel trucks couldn’t get through the flooded roads.

These are all natural disasters based on real events that significantly affected the organizations. Clearly, these kinds of risks are worth examining in more detail.

Risk Modeling

A risk model is nothing more than a taxonomy and a method of measurement that provides a picture of the likelihood and impact of potential damaging events . To be useful, a model needs to reflect reality as closely as possible. Therefore, you should choose the right risk model. One thing to consider is using different risk models for different kinds of risk.

You could use the same model for all of your risk calculations. Many knowledgeable risk analysts do, and come up great results. I want to illustrate a couple of different specialized models that I find useful. You may find the risk model that works best for your organization is entirely different. The most important thing is to think thoroughly about risk and build a practical methodology that is appropriate for your industry, threat landscape, and compliance requirements.

Let’s consider the following list of risks:

  • Earthquake

  • Denial-of-service

  • Fire

  • Fraud

  • Hazardous materials spill

  • Industrial espionage

  • Insider sabotage

  • Intellectual property theft

  • Landslide

  • System failure

Examining this list, you may see two kinds of threats here. There is the threat of natural hazards and accidents, where bad things just happen. There is also the threat of adversarial risk, where bad people make bad things happen, either on purpose or through carelessness. So, this risk list can be split into two columns, as shown in Table 4-1.

Table 4-1. Risks: Natural vs. Man- made

Bad Things Just Happen

Bad People Make Bad Things Happen

Earthquake

Denial-of-service

Fire

Fraud

Landslide

Industrial espionage

Hazardous materials spill

Intellectual property theft

System failure

Insider sabotage

Given the distinctive difference in likelihood and impact between these two kinds of threats, it might be more accurate to model these threats differently. The likelihood of a random event that causes damage, such as a tornado or a technology failure, manifests much differently than intelligent adversaries (such as cyber-criminals or malicious insiders ) who adapt their strategies to your defenses and choose when to attack. Likewise, the impact of natural disasters can sometimes be so vast and multifaceted that you might want to look at a more detailed impact model.

The next chapter is about adversarial threats, whereas the rest of this chapter is about natural/operational threats.

Modeling Natural Threats

In some ways, natural threats are easier to model than adversarial threats. Although there are still many unknowns, such as trying to predict earthquakes or weather, there is also a lot of scientific expertise and historical data to draw upon.

Like real estate, the three most important things about natural hazards are location, location, location. Location even factors in natural threats like pandemics, where large cities and regional travel hubs are at higher likelihood of outbreaks than sleepier locales. Location modeling is where your asset analysis comes in. To do any kind of natural risk analysis, you absolutely need to locate all the significant physical locations of your organization. With that, you can determine what infrastructure and hazards are present in each area. By infrastructure, I mean the following:

  • Connected utilities (power, water, sewer)

  • Communication providers serving that location and their entries into the building

  • Nearest fire department

  • Nearest major highway

  • Nearest major airport

  • Nearby waterways

  • Nearby industrial activity

Based on this, you can gather information about what can go wrong. There are many resources available for getting this kind of data, most from governmental agencies. One warning: sometimes the data can require some work to decipher. For example, where I live, there is the Seattle Hazard Explorer , which is an interactive map that provides data on a dozen different possible natural hazards (see www.seattle.gov/emergency-management/hazards-and-plans/hazards ).

It gives its results as a number range mapping of a “10% chance of exceeding in 50 years.” Reading around the site, you learn that the number range refers to the force of the acceleration generated by an earthquake in terms of gravitational constant (G). So a number 50 would mean shaking to add power equal to half the strength of normal Earth gravity. As a reference, at numbers higher than 30 (or more than 1.3Gs), you can expect some building damage to occur. This gives you some idea of impact. The 10% refers to the probability of such quake in a 50-year period, thus you have likelihood. So with a little work figuring out the numbers, you have enough for a quantitative risk analysis for earthquakes in Seattle. Many of these sites require this kind of decrypting to get data for a risk model.

Here is a list of other good resources for researching natural hazards for North America .

As you assess natural hazards, you quickly realize that here is a huge range of possible threats. There are many things to calculate and consider. Here is a large list of possible natural threats for you to consider.

  • Threat

  • Earthquake

  • Flood

  • Blizzard / ice-storm

  • Landslides/mudslides

  • Building fire

  • Forest fire

  • Heat wave

  • Hazardous material spill

  • Gas leak

  • Prolonged road closure

  • Pandemic

  • Blackout/ele ctrical disruptions

  • Solar flare

  • Volcano

  • Lightning

  • Tsunami

  • HVAC failure

  • Aircraft accident

  • Windstorm

  • Communications failure

  • Nuclear power mishap

  • Civil disturbance

  • Active shooter/domestic terrorism

You may notice that the last two bullet items are human-originated but still listed as natural hazards. In these kinds of threats, it’s easier to model them as random events rather than intentional attacks against your organization. This could change if your organization is one that is attractive to highly motivated attackers, like a law enforcement agency which could appeal to political or nation-state attackers which then skews the normal random distribution of Internet threats. Use your best judgement.

Modeling Impact with Failure Mode Effects Analysis

Operational impacts involving technology can be tricky because of the intricate dependencies involved. Consider a situation like a power surge in a single colocation facility that causes a single rack to fail. This rack holds dozens of servers, including the main external DNS server for an entire company. Coincidentally, the secondary DNS in another facility is undergoing a maintenance event. Now the entire company’s Internet presence is offline. Web, FTP, e-mail, and remote access are all down. Although this is an unusual chain of events, it still happens. Considering the severe impact of these kinds of problems, you might want to model these kinds of occurrences.

For operational risk analysis, consider looking at Failure Mode and Effects Analysis (FMEA ). FMEA is based on a US military procedure. The model was formalized for general use and is currently published as International Standard IEC 60812. FMEA is a risk analysis methodology that focuses on the ways that components in a system fail and the downstream effects of those failures. Although it can be time-consuming, it is a very systematic and easy method for a team to use to get a complete picture of how complex systems fail. You will quickly see alternate design ideas, fail-over mechanisms, and monitoring requirements as you walk through an FMEA analysis.

The essence of FMEA is to

  • Break down a complex system into its major functions.

  • Analyze the functions.

  • Determine the effects of the failure of each of the functions on the overall system.

Simple FMEA Example

Table 4-2 illustrates an example of how to break down an Internet banking system into functions.

Table 4-2. System Example: Internet Banking System

Subsystem

Functional Subsystems

Effect of Failure

Servers

Database server, app server, web server, DNS server

Immediate failure of entire system

Network Devices

Firewall, router, switch, cables

Immediate failure of entire system

Connectivity

ISP, local link, cables

Immediate failure of entire system

Facilities

Server rack, power, HVAC

Varies but assume near-term failure of entire system

Personnel

DBA, sysadmin, net engineer, programmer

Varies but assume failure of system within a few weeks

You can increase the availability of all the technological systems by adding redundant systems. However, adding more personnel and facilities is much more expensive. So let’s dig deeper into these two functions. In Table 4-3, we can add another FMEA factor, the detection method for a failing function.

Table 4-3. FMEA Example of Facilities

Function

Effect of Failure

Explanation

Detection

Server rack

Immediate failure of entire system

Need rack to hold server

Beyond systems failing, none

Power

Immediate failure of entire system

Need electricity

Audible alarm in room on batteries

HVAC

Failure of system in 1–6 hours, plus possible equipment damage

Servers overheat

Thermometer alarm via SMS pager

In Table 4-4, we can add another layer of impact by looking at how personnel are affected by function failures.

Table 4-4. Example of FMEA Breakdown of Personnel

Function

Effect of failure

Explanation

Detection

Database admin

Failure of system within 30–60 days

Database logs fill up without admin to clear them

E-mailed alarms to admins

Sysadmin

Degraded performance, failure of system within 15–30 days

No maintenance by admins

User complaints

Net engineer

Degraded performance

No network fixes or optimization

User complaints

Programmer

Degraded performance

No bug fixes or features added

User complaints

As you can see, there are some problems with delayed failures caused by functions not being available. Coupled with poor detection capability, this could mean a serious problem emerging when you are least able to deal with it, like in the middle of the night. Anyone who works in IT will probably cringe when they read that the detection method is user complaints.

Breaking down a System

The first step is to break down a system into its primary functions. You can represent most systems, especially technological ones, as a hierarchy. In addition, large systems are built to serve a purpose. This is your beginning point: the primary mission of the system. From here, functional requirements will flow. Examine its structure as it fulfills its functions. You may want to look at how it performs those functions over time, considering environment changes on a scale of hours, days, weeks, and so forth. For example, an accounting system may have different functions and dependency cycles for month-end, quarterly profit reporting, and annual tax analysis. Remember to use your asset analysis and review those diagrams, documentation, data flows, and subject-matter expert interviews.

You should consciously decide and document what you consider to the boundaries of the system. This is useful for analysis and will come in handy in Chapter 6 when we discuss the concept of scope. What is a system? What should you include? What you should you leave out? For example, a typical corporate Windows workstation needs dynamic addressing, Active Directory, name services, clock synchronization, and a local area network. Maybe you can leave out authentication services, file sharing services, and Internet links. Maybe you should include Internet connectivity? It all depends on your business requirements.

When looking at larger systems, functions can be more than just components or technical services. You can look at things like facilities, personnel, ISPs, and other major systems. For example, here is a functional breakdown of a retail store’s sales system:

  • Point-of-Sale (POS) terminals

  • Receipt printers

  • Card readers

  • Sales clerks

  • Store network

  • Store wiring closet

  • Store server rack

  • Store POS server

  • Store network link to headquarters in another city

  • Regional IT service technician (part-time)

This method is called the top-down approach and it is the most common way people approach an already existing system. You start big and work your way down into smaller and smaller pieces, stopping analysis when you feel that you understand enough. Another approach is bottom-up, where you start with components and build up to a complete system. This takes longer but is more complete.

Analyzing Functions

With the major functions defined, you can begin analyzing those functions for how they will fail. This means looking at dependencies, redundancies, inputs, and outputs. Like the overall system, you should begin with the goal of the function. How does the function fulfill its mission? What was the mindset of the designers of the function? These things can provide overlooked clues that can help you find shortcomings and possible flaws in the function.

You should also consider how this system feeds into other systems. In the workstation example, that system could be considered part of the “marketing subnet. ” Are there any implications and feedback loops because of that relationship? Perhaps the marketing department does monthly video broadcasts; so now you realize that each workstation should also have speakers or headphones as critical functions.

Things to look at when analyzing a function:

  • Command inputs (the range of possible changes that can be made)

  • Breadth of command inputs (How many external parties can issue commands to the function?)

  • Data flows (size, speed, and path)

  • Internal feedback mechanisms (Can the function monitor its own status? How does it react?)

  • External feedback mechanism (Can the function receive warnings from the outside? How does it react?)

  • Does the function provide feedback to outside systems? How and how often?

  • What does the function require from outside systems to function? What happens when it doesn’t get it?

  • How does the function handle load? What does it do when overloaded? Idle?

  • Does the function adjust its own parameters? How?

  • How long can the function run without rest/maintenance? External adjustment?

Determining Failure Effects

Modeling failures within complex systems full of interacting components is a challenge. Large systems can be non-linear, where a minor change in a single subsystem can resonate with large consequences. Systems also have a history, as they have evolved and been adapted from earlier, simpler systems. Over time, their purpose may have changed. Sometimes that legacy of changes is reflected in the components in the system. For systems already in place, it is important to perform analysis on systems, as they exist, not based on the original specifications or idealization of plans yet to be implemented.

When brainstorming failure modes, you should consider exactly what that failure would look like. Be sure to take into account the scale (how bad) and the duration (how long). For some kinds of systems, scale and duration could be limited or prolonged depending on how the system reacts and recovers from problems. In some cases, disruption is momentary and evades detection. I have seen more than a few cases where a server quietly crashes, reboots, and comes back up all between the cycles of a five-minute uptime check of the monitors. Users, however, did notice because all of their transactions failed and data was lost. Improving function failure detection is an important part of failure effects analysis .

Drawing a system can also help illustrate which failures modes can be the most catastrophic to a system. Figure 4-1 is a simple diagram of an Internet banking web farm.

A417436_1_En_4_Fig1_HTML.jpg
Figure 4-1. Internet banking system

Can you spot the single point of failure in the example? Hint: it’s probably the cheapest piece of equipment in the entire stack. Drawing a system on a whiteboard during a group brainstorming session can be very effective; not only for spotting failure points, but also in creating general understanding on how the system actually functions.

Business Impact Analysis

If you are responsible for the organization’s business continuity plan, then this type of analysis is useful for the Business Impact Analysis (BIA) section of the business continuity plan . Business continuity is a requirement of ISO 27001, PCI DSS, and SSAE 16 SOC 2 and 3. Sometimes this role falls on IT security and sometimes the role falls on a dedicated business continuity team. In any case, it’s essential to understand the process.

BIA identifies the functions that are necessary to the organization and what the effects of an interruption would be. Through BIA, you create disaster impact scenarios based on those mission-critical services to help determine what resources you might need to get business operations going again. This all fits very nicely into the risk analysis methods discussed so far in this chapter.

A way to enhance this analysis, especially for disaster scenarios, is to break down duration of unavailability or degradation of function. For the following examples, I have divided the period of unavailability into three times: one day, three days, and five or more days. These times allow a response plan to address specific kinds of interruptions, but also can work with an escalating crisis. You can use whatever periods you think are appropriate for your organization. Table 4-5 shows one way of categorizing different downtime periods for a typical office.

Table 4-5. Example of Impacts to Normal Business Operations for the Office

Function

Duration: 1 day down

Duration: 3 days down

Duration: 5+ days down

Water & Sewer

Minor

Significant

Major

HVAC

Minor

Significant

Significant

Electricity

Significant

Major

Major

Elevator

Minor

Minor

Minor

Connectivity (Phone & Net)

Significant

Major

Major

One thing you may notice from the example in Table 4-5 is that I didn’t define specific threats; just functional interruptions. I used FMEA is to treat offices services as functions and combined threats into a basic assumption of an interruption of service, regardless of reason, with a defined area of impact. You can use this shortcut with the FMEA model so that you do not need to look at every possible threat. This automatically rolls up the threats of fires, cable cuts, blackouts, pipe breaks, and severe storms into a single threat vector. For example, you could focus on an electrical outage and not worry about if it was caused by a backhoe accident, windstorm, regional blackout, or an earthquake. All you focus on is the loss of the function and effects. You can also use FMEA and ignore all the various types of threats and group common likelihoods together. For the BIA, you can also break down into likely durations; so in your scenario planning, you can just focus on a function interruption event of specified duration. Table 4-6 shows more examples of rolling natural threats and likely durations into an analysis table.

Table 4-6. An Example of Threat Mapping with FMEA to Generalized Failure Threats

Threat

Effects on Services

Outage Duration

Likelihood

Earthquake

Regional, all services

5+ days

1-Rare

Large earthquake

Widespread, all services

5+ days

0- Very Rare

Flood

Regional, all services

2–5 days

3-Unlikely

Blizzard/ice storm

Regional, transport

2–5 days

5-Possible

Storm

Regional, power & com & transport

2–5 days

5-Possible

Building fire

Facility, all services

5+ days

3-Unlikely

Civil disturbance

Facility, transport

2–5 days

1-Rare

Hazardous spill

Facility, all services

2–5 days

1-Rare

Pandemic

Regional, personnel

5+ days

3-Unlikely

Bomb threat

Facility, all services

1 day

3-Unlikely

Blackout

Regional, power

2–5 days

Possible

Volcano

Regional, all

5+ days

1-Rare

Lightning

Facility, power & comm.

2-5 days

1-Rare

HVAC failure

Facility, equipment

2–5 days

5-Possible

Aircraft accident

Facility, all services

5+ days

1-Rare

Communications loss

Regional, comm.

1 day

5-Possible

Nuclear mishaps

Regional, all

5+ days

1-Rare

Sabotage/terrorism

Facility, all services

5+ days

3-Unlikely

Here’s how you can use this mapping—along with combinations of outages—to get a simpler breakdown of the various impacts to business operations that a disaster would create. In this example, I model a service company with two customer-facing services: customer support calls and consulting. Of the two, the call center is the most important because it facilities immediate customer communication. If a customer calls and the lines are down, the company reputation sinks. Consulting services can be delayed but call answering cannot. In this example, most of the offices are in Nevada, except the sales offices, which are in Chicago. These are the facilities :

  • Main office: Las Vegas, NV

  • Sales office: Chicago, IL

  • Call center: Reno, NV

  • Call center: Las Vegas, NV

In Table 4-7, I break down the first cut of the outage scenarios for this example.

Table 4-7. Sample Scenario Overview

Scenario

Example

Likelihood

Impacts Critical service: Customer call center

Other Impacts

Scenario 1: All services in the state of Nevada heavily damaged

Major earthquake

Very rare

Major: Complete outage

Major: Corporate outage, sales office becomes primary contact

Scenario 2: Service outages in the Vegas area

Storms, flooding

Possible

Minor: Reno call center takes all calls; capacity reduced

Minor: Corporate outages; sales office becomes primary contact

Scenario 3: Personnel in the Vegas area are disabled

Pandemic

Rare

Minor: Reno call center takes all calls; capacity reduced

Minor: Corporate outages; sales office becomes primary contact

Scenario 4: Main office is unavailable

Building fire

Possible

No direct impacts

Minor: Corporate outages; sales office becomes primary contact

Scenario 6: Reno call center is unavailable

Building fire

Possible

Minor: Vegas call center takes all calls, capacity reduced

No additional impacts

Scenario 7: Vegas call center is unavailable

Building fi re

Possible

Minor: Reno call center takes all calls, capacity reduced

No additional impacts

Scenario 8: Chicago consulting office is unavailable

Building fire

Possible

No direct impacts

Minor: Corporate takes over sales calls as needed

As you can see in this example, things seem to be pretty well in hand except for a statewide disaster scenario. In this case, the risk has been identified and leadership can choose to mitigate that risk or accept it.

Documenting Assumptions

Whenever you work on these kinds of analyses, there will be assumptions. You need to identify them as you go along. Do not let them be invisibly baked into the resulting analysis. The results can be misleading if the assumption turns out to be false or misunderstood. The easiest way to avoid this is to solicit multiple perspectives in the analysis. Once you have identified the assumptions, they to need to be documented along with the analysis. You should also have leadership review and approve these assumptions when you present your analysis. They may have different assumptions that could change your results. Here are some sample assumptions to go along with the earlier examples :

  • Unless otherwise noted in the analysis, all staff can telecommute to perform their job functions with at least 75% efficiency.

  • Services hosted outside of the state remain available and functional. They are considered “always available” during an emergency due to the low probability that both services are unavailable at the same time. External hosted services include chat (cloud-based), accounting (outsourced), HR payroll (outsourced).

  • IT personnel suff er no more than 25% incapacitation in an interruption event.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset