© Raymond Pompon 2016

Raymond Pompon, IT Security Risk Control Management, 10.1007/978-1-4842-2140-2_20

20. Response Controls

Raymond Pompon

(1)Seattle, Washington, USA

A good plan, violently executed now, is better than a perfect plan next week.

—General George S. Patton, Jr.

How you react when things go wrong is a huge factor in how much damage an incident does to your organization. If you run around like your hair is on fire, things will not go so well. When we are busy or stressed, we make bad decisions. There is panic, confusion, and indecision. Who is in charge? What do we do? Who do we call? This kind of disorder can magnify impacts and turn a bad situation into a disaster. However, if you remember the assume breach principle, then you know incidents are inevitable and you can be ready. What do you need to do to be ready? It involves three principles: preparation, planning, and practice.

When preparing for incidents , there were two kinds of controls that can be used: detective controls, which is primarily about event logging, and corrective controls, which is primarily about backups and failover systems. When planning, we need to look at business continuity plans, when defining how the organization can keep running after disasters and outages, and security incident responses to contain and remedy breaches. Lastly, we try to learn from these events and practice for future incidents.

Logging

An important part of response is responding to the right thing. This means knowing what is actually going on. This is where logging comes in. With comprehensive logging and regular log review routines, it’s possible to catch a breach attempt in progress before catastrophic damage occurs. Even if you don’t have an active incident occurring, logs can give you an idea about what is going on inside your organization. Let’s begin with the logging policy, which will give you a good idea of what logging is all about.

Sample Logging Policy

ORGANIZATION will monitor and review critical systems for signs of unauthorized access as well as to ensure that controls and systems are properly functioning within expected parameters.

The IT and the Security department will share responsibility for configuring and maintaining event logging for all critical systems and security devices. The Security department will have sole responsibility for maintaining and protecting security event logs from unauthorized erasure or modification. Access to logs will be limited to those with specific needs.

  • Systems will be configured such that all system clocks will be synchronized to a trusted time source

  • Systems will be configured to record the following types of events:

    • Login/authentication successes and failures

    • Process or service execution successes and failures

    • Major system faults and errors

    • Administrator actions and configuration changes

    • Creation and deletion of system services and objects

    • Creation and deletion of security services and objects

    • Access, modification, and restart of logging

    • Antivirus alerts and events

  • Each log record will include the following details:

    • Date/time of event

    • User ID of who initiated event

    • Type of event

    • Source or origination of event (network address or subsystem)

    • Name of affected data, system component, or resource objects

Confidential data should not be stored in event log data. If for unavoidable technical or business reasons, confidential data is stored in a log, then ORGANIZATION will use additional controls to protect either the confidential data elements or the entire log file.

The IT Department is responsible for regularly reviewing all logs for critical systems.

The Security department is responsible for regularly reviewing all security device logs for systems like firewalls,intrusion detection system (IDS), and two-factor authentication servers.

Log history will be retained for at least one year, with a minimum of three months of searchable online availability.

Log files will be backed up to a centralized log server or media that is difficult to alter.

What You Must Log

While some of the items in the policy are self-evident, there are others that are worth exploring in more detail. One of them is what should be logged. In an ideal world, you’d log every device capable of logging and keep the data forever. While storage space is relatively cheap, it’s usually not feasible to store, organize, and review that much data. I’ve worked with systems where nearly a quarter of the internal network bandwidth was consumed by data streams from security devices to the log servers. You need to prioritize what you need to record.

The first things to consider are which controls should be generating logs and how should it be captured. Nearly every security device and service on the market now has the capability of generating logs. Since your technical security controls are your primary means of defense, they will be one of your best sources of information when something is going on. You will want as much data as possible from them. This means firewalls, intrusion detection/prevention systems, virtual private network devices, antivirus software, and authentication servers.

In that same vein, you want to capture any security-related events generated by the systems maintaining your infrastructure. Nearly every infrastructure system, including cloud-based services, can generate logs about their status. You absolutely want to capture the events related to security. Sometimes those are clearly categorized by the system and sometimes you have to explicitly select them. When selecting for security events, you want information on adding/changing users, modifying user privileges, stopping/starting services, adding/changing access rules, excessive login failures, and modifications to the logging system itself. Infrastructure systems can include anything from storage systems, domain controllers, file servers, network devices, and virtualization platforms. Many public cloud service providers provide detailed logging feeds on usage and management of their system. Be sure to capture the security related events there as well.

Any systems in scope, including the systems managing the scope barriers, should also be logging. This includes accounting servers, web servers, database servers, e-commerce servers, and mail servers. These servers should also log any access attempts, failure or success, to the software and data that is in scope. If someone logs in and views the credit card data, you want that event logged. If a system administrator modifies the software running the e-commerce site, you want that logged as well. If someone tries to login and fails repeatedly, you definitely want to capture that. In some regulatory environments, like HIPAA, all accesses to health records (for example) must be logged and auditable.

Lastly, every server and workstation should be set up to do a minimal set of logging as well, even if those logs aren’t sent off-box for collection and analysis. You want the same level of logging, capturing security and administrative events on those boxes. On workstations and servers, it’s really useful to track software installs and local logins. All of this can prove valuable during an incident.

Logging systems differ, but most should offer the same basic capabilities. You should be able to record the user-id or IP address of what triggered the event, the type of event (which will vary based on the system), whether the event failed or succeeded, where the event come from (the source address or which subsystem), which subsystems or data was affected by the event, and the date/time of the event.

When looking at the time/date of the event, make sure that all of your systems are set up to do clock synchronization to a trusted, reliable master time source. Without clock sync, system clocks will slowly drift away from the actual time. During an incident, you do not want to realize that one system is 9 seconds fast, while another is set to the wrong time zone and 12 seconds behind. It can turn into a real mess and slow down a critical investigation when every minute counts. Use clock sync and verify that it is working periodically. Establishing an accurate timeline of events across multiple disparate devices is much easier if they are all using the same time source.

Look at Your Logs

Capturing logs is great. Looking at them is better. You do not want to be in the middle of an incident when you realize that you haven’t been capturing proper log data from key systems. More subtly, if you haven’t been studying your logs, it’s hard to spot abnormal behavior and discern attacker actions from normal user actions. The gold standard is by checking your logs so often that you spot an incident in progress and are able to stop if before the damage goes too far. To keep everyone on track, assign the responsibility for log review, set a schedule, and have that person follow a procedure for log review. Part of that log review procedure should be to generate records of the log review and their findings. This serves several purposes. One, filling out the paper work forces them to do the actual log review. Second, this log about your logs provides you with some history and intelligence to review in case you need to look back after an event occurred. Third, auditors will ask you for these records as proof that someone is doing log review. Fourth, they provide proof to everyone else that IT security is doing their job. Security is mostly invisible and when we succeed, nothing happens. Having a record showing how hard we work to make nothing happen is a wonderful tool during budget season. By the way, you don’t actually have to use paper. Many security analysts just open a help desk ticket and record their log review events and findings there.

If you can manage it, there are some great logging-review software packages out there. Some of them are commercial and some are open source, but all require a lot of set up and customization. Every organization’s infrastructure is unique and their logging needs vary, so setting up a reliable and useful system to review logs can take some time. Some systems allow you to set alerts and triggers, which watches the logs for you and sends you an e-mail or raise an alarm when something happens. Here are some things to set triggers on:

  • Security changes

  • Root logins (sysadmins should use named accounts)

  • After-hours access (if atypical for your organization)

  • Access control list and other security policy changes

  • Disabling/restarting of services

  • Changes to the local logging system

  • Account lockouts for system or service accounts

All of these can be indications that a security incident is happening. They could also be administrators doing their jobs, but those actions should be traceable in the change control system. Remember that for any alert to be useful, it has to be actionable. It makes no sense to receive hundreds of alerts every day if all you do is file them away and forget them. That’s logging, not alerting.

You need triggers and alarms on any changes to the logging system. If the logs consume all the disk space, you need to get on that. It may be that you are under large-scale-but-below-the-radar attack, so no large event triggered but millions of small ones did. This can also happen during a port scan or a brute force attack. In any case, you don’t want logging to stop because your disk is full. You also want an alert if the logs suddenly stop coming or are cleared. Attackers will try to shut down or tamper with logging to cover their tracks. Sometimes the only sign you’ll see of an intrusion is logging being silenced.

What Are You Looking For?

An answer is only as good as the question, so what questions are you asking of your logging system? Here are some of questions I want to know the answers to:

Has Someone Successfully Broken In?

Which boxes of mine have been breached? Who did it? How did they do it? How far did they get/going? What are they after? What data did they access? What software did they plant? What users did they add? How sure can I be about all of these answers?

Has Someone Singled Me out for Special Attention?

If you’re on the Internet, you’re under attack right now. It’s mostly harmless junk bouncing off your firewalls and filters. Is someone poking at me and only me (or my industry)? Do they know my weaknesses? What do they know about my organization and its people? Is this part of some kind of campaign against my organization? Can I tell who they are and what they want? When combined with external intelligence sources, I can look for what else have they done that I haven’t noticed yet?

What Is Going on My Scoped Systems?

Is someone doing something to those systems without authorization? Do active changes match up to change control tickets? Are patches and hardening being put in place in a timely manner? Are all the security controls on those systems working properly? Did someone add a new system to my scoped environment and not tell me? Often interpreting these kinds of logs requires some technical and local environment expertise. A software install or change will look different depending on the operating system, environment, and usage patterns of the system.

How Is Everything Going on that Internet Thing?

What is the state of the state? Are there more probes today than yesterday? If so, why? Is there a new vulnerability out there that I don’t know about? Are users surfing to strange and scary places more than usual? What parts of my infrastructure are getting the most attention? What kinds of malware, spam, and phishing are coming into our network? What services are getting lots of attention from attackers right now?

What Have I Seen Today that I’ve Never Seen Before?

I see a lot of stuff in my logs, but what’s appeared today that was never there before? I couldn’t get through a security book without mentioning the great Marcus Ranum. He said, “By definition, something we have never seen before is anomalous” and he even created a simple little log tool to look for it.1 Maybe the new thing is just a new technology or service on the Internet that we’re just noticing for the first time. No big deal. Maybe a sysadmin made a change and we didn’t get the memo. Good to know. Or maybe something wicked this way comes. It’s an easy check to build into your logging review process and always seems reveal interesting bits of information.

Protecting Your Logs

As I mentioned before, attackers go after logging systems to cover their tracks by erasing or altering log data. With the threat of insiders, those attackers can include your own system administrators. Some systems, especially the scoped business servers, put confidential data into the logs as well. So you need to protect your logs. It’s best to have the logs sent away from the log source to a protected server, and then encrypted and digitally signed as they’re saved. The first part is pretty easy, because most systems have the native capability to send logs over the network to some kind of a repository. The most common format for this is syslog with the data being captured in structured text.

The log repository should be secured, even from the IT department and possibly even secure from tampering by the security team as well. This is where digital signatures of hashes of the log data can be used. Encrypting the logs is best, but if you can’t, then lock down the log repository server as best you can. Two-factor access and segregating it from the other systems is a good start.

Backup and Failover

When talking about responding to problems and disasters, our oldest and best control is backup. Although the backup process is usually owned by the IT department, the security team has a stake in its success or failure. When something bad happens, be it man-made or natural, everyone is going to turn to the backups to get things going again. Backups need to be reliable and available when a problem strikes. Not having a good backup when you need it is one of those things that make you look negligent or stupid.

Keep Backups Offsite and Safe

Backups should be stored securely and some distance away from the systems. If a flood takes out the city where the office resides, you don’t want the backup tapes to be in a nearby building. Ideally, you should look at your risk analysis and check the area of effect of the more likely natural disasters when planning an offsite storage system. For example, in Seattle, I try to get my backups out of the fault zone of any major earthquakes. I don’t want my tapes buried in the rubble along with my data center.

If backups are being sent offsite, how do they get there? The old-fashioned way means a courier driving tapes around. Remember how lost tapes can lead to data breaches, so the tapes should be encrypted. If the tapes are encrypted, then you need to have access to the decryption key in the event you need to rebuild somewhere else. Obviously, you don’t want the key to travel with the same courier as the tapes. You also want to make sure that you have a device and software not in the same location as the potential disaster that can read the tape if you have to rebuild at a new location.

If you are sending your backups offsite the new-fashioned way, then you need lots of network bandwidth. Modern offsite backup entails copying your data to a remote archive over private lines or Internet encrypted links. In some cases, backups require so much bandwidth that they can take huge amounts of time to transfer offsite. Also, watch out for data restores. If you have to rebuild at a new location, make sure that there is sufficient bandwidth to pull down your backup files in time to meet requirements. If you have the resources, you can stand up a remote failover data center and stream backups to that site, so they’re immediately available in the event of an emergency.

What to Back Up

In addition to backing up key data, there should be backups of the software and configuration of supporting systems. The goal with backup is to be able to rebuild from scratch, which means starting with brand new servers (bare metal) and building from there. Your recovery procedures need to be written with that assumption in mind. A good place to begin is to enumerate and evaluate the business process that needs restoration, as the technical requirements will flow from there.

When responding to security incidents, you may need to take key systems offline for analysis. New systems should be able to be put in their place to keep the business up and running. The last thing you want to do in an incident is get into a fight with a business head about whether to keep a compromised system up in the middle of an investigation. Make you plan to be able to replace things as needed and quickly.

Lastly, I once had a boss who didn’t consider any back up to be real until it was tested. I’ve worked in other places where they’ve worked on faith that all their backups were going to work perfectly without testing. Being a man who lives by assume breach, you can guess that I prefer my old boss’s philosophy. Test your backups by attempting to restore from them. Also, you should also have some idea how long it takes to restore a system from backup. I once worked 74 hours straight when I was a sysadmin because I was restoring a failed file server before the users came back to work on a Monday morning. The restore function on our backup library would fail after transferring data for a few hours, so I couldn’t run the complete restore in one shot. I had to babysit the tape drive and coax it through byte by byte, hour after hour. It was a very long weekend. Test your full restore procedures.

Backup Policy

Here is a basic backup policy. In addition to this, you should have standards describing what should be backed up and when, as well as written procedures for backup and restoration of data as well as backup media management.

The IT Department will be responsible for performing adequate data backup for ORGANIZATION corporate resources, hosted production environments and the supporting infrastructure.

The IT Department will be responsible for maintaining documented operational processes for backup, restoration, and media handling.

The IT Department will be responsible for documenting a schedule and processes for data archiving, media rotation and proper media destruction.

Failover Systems

The IT department should also be responsible for building failover and redundant systems as necessary. This includes systems capable of taking over for failed or overloaded storage, bandwidth, or compute and memory. Sometimes these devices sit cold, requiring some effort and time to be brought online when needed. Sometimes they can be hot and ready to accept data at a moment’s notice. Some are already in line as part of a load-balanced solution, where workloads are spread evenly amongst them. In larger more mature environments, entire secondary data centers and sites are available to mirror or take over in the event of a problem in the main location.

Failover and high-availability is a diverse topic and one I’m just skimming here. Where you need to be concerned is what failover capacity is available for which systems. You will be working with the IT team on the disaster and business continuity response plans, so you need to know what capabilities are in place. Even if you don’t do any disaster work, there are security incidents that effectively act like disasters and take down systems. A denial-of-service attack or virulent malware infection can easily overwhelm a data center. It would be good to know what your options are for failover and restoration in that event.

Business Continuity Planning

Business continuity is an area of specialization connected to IT security but not necessarily part of it. In smaller organizations, the head of IT security is also responsible for business continuity. In larger organizations, they are separate functions that still work together. It’s not uncommon to meet business continuity professionals who may have different training, certifications, and backgrounds than IT security professionals. Nonetheless, business continuity often falls within security and some its functions are audited in a security audit as well.

This chapter provides an overview of a business continuity plan, but it is not complete. There are many excellent guides to building a business continuity plan. A great one is the National Fire Protection Association’s Standard on Disaster/Emergency Management and Business Continuity Programs, which is available at www.nfpa.org/assets/files/AboutTheCodes/1600/1600-13-PDF.pdf .

One of the key elements of a business continuity plan is the business impact analysis. It gives you the set of risks that you need to respond to with the plan. Chapter 4 already covered everything you need to know (and more) to create a useful and comprehensive business impact analysis. In fact, if you used failure mode effect analysis, you already have specific disaster scenarios that you can construct response plans against.

Next is a policy defining the business continuity plan.

Sample Business Continuity Policy

ORGANIZATION will create, maintain, communicate, and test business continuity processes to mitigate unplanned interruptions and outages of critical system processes and networks.

Department heads will be responsible for writing and testing disaster recovery plans for their business units to maintain or restore critical functions in the event of a disaster. The security department will be responsible for providing information on potential disasters. The business units will be responsible for identifying critical business functions and defining alternative work procedures in the event of a loss of IT resources, facilities, or personnel.

The Head of IT will be responsible for technical operational responsibilities and duties related to maintaining or restoring systems in the event of a disaster. The Head of IT will designate and ensure adequate resources to support primary, secondary, and emergency roles for critical functions to ensure consistent and reliable coverage.

The IT Department and the Security department will share responsibility for maintaining a general disaster recovery plan for critical ORGANIZATION infrastructure and corporate resources.

Disaster Recovery plans will include the following information: locations, personnel, business processes, technical system inventory, impact analysis, recovery site information, disaster declaration procedure, recovery roles and responsibilities, recovery training plan and schedule, applicable service level agreements, contracts, and other records.

The ORGANIZATION will securely store the business continuity plan in a secure offsite location so that it can be easily located by authorized personnel in the event of a disaster.

During disasters, the ORGANIZATION will strive to maintain the same security objectives it has defined during recovery operations.

The business continuity plan and disaster recovery plans will be reviewed at least annually and updated to reflect changes and new requirements in ORGANIZATION.

Expectations for Recovery

Regarding disasters that can take down entire business functions within an organization, what is the expectation from upper management? In the absence of information, management is likely to expect that everyone is just taking care of this, and things failover if something happens. Since this is likely not the case, it is someone’s responsibility (probably yours) to inform them of the current recovery capability of the organization and the business impact implications.

From here, you can find out what management expects you to recover from and how fast. If they expect things to be running perfectly in the face of category five storms and massive denial of service attacks, then you need to explain what resources are needed. Is there any point where management will throw their hands up and say after a large-scale disaster, company survival is up to the will of the gods? I have heard both responses. You can also factor in regulatory and customer contract requirements, as there are often business continuity service levels that need to be met. Find out what is expected before you begin the long and tedious process of building response plans.

When talking about business continuity and disaster recovery expectations, two key terms often come up—RTO and RPO, which are covered next.

RTO, or recovery time objective, is the amount of time it takes for a system to come back online after a disaster takes it down. This is the running stopwatch on the recovery or failover efforts. It is the goal that you work from when building your plan. Different services and business functions can have different RTOs depending on need and resources available. Not meeting an RTO usually has consequences, especially if they are part of customer contractual requirements. Usually, the lower the RTO, the higher the cost to implement.

RPO, or recovery point objective, defines how much data you can afford to lose. Since backups are never going to be instantaneous, it is likely that when your IT systems go down, you will lose some data. Some RPOs are measured in minutes and some in days. It all depends on the criticality of data and resources available. For RPOs measured in minutes, usually data replication systems are needed to copy live data to back up systems as soon as possible. Like RTO, the lower the RPO, the higher the cost. In many cases, the price goes up logarithmically as you approach lower and lower objectives.

Also, the business owners should also set expectations as to when they expect systems to be failed over and when they should be restored. IT should be given explicit information as to when a disaster is declared and when failover mode is triggered. Expectations as to when to restore from backup should also be defined. In some cases, this expectation can take the form of a particular person (or persons) making a formal disaster declaration .

Disaster Recovery Planning

So far, I’ve talked about business continuity and disaster recovery but not specifying the terms. Disaster recovery is a subset of business continuity. Business continuity is about the entire business response process to ensure that an organization keeps chugging along in the face of a disaster. Disaster recovery refers to the specific response plans for specific systems or business units. The business continuity plan is the big picture and the disaster recovery plan is the technical detailed procedures. Usually the bulk of disaster recovery efforts happen with the IT systems, since IT systems run nearly all of our organizations now.

As you can see in the policy, the design and execution of IT disaster recovery is owned by the IT department. They are in the best position to set up data backup, failover systems, redundant links, as well as test them. One element often overlooked in disaster recovery plans is key personnel continuity. What happens when a pandemic hits and the one database administrator who knows how to run everything is sick with the Ebola Gulf-A virus? An effective disaster recovery plan includes contingency plans for personnel. This may include hiring and training backup personnel or having contractors ready to go to take over functions. Ensuring personnel are safe and can work effectively during a disaster event is also an important factor.

Staging and having an assured source of equipment in the event of a major disaster should figure in a recovery plan. Even if you already have a replacement agreement with your suppliers, you do not want to find out during an emergency that you are not the highest priority for the limited available hardware.

One thing to remember is that unless your organization runs or has critical regional resources or assets, it’s likely that you will receive lower priority (or no) support from government emergency response in the event of large regional disaster. I have heard the fire chief tell me that in the event of a large earthquake, he will drive his fire engines right by my collapsed building and wave as he heads to the nearby school. And that’s the way that it should be. So in a disaster, you can expect to be on your own for some time. Plan accordingly, with food and shelter in place capability. It’d also be helpful if some of your staff had some Red Cross training.

Business Continuity Plan

Overall, the business continuity plan is a big document. Also, if the building burns down, it doesn’t help if the plan burns with it. The plan needs to be available so that personnel can use it during a disaster. Also, it’s likely the plan contains confidential details about the organization and potential security weaknesses. This means the plan shouldn’t just be posted on a web site for all to peruse. It needs to be available and protected.

The business continuity plan should include responses for each of your identified risk scenarios as they affect the various business units. Sometimes the disaster recovery plans for each of these units are stored in the main plan and other times they can be found as separate accompanying documents. The most important thing is to have complete coverage of response plans and that the plans are relevant and understandable. Other key elements that the plan should include are:

  • Coordination and role definitions telling who’s in charge of what, and their designated backups

  • Activation instructions which detail how is a disaster declared and by whom.

  • Notification defining how people will be called, checked on, and organized

  • Plans detailing what to do if people aren’t available and where to get additional help

  • Priorities to tell which recoveries go first, in what order and what does it look like

  • How things go back to normal from disaster mode and how services fail back

How Security Is Maintained During the Disaster

During a disaster when you’re executing on the disaster recovery plans, what is the status of all of your security controls? It’s not likely that just because your organization is in trouble that the bad guys are going to lay off. In fact, they may be more inclined to attack since they know things are in chaos and you’re operating out of a recovery site. Sometimes they may even cause the disaster event to move your organization to a more vulnerable state. They might expect your recovery site and failover systems to have a lower level of security than usual. Personnel usually engaged with monitoring controls and locking down systems will be unavailable or otherwise engaged. This is a new risk that you need to raise with the ISMS committee and management.

Is it acceptable for security to be downgraded during an emergency event? If not, you need to identify resources and plans can be made to ensure that things stay locked down. This is a reason why secondary sites are often identical in all aspects, including the same controls as primary sites. If this is not feasible, you can look at focusing your efforts on the key systems that are recovered. Maybe you do not bring up all services in a disaster so you can ensure that the ones that are up remain strong against attack.

Incident Response Planning

A security incident is what you’ve been working hard to avoid, but they are also inevitable. Your goal is to catch them early, with complete information, and contain the damage. A security incident can be as small as a user surfing pornographic web sites or as large as a group of cyber-criminals downloading your entire payment card database. Both kinds of incidents require a response, and the organization will look to you for leadership. That is why it is crucial for you to be the one who remains calm while pointing to an existing incident response plan and offering confident and reassuring advice on how to weather the storm. While it is also important that you be brought into the crisis as soon as possible, a good incident response plan should work without your direct involvement. The entire organization should be aware that any security problems or policy violations are to be relayed to the security team immediately so the plan can be activated. The best way to get that ball rolling is to have a policy.

Incident Response Policy

ORGANIZATION will maintain and communicate security incident management processes to ensure that timely reporting, tracking, and analysis of unauthorized access, modification of critical systems and data. All ORGANIZATION employees and contractors are required to report security incidents to the Security department.

Security incidents can include: Unauthorized exposure of confidential data, Unauthorized access to systems or data, Unauthorized modification or shutdown of a security device, Loss of system or media containing ORGANIZATION data, Malware infections, Unauthorized or unexplained access to systems or data, Denial of Service, Threatening behavior, or Violation of acceptable usage policies.

The Security department is responsible for maintaining and communicating an incident response plan that describe the response procedures, recovery procedures, evidence collection processes, technical responsibilities, law enforcement contact plans, and communication strategies.

If the breach involves data not owned by ORGANIZATION but entrusted to ORGANIZATION, then upon confirmation of the breach the ORGANIZATION will notify the affected parties as quickly as possible based on advice from legal counsel and law enforcement.

Executive Management and the Public Relations department will be responsible for contacting the affected third-party data owners and facilitating ongoing communication with them.

If a data breach involves internal data such as employee records, then Executive Management in conjunction with the Human Resource department will facilitate notification.

The Security department is responsible for facilitating a post-incident process to uncover lessons learned from the incident. Furthermore, the Security department is responsible for recommending modifications totheincident response plan and the security policy according to lessons learned.

Incident Response Plan

This chapter isn’t going to cover how to write a complete incident response plan, but you definitely need to write one. It needs to be specific and customized to your organization, its compliance requirements, and the culture. There are many guidelines to base your plan on. Here are three good resources:

Let’s go over some of the important pieces of an effective incident response plan. These are the high-level steps:

  1. Detect.

  2. Contain.

  3. Eradicate.

  4. Recover.

  5. Post mortem.

A Team Effort

Security incidents can have huge impacts that vary greatly based on how you respond. Therefore, security incidents are an all hands on deck situation where you do not want to be working alone. In fact, a Super Friends–approach works best where you bring together the best and most powerful heroes of your organization to meet the challenge. Having pre-existing relationships with law enforcement will speed this process along as well. These individuals need to be ready to answer the call at any time, so a designated secondary should be identified as well. Table 20-1 shows some incident response team common roles and their responsibilities.

Table 20-1. Roles During a Security Incident

Role

Individual

Duties

Lead incident handler

CSO

In charge of the incident response team, directs other team members and responsible for executing the plan

Incident recorder

Varies but could be another security team member

Keeps track of time line of events, information known so far, pending tasks and questions. Like secretary and project assistant since incident handler is busy

Executive representative

C-level officer

Makes executive decisions regarding incident, provides authority to the team

Legal representative

Legal counsel

Provides legal advice regarding incident

IT operations coordinator

Head of IT

Provides, coordinates, and leads technical resources in response efforts

HR representative

Head of HR

Facilities employee communications, provides advice on internal personnel issues

Customer communications representative

Head of public relations or head of customer-facing business unit

Facilitates and helps craft two-way communication with customers regarding incident

Within these roles and responsibilities, the team should meet on a regular basis, usually quarterly, to work out the specifics of these roles. There are certain things that you want to have already decided before an incident happens, such as the following:

  • Who has the authority to take a system offline? This includes live, customer-serving systems that could affect service level agreements or ongoing revenue.

  • Who will notify law enforcement and work with them?

  • How do we respond to ransom/blackmail demands from criminals?

  • Who will handle communication with customers (minor and major), third parties, business partners, vendors, and suppliers?

  • What message will we post publicly in the event of breach? Who and how can customers contact us for questions?

  • What message will we post internally in the event of breach? Who and how can employees contact us for questions?

  • How do we go about doing an immediate termination? What do we need to do if there is to be legal action? If so, is the legal action going to be civil or criminal or both? Will the terminated person need to sign something?

Communication Strategies

A number of the items to work out beforehand include communication plans. This also means you need to have all those critical outside contacts detailed within the plan. This includes names, organizations, and full contact information. You should have escalation paths worked out as well, so that if you can’t reach someone, you can go upstream to get help. The key contacts you want to have are:

  • Law enforcement contacts for several agencies, federal and local

  • Legal advice (if you need outside counsel), specializing in computer intrusions

  • Key vendors, including all of your ISPs, co-location providers, hosting companies, and cloud service providers

  • Security vendor contacts, in case signatures or controls need to be updated or patched

  • Key customer and third parties

  • Forensic investigators (either in-house or external)

Beyond who you are contacting, the message is also important. You do not want to be bickering with the team about the details of a notification to all of your customers in the heat of crisis. In major breaches, this initial notification is what hits the news. Regardless of the actual response, this is what outside analysts discuss regarding how competently the company is handling the crisis. You surely don’t want to go off half-cocked and send out a poorly worded message that you have to later take back. Have the team work on canned response messages for the major incident scenarios ahead of time. They can be customized as needed when the time comes. Messages would include things like:

  • “Sorry, we’ve had a breach and here’s what we’re doing about it…”

  • “Hackers are attacking our site so we’re going offline for a bit but we’ll be back…”

  • “All employees - we’ve had an incident and we’re working on it. In the meantime, everyone log off the system now…”

  • “All employees - we’ve had an incident but it’s over. Now, change your password…”

The legal implications regarding reporting breaches to customers is described in a few pages.

Procedures for Common Scenarios

Your previously completed risk analysis and knowledge of what attacks are affecting your industry should give you a rough idea of the major threat scenarios that could affect your organization. Put that information to good use by preparing general response plans to guide the team if the incident happens. These plans can include checklists and scripts for IT responders, data gathering goals and procedures, key individuals to contact, systems/services to activate or deactivate. You can even include instructions for critical controls to use during an incident. Table 20-2 shows some common general scenarios to get you started:

Table 20-2. Sample Incident Scenarios

Scenario

Checklists for

Denial of service

Working with internet service providers, activating anti-DDOS firewall tools, contacting customers, contacting law enforcement

Malware infection

Gathering samples, performing rapid assessments, containing malware, determining if a system is clean or infected, obtaining new antivirus signatures and rolling them out, contacting users

Insider unauthorized access

Collecting log data and evidence, Analyzing systems for data storage, shutting down external links to contain exfiltration, disabling a user, working with HR and legal

Inappropriate usage

Collecting log data and evidence from browsers and firewalls, disabling a user, working with HR and legal

Gathering Data

Knowing what data was accessed by intruders figures prominently in your notification response plan. According to most breach disclosure laws,2 you are required to notify the affected parties if you have evidence that their information has been exposed to unauthorized individuals. If you cannot determine what data was leaked and what wasn’t, you may need to notify based on the assumption that all it was leaked. This is why logging is so critical to response. So knowing that only 50 people had their credit cards stolen is a different response than assuming that a hundred thousand cards were leaked.

When IT admins are taking systems offline or seizing laptops, they need to be very careful not to destroy or corrupt potential evidence. Even if you later choose not to go to court, you should still capture as much pristine evidence as you can. You never know when a later lawsuit or related incident may occur. Seizing, handling, and analyzing digital evidence is a discipline all of its own. It’s also a discipline where a wrong move can cause an entire legal response to be invalidated or even backfire. There is insufficient information in this chapter to teach anyone how to do this properly. This is the case where if you don’t know what you’re doing, you should hire or contract with an expert.

That said, the basic idea is to leave the system and data as untouched as possible. This means taking physical possession of the system or copying the data to read-only media (burn it to DVD). This evidence needs to be tracked (time/date stamped, who has custody) and locked up where no one can tamper with it. If you absolutely need to do analysis before expertise is available, you can work off copies from the original. Unless you are qualified or absolutely must capture the current state of memory, avoid doing forensics on the live affected machine.

Hunting and Fixing

Part of your response plans should include how systems and data can be protected during an incident. The plan should have information on how to move or segregate data stores from the active infected network. This may mean new internal firewall rules, temporary user rights revocations, and disconnecting systems from the network. The plan needs to include how to restore affected systems to their last known good state. Proper communication from the incident response team to the responders in the field is also crucial. Information about the incident should be passed along so that those in the field can properly defend systems as needed. Remember the Northwest Hospital malware infection, where systems were taken offline, cleaned, and then placed back on the wire—only to be reinfected.

A formal method for communicating this information has been developed called indicator of compromise, or IOC . An IOC is usually a signature of an intrusion that technical personnel can use to scan systems. Most IOCs are for malware and exploit-driven intrusions and can be shared amongst defenders online. For example, an IOC may be a digital hash of the malware, attacker source IP address, malicious domains, malware filenames, or known artifacts found on compromised systems. If the incident analysts can determine an IOC for an ongoing incident, these can be used to scan to see how many machines are infected. Depending on how detailed your available logs are, and how far back they go, you can also use IOCs to determine when an attack began and how it spread throughout the organization.

Legal Reporting Requirements

If your organization is in possession of other people’s personal information, then you likely have legal obligations to provide notification in a timely manner. If you are a service provider in a business-to-business (B2B) arrangement, you may be in possession of personal information not shared directly with you, but with your partner/client. This means you probably have contractual obligations with those customers to notify them so that they can work directly with their end consumers.

Having complete information is critical when you need to do the notification. Most breach disclosure laws have definitions that tie disclosure to a “reasonable belief” that “personal information” was “acquired by an unauthorized person.” Unfortunately, attackers work diligently to hide their actions, making the determination of the extent and timeline for a breach difficult. Nevertheless, you are on the hook for notification. When you are working with your legal team and law enforcement on how and when to notify, there is some key information you need to have:

  • Exact data elements that have been compromised

    Does the data constitute PII? Customers also want to know what was leaked.

  • Exact format of the data

    Is it encrypted, obfuscated, de-identified, or obscured in an odd format?

  • Likely identity and motivation of the attacker

    Will the information be used for fraud?

  • How the data was compromised

    Was it copied, viewed, modified, or physically lost?

  • How the incident has been mitigated

    Do we expect more breaches or are we sure it’s all over? How sure? Why?

Remember that combinations of what is considered personal information creates the obligation to notify. That includes names in combination with social security numbers, government ID numbers, financial and credit card numbers, and passwords.

All in all, it’s in everyone’s best interest to act like a responsible victim and be as transparent and clear as possible. Being transparent doesn’t mean starting a publicity campaign with incorrect or insufficient information. It means sharing what you know and don’t know with the right persons. Sometimes the right persons are law enforcement and regulators, not the general public.

A good general guideline is available from the California Office of the Attorney General at https://oag.ca.gov/sites/all/files/agweb/pdfs/privacy/recom_breach_prac.pdf .

Working with Law Enforcement

Many security incidents do involve the violation of the law, which means you can contact the police for help. In some cases, law enforcement involvement can override notification timeline requirements as going public could jeopardize an ongoing investigation. Law enforcement can sometimes offer useful incident response advice as they may have responded to similar cases and have an idea about outcomes and magnitudes.

One thing that you need to do before contacting law enforcement is get permission from organizational leadership. Unless you personally are the victim of a cyber-crime, you should not be speaking on behalf of your organization without authorization. The executive leadership represents the organization and is the best one to decide whether to report and when. Although there are some crimes, like child pornography, where the lack of reporting itself is a crime.

Sadly, some executives are reluctant to report cyber-crimes for fear of bad publicity. Some even wonder if law enforcement can do anything to help the company once the damage has been done. In some cases, they can. If a perpetrator is successfully prosecuted, reparations can be identified as part of the judgement. Successful prosecutions also open the way for civil damage lawsuits as well. Reporting cyber-crimes also works to make the whole Internet community safer as well. Even if law enforcement can’t prosecute on your organization’s particular incident, they will keep your information for a later investigation. There have been quite a few major cases where the perpetrator was not identified until years later. When that happens, your case can figure into the overall damages and sentence for the criminal. If nothing else, reporting cyber-crime helps law enforcement build threat intelligence and encourage future investigations and community warnings. Conversely, cyber-crime being allowed to flourish without consequences creates an incentive for epidemics of new attacks, like the ransomware epidemic.

There are many different law enforcement agencies that you can work with, depending on the crime. By being active in security communities like Infragard, you can establish law enforcement relationships beforehand. This helps speed up their response during an incident as you already know and trust each other. In general, the FBI is the primary agency to contact. They can redirect you to various other agencies as needed depending on the nature of the crime.

Human Side of Incident Response

This is a hard job and lot of what’s tough about it, they don’t teach you in a book. Incident response can get ugly, especially when dealing with insiders. As a responder, you may end up having to dig through people’s browser history, e-mail boxes, and chat logs. You may learn things about people you know that will make you see them in a new light. You may discover that people you thought were trustworthy have a dark side or unethical motivations. Even if the data you uncover doesn’t lead you to a malicious incident, you may still find shameful or personal secrets. It’s up to your professional integrity and discretion to keep these matters to yourself or limit the scope of your investigation.

Another difficult aspect of incident response is that as its consequence, people can get fired, get sued, or even go to prison. The things they did may become public because of legal action and there may be personal repercussions for them. It is normal for you to feel guilt for feeling that you were the cause of these consequences, even though you know you weren’t. It helps to talk to someone about it. This can be other security professionals, friends or family, or even a mental health professional. Sometimes it’s better to feel a little bad about the fate of wrongdoers than it is to be indifferent or rejoice in their misfortune.

After Action Analysis

When things go wrong and you’ve finally made it through to the other side, the last thing people want to do is rehash the events. However, reviewing the event and how the teams reacted is a vital learning opportunity. In some ways, examining what went wrong also gives the organization a sense of closure and if in the case of a control failure, the confidence that things won’t repeat themselves. Any major outage or security incident should be followed up by after action analysis. More mature organizations also do analysis after near miss events where the disaster or incident didn’t really occur but came very close. Those too can yield a lot of valuable data.

The response team, whether it’s the business continuity team or the incident response team, should all be present for after action analysis. It should be a brainstorming session, where you are looking for causes and effects, not assigning blame. Remember that by definition, whatever caused the interruption event was beyond your organization’s normal capability to respond. Also don’t forget the assume breach principle. People are going to be pushed to their limit, technology isn’t going to work right, and communications are going to break down. You know that there are people out there who have made a career out of breaking security systems. When you examine things carefully, you’re likely to find much of the real causes are systems failures that run much deeper than a particular individual, team, or technology.

Root Cause Analysis

The goal of getting to the real cause of the incident is to fix whatever went wrong so it doesn’t happen again. It’s likely that the true cause might go much deeper than something as simple as “there was a blackout.”

One straightforward but effective technique is called the 5 Whys. You simply start at the beginning of the event and walk backward asking why something happened. You do this at least five times but you can go deeper. As you can see from this example, there are number of interesting things that turn up that are ripe for fixing.

  • The web farm crashed…. Why?

  • The load balancers started flip-flopping…. Why?

  • The high-availability connection had an error. Both systems thought the other was down…. Why?

  • The cable connecting the units got crimped…. Why?

  • Someone was in the server room and closed the cage door on the cable…. Why?

But don’t stop now, let’s keep going.

  1. 6. The cabling in the back of the racks is a huge rat’s nest…. Why?

  2. 7. Everyone ignores cable management…. Why?

  3. 8. We’re busy and cable management isn’t high priority…. Why?

So here we’re getting closer to the bigger problem. What other IT hygiene and maintenance tasks are being skipped over in favor of more expedient work? If the response committee’s report can convince management, perhaps some lower-level resources (like interns) could help IT? Or maybe management is willing to accept the risk of future outages based on the current workload? In any case, you’ve uncovered a bigger potential problem that can go into your future risk analysis reports.

Another type of root cause analysis is the Ishikawa Fishbone, which looks at the interactions of many different possible causes. You draw an effect diagram with spokes for different possible cause categories such as people, processes, technology, information, and environment. You can learn more about it at http://asq.org/learn-about-quality/cause-analysis-tools/overview/fishbone.html .

Executive Summary

Once the analysis is complete, the results should be written up. Not only do stakeholders and customers often require these reports, auditors like to see them as well. Naturally, you should use this report to justify future control and process improvements or lacking that, identify new risks. The report should include the following:

  • Executive summary with a few paragraphs at most describing the event and the response

  • A list of strengths listing what went right, worked well, and was effective

  • An analysis of response metrics, such as:

    • Time to detect

    • Time to respond once detected

    • Time to containment (if applicable)

    • Time to recovery (vs. RTO)

    • Recovery coverage (vs. RPO)

    • Time to resume

  • Areas for improvement

  • Recommendations for next steps

Here are a few sample after action reports for some major disasters.

Practicing

You already have heard that how an organization responds to an incident or disaster is crucial factor in its outcome. If response is so important, then it’s a good idea to practice. Not only does this shake out the bugs in the plan, but it also gives the team the chance to work together and gel. In some regulatory environments, like HIPAA, testing your incident response plan is required.

There are many IT professionals and executives who don’t understand the assume breach concept and may resist training exercises. Their expectation is that a breach will never happen because the security team will keep them safe. Explain to them that preparation for a breach is part of your defense. Having strong incident response can make the difference between a minor problem and a major breach. This work is vital as any other work you do to defend the organization. It only takes a few hours once a year (or more), and it can be fun3.

There are lots of different exercises and tests that can be done to practice response plans. Some can involve just the response committee, some can include just key stakeholders, and some can engage the entire organization. Practice runs can include:

  • Walk through

    Everyone involved in a response scenario simply reads their steps out loud from the plan describing the details they would take. This is a good exercise for a new plan and/or a new team.

  • Tabletop exercise

    Much like the role-playing games that I enjoyed in college, this is a dice and paper simulation of an event scenario. A moderator creates and runs a session, describing events and changing conditions and the participants reacts and verbalizes their responses. The moderator can even assign probability to the success of certain actions, using dice to determine the outcomes. These can even be held in remote conference sessions to make it easier for all participants to attend.

  • Simulations

    This is a functional run through of a scenario complete with participants doing as much real work in response as they can. Failover to remote locations may actually be tested. If a response plan says that someone needs to call an engineer and have her run a program, then an actual phone call is made to that engineer and a simulated program is run. Even fake news stories and panicked distress calls can be made to the response center to fully immerse people. I’ve participated in regional disaster simulations where some participants were irradiated and were escorted out of the practice room for decontamination (made to wait in a nearby conference room).

The first place you should look for response scenarios is your risk analysis. The top risk scenarios identified are excellent candidates for practice as by definition, they’re likely to affect your organization the most. As the saying goes, never let a good crisis go to waste. Scenarios can also come from past disasters, incidents, and near misses. You have the after action report data to help you define the scenario. It’s also a chance to assure everyone that if it happens again, you’re ready.

Like actual incidents, you should write practice exercise after action reports as well. Stakeholders and auditors will be looking for them, and following up on the suggested next steps. How often should you do these practices? At least once a year is standard, but you can do more if you need it. The saying goes, “Amateurs practice until they get it right. Professionals practice until they can’t get it wrong.”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset