Chapter 3: Engineering for Incident Response

In the previous two chapters, we discussed security operations and incident response and looked at some of the key elements that come into play in incident response, such as the incident response cycle and the kill chain. We have also argued, albeit somewhat loosely, that agile is the best approach for both security operations and incident response. In this chapter, the aim is to tighten up that argument and develop an agile framework in more detail, as well as outline what relationships exist between existing agile approaches and agile security operations.

In this chapter, we will discuss the engineering aspects of incident response, from the viewpoint that incident response is a continuing operational activity that defines agile security operations.

We will primarily build on the incident response loop to develop an agile framework for security operations and discuss some of the engineering aspects. This will be the final chapter that builds the framework for agile security operations, and the focus will be both on the agile security operations process and how the tooling needs to change because of that process.

The key to understanding the agile extension of the incident loop is the realization that it is important, especially if we have assumed compromise, that the incident loop does not become a hamster wheel to nowhere, but instead facilitates the improvement of cyber defenses. The underlying idea is that actual attacks on your organization, once detected and handled properly, are the best guides to the threats faced by your organization, and hence also a robust guide to strategy. Threats, in turn, translate into risks, in a way we have discussed in Chapter 1, How Security Operations Are Changing. Hence, the data you gain from actual intrusions is too valuable to treat sloppily.

The agile method for security operations developed in this chapter, based as it is on specific engineering for incident response, focuses on how we can get the most out of an incident and use it to improve detection, analysis, containment, and eradication.

We will use the Mitre ATT&CK and Shield frameworks and integrate this understanding into an overall picture of defensible infrastructure and applications.

In this chapter, we will cover the following topics:

  • From incident response to agile security operations engineering
  • A brief discussion of agile frameworks
  • Key activities in agile operations
  • Tooling – defend to respond

From incident response to agile security operations engineering

In this section, we will develop the framework for agile security operations engineering. The framework for agile security operations engineering is based on the incident loop we discussed in Chapter 2, Incident Response – A Key Capability in Security Operations, and develops the details as well as the consequences for tooling and engineering practices.

The key to going from incidents to incident response to security operations is to close the loop with a set of specific practices that ensure that organizations make the most of their incidents – when preventing, detecting, responding, and continuously managing their security posture and program.

The process starts with mapping the incident loop from the previous chapter and then defining the activities that turn the loop from an incident into a cybersecurity program.

It's not agile

This chapter develops an agile approach to security operations that is agile in the sense that it rapidly adapts and predicts fast-moving environments and situations. It is not agile in the sense that it strictly adheres to a specific underlying methodology that is commonly used in agile software development such as Scrum or Kanban, although it learns from and takes the best of these frameworks to target the four objectives of incident response we mentioned in the previous chapters.

Therefore, some may argue that the framework in this chapter is not agile. Viewed from the perspective of strict adherence to frameworks, it isn't. But in terms of developing a tactical framework in which security teams can defend their organizations during cyber attacks while enabling the flexibility, effectiveness, and capability to respond to rapidly changing situations during attacks, it is agile.

A good website to consult while reading this chapter is the Agile Alliance (https://www.agilealliance.org/), which contains resources and a glossary of the most important terms. Keep in mind that in this book, we'll repurpose the agile approach to suit security operations.

The best security teams have already operated on agile principles for a while. The framework developed in this chapter loosely derives from a talk by Frode Hommedal at the First.org TC in Oslo, 2015 that, to my knowledge, is no longer available on the internet.

The key step in that process is closing the incident loop, which is an abbreviated way of saying that a structured set of activities must exist that enable an organization to learn from an incident and improve its detection and response capabilities.

Such an approach has significant advantages. In many cases, incidents are Pareto optimizable, which means that the large number of attacks that organizations experience fall into a small number of categories. Accelerating the response for the categories of the most frequent attacks and incidents will pay significant dividends in reducing the workload of the security team.

Mapping the incident loop

In Chapter 2, Incident Response – A Key Capability in Security Operations, we discussed the incident response loop and why an agile approach in incidents is needed. The incident loop consists of six phases: preparation, identification (which we broke down into detection and analysis), containment, eradication, recovery, and lessons learned.

The problem with these loop approaches is that they can become a hamster wheel of pain. Lessons learned as the last step does not necessarily imply that security posture, defense capability, and response agility will improve. Hence, just having a lessons learned step to close the incident loop is insufficient to guarantee that we will be better off.

Feedback – closing the incident loop

If just lessons learned is not the optimal model for how we close the incident loop, we need to specify what is (in other words, what needs to change because of the incident). Agile security operations are looking for the specific patterns that allow organizations to not only learn from but also improve upon how incidents are handled. In addition, organizations learn about important aspects of the motivation of attackers and can use this to update their threat intelligence.

Closing the incident loop in agile security operations entails three specific requirements:

  • Improving the prevention measures against the type of attack that was just experienced, such as by improving the architecture, hardening the existing infrastructure, vulnerability management, and either disabling unused and aging infrastructure or mitigating the risk around it
  • Improving detection of attacks like the one we just experienced, such as by using detection engineering or replacing poor detective software for better quality
  • Improving responses to the attack that we just experienced by automating and streamlining the response and communication processes, to optimize the quality and speed of the response.

These three improvements require structured learning out of our incidents, especially around attacker tactics, techniques, and procedures (TTPs).

TTPs are aligned with the business model of an attacker and outline how an attacker drives their actions for objectives. Improvements in prevention, detection, and response should be driven by TTPs rather than low-value indicators such as IOCs, file hashes, DNS names, or similar attributes. Attackers change these quite frequently, whereas the business model of an attacker – how they aim to achieve their objectives – is much harder to change.

Sometimes, this hierarchy of data is referred to as the pyramid of pain, a term introduced in a blog post by David Bianco: http://detect-respond.blogspot.com/2013/03/the-pyramid-of-pain.html. The idea of the pyramid of pain is that a robust understanding of the business model of an attacker and the behavior of their tools on your network leads you to more effective use of indicators.

Example – APTs and Incident Response

Around 2015-2018, the term Advanced Persistent Threat (APT) was used a lot in marketing material, where the customer was led to focus on the advanced aspect of the APT, whereas the component that needs to be focused on is persistent. Since APTs are persistent, prevention, detection, and response methods must be improved after attacks using information gleaned from past attacks.

Dealing with adversaries often benefits greatly from the ability to develop specific TTPs about an attack, and hence on the robust process of closing the incident loop, as we will advocate in this chapter.

As an example, attackers often work in the time zone of their residence, or with keyboard layouts matching their country of origin. This can be a useful TTP. Similarly, some attackers like to mostly register their DNS names with the same registrar or use specific top-level domain (TLD) names for their activities. Another useful TTP is tooling or details around the initial attack vector, such as the mail domains where initial emails are being sent from, if the initial intrusion vector is email..

With attacks becoming more businesslike in terms of their objectives, attackers also operate more like businesses.

The businesslike weaknesses of attackers

It is a somewhat common misconception that attackers are either loners or highly organized state-sponsored actors. The cybercrime scene has evolved into a highly specialized set of businesses, with malware developers, exploit traders, people handling customer interactions, and monetization all forming an organized cybercriminal ecosystem that operates much like a modern business. This leads to specific weaknesses of cyber-attackers, such as the need to rely on standardized business models, limited tooling providers, adhering to common working hours, having cost-optimized business models that rely on repeatability (and thus predictability) of processes and procedures, and an ongoing need for revenue.

Research done by the endpoint security firm CrowdStrike for their Global Threat Report 2021 shows an interconnected web of malware families and actors that have defined economic relationships.

CrowdStrike divides the ecosystem into services, such as malware development and packing, credit card testing, development of phish kits, loaders and packers, and infrastructure hosting; distribution, such as spamming and social media; and monetization, which focuses on ransom negotiations, muling, money laundering, and cryptocurrency services.

Understanding the business model of an attacker allows security teams to devise defenses that break that model. This dynamic will be discussed in greater depth in Chapter 8, Red, Blue, and purple teaming.

A brief discussion of agile frameworks

In Chapter 2, Incident Response – A Key Capability in Security Operations, we defined the beginnings of an agile framework for security operations. It is now time to build that framework out to a fully fledged capability framework that implements the necessary components of agile security operations.

Agile, as a method, is primarily used in software development, and it can use several different underlying frameworks to enable the agile development of software. In this section, we aim to briefly review them and develop an idea of their possible use in incident response and security operations. The frameworks we will cover have been developed with a view to software and product development, not for incident response. However, they do contain useful information for an incident response process.

In this section, we will briefly look at the main ones before developing a framework that is ideally suited to incident response and security operations.

Lean

The Lean method is closely related to the Kanban method, which we will discuss next. In Lean, teams focus on visualizing the workflow and minimizing waste. Waste is defined as, for example, time spent on any product for which there is no need, or time spent on rework, task switching, waiting for others, unnecessary features, and handoffs.

Lean minimizes waste by rearchitecting the development process away from large upfront designs, instead of optimizing learning through experimentation and aiming to make consequential design commitments as late as possible (also known as keeping your options open).

Another key concept from Lean is the minimum viable product (MVP), which we can define as a version of the product that gives the team the maximum amount of learning about the customer's interaction with a product with the least effort. For instance, an MVP may consist of a landing page with mockup reports or rely on a manual process behind the scenes.

The approaches stressed by Lean certainly resonate in the realm of security operations. A Lean approach to defense stresses the rapid development of many different defenses with rapid feedback loops, which is an approach that is useful during the detection, analysis, and containment phases of an incident.

Interlude – Why having many detections working together is better

Imagine having the following choice: you can implement a super-magic security tool that is 99% effective against any attack (that is, it alerts on 99 out of 100 attacks), or you can implement two tools that are both 95% effective (that is, they alert on 95 out of 100 attacks). Which is better?

If you put the two tools in parallel, the tools work together, but independently of each other, on a data feed. Each of them gives a verdict on whether malicious activity is occurring in this feed. If you, as the operator, give an alert each time one of the tools detects something as malicious, the overall effectiveness is 1 minus 0.0025, or 99.9975. This arrangement is attractive because a tool with 95% effectiveness is usually orders of magnitude cheaper than a single tool with high effectiveness, and two of these tools will still be cheaper than a single highly effective tool.

This is a crude calculation that relies on several assumptions, but it gives you the general picture.

Kanban

The Kanban philosophy is all about managing the flow, limiting the work in progress (WIP), and visualizing the progress of the work. The philosophy of Kanban is to minimize work in progress and multitasking, which slows down the flow and leads to waste, both in terms of time and resources. All the work is mapped on a Kanban board under specific columns.

Kanban also introduces the concept of pull, meaning that teams pull in work according to their capacity, rather than it being pushed to them via a ticketing system, production belt, or supervisor. This notion is important, as we will see later, and radically new for most security teams, who get their incidents via an ever-increasing alert stream that threatens to overwhelm the team.

One of the elements of Kanban is that teams can develop a definition of done, which is a shared understanding of expectations that must be met by software or a product when the team finishes further work on that feature. A definition of done does not always have to mean that a feature is complete.

Scrum

Scrum is a structured method for operating software and product development teams that optimizes for speed of development and feedback on earlier releases. The rapid feedback that is emphasized by the Scrum process is vital for good incident response.

Scrum is highly structured and prescriptive, and primarily a framework for managers of teams. Scrum focuses on a limited time horizon and defines its cadence in terms of predefined sprints, during which the team goes through rituals that are also extensively documented and prescribed.

Scrum also uses the notion of team pull rather than push and relies on teams themselves defining which requests from the backlog will be pulled by the team for each sprint.

Some of the key scrum artifacts will be useful for this chapter are as follows:

  • User story: The user story is a description of a small piece of functionality that an end user needs in a piece of software. It is usually captured in the form of a sentence: As a user of <application> I need to do <something> so that <result or objective of the improvement or functionality>. After creation, the user story is put on the backlog.
  • Task: Tasks can be related to user stories (such as implement this functionality) but can also be related to things the team needs, such as new development tools, environments, or security reviews. All the tasks go on the backlog.
  • Backlog: A list of all the work that needs to be done by the team.
  • Sprint: A defined period in which a team seeks to complete several tasks, which are pulled by the team from the backlog.
  • Sprint backlog: The tasks pulled by the team from the backlog, to be completed in the current sprint.
  • Increment: A potentially shippable new version of the product or software that contains the functionality related to the user stories that have been pulled in the most recent sprint.
  • Sprint review: The sprint review reviews the output of the sprint in terms of the product and is also used to collaborate with the product owners on what to do next.
  • Sprint retrospective: The retrospective focuses on the processes, individual interactions, communication, processes, and tools used during the sprint and focuses on improving the team's effectiveness.

Perhaps not that surprisingly, Scrum as a framework is less suitable for incident response for several reasons:

  • Scrum (along with the other methodologies) focuses on product and software development, not on responding to incidents.
  • Scrum is heavily structured and lacks some of the flexibility required to adapt to attacker behavior. In general, during an incident, a team does not get the opportunity to define a sprint; the cadence is set by events and is outside of the control of the responder.
  • Scrum is driven by the cadence of specific sprints that are set in advance. In an incident response process, we do not usually get to choose the cadence, but it is driven by both the attacker and the cadence of detection and analysis.

    Scrum

    The www.scrum.org website contains a lot of useful information about the scrum framework and its key components.

With that, as we will see later, the Scrum framework contains a lot that can be repurposed in the context of incident response, and the artifacts that form part of the Scrum framework can be repurposed for the process of agile security operations.

In addition to these three frameworks, many other specific frameworks fit with the agile manifesto and are used in various organizations to good effect. For this book, however, there is little value in providing detailed evaluations of all that is possible.

None of the methods that are used for software and product development translate immediately into what is needed for Agile security operations.

The following table maps out some of these models and what the requirements for agile security operations are:

Table 3.1 – Evaluation of the available Agile methodologies

Table 3.1 – Evaluation of the available Agile methodologies

Agile security operations are not necessarily committed to any of the Lean, Kanban, or Scrum models. Instead, we will focus on the specific process of incident response, some of the agile practices for the methods we discussed previously, and then design an agile security operations process that is based on closing the incident loop, ensuring that an organization can improve its security posture based on a deep understanding of the incidents that have occurred, as well as an understanding of what has already occurred.

Agile security operations

The purpose of agile security operations is to provide processes and procedures that ensure that the effectiveness of the organization during the incident loop is monitored and optimized. The critical difference between agile and agile security operations is that in agile software methodologies, we are dealing with code and environments, and, most importantly, a product. In security operations, we are dealing with attackers who have a specific set of objectives, capabilities, and tools that are being used in an adversarial context. Despite the use of the word agile in both cases, this makes agile security operations unique and different in terms of quality regarding agile software development.

Agile security operations have requirements along the following dimensions:

  • Philosophy: The key philosophy of agile security operations is one of assumed compromise. We do not only assume that adversaries exist outside our network, but we also assume they already have a foothold. All that is necessary is for us to discover them.
  • Cadence: Attacks follow a cadence that depends on the attacker and their speed of movement. Advanced Persistent Threat (APT) attackers are persistent, with attacks following a businesslike production schedule for compromise. Cadence is also influenced by the lateral movement of the attackers.
  • Key metrics: The key metrics of the agile security operations process are related to whether an organization can successfully handle a cyber attack before the attackers reach their objectives, and how efficiently they then evict those attackers, as well as become more successful at dealing with subsequent attacks from the same group.
  • Methodology: Agile security operations rely on limiting the number of alerts that analysts are working on, the transparency of what they are doing, and rapidly responding to important intrusions. This is, of course, the primary role of a SIEM, but agile security operations take this a step further by improving the processes by means of which alerts are handled by using automation and orchestration for standard alerts, leaving time for the events and incidents that require the most attention because they pose the largest threat and hence the most risk.

The following diagram provides a high-level overview of the incident loop, optimized for agile security operations:

Figure 3.1 – The Agile incident loop that drives security operations. The loop on the right is the incident response hamster wheel.

Figure 3.1 – The Agile incident loop that drives security operations. The loop on the right is the incident response hamster wheel.

One way to think about this diagram is that it expands the prepare and lessons learned stages of the incident response loop. Specifically, agile security operations define a method for structured learning and improvement from cyber incidents that closes the incident loop, in ways that put an organization in a better position post-breach than it was pre-breach.

Principles for security operations

Agile methodologies such as Kanban and Scrum give a lot of autonomy and authority to teams to determine what problems to work on and how fast. Hence, it is key that teams work from a robust set of principles to ensure that they use the autonomy and authority with integrity and to a high standard. In these methodologies, management places a high amount of trust in individual teams.

One of the best books on operative principles in business and life is Ray Dalio's book Principles, together with the website https://www.principles.com, where you can also find additional materials related to the book.

I have collected some of the key principles for security operations in my book Principles for Cybersecurity Operations, (2020), which you may read as a companion volume to this one. It outlines the principles that I have found useful in managing security teams and resolving incidents. These principles are summarized in the Appendix section.

Key activities in agile security operations

In this section, we will focus on the key activities in agile security operations and the approaches and tooling they require. How we specify the activities in the agile incident loop will determine what tooling a security team requires.

This section will only contain the summaries of the relevant activities. More details will be provided in Sections 2 and 3 of this book, especially in Chapter 6, Active Defense.

Breach

The philosophy of agile security operations is that a breach is eternal, and that incident response is an ongoing process. With that, it is important to realize that not all breaches are created equal. Breaches follow a cadence that is primarily determined by the attacker as they move to our environment. The cadence itself is determined by the kill chain which was discussed in the previous chapter. Our visibility of that cadence is limited by our capability to observe this movement.

Determinants of a cyber attack

In 2014, two economists from the University of Michigan, Robert Axelrod, and Rumen Iliev, studied the tactical factors that led to the timing of a cyber conflict, given that nation states have a stockpile of unused zero-day exploits. Axelrod and Iliev found that the timing of a digital attack can be determined using a model from economics that uses the following information:

  • Persistence: How long will my target be vulnerable to the vulnerability I'm planning to use? Sooner or later, my target will patch or apply workarounds to their vulnerabilities, making my vulnerability useless.
  • Stealth: Can my target detect this? A zero-day is not very useful if a target can detect its use.
  • Threshold: What conditions would lead to the use of a cyber attack vector? Is it worth using on this target? Attackers hate resource waste – in this case, expensive zero-days – as much as defenders do and will generally try to use a tool that just does the job, rather than something that is overkill.

See Axelrod and Iliev: Timing of cyber conflict PNAS 111 (4) 1298-1303 (2014); https://doi.org/10.1073/pnas.1322638111.

Detect

The key problem for many security teams is that the number of alerts overwhelms the available resources in the team. This is the well-known phenomenon of alert fatigue.

Many teams reach for automation to lower their workload. The problem with using automation is that it can deal with standard detection and incidents but may fail to cope with tasks that require a non-standard, highly contextual, deep understanding of sophisticated (and sometimes even not so sophisticated) attacks. Automation may even lead to an ever-increasing list of alerts if it is not configured well; that is, if the automation is geared toward creating even more alerts.

Machine learning may help here but its effect is somewhat unclear. At best, it adds more high-fidelity detections. Consequently, the introduction of machine learning should replace some of the existing detections with better ones.

At worst, machine learning may add a new category of mathematically anomalous but potentially irrelevant detections and alerts to the stream. This makes the problem worse in many ways. The detection from machine learning requires more work to investigate (because it is not initially clear what triggers them) and a thorough analysis of cause and effect is more difficult with machine learning algorithms.

Pull events instead of push

Here is a radical question: why do security teams not pull detection events instead of having an overwhelming number of tickets opened for them? This will allow a team to focus on useful alerts that can be investigated in more depth, contextualized, and updated according to the detection engineering process.

On the face of it, that seems like a crazy idea. Alerts indicate potential security incidents; shouldn't they all be investigated? However, incidents often generate many alerts, all of which are related to the same incident. The size of the alert stream may be misleading. The idea of pulling detections relies on the idea that the team itself will have the best understanding of what constitutes alerts worth their time.

In many cases, alerts that do not seem all that serious lead us to more interesting intrusions, while also explaining many related aspects of the alert stream.

It is still an open question regarding how this should be done: security teams should not miss events and should be alert regarding what goes on in these alert streams. At the same time, automated alert generation may only adapt to new types of attacks with difficulty and certainly won't adapt without active intervention. From this perspective, a team of analysts can be tasked with performing daily tasks and responding to standard alerts, while a team of threat hunters or purple teamers may work on dedicated alert pulling and investigation of new types of attack. This will be discussed in more depth in Chapter 8, Red, Blue, and Purple Teaming.

A security team that has a good handle on the quality of the alerts that it receives will quickly be able to identify the most promising ones, and then spend the right amount of effort on each one.

Pulling alerts should happen with a high degree of trust and integrity in the security team. Here, a few basic rules should be followed:

  • Define criticality based on the criticality of a process, rather than a system.
  • Define relationships with other alerts or alert families and place an alert in a specific context.
  • Define the scope and priority of the investigation, as well as how much time can be spent on it before a decision is made to either continue or abandon the investigation.

The approach to detection and response will be discussed in more detail in Chapter 8, Red, blue, and purple teaming.

Analyze

Analyzing a detection involves getting the additional context around an alert to decide whether an alert is a true positive or false positive (an alert has, of course, already happened at this point).

Analysis involves looking at the specific conditions that the alert has occurred in, as well as looking up additional information on other systems. For instance, when a malware detection event occurs, we may want to consider the DNS logs or system logs to determine whether the malware managed to run and whether it managed to establish, for instance, command and control.

For security operations, analysis must happen quickly and reliably, and to that end, organizations need to ensure that their security teams have sufficient visibility as well as searchability of logs.

Logging strategy

Specifically, organizations should consider implementing a logging strategy, which determines which system and application logs are collected, how they are stored, and how they will be searched. A logging strategy for security needs to be driven from an understanding of what constitutes a security event, and then you must work backward from there to determine what logs should be collected and how. In addition, compliance requirements may determine what logs are collected and how long they are kept.

Modern SIEMs allow a team to capture this information and apply it across the data being ingested by the SIEM to determine when a sequence of events becomes a security incident.

Windows event logging

A good example of a (generic) logging strategy for Windows systems was developed by The Australian Cybersecurity Centre, which defined specific guidance for Windows event logging and forwarding. This may serve as an initial logging strategy for Windows systems. It is available here: https://www.cyber.gov.au/acsc/view-all-content/publications/windows-event-logging-and-forwarding. Similarly, the benchmarks developed by the Center for Internet Security also contain logging guidelines for various environments and systems, along with various hardening guides. They are available (with free registration) from this page: https://www.cisecurity.org/cis-benchmarks/.

The approach and context of a logging strategy is primarily architectural, and will be discussed in more detail in Chapter 5, Defensible Architecture.

No single pane of glass

A common pitfall of security tooling requirements is the idea that all tooling should be accessible through a single interface. Even worse, sometimes, this is considered a critical requirement of a security tool, often in the mistaken belief that such a single pane of glass will assist with alert fatigue and accelerate the analysis of alerts.

It is unclear that this is money well spent. In practice, all that security teams need to perform the analysis is the ability to rapidly pivot between tools, perform ad hoc queries, and analyze the results from different tools. Therefore, the following requirements are preferable when choosing security tooling:

  • Rapid pivoting gets easier if the tools have web-based interfaces with a single authentication system; for instance, using OAuth or SAML.
  • Performing ad hoc queries implies that the tool is driven off a data lake architecture with loose coupling, APIs, and some analytics tooling.
  • Ensure systems run from the same time source to make timelining possible.
  • Ensure a single identity across the organization to speed up analysis and the attribution of events.

Let's proceed to the next steps.

Next steps

The outcome of the analysis step is determining whether an alert was a true positive (an alert and a security event) or a false positive (an alert but no security event). Each of these two possibilities triggers an activity:

  • A false positive prompts us to consider whether we should update the detection and move into detection engineering.
  • A true positive leads us to move to the next step in incident response: containment. In this case, the analysis has given us additional context and information about the event that we can use to both contain and recover from the incident.

Typically, we must determine what stage of the kill chain an attacker may have progressed to. The kill chain was outlined in Chapter 2, Incident Response – A Key Capability in Security Operations. The containment and eradication phases of incident response, as well as the time available to respond, rely on this information.

Contain

If the analysis reveals that an incident has occurred, the incident needs to be contained.

The analysis has given us additional context around the alert, as determined by the stage of the kill chain the attacker is in and has given us some further clues as to what tooling and methods are being used by an attacker.

Containment can follow a combination of the following strategies:

  • Deceive an attacker.
  • Deny an attacker the opportunity for further lateral movement by isolating the host or network segment that has been compromised.
  • Disrupt the capabilities of the tooling they have already deployed.
  • Destroy the tooling that has already been deployed, thus making it impossible for an attacker to continue.

It almost always makes sense to monitor the compromised and neighboring parts of the network with extra care after containment to make sure that an attacker is indeed contained and not running amok somewhere else.

Note

Research done by the firm CrowdStrike has indicated that the time from initial access to lateral movement in 2018 was about 2 hours. More detailed research even developed a league table of attackers based on their breakout time, the time it took for lateral movement to be successful: https://www.crowdstrike.com/blog/first-ever-adversary-ranking-in-2019-global-threat-report-highlights-the-importance-of-speed/.

To contain effectively, teams should have the authority and ability to make the necessary changes quickly. Authority is a key element of the design of security teams and should be covered before incidents in an SOC charter that defines the constituency (that is, who the security team will act for), as well as the authority of the team (that is, what they can do under what circumstances). This aspect will be discussed in more detail in Chapter 9, Running and Operating Security Services.

Authority

More information about authority in security teams may be found in Section 2.3.3 of Jason Zimmerman's Ten Strategies of a World-Class SOC, Mitre Corporation, 2014, available here: https://www.mitre.org/publications/all/ten-strategies-of-a-world-class-cybersecurity-operations-center.

Eradicate

Eradicating attackers involves removing the tooling and access that attackers have deployed on the network, as well as making sure that re-entry is not possible, at least not through the same vector.

Something to consider here, and something that good analysis should be able to give a lead on, is whether it is preferable to observe an attacker rather than eradicate them, especially in an early stage of the kill chain. Provided the attacker is contained and can be observed, this sometimes gives us a good opportunity to discover more about their intentions. However, the observation strategy needs a mature team coupled with a robust understanding of when to pull the plug on the attacker.

The requirements for eradication are as follows:

  • Ensure that all tooling and access to an attacker is removed, not just some of it.
  • Ensure that the vulnerabilities that enabled the initial attack vector (and some trivial variants of it) have been closed to prevent attackers from reentering the network.
  • Write some specific alerts for attacker activity and put them into the detection system to alert on any reoccurrence of activity.

Recover

Recovery from an incident becomes easier if the attackers have not achieved their objectives because in this case, the organization has likely avoided the large data breaches, ransoms, and compromises that make recovery costly and difficult.

But also, incidents that were stopped earlier in the kill chain can be costly to recover from. For instance, an environment may need to be rebuilt in its entirety, some data may be lost, or complex changes to prevent reoccurrence may be necessary. Recovery in almost all instances is something that needs to be operationally addressed by the business. It is unlikely that it can be outsourced, and the effort involved in recovery can be significant and have significant business impact. Hence it makes sense to plan recovery activities in advance.

Develop context and TTPs

Many organizations stop the process of incident response once they have recovered from the incident, thus denying themselves the opportunity to learn from the incident and improve their defenses. Agile incident response focuses specifically on the positive feedback from learning and how this may drive improvement.

The first step in this process is to collect the work that was started in the analysis and add anything else discovered about the incident to a threat intelligence platform. This allows further events associated with the event to also be categorized under the same threat group. The threat intelligence platform in the first instance can be a simple wiki or even a document stating whether something about the incident has been kept and is searchable.

The naming of threat groups

Many groups are persistent, and it may help a team to develop code names for them, on the basis that these threat groups will try to compromise us many times. Having a code name assists in gathering all that activity under a single label. It is better to have such names not refer to nationalities, individuals, or national symbols, as that assumes knowledge about an attacker that is, in most cases, unsupported by the available attribution and may also prejudge future analysis. It is usually best to take something as neutral as possible but meaningful to the team, such as a reference to a memorable event during the first or second time the team responded to an incident caused by these attackers.

The incident and how it's handled may also provide useful insights into the tactics, techniques, and procedures of an attacker. This is useful to collect at the closure of an incident as well.

Updating the architecture, strategy, and risk

The incident and how it's handled can also provide useful clues for updating or improving the architecture, although generally, improvements here will take some time to work through the deployed infrastructure.

Strategy and risk calculations will also benefit from incident data as it provides an improved understanding of the adversaries that are faced by the business and the effectiveness of the current defenses.

Detection engineering

As part of our closure of an incident we can also improve detection engineering. In Chapter 2, Incident Response – A Key Capability in Security Operations, we discussed detection engineering as an activity that develops, tests, and applies detections based on an understanding of the business context to cyber compromises. This elicits quality control on detection code.

Detection engineering aims to reduce the number of false positives, detect as many incidents as possible promptly, and ensure that detection tools are tasked with the right information (that is, the detection is deployed to the tool that does the detection) and reviewed for fidelity. As suggested by the workflow in Figure 2.2, the detection engineering process consists of four specific steps:

  1. Developing the detection: The detection is developed based on the information available from false positives, past breaches, threat intelligence, business context, or threat modeling.
  2. Storing the detection in a detection repository: It is important to have a central repository for detection code.
  3. Tasking workflow: This is where the detection is deployed to the tool that monitors the network, the logs, the packets on the firewall, or the files and events on endpoints.
  4. Monitoring the performance of the deployed detection, especially the false positives and reliability of the detection.

    Detection as code

    The GitHub repository of Florian Roth, https://github.com/neo23x0, contains good examples of what detection as code may look like in practice. Detection engineering is currently not something that is supported widely by most toolsets, and it will require (in the first instance) a manual rather than an automated deployment pipeline.

Improvements – prevention, discovery, and prediction

The last step of agile security operations is to improve the prevention, discovery, and prediction of future breaches. The design and implementation of improvements will be discussed in more detail in Chapter 5, Defensible Architecture.

Making changes to the configuration of tools that only do detection is generally low risk and should be a pre-approved change that's managed by the SOC. It is then up to the SOC to decide whether they need a Test and User Acceptance Testing (UAT) environment to test passive detection changes, but this should not be necessary. It will be hard to robustly test changes to detections in a test environment because this would involve having to simulate actual attacks in this environment. Such simulations are possible, but mostly for larger and well-resourced teams.

For many small teams, the deployment pipeline of detections and predictions will be a manual affair, deployed with little to no risk on the actual environment, and monitored intensively after deployment.

In the remainder of this chapter, we'll focus on some tactical comments on tooling and tool deployment by looking at automation and coverage using the MITRE Shield framework.

Tooling – defend to respond

The agile security operations process strongly influences the tools that are deployed and how they are deployed. In this section, I will briefly discuss some of the tooling for passive and active defense.

Passive defense

Passive defense tooling focuses on either blocking attackers through defenses such as access control, firewalls, and system hardening, as well as the tooling that organizations can deploy to detect an attack has occurred and analyze it. Similarly, if you wish to contain incidents, you will need passive defense tools.

The SOC nuclear triad

The security researcher Anton Chuvakin maintains that a nuclear triad of SOC tooling exists, consisting of Security Incident Event Monitoring (which, in this model, would include logging), Enterprise Detection and Response, and a capability for network detection and forensics. This is still a good model to go on: https://blogs.gartner.com/anton-chuvakin/2015/08/04/your-soc-nuclear-triad/. The SOC nuclear triad focuses on passive defense, not on how defense teams may actively engage with an attacker. More recently, he also suggested that Application Security Visibility may become the fourth pillar of the quad: https://medium.com/anton-on-security/back-in-2015-while-working-on-a-gartner-soc-paper-i-coined-the-concept-of-soc-nuclear-triad-8961004c734.

Passive defense tooling also needs to be managed across multiple environments, such as on-premises infrastructure, cloud, and increasingly intelligent devices. Hence, passive defense engineering is a large, growing, and complex problem that needs to be carefully considered.

To deploy and maintain all that tooling, security teams need a specific engineering function that focuses on the tools themselves. Developing and maintaining security tools is an ongoing concern, where a few guidelines apply:

  • To cut down on alert fatigue and the volume of alerts, SIEM tools should seek to automate frequent high-fidelity alerts and route them immediately into the help desk system for resolution. The security team itself only maintains the oversight of these automated alerts; it does not handle them.
  • Enterprise detection and response tools should implement functionality beyond antivirus and anti-malware but should have the capability to detect and record system events (such as processes that have been stopped and started). It should also include tools to rapidly isolate endpoints remotely and perform searches on an environment in the form of OSQuery or Velociraptor queries. This can help determine whether malicious artifacts are present on an endpoint. Such tools may be available open source or be part of a commercial detection and response solution. These are often referred to by the acronym XDR.

The growth in cloud environments and the growing use of encryption will potentially make network intrusion detection less useful over time, although it is still a useful tool. The placement of network sensors at strategic points in the network needs careful consideration and needs to be used when terminating network encryption.

Active defense – Mitre ATT&CK and Shield

The Mitre ATT&CK framework is a well-known framework that models attackers and attack groups based on their tooling and techniques, tactics, and procedures. The ATT&CK framework can be found here: https://attack.mitre.org. The tactics of the ATT&CK framework map roughly to the kill chain and are used widely in active defense and threat intelligence.

Note

ATT&CK is quite well documented on the attack.mitre.org website. An extensive chapter on ATT&CK can be found in Practical Threat Intelligence and Data-Driven Threat Hunting, by Valentina Palacín, Packt. A further discussion of ATT&CK can also be found in Chapter 6, Active Defense, of this book.

The tactics that make up the ATT&CK framework are as follows:

  • Reconnaissance: The external investigation of the victim's environment to determine avenues of access and weapons that could be used to attack the victim.
  • Resource Development: Development of the resources needed by the attacker to execute the attack.
  • Initial Access: The initial access to the environment.
  • Execution: Executing the payload of the cyber weapon in the victim's environment.
  • Persistence: Ensuring that the attackers have a configured way back into the system, such as a backdoor, or an account that may be used to log back in later.
  • Privilege Escalation: Activities to gain more access from the initial access point and elevate the authorization of the attacker or gain access to more systems.
  • Defense Evasion: The activities to hide the attacker from defensive tooling and teams.
  • Credential Access: Techniques to steal credentials or passwords.
  • Discovery: Activities on the victim network that assist an attacker in discovering additional systems, or any living off-the-land tools that may be used.
  • Lateral Movement: Activities that the attacker performs to move from the initial points of compromise to other systems in the network.
  • Collection: Gathering information or information sources that are relevant to the attacker's objectives.
  • Command and Control: Establishing communications between the attacker's infrastructure and the victim's infrastructure to control the victim's infrastructure.
  • Exfiltration: Exfiltrating the data that forms part of achieving the attacker's objective. This can also involve staging, packaging, and encrypting the data to avoid detection.
  • Impact: This category describes a list of possible impacts of a cyber attack that result from the previous tactics being successfully applied to the victim's environment.

The tactics in the ATT&CK framework can be used as guides to design and implement specific defenses that aim to thwart successful completion of this tactic. The ATT&CK framework can be combined with the Shield framework, which focuses on a range of defense tooling that is available to security teams, as well as its effectiveness against specific tactics in the ATT&CK framework. Whereas defenders can use the ATT&CK framework to analyze attacks and perform intelligence-driven incident responses, Shield is an extensive guide to tooling.

Like ATT&CK, Shield also consists of several defensive tactics. The focus of the Shield framework is specifically on active defense and is defined as the employment of limited offensive action and counterattacks to deny a contested area or position to the attacker. Various definitions can be found here: https://shield.mitre.org/resources/getting-started. Shield, like ATT&CK, uses the usual terminology of tactics, techniques, and procedures, but focuses them on defensive actions. This is, of course, entirely possible: defenders, like attackers, also have a specific arsenal of TTPs.

The focus of Shield is deception: how to divert an attacker from their real objective and drive them to an environment of the defender's choosing so that their behavior can be studied further. Such environments could be honeypots or honeynets.

The tactics that make up the Shield framework are as follows:

  • Channel: Channel is the capability a security team may have to deceive attackers and channel them into an area of the defender's choosing, such as a decoy network or a hardened network.
  • Collect: This term refers to the capability of a team to gather an adversary's toolset and collect other data about their activity on the network. This information may then be used to strengthen defenses or contain an attacker.
  • Contain: This is the capability to restrict an attacker to a specific and constrained area that they cannot escape from.
  • Detect: Detection is the capability to develop and maintain awareness of an attacker's behavior.
  • Disrupt: Disruption prevents an attacker from completing their mission by either slowing the attacker down (increasing the time required to complete the mission) or making it harder, through system hardening or otherwise, to complete the mission.
  • Facilitate: The facilitate step is the opposite of the disrupt step and allows an attacker to complete part of their mission. The purpose of this could be to present an attacker with a false or watermarked set of data of relatively little value.
  • Legitimize: This activity adds authenticity to deceptive components to convince an attacker that a decoy system is real.
  • Test: This capability studies an attacker in the wild to determine their capability, motivation, or behavior.

The Shield terminology also adds two new terms to the defense lexicon:

  • Opportunity spaces, which are the high-level opportunities for active defense that arise when the attacker employs a specific technique.
  • Use cases, which describe what a defender could do to take advantage of an opportunity presented by an attacker's action.

In the latter aspect, the Shield framework can be mapped to ATT&CK techniques, where the use of a particular ATT&CK technique generates an opportunity space for the defender, with specific use cases that outline what a defender might do to thwart that specific technique.

In that sense, specific Shield techniques fit with several opportunity spaces, use cases, and procedures intended to limit the effectiveness of several ATT&CK techniques.

There are two ways in which defenders can use Shield to their advantage:

  • Shield provides a catalog of defensive techniques that enable active defense, and defenders can rate their current toolset and capabilities against what is required.
  • During the analysis and containment phase of an incident, defenders can consult the Shield matrix to determine what their active defense options are.

Summary

In this chapter, we defined an agile framework for security operations that is centered in the incident response loop, with a specific methodology for agile security operations. This chapter has stressed that learning from incidents must lead to improvements in prevention, detection, and response practices.

An important consequence of this development is that the key tooling of security operations changes. We focused on developing a framework for agile security operations that specifies specific activities that form part of the core of agile security operations. This realignment has consequences for engineering and tooling requirements as it introduces a new set of activities and associated tools to the security team.

To free up the time to work with high profile incidents in more detail, I have proposed that, under certain conditions and in addition to the performance of daily checks, security teams should be allowed to pull their alerts from the queue, rather than be expected to deal with all alerts in a priority-based, first-in-first-out model. In other words, security is not a help desk.

We also discussed the importance of passive defense tooling and the Mitre Shield framework, which classifies and categorizes active defense tooling.

This chapter concludes Section 1 of the book. In Section 2, we will focus on some of the underlying ideas of agile security operations, especially the principles and key concepts of defensible architecture and active defense. Section 3 will then fill in some of the remaining gaps. In the next chapter, we will discuss some of the key concepts of resilience and dive deeper into concepts that shape incidence response.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset