The Five Layers of Incident Response (Part 2)

Lessons from the field in structuring a modern, code-driven, and data-centric IR program.

and

Jun 11, 2024

Welcome to Detection at Scale, a weekly newsletter for SecOps practitioners covering detection engineering, cloud infrastructure, the latest vulns/breaches, and more.

This week, we continue a three-part series about the Layers of Code-Driven Incident Response, co-authored with Jeff Bollinger, Director of Detection Engineering and Incident Response at LinkedIn. In Part 1, we covered the Playbook, Data, and Presentation layers:

The Five Layers of Incident Response (Part 1)

Jack Naglieri and Jeff Bollinger

May 21

Read full story

In part one, we covered the Playbook, Data, and Presentation layers, representing the preparation and monitoring phases of the Incident Response (IR) lifecycle. Playbooks map the crucial threats to your organization, data is gathered to satisfy those playbooks, and visual cues help you understand the activity.

In part two, we’ll examine the remaining two steps of the Five Layers of Incident Response: Case Tracking and Remediation. These complete the Incident Response (IR) lifecycle and represent the response-oriented steps to act and decide if alerts turn into incidents.

How can we effectively track our confirmed cases for the best coordination? How do we make sure we effectively learn from the past? How well are we performing, what insights can we discover, and what is our impact on security? What is the plan to move fast on containment while not breaking things in production? These are the topics for part two of this series.

Case Tracking Layer

Case tracking involves evidence gathering and organizing all the pertinent information about an incident. This includes response coordination, timeline creation, artifact collection, and reported observations by the responders. A case is an easily parsable repository of evidence and outcomes about an incident.

Case tracking helps us manage and document all aspects of an incident and can help improve response times by utilizing past learnings. Because cases contain the responder's notes, observations, and evidence, a future team member can review whatever runbooks were executed or what insights and lessons were learned from the previous incident for a faster, consistent response.

Tip: Case tracking should happen in a system with strict access controls where only the security team can see alerts, manage tasks, and add comments. Popular tools for case tracking include Jira, PagerDuty, RTIR, ServiceNow, and The Hive.

The first step in case tracking is understanding when a new case should be opened. For this, we return to our created Playbooks. Do the alerts rise to the level of a declared incident based on previously established criteria based on business risk acceptance? If so, then we should open a case. If a responder spends more than a few minutes on an investigation, we should open a case at minimum to track the team’s efforts and outputs and to ensure that if a similar alert shows up again, there is historical precedent that can add context or insights into systemic problems.

The number of confirmed cases monthly should be used to estimate and optimize our team's workload and other KPIs, such as the mean time to respond. When adding new playbooks and rules, it’s normal to see an increase in cases opened as the rules reach their peak maturity through tuning and other optimizations.

A line graph with the y axis as number of alerts and x axis as time and a line showing a brief spike and then a slow decrease through time.

When triaging alerts before opening a case, the goal is to determine if an activity is malicious. If it’s not, we tune the Playbook or update the underlying rule logic. We want to optimize rules for high accuracy and analysis efficiency. The information in the Presentation and Data layers should also enable quick decision-making. If an alert is malicious, we open a case and progress into the Remediation and Containment layer.

Automating Case Actions with GenAI

With the rise of GenAI tooling, it’s now possible to build autonomous agents to deliver useful insights, summaries, and interesting connections. This works through a combination of RAG (retrieval augmented generation) and registered agent function calls to perform specific tasks. For example, if you have an alert that regularly fires, and historically, this alert is a true positive, you could reasonably understand its origins and predict an outcome based on contextual clues.

The GenAI D&R Revolution Begins

Jack Naglieri

May 13

Read full story

With agents, we can make a data-backed determination or verdict on whether the alert is true positive, benign positive, or false positive by automating:

Blast radius discovery or other assets that might have been impacted.
Containment, such as blocking access or resetting credential(s) with an API.
Notifying the victim and their management to obtain human context.
Updating the case with evidence, timelines, and all communications.
Closing the case and calculating the basic KPIs of time to detect and time to contain.

New cases can also be fed back into the model/agent to aid future decisions and pattern matching. This type of automation will not work on novel attacks or techniques because the model doesn’t yet understand their features and attributes. However, it should be effective for triaging similar yet unknown alerts.

By utilizing agents, your organization can begin to automate away the common alerts to focus our human talent on the more complex and less understood threats.

Remediation and Containment Layer

The Remediation and Containment layers represent the final components of a solid playbook management system, as they are the most direct actions that a response team will take throughout the process. Incident discovery, lessons learned, and long-term fixes are incredibly important. However, they are largely process and documentation overhead, and the resultant work is often owned by external teams in departments like IT and Engineering.

Remediation and Containment require the response team to take direct action to isolate compromised systems or identities to reduce attacker dwell time and damage potential. This prepares the organization for recovery after an incident, as normal business operations cannot be restored after a security incident if you have no confidence that the issue has been fully contained and remediated.

Containment can be achieved in many ways and at many layers. Because of this, it’s important to understand the nature of the attack/compromise and the propriety of various containment methods. For example, if you reset an employee’s login credentials after a successful phishing attack, you have remediated the threat of stolen credentials at the identity layer and contained any additional identity-based attacks that leverage that credential. However, is this sufficient? What if the phishing attack dropped malware on the endpoint after the victim logged into the attacker’s phish kit? What if the attacker no longer needs a login credential because all browser session cookies were stolen?

Persistence is a core part of the attacker kill chain and the ATT&CK Matrix. The longer an attacker remains, the more opportunity they have for damage. The goal of strong containment is to reduce the ability to persist in an organization. The goal of remediation is to restore any compromised entities to a working, secure state and remove traces of attacker behavior once the evidence is cataloged in your Case Tracking system.

The key takeaway is developing a comprehensive and critically reliable containment system at multiple layers. You need to prevent attackers from executing their objectives, whether data exfiltration, lateral movement, or maintaining persistence. Consider options to block or reset domains, IP and MAC addresses, URLs, identities, tokens/bearer/cookies, access groups, cloud security groups, and endpoint lockdown controls. As important as setting up these systems is regularly ensuring they are functional and responsive. You do not want to discover that your containment systems fail during an active incident!

The Remediation layer is similar to Containment in that you need to recover quickly from an incident. Some specific examples would be resetting compromised credentials, re-imaging a compromised laptop, or taking down and rebuilding a cloud-hosted server with stronger controls. Remediation is what gets you back in business, and it often requires help from other groups like IT Help Desks and the victims themselves to ensure they have the confidence their systems are back in shape and no longer impacted by the initial incident.

Whatever your remediation processes, ensure you preserve evidence in your Case Tracking system. Once the threat is gone, you won’t be able to understand exactly what happened or create a useful timeline of events. It’s as important to archive evidence as it is to ensure a full and proper remediation.

Summary and Up Next

To summarize what we have covered so far in the first two parts of our Five Layers of Incident Response series:

Playbook Management Layer: The strategic plan for detecting and responding to incidents, involving defining rules and response actions to minimize manual triage.
Data Layer: Identifying and capturing relevant audit logs and data to support the playbooks through organizational context.
Presentation Layer: The human interface for visualizing incidents, timelines, summaries, and actionable insights.
Case Tracking Layer: Manages and documents all aspects of an incident, including evidence gathering and coordination.
Remediation and Containment Layer: Isolating compromised systems and identities to reduce damage and restore normal operations, ensuring thorough removal of threats.

The final part of this blog series will cover best practices for getting started. Subscribe to get notified of the next post! Please share this with a friend if you found it helpful.

Cover photo by Clark Van Der Beken on Unsplash

A guest post by

Jeff Bollinger

https://www.jeff-bollinger.com