The Five Layers of Incident Response (Part 1)

Lessons from the field in structuring a modern, code-driven, and data-heavy IR program.

and

May 21, 2024

Welcome to Detection at Scale, a weekly newsletter for SecOps practitioners covering detection engineering, cloud infrastructure, the latest vulns/breaches, and more.

This week, we begin a three-part series about the Layers of Code-Driven Incident Response, co-authored with Jeff Bollinger, Director of Detection Engineering and Incident Response at LinkedIn!

Check out our episode below on the Detection at Scale Podcast:

We’re not going to have 50 people staring at the same screen. We try to automate everything from enrichment to data gathering to execution of queries. That’s how we detect at scale.

- Jeff Bollinger

The rise of Detection Engineering has transformed how modern SecOps teams are built, utilizing software engineering and high-scale data principles to automate as much as possible. It’s not about engineering people out of jobs but rather up-leveling and augmenting our capabilities as defenders. This is the “SOCless” mindset made popular by teams like Netflix and Twilio, with a collaborative dichotomy between detection and response coming together around code and automation.

But how can we ensure our efforts are properly coordinated, directed, and focused? This blog will provide a playbook for teams to follow when adopting this mindset.

Many security teams get bogged down by manual triage of too many low-severity alerts that miss the broader point. By adopting higher levels of intentionality combined with automation, we can work backward to protect our organization from the most impactful threat models and improve operational focus. Leaders can follow the “Five Layers of IR” strategy:

Playbook Management Layer
Data Layer
Presentation Layer
Case Tracking Layer
Remediation and Containment Layer

In this part 1, we will cover the Playbook, Data, and Presentation layers, followed by the Case and Remediation layers in part 2, and end with best practices for getting started in part 3.

Playbook Management Layer

Everything starts with the Playbook—the detection rules and subsequent response actions. But how do we know where to begin? What kinds of detections should we build, and what are our containment capabilities? What are the most damaging or difficult attack scenarios we must protect against? What are the ultimate attacker objectives and worst-case scenarios for your data integrity, confidentiality, and availability?

Working backward from these answers helps us prioritize what matters most to detect in our organization. Then, we can tactically consume and curate alerts and signals that lead to the detection of an attacker achieving a malicious objective. By taking this approach from the start, alerts become much more meaningful.

For example:

Threat Model: Production Customer Data Stolen from Amazon S3

Scenario: Attackers gain privileged access to our production environment, discover, copy, and exfiltrate decrypted customer data.

Detection Rules:

- CompromisedSupportAccess (Valid Accounts)
- S3Scanning (Discovery: Cloud Storage Object)
- S3BucketDumped (Collection: Data from Cloud Storage)

Response: 

- Kill sessions, rotate keys

Hardening Opportunities:

- Investigate root causes and prevent or diminish efficacy for initial access.
- Audit all organization tenant storage security policies for excessive access. 
- Block outbound network access by default; only allow selected, managed traffic.

The Playbook defines the strategy and the workflows for executing it, with the goal of continuous improvement over time. The number of incidents should decrease through investments in preventative measures and policy controls. The Playbook can be managed or stored in various ways. Developing the Playbook in a source code repository is best for scalability, non-repudiation, and access control.

Palantir has a great structure you can borrow from, called ADS, or Alerting Detection Strategy framework.

The Playbook's detection opportunities should largely focus on the threats most likely to damage your organization. Rather than covering a potentially distracting checklist of Tactics, Techniques, and Procedures (TTPs), the focus should be on the most relevant and high-impact TTPs.

With a clear intention, we can begin gathering data to feed the playbook.

Data Layer

The Data Layer defines which audit logs we must collect to meet the needs of our Playbooks. This includes identifying and capturing the most relevant data, starting with security-specific events and then your most critical production services, transforming it through a well-supported schema for easy indexing and searching.

Today's most widely accepted data schema formats in security are Open Cybersecurity Schema Format (OCSF) and Elastic Common Schema (ECS). Common schemas define a set of objects representing a specific activity, like DNS, DHCP, and Processes. Then, any data source telemetry producing that data category will map into it, enabling SecOps teams to combine datasets and make rule writing and hunting easier.

The OCSF/examples repo on GitHub can provide a visual of how raw logs translate into OCSF and a basic understanding of the base format.

Adopting a common schema is beneficial because it makes rule creation and triage simpler and portable across different platforms that recognize that same schema, avoiding lock-in. Sharing rules across the community also has the inherent benefit of improving mindshare and detection techniques. However, trying to fit every possible log into a common schema has challenges, such as evolving the base schema, handling leftover fields, and optimizing query performance.

Enrichments and lookups can be integrated into the Data Layer to fill in contextual gaps in data. For example, enriching Okta logs with HR information (e.g., Rippling, Workday) helps answer the “who” during analysis, investigation, and response. These enrichments can be added directly to logs upon ingestion, and while this increases the overall size of your data, it significantly helps in the Playbook layer, where your core threat models are defined. Enrichment can also happen post-alert, but by inserting it first, you can take advantage of granting access to enrichments from correlation rules. Enriching at ingestion time also provides a contextual snapshot at a point in time, which is helpful for triaging indicators that may have expired or lost their value.

Data transformations, log filtering, a unified data model, and routing between hot and cold storage are all attempts to optimize search performance, costs, and ease of querying/rule-writing. These strong data principles will provide the foundation for a scalable, sustainable program.

Improving Security Data Lake Efficiency with Log Filtering

Jack Naglieri

April 8, 2024

Read full story

Presentation Layer

With the “critical threats” modeled, the playbooks created, and data flowing into the system, we can now evaluate the alerts from our correlations, visualize behaviors, validate edge cases, and begin our feedback optimization loop.

The Presentation Layer is the human interface to the evidence unearthed by your playbooks. Imagine this layer as a visual report about an event, including a timeline, summary, and the next potential action to take.

A common initial step for investigations is to review all relevant, concurrent data in a timeline. The presentation layer should allow an analyst to review what happened and analyze related events around the same time as the initial trigger. When pivoting, it’s common to expand or hone in on dimensions like:

Log Sources: Similar events in other places
Time: Adjusting +-1 hour
User: Same behavior, different user
Action: Same user, different actions
Parameters: Similar requests but different options

The system’s query language, either domain-specific or a standard language like SQL, should allow teams to express these pivots easily. A library of pivots and contextual queries can also be built and reused across playbooks.

It’s also valuable to determine relationships between historical incidents or previous alerts for involved entities, which should result in exceptions, rule suppressions, or net-new rules to build. Rule performance metrics can also live in this layer to help in calculating the ROI of our Playbooks. Rapid7 has a great article about measuring Detection Efficacy and calculating additional metrics that precede it.

With a visualization framework, the Presentation Layer enables data exploration and opens up threat-hunting and tagging opportunities. Analysts should also be able to understand the data's provenance, what metadata is available to search with (via the Data Layer), and formulate the data and output in the most usable way for the team's consumption.

Automation in this layer is being explored through the introduction of GenAI analysts, and you can read more about it below: