The Five Layers of Incident Response (Part 3)
Lessons from the field in structuring a modern, code-driven, and data-centric incident response program.
Welcome to Detection at Scale, a weekly newsletter for SecOps practitioners covering detection engineering, cloud infrastructure, the latest vulns/breaches, and more.
We want your feedback about Detection at Scale! It will take < 1 minute and will help improve this newsletter. Thank you!
This week, we conclude a three-part series about the Layers of Code-Driven Incident Response, co-authored with Jeff Bollinger, LinkedIn's Director of Detection Engineering and Incident Response.
In the first two parts of the series, we covered the five layers of incident response: Playbook, Data, Presentation, Case Tracking, and Remediation and Containment.
In this post, we will cover best practices for implementing this approach.
Operating with intention and measurable goals is crucial to making progress in life. It is also central to building a successful security operations program.
Security teams are shifting from generic, spray-and-pray SIEM alerting to tailored, risk-oriented, focused security monitoring. It’s the only way to operate effectively at scale as businesses grow, data demands increase, and our time becomes more scarce.
This transition to a modern Playbook strategy might seem daunting, but this post can help you get started and avoid potential pitfalls early on.
#1 Build the Critical Playbooks
Build your initial playbook based on the scariest scenarios.
If you are affected by this scenario in the future, you’ll have a defined response and remediation plan in place. To build this playbook effectively, you’ll need a deep understanding of the surrounding systems and how to break into them with a red-teamer mindset. That combined knowledge and effective pressure testing are the key to building great monitoring controls.
Once the playbook is ready, get buy-in from your team and management by demonstrating the process flows required and the commitments necessary for response and containment. Also, shop it around with other stakeholder teams, like engineering. This will help build a deeper technical context, improve preparedness, confirm assumptions, and improve your confidence with an accurate and comprehensive response to your organization’s most concerning security threats.
Then, continue developing the next playbook. Only focus on what’s most important from an organizational risk and threat modeling perspective. By being intentional and pragmatic, you can improve the quality of your signal and response plans and prevent unnecessary toil and alert fatigue.
#2 Monitor in-depth
Add redundant logging layers near critical assets as a failsafe.
Data pipeline errors can occur for various reasons: unexpected schema changes, a log source gets accidentally disabled/deleted, capacity concerns, network changes, or permission issues cause failures.
Monitoring in depth can also help boost redundancy in detection by observing similar tactics or techniques from multiple perspectives. For example, simultaneously monitoring traffic from the network (layer 3) and operating system (layer 7) provides richer context about the activity with complementary features like IP addresses, user agents, software versions, hostnames, etc. Threading the features or entities across multiple detection layers provides a clearer picture of the activity.
Ensure that your operational due diligence is in place through metric alarms that monitor for significant log volume dips, permissions errors, schema errors, or other issues. Consider automating routine troubleshooting actions like gathering debug logs or restarting a service or container to ensure that humans are only involved when there is an unusual or systemic problem.
#3 Clearly divide IR responsibilities across functions
The complete incident lifecycle requires diverse skill sets across the board. Given the potential impact of some larger incidents, it is best to divide responsibilities across the lifecycle to encourage scalability. So, no matter which IR framework you choose—this series’ IR Playbook strategy, NIST, or SANS—build each layer as a team capability. For example:
Whether incident responder, SOC analyst, or detection engineer, there are roles to play at all phases of the incident lifecycle, and ensuring preparedness requires understanding what you are accountable for before the incident begins.
#4 Design your program on software principles
Create a culture of “everything as code” to foster a consistent, scalable, and reliable program. This ensures transparency, shared accountability, and easier integration with other technology and engineering organizations and products.
Code means any state, condition, or task declared and managed with a version code system like Git. This means playbooks and dependencies are declared in code and checked into a shared repository as a single source of truth tracking who changed what, when, and why. Peer review is structured through pull requests. Deployments and testing are automated end-to-end to your SIEM.
Your response team, analysts, and detection engineers will contribute feedback over time as the detections produce results, creating an opportunity for iteration and improvement through updates and versioning based on tested use cases.
Ultimately, your program becomes agile, consistent, and scalable, benefiting from “everything as code.” It’ll also become easier to govern your program against accidental, destabilizing changes.
To get started, try using an approach similar to Palantir’s ADS or any vendor aligned on Detection and Response as Code. Try to avoid the concept of “ClickOps” as much as possible, meaning all changes go through Git and not a user interface. If they do need to go through a UI, be sure to push them into code.
#5 Iterate over time and avoid early complexity
Start with a simple tools stack and add components if they solve real problems.
After your critical playbooks are stable, add software as needed to improve the quality of alerts or drive new automation. Diving head-first into dozens of Threat Intelligence feeds and SOAR capabilities without a strong intention can add unnecessary complexity, cost, and work that may not return the investment.
Grow and measure the impact of your playbooks using SOC metrics like alert false positive or true positive rate, mean time to detect, or mean time to respond. Then, work to continuously optimize these values over time to meet your organization's risk tolerance and requirements.
Conclusion
Making the jump to a modern IR playbook strategy is one you won’t regret. As the business climate has shifted to reward efficiencies, security teams can make an impact by taking an approach heavily influenced by automation and code. The techniques covered in this blog series can help future-proof your program by taking an extensible SOCless approach.
The best type of teams and programs are ones that are practical metric-driven and can continue operating harmoniously and uninterrupted over a long period of time. We hope this series armed you with the knowledge to make the shift. Thank you for reading!
Photo by Clark Van Der Beken on Unsplash