Focus on processes and infrastructure during postmortems to avoid blaming individuals

Remove ads, get exclusive features. Starting from $9.99

Focus postmortems on operational processes and infrastructure, not on individuals. A blameless, systems-oriented review reveals gaps in runbooks, escalation paths, and automation, guiding improvements that strengthen incident response and prevent repeat issues, with a culture of learning and accountability.

When a production outage hits, the room tightens up. Phones buzz, dashboards glow, and everyone trades the same anxious glance: who did what, and why did it happen now? It’s tempting to start naming people, listing individual missteps, and assigning blame. But the real discipline of incident response isn’t about finger-pointing; it’s about learning. And learning happens best when we resist the pull of the fundamental attribution error—the tendency to blame a person rather than the system that set the stage for the issue.

Let me explain what that bias looks like in the heat of a postmortem. If we zero in on “the actor,” we miss the broader picture: how the operational processes failed, where the infrastructure pulled the wrong lever, or how a small configuration drift snowballed into a service outage. When we evaluate a blackout through the lens of people alone, we flatten complexity, derail useful improvements, and create a culture where people fear mistakes rather than embracing the hard work of fixing them. And that fear, honestly, makes future incidents more likely.

The truth is simple: the right focus is the operational processes and the infrastructure that support incident response. Why? Because those are the levers you can actually adjust. People are always part of the picture, yes, but the system—the way alerts are generated, how on-call rotations are arranged, how runbooks guide responses, and how changes are rolled out—determines whether a fault cascades or is contained.

What to focus on during a postmortem (the systemic lens)

When you sit down to review an outage, shift the conversation toward these systemic areas:

Alerting and monitoring hygiene: Were alerts actionable? Were we drowning in noise or missing signals? Look at how thresholds were set, how alert fatigue affected response speed, and whether monitoring coverage caught the problem early or only after it escalated.
On-call processes and escalation: How was the incident detected, triaged, and escalated? Were the on-call rotations balanced? Did the right people get paged at the right times, or did a gap in coverage contribute to delay?
Runbooks and response playbooks: Do we have clear, actionable steps for common failure modes? Are the steps current with recent changes in the system, or did a drift in tooling derail the recommended path?
Incident coordination and collaboration: How did teams communicate during the incident? Were there silos that hindered shared understanding? Was information flowing fast enough to support a timely resolution?
Change management and dependency risk: Did a recent change play a role? Were external dependencies stable, and did we have visibility into those risks before the incident happened?
Infrastructure and architectural fragility: Are there single points of failure, brittle integrations, or insufficient redundancy? How do the components talk to each other when load spikes?

In short, assess the system’s health: the processes that orchestrate the response, the tools that empower it, and the infrastructure that underpins it. That’s where you’ll uncover meaningful opportunities for improvement.

A practical way to frame the postmortem

Let’s keep the structure simple but powerful. A blameless postmortem works best when it mirrors how you actually work during an incident:

Start with a neutral, factual timeline. What happened, in what order, and what did each team observe? The timeline should avoid naming individuals and focus on actions, events, and signals.
Then map each event to a systemic factor. For every step, ask: what process or infrastructure enabled this step, and how could we redesign it to reduce risk next time?
Label root causes as systemic, not personal. Instead of “Team X should have noticed earlier,” frame it as “Our alerting logic and escalation policy could be refined to surface critical signals sooner.”
Propose concrete changes. Each recommended improvement should target a process, a runbook, a monitoring tweak, or an architectural safeguard. Include owners and a realistic timeline.
Close with learning and accountability, not blame. Emphasize what the organization will change, how success will be measured, and how the learning will be shared.

A few concrete practices to keep this leaning toward systems, not people

Write the timeline in one go, then step back. Don’t rush to conclusions. A calm re-review often reveals that several small changes would have altered the outcome.
Separate personal from process discussions. Use neutral language like “the checklists didn’t cover this scenario” rather than “typo by the engineer.” It keeps the focus on improvement.
Use evidence, not impressions. Collect metrics, logs, and runbooks as you talk. If a tool didn’t surface a critical alert, ask why and how the tooling can be adjusted.
Create actionable, testable follow-ups. Don’t stop at “fix it later.” Define specific changes, owners, and how you’ll verify the fix—ideally through a test or a staged rollout.
Share the learning broadly. A single incident should benefit the whole organization. Create a concise memo or an internal write-up that peers can reference when similar situations arise.

How this looks in practice, with real-world flavor

Picture a midsize engineering team supporting a web app that suddenly slows during peak hours. The first instinct might be to wonder whether a particular developer pushed a questionable change. But a systemic lens quickly reveals a more reliable narrative: the monitoring didn’t flag the performance drift early, the runbook didn’t cover performance degradation in this sub-system, and the on-call rotation didn’t rotate coverage to someone who understood a related caching layer.

From there, the team might decide to tighten the monitoring around cache efficiency, revise the runbook steps for performance incidents, and adjust the on-call policy to ensure experts for the caching layer are looped in promptly. The incident becomes a catalyst for a more robust signal chain, better runbooks, and a more resilient architecture—not a moment to assign fault to a person.

Another example: an outage traced to a misconfiguration in a deployment pipeline. It’s tempting to blame the engineer who merged the change. But a systemic view would examine the change control process, the testing coverage, and the rollback capabilities. Perhaps the pipeline needed a safer default, or the rollback path required clearer automation. The fix is then about improving the pipeline itself—guardrails, automated tests, and safer deployment steps—so similar mistakes are less likely to cascade again.

The role of culture in sticking to the right focus

It’s hard to overstate this part. A culture that rewards quick blame will erode openness, and that undermines the very insights you need to improve. So, how do teams cultivate a blameless, learning-first mindset?

Normalize postmortems as a learning ritual. Treat every incident as an opportunity to strengthen the system, not as a stress test of individuals.
Lead with transparency. Leaders should model the behavior: acknowledge systemic gaps, share what’s being changed, and celebrate improvements, not heroic scapegoats.
Protect the information flow. Ensure that what’s discussed during a postmortem stays focused on system health and is used to inform better practices—not to punish.

A few words about the human side

People bring passion and expertise to incident response, and that’s precious. When we keep the spotlight on systems rather than souls, we protect that energy. It’s not about shying away from accountability; it’s about channeling accountability into concrete improvements. The result? Fewer recurring incidents, faster resolutions, and a culture where engineers feel safe to experiment, fail, and learn—without fear of blame.

How PagerDuty and similar tools support the right focus

If you’re using modern incident response platforms, you’ve got powerful gears at your disposal to keep the focus where it belongs:

Incident timelines and collaboration threads help you reconstruct events with a systems view rather than a narrative about individuals.
Runbooks and knowledge bases provide the procedural guardrails teams can lean on during chaos, ensuring responses stay consistent and scalable.
Automated escalation policies and on-call routing reduce noise fatigue and get the right people involved promptly.
Post-incident templates encourage uniform reporting, linking incident data to concrete process and infrastructure improvements.
Integrations with monitoring and tracing tools give you richer context for root-cause analysis, so you’re not guessing about what failed.

The bottom line

A postmortem anchored in the operational processes and infrastructure is a powerful antidote to the fundamental attribution error. It shifts the focus from “who caused this” to “what systemic weaknesses let this happen, and how do we fix them?” That shift isn’t merely theoretical; it translates into tangible improvements: smarter alerting, clearer runbooks, more resilient infrastructure, and a more confident, collaborative on-call culture.

If you’re involved in incident response, and you want to reduce the chance of repeating the same mistakes, start with the question that matters most: what in our systems made this happen, and how can we change the setup to keep it from happening again? The answers won’t always be glamorous, but they’ll be practical, measurable, and, most importantly, human-friendly. After all, the goal isn’t to assign blame—it’s to build a more reliable product and a team that learns together.

So, next time a disruption hits, use this lens. Let the focus be on processes, patterns, and infrastructure. Let the team gain momentum from clear, actionable improvements. And yes, let the learning ripple outward—across teams, across responsibilities, and across the entire reliability journey. That’s how you turn a scare into a stronger, steadier system—and how you keep real progress moving, one well-aimed change at a time.

Focus on processes and infrastructure during postmortems to avoid blaming individuals

Get the latest from Examzify