Running human-focused postmortems

5 mins

in partnership with

Incidents and postmortems can be stressful for everyone involved, but there is a way to resolve issues with empathy for everyone involved.

Developing software can sometimes feel like rowing a boat through a swirling maelstrom, trying to stay afloat through a constantly-changing set of circumstances.

Issues are inevitable in this line of work, what matters is how we handle them. Great DevOps teams know the answer lies in studying how engineers respond to incidents. The lessons learned there are a lens into the wider organization, and provide a path to building more stable systems and less stressed teams.

Viewing incidents as investments

Let’s look at an incident I recently helped investigate. A key piece of our architecture is our overnight extract, transform and load (ETL) data system. The details of the incident are fairly simple: none of the ETL jobs ran overnight because someone had turned it off. To resolve things, we needed to turn it back on.

The issue resolved, we could have gone back to our regular day-to-day work. But that would mean missing out on a valuable learning opportunity. Instead, we should view incidents as investments, not catastrophes.

And if that investment has already happened, how can we aim to get the most value out of that work? An incident means that our mental model of our system missed something. What can we learn from the incident? What are the best ways to fuel that learning?

Most companies already include investigation as part of their incident process. It’s common practice to have a postmortem meeting, where incidents are discussed. It’s rarer to have a dedicated investigation phase, where a site reliability engineer (SRE) digs through logs, reads chat messages, and interviews participants to learn as much as possible about the incident.

Not every incident needs a deep investigation and review though, and like any other aspect of engineering, it’s up to the team to determine where to invest their effort. Let’s start with postmortems.

A brief history of postmortem strategies

A long time ago (in SRE years) a blameful postmortem would have pinpointed the engineer responsible for turning off the ETL system and concluded that they aren’t to be trusted with on/off switches. This focus on human error not only misses the deeper questions, but also discourages future incident response, as teams are more focused on avoiding blame than solving problems.

Thankfully, we’ve largely moved past that point as an industry. Most companies have embraced Blameless Postmortems, where the focus is on what happened and not who did what. A blameless postmortem lets us analyze and learn from the technical side of our systems. In the case of the missing ETL jobs, we could learn why the jobs were turned off in the first place and how that mechanism works.

However, such an investigation often misses the deeper insights. Humans are constantly meddling with the system, deploying changes, flipping feature flags, or even changing the base infrastructure. How do we learn from that?

The cutting edge of incident investigation has now shifted towards Blame-Aware Postmortems, where we acknowledge that these discussions can tend toward blame and actively try to counter that instinct. The keystone of blame-aware thinking is to assume that everyone involved in the incident did the best they could with the information available to them at the time.

Putting Things Into Action

And you may ask yourself: “How did I get here?” – Talking Heads, Once In A Lifetime

In practical terms, running a human-focused postmortem starts well before the actual meeting. By reading through the chat logs from the incident, it is possible to identify some key themes, where people got stuck, and identify any potentially sensitive issues ahead of the meeting.

Now you can start the meeting by outlining the timeline of the incident and some of the themes you found. This helps postmortem participants build empathy with those at the sharp end of the incident.

With the stage set, use blame-aware thinking to prompt discussion. Open-ended questions help discussion flow and give an opportunity for different voices to be heard. One trick is to avoid zooming in on any particular individual and instead focus on the system itself.

A good starter question might be: “Why does our system rely on one individual to remember to do the right thing?” Let the discussion flow and don’t go too deep into any particular topic or solution. Make sure to thank people for contributing. Speaking up in these meetings can be hard.

One question I make sure to ask in every postmortem is, “what surprised you?” This is a great way to identify the knowledge gaps and assumptions among the team, as well as giving the floor for people to talk about what they’ve learned.

It’s also important to remember that your goal with this meeting is learning, and we can’t learn from hypotheticals. Don’t focus too much on action items, as these often end up being overfitted alerts or tickets addressing edge cases that fall into the backlog never to be touched again.

Rewards

All of this is a lot of work, but it’s worth it! Running human-focused postmortems centered around learning helps discover the edges of our systems as we scale. It’s also a helpful onboarding tool for employees to quickly pick up practical learnings about how our systems operate and makes them less afraid to speak up when things go haywire.

Save yourself from the tedious postmortems of old, where you read a timeline and create the same old tickets. Now, by asking questions, you might just find some surprising answers.

Episode 02 The relationship between observability, OpenTelemetry, and UX

Running human-focused postmortems

Posted in:

Written by:

Share:

in partnership with

Viewing incidents as investments

A brief history of postmortem strategies

Putting Things Into Action

Rewards

Related content

Managing your to-do list as a staff+ engineer

4 strategies for effectively managing stakeholders

The AI governance policy engineering managers needed yesterday

How business strategy can help to prioritize projects

Building a knowledge transfer strategy to manage technical debt

4 data recovery lessons from the British Library cyber attack

How to speed up code reviews

Upskilling teams at scale with learnathons

How to keep engineering teams effective through prioritization

How to decide on engineering guardrails

Amazon’s Werner Vogels on the 7 laws of cost-effective engineering

What DevOps teams need to know for 2024

The role of developer advocacy in driving innovation

Introducing a career framework as your organization scales

Coding with clarity to improve developer experience

Stop working on islands as engineers

How feature measurement promotes productivity and happiness

Navigating competing priorities as an engineer

What is developer experience? Your route to better productivity

Mastering large-scale migrations

Using experiments to bring security into your software development life cycle

Is agile still the way forward?

Communicating quickly, effectively, and inclusively

Improve planning and ideation with divergent thinking

What McKinsey got wrong about developer productivity

Introducing new processes to your team

5 best practices for annual budget planning

Overcoming the challenges of annual budget planning

Working with leadership to plan for a successful new year

How to design (and listen to) a developer survey

Why support and belonging are the final key to developer thriving

A prioritization framework for uncertain times

Setting goals and using metrics that motivate

3 strategies for driving organizational change

Building an onboarding plan for engineering managers

Managing the chaos of context switching

What the top 10% of dev teams look like in 2023

How AI changes engineering management

Using high-quality questions to unlock innovation

If agile isn’t dead, why is it still not working?

Building a prioritization framework

Introducing processes where none exist

Focus on outcomes over outputs

How to get leadership buy-in on your tech strategy

Growing an experiment-driven quality culture

Keep your delivery in balance with these metrics pairings

Why motivation and self-efficacy drives thriving software teams

Incorporating organizational values into your agile process

Building a more effective DevSecOps culture

Why elite dev teams focus on pull-request metrics

Four key steps to take for an effective project handoff

What recent data tells us about developer productivity and team health

How to plan your next product feature using a six-week cycle

Platform engineering for engineering managers

How to protect your calendar against unnecessary meetings

How principal testers can improve company processes

How to build a strong culture with ensemble programming

Removing roadblocks to faster iteration

Practicing engineering transparency

How engineering leaders can promote urgency in teams

How engineering leaders can better organize their day

Managing complex organizational change

DevOps for engineering managers

Context switching strategies to preserve your focus

Documentation is broken

Keep your digital ops thriving through 2023 uncertainty

How to break the “get me everything” cycle

On-call is a necessary function for any organization

How to make plans for an uncertain future

How to bake quality into your teams’ coding process