So, what is a postmortem?
Solidified in Google’s SRE handbook, a postmortem is defined as “a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.”
Translating that to the real world, postmortems are a critical part of the incident review process. They are an active record of where mistakes were made as a company and where we can do better. By detailing what happened to cause the incident, we can better understand how it can be resolved. Like the name implies, postmortems help us fully understand the reason behind a “deadly” issue.
Postmortems are a way for us to keep an eye on how we can improve either by making sure it never happens again, or if it does, we have better action plans to minimize our mean time to resolution.
How postmortems can keep you accountable – without throwing blame
While we can mitigate risk when we’re working on anything in the software realm, we can’t remove it entirely. Postmortems are a great way to publicly identify issues, without throwing blame at a single person or telling an engineer that their job is on the line for mistakes or outages. It pushes the onus from an individual to the company level – it’s not the individual that pushed the code. Sometimes there are landmines and you just happen to be the person walking on that path. An outage is everyone’s responsibility.
And if we are telling our engineers we blame them for every incident, we’re not going to see progress or development in a meaningful way, simply because they won’t want to take necessary risks. As a company, we need to take on the idea that an outage is everyone’s responsibility.
By utilizing a clear feedback or postmortem approach, you’re allowing a space for accountability and growth. You can structure feedback in a much more productive way and help your development team improve overall by prioritizing common issues.
How do you choose action items from postmortems?
Write everything down and think of it like a brainstorming session. The only way you can fix issues is to first identify them, and break them down into smaller parts and figure out how to manage each piece.
As a team, determine what parts are a priority and set actionable, achievable timelines to make those things happen. As far as how to pick which things are a priority – that’s where working with your team or company helps. One person might not know that a problem is linked to another area, so value and encourage that transparency and communication. A postmortem is no good if you don’t actually work towards solving the issue behind the incident.
Once you figure out what you want to do, put those action items into whatever ticketing system you and your team typically use.
Can this help our team reduce tech debt or better hit SLAs?
Tech debt can manifest itself in many ways, some of it is simply impacted by velocity. Managing code is often harder than it needs to be when there have been a lot of people working on something. But postmortems are a good way to reduce the time it takes to resolve issues – since you break down the issue and are prepared if it happens again – but there are all sorts of tech debt.
Deployments take a long time, builds take a long time, and those could all be considered tech debt that aren’t really improved by this process. So, though others might argue, postmortems may not be impactful to SLAs in that way. But! Teams are able to chase down potential bugs or fixes through this process, and that’s incredibly valuable, even if it is an indirect result.
So, when you have an incident, should you do a postmortem? Yes. Absolutely!