Share this article

Share on facebook
Share on twitter
Share on linkedin
Share on reddit

Re-framing how we think about production incidents

How to deal with production incidents

This post lists the key insights from this article, from Shubheksha, a Backend-engineer at the Fincrime team of Monzo.


The goal of this post is to provide some pointers about how teams can help new engineers deal with this and also folks early in their career can re-frame the way think about and deal with production outages.

Re-assurance from senior members of the team goes a long way

The first big change the author worked on broke something in production in her 2nd job. As soon as it was identified that it was her change that messed something up, her tech lead sent her a reassuring message.

Screenshot

Not all bugs are equal

At Monzo, they believe in deploying and shipping small, incremental changes regularly and our platform enables them to do exactly that. As a result, rolling back most changes in exceptionally easy — you just need to run a single command to revert the change.

However, not all bugs are equal. It’s important to design systems and processes that make it difficult to ship large bugs to production. However, bugs will always slip through, and it’s invaluable to have tools that allow you to resolve issues quickly and limit the impact.

Talk about the stuff you fucked up

This can take the form of in-depth debriefs after an incident depending on how major it is, knowledge share sessions, detailed incident reports, etc. Detailed investigation reports can prove to be super useful here and they can be key in passing context which usually lives inside people’s heads to folks new to the team and can fill in gaps in documentation.

At Monzo, they recently started doing weekly lightning talks and one of the first ones was about “Stuff I broke in prod”.

Communication is key

When you know you messed something up, the most important thing is to ask for help. At the end of the day, managing incidents is a team-sport and not a one-person-show. If there’s one thing you take away from this blog post, let it be this one.

Treat it as a learning opportunity

Seems like a no-brainer but here is some insight into how the author likes to think about it. When something breaks in production, she usually has multiple takeaways from it:

  • You learn to reason about our code better
  • You learn how to debug the problem (even better if it’s other people fixing it)
  • It also gives the broader engineering team insight into how to make the processes more resilient to failures.

It’s never just one person’s fault

When something goes wrong, chances are even though you wrote the code, someone else reviewed it. And they didn’t catch the bug either. This isn’t an invitation to shift the blame on to them for whatever went wrong. It’s another way to look at the same thing: humans are flawed and will make mistakes no matter how perfect the tooling or automation. The goal isn’t to not make mistakes, it’s to learn from them and to avoid making the same mistake twice.

Engineering culture sets the tone for everything

How an individual reacts in such a situation is inherently tied to the team dynamics and the engineering culture of the organisation. Having a blameless culture where people feel safe to accept that they messed something underpins and is a hallmark of a good engineering team.

Instead of treating outages as an invitation to throw an individual under the bus, we should instead try and treat them as opportunity to make our systems better, more robust and more resilient to failure. 

3 min​ read

Subscribe to our newsletter Weekly Bytes to get our curation of the top 5 articles of the week.

Subscribe to our newsletter Weekly Bytes to get our curation of the top 5 articles of the week.

Comments