blameless postmortem

back to index

description: an analysis or discussion following an incident, focusing on learning rather than assigning blame

9 results

pages: 719 words: 181,090

Site Reliability Engineering: How Google Runs Production Systems
by Betsy Beyer , Chris Jones , Jennifer Petoff and Niall Richard Murphy
Published 15 Apr 2016

Postmortem and Root-Cause Analysis We aim to be alerted on and manually solve only new and exciting problems presented by our service; it’s woefully boring to “fix” the same issue over and over. In fact, this mindset is one of the key differentiators between the SRE philosophy and some more traditional operations-focused environments. This theme is explored in two chapters. Building a blameless postmortem culture is the first step in understanding what went wrong (and what went right!), as described in Chapter 15, Postmortem Culture: Learning from Failure. Related to that discussion, in Chapter 16, Tracking Outages, we briefly describe an internal tool, the outage tracker, that allows SRE teams to keep track of recent production incidents, their causes, and actions taken in response to them.

The ideal methodology in incident management strikes the perfect balance of taking steps at the desired pace when enough data is available to make a reasonable decision while simultaneously critically examining your assumptions. It’s important that on-call SREs understand that they can rely on several resources that make the experience of being on-call less daunting than it may seem. The most important on-call resources are: Clear escalation paths Well-defined incident-management procedures A blameless postmortem culture ([Loo10], [All12]) The developer teams of SRE-supported systems usually participate in a 24/7 on-call rotation, and it is always possible to escalate to these partner teams when necessary. The appropriate escalation of outages is generally a principled way to react to serious outages with significant unknown dimensions.

A resolution time above some threshold A monitoring failure (which usually implies manual incident discovery) It is important to define postmortem criteria before an incident occurs so that everyone knows when a postmortem is necessary. In addition to these objective triggers, any stakeholder may request a postmortem for an event. Blameless postmortems are a tenet of SRE culture. For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior. A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had.

Seeking SRE: Conversations About Running Production Systems at Scale
by David N. Blank-Edelman
Published 16 Sep 2018

SRE can fit into your organization if you have enough discipline to trust the process. At first, it seems like SRE establishes another set of constraints. But, actually, people gain freedom when they adopt practices like SLOs and blameless postmortems. An SLO is a codification of clear expectations. It allows local optimizations and empowers people to reason about the reliability impacts of their decisions. A blameless postmortem is explicit permission to point out things that went wrong. It allows you to begin to address the root cause of your reliability issues. These are powerful tools. Like all good tools, they require training and maintenance.

Learnings: TL;DR If you want to simplify everything from this chapter into a sentence, here it is: measure everything, be relentlessly curious, and don’t be afraid to get dirty and wallow in your data to find the right actions to take. In many cases, getting these insights required hand curating a fair bit of data, but after we understood which metrics mattered, we could then instrument and automate them and help bring visibility to metrics that could help the services get better. Further Reading Blameless postmortems: “Blameless PostMortems and a Just Culture”: John Allspaw, Etsy “Postmortem Action Items: Plan the Work and Work the Plan”: Sue Lueder and Betsy Beyer, Google Beyond Blame — Learning from Failure and Success: Dave Zwieback Using data to derive operational insights: “Improving Operations Using Data Analytics”: Parviz Deyhim and Arti Garg, Datapipe “Incident Analysis”: Sue Lueder, Google “Measuring the Success of Incident Management at Atlassian”: Gerry Millar, Atlassian “PDA: A Tool for Automated Problem Determination”: Hai Huang, Raymond Jennings III, Yaoping Ruan, Ramendra Sahoo, Sambit Sahu, and Anees Shaikh, IBM T.J.

Even so, there appears to be a growing cadre of practitioners and researchers adopting the mantle of cognitive systems engineer, many of them with direct experience in our industry. This bodes well for our collective future. References Allspaw, J. (2012). Blameless PostMortems and a Just Culture. (Etsy Code as Craft blog.) https://codeascraft.com/2012/05/22/blameless-postmortems/, accessed June 18, 2018. Bainbridge, L. (1983). Ironies of Automation. Automatica 19(6): 775–779. Billings, C. E. (1996). Aviation Automation: The Search for a Human-Centered Approach. Boca Raton, FL: CRC Press. Cook, R. I. (2010).

pages: 282 words: 85,658

Ask Your Developer: How to Harness the Power of Software Developers and Win in the 21st Century
by Jeff Lawson
Published 12 Jan 2021

In The Paper Chase, a first-year student who has taken an intellectual beating from Kingsfield in front of his peers races to the men’s room to get sick. To be clear: I definitely do not advocate going that far. But the same approach that’s used in complex graduate education can be applied to the task of training business leaders. It’s much more effective than sitting through a seminar or reading a book. The Blameless Postmortem We often talk about learning to make decisions in the context of business planning, but what about when things go wrong? You’ve been there—in a technical organization, it might be when the servers fail and the product suffers an outage. Outages aren’t the only kinds of failures, though: maybe it’s an M&A integration that went south, or a financial model that didn’t remotely pan out, or the mis-hire of an important leader.

When things go wrong, it’s either a time to blame, or a time to learn. I believe each failure is an opportunity to uncover deep learnings about how the organization operates, and what could strengthen it systematically, and then take action. We, and many other software companies, do this via a ritual called the “blameless postmortem.” The purpose of the blameless postmort is to dig below the surface of some kind of bad outcome to the true root cause, and address that as an organization. Here’s how it works: Let’s take a common type of issue. A software developer introduces a bug into the code, which makes its way to the production servers and takes down the website.

By addressing the true root cause, you’re going to not just fix the cause of your last outage but also likely address the cause of your next outage. If you repeat this process long enough, you’ll systematically build a stronger and stronger organization. I used a technical example, because the practice of blameless postmortems is more common in technical organizations. However, I’ve seen this method applied to every part of our business and it works the same way. In 2010, a fledgling ten-person startup called Uber (actually UberCab at the time) became a Twilio customer. Over the years, they experienced meteoric growth and by the time we went public in 2016 they represented more than 10 percent of our revenue and were a prominent part of our IPO road show.

pages: 232 words: 71,237

Kill It With Fire: Manage Aging Computer Systems
by Marianne Bellotti
Published 17 Mar 2021

These practices helped Etsy scale its technology to 40 million unique visitors every month.7 If you want your team to be able to handle breaking things, pay attention to what the organization celebrates. Blameless postmortems and just culture are a good place to start, because they both manipulate how people perceive failure and establish good engineering practices. Who Draws the Line? But, can blameless postmortems ever really be blameless? In 2008, system safety researcher Sidney Dekker published an article titled “Just Culture: Who Gets to Draw the Line?”8 Dekker’s article addresses whether true “blameless” postmortems, where no one was ever punished for errors, are the desired end state of just cultures.

If a project is failing, you need to earn both the trust and respect of the team already at work to course-correct. The best way to do that is by finding a compounding problem and halting its cycle. If an organization is having too many meetings, cut all of them and gradually reintroduce them one by one. If career-minded leaders are damaging psychological safety, start educating people about blameless postmortems and just culture. Talk to people and observe how the team behaves as a unit. When you can, it is always better to set up someone else for victory rather than solving the problem yourself. The rest of this chapter describes various in-progress failures I have seen and what we did to pull the project out of a death spiral.

Team Topologies: Organizing Business and Technology Teams for Fast Flow
by Matthew Skelton and Manuel Pais
Published 16 Sep 2019

Coaching Agile Teams: A Companion for ScrumMasters, Agile Coaches, and Project Managers in Transition. Upper Saddle River, NJ: Addison-Wesley Professional, 2010. Allen, Thomas J. Managing the Flow of Technology. Cambridge, MA: MIT Press, 1984. Allspaw, John. “Blameless PostMortems and a Just Culture.” Code as Craft (blog), May 22, 2012. https://codeascraft.com/2012/05/22/blameless-postmortems/. Almeida, Thiago. “DevOps Lessons Learned at Microsoft Engineering.” InfoQ, May 22, 2016. https://www.infoq.com/articles/devops-lessons-microsoft. Ancona, Deborah Gladstein, and David F. Caldwell. “Demography and Design: Predictors of New Product Team Performance.”

InfoQ, April 5, 2018. https://www.infoq.com/presentations/observability-testing. Pink, Daniel. Drive: The Surprising Truth About What Motivates Us. New York: Riverhead Books, 2009. Raymond, Eric. The New Hacker’s Dictionary, 3rd Edition. Boston, MA: MIT Press, 1996. Reed, J. Paul. “Blameless Postmortems Don’t Work. Be Blame-Aware but Don’t Go Negative.” TechBeacon, March 22, 2016. https://techbeacon.com/blameless-postmortems-dont-work-heres-what-does. Reinertsen, Donald. The Principles of Product Development Flow: Second Generation Lean Product Development. Redondo Beach, CA: Celeritas Publishing, 2009. Rensin, Dave. “Introducing Google Customer Reliability Engineering.”

pages: 1,380 words: 190,710

Building Secure and Reliable Systems: Best Practices for Designing, Implementing, and Maintaining Systems
by Heather Adkins , Betsy Beyer , Paul Blankinship , Ana Oprea , Piotr Lewandowski and Adam Stubblefield
Published 29 Mar 2020

When evaluating your organization’s responses to incidents and tests, consider the following best practices: Measure the responses. Evaluators should be able to identify what worked well and what did not. Measure the amount of time it took to implement each stage of the response so you can identify corrective measures. Write blameless postmortems and focus on how you can improve the systems, procedures, and processes.4 Create feedback loops for improving existing plans or developing new plans as needed. Collect artifacts and feed them back into signal detection. Make sure you address any gaps you identify. So that you can perform forensic analysis and address gaps, make sure you save the appropriate logs and other relevant material—especially when conducting security exercises.

A strong postmortem covers technology issues that the attacker exploited, and also recognizes opportunities for improved incident handling. Additionally, you should document the time frames and efforts associated with these action items, and decide which action items belong to short-term versus long-term roadmaps. We cover blameless postmortems in detail in Chapter 15 of the SRE book, but here are some additional security-focused questions to consider: What were the main contributing factors to the incident? Are there variants and similar issues elsewhere in the environment that you can address? What testing or auditing processes should have detected these factors earlier?

The Concord Law School at Purdue University has written a good article on this trend. 5 See Chapter 31 in the SRE book. 6 This color scheme is derived from the US military. 7 For more on Purple Teams, see Brotherston, Lee, and Amanda Berlin. 2017. Defensive Security Handbook: Best Practices for Securing Infrastructure. Sebastopol, CA: O’Reilly Media. 8 Do so by building on a culture of blameless postmortems, as described in Chapter 15 of the SRE book. 9 See Google researcher sirdarckcat’s blog post on rewards for a more philosophical outlook. 10 For further reading, see sirdarckcat’s post about vulnerability pricing. 11 sirdarckcat has also written a post about vulnerability disclosure. Chapter 21.

pages: 757 words: 193,541

The Practice of Cloud System Administration: DevOps and SRE Practices for Web Services, Volume 2
by Thomas A. Limoncelli , Strata R. Chalup and Christina J. Hogan
Published 27 Aug 2014

., & Piatek, M. (2011). Thialfi: A client notification service for internet-scale applications, Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP), pp. 129–142. http://research.google.com/pubs/pub37474.html Allspaw, J. (2009). Blameless postmortems and a just culture. http://codeascraft.com/2012/05/22/blameless-postmortems. Allspaw, J. (2012a). Each necessary, but only jointly sufficient. http://www.kitchensoap.com/2012/02/10/each-necessary-but-only-jointly-sufficient. Allspaw, J. (2012b). Fault injection in production, Queue 10(8): 30:30–30:35. http://queue.acm.org/detail.cfm?

They should drive engineering efforts related to improving the situation. A culture of accountability, rather than blame, fosters an organization that values innovation. If blame is used to avoid responsibility, the whole team suffers. For more information about this topic, we recommend Allspaw’s (2009) article “Blameless Postmortems and a Just Culture.” * * * A Postmortem Report for Every High-Priority Alert At Google many teams had a policy of writing a postmortem report every time their monitoring system paged the oncall person. This was done to make sure that no issues were ignored or “swept under the rug.” As a result there was no back-sliding in Google’s high standards for high uptime.

• Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation (Humble & Farley 2010) The canonical book about service delivery platforms. • Release It!: Design and Deploy Production-Ready Software (Nygard 2007) Detailed coverage and examples of how to implement many of the ideas in Chapter 11. • Blameless PostMortems and a Just Culture (Allspaw 2009) Theory and practice of postmortems. • A Mature Role for Automation (Allspaw 2012c) Why “Automate Everything!” is bad, and what to do instead. • Each Necessary, But Only Jointly Sufficient (Allspaw 2012a) Myths and limits of “root cause analysis” and “The Five Why’s.”

pages: 395 words: 110,994

The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win
by Gene Kim , Kevin Behr and George Spafford
Published 14 Jul 2013

I’m amazed at the difference two weeks can make. During the last Sev 1 incident that hit our credit card processing systems, the conference call was full of finger-pointing, denials, and, most importantly, wasted time when our customers couldn’t give us money. Afterward, we did the first of a series of ongoing blameless postmortems to figure out what really happened and come up with ideas on how to prevent it from happening again. Better yet, Patty led a series of mock incident calls with all hands on deck, to rehearse the new procedures. It was terrific to watch. Even Wes saw the value. I’m pleased to see all the e-mails indicating a lot of good information and effective discussion among the teams working the problem.

pages: 468 words: 124,573

How to Build a Billion Dollar App: Discover the Secrets of the Most Successful Entrepreneurs of Our Time
by George Berkowski
Published 3 Sep 2014

You can’t have one component fail and take down the entire system. Build resilience into your organisation, processes and systems. By hiring the best people – and ones with a variety of strong talents – you’re going to build in redundancy, and the ability to weather big team challenges, as Square was able to. • Blameless postmortems are the key to learning from a tech-ops crisis. Fear-driven organisations do not scale. Calmness in chaotic situations is another trait learned over time. Blame is never helpful when working with others – and certainly not in public forums. While you should always keep track of how employees are performing, fear and blame are not strategies that build a great organisation