description: an analysis or discussion following an incident, focusing on learning rather than assigning blame
10 results
by David N. Blank-Edelman · 16 Sep 2018
. SRE teams interact with other teams, particularly development teams, so these need to be bought in, too. For example, it’s hard to create a blameless postmortem culture if developers or managers are pointing fingers. When paired with an SRE team, a development team should put something of value into the partnership
…
. Be alert for signs of maturity like focus on long-term benefits, incremental improvement, and acknowledgment of existing problems. If people can face problems, a blameless postmortem culture can take root. If they want to measure the pace of improvement, SLOs and error budgets will work. If they are willing to invest
…
the process. At first, it seems like SRE establishes another set of constraints. But, actually, people gain freedom when they adopt practices like SLOs and blameless postmortems. An SLO is a codification of clear expectations. It allows local optimizations and empowers people to reason about the reliability impacts of their decisions. A
…
blameless postmortem is explicit permission to point out things that went wrong. It allows you to begin to address the root cause of your reliability issues. These
…
which metrics mattered, we could then instrument and automate them and help bring visibility to metrics that could help the services get better. Further Reading Blameless postmortems: “Blameless PostMortems and a Just Culture”: John Allspaw, Etsy “Postmortem Action Items: Plan the Work and Work the Plan”: Sue Lueder and Betsy Beyer, Google Beyond Blame
…
top remediations that would reduce the likelihood of that type of incident by 90% in the future. Throughout this shift, we retained the principle of blameless postmortems, emphasizing the importance of learning from our mistakes, and ensuring that no one was thrown under the bus. Key Learnings Some of our key learnings
…
? There are many items that people think of when they think of reliability culture that are not specific to the database world, including the following: Blameless postmortems Automating away repetitive work Structured and rational decision making This all makes sense, and everyone within an operations or SRE organization should constantly be working
…
them with direct experience in our industry. This bodes well for our collective future. References Allspaw, J. (2012). Blameless PostMortems and a Just Culture. (Etsy Code as Craft blog.) https://codeascraft.com/2012/05/22/blameless-postmortems/, accessed June 18, 2018. Bainbridge, L. (1983). Ironies of Automation. Automatica 19(6): 775–779. Billings, C
…
powerful, which in turn makes us less likely to “anchor” on specific scenarios to the exclusion of all else; reading John Allspaw’s materials on blameless postmortems can also help to shake out assumptions about what is actually at fault when things go wrong in distributed systems, and helps to increase self
by Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Richard Murphy · 15 Apr 2016 · 719pp · 181,090 words
one of the key differentiators between the SRE philosophy and some more traditional operations-focused environments. This theme is explored in two chapters. Building a blameless postmortem culture is the first step in understanding what went wrong (and what went right!), as described in Chapter 15, Postmortem Culture: Learning from Failure. Related
…
being on-call less daunting than it may seem. The most important on-call resources are: Clear escalation paths Well-defined incident-management procedures A blameless postmortem culture ([Loo10], [All12]) The developer teams of SRE-supported systems usually participate in a 24/7 on-call rotation, and it is always possible to
…
occurs so that everyone knows when a postmortem is necessary. In addition to these objective triggers, any stakeholder may request a postmortem for an event. Blameless postmortems are a tenet of SRE culture. For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without
…
postmortem is seen by engineers as an opportunity not only to fix a weakness, but to make Google more resilient as a whole. While a blameless postmortem doesn’t simply vent frustration by pointing fingers, it should call out where and how services can be improved. Here are two examples: Pointing fingers
…
difficult to be fully trained up on. I’m sure our future on-callers will thank us!” Best Practice: Avoid Blame and Keep It Constructive Blameless postmortems can be challenging to write, because the postmortem format clearly identifies the actions that led to the incident. Removing blame from a postmortem gives people
…
reinforcement. We reinforce a collaborative postmortem culture through senior management’s active participation in the review and collaboration process. Management can encourage this culture, but blameless postmortems are ideally the product of engineer self-motivation. In the spirit of nurturing the postmortem culture, SREs proactively create activities that disseminate what we learn
…
weathers fewer outages and fosters a better user experience. Our “Postmortems at Google” working group is one example of our commitment to the culture of blameless postmortems. This group coordinates postmortem efforts across the company: pulling together postmortem templates, automating postmortem creation with data from tools used during an incident, and helping
…
an outage while you’re embedded. If you aren’t the person on-call, team up with the on-call SRE to write a great, blameless postmortem. This document is an opportunity to demonstrate how a shift toward the SRE model benefits the team by making bug fixes more permanent. More permanent
…
systematic investigation of root causes of identified issues or risks in order to prevent recurrence. This principle is embodied by SRE’s strong culture of blameless postmortems. When something goes wrong (and given the scale, complexity, and rapid rate of change at Google, something inevitably will go wrong), it’s important to
…
Notes in Computer Science 5959, 2010. [All10] J. Allspaw and J. Robbins, Web Operations: Keeping the Data on Time: O’Reilly, 2010. [All12] J. Allspaw, “Blameless PostMortems and a Just Culture”, blog post, 2012. [All15] J. Allspaw, “Trade-Offs Under Pressure: Heuristics and Observations of Teams Resolving Internet Service Outages”, MSc thesis
by Marianne Bellotti · 17 Mar 2021 · 232pp · 71,237 words
many meetings, cut all of them and gradually reintroduce them one by one. If career-minded leaders are damaging psychological safety, start educating people about blameless postmortems and just culture. Talk to people and observe how the team behaves as a unit. When you can, it is always better to set up
…
million unique visitors every month.7 If you want your team to be able to handle breaking things, pay attention to what the organization celebrates. Blameless postmortems and just culture are a good place to start, because they both manipulate how people perceive failure and establish good engineering practices. Who Draws the
…
Line? But, can blameless postmortems ever really be blameless? In 2008, system safety researcher Sidney Dekker published an article titled “Just Culture: Who Gets to Draw the Line?”8 Dekker
…
’s article addresses whether true “blameless” postmortems, where no one was ever punished for errors, are the desired end state of just cultures. People want psychological safety, but they also want accountability
by Heather Adkins, Betsy Beyer, Paul Blankinship, Ana Oprea, Piotr Lewandowski and Adam Stubblefield · 29 Mar 2020 · 1,380pp · 190,710 words
and what did not. Measure the amount of time it took to implement each stage of the response so you can identify corrective measures. Write blameless postmortems and focus on how you can improve the systems, procedures, and processes.4 Create feedback loops for improving existing plans or developing new plans as
…
the time frames and efforts associated with these action items, and decide which action items belong to short-term versus long-term roadmaps. We cover blameless postmortems in detail in Chapter 15 of the SRE book, but here are some additional security-focused questions to consider: What were the main contributing factors
…
Amanda Berlin. 2017. Defensive Security Handbook: Best Practices for Securing Infrastructure. Sebastopol, CA: O’Reilly Media. 8 Do so by building on a culture of blameless postmortems, as described in Chapter 15 of the SRE book. 9 See Google researcher sirdarckcat’s blog post on rewards for a more philosophical outlook. 10
…
that teams can manage the needs of customers and the risk of missing or delaying the Big Deal. In these situations, cultural norms should encourage blameless postmortems to uncover patterns of failure that can be fixed, thereby avoiding harmful conditions in the future.1 Companies with healthy cultures know that getting hacked
…
refine your organization’s processes. Organizations that embrace the inevitable also study any failures that do occur, including within their peer groups. Internally, they use blameless postmortems—discussed in Chapter 18 of this book and in Chapter 15 of the SRE book—to reduce the fear of failure and build confidence that
…
norms had impeded a culture of safety. This publicly available real-world example demonstrates good use of after-action reports, which we recommend conducting as blameless postmortems. These kinds of reports can also give leadership a clear understanding of how cultural breakdowns can lead to reliability and security issues. Culture of Sustainability
…
zero trust networking model, Zero Trust Networking Bigtable, Improve observability binary provenance, Binary Provenance-What to put in binary provenance, Data Sanitization BIOS, Device firmware blameless postmortems, Building a Culture of Security and Reliability, Culture of Inevitably blast radius, controlling, Controlling the Blast Radius-Time Separationfailure domains, Failure Domains-Low-dependency components
by Thomas A. Limoncelli, Strata R. Chalup and Christina J. Hogan · 27 Aug 2014 · 757pp · 193,541 words
values innovation. If blame is used to avoid responsibility, the whole team suffers. For more information about this topic, we recommend Allspaw’s (2009) article “Blameless Postmortems and a Just Culture.” * * * A Postmortem Report for Every High-Priority Alert At Google many teams had a policy of writing a postmortem report every
…
. • Release It!: Design and Deploy Production-Ready Software (Nygard 2007) Detailed coverage and examples of how to implement many of the ideas in Chapter 11. • Blameless PostMortems and a Just Culture (Allspaw 2009) Theory and practice of postmortems. • A Mature Role for Automation (Allspaw 2012c) Why “Automate Everything!” is bad, and what
…
ACM Symposium on Operating Systems Principles (SOSP), pp. 129–142. http://research.google.com/pubs/pub37474.html Allspaw, J. (2009). Blameless postmortems and a just culture. http://codeascraft.com/2012/05/22/blameless-postmortems. Allspaw, J. (2012a). Each necessary, but only jointly sufficient. http://www.kitchensoap.com/2012/02/10/each-necessary-but-only
…
storage system, 24 Bimodal patterns in histograms, 361 BIOS settings in deployment phase, 218 Blackbox monitoring, 346–347 Blacklists, 40–42 Blade servers, 217–218 “Blameless Postmortems and a Just Culture” article, 301 Blog Search, upgrading, 226 Blue-green deployment, 230 BMC (Baseboard Management Controller), 218 Botnets, 140 Bots in virtual offices
by Jeff Lawson · 12 Jan 2021 · 282pp · 85,658 words
can be applied to the task of training business leaders. It’s much more effective than sitting through a seminar or reading a book. The Blameless Postmortem We often talk about learning to make decisions in the context of business planning, but what about when things go wrong? You’ve been there
…
organization operates, and what could strengthen it systematically, and then take action. We, and many other software companies, do this via a ritual called the “blameless postmortem.” The purpose of the blameless postmort is to dig below the surface of some kind of bad outcome to the true root cause, and address
…
. If you repeat this process long enough, you’ll systematically build a stronger and stronger organization. I used a technical example, because the practice of blameless postmortems is more common in technical organizations. However, I’ve seen this method applied to every part of our business and it works the same way
…
newly public company and a mistake we didn’t want to make again. I asked our CFO at the time, Lee Kirkpatrick, to run a blameless postmortem. The finance team had never run one before, so we tapped Jason Hudak, our head of technical infrastructure, whom you’ll meet in Chapter 11
…
a company culture more like Google, Apple, or Facebook than like the U.S. federal government. But ask yourself: When people make mistakes, are there blameless postmortems, or are your people hauled in front of Congress (aka the executive team)? Are people encouraged to learn quickly, even at the risk of making
by Matthew Skelton and Manuel Pais · 16 Sep 2019
, 2010. Allen, Thomas J. Managing the Flow of Technology. Cambridge, MA: MIT Press, 1984. Allspaw, John. “Blameless PostMortems and a Just Culture.” Code as Craft (blog), May 22, 2012. https://codeascraft.com/2012/05/22/blameless-postmortems/. Almeida, Thiago. “DevOps Lessons Learned at Microsoft Engineering.” InfoQ, May 22, 2016. https://www.infoq.com/articles
…
, Eric. The New Hacker’s Dictionary, 3rd Edition. Boston, MA: MIT Press, 1996. Reed, J. Paul. “Blameless Postmortems Don’t Work. Be Blame-Aware but Don’t Go Negative.” TechBeacon, March 22, 2016. https://techbeacon.com/blameless-postmortems-dont-work-heres-what-does. Reinertsen, Donald. The Principles of Product Development Flow: Second Generation Lean
by Gene Kim, Kevin Behr and George Spafford · 14 Jul 2013 · 395pp · 110,994 words
-pointing, denials, and, most importantly, wasted time when our customers couldn’t give us money. Afterward, we did the first of a series of ongoing blameless postmortems to figure out what really happened and come up with ideas on how to prevent it from happening again. Better yet, Patty led a series
by Zoë Schiffer · 13 Feb 2024 · 343pp · 92,693 words
colleague had told Musk was technically true. But it wasn’t how Twitter used to operate. In the past, the engineering organization had followed a “blameless postmortem” model, meaning the team would try to figure out what had gone wrong without pinning it on anyone in particular. This put all the engineers
by George Berkowski · 3 Sep 2014 · 468pp · 124,573 words
a variety of strong talents – you’re going to build in redundancy, and the ability to weather big team challenges, as Square was able to. • Blameless postmortems are the key to learning from a tech-ops crisis. Fear-driven organisations do not scale. Calmness in chaotic situations is another trait learned over