Failure is only the opportunity more intelligently to begin again.Henry Ford*
This whole section has been about managing and reducing risk, but an unfortunate fact of life is that it’s nearly impossible to completely eliminate all risk involved in writing software. With any moderately complex software, things will go wrong at some point. And sometimes things will go very wrong. Failure is inevitable, and at some point, you’ll be pulled into an incident. When these incidents happen, it’s important to use them as learning experiences and take the time to reflect on the preceding events in order to better understand how and why they happened. In doing so, you’ll be able to learn from your mistakes and make any appropriate changes to prevent them from happening again in the future.
The best thing you can do in the aftermath of an incident is to capture and document what happened leading up to, during, and after the incident so that you can reflect, learn, and share that knowledge with others within your organization. This process is known as a postmortem.
An incident postmortem should bring people together to discuss and document the details of an incident:
What was the timeline of events leading up to and during the incident?
What was the ultimate root cause?
What was the impact on the customers and the organization?
What actions were taken to mitigate the failures and get the system back to a stable condition?
What steps, if any, should be taken to prevent the same thing from happening again?
If you and your team are able to set aside time to put together a root-cause analysis after a major operational incident, then you’re setting yourself up for the opportunity to improve yourself, your teammates, and your team’s software development processes. When you learn from your mistakes, you’re able to reduce the risk of making those same mistakes in the future, but it takes time and effort to assess the impact and damage after the dust has settled. A postmortem is a useful framework for sharing knowledge and learning from incidents. Its ultimate purpose is to help organizations turn negative events into forward progress.
Postmortems can be difficult, however, especially if one highlights a mistake or oversight you personally made. You or one of your colleagues may be embarrassed or nervous to share details within your organization. Successful postmortems should be blameless and focus on finding a solution to prevent the root cause from happening again, not on pointing fingers and assigning criticism.
Your goal should be to bring people together in a constructive and collaborative environment that allows everyone to contribute to the progress and evolution of the organization. Postmortems are designed to build trust among team members, across teams, and even with customers. Some companies choose to publish their postmortems publicly in order to show their customers transparency and rebuild confidence in their products.
You don’t need to wait for an incident to reflect and learn from your past, however. There’s another framework, called retrospectives, commonly used by many modern software companies.