editione1.0.1Updated October 11, 2022
So, your code was deployed to production, and now things are getting thrown left and right. What do you do? It can be stressful, especially for someone that doesn’t have a lot of experience dealing with production outages. The errors keep coming, and you haven’t been able to identify the root cause yet. You don’t even know where to begin looking.
First off, take a breath. Panicking won’t do you any good here and will probably make the situation worse. So, the first thing you should focus on is staying as calm and collected as you can. As you get more experience working during production incidents, this gets more natural, but it’s easier said than done the first few times.
Next, try to be methodical with your approach to identifying the root cause of the issue. There might be a million thoughts racing through your head about what it could be, but if you don’t slow down, you won’t be able to think clearly. Slow is smooth, smooth is fast. It may sound counterintuitive, but by focusing your attention on one thing at a time, you can often move quicker than trying to do too many things at once.
Try to eliminate potential causes one-by-one. And keep notes as to what changes were tried and which possible causes were eliminated, including why. That will help avoid rework and help document the final resolution, which may come in handy during a postmortem.
So, where do you start?
There are whole books written on incident management, but almost every single one of them will mention some form of how to identify the severity of an incident. When you first hear that there is an issue in a production environment, one of the first things you should do is determine just how bad the issue is. In essence, determining the severity of an incident is the process of quantifying the impact that an incident has on the business.
Severity is often measured on a numerical scale, with lower numbers representing a greater impact than higher numbers. For example, a severity of 1 is an all-hands-on-deck type of incident, whereas a severity of 5 may be a minor, low-impact incident that can be fixed later.
Different businesses may use different severity scales and may even label them differently. One business may define a scale from Sev0 (high severity) to Sev4 (low severity), while another business may define their scale as Sev1 (high severity) to Sev3 (low severity).
Severity scores are useful for communicating the urgency in the midst of an ongoing incident, and they also help communicate to business stakeholders what kind of fallout they can expect from a given incident. This helps customer success and public relations teams when communicating with external customers and partners, and helps the executive leadership team understand the impact to the company.
Additionally, having a defined severity scoring system contributes to a framework for who should respond to incidents and when others need to be pulled into active incidents. The more impactful an incident, the more important it becomes to have a plan in place for who should respond and when they need to be notified.
exampleHere’s an example of a severity framework with descriptions of what each level entails:
|Sev1||A critical incident with significant impact||A customer-facing service or feature (such as a checkout page) is down for all customers. Customer data is breached or compromised. Customer data is lost.|
|Sev2||A major incident with a large impact||A customer-facing service or feature (such as a push notification system) is unavailable for a subset of customers. Core functionality such as invoice creation is significantly impacted.|
|Sev3||A minor incident with a low impact||A bug is causing a minor inconvenience to customers, but a workaround is available. Performance for a page or feature is degraded.|
Table: Sample severity framework.
These severity levels also determine the amount of post-resolution follow-up, if any, is expected in order to prevent the same issue from happening again. In some cases, the most critical severity incidents require communication with customers and other parties who were impacted.
The more clearly defined your severity levels are, the more likely your team will know how to react and respond to an ongoing incident. If your company already has clearly defined severity levels, it’s a good idea to read them and understand when each one should be used. This will help you know what to do and how you should respond during an incident.
Some companies have dedicated incident management or crisis management teams. In those cases, they will be responsible for defining the severity levels, determining the severity of each incident, and even deciding whether an issue should be classified as an incident at all or just a bug. These teams often get involved in higher severity incidents to help facilitate which teams need to be involved in resolving the incident. They also handle communication between teams as well as status updates to company executives and external customers, vendors, or partners. Learning how incident management teams operate is key to working well with them during high-stress incidents and getting issues resolved quickly. If you’re able to hop into an incident and provide valuable and timely information to help get things resolved, you’ll be able to build your reputation as a problem solver and leader within your company.
Look for clues, anomalies, and patterns. You’ll want to observe and absorb as much information as possible and then figure out how it’s all related, if at all. Hopefully, you have some internal dashboard, monitoring tools, or logs that you can dig into to find the information you need.
important Look for these three things to help you figure out what’s happening.
Clues. Your goal here is to gather evidence.
When did the errors first start?
What changed around that time?
Did we recently deploy new code?
If not, when was the last time we deployed, and what changes went out?
Do we have any cron jobs that run around that time?
Check your logs and any other observability tools.
Look for HTTP status codes.
Look for stack traces.
Read the actual error messages, and then read them again. What is it actually saying? Sometimes you might need to read a message a few times before you truly understand it.
Are errors coming from specific servers or all? What about containers?
Check server health metrics.
What does the CPU utilization look like?
How much disk space is left?
How much free memory is remaining?
Are bug reports coming from customers?
What were they doing when they experienced the error?
What browser are they using?
What operating system?
What steps did they take to trigger the error?
Anomalies. Your goal here is to identify outliers.
Try to determine a baseline for your system under normal conditions.
What did the server metrics look like before the incident vs. now?
Are you seeing an increase in any metrics vs. the baseline, such as mobile devices hitting your API?
Are you seeing a decrease in any metrics vs. the baseline, such as a slowdown in processing your background jobs?
Patterns. Your goal here is to find any repeating occurrences that could offer additional insights.
Try to find patterns in the errors across time.
Do the errors happen at certain times of the day? Do they line up with cron jobs?
Do the errors happen at specific time intervals?
Do the errors happen at specific times during the day?
Try to find patterns in the data inputs when errors occur.
Do the errors happen when a user enters a number with an integer? Or just when they enter a number with a decimal point?
Are the errors happening with all input values, or just some? Maybe your logic doesn’t account for an edge case when they select a specific value from a dropdown list.
In some cases, asking questions like the above may help you uncover the root cause, but not always. If you’re able to ask the right questions, and then dig deeper to uncover the answers to those questions, you should be able to at least narrow down the problem.
As you gain more experience and help diagnose and fix issues in production, you’ll gain a better feel for what questions to ask based on the nature of the errors you’re seeing. A big part of this just comes from experience, and that experience comes from observing how your coworkers handle these kinds of situations.
The senior engineers on your team have likely dealt with many more production incidents than you have, and they will demonstrate that experience when dealing with an outage. Use this as a learning opportunity.
Watch your coworkers and observe what they do during an incident.
What monitoring dashboards are they checking? This will show you what metrics they consider important.
How urgently are they moving? Is this an all-hands-on-deck incident, or something that needs to be fixed but isn’t a major issue?
Who are they communicating with?
What channels are they communicating across?
You can learn quite a bit just by observing the senior engineers and how they conduct themselves during an outage. Before you realize it, you’ll start picking up the same techniques and processes they follow, and you’ll be able to diagnose and put out production fires in no time.
This is arguably the most important thing you should remember to do during an incident. Collaboration tends to suffer when communication breaks down, and collaboration is paramount during a high-priority incident.
When in doubt, don’t be afraid to overcommunicate to your coworkers. Let them know what you find, especially when searching for clues, anomalies, and patterns. You may not be able to connect the dots yourself, but you might offer a clue that helps your coworker do so. Or perhaps a coworker will mention something that helps you connect the dots and identify the root cause. The important thing is to share as much information as you can, so that you can work together to solve the problem.
You’re part of a team, and you’re all working towards the same goal—fixing the issue. It doesn’t matter who discovers the root cause. It’s more important that the root cause is identified and a fix is applied so that you can get back to the project you were working on.
There will be times in your career when you’ll be pulled into an incident for something that’s not your fault, but it’s still your job to fix the problem without assigning blame. There will also be times when you’ll be dealing with an incident that was caused by a change you made, whether it’s bad logic, a bad configuration change, or an inefficient SQL query. When these things happen, it’s best to be honest and own up to your mistake.
This may be much harder than it sounds depending on the severity of the issue, but taking responsibility when you make a mistake is always the best thing to do in these situations. Avoiding ownership is the worst thing you can do, because it can hurt the credibility you’ve been working on building with your coworkers and your manager.
After all, there’s a good chance your teammates already know you’re the one who made the mistake because they reviewed your code and know it’s related to the outage. You can make all the excuses in the world, point fingers, or go hide in a conference room, but it won’t change the fact that an incident happened that was related to code you wrote. It’s all documented right there in the version control system, and it’s easy for your coworkers to search the commit history to find your name next to the code that broke.
Your coworkers may start asking questions about your changes, but do your best not to view it as a personal attack. They may just be trying to gather more information about how the code is supposed to work or the context around why you made the change.
caution Whatever you do, try not to get defensive if people start asking questions. Immediately trying to justify a mistake or possible mistake is not going to help solve the problem and will make others defensive or frustrated, too.
exampleSome programmers do this by pointing out that their pull request passed the code review and was approved by their teammates. This is bad because they’re trying to shift the blame on to someone else for not catching their mistakes. Even worse, some programmers may try to blame the QA team for not catching their mistakes. Ultimately, you are responsible for your own code. So, it’s not fair to rely on others to catch your mistakes and then blame them when they don’t.
exampleOther programmers may blame some constraint they had to work around, such as a legacy part of the codebase that hasn’t been updated in a while. Once again, you’re responsible for the code you write. Sometimes you need to work around constraints that are out of your control. That’s part of the job as a software engineer—to solve problems within a set of constraints. Blaming your mistakes on something that is out of your control doesn’t look good. Instead, it’s better to admit that you hadn’t considered an edge case or that you didn’t understand the legacy system as well as you thought you did.
When you blame your mistakes on someone or something out of your control, you’re playing the victim. And when you play the victim, you’re admitting that you were not in full control of the situation. You’re admitting that you didn’t fully think through the ramifications of your code or that you didn’t test your code as thoroughly as you should have. Not only does playing the victim look bad, but it will strain your relationships with your coworkers, especially if you’re attempting to pin blame on them.
By being open and honest about where you fell short, you’re allowing yourself to accept that you’re not perfect. In a way, this makes it easier to learn from your mistakes because you’re letting your walls down. It’ll be easier to learn how you can avoid making the same mistake again in the future, rather than feeling like you are being judged for not knowing something.
Making mistakes can be painful, but they are opportunities to grow significantly as a software engineer. There’s a saying among sailors that “smooth seas never make a good sailor,” because you’ll never be great if you never experience adversity. The skills you learn during the rough times in your career will be hard fought, but I can guarantee you’ll learn to never make those mistakes again, and you’ll come out of the storm much smarter and more experienced.
A developer’s guide to programatically overcome fear of failure (pagerduty.com)
Don’t Be Afraid to Break Stuff (blog.codinghorror.com)
Lessons learned in incident management (dropbox.tech)
My Most Embarrassing Mistakes as a Programmer (so far) (stackoverflow.blog)
How to Avoid the Biggest Mistake You Can Make as a New Software Engineer (effectiveengineer.com)