At some point in every programmer’s career, they’ll go through the inevitable “oh crap” moment when some code they wrote suddenly breaks in production. It happens to the best software engineers, and it’ll happen to you. No matter how many steps you take to mitigate risks, sometimes bad code will slip through the cracks and cause a catastrophic failure in production.
In some ways, this is a rite of passage on your journey to becoming an experienced software engineer, because you’ll gain valuable experience identifying, triaging, fixing, and recovering from an incident in a high-pressure situation.
Software engineers work on complex systems. It’s impossible to fully understand how each new code change you deploy will behave in a production environment. We can take measures to mitigate risks, but it’s impossible to avoid them. So, what can you do if you’re not able to completely steer clear of breaking code?
Good programmers accept that mistakes will happen. They don’t know when one will happen, but they know what to do when things do go wrong. The ability to stay calm and collected and work through the problem under pressure, especially when alerts are going off and logs are filling up with errors, is a sign of an experienced programmer. They move with urgency and without losing their composure, because their main priority is getting the issue fixed and dealing with the impact.
Experienced engineers don’t worry about their reputation or what their coworkers will think of them during a production outage. They know they may have to deal with some fallout after the dust has settled, but that’s not their main concern when systems are down. While it’s natural to be concerned about your reputation when things go wrong, that may hinder your ability to think through the problem clearly.
In How to Manage Risk, we’ll dive deeper into different things you can do to prevent future incidents. Here, we’ll focus on things you can do during an ongoing incident. Let’s look at different ways things can go wrong.
Why Code Breaks
We can’t predict every dependency between our systems, or even all the dependencies between pieces of our logic in the same system. This alone makes it difficult to avoid introducing breaking changes, but it’s not the only thing that contributes to broken code.
exampleLet’s look at other ways the code you write may break.
Unlock expert knowledge.
Learn in depth. Get instant, lifetime access to the entire book. Plus online resources and future updates.
Untested code. You may think it’s a small change and you don’t need to test it, or you may be in a hurry to fix a bug, so you put your code up for review as soon as you finish writing it. While you may think you’re working quickly, this is an easy way to introduce broken code into production because you didn’t take the time to actually test it. While your code may look correct at first glance, it’s possible your logic may have unintended behavior that you would never know about unless you actually ran it.
Unknown edge cases. The data you use during development and testing may be clean, structured, and made up of expected values, but production data is often messy and varies greatly. Your system will need to handle inputs and events from your users (or other systems) that you didn’t know about or account for when writing your code.
Missing context. Oftentimes, the person who wrote the original code won’t be the person who is updating or fixing it. Perhaps the original author is out of the office on vacation, moved to a different team, or moved on to a new company. When this happens, you may not have the full context about how part of the system works when you need to make modifications to it. There may be a specific reason the logic was written a certain way, or the logic may account for an edge case that isn’t apparent when first reading the code.
Hidden dependencies. As codebases grow, so does the dependency graph. You may deploy some code to production that you thought you had tested thoroughly, only to find that there is an obscure part of the codebase that also relies on the logic that was changed. Suddenly, you have another team asking you why their code, which they haven’t touched in months, is throwing errors in production. Even worse, there could be bugs in a third-party or open-source library that your codebase relies on. These are often difficult to track down and can be difficult to fix if the maintainers aren’t responsive.
Code rot. Also referred to as software rot, code rot describes how behavior or usability of a codebase degrades over time, sometimes even if the code itself has not been modified. The environment the code runs in will change over time, or your customer’s usage patterns may shift. They may request new features that build on top of existing logic, which can introduce new bugs. Even routine software maintenance contributes to code rot. You may need to update third-party libraries to patch security vulnerabilities, only to find the newer version breaks existing logic.
Environment differences. As much as we’d like to keep our development environments in sync with our staging and production environments, it’s extremely difficult to match them exactly. Your local environment variables may differ from what’s running on production, which could lead to code that works locally but breaks on production. Differences in scale between environments can cause entirely new errors to occur in production systems that are hard to reproduce in smaller environments. Operating systems and their packages in higher environments may differ from your development environment. And finally, the hardware itself: while you develop your code on a personal computer, your code runs in production on powerful servers with different CPU and memory characteristics.
Third-party dependencies. You may have external dependencies that your product relies on, such as third-party APIs that your code calls out to. When those services have outages of their own, they may affect your system and cause errors on your end. Even cloud hosting providers such as Amazon Web Services or Microsoft Azure go down from time to time and could bring your system down with it.
The above examples are just a few ways in which your code can break—the list keeps going. Murphy’s Law states that “anything that can go wrong, will go wrong.” As a programmer, it’s your job to identify all the scenarios in which your program can fail, and then to take steps to reduce the likelihood of those scenarios happening. In some cases though, your program will fail in unexpected ways that you never could have imagined, which makes it hard to plan for.
Here are more examples of ways that your system can fail. Remember, sometimes it’s not just the code itself but other pieces of the system that can fail too.
CPU maxed out
No remaining disk space
Social engineering attacks
Any combination of these failures can occur at any given time, and you or your team might be responsible for getting the system back online when it does. The risk of some of these disasters can be managed and mitigated ahead of time, which we’ll learn about in How to Manage Risk, but others are much harder to predict or prevent. The best thing you can do is be prepared for anything to happen at any time.
Now, let’s shift our focus to things you should do during an incident that will help you triage, identify, and fix issues as they occur.
So, your code was deployed to production, and now things are getting thrown left and right. What do you do? It can be stressful, especially for someone that doesn’t have a lot of experience dealing with production outages. The errors keep coming, and you haven’t been able to identify the root cause yet. You don’t even know where to begin looking.
First off, take a breath. Panicking won’t do you any good here and will probably make the situation worse. So, the first thing you should focus on is staying as calm and collected as you can. As you get more experience working during production incidents, this gets more natural, but it’s easier said than done the first few times.
Next, try to be methodical with your approach to identifying the root cause of the issue. There might be a million thoughts racing through your head about what it could be, but if you don’t slow down, you won’t be able to think clearly. Slow is smooth, smooth is fast. It may sound counterintuitive, but by focusing your attention on one thing at a time, you can often move quicker than trying to do too many things at once.
Try to eliminate potential causes one-by-one. And keep notes as to what changes were tried and which possible causes were eliminated, including why. That will help avoid rework and help document the final resolution, which may come in handy during a postmortem.
So, where do you start?
Determine the Severity
There are whole books written on incident management, but almost every single one of them will mention some form of how to identify the severity of an incident. When you first hear that there is an issue in a production environment, one of the first things you should do is determine just how bad the issue is. In essence, determining the severity of an incident is the process of quantifying the impact that an incident has on the business.
Severity is often measured on a numerical scale, with lower numbers representing a greater impact than higher numbers. For example, a severity of 1 is an all-hands-on-deck type of incident, whereas a severity of 5 may be a minor, low-impact incident that can be fixed later.
Different businesses may use different severity scales and may even label them differently. One business may define a scale from Sev0 (high severity) to Sev4 (low severity), while another business may define their scale as Sev1 (high severity) to Sev3 (low severity).
Severity scores are useful for communicating the urgency in the midst of an ongoing incident, and they also help communicate to business stakeholders what kind of fallout they can expect from a given incident. This helps customer success and public relations teams when communicating with external customers and partners, and helps the executive leadership team understand the impact to the company.
Additionally, having a defined severity scoring system contributes to a framework for who should respond to incidents and when others need to be pulled into active incidents. The more impactful an incident, the more important it becomes to have a plan in place for who should respond and when they need to be notified.
exampleHere’s an example of a severity framework with descriptions of what each level entails:
A critical incident with significant impact
A customer-facing service or feature (such as a checkout page) is down for all customers. Customer data is breached or compromised. Customer data is lost.
A major incident with a large impact
A customer-facing service or feature (such as a push notification system) is unavailable for a subset of customers. Core functionality such as invoice creation is significantly impacted.
A minor incident with a low impact
A bug is causing a minor inconvenience to customers, but a workaround is available. Performance for a page or feature is degraded.
Table: Sample severity framework.
These severity levels also determine the amount of post-resolution follow-up, if any, is expected in order to prevent the same issue from happening again. In some cases, the most critical severity incidents require communication with customers and other parties who were impacted.
The more clearly defined your severity levels are, the more likely your team will know how to react and respond to an ongoing incident. If your company already has clearly defined severity levels, it’s a good idea to read them and understand when each one should be used. This will help you know what to do and how you should respond during an incident.
Some companies have dedicated incident management or crisis management teams. In those cases, they will be responsible for defining the severity levels, determining the severity of each incident, and even deciding whether an issue should be classified as an incident at all or just a bug. These teams often get involved in higher severity incidents to help facilitate which teams need to be involved in resolving the incident. They also handle communication between teams as well as status updates to company executives and external customers, vendors, or partners. Learning how incident management teams operate is key to working well with them during high-stress incidents and getting issues resolved quickly. If you’re able to hop into an incident and provide valuable and timely information to help get things resolved, you’ll be able to build your reputation as a problem solver and leader within your company.
Look for clues, anomalies, and patterns. You’ll want to observe and absorb as much information as possible and then figure out how it’s all related, if at all. Hopefully, you have some internal dashboard, monitoring tools, or logs that you can dig into to find the information you need.
important Look for these three things to help you figure out what’s happening.
Clues. Your goal here is to gather evidence.
When did the errors first start?
What changed around that time?
Did we recently deploy new code?
If not, when was the last time we deployed, and what changes went out?
Do we have any cron jobs that run around that time?
Check your logs and any other observability tools.
Look for HTTP status codes.
Look for stack traces.
Read the actual error messages, and then read them again. What is it actually saying? Sometimes you might need to read a message a few times before you truly understand it.
Are errors coming from specific servers or all? What about containers?
Check server health metrics.
What does the CPU utilization look like?
How much disk space is left?
How much free memory is remaining?
Are bug reports coming from customers?
What were they doing when they experienced the error?
What browser are they using?
What operating system?
What steps did they take to trigger the error?
Anomalies. Your goal here is to identify outliers.
Try to determine a baseline for your system under normal conditions.
What did the server metrics look like before the incident vs. now?
Are you seeing an increase in any metrics vs. the baseline, such as mobile devices hitting your API?
Are you seeing a decrease in any metrics vs. the baseline, such as a slowdown in processing your background jobs?
Patterns. Your goal here is to find any repeating occurrences that could offer additional insights.
Try to find patterns in the errors across time.
Do the errors happen at certain times of the day? Do they line up with cron jobs?
Do the errors happen at specific time intervals?
For example, the errors seem to be happening once every minute, or the errors seem to happen at random, with no pattern.
Do the errors happen at specific times during the day?
For example, the errors seem to occur towards the beginning of the workday, right when users log on in the mornings.
Try to find patterns in the data inputs when errors occur.
Do the errors happen when a user enters a number with an integer? Or just when they enter a number with a decimal point?
Are the errors happening with all input values, or just some? Maybe your logic doesn’t account for an edge case when they select a specific value from a dropdown list.
In some cases, asking questions like the above may help you uncover the root cause, but not always. If you’re able to ask the right questions, and then dig deeper to uncover the answers to those questions, you should be able to at least narrow down the problem.
Observe How Your Coworkers Work
As you gain more experience and help diagnose and fix issues in production, you’ll gain a better feel for what questions to ask based on the nature of the errors you’re seeing. A big part of this just comes from experience, and that experience comes from observing how your coworkers handle these kinds of situations.
The senior engineers on your team have likely dealt with many more production incidents than you have, and they will demonstrate that experience when dealing with an outage. Use this as a learning opportunity.
Watch your coworkers and observe what they do during an incident.
What monitoring dashboards are they checking? This will show you what metrics they consider important.
How urgently are they moving? Is this an all-hands-on-deck incident, or something that needs to be fixed but isn’t a major issue?
Who are they communicating with?
What channels are they communicating across?
You can learn quite a bit just by observing the senior engineers and how they conduct themselves during an outage. Before you realize it, you’ll start picking up the same techniques and processes they follow, and you’ll be able to diagnose and put out production fires in no time.
This is arguably the most important thing you should remember to do during an incident. Collaboration tends to suffer when communication breaks down, and collaboration is paramount during a high-priority incident.
When in doubt, don’t be afraid to overcommunicate to your coworkers. Let them know what you find, especially when searching for clues, anomalies, and patterns. You may not be able to connect the dots yourself, but you might offer a clue that helps your coworker do so. Or perhaps a coworker will mention something that helps you connect the dots and identify the root cause. The important thing is to share as much information as you can, so that you can work together to solve the problem.
You’re part of a team, and you’re all working towards the same goal—fixing the issue. It doesn’t matter who discovers the root cause. It’s more important that the root cause is identified and a fix is applied so that you can get back to the project you were working on.
There will be times in your career when you’ll be pulled into an incident for something that’s not your fault, but it’s still your job to fix the problem without assigning blame. There will also be times when you’ll be dealing with an incident that was caused by a change you made, whether it’s bad logic, a bad configuration change, or an inefficient SQL query. When these things happen, it’s best to be honest and own up to your mistake.
This may be much harder than it sounds depending on the severity of the issue, but taking responsibility when you make a mistake is always the best thing to do in these situations. Avoiding ownership is the worst thing you can do, because it can hurt the credibility you’ve been working on building with your coworkers and your manager.
After all, there’s a good chance your teammates already know you’re the one who made the mistake because they reviewed your code and know it’s related to the outage. You can make all the excuses in the world, point fingers, or go hide in a conference room, but it won’t change the fact that an incident happened that was related to code you wrote. It’s all documented right there in the version control system, and it’s easy for your coworkers to search the commit history to find your name next to the code that broke.
Your coworkers may start asking questions about your changes, but do your best not to view it as a personal attack. They may just be trying to gather more information about how the code is supposed to work or the context around why you made the change.
caution Whatever you do, try not to get defensive if people start asking questions. Immediately trying to justify a mistake or possible mistake is not going to help solve the problem and will make others defensive or frustrated, too.
exampleSome programmers do this by pointing out that their pull request passed the code review and was approved by their teammates. This is bad because they’re trying to shift the blame on to someone else for not catching their mistakes. Even worse, some programmers may try to blame the QA team for not catching their mistakes. Ultimately, you are responsible for your own code. So, it’s not fair to rely on others to catch your mistakes and then blame them when they don’t.
exampleOther programmers may blame some constraint they had to work around, such as a legacy part of the codebase that hasn’t been updated in a while. Once again, you’re responsible for the code you write. Sometimes you need to work around constraints that are out of your control. That’s part of the job as a software engineer—to solve problems within a set of constraints. Blaming your mistakes on something that is out of your control doesn’t look good. Instead, it’s better to admit that you hadn’t considered an edge case or that you didn’t understand the legacy system as well as you thought you did.
When you blame your mistakes on someone or something out of your control, you’re playing the victim. And when you play the victim, you’re admitting that you were not in full control of the situation. You’re admitting that you didn’t fully think through the ramifications of your code or that you didn’t test your code as thoroughly as you should have. Not only does playing the victim look bad, but it will strain your relationships with your coworkers, especially if you’re attempting to pin blame on them.
By being open and honest about where you fell short, you’re allowing yourself to accept that you’re not perfect. In a way, this makes it easier to learn from your mistakes because you’re letting your walls down. It’ll be easier to learn how you can avoid making the same mistake again in the future, rather than feeling like you are being judged for not knowing something.
Making mistakes can be painful, but they are opportunities to grow significantly as a software engineer. There’s a saying among sailors that “smooth seas never make a good sailor,” because you’ll never be great if you never experience adversity. The skills you learn during the rough times in your career will be hard fought, but I can guarantee you’ll learn to never make those mistakes again, and you’ll come out of the storm much smarter and more experienced.
Asking questions is a natural instinct in humans. In fact, it’s so ingrained in our DNA that we begin asking questions about anything and everything from the earliest days of our childhood. Kids are curious about how the world works, so they ask questions as a way to help them make sense of the things they see. They’re full of endless questions, but there’s one that is simple, yet so powerful at the same time: why?
Why is the sky blue?
Why do I have to eat my vegetables?
You’re reading a preview of an online book. Buy it now for lifetime access to expert knowledge, including future updates.