Incident Escalation and Management

9 minutes, 14 links
From

editione1.0.3

Updated March 23, 2023

You’re reading an excerpt of The Holloway Guide to Remote Work, a book by Katie Wilde, Juan Pablo Buriticá, and over 50 other contributors. It is the most comprehensive resource on building, managing, and adapting to working with distributed teams. Purchase the book to support the author and the ad-free Holloway reading experience. You get instant digital access, 800 links and references, a library of tools for remote-friendly work, commentary and future updates, and a high-quality PDF download.

After defining what urgency looks like, you can focus on how to deal with it.

The first step in handling an emergency is being informed about it, and since we’ve built the “homeostasis” of our distributed work practices around asynchronous communication channels, this means we have to come out of being “on track” and signal to our organization that the attention is needed somewhere. Here, interrupting is not only OK; it’s the right thing to do.

This doesn’t mean any interruption is welcome. Before we are in a state of emergency, we should also outline clear protocols so we can all understand as a team that what is happening is serious and needs to be looked at. We should explicitly define what this looks like.

What protocols you use are specific to the nature of your team. You will need to consider factors like whether there’s an office or not, the nature of the issue, and even the time zone distribution when determining the protocols for reaction.

A few examples of what to consider as you build your incident escalating and protocols are:

  • Who to report to. Knowing which person or group you should go to when something is wrong is fundamental. This means designating a person or group of people who become first responders when incidents happen, and who can act on the issue. This process is sometimes referred to as being “on call.” Who you choose depends on the nature of your business, and you may have different groups of people on call at the same time. For example, customer experience teams can be thought of as being “on call” for customer needs and could be first responders if they are empowered to act on problems. In other cases, you may choose to designate managers as the recipient of these reports. You can find detailed information about on-call strategies for software teams in this guide by Atlassian. It’s written for software teams, but many of the lessons apply to other kinds of teams or departments.

  • How to report. In a distributed organization where constant presence is not expected, everyone should know how to report an incident, and have the means to do so. This is when we shift away from asynchronous communication and use channels that interrupt and get attention. Incident-alerting platforms like Pagerduty or VictorOps can be useful for software teams, since they have automated and manual incident alerting. If you use Slack, you can write a custom slackbot response that lists the phone number of the person on call when messages like “!emergency”, “!incident” or “!911” are posted. The “!” prevents this message from being triggered during normal conversations.

  • Have backups. If the first mechanism to report didn’t work, it’s wise to have backup people who can get help. Managers are good candidates for backup emergency contacts, since they may know how to navigate the organization better; but they aren’t usually in a position to take action other than aiding with decision making and communications. If your organization is big enough, you can consider having a specific incident org chart so the structure is clearly defined.

  • Have a directory. When responding to an incident, you may need specialized knowledge or access to resolve the issue. It may make sense to have people list their contact numbers in their chat profile, or to build a small emergency directory for this purpose.

  • Build redundancy. Sometimes big parts of the internet break, bringing down large services with them. If your company chat goes down, you should still be able to respond and communicate. It’s worth exploring having a virtual situation room with a dedicated link on Zoom or UberConference so people know where to gather if something goes wrong. A backup chat group in a different service, used exclusively for emergencies, may allow you to recover if your main system goes down.

  • False alarms. You can expect occasional reports of emergencies that were not really emergencies. When we’re remote, it’s better to err on the side of overreacting versus not reacting at all. In any case, you will want to be aware of alert abuse, which may lead to fatigue and eliminate the effectiveness of your incident management procedures.

  • Explicit point of contact. When dealing with an emergency, the entire organization must have access to the issue status without having to ask questions or interrupt those who are responding to the issue. The best way to do this is to designate, or have a procedure to select, an Incident Commander who will be the primary point of contact during incident response.

  • Postmortems. Incidents are excellent opportunities from which to learn. Many times they happen because something that we thought was true, stopped being so. Emphasizing the importance of learning what we can when things go wrong will help organizations develop a culture where failure leads to inquiry, instead of blame or punishment. Postmortems, or Incident Learning Reports, can help teams be introspective and learn together. They’re also an interesting artifact that can teach you about how others fail.

  • Focus on stability. When things go wrong, our impulse may be to try to fix the underlying problem, but this may be difficult. The primary focus of your incident management will likely have to be to stabilize the system, including your organization, so that work can resume. Solving difficult problems under stress isn’t ideal and may lead to more problems, especially after hours. Depending on your incident matrix, you will consider incidents resolved when critical systems are restored—like payment or shipping problems— but let your teams address less critical issues when the adrenaline has subsided. Restoring the “About Us” page can wait until the next morning.

  • Quick fixes aren’t universal. Not all problems may be easily resolved. When the issue will require a significant effort to fix, it’s better to prioritize stabilizing the system and deferring the full solution for normal business hours or when the team has had time to recover after the emergency. For these cases, you can use a version of your incident management protocol, but replace synchronous communication methods for asynchronous, and lower the communication frequency. This means that you would still inform the organization about it so that your co-workers understand that you’re addressing it, but you’d want to let them know that the complexity of the issue means that it will take time to figure it out.

  • Be prepared. Teams are complex, and learn at different rates. Every time we add or remove a member, the end result is a completely different team. For this reason, organizations will need to be patient and understand that internalizing incident management procedures comes with time, and teams get better with practice. To accelerate learning, you can run remote game days, which simulate various kinds of failures or emergency scenarios. Marc Hedlund wrote about how he ran them at Stripe, and Chris Hansen shared how they do it at AdHoc. This requires setting some time aside and practicing as a team so you can be ready when the time comes to react.

Further Reading About Incident Management

Incident management is complex, and there are many different factors that go into defining how it should work at your organization. Some valuable resources to learn about this process and provide inspiration include:

Interruption Management

Managing team focus still maters at a distance. When priorities change or opportunities arise, we want our organizations to be in a position to react and take advantage of them. You may need to look at incidents with medium impact or lower urgency without raising a full-blown alarm. For these cases, individuals and teams can build similar protocols to those of incident management, but with more tolerance for time and attention.

For interruptions, you will want to consider delegating an “interrupt handler” for the team. This is someone who will be on a higher state of alert on chat, email, and other channels, paying attention to changes. The interrupt handler can escalate issues or delegate as needed; they’re like a triage nurse for your distributed team. It’s a lighter version of being on call, and may or may not be independent of usual work hours. Since the person will be interruptible, they also can take on maintenance or supporting tasks that don’t need too much focus to handle. This is a good way of getting work done that is not urgent, but is still important.

In addition to designating a point person for interruptions, you’ll need to define protocols for individuals to know when their attention is needed to support such ad-hoc priorities. Relying on instant messaging or phone calls may be challenging if you’re in different time zones, but having a predefined email subject prefix like “[action-needed]” or “[priority-shift]” will allow people to shift their attention promptly, without using channels reserved for real emergencies.

You’re reading a preview of an online book. Buy it now for lifetime access to expert knowledge, including future updates.
If you found this post worthwhile, please share!