editione1.0.3Updated March 23, 2023
You’re reading an excerpt of The Holloway Guide to Remote Work, a book by Katie Wilde, Juan Pablo Buriticá, and over 50 other contributors. It is the most comprehensive resource on building, managing, and adapting to working with distributed teams. Purchase the book to support the author and the ad-free Holloway reading experience. You get instant digital access, 800 links and references, a library of tools for remote-friendly work, commentary and future updates, and a high-quality PDF download.
This section was written by Juan Pablo Buriticá.
Most of the focus of this section is dedicated to building predictable and reliable practices as a platform for distributed teams to build upon. A steady cadence, explicit (largely asynchronous) communication practices, and documented team agreements free us to focus on being productive despite being separated. This strategy works when everything is going well, but becomes inadequate in situations where the need to react can’t wait. While we recommend relying as much as possible on asynchronous methods for day-to-day operations, we encourage teams to fall back on synchronous methods when communication breaks down or time-sensitive matters appear.
“If everything is important, then nothing is.” ―Patrick Lencioni, author, The Five Dysfunctions of a Team*
Before teams learn how to react to incidents and emergencies, they first need a definition of what counts as urgent. Teams should have a clear understanding of what warrants shifting out of the default way of working, and how and when they should raise the issue.
By having clear guidelines regarding urgent matters, teams and individuals can also learn how to protect their own focus—they understand that unless something is classified as highly urgent, it can wait. You can facilitate asynchronous work by making sure everyone is able to defer non-urgent items. This allows individuals to devote all their energy to the task at hand without refreshing their email or checking the work chat every 20 minutes to see if there’s something important that needs solving.
caution If there’s no definition of critical systems or operations, then anything that people with authority ask for becomes highly urgent. Bosses aren’t always conscious of their power. They may report a website bug on a channel assuming it will be prioritized and eventually solved, when instead the team interprets this as extremely important and drops everything to make the change. It’s critical for teams to have explicit permissions to defer solving problems reported by people with authority, or to have a framework to ask if there’s a need to reprioritize.
An incident priority matrix is a document that outlines how to gauge the priority of an incident to determine if it should be classified as an emergency or not.
Urgency is contextual to your organization, but there are a few examples of how you can think of urgency in your business. You likely don’t need to wake up your on-call engineer at 3am because your blog went down; it can probably wait a few hours. If you are a transportation company and your dispatching system went offline, that would be a different story. Michael Churchman wrote a simple guide that you can use to build an incident priority matrix. Urgency comes down to impact and the context of your operations.
Impact is generally based on the scope of an incident’s effects—how many departments, users, or key services are affected. A large number of near-simultaneous reports that a specific service is unavailable, for example, may be a good indication of a high-impact incident; while a report of a problem from a single user, unaccompanied by any similar reports, is more likely to indicate a low-impact incident. For many IT departments, the guidelines for determining incident impact might look something like this:
High impact. A critical system is down.
One or more departments are affected.
A significant number of staff members are not able to perform their functions.
The incident affects a large number of customers.
The incident has the potential for major financial loss or damage to the organization’s reputation.
Other criteria, depending on the function of the organization and the affected systems, could include such things as threats to public safety, potential loss of life, or major property damage.
Moderate impact. Some staff members or customers are affected.
None of the services lost are critical.
Financial loss and damage to the organization’s reputation are possible, but limited in scope.
There is no threat to life, public safety, or physical property.
Low impact. Only a small number of users are affected.
It is not always easy to draw a strict distinction between incident impact and incident urgency, but for the most part, urgency in this context can be defined as how quickly a problem will begin to have an effect on the system or people who rely on it. The failure of a payroll system may have a high impact, for example, but if it occurs at the beginning of a pay cycle, it is likely to be less urgent than the loss of a customer-relations database that is put to heavy use on a daily basis.
High urgency. A service that is critical for day-to-day operations is unavailable.
The incident’s sphere of impact is expanding rapidly, or quick action may make it possible to limit its scope.
Time-sensitive work or customer actions are affected.
The incident affects high-status individuals or organizations (for example, upper management or major clients).
Low urgency. Affected services are optional and used infrequently.
The effects of the incident appear to be stable.
Important or time-sensitive work is not affected.
importantNote that for both impact and urgency, meeting a single criterion (rather than all or a majority of criteria) in a category is generally sufficient. Best practice is to place incidents in the highest category for which they qualify.
The best way to build these is to outline what the organization values and why, and then build a brief guide that makes it easy for people to understand impact and urgency. Dealing with an emergency is not the right time for anyone to have to go over a multi-page manual to make a decision.
After defining what urgency looks like, you can focus on how to deal with it.
The first step in handling an emergency is being informed about it, and since we’ve built the “homeostasis” of our distributed work practices around asynchronous communication channels, this means we have to come out of being “on track” and signal to our organization that the attention is needed somewhere. Here, interrupting is not only OK; it’s the right thing to do.
This doesn’t mean any interruption is welcome. Before we are in a state of emergency, we should also outline clear protocols so we can all understand as a team that what is happening is serious and needs to be looked at. We should explicitly define what this looks like.
What protocols you use are specific to the nature of your team. You will need to consider factors like whether there’s an office or not, the nature of the issue, and even the time zone distribution when determining the protocols for reaction.
A few examples of what to consider as you build your incident escalating and protocols are:
Who to report to. Knowing which person or group you should go to when something is wrong is fundamental. This means designating a person or group of people who become first responders when incidents happen, and who can act on the issue. This process is sometimes referred to as being “on call.” Who you choose depends on the nature of your business, and you may have different groups of people on call at the same time. For example, customer experience teams can be thought of as being “on call” for customer needs and could be first responders if they are empowered to act on problems. In other cases, you may choose to designate managers as the recipient of these reports. You can find detailed information about on-call strategies for software teams in this guide by Atlassian. It’s written for software teams, but many of the lessons apply to other kinds of teams or departments.
How to report. In a distributed organization where constant presence is not expected, everyone should know how to report an incident, and have the means to do so. This is when we shift away from asynchronous communication and use channels that interrupt and get attention. Incident-alerting platforms like Pagerduty or VictorOps can be useful for software teams, since they have automated and manual incident alerting. If you use Slack, you can write a custom slackbot response that lists the phone number of the person on call when messages like “!emergency”, “!incident” or “!911” are posted. The “!” prevents this message from being triggered during normal conversations.
Have backups. If the first mechanism to report didn’t work, it’s wise to have backup people who can get help. Managers are good candidates for backup emergency contacts, since they may know how to navigate the organization better; but they aren’t usually in a position to take action other than aiding with decision making and communications. If your organization is big enough, you can consider having a specific incident org chart so the structure is clearly defined.
Have a directory. When responding to an incident, you may need specialized knowledge or access to resolve the issue. It may make sense to have people list their contact numbers in their chat profile, or to build a small emergency directory for this purpose.
Build redundancy. Sometimes big parts of the internet break, bringing down large services with them. If your company chat goes down, you should still be able to respond and communicate. It’s worth exploring having a virtual situation room with a dedicated link on Zoom or UberConference so people know where to gather if something goes wrong. A backup chat group in a different service, used exclusively for emergencies, may allow you to recover if your main system goes down.
False alarms. You can expect occasional reports of emergencies that were not really emergencies. When we’re remote, it’s better to err on the side of overreacting versus not reacting at all. In any case, you will want to be aware of alert abuse, which may lead to fatigue and eliminate the effectiveness of your incident management procedures.
Explicit point of contact. When dealing with an emergency, the entire organization must have access to the issue status without having to ask questions or interrupt those who are responding to the issue. The best way to do this is to designate, or have a procedure to select, an Incident Commander who will be the primary point of contact during incident response.
Postmortems. Incidents are excellent opportunities from which to learn. Many times they happen because something that we thought was true, stopped being so. Emphasizing the importance of learning what we can when things go wrong will help organizations develop a culture where failure leads to inquiry, instead of blame or punishment. Postmortems, or Incident Learning Reports, can help teams be introspective and learn together. They’re also an interesting artifact that can teach you about how others fail.
Focus on stability. When things go wrong, our impulse may be to try to fix the underlying problem, but this may be difficult. The primary focus of your incident management will likely have to be to stabilize the system, including your organization, so that work can resume. Solving difficult problems under stress isn’t ideal and may lead to more problems, especially after hours. Depending on your incident matrix, you will consider incidents resolved when critical systems are restored—like payment or shipping problems— but let your teams address less critical issues when the adrenaline has subsided. Restoring the “About Us” page can wait until the next morning.
Quick fixes aren’t universal. Not all problems may be easily resolved. When the issue will require a significant effort to fix, it’s better to prioritize stabilizing the system and deferring the full solution for normal business hours or when the team has had time to recover after the emergency. For these cases, you can use a version of your incident management protocol, but replace synchronous communication methods for asynchronous, and lower the communication frequency. This means that you would still inform the organization about it so that your co-workers understand that you’re addressing it, but you’d want to let them know that the complexity of the issue means that it will take time to figure it out.
Be prepared. Teams are complex, and learn at different rates. Every time we add or remove a member, the end result is a completely different team. For this reason, organizations will need to be patient and understand that internalizing incident management procedures comes with time, and teams get better with practice. To accelerate learning, you can run remote game days, which simulate various kinds of failures or emergency scenarios. Marc Hedlund wrote about how he ran them at Stripe, and Chris Hansen shared how they do it at AdHoc. This requires setting some time aside and practicing as a team so you can be ready when the time comes to react.
Incident management is complex, and there are many different factors that go into defining how it should work at your organization. Some valuable resources to learn about this process and provide inspiration include:
Chaos Engineering principles take a proactive approach to prevent incidents
Managing team focus still maters at a distance. When priorities change or opportunities arise, we want our organizations to be in a position to react and take advantage of them. You may need to look at incidents with medium impact or lower urgency without raising a full-blown alarm. For these cases, individuals and teams can build similar protocols to those of incident management, but with more tolerance for time and attention.
For interruptions, you will want to consider delegating an “interrupt handler” for the team. This is someone who will be on a higher state of alert on chat, email, and other channels, paying attention to changes. The interrupt handler can escalate issues or delegate as needed; they’re like a triage nurse for your distributed team. It’s a lighter version of being on call, and may or may not be independent of usual work hours. Since the person will be interruptible, they also can take on maintenance or supporting tasks that don’t need too much focus to handle. This is a good way of getting work done that is not urgent, but is still important.
In addition to designating a point person for interruptions, you’ll need to define protocols for individuals to know when their attention is needed to support such ad-hoc priorities. Relying on instant messaging or phone calls may be challenging if you’re in different time zones, but having a predefined email subject prefix like “[action-needed]” or “[priority-shift]” will allow people to shift their attention promptly, without using channels reserved for real emergencies.
Finally, you will want to agree as a team to be mindful when others need help, so regular work can continue to happen. Jory MacKay shares some useful recommendations for individuals on the Rescuetime Blog:
“Schedule dedicated time for more complex questions. If a conversation is going to take time to get through, it’s a good idea to schedule dedicated time. This way that person can prepare for it and make sure it works around their schedule.
Not all interruptions are equal. Decide on which non-urgent form of communication you’ll use. Most people we spoke to said emails are easier to ignore than other forms. Find what channel or tool makes the most sense for your co-workers and use that when you need something. Consider providing context around urgency when you contact someone.
Ask if someone’s free before getting to what you need. It seems almost too easy, but simply asking if someone’s available before jumping into your ask can avoid most face-to-face interruptions.
Have set ‘office hours.’ If your job involves being available to others for questions, set aside dedicated time rather than being always around. This way people know when they can interrupt and you can schedule your day around those periods.”
Beyond cost savings, easier access to talented employees is one of the biggest reasons employers consider supporting remote work. Especially for startups and high-growth companies, the talent supply is limited and in high demand, making hiring very competitive, especially for technical roles. Being able to hire outside traditional tech hubs like the San Francisco Bay Area or greater New York City means people can find the talent they want almost anywhere in the world.* Even in less competitive industries, advances in technology make remote work feasible in many roles, granting those employers access to a much broader and more specialized workforce.
A lot of hiring best practices are similar across in-office and remote roles—for example, clarity about the role, sourcing a high-quality applicant pool, and being explicit about cultural values. These practices become even more important when hiring for remote roles. Candidates will be working in physically isolated locales, so small issues can be magnified. We won’t cover all the ins-and-outs of standard good hiring practices, which you can find in our Guide to Technical Recruiting and Hiring as a companion to this section.