editione1.0.0
Updated October 9, 2023🚀 As explained by Laura
A disaster recovery plan is critical to your organization’s ability to respond to and recover from a range of disruptive events.
The objectives of this plan are to:
Undertake risk management assessment.
Define and prioritize your critical business functions.
Detail your immediate response to a critical incident.
Detail the strategies and actions to be taken to enable you to stay in business.
In plain English, the aim of this entire plan is to know what has gone wrong and get your most critical systems and processes back up and running with minimal disruption.
Next, we are going to look at all the sections you would typically put into your business continuity plan or disaster recovery plan and outline the types of information you should capture in each. Towards the end of this chapter we’ll look at what to do after an incident or disaster, and mistakes to avoid.
You need to manage the risks to your business by identifying and analyzing the things that may have an adverse effect on your business and choosing the best method of dealing with each of these identified risks.
The questions to ask are:
What could cause an impact?
How serious would that impact be?
What is the likelihood of this occurring?
Can it be reduced or eliminated?
Risk/Description | Likelihood | Impact | Preventative Action | Contingency Plans |
---|---|---|---|---|
Natural Disaster | Low | High | Insurance Off site backups in multiple locations | |
Epidemic | Low | High | Well-defined and tested remote working arrangements. | |
Fire | Medium | High | Use of well-provisioned working spaces with fire prevention mechanisms such as sprinklers. | Off-site backups in multiple locations Insurance |
Flood | Medium | High | Use of water tight and well-maintained working environments. | Insurance |
Theft of Equipment | High | Medium | Encryption of all disks and portable devices. Encryption of backup files. Physical controls on working spaces. Guidance for travel with work devices. | Restoration of device data from backups Insurance |
Loss of Key Staff member | High | Medium | Ensuring roles are known by multiple staff members. Use of access management and sharing solutions to ensure all passwords and access keys are securely stored and accessible. | Prompt assessment and revocation of accesses. |
Determine what types of insurance are available, and purchase the necessary policies. Your disaster recovery plan should document any policies you have so that if something happens, they are easy to find and trigger.
Example data to capture about your insurance policies:
insurance type
policy details and documents
exclusions
insurance company and contact details
renewal and review dates.
Ensure that any backup processes for critical data are recorded in your disaster recovery plan.
This helps you understand how much data you can recover, how that recovery process works, and when it was last tested.
Example data to capture about your backups:
backup frequency
where and how the backups are made
owner of the system and associated backups
recovery procedures (where to find them)
how frequently the backups are tested and when the last test was held.
This is where we start getting into the really crucial part of disaster recovery. As we all know, you can’t do everything at once; there always has to be an order to the actions we carry out that works with the time, money, and people we have available.
Disaster recovery is one area that really highlights this reality. Imagine you lost all of your systems in one day after a freak accident in a hosting center wipes out your infrastructure. While this is a highly unlikely, controlled example, the point is still the same. If you had nothing and had to rebuild everything from scratch to resume your business operations, what would you restore first?
There are two key measurements we use to prioritize our systems.
Definition The recovery time objective (RTO) is the amount of time you can operate or survive as a business without a system. In short, how quickly do you need the system to resume?
Definition The recovery point objective (RPO) lets us define how much data we would need to have restored for a system to function or to be of use to our company.
The RTO and RPO are a balance. Here are some scenarios that outline the relationship between these two values.
You may be able to resume services very quickly (short RTO) but with a small amount of data (short RPO), such as only the records from the last hour.
You may be able to last a long time without a system (long RTO) so long as when it comes back, you haven’t lost any data at all (long RPO).
You may need your system to be back quickly (short RTO) and have all the data back including historical records (long RPO).
Whatever your requirements when recovering from a disaster, it’s important that every system, tool, or process that needs to be restored is documented in your plan, along with your RTO and RPO for that system. This analysis allows those responding to the event to prioritize and get systems back up in the right order, as well as with enough data to make them useful.
confusion Remember that not everything can be restored first and not all data can come back in those early hours and days. Think carefully about your RTO and RPO expectations so that you can make this process easy and reduce conflict or arguments.
Critical Business Activity/System | Name of the system |
---|---|
Description | What does this system do? |
Priority | What is the priority for this system when recovering? |
Impact of system loss | What impact would losing this system or not recovering it have on the organization? |
Recovery time objective (RTO) | How long can you live without it? |
Recovery point objective (RPO) | How much data do you need back? |
Business continuity requires a number of coordinated roles to work efficiently.
To ensure that your organization is able to respond quickly, incident specific roles such as “incident lead” and “deputy” should be rotated between team members. This redundancy reduces the reliance of individuals and that nasty “key person risk” we discussed in Part III.
Role | Description | Responsibilities |
---|---|---|
Business Continuity Owner | Owns this business continuity plan and management level ownership of it and its associated risks. | • Update and maintain this document • Arrange for regular tests of this process |
Incident Lead | Controls and leads activities for a specific business continuity event. | • Lead the response team • Coordinate response activities • Manage prioritization during event |
Deputy | Supports the Incident Lead and manages communications for a specific incident | • Manage communications with internal and external stakeholders • Support the incident lead |
This is where our plan starts to move from collecting important information to documenting the key steps we need our response team to take for every event.
The following activities should be conducted in the event of a serious business continuity incident. They are listed in priority order.
Assess the severity of the incident.
Evacuate the site.
Account for everyone.
Identify any injuries to people.
Contact emergency services.
Start event log.
Begin restoration plan activities.
Activate staff members and resources.
It’s important to review these suggestions and see if there are any additional steps you need to take based on your location, operating model, health and safety risks, or culture.
The aim should remain the same, however. The first steps of this process are always focused on quickly triaging the situation and ensuring people are removed from harm’s way. Later steps focus on addressing human harm first and then, when safe to do so, restoring services and operations.
Upon loss of your physical infrastructure or office, or any event that prevents staff from safely reaching usual working premises, you need to ensure that your team take the right steps to stay safe:
If located at the affected site at the time of the event, report to the business continuity lead to register their safety and presence. In some cases, like a fire alarm, this might be a physical location like a car park or muster point. In cases like natural disasters, or for remote teams affected by disaster events, this might be a digital check in to say you are safe.
Seek medical assistance where required.
Remain at or return to your homes or other appropriate safe location.
Resume working in a remote capacity when safe to do so.
Await further instructions.
The important thing to remember is that by documenting this plan and your expectations, you can remove some of the anxiety and uncertainty from a very stressful situation. Disasters and business continuity events are very hard to manage for most people, and by having a simple, well-communicated plan, your team can focus on staying safe and can fall back to your instructions at any time if they are lost, uncertain, or unclear as to what is expected of them.
Once your people are safe, the next step is to locate the essential equipment and documentation you need to begin the recovery process.
For essential equipment, you should make sure the following are available:
emergency medical supplies
first aid kits
earthquake kits
flashlights.
Many national and international civil defense organizations provide guidance on preparing these kits and what to include in them. Please remember, if your team is remote or distributed, you should provide this equipment to all operating locations.
When preparing your essential documentation, you should make sure that the following are available:
contact details for communicating with the team and key stakeholders
insurance information
your disaster recovery plan
recovery codes for secure accounts, password managers, or other highly sensitive systems.
Each of these items should be stored somewhere suitable and accessible in the event of a disaster. It’s no good having a well-documented plan if nobody can find it. Your plan and critical information should be stored both electronically and physically in a number of geographically separated locations. This ensures that in a bad situation, there should always be a copy accessible.
As well as choosing good locations, ensure that multiple people have access so that in the event of injury or loss of contact, there are additional people who can locate the plan and activate it.
Much like you need to be able to find your first aid kit in case of an emergency, having contact details at hand is also critical to how well you can respond.
Remember that depending on the type of emergency, you won’t just be able to look up someone’s details on your company’s computer systems. You may have to resort to more manual and old-school mechanisms, like a call sheet.
When capturing contact information, remember to capture the details of both internal contacts (people on your team) and external contacts (people outside your organization who are essential to its operation).
For each of these groups, you should capture:
name
contact Number
responsibilities or roles
which company they represent (external only).
The painful part of this section is maintaining your lists. It’s been a long time since any of us kept a physical address book. Make a point to schedule updates to both this list and your overall disaster recovery plan and assign owners from across your team to ensure that many hands make light work of its upkeep.
Finally, our disaster recovery would not be complete without instructions on how to recover the systems, infrastructure, facilities, and data we rely on to get the job done every day. The more systems you have and the more complex your organization, the more you will need to document here.
The aim of this section is to give responders enough information to get going with restoring systems. This often includes:
where to find recovery playbooks for each system
who to contact for each system to talk through the process and set expectations
where to find essential equipment, backups, or authentication materials
how to get physical access where needed.
confusion Remember that any document you reference here should be stored along with the overall plan so that it can be accessed in times of need.
The first common element of both disaster recovery and incident response plans is the need to plan your communications during an emergency. There are many reasons why you don’t want to leave this to chance:
Your normal communication tools may not be available due to an outage or fault.
You may have no physical access to your communication devices, or other physical locations or equipment needed to use them.
You may not have reliable internet access.
Regardless of why you can’t just “do what you always do,” there are a number of key communications channels you need to establish when handling an incident or disaster. These include:
Channel | Reason |
---|---|
Emergency Services | To coordinate any response needed from fire, ambulance, police, or other emergency support services. |
Executive and Board | To communicate updates and briefings as the situation evolves. |
Whole Company | To inform the team of the situation and any changes to operations as a result. |
Media | To manage and respond proactively to media questions in the event of a publicized issue. |
Customers | To support, soothe, and inform customers as the situation evolves, such that they know what to expect and are aware of any risks or service interruptions. |
Internal Response Team | To communicate internally to collaborate on incident response or disaster recovery activity, as well as to capture the timeline of events as they emerge. |
When choosing appropriate communications channels and technologies, you should consider some of the following:
Does my audience have access to this channel?
Do we need any specialist equipment, accounts, or access that can be set up in advance?
Is this channel secure enough to send sensitive information during an emergency, or do you need to document guidance concerning what information can be shared and where?
Does my audience know where to expect communications?
Do I need evidence of this communication after the incident or disaster has ended?
The right communication channel is one that you can safely access, that can reach your required audience, and that will protect your communications in transit (while being sent) and at rest (once they have been sent). Remember that in stressful situations, choosing simple, reliable communication is much better for reducing stress than choosing cutting-edge, untested options. To that end, don’t forget that sometimes just picking up the phone and calling someone is the easiest path to get the job done.
For those items that need some form of evidence after the event, ensure that any verbal channels are followed up by written summaries, shared with both parties.
Whatever channels you choose, whether it be telephone, email, collaborative documents, or messaging platforms like WhatsApp, Signal, or Slack, remember to test them first—in fact, test the entire plan.
The second common element of both disaster recovery and incident response plans is the need to test that the plans work.
I know that it’s tempting to say “we have incidents all the time so we know what to do,” but in all honesty, just because you have incidents frequently, it doesn’t mean that they are representative of all the events you might need to deal with. There is also the question about who is “handling” your incidents. If you are responding from instinct, experience, or memory, that response is probably different from what is in your plan and may be difficult for someone else on the team to replicate.
important Every plan you create should be tested, at least once a year. It’s as simple as that.
The risks and threats faced by an organization change over time, as do the staff members involved with protecting it. Testing on a regular basis ensures that the plan remains accurate and appropriate. Testing also ensures that all potential response team members are familiar with executing this plan.
The point of the test is to gather together the people and teams who would likely be involved in the response and walk through the plan together. This process allows all these different people to identify gaps or questions that arise from the process. The more they identify, the more you can improve your plan (or associated systems) to make sure that in a real emergency, the plan will be its most effective.
You’ve decided to run your first testing session; fabulous. Here are some things you need to do that will help you get the most out of your session.
Create a list of representatives from key areas in your organization that are likely to be involved in responding to an incident. For example:
Customer success (to explain outages to customers)
Engineering (to diagnose or fix issues)
Operations (to be involved in process alternation or backup systems)
Board and executive members (to be briefed)
Legal (to assess implications of incidents and advise the board and executive team)
Marketing (to engage with the media or create a communications plan)
Schedule a time to meet; this needs to be enough time to get through the plan and allow for people to discuss challenges and ask questions (at least a couple of hours normally).
Choose a testing scenario and make sure everyone has access to the plan you are testing in advance.
Choose a lead for the plan test; this person needs to control the scenario and walk the other participants through the challenge. They should be very familiar with the plan and be able to adapt the scenario if questions arise.
Choose someone to take notes, as you will need these to identify issues or updates that need to be made.
Run the testing session; you will probably need a whiteboard, pens, and a private space.
Record any outcomes or issues that need to be addressed and assigned to teams.
Ensure all issues are addressed within 30 days of the testing session.
When something goes wrong, the best course of action (once you have recovered) is to do some reflection and try to identify changes that can be made to systems, processes, or situations to avoid the same thing happening again.
A post-incident review is a structured exercise designed to review the chain of events surrounding an incident or event. By evaluating the activities that led to and resulted from an incident, the post-incident review is able to establish a timeline of events and identify any areas for improvement.
When structured well, a post-incident review is a blameless tool for evaluation, feedback, and process improvement. You can learn more about blameless approaches to post-incident reviews by checking out Etsy’s work in this space.