5 ways to test your BCDR potential

Jul 20 · 08 min read

BCDR disaster recovery plan

Flight attendants always review the safety protocol before each flight, even though pilots are well-trained. Tech giants like Google have recovery data centers all over the world. Everybody needs a disaster plan, including your business. However, a plan isn't just documentation. How do you know your business continuity and disaster recovery (BCDR) plan is effective? What is ManageEngine's disaster recovery (DR) testing process? Let's take a look.

DR is a primary requirement, not just for us, but for our customers as well. It helps us minimize downtime on unexpected events, and testing our plans is crucial to:

  • Ensure stakeholders are aware of their roles and responsibilities
  • Verify all procedures and documentation are up to date
  • Verify the components of the IT infrastructure are available and easily accessible
  • Identify areas of improvement, gaps in the process or strategy, and find solutions

Types of DR testing

There are five common types of DR tests that help your organization confirm that its methods and processes are optimal.

1. Checklist testing:

The first simple test is to figure out if you have everything you need: people, process, and technology. The DR owner and team members convene to review existing documentation and make changes as required. It is also often called a paper test.

During the Chennai floods in 2015, some of our critical teams had to shift operations to our secondary office in Tenkasi (L2). We had to accommodate close to 150 employees in a new location and facilitate work immediately as they engage with high-profile customers. The IT teams needed to ensure they had the necessary infrastructure to help them get on their feet right away. Having a blueprint enabled faster communication between stakeholders and ensured favorable results. Here's a sample checklist that helped us make the transition smooth:

People

Point of contact for operations

Admin: Siva Venkat

HR: Deepa Kumar

Sysadmin: Raj Singh

Process

  • VPN access
  • Network ports in L2
  • Phone extensions for customer-facing teams
  • Datacenter access for engineering and support teams
  • Encryption checks for employees utilizing laptops
  • Bandwidth to host 150 more employees in L2
  • Seating capacity and conference rooms for 150 new employees
  • Safe transport arrangements for 150 people

Technology

Internal communication: Zoho Cliq

Remote attendance: Zoho People

Flood-related announcements: Zoho Connect

Employee survey for safety updates: Zoho Creator

Each team involved in this BCDR plan uses its checklists. It helps the team test its capabilities regarding how quickly and effectively it can facilitate this plan.

For events like network and power outages, or natural disasters, you need to document critical inventory, recovery time objective, recovery point objective, identify recovery sites, and sensitive information. A checklist test, however beneficial, is not a reliable method on its own to guarantee that your BDCR plan works. It works best when combined with other methods of testing.

2. Walk-through testing:

A group discussion where stakeholders discuss the plan, walk-through testing identifies potential obstacles, and the methods to overcome them. It is also often called a table-top exercise. A walk-through test is a simple, low-cost plan that can be highly effective. It does not require a lot of resources, just a conference room and the stakeholders' time. At this stage, everyone involved must be clear about their role in case of an emergency.

Here's a standard template you can use during discussions. Your aim should be to predict potential scenarios, the response plan for these scenarios, and confirm that everyone knows what to do when a disaster strikes.

Exercise Information

Objective

During the initial stages of COVID-19, we executed a no-contact policy when employees returned their used devices or came here to collect new ones.

Type of exercise

Walk-through

Scope

All business units located in the Estancia office

Elements to be tested:

People

Yes

Process

Yes

Infrastructure

Yes

Resources

Yes

Success Criteria

Employees safely receive their new devices and hand over the old ones without coming in contact with any other staff.

Pre-conditions

The Admin team must be aware of the procedure for moving devices from the warehouse to the pickup point, keeping the safety of the employees in mind.

Participants (Internal/External):

Internal

Conduct of exercise

Date

February 1, 2022

Start and end time

9am - 11pm IST

Duration

14 hours

Location

Zoho Estancia office, Chennai

Frequency:

Mandatory

Twice a year

Voluntary

After any changes/reviews

Notice required

Yes

Participant briefing

Yes. This includes a short introduction prior to the exercise and supporting guidelines.

Exercise budget

The project budget is determined by the IT management team.

Controllers

Head of administrative operations and one representative from HR (TBA)

Documentation/ Post walk-through

Actions

Exercise report

  • When employees raise an asset request, they receive an automated email that runs them through the guidelines.
  • When their device is ready, they receive another email providing them with a unique ID. Only on receiving this ID, do they plan their travel to the office.
  • When they arrive at the office, a security officer conducts basic COVID-19 checks and requests the unique ID. The security officer enters the ID in the database to check if the associated device is ready in the warehouse.
  • If it is ready, the officer escorts the employees to the pickup point and seats them at a socially-distanced location.
  • The officer arranges transit personnel to move devices from the warehouse to the pickup point.
  • After the employees receive their devices, the IT team is notified by the security officer, recording the serial numbers of the devices in the database.
  • The employees receive a mail stating they've collected their devices, and it includes an option for them to reply to confirm this.

Action plan

Teams will be notified to modify the portion on the report that relates to its activities.

Supporting documents

  • The record of the assets leaving the warehouse (as maintained by the security team)
  • The template and distribution list for the confirmation email sent to the employees, and relevant partners, vendors, and interns

3. Simulation testing:

As the name suggests, this test involves imitating real-life scenarios to determine how teams respond. You should be able to answer questions like:

  • Do we have the right tools?
  • Do we have enough people?
  • Are we capable of getting the job done?
  • How does it affect business?
  • How much time does recovery take?

Let's take the same scenario from earlier, the 2015 floods. With the inauguration of our hub and spoke offices, employees have the option to work from any office at their convenience. Team members often show up unannounced. Say, a product team consisting of 50 members shows up at a spoke office. Product teams comprise members of different roles, like developers, marketers, managers, etc. These roles have various requirements. So, IT teams at these offices have checklists to ensure they can provide the required infrastructure. Ideally, a simulation test should be carried out periodically, say once or twice a year.

4. Parallel testing:

You've identified backup recovery systems during the walk-through testing. But are you sure they can handle real-world business operations? Time to find out. Run the disaster recovery systems parallel with the primary systems to test accuracy in performance and support.

Application switchover testing

Application switchover testing at Zoho Corp handled by Zorro

For instance, Zorro, Zoho's infrastructure operations team, conducts application switchover testing by validating output from the recovery data center against the primary data center. Each component inside a data center has redundancy. If one component fails during a disaster, we have a secondary component replicating its exact configuration and data. Likewise, there is an overall redundancy for data centers too. Components, activities, and data in all primary data centers are replicated dynamically in secondary data centers. Primary and secondary data centers are exact copies of each other and are separated physically. Yet, both run simultaneously. When it's time to test, we switch production from the primary to the secondary data center. The data from the secondary data center is compared to confirm accurate data replication.

5. Cut over testing:

What happens if your primary system fails? Often called full interruption testing, a cut over test aims to evaluate the ability of your recovery systems to support complete business functions. The primary systems are shut off or disconnected, forcing the recovery systems to kick in and take over.

Here's a simplified guide used to test the performance of a new application.

  • Create a cut-over plan with a list of roles, responsibilities, tools, and processes involved.
  • List out the dependencies (if any) for each task.
  • Conduct risk analysis for dependencies and tasks.
  • Include a rollback and business continuity plan.
  • Announce the exact time for each task and record it on a downtime calendar.
  • Conduct a mock run.
  • Sort out any issues that arise during the mock run.
  • Go live—release application to the target users.
  • Provide post-implementation support.

A BCDR plan determines your organization's preparedness in difficult times. Conduct tests periodically and ensure stakeholders are aware of their roles and responsibilities at all times for a fast and secure recovery from operational disruptions. Want to know more? Our BCDR e-book goes in-depth about ManageEngine's BCDR blueprint, the implementation process, and the teams and tools involved.

About the author

Mahanya Vanidas, Content writer

Sign up for our newsletter to get more quality content

Get fresh content in your inbox

By clicking 'keep me in the loop', you agree to processing of personal data according to the Privacy Policy.