5 ways to test your BCDR potential

Flight attendants always review the safety protocol before each flight, even though pilots are well-trained. Tech giants like Google have recovery data centers all over the world. Everybody needs a disaster plan, including your business. However, a plan isn't just documentation. How do you know your business continuity and disaster recovery (BCDR) plan is effective? What is ManageEngine's disaster recovery (DR) testing process? Let's take a look.

DR is a primary requirement, not just for us, but for our customers as well. It helps us minimize downtime on unexpected events, and testing our plans is crucial to:

Ensure stakeholders are aware of their roles and responsibilities
Verify all procedures and documentation are up to date
Verify the components of the IT infrastructure are available and easily accessible
Identify areas of improvement, gaps in the process or strategy, and find solutions

Types of DR testing

There are five common types of DR tests that help your organization confirm that its methods and processes are optimal.

1. Checklist testing:

The first simple test is to figure out if you have everything you need: people, process, and technology. The DR owner and team members convene to review existing documentation and make changes as required. It is also often called a paper test.

During the Chennai floods in 2015, some of our critical teams had to shift operations to our secondary office in Tenkasi (L2). We had to accommodate close to 150 employees in a new location and facilitate work immediately as they engage with high-profile customers. The IT teams needed to ensure they had the necessary infrastructure to help them get on their feet right away. Having a blueprint enabled faster communication between stakeholders and ensured favorable results. Here's a sample checklist that helped us make the transition smooth:

People

Point of contact for operations

Admin: Siva Venkat

HR: Deepa Kumar

Sysadmin: Raj Singh

Process

VPN access
Network ports in L2
Phone extensions for customer-facing teams
Datacenter access for engineering and support teams
Encryption checks for employees utilizing laptops
Bandwidth to host 150 more employees in L2
Seating capacity and conference rooms for 150 new employees
Safe transport arrangements for 150 people

Technology

Internal communication: Zoho Cliq

Remote attendance: Zoho People

Flood-related announcements: Zoho Connect

Employee survey for safety updates: Zoho Creator

Each team involved in this BCDR plan uses its checklists. It helps the team test its capabilities regarding how quickly and effectively it can facilitate this plan.

For events like network and power outages, or natural disasters, you need to document critical inventory, recovery time objective, recovery point objective, identify recovery sites, and sensitive information. A checklist test, however beneficial, is not a reliable method on its own to guarantee that your BCDR plan works. It works best when combined with other methods of testing.

2. Walk-through testing:

A group discussion where stakeholders discuss the plan, walk-through testing identifies potential obstacles, and the methods to overcome them. It is also often called a table-top exercise. A walk-through test is a simple, low-cost plan that can be highly effective. It does not require a lot of resources, just a conference room and the stakeholders' time. At this stage, everyone involved must be clear about their role in case of an emergency.

Here's a standard template you can use during discussions. Your aim should be to predict potential scenarios, the response plan for these scenarios, and confirm that everyone knows what to do when a disaster strikes.

Exercise Information
Objective	During the initial stages of COVID-19, we executed a no-contact policy when employees returned their used devices or came here to collect new ones.
Type of exercise	Walk-through
Scope	All business units located in the Estancia office
Elements to be tested:
People	Yes
Process	Yes
Infrastructure	Yes
Resources	Yes
Success Criteria	Employees safely receive their new devices and hand over the old ones without coming in contact with any other staff.
Pre-conditions	The Admin team must be aware of the procedure for moving devices from the warehouse to the pickup point, keeping the safety of the employees in mind.
Participants (Internal/External):	Internal

Conduct of exercise
Date	February 1, 2022
Start and end time	9am - 11pm IST
Duration	14 hours
Location	Zoho Estancia office, Chennai
Frequency:
Mandatory	Twice a year
Voluntary	After any changes/reviews
Notice required	Yes
Participant briefing	Yes. This includes a short introduction prior to the exercise and supporting guidelines.
Exercise budget	The project budget is determined by the IT management team.
Controllers	Head of administrative operations and one representative from HR (TBA)

Documentation/ Post walk-through	Actions
Exercise report	When employees raise an asset request, they receive an automated email that runs them through the guidelines. When their device is ready, they receive another email providing them with a unique ID. Only on receiving this ID, do they plan their travel to the office. When they arrive at the office, a security officer conducts basic COVID-19 checks and requests the unique ID. The security officer enters the ID in the database to check if the associated device is ready in the warehouse. If it is ready, the officer escorts the employees to the pickup point and seats them at a socially-distanced location. The officer arranges transit personnel to move devices from the warehouse to the pickup point. After the employees receive their devices, the IT team is notified by the security officer, recording the serial numbers of the devices in the database. The employees receive a mail stating they've collected their devices, and it includes an option for them to reply to confirm this.
Action plan	Teams will be notified to modify the portion on the report that relates to its activities.
Supporting documents	The record of the assets leaving the warehouse (as maintained by the security team) The template and distribution list for the confirmation email sent to the employees, and relevant partners, vendors, and interns

3. Simulation testing:

As the name suggests, this test involves imitating real-life scenarios to determine how teams respond. You should be able to answer questions like:

Do we have the right tools?
Do we have enough people?
Are we capable of getting the job done?
How does it affect business?
How much time does recovery take?

Let's take the same scenario from earlier, the 2015 floods. With the inauguration of our hub and spoke offices, employees have the option to work from any office at their convenience. Team members often show up unannounced. Say, a product team consisting of 50 members shows up at a spoke office. Product teams comprise members of different roles, like developers, marketers, managers, etc. These roles have various requirements. So, IT teams at these offices have checklists to ensure they can provide the required infrastructure. Ideally, a simulation test should be carried out periodically, say once or twice a year.

4. Parallel testing:

You've identified backup recovery systems during the walk-through testing. But are you sure they can handle real-world business operations? Time to find out. Run the disaster recovery systems parallel with the primary systems to test accuracy in performance and support.

Application switchover testing at Zoho Corp handled by Zorro

For instance, Zorro, Zoho's infrastructure operations team, conducts application switchover testing by validating output from the recovery data center against the primary data center. Each component inside a data center has redundancy. If one component fails during a disaster, we have a secondary component replicating its exact configuration and data. Likewise, there is an overall redundancy for data centers too. Components, activities, and data in all primary data centers are replicated dynamically in secondary data centers. Primary and secondary data centers are exact copies of each other and are separated physically. Yet, both run simultaneously. When it's time to test, we switch production from the primary to the secondary data center. The data from the secondary data center is compared to confirm accurate data replication.

5. Cut over testing:

What happens if your primary system fails? Often called full interruption testing, a cut over test aims to evaluate the ability of your recovery systems to support complete business functions. The primary systems are shut off or disconnected, forcing the recovery systems to kick in and take over.

Here's a simplified guide used to test the performance of a new application.

Create a cut-over plan with a list of roles, responsibilities, tools, and processes involved.
List out the dependencies (if any) for each task.
Conduct risk analysis for dependencies and tasks.
Include a rollback and business continuity plan.
Announce the exact time for each task and record it on a downtime calendar.
Conduct a mock run.
Sort out any issues that arise during the mock run.
Go live—release application to the target users.
Provide post-implementation support.

A BCDR plan determines your organization's preparedness in difficult times. Conduct tests periodically and ensure stakeholders are aware of their roles and responsibilities at all times for a fast and secure recovery from operational disruptions. Want to know more? Our BCDR e-book goes in-depth about ManageEngine's BCDR blueprint, the implementation process, and the teams and tools involved.

About the author

Mahanya Vanidas, Content writer

5 ways to test your BCDR potential

Types of DR testing

1. Checklist testing:

2. Walk-through testing:

Exercise Information

Conduct of exercise

Documentation/ Post walk-through

3. Simulation testing:

4. Parallel testing:

5. Cut over testing:

About the author

Putting together your sales enablement starter kit