A BCDR blueprint for enterprises

Zoho’s BCDR framework

Zoho BCDR framework

Purpose and Scope

Purpose

The BCDR plan lays out the steps and procedures that Zoho and ManageEngine will follow before, during, and in the wake of a disasters ( (e.g., natural disasters, man-made events, pandemic) so that we are resilient as a company, ensure maximum functionality during the emergency, and get our operations back to normalcy in the shortest possible time.

Key elements of the BCDR:

  • Resilience: withstanding business interruptions in the face of adverse conditions
  • Recovery: Getting back to business as quickly as possible after a disaster
  • Contingency: Having a comprehensive set of measures and controls in place for a full recovery
  • Continual improvement: continually reviewing the plan to make necessary revisions and keep the plan updated

Scope

The effectiveness of a BCDR depends on a well-defined scope. As Zoho is a larger enterprise and has distributed teams, scoping a BCDR is understandably more complex. There are many questions that we ask, answer and record when scoping the BCDR:

  • Is it intended to cover all work sites, disaster prone sites, or the production center?
  • Is it to cover all customers, or just a percentage of them?
  • Is it intended to cover a local disaster, or a wide spread of disasters such as hurricanes and pandemics?
  • What are our essential products and services?
  • What are the critical processes and business units that MUST function in the event of a disaster? Example: Customer facing teams

Next, we validate certain assumptions. For example: skilled resources, team leaders or alternates will be available following a disaster

BCDR Governance

For many organizations tasked with BCDR, their first instinct is to immediately start to write a plan. However, experience tells us that a good governance structure is key to steering our BCM efforts and ensure there are no dead ends and pitfalls in the processes.

We have a control or governance system, BCDRC that is comprised of our board of directors and senior management executives of Zoho and ManageEngine. The BCDRC are brought on board early to steer our BCDR efforts and also to ensure that a) the right individuals are in the right roles to maximize our business continuity efforts b) the BCDR is kept ready and relevant at all times.

The following table highlights the roles and responsibilities of our BCDRC.

The roles and responsibilities of our BCDRC

Board of directors

Senior management

Understand and communicate the value of BCM and the risks in the absence of BCM

Senior management team has a sound working knowledge of BCM practices and business risks.

Review the organization's BCDR annually

Keep the board of directors and C-suite executives informed of any significant changes to the business continuity plans

Get frequent updates from senior management team for any newer business continuity policies and procedures

Define Zoho's business management objectives, provide strategic inputs for BCDR, and designate the BCC.

Direct and approve the planning, implementation, testing and other strategic objectives of the BCDR

Review and approve during creation and updation of the critical processes, standard operating procedures (SOP) and planning exercises of our BCDR for each business units.

Direct the audit committee to prepare for external audits

Support and communicate the importance of BCM planning, training, and testing to all stakeholders.

Direct the external communication plan to investors, customers, media and law enforcement authorities

Assign the right middle managers to perform key BCM-related procedures and exercises

Other roles and responsibilities

Who?

Does what?

Middle management (Business owners)

  • Interview with BCC to inform on their critical processes of their business units.
  • Assess risks specific to their business units
  • Recommend steps to the BCDRC to address the identified risks of their individual business units.
  • Collaborate with the IT management team and BCC to design and implement BCDR based on risk assessment and BIA.
  • Conduct adequate tests to ensure the correctness of the business operability procedures as per BCDR

Internal - Audit and risk management committee (ARMC)

  • Conduct internal compliance audits
  • Review and report to BCDRC on the effectiveness of BCDR

Risk management team

  • Risk owners are ultimately accountable for ensuring the risk is managed appropriately

IT incident management team

  • Assess risks specific to IT environment along with the risk management team
  • Recommend steps to the BCDRC for IT related risks.
  • Conduct tests to ensure the correctness of the IT continuity measures in BCDR.

Business continuity coordinators (BCC) are similar to incident coordinators (refer IM process here). The BCC create and maintain the BCDR and work closely with other critical business functions to understand their processes, identify risks, and also help manage and minimize those risks.

  • Manage, communicate, and control all activities associated with the BCDR and the recovery of critical business functions.
  • Activate BCDR for affected departments
  • Keep the incident manager informed of continuity/disaster recovery efforts (refer IM process here.)
  • Meet middle management to review and prioritize critical processes of their respective business units
  • Receive updates from middle managers and adjust plans as necessary
  • Work with communications team and provide inputs to help craft the communications plan.
  • When business restores to normalcy, receive feedback from middle management identify gaps, and record the learnings to continually improvize the BCDR.
  • Maintain documentation of BCDR while ensuring confidentiality and privacy

Legal counsel

TBD

Other stakeholders

  • Report their concerns in risk assessment and BIA of their respective business units
  • Familiarize themselves with the BCDR and emergency contacts
  • Participate in the test and training sessions and provide feedback to middle management.

Risk Assessment

Risk Assessment Methodology

Risk assessment methodology

The first and key step of BCM is the assessment of risks. Risk is the uncertainty of achieving the objectives, which affects our business in an adverse way. Risks are realized when:

  • The objectives of the business is not achieved.
  • There is non-compliance with organization's policies and procedures or external legislation and regulation.
  • The resources of the business are not utilized in an efficient and effective manner.
  • There is a violation of the Confidentiality, Integrity and Availability (CIA) of information.

It is important for Zoho to have an all-hazards approach to risk assessment and control processes in place to ensure that potential impacts do not become real, or if they do, there is a contingency plan in place to deal with them. It is also important that the process is sufficiently clear so that successive assessments produce consistent, valid, and comparable results, even when carried out by different people.

Establish the context

The scope of risk assessment is defined based on factors such as:

  • Geographical location: Distributed data centers and office set up
  • Business units or departments
  • Business process(es)
  • IT services, systems, and networks
  • Customers, partners, products, or services

The overall environment in which the risk assessment is carried out should be identified and rationalized. This will include a description of the internal and external context and any recent changes that affect the likelihood and impact of risks in general.

Internal Context

External Context

Governance, organizational structure, roles and accountabilities

The cultural, social, political, legal and regulatory environment

Policies, objectives, and the strategies

Financial, technological, economic, nature and competitive environments

Capital, time, people, processes, systems and technologies

International, national, regional or local environments

Information systems, information flows and decision-making processes

Key drivers and trends which have impact on the objectives of the organization

Relatonships with, and perceptions and values of, internal stakeholders

Relationships with, and perceptions and values of, external stakeholders

The organization’s culture

Standards, guidelines and models adopted by the organization

Form and extent of contractual relationships

The type(s) of cloud services provided

Risk Identification

Although there are myriad disasters, the resulting effects are similar for most, and it is these we plan for. They result in scenarios such as loss of infrastructure or sustained IT failure. Preparing for the worst-case scenario helps cover many scenarios and risks in a single plan.

Our risk assessment team identifies, classifies, and assesses a wide range of disasters, especially those with catastrophically high impact potential, then characterizes their effects on business to enhance preparedness, response, and resilience.

Natural

Willful

Accidental

Sub category

Geophysical

Earthquake

Bomb threat

Chemical spill

Volcanic eruption

Terrorist activity

Radiation contamination

Landslide

Civil disorder

Heating systems or air conditioning failure

Rockfall

Bomb explosion

telecommunication failure

Meteorological

Thunderstorm

Bio weapons

Network failure

Lightning snowstorm

Disasterous waste

Gas leak

Blizzard

Employee strike

Fire (internal)

Tornado

Cyber attack

Wildfire (external)

Hydrological

Flood

Disgruntled employees sabotage an organization's systems

Tsunami

Avalanche

Climatological

Drought

Heatwave/coldwave

Forest/land fires

Biological

Epidemic

Pandemic

Most enterprises have resilience plans for geophysical, willful, and accidental disasters and for IT disaster recovery. These plans that are effective for various business disruptions can fall short during a global pandemic like the COVID-19.

It's important that enterprises understand the significant differences between natural disasters versus the pandemic outbreaks so they can look beyond traditional business continuity strategies. At Zoho, we have established pandemic-specific policies and communication strategies to minimize business disruptions.

While natural disasters with physical phenomena are limited to a particular geography, biological ones such as viral pandemics spread globally. The table below lists the differences between the disruptions due to natural disasters and pandemics.

Disruption di erences between natural and biological disasters

Distinguishing factors

Natural

Biological

Impact

Affects the organization, facility, workforce and third parties.

A systemic event that affects everyone globally including the organization and its workforce, customers, suppliers, competitors

Exposure

Can be contained and isolated as soon as the root cause is identified

A contagion that spreads rapidly across geographies with severe impacts.

Duration

Shorter duration that varies from a few hours to a week.

Longer duration with a viral pandemic lasting for several months.

Workforce

Temporary shortage and relocating of workforce

Significant shortage of workforce that needs other alternatives like telecommuting

External communication

Emergencies should be reported to the appropriate law enforcement authorities, and health care assistance (e.g. police, fire station, ambulance)

High degree of coordination with the local government, law enforcement, health care assistance.

Infrastructure

Affects public infrastructure availability like electricity, telecommunications, and internet

Affects the global supply chain

Compile/maintain asset inventory

The definition of an asset is taken to be "anything that has value to the organization" and needs to be protected. A full inventory of assets is compiled and maintained by Zoho using the ServiceDesk Plus application. This includes customer data that Zoho stores and processes in its role as a cloud service provider.

Two major types of assets are identified as:

  • Primary assets — information and business processes and activities
  • Supporting assets — hardware, software, network, personnel, site, organization structure

The list of assets is held in the document "Information Asset Inventory" and in the ServiceDesk Plus application. Within the inventory, every asset is assigned a value which should be considered as part of impact assessment stage of this process. Each asset also has an owner who should be involved in the risk assessment for that asset. Where is appropriate for the purposes of risk assessment, cloud customer data assets may be owned by an internal role and the customer consulted regarding the value of those assets.

For the purposes of risk assessment, it is recommended to group assets with similar requirements together so that the number of risks to be assessed remains manageable.

For each asset (or asset group), the threats that could be reasonably expected to apply to it will be identified. These will vary according to the type of asset and could be accidental events such as fire, flood or vehicle impact or malicious attacks such as viruses, theft or sabotage. Threats will apply to one or more of the Confidentiality, Integrity, and Availability of the asset.

Risk scenarios

The identification of risk scenarios is performed by a combination of group discussion and interviews with interested parties such as:

  • Business unit manager(s) responsible for each business-critical activity
  • Representatives of the people who normally conduct each aspect of the activity
  • Providers of the inputs to the activity
  • Recipients of the outputs of the activity
  • Appropriate third parties with relevant knowledge
  • Representatives of those providing supporting services and resources to the activity
  • Any other party that is felt to provide useful input to the risk identification process

The identified risks along with a description are recorded to assess the likelihood and impact of the risks.

Disasters and risk scenarios identified in the last decade

Hazards

Risk Scenarios

Earthquakes

Irreversible damage to IT infrastructure

Floods

Half of core revenue generating business units

Tsunami

Pandemic

Loss of Zoho production center( Zoho Estancia building) & data centers

Ransomware

Loss of critical customer data

DDos attacks

Absenteeism of critical employees

Telecommunication failure

Loss of access to our worl sites

Network failure

Interruption of supply chain

Risk analysis

This process involves assigning a numerical value to the a) likelihood and b) impact of a disaster. These values are then multiplied to arrive at a classification level of high, medium, or low for the disaster.

Assessing the likelihood

An estimate of the likelihood of a disaster occurring is made. This should take into account whether the disaster has occurred before either to Zoho or to similar organizations or location and whether there exists sufficient motive, opportunity, and capability for a threat to be realized.

The likelihood of each disaster is graded on a numerical scale of 1 (low) to 5 (high). General guidance for the meaning of each grade is given in table 1. When assessing the likelihood of a disaster, existing controls is taken into account and that means an assessment has to be made on the effectiveness of existing controls. The rationale for assigned grades to a disaster risk is recorded to aid understanding and to be repeated in future assessments.

LIKELIHOOD

PROBABILITY

EXPLANATION

SCORE

LOW

An event that never occured

0

An event that is highly unlikely to occur of occurs rarely (perhaps once in 3 years)

1

MEDIUM

An event likely to occur relatively infrequently, perhaps once a year

2

HIGH

An event that is fairly probable, and could be expected to occur several times a year

3

Assessing the impact

An estimate of the impact that the disaster risk could affect the Confidentiality, Integrity or Availability on the organization is given. This will take into account existing controls that lessen the impact, as long as these controls are seen to be effective. Consideration will be given to the impact in the following:

  • Customers
  • Finance
  • Health and Safety
  • Reputation
  • Knock-on impact within the organization
  • Legal, contractual or organizational obligations

The impact of each risk is graded on a numerical scale of 1 (low) to 5 (high).

PROBABILITY

EXPLANATION

SCORE

LOW

No impact

0

Negligible or less impact with less effort to repair

1

Damage to reputation or revenue loss is minimal

MEDIUM

Tangible harm, extra effort required to repair

2

Damage to reputation or revenue loss is significant

HIGH

Significant expenditure of resources requires and compromise of the system

3

Damage to reputation and revenue loss is high

Risk classification

Based on the assessment of the grade of likelihood and impact, a score is calculated for each risk by multiplying the two numbers. This resulting score is then used to decide the classification of the risk based on the matrix.

Risk formula

Risk formula

Each risk will be allocated a classification based on its score as follows:

RISK VALUE

RISK LEVEL

COLOR CODING

0-3

LOW

4-6

MEDIUM

7-9

HIGH

Note: Based on our risk appetite, we do change the definition of high, medium, and low classifications. For example: We may decide that only risks with a score of 16 or more

Risk evaluation

Risk acceptance criteria

Risk treatment will not be done for the risks which are ranked in the “Low” risk level. If the value is rated as 3, no actions are taken. If the value is rated as >= 4, the actions will be initiated. Risk treatment can still be done for the “Low” risk category, should the BCDRC decide to do so.

We evaluate risks to decide on the risks that can be accepted and the ones that need to be treated. This should take into account the risk acceptance criteria. The matrix above shows the classifications of risks, where the green indicates that the risk is below the acceptable threshold and could be regarded as “safe”. The orange and red areas generally indicate that a risk does not meet the acceptance criteria and needs to be treated. Risks will be prioritized for treatment according to their score and classification so the high scoring risks are recommended to be addressed before those with lower levels of exposure for the organization.

Risk assessment report

The results derived from risk evaluation is captured in the risk assessment report with the following information:

  • Assets (asset-based risk assessment only)
  • Threats
  • Vulnerabilities
  • Risk scenario descriptions (scenario-based risk assessment only)
  • Controls currently implemented
  • Likelihood (including rationale)
  • Impact (including rationale)
  • Risk score
  • Risk classification
  • Risk owner
  • Whether the risk is recommended for acceptance or treatment
  • Priority of risks for treatment

Note: The risk assessment report holds the inputs to the risk treatment stage of the process and is signed off by the BCDRC before proceeding further, particularly those risks that are recommended for acceptance.

Risk treatment

Risk treatment is a process to develop a range of options for mitigating the risks that are agreed to be unacceptable. We apply the following measures to treat the risks:

  • Modify the risk by applying appropriate controls to lessen the likelihood and/or impact of the risk.
  • Avoid the risk by taking actions that means it no longer applies.
  • Share the risk with another party. For example: insurer or supplier.

We use our judgement to decide which course of action to follow based on a sound knowledge of the circumstances surrounding the risk. Example: Business strategy, regulatory and legislative considerations, technical issues, commercial and contractual issues.

Note: The risk reviewer ensures that all parties who have an interest or bearing on the treatment of the risk are consulted, including the risk owner.

Risk treatment plan

On evaluating the treatment options, the risk treatment plan is created with the below details:

  • Risks requiring treatment
  • Risk owner
  • Recommended treatment option
  • Control(s) to be implemented
  • Responsibility for the identified actions
  • Timescales for actions
  • Residual risk levels after the controls have been implemented.

Statement of Applicability (SOA)

The SOA sets out those standard controls that have been selected and the reasons for their selection. It also details those that have been implemented and identify any that have been explicitly excluded along with the reasons for exclusion.

BCDRC Approval

At each stage of the risk assessment process, the BCDRC is kept informed of the progress and the taken, including the formal signoff of the proposed residual risks. The BCDRC approves the following documents:

  • Risk assessment report
  • Risk treatment plan
  • Statement of Applicability (SOA)

The acceptance or treatment of each risk will be signed off by the relevant risk owner.

Risk Monitoring and Reporting

As part of the implementation of new controls and the maintenance of existing ones, key performance indicators (KPIs) are identified which allows the measurement of the success of the controls in addressing the relevant risks. These indicators are reported on a regular basis and trend information is produced so that exceptional situations are identified and dealt with as part of the BCDRC review process.

Regular Reviews

In addition to a full annual review by ARMC, risk assessments are evaluated on a regular basis to ensure that they remain current and the applied controls are valid and relevant. The relevant risk assessments are also reviewed upon major changes to the business such as office moves, mergers and acquisitions or introduction of new or changed IT services.

Business Impact Analysis

Business impact analysis

While some business functions maybe relatively unimportant, some are absolutely critical to ongoing business. The BIA process makes it easy to pinpoint the most critical business functions, their interdependencies, and whether they should be considered for inclusion into the business continuity strategy. It also helps us identify how these core functions can be impacted by disasters, and also lays the groundwork for more systematic and logical recovery plans.

Furthermore, doing this analysis makes us more confident and secure about our business decisions, knowing fully well that our decisions are based on a solid understanding of the most essential components of our business.

The core objectives of BIA are as follows:

  • Prioritize business-critical units or departments, products, and services that must be protected
  • Create an inventory of essential business activities and the minimum resources required to conduct business as usual or almost.
  • Establish recovery time frames or recovery time objectives (RTOs) to help prioritize risk treatment plans and select the appropriate response and recovery strategies.

As shown in the process activity diagram below, BIA is a multi-phase process performed by BCC.

BIA process activities

BIA Interviews

The BCC take stock of all business units and gather some basic information before the actual interview using a Zoho Creator form. A link with the questionnaire is sent out as an email in the name of a department head along with a note of what the BCC are trying to accomplish through this exercise and why it's important. A reasonable amount of time (around 2 weeks) is given to the concerned teams to complete the task. This prework sets the stage for a more focused and effective BIA interviews and also cuts down the time.

The BCC initially ask the below questions:

  • Name of the business unit?
  • What the business unit does?
  • How many resources does the business unit have?
  • Where is the business unit located?
  • What are the hours of operation? Does it involve shifts?

Tip: Choose a data gathering model that is least time-consuming and one that is more aligned with how you work in your organization. Any effort that's not part of your mainstream business activities such as business continuity, disaster recovery, and compliance are usually low on priority for your business units, and any steps that you take to reduce the effort to gather the data can pay off.

Gather information

The BCC hold a kick off meeting to hand out the questionnaire to the department heads and to clearly articulate the purpose of the whole exercise. The questionnaire covers all the required data points as the final output of the BIA relies on this step.

Below is a sample questionnaire from the BCC:

Sample questionnaire from the BCC

Data points

Questions

IT related questions

Business unit and processes

Describe your business unit and its processes?

What IT systems and applications does this business unit use?

Dependencies

What are your dependencies with other business units?

Would a disruption of this business unit impact others?

How and when would this disruption to other units happen?

What are the IT systems that impact or are impacted by this business unit?

Resource dependencies

Does this business unit depend on any key job functions? If yes, then what is the job function and to what extent does this business unit depend on the job function? What is the minimum number of resources needed for this business unit to function?

What are the secondary systems (if any) needed for these job functions?

Expertise dependencies

Does this business unit depend on the knowledge and expertize of a skilled worker? If yes, describe the role and expertize of the skilled worker and the impact on business in their absence.

Operational

If this business unit did not function, how would it impact business?

If this business unit did not function, how would it affect IT operations?

Tolerance to outages

In the face of a disaster, such as loss of production center (Zoho Estancia), how long can the business unit/systems sustain before the loss impacts the organization, its stakeholders, and suppliers?

Minimum infrastructure requirements

What are the infrastructure requirements for the your business unit: physical space, office supplies, network, communication, furniture, lighting, HVAC, water, and food supplies.

Others

Other concerns (if any) that can affect the recovery of your business unit?

Alternate business processes and resources

What are workarounds currently in place for your business processes? Who are the alternate or back up resources?

Critical documentation

Where do you store your critical documents? Mention the type of documents, location, and alternate locations (if any)

Recovery timeframes

What are the potential recovery issues that your business unit can face? What's the minimum recovery time frame? Who are the essential resources needed to restore operations to a near-normal state?

Financial impact

If this business unit did not function, how would be the financial impact on business? When would the impact be realized? Will it be a one-off impact or recurring?

Recovery time frame

What is the minimum time frame (in hours, days, weeks, months) to recover this business unit?

How long would it take to recover or replace the IT systems/applications related to this business unit?

Service level agreements

Are there any service level agreements in place for this business unit? In the event of a disaster, what would be the impact on SLAs? What are the key metrics associated with the SLAs?

What's the impact on IT service levels be impacted during disruption of this business unit?

IT applications

What software applications are needed for this business unit?

What IT assets are needed to run these applications and to support this business unit?

Desktops, laptops, workstations

How many desktops, laptops, workstations are needed for this business unit?

What is the configuration data for these systems?

Servers and networks

Does this business unit require backend systems and network?

Workarounds

Does this business unit have any workaround processes that have been developed and tested? If yes, would these processes facilitate the smooth function of this business unit during an event? If no, is it feasible to develop such workarounds?

Are there any IT-related workaround for this business unit? If yes, what are those workarounds and how can they be implemented?

Remote

Will this business unit be able to work from backup recovery sites of Zoho? OR work remotely from home?

What should IT do to enable remote access for this business unit?

Vital records

Where does this business unit store critical documents? Are these documents backed up? If yes, where and how frequently does the business unit back up documents?

Where are the document backups stored? Is the current document backup strategy sound enough?

Previous business disruption experience

Has this business unit faced any disruptions earlier? If yes, what was the disruption scenario and duration? Any learnings that can be incorporated into the BCDR to prepare for future disruptions?

Has IT been involved in this disruption scenario? If yes, how did IT address this disruption?

Competitive impact

What would be the competitive impact to Zoho if this business unit faced significant disruption? What percentage of customers would we lose?

The BCC conduct follow up interviews to validate the gathered information and to fill up any gaps.

Analyze the information

The questionnaire is created to gather information as the financial and non-financial impacts, recovery timeframes, resource, and application requirements. The BCC compile and analyze the responses to provide the required information to develop a corporate-wide recovery and continuity strategies.

The below table captures some of the most important impact categories that we consider. This table can be used as a checklist by other IT organizations while conducting BIA.

What's at stake for Zoho?

Impact categories

Impact

Financial impact

  • Loss of revenue due to lost sales
  • Penalties due to non-compliance of service-level agreements (SLAs)
  • Increased operating, relief, and recovery expenses

Infrastructure

  • Damage to production/data centers
  • Restricted access to work sites
  • Damage to IT systems
  • Damage to other physical assets

Resource

  • Absenteeism
  • Loss of data
  • Supply chain disruption
  • Loss of network, power, telecommunication systems

Health and safety

  • Compromised employee health (pandemics) and safety worker safety (fire)
  • Environmental damage

Legal

  • Inability to fulfill service level agreements
  • Inability to comply with regulations

Strategic

  • Delay in new business initiatives
  • Lack of innovation due to low employee performance

Intangible

  • Dissatisfied customers
  • Customer defection
  • Damage to Zoho's business reputation
  • Loss of goodwill with partners
  • Loss of employee morale

The information gathered in the BIA interviews is used to:

  • Identify the critical business units and processes
  • Define the recovery time objective (RTO) for each business process.
  • Define the recovery point objective (RPO) for each business process
  • Identify resource requirements

Identifying critical functions

In the big picture, how critical is each business unit and their processes to Zoho's ability to operate? A four point rating system helps the BCC assign a "criticality rating" to a business unit and its functions.

CATEGORIES

CRITICALITY

COLOR CODING

1

Critical

(mission critical BUs and processes)

2

Important

(necessary BUs and processes)

3

Minor

(Desirable BUs and processes)

Category 1

Critical business units and processes are those that are:

  • most sensitive to downtime
  • maintain cash flow
  • fulfill service level agreements
  • play a key role in maintaining Zoho's business reputation.

The BCDR focuses more time and resources on the critical BUs and functions first, followed by the important BUs and functions.

Category 2

Important business units and processes don't affect Zoho's business operations in the near term. However, if they are not functional for a longer term, they can cause some disruption to the business.

Category 3

Minor or desirable business units and processes do not cause significant business disruption to business. They are usually dealt with in the later stages of business recovery.

Recovery time objective

Once the impact data is analyzed, the BCC define the recovery time objectives (RTO). RTO is the time in which a business process should be restored following a disruption. This depends on the criticality of a business unit, process, and application and range anywhere between no downtime to several days or weeks. Simply put, “How long can we be down?”

This timeframe can vary by organization — for some IT organizations, the recovery time for processes can be as low as 0 minutes.

CATEGORIES

CRITICALITY

COLOR CODING

1

Critical

(mission critical BUs and processes)

12 hours or less

2

Important

(necessary BUs and processes)

48 hours or less

3

Minor

(Desirable BUs and processes)

< 3days

Recovery point objective (RPO):

RPO defines the maximum acceptable data loss that can be tolerated by a critical business process. Simply put, if the IT systems supporting a critical business process were to fail, how much data can be recovered? We use three time frames here and this can also vary by organization.

RPO 0 — no data loss (real time back ups)

RPO 1 — less than 4 hours data loss

RPO 2 — 24 hours data loss

Identifying resource requirements and dependencies

The BCC document each department and process along with the resource(s) responsible for the processes of a business unit. A list of backup resources for the process is also identified in case the lead resources are unavailable during an emergency.

The BCC also identify the systems, applications (be it a CRM, payroll, HR software), and the level of access needed to get their jobs done. The level of reliance of a business unit on these systems and applications is rated as high, medium, or low in order to ensure the availability of crucial systems and application during an emergency.

A thorough understanding of interdependencies between business units, their functions, and IT systems is crucial to both disaster recovery and business continuity. If system A is down during an event, it's pointless for our IT teams to spend a week trying to restore System B, if System A is still out of function. The BCC document and highlight these interdependencies at this stage to ensure the effectiveness of business continuity.

BIA Report

The outcome of the BIA is documented as a BIA report with recommendations of recovery strategies and presented to the BCDRC for approval. This report is also appropriately incorporated into our IT disaster recovery and incident management plans. Here is a sample BIA report of one of our BUs - IT operations.

BU Head

BU name:Network operations center (NOC)

BU head:Prabhu Ponnukumaraswamy

Email ID:xxxx@zohocorp.com

Mobile:+919999999999

Headcount

50

Priority

Critical

Business unit functions

  • Network monitoring
  • Incident response
  • Provide 24/7 LAN, WAN, VPN, and network connectivity with 99.999% uptime.
  • Provide hardware and software application support to the employees
  • Manage the IT infrastructure of Zoho

Business unit disruption impact

  • Disruption of NOC will impact all BUs and productivity of Zoho directly.
  • Disruption of this BU means Zoho will not be unable to conduct business and the downtime will be directly proportional to lost dollars.

RTO

15 minutes.

RPO

0

Internal dependencies

Human Resources, Finance, Facilities, and Security.

External dependencies

  • Dependency on external physical server vendors and technicians
  • ISP for network connectivity

Recommendations

  • Backup site should be at a safe distance from the production center. (Tenkasi)
  • Safe to engage two external ISP vendors for network connectivity

BCDRC Approvals

The BIA report is sent to BCDRC for their perspective and approval as the BIA results is used to formulate recovery strategies and continuity planning. The BIA goes through a multi-step approval process. The first level of approval is done by the BIA owner and the final go ahead is given by the BCDRC.

BCDR Planning

BCDR planning

The bulk of our work in developing our BCDR plan is almost complete when we get to this point. This section is where everything comes together - the risk assessment we performed gave us the data that helped us identify the business impact those risks can have on our business. Finally, all of that data is now going to help us identify the disaster response, mitigation, and recovery strategies, as well as the people, resources, and activities that we need for effective BCDR.

The BCDR plan includes two phases:

  • Emergency response procedures that all Zoho worksites will follow as the appropriate emergency response to disasters like fire, flood, and earthquakes to protect employee lives and limit damages.
  • Disaster recovery and business continuity activities conducted after the disruption for the restoration of business operations.

Roles and responsibilities

One of the crucial steps in emergency response and recovery is assigning roles and responsibilities. When disasters strike, the response teams on the scene are our first line of protection.

These teams help contain the impact of the disaster and effect a timely recovery before the first responders such as police or firefighters arrive at the disaster site.

Below are the response teams and responsibilities.

BCDR Roles and Responsibilities

Emergency personnel

  • Trained emergency personnel who act as floor wardens during disaster to aid the EMT in immediate evacuation
  • Notify security personnel
  • Brief external emergency services, upon arrival on the type and location of the emergency, summarize the damage (e.g.,minimal, heavy, total destruction). and the status of the evacuation.
  • Notify building security personnel who will establish security at the facility and not allow access to the site unless notified by the BCDRC.

Security personnel

  • In the event of an emergency, safety and security operations are one of the first points of contact for the BCDRC.
  • Dial national emergency number
  • Activate evacuation alarm followed by a verbal announcement to all employees to evacuate the building.
  • Outside of business hours, the security personnel remain on-call to notify the BCDRC and manage an emergency.
  • Provide emergency response to all on-site emergencies.
  • Provide security resources and work with all recovery teams as needed.
  • Contact external emergency services.

Head of facilities

  • Responsible for life safety measures of employees including fire alarms, extinguishers, emergency lighting, fire detection systems, emergency exits, and other warning systems.
  • Provides emergency floor plans on request
  • Ensure all employees evacuate the facilities and meet at the assigned outside location (assembly point) and follow instructions given by the emergency personnel.
  • Acts as a liasion between Zoho and essential services vendors such as HVAC, electrical, and plumbing.

In house medical officers

  • In-house medical officers are mobilized during emergencies

Ambulance service

  • Provide transport for the injured employees.

Employees

  • Familiarize themselves with the standard emergency procedures
  • Respond to emergencies
  • Follow instructions of the emergency and security personnel
  • Keep all emergency exits clear and avoid panic during emergencies

Disaster recovery team

  • Coordinate with EMT and BCDRC for appropriate recovery actions.
  • Notify all company department heads and advise them to activate their plan(s) if applicable, based on the disaster situation.
  • Determine recovery needs
  • Establish command center and assembly areas
  • Assess the damage to the affected location and/or assets.
  • Contact vendors/contractors of installed equipment for their expert opinions on the condition of the equipment.
  • Document assessment results using assessment and evaluation form
  • Inspect the affected areas to assess damage to essential hard copies of records (files, manuals, contracts, documentation, etc.) and electronic data.
  • Gather information regarding damage to other work site(s), e.g., environmental conditions, physical structure integrity, furniture, and fixtures) from the DRT.
  • Develop a restoration priority list, identifying facilities, vital records, and equipment needed for resumption of activities that could be operationally restored and retrieved quickly.
  • Prepare post-disaster debriefing report.

Emergency management team

  • Evaluate which recovery actions should be invoked and activate the corresponding recovery teams.
  • Evaluate and assess damage assessment findings.
  • Set restoration priority based on the damage assessment reports.
  • Provide Senior Management with ongoing status information.
  • Act as a communication channel to corporate teams and major customers.
  • Work with vendors and DRT to develop a rebuild/repair schedule.

IT

  • Facilitate technology recovery and restoration activities, providing guidance on replacement equipment and IT systems.
  • Coordinate removal of salvageable equipment at the disaster site(s) that may be used for alternate site operations.

Notification procedures

  • If in-hours: On observation or notification of a potentially serious situation, (example: fire) the employee identifying the incident (reporter) calls their BU head. If the BU head is unreachable or incapacitated, the reporter calls their backup, a senior manager.
  • The BU head/backup notifies the emergency personnel on site (who carry out the standard emergency and evacuation procedures if necessary) and the EMT and DRT.
  • If out of hours: IT personnel notify the EMT and DRT.
  • The EMT, DRT, and other response teams respond based on the directives specified by BCDRC.
  • When a disaster is declared, the EMT and/or DRT will notify IT immediately for deployment.
  • The person who is authorized to declare a disaster within the BCDRC has a backup who is also authorized to declare a disaster in the event the primary person is unavailable. For example: CEO - primary authority, COO - secondary authority.

A call tree is a general notification technique that we use to list the primary and alternate contact numbers of key personnel as well as the back up personnel numbers in the event that the key personnel is unreachable. The contact list includes the name, department, role, mobile number, residential number and address of the key and backup personnel.

BCDR notification procedures

Disaster declaration

A disaster is declared only when the emergency is not likely to be contained and resolved within predefined time frames. The BCDRC is responsible for declaring a disaster and has to be well informed about the geographical, political, social, and environmental events that can pose a threat to Zoho's business operations. To avoid false alarms, the BCDRC has identified institutions that provide timely and meaningful disaster predictions that allows Zoho to respond and recover effectively. Below are a few identified institutions that help the BCDRC with disaster monitoring for regional work sites.

Type of disaster

Early warning/prediction systems

( For regional work sites )

Cyclones and earthquakes

Indian meteorological department and earthquake sensors

Tsunami

Indian meteorological department and earthquake sensors

Cyclones and earthquakes

Indian national centre for oceanic information services

Floods

Central water commission

Invoking the plan

Like every IT organization, we hope to never have to invoke the BCDR. However, emergencies can arise at any time and we believe in readiness. The BCDR is reserved for significant disasters and business disruption and is invoked by BCDRC.

Regardless of the service disruption circumstances, or the identity of the individual(s) in the BCDRC who are first notified of the disaster, the EMT and DRT are activated immediately in the following cases:

  • The production center at Estancia is down due to a natural disaster like flood, earthquake etc.
  • Any disruption in the IT systems or network facility that can cause concurrent downtime in the production center for more than three hours.

Internal communication

Effective internal communication is key to ensuring that employees are well-informed, supported, reassured, and most importantly safe during a disaster. Ideally, a face-to-face communication is effective for relaying messages to stakeholders during a disaster. However, at Zoho, a forum post from the CEO and HR with key messages surrounding the disaster and BCDR on Zoho Connect (A team collaboration software, like an internal Facebook-like application that connects all stakeholders and enables collaboration during a disaster.) is an effective alternative.

In addition to the forums on Zoho Connect from the BCDRC and HR teams, the BU heads are the focal points for their departments to provide updates on the progress of their disaster recovery and business continuity efforts and how they can contribute to the recovery efforts.

Initial response

It might sound obvious but the BCDR prioritizes our employees and their lives over assets.

The emergency response procedures taken in the initial minutes of an emergency are critical to saving the lives of our employees. Our emergency procedures captures four protective actions: evacuation, shelter, shelter-in-place, and lockdown and relevant p rocedures. These emergency actions apply to all employees (including management personnel), and to all work sites of Zohocorp.

Authority

The instructions and guidance given by Zoho's trained emergency personnel overrules the reporting structure. This authority is given to the emergency personnel to ensure that the life and safety of the employees takes precedence over IT systems, other assets, and production during an emergency.

Assembly points

The BCDR plan identifies two assembly points both inside and outside Zoho premises where employees should gather after evacuating. These areas of refuge have sufficient space to accommodate all of Zoho's employees and are away from buildings, power lines, trees, gas lines, poles, and vehicles.

The BCDR plan identifies two evacuation assembly points

  • Primary - Open ground behind Zoho
  • Secondary - Open ground across the street opposite to Zoho

Protective action and emergency procedures by disaster types

Disaster Type

Protective Action

Procedures

Fire/smoke

Evacuation

  • If fire or smoke is present in the facility, evaluate the situation and determine the severity, categorize the fire as “major” or “minor.”
  • In case of minor fires, the employees attempt to extinguish minor fires (e.g., single hardware component or paper fires) using hand-held fire extinguishers located throughout the Zoho facilities. Any other fire or smoke situation to be handled by qualified building personnel until the local fire department arrives.
  • In the event of a major fire, the fire alarm system has to be activated immediately and ring continuously for 60 secs.
  • The reporter has to provide the recovery teams with their name, extension, work location (block, floor, workstation ID), the nature of the emergency. Follow all instructions given.
  • The EMT and DRT on their mobile numbers by the reporter.
  • The emergency personnel assist employees (giving utmost care for the physically-challenged) to the nearest safety exit to the staircase that leads to the building exit. Lifts not to be used.
  • All employees to gather at the assigned outside location or assembly point and follow instructions given by the emergency personnel.
  • It should be ensured that all employees are evacuated and headcount taken.
  • The assembly area has to be monitored and employees should be reassured of their safety.
  • Appropriate first aid to be given to the injured (if any) by the in-house medical officers until paramedics arrive.

Flood/water damage

Evacuation

  • Cease operations and usage of electrical equipment and move to higher ground
  • If water is dripping from an air conditioning unit and is not endangering IT systems and other assets, contact plumbing and HVAC repair personnel immediately.
  • If flooding is severe, activate alarm/warning system for employees and immediately notify EMT/DRT teams, emergency personnel, and implement power-down procedures.
  • While power-down procedures are in progress, evacuate the area and follow BCDRC instructions.
  • Evacuate the work site building if necessary and proceed to the emergency assembly area. Follow evacuation procedures.
  • Dial national emergency number immediately and wait for outside help.

Tornado/cyclones

Shelter

  • Activate alarm system to warn employees
  • Notify EMT/DRT teams
  • Follow instructions from emergency personnel and move to the basement, lower floors, and at the strongest side of the building.
  • If there is no basement, go to a hall/room on the lowest level of the building.
  • Move away from glass windows, large shelves, ceiling decor, and other potentially harmful things.
  • Stock supplies such as water, nonperishable food, first-aid, batteries, flashlights, (cyclones can cause power failure) and other necessities based on the forecasted duration of the cyclone.
  • Employees to remain inside campus until the cyclone.
  • Heads of BUs to take headcount of their respective reportees and notify emergency personnel in case of missing employees.

Earthquakes

Shelter-in-place

  • Take shelter under sturdy furniture or other shelter.
  • Move away from glass windows, large shelves, ceiling decor, and other potentially harmful things.
  • Make oneself as small and as possible and cover the head and neck with the hands.
  • Remain in shelter until the shaking stops.
  • Evacuate the work site building if necessary and proceed to the emergency assembly area. Follow evacuation procedures.
  • The emergency personnel to assist employees to the exit for evacuation and look for the injured to assist.
  • Dial national emergency number immediately and wait for outside help.

Terrorist attack

Lockdown

  • When terrorist attack is suspected and gun shots are heard, the primary evacuation route is not safe. All employees including emergency personnel are to find safe spots to hide and remain silent.
  • Mobile devices are to slienced by turning off both the ring tone and vibration functions.
  • All employees to work together as a team and escort other panic-struck colleagues to safety.
  • Security personnel to call the emergency hotline.
  • Once the threat is eliminated with interference from the local authorities, follow evacuation procedures and instructions from emergency personnel.

Emergency contacts

In case of emergencies we call for help, information, and services on these emergency hot lines.

Emergency Crisis Hot Lines (Regional)

National emergency hotline

Disaster management services

AIIMMS

Air ambulance

Red cross

Gas leak

Fire department

Police department

Hospital

Medical services (mobile)

Ambulance services

Utility Companies

Network provider

Gas

Plumbing

HVAC

Electricity board

BCDR activities

Here is what an emergency scenario at Zoho can look like as the recovery activities unfold. The activities below are some of the recovery activities in case of fire, flood and earthquakes, and of course, will vary depending on the nature of the emergency and its impact on business.

How is Zoho geared for eventualities?

Timeframe

Activities

First 4 hours

External communication: Our communications team collects information from reliable sources and crafts key messages (before, during, and after the disaster), as well as ensures a consistent message across all channels: website, blog, media, news release, social media et cetera. The team holds a ready list of potential external audiences: emergency medical services, fire department, police, local government, suppliers and vendors along with their contact numbers.

Two official spokespersons, President/Vice President of Zoho and ManageEngine, with a solid experience in working with both print and broadcast media, will be the primary contact for all media inquiries. The spokespersons typically run all press conferences and give the most analyst and partner interviews during a crisis.

All external communication will include details of the disaster including the date and time of occurrence, a description stating the impact of disaster on business, steps being taken to mitigate the risks, recovery, and business continuity, and estimated time for recovery.

Emergency command centers (ECC):Our emergency command centers are the coordination hubs for disaster response. The BCDRC and response teams personnel gather critical information, coordinate response and recovery activities, and manage employees as the emergency situation demands from these centers.

Emergency command center 1: Estancia IT Park, Chennai, India

Emergency command center 2: Tenkasi, India

Alternate locations: In case of temporary or permanent loss of a disaster struck facility, the 12 offices spread across different countries act as alternate locations to each other.

We move our critical business functions to alternate sites that are equipped to provide similar working environments as other sites.

Alternate sites may include (but not limited to):

  • Zoho's alternate site(s) listed here that are not affected by disaster. The sites closer to the affected site can host the essential resources and also assist in recovering business operations.
  • Temporary worksites: Temporary worksites are set up in case of emergencies with minimal IT systems, telecommunications, and other equipment.
  • Telecommuting: Employees work remotely from home or alternate locations of their choice as Zoho runs on cloud applications.

Critical teams and resources: In the BIA phase of this plan we already identified the critical teams and employees that are considered essential during an emergency or disaster. These critical BUs such as customer-facing teams (presales, sales, customer support) and their resources are moved to alternate locations. Minimal resources from other critical BUs such as HR and facilities report to work regardless of conditions.

Availability: Application data is stored on resilient storage that is replicated across data centers. Data in the primary DC is replicated in the secondary in near real time. In case of failure of the primary DC, secondary DC takes over and the operations are carried on smoothly with minimal or no loss of time. Both the centers are equipped with multiple ISPs. We have power back-up, temperature control systems and fire-prevention systems as physical measures to ensure business continuity. These measures help us achieve resilience. The live status and historical status data (30 days) of cloud services can be seen at status.zoho.com / status.zoho.eu / status.zoho.in / status.zoho.com.au.

Disaster-ready data backups: Data backup and recovery is critical for recalling data during natural disasters. At Zoho, we perform full and incremental backups to preserve corporate information. These backups are performed on a regular basis for audit logs and files that are considered critical. The backup media is stored in a secure offsite data center, geographically separate from the original.

5-24 hours

Succession plan: In case of casualties, activate the succession plan that lists who replaces the BCDRC, senior managers, managers, team leads during an emergency if they are not available to carry out their responsibilities.

Stabilize the situation: The disaster situation is stabilized to save lives, and is usually done at the response stage. However some stabilization activities such as removing records from the disaster location, and isolating affected systems are done before damage assessment to prevent further damage to the records and information, as well as the assets.

Damage assessment: Once a disaster is declared, the DRT should be mobilized. Damage assessment is done as quickly as conditions permit by the DRT (under the direction of the location authorities) to assess the damage to:

  • Essential hard copy records (files, manuals, contracts, documentation, et cetera.) and electronic data.
  • The site(s), e.g., environmental conditions, physical structure integrity, furniture, and fixtures

Damage assessment helps us gauge the extent of damage: what can be replaced, salvaged, or reconstructed. The results of the damage assessment are documented in the damage assessment and evaluation form (Check forms section below for a complete list of forms that we use during emergencies). This helps develop a restoration priority list, identify facilities, vital records, and equipment needed for resumption of activities.

The EMT and DRT gather all the information regarding the event and send for BCDRC's review. The decision to move to the business continuity phase is made at this point. If the situation does not warrant this action, then the EMT and DRT continue to address the situation at the affected site(s).

Supply chain: In times of disaster, our supply chains that were functioning well can experience significant disruption. We've identified a list of key back up vendors for all essential equipment and supplies so we can switch to these vendors in the event the primary vendor is also affected by the disaster.

Days 2-4

Salvage operations at disaster site: The salvage operations now begin for damaged IT systems, furniture, workstations, and records with appropriate procedures. The activities include:

  • Isolating and removing affected systems, furniture, and other equipment from the disaster site.
  • Sending the systems and equipment for salvage to the respective vendors for repair.
  • Organizing the undamaged systems equipment
  • Cleaning all workstations including the furniture, the undamaged IT systems, and other equipment.
  • Removing debris and making sure the facility is restored to normalcy.

Move critical resources back to primary site: As soon as the primary site is stabilized and repaired, the critical resources are moved back into the primary site.

Days 5-14

Bringing back business as usual (BAU): In the event of total facility destruction efforts begin for fully rebuilding the facility, while the critical employees continue to work from alternate locations, and other employees work from home.

In case of partial damage, the facility is rebuilt in the shortest time possible and all employees are moved into the primary facility.

Once all the IT systems, records, data, supplies are restored and normalcy returns at the organization, external communication is sent out to customers, partners, press, and concerned authorities.

Forms

Disaster form

In the event of a disaster, the on-duty personnel make the initial entries into a disaster form. This form captures a chronological log of the business impact reported during the event. It is then forwarded to the ECC, where it is continually updated. The running log remains active until the disaster ends and its business as usual.

Date and time

Type of event

Location

Building access issues

Projected impact to operations

Running log (ongoing events)

Critical equipment status assessment and evaluation form

Date and time

Type of event

Location

Equipment

Condition

Salvage

Comments

Critical equipment status form:

OK - Undamaged

DBU - Damaged, but usable

DS - Damaged, requires salvage before use

D - Destroyed, requires reconstruction

BCDR Approvals

Once the BCDR plan is completed along with the estimated costs for recovery, it is sent for a formal approval to the BCDRC. The BCC get the support and buy in of the senior management to emphasize the senior management's commitment to the BCDR process and its importance.

Implementation and Training

So we've now created a BCDR plan and it's now part of our mainstream processes and policies. The last step in being BCDR-ready is regularly training all those who use the BCDR plan, and also those employees who aren't part of its development is critical to the success of the plan. The training can be walk-throughs, mock disaster drills, or component testing.

The DRT and EMT teams choose disaster scenarios that can realistically happen. For example, they can build a scenario around a fire accident to conduct mock fire drills. The fire drills are conducted every six months to check the reaction of the employees, the efficiencies of the fire alarm and fire fighting systems, execution of evacuation procedures by the emergency personnel, and the disaster response and recovery activities.

We also train our IT teams in disaster recovery activities to get them up to speed as they are instrumental in keeping our systems available and accessible in an emergency.

Plan Review and Maintenance

The BCDR plan review and maintenance are closely tied as maintenance of the plan requires a review from time to time to ensure that the plan stays current and that any changes to the infrastructure or personnel details is updated in the plan. As part of our continual improvement efforts, the BCC gather lessons learned from disaster experiences and mock drills, and update the plan with new information gleaned from these experiences.

The updates, revisions, and approvals to the changes in the BCDR plan is done using the change management module in our ITSM tool.

Get fresh content in your inbox

By clicking 'keep me in the loop', you agree to processing of personal data according to the Privacy Policy.