Root cause analysis (RCA)

Incident management handbook: How Zoho manages the spectrum of IT incidents

Whew! That was a close call. Let's hope that never happens again!

Over the last decade, we’ve combated thousands of incidents.

As bootstrappers, we’ve experienced low-impact incidents that typically needed fewer technicians but still required a well-established incident management (IM) framework. For most incidents in the past, we relied on problem solving by individuals. However, as our IT infrastructure grew, we faced more complex and high-impact incidents, forcing us to up our IM game.

Soon, we realized there is no one-size-fits-all process to manage all the different types of incidents our organization faced. So, we took the frameworks that were most effective and added, combined, or omitted steps to handle every type of incident based on their impact and our business operations. This ensures each response is well-tailored to the challenges presented by each incident.

The result? Our incident process now extends beyond established industry frameworks. Our IM frameworks are classified based on the severity and impact the different types of incidents have on business operations.

IM framework		Impact	Scenarios
Desktop sprint	Break/fix incidents that affect an individual user	A single user is affected No critical services are involved	Password resets Internet is slow
Desktop sprint	Low or medium-impact incidents that affect user groups or departments	A single VIP user is affected A small group of end users are affected There is no potential for financial loss or loss of reputation	The CEO's laptop is not working and is unable to send and receive communications A printer is not working on a particular floor
Big bang	High urgency Affects service	A certain business-critical service, application, or infrastructure component is unavailable, and the estimated time for recovery is unknown or exceedingly long Service is unavailable. Immediate restoration of service is expected	Our high-speed network connection fails, and communication to and from outside our organization is cut off Core functionality of an application is down, affecting several customers One of our applications is down A distributed denial-of-service (DDoS) attack
CyberSec (Showstoppers)	Immediate urgency Critical or red alert situations that affect business	Affects our company's bottom line Major impact on revenue, reputation, and legal affairs	Software bugs and vulnerabilities Malware Advanced persistent threat (APT) Ransomware Phishing and social engineering Insider threat

When it comes to IM, there is no one-size-fits-all solution as every organization is different. What will work for your organization will depend on your business model, infrastructure, operations, the information you are protecting, your resources, and more. Recognize that some techniques only come with time and experience. This should not, however, discourage you from getting started!

Who is this guide for?

This e-book is written for IT leaders, managers, and practitioners from a service management perspective. We will walk you through our IM processes with illustrated process flows, roles, and best practices. This guide is full of lessons we've learned through trial and error—so you don't have to.

Before we dive in, let's get the basics out of the way.

What is an incident?

An incident is an unplanned interruption that causes, may cause, or reduces the quality of an IT service. Some classic examples are the internet running too slow, a business application going down, or a printer not working.

Truth is, we can define an incident in many ways. What matters most is that every incident should have a well-structured, timely response and resolution.

What is incident management?

Incident management is a way to restore normal service operations as quickly as possible, minimizing any adverse impact on business operations or the user.

Our incident values

Incident principles	Approach
Be proactive, not reactive	A proactive approach: preventive maintenance is regularly performed to reduce the likelihood of failure—before users are even impacted Monitoring tools provide visibility on the network's health and performance, and grt us about issues before they become incidents
Be open, and communicate	We communicate with our customers early on and often, to let them know that we are aware of and working on issues A predefined group of stakeholders are automatically notified through their preferred contact methods when an incident strikes
Align teams, collaborate effectively	We have distributed teams from various time zones working together during high impact incidents, dialing in to a conference call bridge number, and utilizing communication or productivity applications to handle the incident.
Bounce back quickly	Incident management could mean many things. However for us, it translates into time management. We utilize Site24x7, which lets us know as soon as something breaks. Getting ahead of issues is crucial to our IM. Sometimes, our employees turn into an alert system. They use our systems on a daily basis and will likely be the first people to notice when something does not feel right We have an open IM system, follow protocols where needed, and work as a team to resolve the issue at the earliest
Document the lessons	We sometimes make mistakes. Who doesn't? However, we ensure that we learn from those mistakes by documenting the lessons learned
Continually improve	We deep dive into what went wrong to ensure we don’t make the same mistake twice Sometimes, we run mock incidents to see how our IM strategy holds up, and continue to fine-tune it before the real deal

Our IM tools

We utilize several tools to aid our IM processes.

Desktop incidents

Track & manage incidents:

ServiceDesk Plus Cloud is customized to fit our incident management processes.

Password resets:

ADSelfServicePlus is a self-service password reset tool.

Password management:

Password Manager Pro is a secure vault for storing and managing shared sensitive information such as passwords, documents, and digital identities of enterprises.

Endpoint management:

Endpoint Central, a unified endpoint management solution, helps manage servers, laptops, desktops, smartphones, and tablets from a central location.

Major availability incidents

Alerting tool:

We use Site24x7 to monitor the availability of servers and applications.

Security incidents

Bug Bounty program:

Bug Bounty is a third-party tool for employees and individuals to report bugs, like exploits and vulnerabilities.

Communication

Note:

We also use social networking sites, messaging platforms like What’s App, and phone calls as alternative ways to communicate should Cliq go down, as it’s important to have alternative means of communication during a disaster.

Documentation:

Zoho Docs is a central system for storing all incident and root cause analysis (RCA) documents.

Chat:

Zoho Cliq is a real time business messaging app that helps our employees communicate effectively with each other anytime, including during an incident.

Collaborate:

Zoho Connect is collaboration software that ensures that all teams can be on the same page when resolving incidents. Some call it the Facebook of our workplace.

Our incident management command center (IMCC)

Our incident management command center (IMCC) is a large secure room with big, NASA-like screens of monitoring devices to provide detailed metrics and visibility, enabling our IM teams to react quickly and troubleshoot effectively during incidents. This room hosts three core teams: the network operations center (NOC) team, the Zorro team, and the central system admin team. We have dynamic access control in other work sites to perform monitoring activities.

Previous Chapter

Trends in Managed Service Provider industry
‹
Incident management processes

Next Chapter
›