Root cause analysis (RCA)
What is RCA?
Root cause analysis (RCA) is a systematic approach that drills deep to identify the root cause of an incident by repeatedly asking “why” questions until no additional diagnostic responses can be provided. It typically involves an analysis or a discussion soon after an incident has occurred. An additional resource, the incident state document, serves as a written record of what happened before and during the incident, and gives answers to the questions necessary to conduct a root cause analysis.
The incident state document, also known as an incident report, is the best place to start with root cause analysis. However, it is critical to dig deeper than just what the form states. At Zoho, we create a problem record from within the incident ticket to perform a fully fledged RCA through our ITSM tool.
Why perform RCA?
At Zoho, we never let a good crisis go to waste. We see an unfortunate event as an opportunity to learn from our mistakes, identify where the processes or systems failed, and be better prepared to handle similar incidents in the future.
- RCA is conducted to determine the factors that resulted in the incident and to take corrective actions rather than to simply treat the symptoms.
- A successful RCA is performed systematically with conclusions backed by real evidence.
- Most often, there is more than just one root cause for an incident.
- “If you are not making mistakes, then you are not doing anything.” At Zoho, we believe in learning from our mistakes. Having a “blameless” RCA process allows our employees and teams to give the exact details of their approach such as the actions they took and the assumptions they made when handling the incident.
Corrective Action Preventive Action (CAPA) is our structured approach to investigate, identify the root cause, take corrective action, and prevent the recurrence of the root cause(s).
Here are the actions performed by the incident manager during RCA:
Create: Creates a problem record from the incident ticket to perform RCA.
Investigate: The information in the incident state document serves as the base to perform RCA. The incident manager identifies the departments and the processes associated with CAPA, and conducts a thorough investigation. Throughout the RCA process, we learn lessons and discover opportunities for improvement. We ask the below questions to form a conclusion.
Root cause statement: Zoho Accounts servers were up but were unable to serve any requests resulting in Zoho CRM and Zoho Mail facing accessibility issues.
Incident summary: This availability incident was triggered on 22-Jan-2019 15:31 IST and ended on 22-Jan-2019 at 15:52 IST. The incident was detected by Site24x7 and affected Zoho CRM and Zoho Mail services.
The event was mitigated by taking the following actions:
Temporary fix (immediate): The problem-causing service entry was removed immediately within 15 minutes after the outage. Zoho Accounts servers were up and services were accessible within 20 minutes.
Permanent fix (next day build update): When a new service is added, we clear the cache, which clears the JVM cache and repopulates it. As an alternate step, we are now repopulating the newly added service in the JVM cache so that even if repopulation fails in the future, the older service list will be used.
The downtime lasted for 21 minutes, and Zoho CRM and Zoho Mail customers were not able to access the services.
20 support tickets were raised following the incident through phone call, email, and chat.
The incident was detected by customers, and the incident coordinators of Zoho CRM and Zoho Mail teams responded to the incident informing the incident manager and involving the heads of product teams and other stakeholders.
The incident was responded to within 15 minutes of occurrence, and a temporary fix was provided.
Temporary fix (immediate): The problem-causing service entry was removed within 15 minutes after the outage. Zoho Accounts servers were up and services were accessible within 20 minutes.
Permanent fix (next day build update): Whenever a new service is added in Zoho Accounts, a list in which all services are stored will be cleared. When populating the now empty list, we try to sort the services obtained from the database. Since this list is no longer needed, we have completely removed it from our codebase.
A detailed incident timeline, in chronological order, timestamped with the time zone.
22-Jan-2019 15:27 IST: A new service entry was added to the accounts.
22-Jan-2019 15:30 IST: Zoho Accounts configurations were cleared to reflect the new entry. Subsequently, Zoho Accounts servers failed to reload.
22-Jan-2019 15:31 IST: Zoho Accounts servers were down and the services were inaccessible.
22-Jan-2019: 15.51 IST: A temporary fix was executed in 20 minutes.
22-Jan-2019 15:52 IST: Zoho Accounts became stable and services were accessible again.
The sorting of services was part of the old UI. As the listing was not required anymore, we removed it from the codebase.
We also identified and removed similar functions where the same sorting algorithm is employed so that this downtime does not occur in the future.
The incident manager determines the root cause of the incidents using the “5 why’s” technique that involves repeatedly asking the question “why?” until the root cause is identified. The purpose is not to place blame, but to uncover why an incident occurred in the first place.
Note: Sometimes it may take just three “why?” questions to reach the root cause; often, it requires more. It takes a while to master the art of questioning, but when the right questions are asked, the root cause can be identified quickly. In this case, the root cause was identified with just three questions.
Simply put, corrective actions are based on an adverse event that happened in the past. Preventive actions are based on thwarting an adverse event in the future. Corrective Action Preventive Actions, typically referred to as CAPA, are integral parts of our continual improvement process.
The success of RCA requires careful management of the action plan. So the next stage of the RCA process is to establish a proposed action plan defining the list of corrective actions and preventive actions. The action plan should define the time frame in which the actions will be completed and who handles each task.
Here’s our checklist to ensure a systemic action plan:
- Are there corrective actions listed that are not supported by the analysis?
- Are the corrective actions clear and appropriate for the cause? Are the corrective actions listed in order of priority?
- If a third party is involved, will the action items be delivered within the intended time frame?
- Are the corrective actions likely to cause unintended consequences?
- Are the corrective actions under the control of management? Are the corrective actions likely to prevent recurrence?
- Has the department/action owner agreed to do the corrective action?
- Does each corrective action have a clear owner and due date?
The preventive actions process is to build in safeguards and process changes to prevent non-conformance. As a proactive measure, we:
- Analyze processes and services for negative trends that could escalate an incident.
- Perform risk analysis to uncover latent hazards.
- Conduct training programs to enhance our employees’ skills and to be better prepared during an incident.
- Introduce disaster recovery, security, and contingency plans for unpredictable crisis situations.
- Set up preventive maintenance to ensure our services are always safe, available, and performing optimally.
- Perform audits to assist in streamlining processes and to deliver quality service.
Finally, the RCA goes to management for approval to make the changes and prevent repeat problems. The incident manager establishes thorough follow-ups with resolver groups to ensure the corrective steps are effective and recurrence has been prevented.
The below checklist can be used by all IT teams to evaluate the overall quality of an incident response plan.
- Did the incident response plan help resolve the incident, or did the organization rely on “off-plan” activities?
- Is there a clear summary document for quickly understanding the incident?
- Is the entire incident analysis fact-based?
- Was the IT architecture robust enough to limit the impact between internal systems?
- How well did the associated teams, for example HR, legal, product, and so on, engage in assessment and communication?
- Was the data protection policy and practices adequate to identify and prioritize critical data? How effective was the communication plan?
- Have we asked “why” enough times to determine the root cause?
- Is there a clear link between facts, causes, and corrective actions?
- Did the analysis identify if the incident occurred previously?
- Were the resolvers identified earlier to handle this type of incident, or pulled in later based on their knowledge?
- Were the risks to the organization evaluated and managed? Has the RCA gone through the approval mechanism?
We conduct RCA meetings to get to the bottom of the issue, take necessary corrective actions to fix the issue permanently, and take preventive actions. The most important guideline for our RCA meetings is to learn and continually improve, not to assign blame or to vent.
Here are some tips to ensure an effective RCA meeting:
- Schedule a date and time that’s convenient for all meeting participants, keeping in mind the team members who work in shifts and other distributed teams.
- Develop and stick to a meeting agenda that doesn’t exceed two hours.
- Reserve a conference/meeting room with sufficient seating for all resolver groups, stakeholders, and top management.
- Schedule and invite participants (we use Zoho Calendar) one to two days prior to the RCA meeting, emphasizing the importance of the meeting and including the meeting agenda.
- Keep a written record of how long the meeting ran.
Incident management processes are meant to shield organizations against adverse events. This is especially true for organizations like Zoho Corporation that rely heavily on the internet and computer networks, and deal with a vast amount of personal data.
An effective incident response policy focuses on four key aspects: risk management, regular audits, preventive measures, and most importantly, employee training. At Zoho, we have the right people, processes, and tools in place to stay ahead of future cyberattacks.
Now that you’ve seen how Zoho handles incidents, we hope your organization can design and pursue a similar strategy keeping in mind your business operations, task force, and company culture.