ITIL problem management techniques

For this section, we will dive into the various techniques employed to find the root cause of a problem in an IT environment.

IT Problem management techniques

The problem management process can be mandated with a good service desk tool, but the techniques used for investigation and diagnosis should vary according to the organization. It's recommended that investigation techniques are flexible based on the organization's needs rather than being overly prescriptive.

Since problems can appear in any shape or size, it's impossible to stick to one technique to find a solution every time; instead, using a combination of techniques will yield the best results. A simple LAN connectivity problem might be solved with a quick brainstorming session, but a network or VoIP issue might need a deeper look.

Here are several techniques you can practice in your organization's problem management process.

Brainstorming

By establishing a dialogue between departments, you gain various perspectives and new information, generating many potential solutions.

To have a productive brainstorming session, you need a moderator. The moderator handles the following:

Driving the direction of the meeting
Documenting the insights obtained
Highlighting the measures to be taken
Tracking the discussed deliverable
Preventing a time-consuming session

Brainstorming sessions are more productive when collaborative problem-solving techniques, such as Ishikawa analysis and the five whys method, are used. These techniques will be discussed later in this section.

Kepner-Tregoe method

The Kepner-Tregoe (K-T) method is a problem-solving and decision-making technique used in many fields due to its step-by-step approach for logically solving a problem. It's well-suited for solving complex problems in both proactive and reactive problem management.

The method follows four processes:

Situation appraisal: Assessment and clarification of the scenario
Problem analysis: Connecting cause with effect
Decision analysis: Weighing the alternate options
Potential problem analysis: Anticipating the future

However, problem analysis is the only part that concerns IT problem management, and it consists of five steps.

Define the problem

Identifying what the problem truly is can be a problem in itself. Since problem management is inherently a collaborative effort, having a comprehensive definition of the problem eliminates preconceived notions that any participating member might have, saving a considerable amount of time.

For example, if an organization's automatic data backup on a server has failed, the problem can be defined as:

Failed backup on server

This definition indeed describes the deviation from the normal situation, but it demands more questions and information. A good model of a definition should be unambiguous and easily understood.

To remove ambiguity, the above definition can be updated to:

Data backup on November 15 failed on server #34-C

This definition provides more clarity, and spares employees from redundant questions. Nevertheless, this definition can be further improved. Suppose the cause of the data backup failure can be attributed to an event such as the application of a new patch; then the initial problem analysis would undoubtedly lead to this event.

To save time and effort, let's update the definition to:

Data backup on November 15 failed on server #34-C after application of patch 3.124 by engineer Noah

This detailed definition leaves no room for redundant questions, and provides a good amount of information on where the problem could lie. These extra minutes spent on the initial definition save valuable time and effort, provide a logical sense of direction to analysis, and remove any preconceived notions about the problem.

Describe the problem

The next step is to lay out a detailed description of the problem. The K-T method provides the questions that need to be asked on any problem to help identify the possible causes.

The questions below help describe four parts of any problem:

What is the problem?
Where did the problem occur?
When did the problem occur?
To what extent did the problem occur?

Each of these questions demands two types of answers:

IS: As in, "What is the problem?" or "Where is the problem?"

and

COULD BE but IS NOT: As in, "Where could the problem be but is not?"

This exercise helps compare and highlight the what, where, when, and how the deviation from normal performance in business processes is happening.

Establish possible causes

The comparison between normal performance and deviated performance made in the previous step helps in shortlisting the possible causes of the problem. Making a table with all the information in one place can be helpful to make the comparison.

	Is	Could be but is not	Differences	Changes
What	Server #34-C backup failed after patch 3.124	Failed backups in other servers with patch 3.124	New engineer (Noah) applied the patch	New patch procedure followed
Where	4th floor server	Basement servers	Normally done by Level 3 engineers	Level 1 engineer applied it
When	November 15, 12:32am	Any other time	None noted
Extent	Only on server #34-C	Any other server	None noted

New possible causes become evident when the information is assembled together. For our example problem, the root cause can be narrowed down to:

Procedural error caused by the inadequate transfer of knowledge by the Level 3 engineers.

Whatever the problem, a sound analysis for possible causes can be done based on relevant comparison.

Test the most probable cause

The penultimate step is to short-list the probable causes and test them before proceeding to the conclusion. Each probable cause should follow this question:

If _______ is the root cause of this problem, does it explain what the problem IS and what the problem COULD BE but IS NOT?

Again, it's beneficial to populate all the information into a table.

Potential root cause	True if	Probable root cause?
Server #34-C has a problem	Only server #34-C has been affected	Maybe
Incorrect procedure	Same procedure affects another server	Probably
Engineer error	Problem did not reoccur with same procedure	Probably not

Verify the true cause

The final step is to eliminate all the improbable causes and provide evidence to the most probable causes. With this verification, it's time to propose a solution to the problem. Without evidence of the possible root cause, the solution should not be attempted.

Ishikawa analysis, or fishbone diagram analysis

Ishikawa analysis uses the fishbone framework to enumerate the cause and effects of a problem, and can be used in conjunction with brainstorming sessions and the five whys method. The simplicity in executing RCA using an Ishikawa diagram shouldn't deceive you of its prowess to handle complex problems.

To start the analysis, define the problem and use it as the head of the fishbone. Draw the spine and add the categories that the problem could be originating from as ribs to the fishbone.

Generally, it's easiest to start the categories with the four dimensions of service management: partners, processes, people, and technology. However, these categories can be anything relevant to your problem, environment, organization, or industry.

Once these categories form the ribs of the fishbone, start attaching possible causes to each category. Each possible cause can also branch out to detail the reason for that occurrence. This could lead to a complex diagram of four to five levels of causes and effects, subsequently drilling down to the root cause of the problem.

It's recommended to split up dense ribs into additional ribs as required. Alternatively, merging empty ribs with other suitable ribs keeps the fishbone clean and easy to read. Additionally, you should ensure the ribs are populated with causes, not just symptoms of the problem.

This analysis is again a collaborative effort, and requires a moderator to direct the brainstorming sessions in an effective way. Every participant has the opportunity to engage, providing a comprehensive view of the problem.

Pareto analysis

The Pareto principle is an observation that approximately 80 percent of effects come from approximately 20 percent of causes. This observation applies to a wide range of subjects, including problem management.

When trying to reduce the number of incidents occurring in an organization, it's highly efficient to apply Pareto analysis before jumping into solving the problems. Pareto analysis prioritizes the causes of incidents, and helps in managing problems based on their impact and probability.

This analysis is carried out by generating a Pareto chart from a Pareto table. A Pareto table consists of the cumulative count of classification of all problems. A Pareto chart is a bar graph showing the cumulative percentage of the frequency of various classification of problems.

To create a Pareto chart, follow the steps given below:

Collect problem ticket data from your service desk tool.
Remodel the data into categories based on various attributes.
Create a Pareto table to find the frequency of problems in each classification over a period of time.
Compute the frequency of problem occurrences in each category.
Generate the cumulative frequency percentage in decreasing order.
Plot the data on a graph to create a Pareto chart.

The most important step is to remodel the data into a countable set of classifications and attributes.

Classification	Attribute
Impact	Affects business	Affects department	Affects user
Priority	Low	High	Urgent
Category	Network	Hardware assets	Software assets
Duration	In SLA	Outside SLA	No SLA

Classification	Attribute	Count	Cumulative	% of contribution
Duration	No SLA	670	1,470	38.72%
Priority	High	550	2,020	53.21%
Duration	Outside SLA	500	2,520	66.39%
Category	Network	430	2,950	77.71%
Priority	Urgent	300	3,250	92.73%
Category	Software assets	270	3,520	92.73%
Category	Hardware assets	150	3,670	96.68%
Impact	Affects department	80	3,750	98.79%
Impact	Affects user	35	3,785	99.71%
Impact	Affects business	9	3,794	99.95%
Duration	In SLA	2	3,796	100%

This chart helps identify the problems that should be solved first to significantly reduce service disruption. This analysis complements the Ishikawa and Kepner-Tregoe methods by providing a way to prioritize the category of problems, while the other methods analyze the root cause.

It's important to remember that the 80/20 rule suggests likely causes, and may be incorrect at times.

Five whys technique

Five whys is a straightforward technique for RCA. It defines a problem statement, then repeatedly asks why until the underlying root cause of the problem is discovered. The number of whys doesn't need to be limited to five, but can be based on the problem and the situation.

The five whys technique complements many other problem-solving techniques like the Ishikawa method, Pareto analysis, and the K-T method.

Using the previous example of the data backup failure in a server, let's apply the five whys technique.

Why did the data backup fail in server #32-C?	Due to the application of patch 3.124.
Why was it due to patch 3.124?	The procedure used was different.
Why was the procedure different?	A Level 1 engineer was responsible for it.
Why was the Level 1 engineer responsible?	The Level 3 engineers were busy with a major incident and had improper transfer of knowledge.
Why was there an improper transfer of knowledge?	There isn't a standardized schedule or format used in the organization.

The above iterative process reveals the absence of a standardized format, which has led to the problem of data backup failure.

For our purposes, the example above is a simple execution of the method. In a real scenario, the next question depends on the answer to the previous question, so it's imperative to collaborate with stakeholders who have elaborate knowledge of the domain the problem resides in.

By adopting parts of the K-T method along with the five whys technique, such as providing evidence to each answer before validating it with a return question, you can ensure precise analysis during problem-solving sessions.

Other techniques

Apart from the five major techniques, there are still numerous others, each with their own unique strengths. Overall, problem investigation is carried out using a combination of techniques suitable for the situation. Some other techniques that are prevalent in the problem management community are chronological testing, fault tree analysis, the fault isolation method, hypothesis testing, and pain value analysis. It's worth taking the time to learn many techniques as your organization's problem management process matures.

Up next:

You have made it so far! In our penultimate part of the six-part series, you will learn about the best practices of problem management that can help you jump past any hurdles during your problem management journey.

Assess your incident response readiness to kick-start your problem management journey

The zeroth step in the journey towards proactive problem management is establishing a robust incident management process in your IT environment. Discover how Zoho, our parent company, handles the spectrum of incidents thrown at it year over year and assess your incident management readiness at an enterprise scale.

Download a free copy of our incident management handbook and a best practice checklist to review your problem management solution.

Problem management feature checklist
IT incident management handbook