What is ITIL®problem management?

Problem management real time scenarios

A problem is the cause or potential cause of multiple incidents. Problems can arise from major incidents affecting many users, or from recurring incidents. Further, problems can be identified in infrastructure diagnostic systems before users are affected.

Incidents hinder business productivity, and providing quick solutions helps ensure seamless continuity of business operations. However, when multiple incidents occur at once or the same incident occurs multiple times, it's not feasible to move forward by providing patchwork solutions, or offering the same resolutions over and over again.

ITIL®problem management is a procedural way to ensure minimal incidents emerge from IT infrastructure operations by delving deep into incidents to find the root causes and fixes, and also reduce the severity of the incidents through suitable documentation of existing issues and providing workarounds.

Problem management is a methodical approach to identify the cause of an incident and manage the life cycle of all problems. The goal of ITIL®problem management process is to minimize the impact of incidents and eliminate recurring ones. While ITIL®doesn't state any specific technique to perform problem management, it recommends three phases to follow:

  • ITIL problem identification

    Problem identification

  • ITIL problem control

    Problem control

  • ITIL problem management error control

    Error control

These phases will be discussed in detail later in the guide.

Reactive management deals with incidents that are currently affecting users, whereas proactive problem management addresses issues that could potentially surface as incidents in the future should they be left alone.

A sound problem management process has the potential to significantly reduce the influx of incident tickets, saving IT service desk staff significant time and effort. This advantage ripples into other benefits such as reduction in mean time to repair (MTTR), higher customer satisfaction, a robust known error database, and reduced cost of IT services and issues. Moreover, an organization that practices proactive problem management is likely to find tremendous value from identifying and eliminating issues before they disrupt business processes.

Problem management as an ITIL®practice is most useful when used with other ITIL®practices in the overall service value chain. Information is exchanged between the various ITIL®practices, namely incident management, change management, IT asset management, knowledge management, and continual service improvement. This information exchanged between parties accumulates value as it moves through each ITIL®practice, in turn building an ideal IT service management process.

Before going further, the following definitions will be useful in understanding the context of this guide.

  • Workaround: Temporary solutions that restore services and ensure business continuity. A workaround reduces the impact of an incident or problem.
  • Root cause analysis (RCA): The root cause is the problem's underlying issue. RCA is the investigation techniques that help discover the root cause of a problem.
  • Known error: Problems that have occurred before and have a workaround or known root causes.
  • Known error database (KEDB): A database created by documenting the known errors using incident management and problem management.
  • In this guide, we'll examine each facet of problem management in detail, providing all the knowledge you need to get up to speed on how to implement problem management in your enterprise.

Incident management vs. problem management

IT incident management real time scenarios

In ITIL®, the terms incident and problem might appear to be synonymous, but both are distinct in the role they play in achieving ideal service quality. It's important to know where incident management and problem management interact with each other and how they differ, especially where an incident ends and a problem begins.

Incident management

An incident is an unplanned interruption of an entire service or just a component of one. Let's look at a scenario to understand it better. There's an important meeting in 15 minutes, and a report has to be printed out. Unfortunately, the department printer isn't working. A ticket is quickly raised to patch a workaround and get the reports printed out. This is an incident.

The incident management process is about handling incidents and restoring service as soon as possible. In our scenario, the service desk staff quickly connects the laptop to the adjoining department's printer to help the user get the reports ready in time for the meeting. Therefore, incident management's goal is to ensure that an interruption or incident gets resolved as quickly as possible with a workaround or a resolution.

Problem management

Problem management isn't about restoring services or troubleshooting, but determining and removing the cause. A problem is logged in a service desk when there are recurring incidents that have common issues, or if a major incident occurs that impacts many users. In our scenario, the sole printer in the department went under and all the users in that department were affected, which was logged as a problem by the service desk staff to find the cause and solution. An incident can be closed when a workaround is provided, but a problem is raised to fix the printer permanently so this issue does not occur again.

Referring back to our scenario, the printer issue will undergo RCA to find a permanent fix, and be tracked as a problem ticket while the business continues with the workaround in place. If the problem management team is unable to find a solution, the workaround is documented and the issue is added to the KEDB. In this way, problem management is not only about eliminating incidents by finding the underlying root cause, but also determining the most feasible solution that can be implemented to minimize disruptions. Sometimes, despite knowing the root cause, the most feasible solution is to implement a workaround and document it as a known error.

Despite being different, incident management and problem management complement each other and are closely aligned. Incident management ensures continuity in business operations, while problem management takes care of the underlying issues and problems.

Reactive problem management vs. proactive problem management

Reactive problem management examples

What is reactive problem management?

Reactive problem management reacts to the incidents that show up, then proceeds with the problem management process. Essentially, a reactive problem management approach aims to find and eliminate the root causes of known errors, and deals with a problem only when it shows up as major or recurring incidents.

What is proactive problem management?

Proactive problem management seeks out issues, faults, and known errors in IT systems by going through past incidents, network monitor data logs, and other sources of information, then proceeds to solve them permanently before they arise as incidents. This process is a part of continuous service improvement. Proactive problem management also aims to solve all known errors under the KEDB if it is feasible to do so.

Both types of problem management follow the same phases of problem-solving once presented with a problem: problem identification, problem control, and error control. The only difference is the approach towards identifying the problem. Nonetheless, both processes offer distinct advantages to service management, and require unique resources to function.

Choosing between reactive and proactive problem management approaches

Reactive vs proactive problem management

Organizations that are new to problem management should focus their efforts on implementing a reactive problem management process. It's sensible to use the problem-solving talent of the existing service desk staff when they aren't occupied with daily incidents; in doing this, they gain valuable experience before implementing proactive problem management.

As an organization's service delivery matures, it should transition to a proactive problem management process. This transition should be carried out by a team with a good analytical skill set that's highly proficient in IT infrastructure and the tools and technology that support the organization.

However, many organizations don't undergo this transition since it's tricky to quantify the benefits of proactive problem management, which can be perceived as solving potential problems and not actual ones. Nevertheless, some of the world's most effective organizations practice proactive problem management and find tremendous benefit in it.

What are the benefits of IT problem management?

Benefits of ITIL problem management

There are a few hurdles organizations might encounter in the process of establishing problem management. The organization might not have the resources to allocate for a problem management team, or it may already have an unorthodox way of managing problems and is reluctant to change. Sometimes, it could just be a cost-related denial of request.

Consequently, it's vital to include all stakeholders in the problem management process, and express how it provides value to different facets of the organization. These benefits include:

  • Eliminates the faults in an organization's services through suitable documentation.
  • Refines the service design by identifying and solving weak points, ensuring the most effective and efficient path for service delivery.
  • Increases the first time fix rate on service failures by providing permanent solutions to incidents rather than stopping at workarounds.
  • Diminishes the impact of incidents affecting multiple users, or a single user at a crucial time.
  • Prevents most of the incidents and problems plaguing an organization over time, boosting user productivity.
  • Strengthens the confidence users have in the organization's IT services.
  • Decreases the time it takes to recover from failures through systematic maintenance of a KEDB.
  • Prevents recurring incidents through one-time fixes, sparing valuable service desk efforts in resolving them.
  • Encourages IT services to mature as the organization develops by the learning from the resolved problems.
  • Develops IT talent within the organization through technical awareness and valuable insights.

Take the first step in your problem management journey

IT Problem management roles and responsibilities

Roles and responsibilities of problem management team

The roles of a problem management team are directly related to the organizational structure that is present. The organization's age, culture, technology, and number of locations worldwide affect the composition of its problem management team. In the case of small IT organizations, the team's responsibilities might all be combined, or in the case of large, multinational corporations, they may be specialized.

Either way, it's up to the convenience and flexibility of the IT team to tailor an arrangement that ensures problems are efficiently addressed in terms of ITIL®recommendations. Being aware of the organization's general strategy is a good starting point to initiate the team formation. Also, it's important to be wary of the resources the organization is ready to expel for the development of a problem management team.

The team's roles and responsibilities should extend, diverge, and mature as the organization's technology grows, otherwise confusions in accountability can arise during service delivery.

The general roles and responsibilities of problem management teams are listed below.

Role Responsibility
Problem manager Responsible for the effectiveness and efficiency of the entire practice. Akin to team leader.
Problem owner Accountable for the life cycle of any problem tickets they're assigned.
Problem agent Accountable for the tasks associated within a problem ticket.
Diagnosis team An assortment of people with various expertise, responsible for RCA of a problem.

IT Problem management process flow

Just like an organization creates value for its customers, IT service management creates value for its users through best practices, and indirectly aids in creating value for the organization. To create this value, there must be a process with defined inputs and outputs. When an ITIL®-ready service desk is put in place, the streamlined flow of a problem process looks like this:

ITIL Problem management process flow steps

According to ITIL®, you can implement problem management processes with any technology you deem the right fit for your organization. The technology put in place should have functionalities that enable the three phases of ITIL®problem management.

The three phases are:

ITIL Problem management sub-processes
Problem identification definition

Problem identification

The problem identification phase identifies and records problems in a management tool. A service desk tool associated with multiple practices of service management, including incident management, asset management, the CMDB, and change management, gives organizations an advantage in this phase.

While the service desk staff would normally report problems based on a surge of incidents, a proactive approach to problem management identifies problems by:

  • Analyzing incident trends, leveraging network monitoring systems, and utilizing other diagnostic software.
  • Detecting risks from incidents that might recur.
  • Evaluating information received from partners and suppliers.
  • Evaluating information from internal software developers, engineers, and test teams.

Depending on your organization's structure, domain, and culture, there could be even more modes through which problems can be identified. Nevertheless, it's important to have a system in place for problems to be brought in, identified, prioritized, and recorded for further investigation and diagnosis.

ITIL problem activities

Problem control

Problem management is a collaborative effort, so for results to be effective, multiple departments and stakeholders should be involved in the problem control phase.

Problem control includes activities like prioritization, investigation, analysis, and documenting known errors and workarounds. There are numerous techniques that help in prioritization and analysis of problems. A good rule of thumb to follow is first tackling problems that, when solved, significantly curb the disruption of services in the organization.

Feasibility is another aspect to note when tackling problems. Fixing a problem permanently might require more resources than settling for a workaround. A quick cost-benefit analysis can determine whether you should proceed with a permanent fix or not.

Workarounds are documented in problem records. Generally, if a problem persists longer, implementing a quick workaround is advisable. This workaround can even be a part of incident management resolution; however, the problem management team should review the workaround and refine the resolution if necessary. As you can see, an effective incident workaround can become a permanent solution to some problems.

Known error record

Error control

This phase manages known errors from the KEDB by regularly checking it for possible permanent fixes if they pass the cost-benefit analysis.

Once a problem is analyzed, it's documented as a known error. These known errors are regularly reassessed to account for the impact they create, and to test the effectiveness of workarounds.

The relationship between the ITIL®processes and problem management

An integrated system of service delivery best practices improves business services and IT service capabilities. An effective problem management process has interactions with several other ITIL®processes.

Problem management process interactions with other ITIL process

The processes that interact with problem management are briefly discussed below:

Incident management

Incident management is the methodical process of logging, categorizing, prioritizing, assigning, and resolving issues in an organization. The goal of incident management is to restart the interrupted services as soon as possible; often, this means a workaround is arranged in place of a permanent solution. Every activity in this practice is documented on a granular scale and pushed to the problem management team, who initiates RCA to develop a permanent solution. You can see that despite problem management being its own process, it's dependent on a robust incident management process.

Change management

The objective of change management is to increase the success rate of any changes implemented in the organization. A change refers to any modification made to an organization's IT infrastructure, processes, services, products, applications, vendors, or anything else that implicitly or explicitly affects the organization's service delivery.

According to the ITIL®framework, problem management's responsibility concludes with finding the root cause that leads to a solution for a problem, and actually implementing the solution is carried out with change control. Since implementing a change involves managing risk in multiple business units, it requires a process of its own for efficient handling. However, the problem management team should participate in the post-implementation review of a change to ensure consistency between the problem solution and the implemented change associated with it.

IT asset management

IT asset management is the practice of governing the life cycle of an asset in an organization. Its activities include deriving maximum value from assets, controlling asset costs, and managing the risks of assets. These risks can be in terms of compliance, vendor selection, usage policies, and disposal practices.

The practices of asset management and problem management may cross paths when problems emerge from hardware and software assets used by the organization. When the root cause of a problem appears to be from a product or service, IT asset management's detailed record of the inventory expedites the problem-solving process. Apart from this, IT asset management assists problem management in studying the impact of an incident, examining the effects of implementing a solution, and providing information whenever necessary via RCA.

Zylker root cause analysis

Let's put things into perspective with a scenario.

Zylker is a fast-growing stock photography provider in India. A manager in Mumbai has been having trouble generating monthly reports from the SQL server in New Delhi. An incident has been raised, and the service desk staff has notified the technicians in New Delhi. As a temporary workaround, the reports are generated locally and sent to ensure business continuity.

Zylker's proactive problem management team decides to run trend analysis on incidents occurring over the past six months. They find multiple incidents pertaining to the server in New Delhi. This leads to them initiating a problem ticket and proceeding with the investigative analysis using the accumulated data from all the documented incidents.

The technician in New Delhi sees that the SQL server is using multiple types of protocols, including iSCSI and Fibre Channel, for linking data storage facilities. Since both protocols function on an Ethernet network, there is doubt about whether the local block switch was configured for large packet data transfer. The technician receives data from the IT asset management team and verifies that the switch was not the culprit. This is supported by the evidence that generating reports locally was not a problem.

The wide area network (WAN) is next in line for analysis, as a manager from Mumbai is having trouble generating the monthly report. The technician, due to their experience in network issues, has doubts about traffic flow at the end of every month, so they install software on the company's routers and switches to analyze traffic passing through them and statistically aggregate the information.

The software generates graphs and charts that indicate the top protocols that were used, along with the bandwidth each protocol consumed over a month. This unveils significant bandwidth usage at the end of the month around the same time the monthly report is generated. After careful examination, it's revealed that full image backups were scheduled around the same time as the monthly report, and this caused a significant bottleneck in the WAN.

Now that the problem's root cause is identified, the technician raises a change ticket to reschedule the image backup to the early hours of the morning before business begins, leveling out the traffic in the network.

Here's an overview of the steps performed in this scenario:

Activity Practice involved
The manager in Mumbai had trouble generating monthly reports from the SQL server in New Delhi. An incident was raised and the reports were generated locally and sent to the manager. The ticket was closed. Incident management
The proactive problem management team ran trend analysis on incidents over the past six months. They found multiple incidents involving the server in New Delhi. Problem management, incident management
The technician in New Delhi observed the SQL server's network and protocol, and was unsure whether or not the local block switch was configured for large packet data transfer. Problem management, IT asset management
The technician received data from the IT asset management team and verified that the switch was not the culprit. Problem management, IT asset management
The technician had suspicions about the traffic flow at the end of every month, and installed software on the routers and switches that analyzed traffic and statistically aggregated the information. Problem management, IT asset management
After careful examination, it was revealed that full image backups were scheduled for around the same time as the report generation, and this caused a significant bottleneck in the WAN. Problem management
The technician raised a change ticket to reschedule the image backup to the early hours of the morning before business begins. Problem management, change management

All ITIL®practices have an intricate relationship with other ITIL®practices. As your problem management matures in service delivery, make sure to improve the way it interacts with other practices for healthy, business-oriented service delivery.

IT Problem management techniques used in ITIL®

The problem management process can be mandated with a good service desk tool, but the techniques used for investigation and diagnosis should vary according to the organization. It's recommended that investigation techniques are flexible based on the organization's needs rather than being overly prescriptive.

Since problems can appear in any shape or size, it's impossible to stick to one technique to find a solution every time; instead, using a combination of techniques will yield the best results. A simple LAN connectivity problem might be solved with a quick brainstorming session, but a network or VoIP issue might need a deeper look.

Here are several techniques you can practice in your organization's problem management process.

Brainstorming problem solving technique

Brainstorming

By establishing a dialogue between departments, you gain various perspectives and new information, generating many potential solutions.

To have a productive brainstorming session, you need a moderator. The moderator handles the following:

  • Driving the direction of the meeting
  • Documenting the insights obtained
  • Highlighting the measures to be taken
  • Tracking the discussed deliverable
  • Preventing a time-consuming session

Brainstorming sessions are more productive when collaborative problem-solving techniques, such as Ishikawa analysis and the five whys method, are used. These techniques will be discussed later in this section.

Kepner Tregoe technique in problem management

Kepner-Tregoe method

The Kepner-Tregoe (K-T) method is a problem-solving and decision-making technique used in many fields due to its step-by-step approach for logically solving a problem. It's well-suited for solving complex problems in both proactive and reactive problem management.

The method follows four processes:

  • Situation appraisal: Assessment and clarification of the scenario
  • Problem analysis: Connecting cause with effect
  • Decision analysis: Weighing the alternate options
  • Potential problem analysis: Anticipating the future

However, problem analysis is the only part that concerns ITIL®problem management, and it consists of five steps.

Define the problem

Identifying what the problem truly is can be a problem in itself. Since problem management is inherently a collaborative effort, having a comprehensive definition of the problem eliminates preconceived notions that any participating member might have, saving a considerable amount of time.

For example, if an organization's automatic data backup on a server has failed, the problem can be defined as:

Failed backup on server

This definition indeed describes the deviation from the normal situation, but it demands more questions and information. A good model of a definition should be unambiguous and easily understood.

To remove ambiguity, the above definition can be updated to:

Data backup on November 15 failed on server #34-C

This definition provides more clarity, and spares employees from redundant questions. Nevertheless, this definition can be further improved. Suppose the cause of the data backup failure can be attributed to an event such as the application of a new patch; then the initial problem analysis would undoubtedly lead to this event.

To save time and effort, let's update the definition to:

Data backup on November 15 failed on server #34-C after application of patch 3.124 by engineer Noah

This detailed definition leaves no room for redundant questions, and provides a good amount of information on where the problem could lie. These extra minutes spent on the initial definition save valuable time and effort, provide a logical sense of direction to analysis, and remove any preconceived notions about the problem.

Describe the problem

The next step is to lay out a detailed description of the problem. The K-T method provides the questions that need to be asked on any problem to help identify the possible causes.

The questions below help describe four parts of any problem:

  • What is the problem?
  • Where did the problem occur?
  • When did the problem occur?
  • To what extent did the problem occur?

Each of these questions demands two types of answers:

IS: As in, "What is the problem?" or "Where is the problem?"

and

COULD BE but IS NOT: As in, "Where could the problem be but is not?"

This exercise helps compare and highlight the what, where, when, and how the deviation from normal performance in business processes is happening.

Establish possible causes

The comparison between normal performance and deviated performance made in the previous step helps in shortlisting the possible causes of the problem. Making a table with all the information in one place can be helpful to make the comparison.

Is Could be but is not Differences Changes
What Server #34-C backup failed after patch 3.124 Failed backups in other servers with patch 3.124 New engineer (Noah) applied the patch New patch procedure followed
Where 4th floor server Basement servers Normally done by Level 3 engineers Level 1 engineer applied it
When November 15, 12:32am Any other time None noted
Extent Only on server #34-C Any other server None noted

New possible causes become evident when the information is assembled together. For our example problem, the root cause can be narrowed down to:

Procedural error caused by the inadequate transfer of knowledge by the Level 3 engineers.

Whatever the problem, a sound analysis for possible causes can be done based on relevant comparison.

Test the most probable cause

The penultimate step is to short-list the probable causes and test them before proceeding to the conclusion. Each probable cause should follow this question:

If _______ is the root cause of this problem, does it explain what the problem IS and what the problem COULD BE but IS NOT?

Again, it's beneficial to populate all the information into a table.

Potential root cause True if Probable root cause?
Server #34-C has a problem Only server #34-C has been affected Maybe
Incorrect procedure Same procedure affects another server Probably
Engineer error Problem did not reoccur with same procedure Probably not

Verify the true cause

The final step is to eliminate all the improbable causes and provide evidence to the most probable causes. With this verification, it's time to propose a solution to the problem. Without evidence of the possible root cause, the solution should not be attempted.

Fishbone diagram technique

Ishikawa analysis, or fishbone diagram analysis

Ishikawa analysis uses the fishbone framework to enumerate the cause and effects of a problem, and can be used in conjunction with brainstorming sessions and the five whys method. The simplicity in executing RCA using an Ishikawa diagram shouldn't deceive you of its prowess to handle complex problems.

To start the analysis, define the problem and use it as the head of the fishbone. Draw the spine and add the categories that the problem could be originating from as ribs to the fishbone.

Generally, it's easiest to start the categories with the four dimensions of service management: partners, processes, people, and technology. However, these categories can be anything relevant to your problem, environment, organization, or industry.

Once these categories form the ribs of the fishbone, start attaching possible causes to each category. Each possible cause can also branch out to detail the reason for that occurrence. This could lead to a complex diagram of four to five levels of causes and effects, subsequently drilling down to the root cause of the problem.

Fishbone diagram example

It's recommended to split up dense ribs into additional ribs as required. Alternatively, merging empty ribs with other suitable ribs keeps the fishbone clean and easy to read. Additionally, you should ensure the ribs are populated with causes, not just symptoms of the problem.

This analysis is again a collaborative effort, and requires a moderator to direct the brainstorming sessions in an effective way. Every participant has the opportunity to engage, providing a comprehensive view of the problem.

Pareto analysis problem management

Pareto analysis

The Pareto principle is an observation that approximately 80 percent of effects come from approximately 20 percent of causes. This observation applies to a wide range of subjects, including problem management.

When trying to reduce the number of incidents occurring in an organization, it's highly efficient to apply Pareto analysis before jumping into solving the problems. Pareto analysis prioritizes the causes of incidents, and helps in managing problems based on their impact and probability.

This analysis is carried out by generating a Pareto chart from a Pareto table. A Pareto table consists of the cumulative count of classification of all problems. A Pareto chart is a bar graph showing the cumulative percentage of the frequency of various classification of problems.

To create a Pareto chart, follow the steps given below:

  • Collect problem ticket data from your service desk tool.
  • Remodel the data into categories based on various attributes.
  • Create a Pareto table to find the frequency of problems in each classification over a period of time.
  • Compute the frequency of problem occurrences in each category.
  • Generate the cumulative frequency percentage in decreasing order.
  • Plot the data on a graph to create a Pareto chart.

The most important step is to remodel the data into a countable set of classifications and attributes.

Classification Attribute
Impact Affects business Affects department Affects user
Priority Low High Urgent
Category Network Hardware assets Software assets
Duration In SLA Outside SLA No SLA
Classification Attribute Count Cumulative % of contribution
Duration No SLA 670 1,470 38.72%
Priority High 550 2,020 53.21%
Duration Outside SLA 500 2,520 66.39%
Category Network 430 2,950 77.71%
Priority Urgent 300 3,250 92.73%
Category Software assets 270 3,520 92.73%
Category Hardware assets 150 3,670 96.68%
Impact Affects department 80 3,750 98.79%
Impact Affects user 35 3,785 99.71%
Impact Affects business 9 3,794 99.95%
Duration In SLA 2 3,796 100%

This chart helps identify the problems that should be solved first to significantly reduce service disruption. This analysis complements the Ishikawa and Kepner-Tregoe methods by providing a way to prioritize the category of problems, while the other methods analyze the root cause.

It's important to remember that the 80/20 rule suggests likely causes, and may be incorrect at times.

5 whys to solve problems

Five whys technique

Five whys is a straightforward technique for RCA. It defines a problem statement, then repeatedly asks why until the underlying root cause of the problem is discovered. The number of whys doesn't need to be limited to five, but can be based on the problem and the situation.

The five whys technique complements many other problem-solving techniques like the Ishikawa method, Pareto analysis, and the K-T method.

Using the previous example of the data backup failure in a server, let's apply the five whys technique.

Why did the data backup fail in server #32-C? Due to the application of patch 3.124.
Why was it due to patch 3.124? The procedure used was different.
Why was the procedure different? A Level 1 engineer was responsible for it.
Why was the Level 1 engineer responsible? The Level 3 engineers were busy with a major incident and had improper transfer of knowledge.
Why was there an improper transfer of knowledge? There isn't a standardized schedule or format used in the organization.

The above iterative process reveals the absence of a standardized format, which has led to the problem of data backup failure.

For our purposes, the example above is a simple execution of the method. In a real scenario, the next question depends on the answer to the previous question, so it's imperative to collaborate with stakeholders who have elaborate knowledge of the domain the problem resides in.

By adopting parts of the K-T method along with the five whys technique, such as providing evidence to each answer before validating it with a return question, you can ensure precise analysis during problem-solving sessions.

5 whys to solve problems

Other techniques

Apart from the five major techniques, there are still numerous others, each with their own unique strengths. Overall, problem investigation is carried out using a combination of techniques suitable for the situation. Some other techniques that are prevalent in the problem management community are chronological testing, fault tree analysis, the fault isolation method, hypothesis testing, and pain value analysis. It's worth taking the time to learn many techniques as your organization's problem management process matures.

IT Problem management best practices

Best practices for problem management

Although we've discussed the process and the various methods to practice problem management, there are certain things to keep in mind while actually going about it. These do's and don'ts will help you avoid little hiccups on your problem management journey.

DO's:

  • Broadcast the exact differences between an incident and a problem: ITIL®processes work only when there's a clear, acknowledged divide between incident management and problem management, so develop a distinction that works for your organization.
  • Acknowledge that the problem manager is a non-technical role: The problem manager is the glue that holds the entire team together. The technical part of the process will be conducted by experts, but the problem manager enables it to happen.
  • Establish the objectives for your problem management efforts: Move forward with short-term and long-term goals so your focus is not easily diminished. For example, consider short-term goals as something like solving the top ten problems that disrupt the business, and long-term goals akin to improving your support cost savings.
  • Always go for permanent solutions rather than temporary ones: Realize the true benefits of problem management by always going for permanent solutions, even if it's a permanent workaround.
  • Welcome the people that challenge the state of things: Be appreciative of members who question the existing state of things. This could be the spark that leads to positively improving your organization's system.

DON'Ts:

  • Try to achieve perfection from the start: Problem management is a learning experience and is unique to each organization. Trying to be perfect from the start is setting yourself up for failure. No one becomes a rock star the first day they start playing the guitar.
  • Complicate yourself with reactive and proactive approaches: Take it easy in the beginning. Some problems can't be overlooked due to their severity, and some have to be found through analysis.
  • Measure problem management as an individual process: ITIL®processes as a whole are designed to cooperate with each other to make your IT service delivery more manageable. Both the good and bad results of your problem management can stem from incident management, change management, or even project management.

Experience how problem management benefits your organization

IT Problem management key performance indicators

Key performance indicators (KPIs) should provide value to users, technicians, and stakeholders alike. While these metrics act as a self-examination tool, it is advisable to limit the metrics to seven or eight for the problem management process since too many might provide a skewed perception of the process itself. There could be problems on the ground level, but the various metrics acting together could come to a different conclusion.

KPIs can vary according to the way an organization functions, so there isn't one single list of applicable metrics for all organizations. In order to determine which KPIs should be monitored, stakeholders should be asked to weigh in and decide what would be beneficial.

Below are the most applicable KPIs for the problem management process.

KPI Formula Comment
Average time to start RCA The average time taken from identifying a problem to initiating RCA. This showcases the efficiency of the problem diagnosis team.
Average time to start RCA The average time taken from identifying a problem to initiating RCA.
Total number of uncompleted problems The count of problems that are yet to undergo RCA. This is different from an unresolved problem. Incomplete problems are logged, but work hasn't been started on them yet.
Percentage increase/decrease of major incidents Percentage increase/decrease of major incidents This metric can assist in finding trends, such as the frequency of problem occurrences.
Total number of problem records reported The total number of problems logged from incidents. As your problem management practice matures, the problems reported from incidents should go down.
Average resolution time of problems The average time taken from identification to the resolution of a problem. Problems can take a long time to be solved. To speed the process up, it helps to measure the improvement efforts in RCA and the actual problem management process.
Total number of known errors The count of known errors in the KEDB. This highlights your organization's documentation efforts. If the ratio between the number of problems logged and known errors is low, it's a good sign.
Total number of unresolved problems The count of unresolved problems in the service desk. Unresolved problems are ones with ongoing RCA.
Total/average number of incidents associated with problems The count of all incidents with an associated problem ticket. When trying to scale up your proactive diagnosing activities, ensure that this metric gradually recedes to a minimum.
Percentage of problems with an identified root cause The number of problems that have a clear, identified root cause, compared to the overall logged problems. Both of these metrics supplement other metrics, like the effectiveness of the problem management practice, and help with decision-making, such as monetary decisions.
Percentage of problems with a workaround The number of problems that have a workaround in place rather than a permanent fix, relative to the overall logged problems.

The best features for problem management software

It's easier for organizations to leverage software to formulate their problem management process rather than try to develop it from scratch. There are numerous solutions on the market that claim to be the best for problem management; at the very least, problem management software should feature the capabilities listed below.

Feature list Value
Problem creation and logging Create problems from an incident Identify incidents of an underlying problem that require a full investigation, and associate changes.
Mark a problem as a known error Maintain a KEDB.
Create problem templates Standardize the format for defining problems.
Problem analysis Problem roles and technician Identify the problem owner.
Include analysis, impact, and RCAs for every problem Analyze the impact, symptoms, and root cause of the problem, and document them.
Mark services affected and assets involved Precisely define each problem, and quantify the business impact.
Problem solution Add tasks with dependencies within a problem Assign solution implementation to specific technicians with due dates.
Mark workarounds as solutions Provide a temporary solution with a workaround, or a permanent solution to the problem.
Associate a change with the problem Make problem management work in tandem with other ITSM processes.
Problem closure Copy a problem solution and workaround to all associated incidents Avoid redundant activities, and ensure consistent records on all tickets.
Close all associated incidents automatically on closure of the problem Save technicians time and effort.
Create work logs to record the cost, effort, and time taken to resolve a problem Get detailed KPIs with respect to cost and time taken to resolve problems.
Notification rules and announcements Put notification mechanisms in place to keep stakeholders informed.

Conclusion

ITIL Problem management resolution

ITIL®'s problem management framework is a guiding light for every organization on the path to proactive problem diagnosis and resolution. Problem management and its practices are flexible for all organizations irrespective of size, geographical spread, industry, and technology used to function every day.

Organizations with robust incident management should aim for a basic problem management setup by implementing a separate channel for logging and managing problems and maintaining a KEDB. As the problem management team's experience grows along with the organization, the process should mature as well.

For an organization that already practices problem management, its aspirations should lie in reducing incidents to an all-time low. This is most attainable through a proactive approach to problem management.

An easy first step in implementing the problem management process is to utilize a service desk tool with the right modules to ensure comprehensive IT service desk operations and centralized control of tickets, incidents, and problems. Having a streamlined problem management process in your organization is a long-term project that will pay off as your business grows and your IT infrastructure scales.

Align your IT environment with the industry's best practices

Armed with problem management expertise, you can implement other ITSM disciplines with confidence, and learn how you can reinforce your service delivery the ITIL way. Download a free copy of our ITIL handbook and a best practice checklist to review your problem management solution.

  • Problem management sofware feature checklist

    Problem management feature checklist

  • major incident procedure ITIL

    ITIL heroes' handbook

By clicking 'Get ITSM resource kit', you agree to processing of personal data according to thePrivacy Policy.