8 Key Performance Indicators(KPIs) that every IT help desk needs to know
People often say "what gets measured gets improved," but they rarely say what,
exactly, should be measured. With the recent developments in the reporting
capabilities of IT help desk software, hundreds of KPIs and help desk metrics can be
measured and monitored.
But that doesn't mean you should measure them all.
Only the KPIs and metrics that are critical to your IT help desk need to be
measured to improve service delivery.
This paper describes the 8 KPIs that are critical to every IT help desk.
These KPIs help meet basic IT help desk objectives such as business continuity,
organizational productivity, and delivery of services on time and within budget.
The KPIs are as follows:
Ensuring business continuity
Making the organization productive
Ensuring business continuity
1. Lost business hours
The number of hours the business is down because IT services are unavailable.
Keep lost business hours to the bare minimum.
Most IT teams track service availability to see the overall performance of their
IT service desks. But the pain of lost business isn't always reflected in service
availability levels, even when those levels are high. For instance, if service
availability is at 99.9%, the company still loses more than eight hours
per year. Tracking lost business hours clearly highlights the loss and its impact
Case study: No-fly time at Virgin Blue
In September 2010, Virgin Blue faced what could be considered every
airline's worst nightmare. About 50,000 customers and 100 flights were
grounded . Four hundred more flights were delayed or rescheduled over the
following days because the solid-state disk server infrastructure hosting
Virgin Blue's applications failed. This affected Virgin Blue's online check-in
and booking system.
Despite SLAs to restore services immediately, it took 11 hours for the service
to be restored, and 10 more hours to restore full operations. This was
because of an attempted repair of a faulty device, which delayed the switch
over to a contingency hardware platform. By then, the damage was already
done. Although these 11 hours didn't cost much in terms of Virgin Blue'sIT
service availability for the year, they cost Virgin Blue approximately $10
million in terms of lost business.
Industry standards - Lost business hours
|Number of downtime events in the last 12 months
|Average amount of downtime per event in the last 12 months
|Longest downtime event
|Crirical application availability
|Length of time to recover from last downtime event
Tips for minimizing lost business hours
- Proper planning and execution of application upgrades, server migration,
and any IT change implementation process.
- Having a clean and well-defined CMDB to identify critical failure points
and understanding CI interactions in the network to identify the cascading
impact of failed changes.
- Educating IT teams on the risks of SLA violations in terms of lost business
hours and revenue.
- Gain insight on anticipating and handling outages by evaluating the past performance of the IT help desk
That said, a lot of factors could contribute negatively towards lost business
hours. In 2010, Gartner projected that, "Through 2015, 80% of outages impacting
mission-critical services will be caused by people and process issues, and
more than 50% of those outages will be caused by change/configuration/release
integration, and hand-off issues."
2. Change success rate
The ratio of the number of successful changes to the total number of changes
that were executed in a given time frame.
Achieve a higher percentage of successful change implementations.
Opinion remains divided on what a failed change implies. It basically refers to
any change that did not meet its objectives or go as planned.
Case study: The ASX outage
On October 27, 2011, trading had to be halted at the Australian Stock Exchange
(ASX) for four hours due to a failed change implementation. An upgrade
on the ASX's internal network ( to improve the latency of the trading
platform) led to unprecedented connectivity issues between the supporting
components and the disseminating gateways of the trading system. ASX had
to initiate trading services from one of their disaster recovery sites. Finally,
to restore normalcy, the change had to be backed out that night.
A downward trend or a stale change success rate is usually due to failure of
change implementations due to:
- Lack of relevant information such as the impact of the change, the dependencies
of the assets involved, the change implementation window, and
- Inability to collaborate between teams for successful change
- Improper communication to end users and stakeholders of the change
Tips for a high change success rate
- Perform a proper impact analysis and a detailed rollout plan with a check
list of tasks to be completed.
- Collect all relevant information from end users and technicians before the
- Constitute CABs and ensure a strict approval process.
Another help desk metric that should be tracked to have an effective change management process is the number of unplanned changes. An unplanned change can be an
emergency change or an urgent change.
- An emergency change: A service restoration change due to an incident, or
a change that needs to be implemented quickly to avoid an incident.
- An urgent or expedited change: Changes that are required quickly due to
a pressing need such as a legal requirement or a business need, but are not
related to restoring service.
Although there is no industry standard or defined number for the number of
unplanned changes permissible in an IT infrastructure, this reporting metric is important,
especially during an increasing trend or a spike in the number of unplanned
An increasing trend in unplanned changes
An increasing trend in the number of unplanned changes indicates the inadequate
planning of changes and questions the efficiency of the change management
process. Therefore, the change management process has to be improved
to ensure proper planning and execution of changes.
Increasing trend in unplanned changes
A discrete spike in unplanned changes
A sudden spike in the number of unplanned changes can be due to unanticipated
major incidents, which warrant emergency changes to restore service. Such
a situation is probably due to an unstable infrastructure, which could affect
service availability and, ultimately, the business.
Discrete spikes in unplanned changes
3. Infrastructure stability
A highly stable infrastructure is characterized by maximum availability, very
few outages, and low service disruptions.
Maintain a highly stable infrastructure.
To effectively gauge and monitor infrastructural stability, IT help desks need to
monitor the following:
- Percentage reduction in the number of problematic assets
- Percentage reduction in the number of major incidents
Percentage reduction in the number of problematic assets
Delivering maximum availability and better service quality will be impossible
in an infrastructure where routers have to be restarted multiple times a day,
servers are often down, or workstations have to be rebooted every now and
then. Therefore, such problematic assets must be identified and replaced to
ensure business continuity.
A problematic asset might repeatedly be the cause
for service disruptions or outages, and for reporting purposes, these could be
assets that have more than a couple incidents associated with them. The percentage
reduction in the number of problematic assets can be calculated using
the following formula:
Number of problematic assets replaced at the end of the time frame.
Number of problematic assets identified at the beginning of the time frame
Percentage reduction in the number of major incidents
Another major indication of stability is the recurrence of major incidents on the
IT infrastructure, which can lead to service disruptions or service level deterioration.
A major incident, by definition, is a high-impact, high-urgency incident
that affects a large number of users, depriving the business of one or two key
The goal is to reduce the number of major incidents, which can be
achieved with efficient Root Cause Analysis (RCA) and a reduction of problem
backlog. Identifying root causes and fixing problems can reduce the recurrence
of major incidents and, subsequently, ticket volumes to the IT helpdesk.
Tips to reduce problem backlog (and therefore major incidents)
- Faster initiation of RCA: In this case, the sooner the better. The sooner the
RCA is initiated, the greater the chances are of identifying the root cause.
- Quick completion of investigations: If the root cause is identified faster,
the IT team can fix and resolve the problem faster, making sure that incidents
Teams can also measure these action items with details on time taken to initiate
root cause analysis after problem identification and time taken to complete root
Case study: Reducing major incidents helps improve it stability
One of the world's leading financial institutions was able to improve its stability
by reducing their major incidents. This reduction in the number of incidents
was achieved by improving their root cause analysis process.
Reducing major incidents helps improve IT stability
The major reasons for a heavy problem backlog could be:
- Delayed and long-pending RCAs.
- Inconsistent quality of RCAs, and lack of proper documentation.
- Not effectively communicating the investigation process to the
Without identifying and rectifying the root cause, the chances of major incidents
recurring are fairly high. Thankfully, though, the problem backlog can
be reduced by:
Working on these two simple ITIL® service desk metrics-percentage reduction in the number of major incidents and percentage reduction in the number of problematic assets-can help you maintain a highly stable IT infrastructure.
4. Ticket volume trends
Total number of tickets handled by the IT helpdesk and their patterns within a
given time frame.
Optimize the number of incidents and service requests, and prepare the IT team
to handle the ticket load.
What can you do with ticket volume trends?
- Identify peaks and troughs to optimize resource management and
- Create a better staffing model.
- Design training sessions for your IT service desk team.
- Analyze service request patterns and plan ahead for purchases of
assets and licenses.
- Validate any additional resource requirements.
IT help desks should watch out for a few trends when it comes to ticket volumes,
Discrete spikes in ticket volumes
A sudden upward spike in the ticket volume can be due to the following
- a. Period of peak business activity
- b. IT rollouts leading to:
c. IT disruptions
d. Post holiday password reset tickets
- i. Service disruptions and unavailability
- ii. FAQs
Case study: Fall intake leadsto ticket spike at auniversity
The below figure (7) represents the number of tickets handled by the IT helpdesk at a university in the United States. The graph clearly indicates a ticket
spike in the month of September 2012 and 2013. This is due to the increased
amount of students joining the university during the fall. So, the IT team makes
sure that this extra load is distributed evenly across the team, and each member
works overtime to handle these ticket spikes
Ticket volume at an American university
Gradual continuous upward trend
Continuous upward trend in ticket volumes
An upward trend could be due to any of the following reasons:
Increase in the organization size
As the business grows, it is obvious that the IT service desk has to support more
end users, which typically leads to increased ticket volumes. This gradual increase in the ticket volume can be handled by an effective staffing plan in accordance with the growth of the business. Furthermore, end users can be segregated
into departments and user groups to handle tickets effectively.
Initiatives to support more business functions
As IT starts supporting more business functions, the ticket volume (both incidents and service requests) rises. This can be handled by understanding the requirements and expectations of the end users, and equipping the IT help desk team to handle the increase in tickets.
Decrease of infrastructure stability
With the increasing number of problematic and outdated assets in the IT network,
the number of tickets is bound to increase as well. This can be addressed
by associating the incidents and problems with their assets, helping the IT team
decide on retiring the asset, upgrading the assets, and so on.
5. First Call Resolution Rate (FCRR)
Percentage of incidents resolved by the first level of support (first call or
contact with the IT help desk).
Have a higher level of FCRR.
High first call resolution rate is usually associated with higher customer satisfaction
as confirmed by a study that Customer Relationship Metrics conducted.
Furthermore, a study conducted by the Service Quality Measurement Group
also revealed that for every one percent improvement in FCR, you get a one
percent improvement in customer or end user satisfaction.
First call resolution is also related to cost per ticket. The following graph
represents the cost per ticket for every level.
Cost per ticket at various levels of support
Sometimes IT helpdesk technicians rush to close tickets during the first call,
even without accurate resolutions. Such cases can lead to first call resolution
rates rising, while end user satisfaction rates drop drastically, as depicted in the
First call resolution rate Vs. End user satisfaction
FCRR Excellence Tip
Here is a simple three-phase technique to get your IT help desk team resolving
tickets in the first call.
Phase 1 : Learn the environment
- Gather environment-specific knowledge.
- Populate the knowledge base with the information collected, creating relevant
- Generate regular status reports on the IT help desk performance with sections
on lessons learned, achievements, and obstacles overcome.
- Invite experts to evaluate performance.
- Create an operations manual that clearly outlines support processes, centralizes
key environmental information, and explicitly defines complex procedures
for ticket resolution.
Phase 2 : Fine tune
Generate reports to ascertain that the efforts of phase I panned out, and identify
areas of improvement. Below are some sample reports to help you get started.
- Percentage of calls taken by each technician.
- Number of calls taken per agent, per hour.
- Average talk time, by agent.
- Of the tickets we did not close, where were they transferred?
- Of those transfer destinations, who received the most tickets?
Phase 3 : Optimize
Establish a well-defined process for continual improvement of first call
This technique not only helps you improve the FCRR levels, but also helps
ensure that tickets are properly resolved, not just closed.
Another possible trend is a constantly degrading FCRR, as shown in the
Constantly degrading FCRR
There are a few reasons this could occur, but the primary reasons are as follows:
- Lack of requester and system information.
- Poor technician capabilities.
- Poor knowledge transfer and sharing.
According to MetricNet's benchmarking levels, the average net FCRR for service
desks globally is 74 percent, with a range of 41 to 74 percent. The most
common factors among all the services on the higher end of the spectrum were
the presence of highly trained agents, the availability of knowledge management
tools, and the presence of tools such as remote desktop management.
FCRR can be improved with the following tips:
- Communicate the importance of FCRR to the technicians.
- Design training programs for the first level technicians on specific subjects
to help resolve tickets faster.
- Maintain a knowledge base of advanced technical solutions and articles
exclusively for, and limited to, technicians.
- Create custom forms to collect all relevant information at the time of ticket
creation to avoid turnaround delays.
- Automatically route tickets to the right technician or group based on ticket
6. SLA compliance rate
Percentage of incidents resolved within the agreed SLA time.
Maintain maximum SLA compliance rate.
Tracking SLA compliance levels helps IT help desks:
- Ascertain that the service levels are real and obtainable.
- Check the performance of the IT help desk against the service levels agreed
with the end user
- Identify areas of improvement, strengths, and weakness of the IT help desk.
Sometimes IT help desk technicians close tickets without proper resolutions,
just to avoid SLA violations. When this happens, though the SLA compliance
rates remain high and the end-user satisfaction levels are bound to decrease, as
shown in the following graph.
SLA compliance rate Vs. End user satisfaction
SLA compliance levels may drop for other reasons, though, so it is important to
keep the following possibilities in mind:
- Your team may not understand the business requirements, which can lead
to service level agreements that don't fulfill the business needs, or improper
categorization and prioritization of tickets leading to SLA violations.
- There is often a lack of proper communication on the risks of outages affecting
mission-critical services and their business impacts.
During such scenarios, IT service desk teams must understand the requirements of
the business, and redefine their SLAs as appropriate.
Case study: When meeting SLAs doesn't help
SLAs and SLA compliance are critical to ensuring business continuity. This
case from a cement manufacturing company, however, stresses that SLAs must
also be set carefully. The IT help desk was unavailable for immediate response
to an issue on a truck dispatch, but did resolve it within the SLA. Unfortunately,
the cement manufactured had to be dispatched to the client location within
one hour to avoid hardening.
The IT help desk was unaware of this, and SLAs
were set without considering these factors. As a result, though the ticket was
resolved within the SLA, the cement had already hardened, which affected the
Decreasing trend in SLA compliance rate
Another alarming trend to keep an eye out for is a constantly degrading SLA
This falling trend could be due to any of the following:
- Unrealistic service level agreements.
- Lack of awareness of the SLAs and the risks of SLA violations.
- Absence of proper monitoring and proactive escalation.
- Lack of technician expertise.
- Unassigned tickets and delayed and faulty ticket assignments.
The SLA compliance rate can be kept at higher levels by:
- Setting realistic SLAs based on the business requirements and
- Communicating the SLAs and risks of SLA violations to the business and
- Setting necessary escalation rules.
- Automating the process of routing and assigning tickets.
- Designing training programs for your technicians.
7. Cost per ticket
The total monthly operating expense of IT support, divided by the monthly ticket volume.
Maintain minimum levels of cost per ticket.
As per MetricNet, the following were the cost per ticket benchmarks for 2014.
Industry standard - Cost per ticket at a high density environment
Industry standard - Cost per ticket at a medium density environment
As seen in both cases, the cost of the service request is usually higher than the
cost of the incidents. This is because incidents typically take less time to resolve
than service requests. So, the cost per ticket is heavily influenced by the
mix of incidents and service requests.
IT support is considered a cost center in most organizations, and is usually the
first to get budget cuts during a financial downturn. Therefore, IT support must
remain efficient, even when IT spending is reduced. Cost per ticket is a key service desk performance metric that helps IT support analyze its efficiency in handling tickets within a given budget. The goal is always to maintain an optimal level of cost per ticket.
However, it is important to keep in mind that a higher-than-average cost per
ticket may not necessarily be a bad thing, and a lower-than-average cost per
ticket may not always be good, as shown in the following graphs.
The scenario depicted in this graph may mean that the IT service desk team is
compromising on service quality to reduce the cost per ticket, which often results
in lower customer satisfaction levels.
Cost per ticket Vs End user satisfaction
The scenario depicted in the above graph shows where the increase in the cost
per ticket is accompanied by an increase in the customer satisfaction levels.
This may mean that the increasing cost per ticket has led to better service delivery,
justifying the extra cost.
One key factor for optimizing the cost per ticket is to enable quick resolution
of tickets and reduce any unnecessary escalation. Cost per ticket can be kept in
control by following these pointers:
- Analyze service request patterns to plan ahead for purchase of assets and
licenses, reducing the time taken to close service requests.
- Identify peaks and troughs to optimize resource management and
technician work load.
- Properly categorize and prioritize tickets to reduce incorrect ticket
assignments, helping provide quick resolutions.
- Create a robust knowledge base.
8. Software asset utilization rate
Percentage of software products and licenses in actual use by the business.
Maximize ROI (return on investments) on software investments.
With software license purchases taking up a major part of the IT spending, it
is important to track software utilization. Unfortunately, this is one of the least
discussed service desk metrics. For easy management, the software can be categorized as
- Category 1 - Software that needs the most attention (with the highest
business implications, license cost, or compliance risks).
- Category 2 - Software that needs the least attention (free software such as
- Category 3 - Prohibited software and malware.
The following service desk metrics can be used to track software utilization:
Ratio of total used to total owned software
This metric helps identify any software purchase expenditure that does not
provide any value to the organization. Ideally, this ratio should be close to
one, meaning there is maximum utilization of all purchased software, thereby
ensuring a maximum ROI on the software license purchase. A high number of
category one software in the unused list means that a major portion of the software
asset spending is sitting in idle software.
Ratio of unallocated licenses to total license count
This metric helps analyze the license utilization of a particular software, helping
IT teams plan ahead for license purchases. The ratio should be as small as
possible for maximum ROI. A higher ratio could mean that some of the software
applications are over licensed, which could be an idle investment with no
Case study: Increasing software asset utilization saves a million dollars
A leading global pharmaceutical company saved about one million dollars
in spending. The pharmaceutical company, with its services
spread across 50+ countries, was using a diverse range of Microsoft products.
At one particular office, there were thousands of software applications licensed
under a Microsoft volume licensing agreement, but there was no visibility or
control of these software assets, initially. The purchase had been made without
understanding the business requirements.
In fact, the company had limited information
on the software assets and the number and type of assets the organization
actually needed. This, again, put the organization at the risk of over-licensing,
under-licensing, and compliance penalties.
The IT help desk started with a simple analysis by comparing the installed Microsoft software with the
Microsoft licenses they held. The insight gained, and IT's efforts to understand
the business requirements, led to a redesigned Microsoft license purchase that
involved stepping down from the Microsoft Office Professional edition to the
cheaper standard edition, which met the business requirement.
several other volume licenses were replaced, leading to cost cuts saving the
company about one million dollars in their software license purchases.
License compliance rate
Another important software asset management metric that could incur cost to
the organization is the license compliance rate. Maintaining maximum compliance
can save your organization from penalties and fines. The following are a
few tips for achieving maximum compliance:
- Track all software installations and license purchases.
- Allocate licenses to individual software installations to find the over and
- Purchase the right license types for the software. For example, it is better to
purchase a perpetual license for a core software to avoid compliance issues
due to license expiry.
- Conduct formal internal assessments for compliance and audit readiness.
Achieve maximum compliance with a three-step pre-audit
Hundred percent license compliance rate will no longer be a myth with this
simple three-step pre-audit.
Step 1 : Gap analysis
- Request a list of all software applications licensed to your organization
from the specific vendor.
- Identify and pin down software that is in use by the business, but not on the
list provided by the vendor.
Step 2 : Compliance analysis
Check the total number of software installations vs. the total number of licenses
purchased for every software application to identify over and under-licensed
Step 3 : Software license optimization
With all the insight gained from step I and II, redesign your software purchases
to optimize compliance and attain a 100 percent license compliance rate.
These 8 KPIs, with respective metrics, will help you establish a measurement
engine to constantly measure and continuously improve your service desk
performance. The first step in establishing this measurement engine is to understand
the business that the IT help desk is supporting, and align the IT help desk
objectives to the business objectives. The next step is to identify the KPIs and
metrics that are critical to these help desk objectives, and constantly measure
The 8 service desk KPIs discussed here are critical to the three basic IT help desk
objectives of ensuring business continuity, making the organization productive,
and delivering services within budgets and on time, which underlines the fact
that these 8 KPIs are the ones that your IT help desk should care most about.