What is an incident postmortem? Examples [+ Free RCA template]

Last updated on: May 15, 2025

A wave of panic hit Zylker's IT team on a busy Monday morning. Incident tickets flooded their support inbox, and phones buzzed endlessly. The company's order management system, the mainstay of their business, had gone down. Customers were locked out of their accounts and sales ground to a halt. The order management system is the lifeblood of Zylker's operations, and a prolonged outage could damage user trust and lead to lost revenue. Thankfully, Zylker had a well-oiled incident response plan for major incidents. Once the outage was resolved, the IT team decided to perform an incident postmortem to zero-in on what went wrong and identify the steps that could be taken to make their IT infrastructure more resilient.

But before we dive in, what is an incident postmortem?

An incident postmortem is a thorough review of an IT incident that documents the timeline of events and the impact of the incident, as well as the steps taken to resolve it and prevent it from recurring. Rather than being an exercise to direct blame, it's a collaborative effort to identify mistakes and optimize future incident responses. The process involves personnel from different departments and carefully documents the timeline of the incident to pinpoint potential problems. The postmortem process also includes critically reviewing the incident response strategy and learning whether it was effective or if there was any scope for improvement.

It is recommended to perform a postmortem shortly after an incident is resolved so that you can capture as much fresh and relevant information and insights from the incident responders and other stakeholders. Ideally, the postmortem should commence no later than a week after the incident.

So here's how Zylker performed its incident postmortem. The IT team at Zylker started by designating John Clarke as the incident postmortem owner. John was part of the incident response team and had first-hand knowledge about the outage. He prepares the following postmortem document for Zylker.

Title

SEV-1 incident: Order management system outage on 1st August, 2024 [INC-14819]

Incident start and end date

Start: August 1, 2024 at 8:57am GMT
End: August 1, 2024 at 9:05am GMT

Summary

A database upgrade was performed on August 1, 2024, at 8:50am on legacy database servers of the order management system. This led to an unforeseen compatibility issue with the application servers and caused the application to be unavailable for all end users.

Sequence of events

Date and time	Event
Aug. 1 \| 8:50am GMT	The primary database servers that run Microsoft SQL Server 2014 are upgraded to Microsoft SQL Server 2017.
Aug. 1 \| 9:00am GMT	Users report login failures, meaning customers are unable to place orders.
Aug. 1 \| 9:04am GMT	The incident response team identifies the disconnection between the database server and the application servers.
Aug. 1 \| 9:05am GMT	The application server is connected to the secondary database server as part of the disaster recovery plan.
Aug. 1 \| 9:47am GMT	The database upgrade is rolled back and the primary database server is reconnected to the application server.

Impact

Employees and third-party vendors were unable to access the order management system for a duration of five minutes. Customers were unable to place orders between 9:00am GMT and 9:05am GMT on August 1, 2024. This unavailability incident led to 6713 failed orders causing losses amounting to $756,000.

Factors contributing to the outage

Incomplete configuration analysis due to outdated CMDB
Inadequate dependency mapping
Absence of comprehensive change testing
Non-adherence to the change freeze window
Communication lapses between the digital transformation team and the OMS maintenance team.

Resolution

The order management system application servers were reconnected to the secondary databases, and the database upgrade was rolled back in the primary database servers.

Lessons learned

An up-to-date CMDB with comprehensive documentation and dependency mapping could have prevented this outage.
Database administrators (in this case) need to be included in the change advisory board (CAB) meeting. Moving forward, CI owners and service administrators will review change requests as part of the CAB.
Change testing and freeze window recommendations were disregarded. They need to be enforced, and creation of change requests should be restricted, except for emergency changes.

Corrective and/or preventive measures

A comprehensive revamp of the CMDB has been initiated, including a migration to ServiceDesk Plus.
Designated CI owners and business service owners designated as CAB approvers.
During peak shopping season, the change freeze window will be enforced.
An org-wide infrastructure upgrade project has been initiated and will be completed phase by phase.

Incident response team

William Smith - Lead
John Clarke
Jessica Chan
Kumar Gupta

Want to use Zylker's incident postmortem template? Get your complementary copy here!

One of the most crucial steps involved in the incident postmortem is the root cause analysis (RCA).

An RCA is a systematic method to find the underlying reasons for an incident. The team gathers and analyzes all the available information—such as event logs, server configurations, etc—and nails down the root cause. This helps the organization learn from the incident and ensure similar incidents do not occur in the future.

Here are a few methods Zylker used in their RCA:

Five whys method

The five whys method is a technique for thinking through an incident by repeatedly asking the question "why?" until the root cause is identified, usually in about five whys.

In Zylker's case, let's see how their IT team identified the root cause:

Why were customers unable to place orders? Because the order management system was unavailable.
Why was the order management system unavailable? Because the database server was disconnected from the application servers.
Why did the disconnection happen? Because the upgraded version of Microsoft SQL server was incompatible with the application codebase.
Why was the database upgraded without adequate analysis? Because the CMDB was not updated to reflect the latest dependency relationships, so the analysis that was performed did not reflect the live environment.
Why was the CMDB not updated with the latest dependency relationships? Because communication lapses within the team lead to the standard operating procedure for dependency mapping to be no longer followed.

Fishbone diagram

Sometimes known as a Ishikawa diagram or a cause-and-effect diagram, this visual tool identifies potential contributing factors categorized by root causes like people, processes, technology, and environment.

The incident response team would create a fishbone diagram with the central problem being the "Order management system outage." Each of the branches, which would then represent a potential root cause category:

Category	Potential causes
People	Communication lapses between the digital transformation team and the OMS maintenance team.
Process	Non-adherence to the change freeze window.
Technology	Incomplete configuration analysis due to outdated CMDB.
Environment	Inadequate dependency mapping.

Kepner-Tregoe (KT) method

This structured problem-solving approach uses a series of questions to define the problem, analyze potential causes, and develop corrective actions.

Here's how the KT model was applied to find the root cause:

Define the problem: The order management system suffered an outage, leading to a significant loss of revenue.
Analyze potential causes: Based on the information gathered in the postmortem and brainstorming from the fishbone diagram, the team identifies potential causes like communication lapses and incomplete configuration analysis.
Develop and test solutions: The team then evaluates and prioritizes solutions for each potential cause. This could involve implementing a real-time CMDB with granular dependency mapping, updated CAB policies, revised change management workflows, and instituting change freeze windows.

By employing these three RCA methods, Zylker identified the root cause of the outage. Remember that an RCA is a crucial step in the incident postmortem and directs the problem management efforts that will follow through in cases of repeated incidents or major outages.

At organizations that do not actively engage in the exercise of comprehensive incident postmortem, their IT teams lose out on the valuable opportunity to retrace their steps during incident response and identify insights that could make or break their incident and problem management strategy. However, among those teams that do perform incident postmortems, the exercise can quickly descend into a blame-game of pointing fingers. For an incident postmortem and RCA to be truly valuable, IT teams must run both exercises objectively—without directing blame toward anyone—and leverage the incident experience as knowledge that can help infuse resilience into their IT services.

Author's bio

Saket Pasumarthy, a product expert at ManageEngine ServiceDesk Plus, is an ITSM enthusiast and is fascinated in understanding the latest advancements in the IT space. Saket writes articles and blogs that help IT service management teams globally handle service management challenges. Also he presents user education sessions in the ServiceDesk Plus Masterclass series. Saket spends his free time playing football and flying planes on a flight simulator.

Incident postmortem and RCA templates for IT teams

But before we dive in, what is an incident postmortem?

Incident postmortem example