How to communicate effectively during an IT incident

There's a stark difference between what you see during an IT incident and what your customers experience. For you, it might start with an alert from the monitoring tool and end with an email about the fix. But for your customers, it could start with panic and end in a loss of trust. That's why you need a solid incident communication plan to ensure incidents are handled smoothly.

However, you shouldn't fall into the trap of thinking incident communication is just sending a well-written email to affected users. It's about ensuring your incident response mechanism runs smoothly and helps both the employees who handle it as well as the users on the receiving end.

Over a couple of decades, we at ManageEngine have internalized this learning and created a framework for communicating effectively during an incident.

Choosing the right category

We've covered our incident management process at length in our incident management handbook, and we believe communication starts right from the analysis phase itself. We've publicly announced how to contact us to report incidents and we've announced this within the organization as well. We ensure every employee is aware of this during their training. Once we're aware of an incident, we first categorize it based on its type and impact.

An approach that works well for us is to categorize incidents and respond to them based on NIST's Cybersecurity Framework. Once we categorize the incident, we use a communication template suitable for that category.

Creating templates for better communication

A common reason why companies mishandle incident communication is lack of clarity. A developer might not know whom to ping for a particular type of incident. A data center operator or a network administrator might get confused about whether to use a group chat or to send an email while handling a particular issue. A coordinator might miss a few details in the email that they need to send to end users. Having templates for each category eliminates these uncertainties.

It's also important to provide a clear distinction between internal (between teams) and external (to users outside the organization) communication.

Here's what one of our templates typically covers:

	Availability incidents	Security incidents	Privacy incidents
What?	External: Expected downtime, works in progress and alternate measures for users, and the root cause of downtime	External: Details on the bug or attack, the root cause of the incident, the severity and the extent of impact end users can expect, immediate measures taken by our teams, any required action to be taken by users, and other information relevant to end users	External: The nature of the data involved, any privacy laws violated, details we require from users to determine the severity and impact of the incidents, the reason for the incident, and an apology if caused by us.
What?	Internal: Details of the affected services (from the monitoring tool), the root cause of the incident, and preventive measures to be taken	Internal: Teams responsible for the incident, details of the incident, the possible impact, immediate action to be taken by the teams, the root cause of the incident, and preventive measures to be taken	Internal: Teams responsible for the data involved, steps to be taken by the internal teams to assist the Privacy team, the root cause of the incident, and preventive measures to be taken
How?	External: Email and in-app notifications	External: Email and related forums	External: Email
How?	Internal: Zoho Connect groups and Cliq channels	Internal: Zoho Connect groups and Cliq channels	Internal: Cliq channels
Who?	External: Organization admins	External: Organization admins, and security officers if required	External: Organization admins, and data protection officers if required
Who?	Internal: Team managers, appropriate team members, the Security team, and the head of IT	Internal: The Security team, the team manager, and members of appropriate teams.	Internal: The Privacy team, the Security team, appropriate team members, and the data protection officer
When?	External: Updates immediately after acknowledging the incident, when the incident is resolved, and if the time taken to resolve increases unexpectedly	External: Updates immediately after acknowledging the incident, when the incident is resolved, and after the incident is resolved.	External: Updates immediately after acknowledging the incident, when the nature and severity of the incident is determined, and if further action needs to be taken to minimize the impact.
When?	Internal: Communicationimmediately after acknowledging the incident, during the course of fixing the incident, immediately after the downtime is resolved, and after the incident is resolved	Internal: Communication immediately after acknowledging the incident, while implementing solutions, after the incident is resolved	Internal: Communication immediately after acknowledging the incident and after the incident is resolved

Let's say our monitoring tool, Site 24x7, alerted us about downtime in one of our ManageEngine products, AssetExplorer. These alerts are automatically posted to our incident management tool. A post is then created on Zoho Connect, our team collaboration platform. The incident management process would be as follows:

Internally, we tag the managers (Who?) of the concerned team in the post (How?), and give them details about the downtime. We discover we can keep the downtime below 30 minutes, so we categorize the incident accordingly as medium impact.
Incident coordinators (Who?) on the team check for possible causes in application logs and the monitoring tool. The team works on applying an immediate fix. Simultaneously (When?), we email (How?) the concerned users about the downtime and the reason for it (What?).
While working on the fix, the team involves incident coordinators and other experts needed (Who?) by creating a separate Cliq group for the incident (How?). The team applies a fix and resolves the issue. The details of the fix (What?) are communicated to the incident team.
We immediately (When?) inform users about the fix via email (How?).
Once the issue is settled, we work on creating a root cause analysis (RCA) document. The RCA template (What?) is filled out for the incident and emailed (How?) to users for better clarity.
We implement preventive measures and update the details (What?) in the incident management tool (How?). We email (How?) users about the preventive measures taken.

For an incident of longer downtime, we might need to communicate more frequently with users. For a security incident like a vulnerability being exploited, we may need more internal communication through chat groups and meetings. Depending on your organization, you could create more categories and templates to suit your needs.

Implementing a template-based framework

While templates work, they're only as good as how they're executed. To execute them well, you need a certain organizational discipline and structure that can assist your employees. Here's what we do to ensure we execute templates well:

1. Establish the templates in a central repository:

We have a portal for governance, risk, and control (GRC) that the entire organization can access. Further, we use Zoho Connect to let everyone know about updates in the GRC portal. You should choose a portal that functions well as a central repository of communication templates for your organization.

2. Appoint an incident coordinator to every team:

An incident coordinator is necessary for more than just communication, but they make communication much easier because they can take responsibility for all communication within that team.

3. Create awareness:

We include these templates in our employees’ security and privacy training to spread awareness about them.

4. Test and improve:

Ultimately, testing and improving is the only way to know whether our templates are becoming the best versions of themselves. After every incident, our incident team reflects on how the templates worked and how we can improve them before the next incident.

We touched upon the communication part of incident management in this article. Good communication will bear more fruit when you have a solid incident management framework to guide you throughout the process. If you're looking for such a framework, check out our incident management handbook.

About the author

Shivaram P R, Content writer