Our enterprise IT change management framework

This chapter deals with how we implement each type of change through a set of processes. We'll share our overall methodology first, followed by specific frameworks for scenarios involving each of these types.

Our overall change management methodology

Each type of change goes through a set of processes based on factors specific to the change. We arrived at those sets of processes by applying our overall methodology first. The following sections will be dedicated to discussing those processes and how we implement different types of changes using our framework. Here, we look at our overall methodology.

We manage changes through six phases: initiation, planning, approval, implementation, review, and closure.

ManageEngine’s change management approach

Change management approach by ManageEngine

1) Initiation

The initiation phase involves officially kick-starting a change through the change management module in our service management solution. The employees who wish to request a change raise it in a change control form.

This phase focuses on collecting accurate data about the change that helps during further processes. This information might include:

  • The location of the required change, like our data centers, development centers, or hub offices.
  • The nature of the change, like whether it is one regarding infrastructure components; configurations; network changes involving firewalls, switches, or routers; or hardware changes.
  • The proposed duration of the change.
  • A backup plan.
  • The person responsible for implementation of the change.

2) Planning

The planning phase is about analyzing the feasibility of the change. The stakeholders of this change request perform high-level impact and risk analysis.

They come up with a rollout plan, a series of steps to implement the change. They also provide clarity on the people involved in implementing the change, the documents they can refer to, and the process that must be followed. At this point, a rollback plan is also created; this plan can undo the change if it doesn't go on as per the standards we set. This phase helps the change manager (we'll be seeing more about change roles later) decide the process suitable for the change we're dealing with.

3) Approval

The approval phase is where the designated manager accepts or rejects the change request. Our tool sends out an automated email notification to the manager where they can view the details of the change request. If they deny the change request, they must provide the reason to do so. How we designate managers for various types of changes is driven by our policies.

Based on the expected duration of the change, a senior engineer of the team that carries out the implementation assigns appropriate personnel to do the job. We treat each change request as a ticket, and we allocate the tickets into various levels. For example, we expect a Level 1 change ticket to be completed within four hours of approval. A Level 5 ticket might take weeks to close. Senior engineers assign change tickets into levels based on evaluation of the scope of work involved in executing the change request, and the level of impact it would have on the regular functioning of the environment (could be a network, data center, or server based on the change).

As an IT leader, you can choose these levels based on the SLAs your organization wants to have.

4) Implementation

The implementation phase involves moving parts that function based on the plan created earlier. Once the change is approved, the responsible people we identified earlier follow the series of steps and ensure they complete the change process.

If the change process involves downtime of some of our services, we send a notification in advance to the users via a pop-up, banner, or social media.

After implementing the change, we monitor it using our in-house monitoring tools for a certain duration, depending on the level of change.

5) Review

The review phase kicks off after the monitoring in order to see if the change is successful. The reviewers use parameters to see how well we've implemented the change and how we can improve it. They record their comments as part of the change so the feedback helps improve future changes.

How we execute this methodology

There are three main components needed to execute this methodology: resources, responsibilities, and review.

1) Resources

We ensure the change managers have sufficient technological resources, personnel, and knowledge at their disposal. Resources also vary significantly with the type of change we're dealing with. For us, we have a suite of tools that we provide external organizations, and we use those same tools to manage our changes.

To execute the entire change management process, we require a set of tools that we will discuss when we talk about normal changes. Here are some of the tools that help us execute the majority of changes in our IT:

Requirements based on changes

Resources we use

Use cases

Manage and track change requests as tickets till resolution.

ServiceDesk Plus

An employee raises a request to change firewall access to a group of applications. After raising the request, the employee's manager must approve it. Then the network operations manager must give their approval, after which a member of the IT services team will grant access after verifying the need for the change, which is recorded along with the request, and analyzing the impact. All these processes happen inside the tool.

Monitor server and application availability.

  • OpManager
  • Monitoring Instrumentation Tool

A change request for migrating applications from one server to another needs sufficient data for execution. This tool helps us analyze the server’s capacity, user load, and speed.

Manage inventory, load balancer operations, SSL operations, and the process of switching from one data center to another for disaster recovery.

  • Zoho Admin Console
  • OpManager

A request is submitted for configuration changes in a load balancer. These tools offer visibility into the load balancer and enable us to apply the configuration changes and to mirror the new configuration across all servers.

Manage device credentials.

Password Manager Pro

A request is made to change an employee’s access to a server. We use this tool to revoke the current password for that employee in the secure vault and provide alerts to the IT team based on access to the password.

Monitor service availability from different global locations.

Site24x7

Someone submits a request to modify a particular DNS record on our website. We use this tool to monitor the respective site, verify DNS correctness, and perform lookups from multiple locations before modifying the record.

Create and manage forms to track specific requests.

Zoho Creator

A request to change the location of a certain set of files is submitted. It requires the approval of multiple teams. We create a form and circulate it to those teams to ensure they all approve it.

Maintain all documents, including policies, procedures, and vendor contracts.

Zoho Wiki

A new member of the IT services team must implement a standard change of upgrading a machine’s RAM. But, unsure of the process, they must refer to supporting documents like procedures and policies.

Monitor system and firewall logs, track network traffic, and receive alerts based on the configured threshold values.

  • EventLog Analyzer
  • FortiAnalyzer

A change request to modify firewall rules is raised in response to malicious inbound traffic. This tool analyzes perimeter devices, including routers and switches, to provide insights on how and why the change must happen.

Analyze and monitor switch ports and interface traffic.

NetFlow Analyzer

A request to change the status of a set of applications to “non-critical” is raised. This tool helps analyze network traffic to see how much bandwidth these applications occupy and how they're useful.

These tools are but a few of the many we use. The wide breadth of tools we require stems from the size of our organization and the complexity of our changes. The main takeaway here is the requirements that drive the use of these tools. As an IT leader, your organization might have these requirements too, and it's up to you to list them out and decide what resources you can have in place to satisfy those requirements.

2) Responsibilities

Throughout a change, we provide clarity and ensure those involved in the change are aware of their responsibilities. Each person or team involved plays a role, and we lay it out for them through policies and guidelines.

Amongst the many roles, a significant one is that of the CAB. It is a group of people who the change manager trusts as guides in the decision-making process. The CAB evaluates the change once the change manager approves it and an implementation plan is in sight. It provides crucial recommendations and oversees the implementation. Let’s consider a major change that affects the entire organization, like shifting a division of our IT operations from one office to another.

Here's what our CAB looked like when we implemented that change in 2015:

A typical change advisory board

What kind of people make up a good CAB for a major change that affects the entire organization?

Change advisory board members in ManageEngine

As you can see, this is a mixture of experienced individuals and experts from various levels of the organization. For a change like migrating a service to a new data center, the CAB would be different and include more members from the IT team.

  • The CAB must have a good mix of various job roles and teams.
  • A large CAB could slow down the process. We choose CAB members on a minimalist basis.
  • We appoint experienced individuals from various facets of the business. This may include C-level executives, managers, technical teams, finance teams, and more, depending on the nature, severity, and scale of the change. Their experience in helping us spot hidden challenges cannot be overstated.
  • We document and communicate the role of CAB members clearly to them. They are responsible for assessing the change from the initial stage to the review stage, providing their perspectives, recommendations, and course corrections.
  • Their contributions will be crucial while creating risk mitigation plans, monitoring the change for red flags, and suggesting improvements by reviewing the changes.

We also ensure change managers and implementers clearly understand the roles of the CAB. Not all changes need to go through a CAB, and neither is the CAB a rigid establishment. That's why the change manager takes charge of the change and decides how the CAB members must get involved.

A CAB is just one crucial role in our change management model. Apart from the CAB, we have four critical roles with their own sets of responsibilities:

4 critical change management roles at ManageEngine

Change initiator

They create and submit a request for change with a good reason attached to it. They are also responsible for planning, defining roles, and helping out with implementation.

Change manager

They drive the change from start to completion and make most of the crucial decisions. Our senior engineers and project managers usually take up this role.

Change owner

They are held accountable for the entire process. Our top officials with experience and expertise, like the vice president or the head of IT operations, may take up this role.

The review team

It provides a neutral opinion. Members of this team have expertise and could be part of the CAB as well. The team’s role is to reflect on how the change happened. what our performance metrics say, etc.

3) Review

Review has helped us shape our framework over the years. While being an integral part of our change management process itself, it is also a pillar of learning and improvement. Review goes beyond just reviewing a change on a case-by-case basis. IT leaders review how we perform changes overall and where we're headed, using a set of metrics and parameters.

  • Percent of successful changes: During change closure, the change manager records how well the change went and whether it was successful or not. The review team determines why the change was a success or failure based on how well the objective was achieved and if resources were wasted. We aim to improve this percentage every quarter.
  • Patterns of unsuccessful changes that were rolled back: Be it a simple network change or a complex data center migration, a rollback indicates we didn't do something right in the planning stage. It is normal to roll back changes, but we ensure we understand the reason and notice patterns in these changes.
  • Proportion of change requests in a particular division: Too many change requests in firewalls, switches, or routers could indicate a need to revisit this infrastructure. Too many changes in the data center could indicate more risk around the corner, or it could just be due to increased customers, new employees, or a host of other reasons. Either way, it is worth looking out for this parameter to be on the safe side.
  • Frequency of incidents caused by change: Obviously, if a change causes incidents, it needs a revised process or more checklists. Further, we see if a particular area of change (like in networks or devices) frequently causes minor or major incidents.
  • Percent increase in emergency changes: A high percent increase is a huge red flag, especially if the emergency changes are unforced. A DDoS attack or a disaster may not be in our control, but a normal change gone wrong is certainly something we want to avoid. We investigate such emergency changes and record the learnings.
  • Cost incurred on changes: This can be a bit difficult to calculate at times due to the sheer breadth of our operations. However, the objective here is to see what proportion of our resources (person hours, tools, and direct costs) we spend on changes. It need not be an exact amount, but we must know if it is increasing every month or even every quarter.
  • Time taken to complete various levels of changes:We have our SLAs set for change requests, and we ensure we maintain SLAs at all levels.
  • Approval rate: This is the percent of changes approved by the CAB. A high percentage means our planning is good.

The review team evaluates each change, updates the knowledge management database, and documents the learnings. The IT leaders review the overall progress of our change management journey and make decisions to improve our processes.

Scenario 1: A change escalation

change escalation

One of the most interesting perspectives of our leaders is a counterintuitive one. Here's how some of our wise people put it:

Successful change management is mostly about being aware of what not to do and not doing it.

Here's a good example of what not to do and the type of escalation one should avoid:

Let's start with a standard change of updating security patches in a server. Since it's a no-brainer that it's critical, you might have categorized it as a standard change, checked for the pre-approved status and the documentation, and went on to update the server. However, some of your critical applications might be running on that server, and they'd still only be compatible with older versions. Now those applications are not functioning properly, and you must move them to a backup server. This will be a major change. You'll now have to involve a CAB and follow a different process. Let's say this backup server does not have a disaster-recovery mechanism, and your critical application data is lost. This is an emergency change as it affects many users, and your C-level executives are now responsible for cleaning up this mess.

What should have been a standard change has now moved on to become an emergency change. This is exactly what our IT leaders try to avoid. Here is the road you should avoid too:

A change management roadmap to be avoided

Enterprise change management roadmap

The methods we discuss in the following chapters will help you avoid this road by using frameworks for each type of change.

Scenario 2: Facilitating normal changes

As we adopt our general approach to normal changes, we take them through a comprehensive framework supported with processes and resources during each stage. We treat changes like an OS upgrade or a data center migration for a service as normal changes.

Let’s look at a detailed breakdown of our framework.

Framework for normal changes

Phase

Process breakdown

Resources required

Initiation

The change initiator creates a request for the change.

The change owner (an IT leader) assigns a change manager.

The change requester identifies and records the reason for the change.

The change requester determines the tangible and intangible benefits of the change and records them.

The change manager confirms the type of change (could vary for an emergency change).

The change manager assigns suitable roles to other technicians: approvers, reviewers, and implementers.

The change manager assigns workflows and the priorities of the change.

A single platform for managing incidents, problems, assets, and changes

Planning

The change owner analyzes the change. The reason for the change could be linked to other problems or incidents. They link these items to the change for better clarity.

The change owner figures out the possible business impact the change could bring, including the number of users affected.

The change owner documents the rollout plan—a series of steps the implementers need to follow to execute the change.

The change owner documents the back-out plan—a series of steps the implementers need to follow to undo the change if it turns out to be unsuccessful.

The change owner creates and attaches a checklist to the plan in order to help the implementers exercise reasonable precautions.

The change owner records the downtime that the change could bring, if any.

The change manager reviews the plan and approves it if they find it satisfactory.

  • A CMDB to support the decision-making process. It is the repository of all the resources and assets, and the relationships between them. It is also a tool that helps the change owner assess the impact of the change beyond the surface level. It lays out all entities, including external parties that participate in the change.
  • A unified desktop management or asset management solution. This helps the change owner get enough visibility about the assets needed to execute the change, and the ones that could be impacted by the change.

Approval

The change manager seeks the CAB’s recommendation. If needed, they can add a CAB member depending on the nature of the change. Once the CAB provides its recommendations and comments, the change manager implements them if needed.

A change management view that gives the change manager options to add CAB members and view their comments

Implementation

The change implementer creates a project and then lists the tasks to complete that project. They assign roles to technicians and supply them with the necessary knowledge resources. If needed, the change implementer breaks the project down into smaller fragments or vice versa.

  • A project management solution that can link tasks with the people who must complete them
  • We already discussed the resources that help us implement changes. Depending on the nature of the change (could be a network change or a change in our data centers or devices), we use one or more of those resources.

Review

The review team analyzes the change using suitable metrics and provides comments. It documents its observations, and schedules another review if need be.

Problem management tools to investigate the change. They help in calling out areas where the process could be improved.

The framework in action

We often make normal changes as part of technological advancements, like updating all Windows servers to the latest version. Let’s take this example and see how our framework helps execute this change.

In this case, the need for such a change is straightforward—the servers must receive appropriate security updates released by Microsoft. For the majority of changes, the most important aspect our leaders emphasize on is the CMDB, which the change owner focuses on during the planning phase. Our CMDB will tell us we have four servers in our data center with the 2012 version of the OS, and we want to bring them up to the 2020 version. During this phase, the change owner will consider two questions before we proceed:

  • Do we know the impact of this change?
  • Do we have enough assets in place to execute this change?

For the first question, the change owner will ensure they have sufficient evidence to back up their answer. It could be the learnings from our previous version updates registered in the CMDB, or their own analysis of how this occurs elsewhere. For the second question, we have our in-house desktop management solution in place to give the change owner enough visibility into our infrastructure.

Now we need a plan. This plan will be a natural result of the two questions above because we'll know what the right resources are and how we would execute the change by dividing it into granular levels. Here, the CMDB helps us understand the impact of such execution. Analysis of the CMDB, along with desktop management solutions, will give us the following:

  • Risks involved in the change
  • Impact of each risk
  • Risk mitigation measures in place—both technical and procedural measures for various risks
  • Entities (servers) and other related assets involved in the change
  • Other stakeholders who could be affected by this change

Once the plan is set, our CAB members will have a bird’s-eye view of this change from the change management tool's dashboard. For this change, our CAB would consist of the IT operations head of ManageEngine, the senior engineer in charge of implementation, and the vice president of security and compliance. There's also a provision to add CAB members if we need more perspectives. CAB members give their recommendations, which could be to modify the time frame of execution, use additional tools for monitoring, or even re-schedule the change if the business is impacted significantly.

For implementation, we have project management tools to help us. The implementers create a project, add each server as a milestone, and can split the implementation into tasks. The tool will also give them the necessary support with the documented procedures to carry out each task. Each task will have a dependency and an accountable person. When the change owner looks at this, they’ll be able to tell at the resource management section itself if the resources are overused; for example, if a single resource has 40 tasks, nothing will go as planned. But if four people have 10 tasks each, it’ll be smooth. We map resources to assets and move the resources around if needed. When we do this, the change won’t be rushed. It’ll be measured and implemented well.

Once implemented, we review the change using problem management tools. We investigate how well the teams performed and sort out the problems that arose during the change.

Quick summary

  • The CMDB, integrated with your change management module, gives you enough information to spot and tackle hidden challenges. Use the information in the CMDB to thoroughly analyze all related assets, their history, and their functioning.
  • Use the expertise of CAB members to include technological and procedural checks during the planning and implementation stages.

Other tips to nail normal changes:

  • Visibility: Use desktop management solutions for high visibility.
  • Clarity: Be clear with the process and the implementation. Establish who is responsible, what tools they use at what step, what is the right document to refer to, where this document stays, etc.
  • Granularity: Manage your implementation by creating projects, fragmenting them into multiple tasks, and distributing them well.

Scenario 3: Perfecting the execution of standard changes

Standard changes are similar to normal changes, but they don't need approval. We have to deploy a Microsoft patch or a service pack update for our services regularly, and they don't need approval every time. That's why we pre-approve them and bring them under the category of standard changes. These changes, even if unsuccessful, don't have the potential to create an immediate business impact on a large scale.

Standard changes follow the same framework as normal changes but skip the approval part. We initiate the change, plan it, implement it, and review it.

There are hundreds of such standard changes that are practically invisible. However, we often notice that we need to implement some of these changes with a slightly modified approach. To give ourselves more clarity on standard changes, we categorize standard changes further into two sub-categories:

Standard documented changes

Standard maintenance changes

Standard documented change

We handle these changes using robust documentation. Let’s consider one such change: upgrading the RAM in one of our virtual machines. The procedure is transparent, the technicians know how to carry it out, and no services need to be stopped to execute this change. So, we consider this a standard documented change, where the technicians carry out the change with sufficient documentation that includes the following details:

  • Why the change happened
  • Who carried out the change and which documented procedure they followed
  • When the change happened and how long it took
  • Whether the change was successful and its effects

Our in-house service management tool has features to gather this data and support the change manager in execution and decision-making. When it comes to why the change is needed, the change manager takes a look at our CMDB to understand why this documented change needs to happen.

One such reason could be the incidents surrounding the change. Let’s say there are 10 incidents stating that the server is slow. Quick analysis of the CMDB would give the change manager a reason—only five services can go through with the current capacity, while we’re letting 10 services get through. So, that calls for a RAM upgrade. Other reasons could be based on multiple low-impact service requests or the need to adapt to new technology.

Restarting our services could also be a standard documented change. When there's a troubleshooting process in one of our services, it usually leads to a change. For the change to take effect, we must restart the service. Obviously, a technician needs access to a server to perform this operation. We consider it a standard change, which is pre-approved. Our tool will create a record stating a particular technician has accessed this server for so-and-so reason. They'll also have documented procedures to support them with the process. If there’s trouble later and it leads to a minor incident, the change manager will know who is responsible and will ensure that technician learns from the mistake.

Ultimately, these are the types of changes where the track has been laid and it’s just a matter of implementation. However, we still ensure we support the implementation with well-documented procedures and checklists.

Standard maintenance change

A maintenance change is a slight tweak in the approach to a documented change. Let's go through an example to understand maintenance changes.

We used to consider changing the toner in a printer a standard change. Soon, we realized we started having changes frequently documented about this, and that involved more resources and time. So, we decided to switch it to a maintenance change so there’s no need for such a change at all.

One instance where this helped was when our employees were heavily dependent on printers during tax filing season. We had long queues. If a particular printer kept having problems, it was replaced, which was treated as a change. We incurred the cost of changing the printer, and many users faced problems. We switched to a bi-weekly maintenance schedule to cut down on the cost of printer replacement. During printer maintenance, our asset management team ensures the cartridges are full, the configuration is set right, and the printer has enough paper. This means we avoid paper jams and the need for employees to switch floors to use a different printer. Moreover, this has reduced the total number of incidents reported with printers.

It was a standard documented change without approval, where we kept replacing printers, so we took a shift-left approach to a standard change and made it maintenance.

We used the preventive maintenance feature in our service management tool and ensured that a ticket is issued to a desktop management representative, who takes responsibility of the bi-weekly maintenance. They have a checklist and ensure each printer is always working smoothly with full functionality.

You can also make service pack updates and antivirus updates into maintenance changes if it suits your organization. For instance, antivirus software will seldom change our firewall. And if it seems logical, our CAB would take a shift-left approach and use the preventive maintenance feature in our tool for those types of changes too. It will then take the route of a maintenance change. We also use different tools like our patch management solution to carry out periodic maintenance changes.

You must resist the temptation of listing too many changes under the bracket of standard changes. For example, upgrades in the OS are also normal changes any day. The ransomware attacks on Microsoft OSs in 2016 and 2017 made Microsoft introduce upgrades. Many of our systems had to be upgraded. But you can’t simply upgrade all servers without caution. You must check the CMDB to investigate which of these servers have critical applications, and check for compatibility, business impact, impact on number of users, etc., before making the upgrade. So, it’s a good idea to have these as normal changes and put approvals in place.

Quick summary

  • When you track changes over a period of time, you will see that many changes won't give you many challenges and can be performed without approval. By studying this track of changes, you will know what it takes to eliminate the little difficulties in those normal changes and make them standard ones.
  • The best way to eliminate the possibility of challenges in standard changes is to use documentation. Clear procedures, revision control, and periodic review of documents ensures your standard changes fare well.

Scenario 4: Establishing a system for major changes

Major changes are mostly normal changes that could turn dangerous without more vigorous filters. For example, firewall policy changes are mostly normal, but whether one is major or not depends on which firewall it is. A firewall for a back-end internal application like the one we use for tracking attendance would be a normal change; it’ll only affect a small proportion of users for a limited time. But if the same firewall is for a data center that hosts critical services, it could be a major change. Which applications run in that data center, who uses those applications, and what is the impact if the change goes south—all these questions would come into the picture. The impact ultimately decides whether it’s a major or minor change.

Most major changes affect the entire organization. One of our IT leaders identifies the need for it first and calculates the benefits the change would bring to us. Then, they assign a suitable change manager to collect information about how this major change would affect our operations. It is then a matter of executing our general methodology we discussed earlier.

However, when we implement major changes, it is crucial to look beyond the elemental approach to spot hidden challenges.

Also, since it's a major change with considerable impact, we need to tackle these challenges with even more care. Let's see how these hidden challenges cropped up when we implemented a major change many years back.

One of our primary services, our in-house email service, was hosted in a primary data center, which we’ll call DC A. We have primary and secondary data centers along with a backup for each. Over a period of time, we started seeing minor incidents about the email service being slow for some users in our company. When our incident management team analyzed this with the help of our IT operations team and technical staff from the email team, we initially figured the issue might be due to our internet service providers (ISPs).

However, the ISPs were firm that the problem was on our end. This back-and-forth communication didn't take us anywhere, so we decided to make a major change: transfer the email service to another secondary data center (let's say DC B). Then we'd move on to analyzing DC A and sort the problem out once and for all.

Moving a service to another data center is a major change, and we had a framework with the following:

  • Well-documented procedures to carry out the implementation
  • Trained technicians
  • An asset management database coupled with our change management module
  • In-house tools for monitoring, implementing, and reviewing the change

On paper, these are enough to carry out our elemental approach and execute the change. But that wasn't enough, and that's where we needed a filter to spot and tackle hidden challenges. When we went on to implement the change, here's what we found:

  • DC B's backup was not configured well.
  • There were other issues with DC B, like firmware and hardware problems.
  • These issues were not documented according to procedure.
  • There were other applications that had to be moved to DC B, but they weren't moved because they were not compatible with the hardware and OS.

Luckily, our experienced IT leaders had the eye to notice these issues early, and we rolled back the change soon. But if we hadn't, it could have paved the way for a major incident that affected multiple users in our company and might have even turned into an emergency change.

We had a proven framework that helps us all day with normal and standard changes, but something was lacking. What was it?

  • A CMDB that covers 360 degrees of a change
  • An additional filter to rule out possible dangers and train the CMDB

In other words, our CMDB was not equipped to cover all aspects of a change. The CMDB that helped us execute normal changes well needed an even more vigorous upgrade. A CMDB is a machine that facilitates decision-making. Technically, a CMDB has a list of items that must be managed to ensure services function properly, and it also has the relationship between these items. Practically, though, it is a machine that needs training; the more versatile the information you feed it, the better it'll help you make decisions.

In this example, a 360-degree CMDB would have included the technician, the secondary data center, the applications running on it, its history, the other technicians involved in its operation, and a lot more. A review of the architecture of the secondary data center would have saved us from this risk.

When you make a major change, you go from path A to path B. A 360-degree CMDB ensures path B isn't a dead end first.

Before you make a major change, ask yourself a set of questions related to the CMDB along the lines of compatibility, security, availability, and integrity. Let's see how such a filter would help for major changes.

Creating a filter for major changes

The table below gives you a checklist that would have helped us when we implemented the change. There is also a generalized version of the same list that you can use to create a filter for major changes.

Enterprise change management checklist

S.No

Contextual filters

Answers

A general filter for you

1.

Who are the different ISPs?

ISP 1 and ISP 2

What external parties are involved in the change?

2.

Who from the IT team communicates with that ISP?

Insert email address of the responsible person

Who communicates with each of these external parties?

3

What services use this ISP?

Email, internal applications

Which applications depend on these external parties?

4

Where is the data from this ISP traveling to?

DC A

What is the data involved concerning these parties, and how is it handled?

5

How many data centers are there?

Two

How many such entities is the data traveling to?

6

Which is the primary data center, and which is the secondary?

Primary: DC A

Secondary: DC B

How are these entities organized?

7

How are the ISPs connected to the data centers?

ISP 1 and ISP 2 serve DC A

How are the external parties connected to the entities?

8

Which services are initiated from DC A?

Mail, chat, database, and others

Which services are dependent on these entities?

9

Who might need these services when the migration is scheduled?

Product teams, operations teams, etc.

Which stakeholders depend on theses services during the change implementation window?

10

What backup is there for DC B?

Insert backup server details here

What is the business continuity and disaster recovery (BCDR) setup for these entities?

11

Who is responsible for the backup of DC B?

Insert email address of the technician

Who is the owner of the BCDR setup?

12

Where will the escalations for DC A and DC 2 go, and who will handle them?

Insert email address and designation

How are escalations regarding these entities handled?

13

Is there a way to monitor the availability of DC B?

ManageEngine's performance monitoring solutions

How is the availability of entities monitored?

14

Which firewall guards DC A and DC B?

Insert firewall details

What are the network security systems in place for these entities?

15

Who manages those firewalls?

Insert email address

Who is accountable for those security systems?

This checklist would have reduced the risk we were about to face. In fact, we'd have not planned the change at all before fixing the problems with DC B. A CMDB trained with all the above information should be the starting point of a major change.

The above is not an exhaustive checklist at all. Ultimately, it's up to you and the other IT leaders in your company to come up with such checklists for different scenarios of major changes that you go through. To create such a filter for yourself, you can focus on these aspects: resources, responsibilities, and review.

Resources:

Questions related to the entities, external parties, etc. give you a good idea of the resources involved in that change. You must ensure you consider all possible resources that could be involved.

Responsibilities:

Questions where we inserted the data of a person in the middle column are all about responsibilities. These people hold the assets and the process together.

Review:

Questions involving “how” could be part of your review mechanism. You must review the resources and the process together and improve this checklist based on your findings.

Applying the filter

change management cycle

We apply this filter during the planning stage. This ensures we have an implementation plan with minimal loopholes.

We once had to upgrade our email servers. Since this is a critical service, our change owner categorized it as a major change and decided to implement it during the weekend so that the business impact would be minimal.

Our CAB consisted of senior engineers, IT department technicians, and also the head of operations since Zoho Corp. (the parent company of ManageEngine) was itself a user of the email service. We followed the general approach of initiating the change, planning it, and analyzing the impact. We estimated there would be considerable downtime and contemplated our backup resources and other security measures.

Since it is a major change, we decided to run it through the filter we discussed earlier.

Let’s consider the following questions from the filter we created earlier:

  • Which services are dependent on these entities?
  • Which stakeholders depend on these services during the change implementation window?

With our CMDB, we identified the stakeholders involved in the services affected by the change, and we then came to know that a series of HR interviews were scheduled during the weekend (the scheduled downtime). There was a CAB member from operations who we consulted with. But we didn't necessarily pull back the change since the IT leaders in the CAB recommended that this change was crucial, and not doing it immediately could cause the company to lose money. The change manager took that recommendation, and we rescheduled the HR interviews. If we did not have a well-trained CMDB and a filter, the HR team would not have known about the change, and the interviews would have happened as scheduled. They'd have been disrupted by the email server upgrade, and the change would have gone down as an unsuccessful one.

This is the importance of handling the planning stage using a filter, and also how crucial it is to communicate with stakeholders. Let's take a look at another example, the one we discussed already in Chapter 2 when we saw how an escalation could create huge problems. Let's see how this filter can prevent the scenario from turning into an escalation:

“Let's start with a standard change of updating security patches in a server. Since it's a no-brainer that it's critical, you might have categorized it as a standard change, checked for the pre-approved status and the documentation, and went on to update the server. However, some of your critical applications might be running on that server, and they'd still only be compatible with older versions. Now those applications are not functioning properly, and you must move them to a backup server. This will be a major change. You'll now have to involve a CAB and follow a different process. Let's say this backup server does not have a disaster-recovery mechanism, and your critical application data is lost. This is an emergency change as it affects many users, and your C-level executives are now responsible for cleaning up this mess.”

The second stage here is a major change. You're supposed to move some applications in your main server to a backup server. Without the filter for the major change, we know it's going to result in an emergency change. But with the filter, the following questions come up:

Note: The entities here would be the main server and the backup server, according to the filter.

  • Which services are dependent on these entities?
  • What is the BCDR setup for these entities?
  • Who is the owner of the BCDR setup?

These three questions would have helped you realize that there's no backup, and you'd have held off on the change until you fixed that problem first.

Quick summary

  • The added filter has all the questions to help you spot the hidden challenges beforehand. And for major changes, you need a more enhanced CMDB that covers 360 degrees of the change—including all the resources and responsibilities related to a particular entity involved in the change—so your filter works well.
  • To apply both technical and procedural interventions, you can take the help of your CAB members to use this filter during the planning stage. They can customize this filter based on the major changes you generally deal with. They must also keep improving this filter and train the CMDB even better with each major change.

Scenario 5: Preparing for emergency changes

An emergency change is when you realize something is broken; it's predominantly a reaction to an incident. In our incident management handbook, we discussed how ManageEngine deals with such incidents and carries out root cause analysis (RCA). Corrective and preventive action (CAPA) is a natural result of RCA. The corrective actions mostly coincide with how we must execute the emergency change.

Emergency changes could belong to one of these two categories:

  • A reaction to external events
  • An attempt at fixing an unsuccessful normal or major change

The frameworks we discussed so far will help you reduce the chance of falling under the second category. We will also see how else you can prevent these emergencies. For the first one, however, you need to have a system of measures to help you handle those events with enough conviction.

A system for the outside

ManageEngine's leaders focus on one prime aspect when they are faced with an emergency from the outside: the ability to balance urgency and caution.

When the emergency change is forced from outside, we are looking at a potentially huge impact and the need to hurry to reduce damage.

To handle time constraints and the possible business impact, we have a process where we skip planning and instead focus more on implementation.

Stages of emergency change lifecycle

4 stages of the emergency change life cycle

  • Forming an emergency CAB (ECAB)
  • Putting emphasis on fast testing

We will consider a scenario to see how we apply these steps. The number of security vulnerabilities is constantly rising, and vulnerabilities pose a huge threat to customer data if we don't handle them well. Let's see how we handle this threat.

Our IT team updates its knowledge of security vulnerabilities through sources like the CVE Program, Red Hat Security, and the Cybersecurity and Infrastructure Security Agency. Once the IT team finds a vulnerability that could affect our servers, it tests the vulnerability and gauges its severity. If the vulnerability is decided to be a severe threat, the IT team calls for an emergency change where we implement a fix or patch on the server.

Forming an ECAB

Unlike a regular CAB, the ECAB's focus is expediting the emergency change process by assisting the change manager. We limit the ECAB strictly to those who have the expertise to implement the solution for the emergency change. That naturally eliminates the top C-level executives who you may find in our major changes.

In this case of a security threat in a server, the ECAB consists of the following members:

  • Cybersecurity experts from the Security team
  • Manager and a member from the Framework team

With the help of these experts, the change manager creates a fix or patch to implement. The ECAB's focus here is to implement the change as soon as possible to prevent further damage from occurring.

The review phase is equally important too, and the ECAB also sticks around after implementation to re-evaluate the change. Once we raise our defenses to counter the immediate danger, the ECAB focuses on a long-term solution by testing the change again and seeing how it can be improved to prevent even bigger threats in the long run.

Emphasis on fast testing

We use our in-house patch management solution for all-around implementation. We use it for:

  • Testing the patches in a local environment before deployment.
  • Automating patch deployment for all OSs.
  • Monitoring the performance of patches using reports.

Once we develop a patch with the security and framework teams, we implement it using the following process that focuses on thorough but fast testing on multiple levels:

Emergency change implementation process

Though there is a time constraint, our framework ensures we focus on testing the change to prevent even more adverse effects than the original emergency. Compromising on the effectiveness of the change due to the urgency is a bad idea. However, you can help yourself and speed up the process by creating an environment of urgency that also has enough support.

Here's how we create the ideal environment to implement emergency changes:

1. Unified solutions for monitoring and implementation:

Like we do for patch management, we also use solutions for network monitoring, data center monitoring and infrastructure management, domain and certification management, inventory management, and operations management.

For you, the requirements may differ. However, you should know the importance of your IT team's ability to test the changes multiple times under a strict time constraint while still having enhanced visibility and control.

2. Transparent, simple, readily available documentation:

Documentation must be presented in a way that your IT team understands. It should be transparent to all members of the team, and it should be available where the entire team can access it. These documents include procedures, basic definitions, guidelines, and policies that every member of the team must be aware of.

We have spent years perfecting our documentation. We also hire technical writers with deep understanding of the domain for this purpose. You can check out a section of our previous e-book A CIO's guide to rethinking compliance, which explains the importance of documentation and how you create it.

3. Communication channels:

We use these channels to reduce confusion as well as improve the sense of urgency during an emergency change. This has proven to be a useful tool for change managers, especially when they must implement the change in a matter of hours. Even after the implementation, these channels serve as records that one can review later to improve on the changes.

Tactics for the inside

If an emergency change is not a reaction to an external force, then we have probably gone down the road not to be taken. The ideal way to handle such emergency changes is to first ensure we don't cause them at all. But it takes a certain amount of tactical impetus to achieve that.

Let’s see how we can prevent the rise of emergency changes from everyday happenings.

Strengthening the fort

Let’s take the example of one of our customers that falls under the SME category. The organization offers IoT solutions, using devices like tablets and other sensors for monitoring and maintenance of spaces like conference rooms with the help of applications. The customer has a team of desktop management professionals who decided that early in the morning, a few hours before business kicks in, was a good time for a server migration. Their boss, the one who is ultimately accountable for this migration, was out of town that day due to a personal emergency. But since this is almost as simple as planned maintenance, the team decided to go ahead with their best judgement. They had to transfer terabytes of data from one server to another. It was a simple process; they simply had to replicate the data with a documented process.

As they got started and the process went on, the system got slower. Business hours kicked in, and applications became inaccessible. Slowly, the team started receiving tickets from their customers, and some escalations began to rise. They realized they had made a mistake. They apologized to the customers and salvaged what they could, but the damage was done.

A planned change turned out to be an emergency change for them.

What was the real problem here? They simply didn't know when to back out.

They knew the procedure for migration but didn’t have the knowledge to judge the situation. If a migration takes more than 30 minutes, it’s time to back out and roll back. Since their boss wasn't reachable, nobody could tell them that. Even though the procedure for migration was well-documented, they did not know when to back out. Firstly, this type of change needs an approval mechanism, but they reckoned it was a standard change. They then learned that these changes should happen after business hours.

This is where the need for business rules and strong policies comes in. Beyond well-documented procedures and tools, you need policies to ensure your teams apply the business rules.

They didn’t have a good indicator or a scale to judge how long it would take given the conditions. These things should also be included in the documentation. After the incident, they revisited their documentation and added these indicators.

Looking back at the incident, the planning was good, they had documentation, but they did not have enough support during implementation. No one could be blamed. People were trained—they were skilled, knowledgeable, and following a documented procedure—but the change engine collapsed due to lack of business rules during implementation.

To prevent such happenings, we ensure we form policies with the help of our IT leaders and make them visible to all teams. Even though our culture allows for flexibility and independence, these policies help our IT teams implement changes effectively.

The art of understanding

Our IT leaders' understanding of technology and process has helped us prevent many emergency changes. We keep an eye out, and so must you.

Keeping up with trends in business and technology is crucial to prevent emergency changes.

For example, Microsoft may give us the end-of-life date for an OS version. And then they'll give us a year of notice, too. If we still maintain old servers, we're bound to face an emergency change in the future.

Beyond the OS or the network updates, the path to an emergency change could be a security risk. Our IT teams and security teams have dedicated resources for becoming aware of vulnerabilities and other security threats going around in the digital world. Many organizations have a tough time figuring out which of their machines could be under attack or are vulnerable. We approach such problems using our asset management solution and CMDB. We maintain assets using desktop management solutions and frame relationships with assets to figure out which ones are problematic. It all comes down to the visibility we have and our framework to use that visibility.

An important part of understanding is also not overdoing changes.

While keeping up with technology, you shouldn't let your enthusiasm make you over-perform. The same example of Microsoft's OS upgrades hold good here. Microsoft will keep releasing them, but if we upgrade our servers recklessly, our critical applications are bound to go down. So, we do these upgrades with enough visibility and the help of our framework.

For the best level of understanding, we depend on the last stage of our process: review. That's the most crucial part of an emergency change as it helps you prevent such changes in the future. Sometimes our server patches malfunction and throw a blue screen. That would be the latest version with a good configuration, but it still “broke.” This is where a review would help. After we raise an incident and do RCA, we will see a small fragment that caused the emergency change. Maybe we missed a small performance issue or we skipped maintenance. We ensure we figure out the reason and build our understanding.

Quick summary

  • Practically, your hidden challenges will be based on the environment you create for fast testing and the training your technicians have. To get both of those right, you can have a schedule to test your testing environments when there isn't an emergency change happening. This helps you improve your environment and the readiness of your technicians.
  • The best way to handle these hidden changes is to prevent them from occurring during an emergency. Use of strong policies and a better understanding of your IT infrastructure will help you here.

Get fresh content in your inbox

By clicking 'keep me in the loop', you agree to processing of personal data according to the Privacy Policy.