The complete server monitoring guidebook | ManageEngine OpManager

Servers are at the heart of IT. They run your apps, databases, domains, e-mail, and manage operations across networks. Server monitoring helps prevent sudden unplanned outages as well as long-term performance inefficiencies from affecting your service delivery.

This article lists 12 commonly asked questions about server monitoring and explains the best practices you need to follow to improve the reliability, resilience, and performance of your server devices.

What is server monitoring?

Server monitoring is the process of checking your server devices in regular intervals to check whether your systems and services are running optimally. To ensure this, you can check for two things:

Is the server up and running?

This is known as an uptime check. Uptime checks see whether a server is reachable via the network. If the server doesn't respond to an uptime check, it can indicate two issues: An issue on the network end that's preventing the server from being contacted. Or an issue with the server itself.

How is the server performing?

If the server is up and running, you need to check how the server is performing. There are multiple metrics related to the server's processor, memory, storage, and hardware that indicate how the server is performing at any given time. Choosing the right performance indicators depends on the type and functionalities of the server.

What processes and services are running?

You might also want to know the processes and services running in the server. Having the visibility into all the processes and services running in a server and the resources utilized by them lets you prioritize the critical ones and restore services in case they go down.

How are servers monitored?

Server monitoring tools help you check the uptime and performance of server devices at regular intervals. You can use them to monitor many devices at the same time, eliminating the hassles of checking them manually.

Server monitoring tools get access to the requisite data with two mechanisms: They can either 'pull' the information out from the server using specific network protocols supported by the server. Or they can install a lightweight software inside the server that 'pushes' the data towards them at regular intervals.

Network protocol-based server monitoring

Network protocols are guidelines that enable network communication between IT systems. Network protocols are specified in each layer of the IT stack, ranging from the Internet Protocol (IP) that enables communication between network devices within a LAN to the Hypertext Transfer Protocol (HTTP) that facilitates application to client communications.

When it comes to monitoring, certain specialized network protocols enable information retrieval from specific devices. Protocols are used to establish connection and collect data from various servers.

SNMP: SNMP or simple network management protocol is a widely accepted protocol for monitoring devices within a network. Windows, Linus, MacOS, and various virtual server vendors support SNMP protocol. Three versions of SNMP are available (SNMP v1, v2c, and v3) with varying levels of authentication and encryption.
WMI: WMI or Windows management instrumentation is Microsoft's specialized protocol for data collection from Windows devices and servers. Compared to SNMP, WMI offers more detail and can be easier to configure, but it also consumes more resources.
CLI: Server monitoring tools use CLI or Command line interface to query data from Linux servers. Specific CLI commands yield various outputs like CPU, memory, disk space, etc, and this data is presented in the monitoring interface.
Nutanix Prism API: Nutanix Prism APIs are used to reach, communicate, and gather information from Nutanix clusters.

Agent-based server monitoring

Agents are lightweight software that collect and push data to a monitoring software. Agents are usually preferred over network protocol based monitoring when you want to reduce the load on the monitoring solution. Moreover, agents are also suitable for networks where security requirements limit the usage of network protocols like SNMP.

Agents can be manually installed by you or pushed remotely through the network. Once installed, the agent pulls the requisite information from within the server and sends it to the monitoring solution at regular intervals.

What are the core functionalities of a server monitoring tool?

Most server monitoring tools incorporate some core functionalities that help ensure optimal server operations. These functionalities use various mechanisms and technologies to help you detect issues, diagnose the nature of the issue, and resolve it. We've divided core server monitoring functionalities into three:

Monitoring (data collection)

We discussed this in the previous question. Server monitoring tools collect data from various server devices using either network protocols or monitoring agents. Data collection for server monitoring is performed by an NMS (Network monitoring service) which is configured to collect the uptime and performance data at specified intervals.

Alerting (Fault detection)

Server monitoring tools are used to detect faults and outages. This process can be automated with the help of alerts. A server monitoring tool with an built-in alerting feature can compare the collected data with a set of baselines. If the collected value crosses the baseline, an alert is generated.

Uptime monitoring

When it comes to uptime, the baseline is obviously up. If the monitoring tool detects that a device is reachable and responding, it's marked as up. Otherwise, the device is marked as down. If the user sets up an alert to be generated for down devices, the monitoring tool generates a downtime alert.

Performance monitoring

For performance monitoring, the baseline depends on various factors. Lower values are preferred for metrics like CPU utilization, but when it comes to metrics like free disk space, the higher the value, the better.

Most monitoring tools can't get the context required to identify baselines for server performance monitoring. So the users have to set it up manually based on the expected performance of the server at peak performance. This is done with the help of thresholds. Thresholds are set based on the 'normal' performance baseline for a server. If the monitored or collected value crosses the threshold (In the case of CPU utilization, it will be more than the set threshold, while for disk space, it will be less than the set threshold.) an alert will be generated. Let's explain this with an example:

An application server that hosts an online ticket booking service experiences higher loads on the weekends than on the weekdays as more users try to access the service. The admin sizes the server to deal with this peak demand and expects the server to run at 60% CPU during the weekends. To monitor the server CPU for abnormalities, they set an alert at 65%. This means that if the application server experiences an issue that causes its CPU to shoot up beyond 65%, this will be flagged as an alert by the server monitoring tool.

AI-driven monitoring

Modern ML (Machine learning) and AI (Artificial intelligence) technologies have enabled monitoring tools to get the context required to identify, set, and update the baselines for normal server performance. An AI-driven monitoring tool can analyze the performance of a server for a training period and then set thresholds automatically thereafter.

Process and service monitoring

You'd also need to track critical processes and services and get alerted if they go down. This can be done by checking whether a certain process or service is up at regular intervals. With more advanced solutions, the resources utilized them can also be tracked.

Analysis (Data insights)

Most server monitoring tools excel at monitoring and generating alerts. But this is the limit of their capabilities. However, without analysis, You would have to draw insights from large volumes of raw data manually. Let's see how this could happen.

A server hosting critical applications and services can experience anywhere between 10 to 30 abnormal events per day. For a typical server rack, this equates to between 420 to 1260 events per day for a 1AU server rack. Now imagine similar alerts coming in from other racks, network nodes, storage systems, and power units. The IT team would have to manually correlate all this data to arrive at conclusions.

How does analysis help?

Reports

Reports help elucidate large and complex data formats into quick, executable insights. By categorizing servers based on their performance metrics and time, you can review resource consumption, failure statistics, and availability timelines effortlessly. Moreover, reports can be easily exported and sent to other stakeholders, like upper management.

Dashboards

Dashboards offer quick, easily decipherable insights into server performance at any given instance. If the server monitoring tool has a dashboard setup that displays the last monitored uptime status and CPU usage of all the critical servers, it would help you keep a close watch on the servers and react quickly to incidents.

Graphs

Graphs help IT teams to investigate issues after they occurred or analyze potential issues. Graphs are usually time-stamped, colour coded representation of monitored data that is accessible for continuously varying data formats (Like CPU utilization, or traffic usage).

Correlation

Correlation involves comparing multiple monitored parameters to arrive at conclusions. For server monitoring, correlation is usually done for monitored metrics within servers (The relationship between memory usage and disk space), or between different server devices (Like a VM and its host), or between servers and network components (Server and it's dependent network switch). Correlation features in server monitoring tools allow you to analyze disparate, seemingly random events and build a coherent incident timeline from it.

Maps

It's always easier to analyze and monitor data visually rather than with numbers or tables. Maps allow you to visualize servers and server architectures with ease. Generally, different types of maps are used to represent topology (of servers within the network), location (within actual maps), datacenters, and business criticality.

Why is server monitoring important?

Servers perform critical compute operations and host user-facing applications. It goes without saying that the user-experience, revenue, and reputation of your organization hinges on the performance of your servers. The importance of server monitoring is further compounded by the following factors:

Increased complexity of modern IT

The complexity of the modern IT environment has made server monitoring an indispensable part of IT operations. Earlier, monolithic servers were deployed, with each server designed to handle peak operations for its operating range.

Modern servers are deployed as microservices, Virtualization is used to consolidate standalone monoliths into consolidated machines, and containerization is performed to improve application resilience and scalability.

This results in an incredibly distributed architecture with high-levels of consolidation. Services are distributed both physically and logically, and it's difficult to pinpoint from the outside what server performs what operation.

Modern architectures are also very dynamic. VMs can be spun up, migrated, and taken down with ease, and containers are made to be portable and scaled up rapidly.

Server monitoring provides visibility into this environment by discovering and classifying servers. This helps you to find specific servers, monitor the components involved in critical operations, and track the sprawl and resource utilization for your services.

Perils of inefficiencies and downtime

Without constant surveillance and monitoring, you wouldn't have any awareness about server performance. Your services would be affected, applications sluggish, and user-experience impacted, but you wouldn't be able to narrow down the root cause of it all.

Downtime can also lead to serious losses in revenue and reputation. Studies have placed the cost of downtime for critical services between $5,600 and $14,000.

SLA violations and penalties

Without constant monitoring, any incident is a race against time. When incidents are reported, you have to diagnose the issue within a specified time (Mean time to diagnose- MTTD) and restore services within the time specified in your SLA.

Modern SLAs ask for five nines (99.999%) or similar standards of availability. Without server monitoring, MTTR could be stretched out indefinitely as you'd have no visibility into the state of your servers at the moment of and prior to the incident.

Modern compliance standards (ISO 27001, SOC 2) also mandate availability and business continuity for IT services. This means that you have to ensure that your services remain up, and if something happens, you have to restore services quickly. Without server monitoring, this can be difficult. Failure to comply with these standards often leads to hefty penalties.

What are the key benefits of using a server monitoring tool?

Server monitoring tools can improve server performance and uptime and consequently lead to better service delivery. Let's take a look at some specific benefits.

Proactive incident response

With constant surveillance and monitoring, you can reduce the MTTD for your servers drastically. Depending on the criticality of the server and its operations, you can set up monitoring intervals as low as 30 seconds or 1 minute. Once you set up the right kind of alarms, you can get instantly notified about potential issues before they occur.

Efficient ITSM practices

Reduced MTTR translates into other aspects of your IT processes as well. Once you detect and diagnose the issue early, your team of technicians can get in on the task and restore services. Having visibility into the server performance also assists them in restoring services faster. This results in lower MTTR (Mean time to restore/resolve).

Server performance optimization

With in-depth visibility into server performance indicators, you can easily detect bottlenecks and optimize performance. For instance, if while monitoring the hardware of an under-performing server, you detect that the CPU temperature is constantly high, you can deduce that the cooling system is a bottleneck that affects performance and optimize it accordingly.

Improved user experience

Server monitoring naturally leads to reduced downtime and improved performance. This also improves the user-experience. Users prefer services and applications that load fast and perform consistently. Monitoring helps deliver this.

Informed capacity planning

You can also plan and size your servers to ensure that they scale well with growing demands. Monitoring server resources like CPU utilization, memory usage, and disk space keeps you aware of current usage trends. When these resources near exhaustion, you can calculate the resources needed to maintain consistent performance for the future and provision more resources.

Better sprawl prevention

Monitoring helps you detect resource-starved and resource greedy servers. This is particularly beneficial in virtualized environments where virtual sprawl can be a major bottleneck. By managing resources between under-sized and over-sized servers, you can ensure that your critical services are always well provisioned.

Happier IT teams

Another indirect impact can be seen in your IT team. With a good server monitoring tool, your IT team would need less effort to yield better impacts. Let's see how.

An IT team made a switch from an outdated server monitoring tool to a newer, better tool. Before, the IT admins had to create custom SNMP monitors to track their servers. Each custom monitor required scripting and could only retrieve basic data about the server. Moroever, since newer Windows servers stopped supporting SNMP, they had to separately install SNMP agents in their servers.

The new tool they switched to supported various monitoring protocols like SNMP, WMI, and CLI. Monitors were available out-of-the-box and could be deployed instantly. The tool also incorporated ML/AI, which further reduced the effort they needed to monitor the servers. The team found out that since the repetitive/manual work was now reduced, they could spend more time strategizing and innovating, improving their work-life balance and productivity.

What are the key performance indicators we need to monitor with server monitoring?

Here, we have a list of performance indicators that might be critical for various server types.

CPU or Memory Utilization: The percentage of CPU or RAM actively in use by the server. Setting high thresholds prevents bottlenecks, poor performance, or resource scarcity by alerting to excessive consumption.
Active Transactions: The count of ongoing server operations at any given moment. This helps identify the server's load, potential deadlocks between processes, or slow operations affecting overall performance.
Network Bandwidth Usage: The volume of data being transferred across the network connection. High usage indicates heavy traffic, potential network saturation, or extensive data transfer operations.
Data Files Size: The cumulative disk space occupied by application or database data files. Tracking growth is essential for proactive capacity planning to ensure sufficient storage.
Packet Loss Rate: The percentage of data packets that fail to reach their destination. High rates are a strong indicator of underlying network connectivity issues impacting data integrity.
Active Database Connections: The number of currently open connections to the database. High counts can signify high application demand, issues with connection pooling efficiency, or exceeding resource limits.
Interface Error Rate: The percentage of errors detected on network interfaces. High rates often suggest physical network problems, directly impacting server connectivity and data flow.
Logins per Second: The rate at which user authentications are occurring on the server. Spikes can indicate either a sudden increase in demand or potential security threats like brute-force attacks.
IO Read and Write Rate: The speed of operations for reading from and writing to disk storage. High rates suggest applications heavily reliant on disk I/O or the presence of I/O bottlenecks.
Transactions per Second: The number of completed server operations within one second. This metric directly reflects the server's throughput and its overall capacity for processing tasks.
Total Active Locks: The number of locks currently held on server resources (e.g., database tables). High counts indicate contention, poor concurrency, or potential deadlocks hindering operations.
Log Files Used Percentage: The proportion of disk space consumed by system and application log files. Monitoring ensures proper log rotation, archiving, and prevents disk space exhaustion.
Average Disk Latency: The average time taken for disk input/output operations to complete. High latency indicates slow disk performance, directly impacting application responsiveness and speed.
Lock Wait Time: The average duration processes spend waiting to acquire necessary locks. High wait times indicate resource contention, leading to significant performance degradation for applications.
Partition Details of the Device: Information about the server's disk partitions, including size and free space. This aids in capacity planning, identifying imbalances, and efficient space management.
Number of Deadlocks per Second: The frequency at which deadlocks, where processes block each other, occur. High rates severely impact application performance and require immediate resolution to restore functionality.
Data Space of DB: The total disk space utilized by the database itself. Monitoring its growth is crucial for accurate capacity planning and efficient overall database management.
Instance Count: The number of running instances of an application or service. This helps ensure scalability, appropriate resource allocation, and maintaining continuous service availability.
Cached Hit Ratio: The percentage of data requests successfully served directly from the cache. A high ratio indicates efficient caching, leading to faster responses and reduced load on backend systems.
Count: The total number of active threads or processes running on the server. High counts can indicate resource contention, inefficient code, or potential memory exhaustion.
Active Memory: The amount of memory actively being used by a virtual machine, as observed by the hypervisor. This indicates actual memory demand, helping optimize resource allocation.
Balloon Memory: Memory reclaimed from a virtual machine by the hypervisor using a balloon driver to free up physical host memory. This occurs under memory pressure, potentially impacting VM performance.
CPU Ready: The time a virtual machine is ready to run but cannot get CPU time from the hypervisor. High values indicate CPU resource contention, directly impacting VM performance.
CPU Contention: A state where multiple virtual machines compete for limited CPU resources on a host. High contention leads to increased CPU ready time and degraded virtual machine performance.

Like I mentioned earlier, choosing the right performance indicators depends on the type of server you're monitoring. For instance, If you're monitoring a database server, I/O operations (input and output operations) would be a key metric. Similarly, CPU ready time would be a key metric for a VM.

Certain metrics like CPU or hardware health will obviously be critical no matter what the server is. When you're setting up performance monitoring, you have to prioritize certain metrics over others and ensure that you're not overlooking any metric that might be important.

Setting up the right performance indicators will give you the best chance to optimize server performance.

What are the hallmarks of the best server monitoring tools?

Choosing the characteristics of the 'best server monitoring tool' can be tricky since different tools cater to different use-cases and market niches. What we can do is provide general characteristics that might be suitable for each market type. Monitoring tools that support most of these characteristics might be the best suited ones for your specific requirements.

Home server monitoring

If your monitoring requirements are limited to the basic uptime / performance of say, under 10 server devices, you can look at the following capabilities:

Affordability: Open source monitoring tools are the most affordable tools at this range, as you can obtain and operate them for free.
Ease of use: You shouldn't have to spend hours setting up monitors for your servers. Your tool should be able to set up simple uptime and CPU checks.
Active user community: Having an active community of users and product support teams can be helpful in solving issues.
Strong support for core functionalities: When you're looking to monitor server availability and basic performance, you may not need any advanced features, but a strong support for core features would go a long way.

Small to medium business (SMB) server monitoring

For businesses with a moderate number of servers (e.g., 10-50 servers) and a need for more robust monitoring beyond basic uptime, consider these characteristics:

Scalability: The tool should be able to easily scale as your server infrastructure grows without significant re-configuration or cost increases.
Centralized dashboard: A single, intuitive dashboard to view the health and performance of all servers, making it easy to identify issues quickly.
Alerting and notifications: Customizable alerts via email, SMS, or other channels for critical events, ensuring prompt awareness of problems.
Basic reporting: Ability to generate reports on server performance trends, historical data, and resource utilization for capacity planning and troubleshooting.
Integration capabilities: Support for integrating with other IT management tools, such as ticketing systems or configuration management databases.
Technical support: Professional, SLA-based support improves the reliability of IT solutions. This is particularly helpful for SMBs where the IT teams might be smaller and under-staffed.

Enterprise server monitoring

Large enterprises with hundreds or thousands of servers, complex infrastructures, and stringent performance requirements will need tools with these advanced characteristics:

Comprehensive monitoring: Deep visibility into all layers of the IT stack, including physical and virtual servers, applications, databases, and network devices.
Advanced analytics and AI/ML: Capabilities for predictive analytics, anomaly detection, root cause analysis, and automated remediation powered by AI/ML algorithms.
Customization and extensibility: Highly customizable dashboards, metrics, and alerting rules, along with API access for integrating with proprietary systems and workflows.
Mapping and visualization: Enterprise server arrays are often complex, involving both physical and logical interdependencies. Visualization helps contextualize such architecture.
Security and compliance: Robust security features, role-based access control, and compliance certifications to meet enterprise-level security standards and regulatory requirements.
Distributed monitoring: Ability to monitor geographically dispersed datacenters and cloud environments from a central console.
Dedicated support: Technical support for enterprises requires intimate knowledge of the IT architecture, organizational policies, and preferences. A vendor that provides dedicated support technicians is preferable.

Cloud-native server monitoring

For organizations primarily leveraging cloud infrastructure (e.g., AWS, Azure, Google Cloud) and microservices architectures, the best tools will exhibit these traits:

Cloud platform native integrations: Seamless and deep integration with specific cloud provider services, APIs, and metrics for comprehensive visibility.
Dynamic scalability: Ability to automatically scale monitoring capabilities to match the dynamic nature of cloud environments, including auto-scaling instances and serverless functions.
Cost optimization: Features that help monitor and optimize cloud spending related to server resources, identifying inefficiencies and waste.
Container and micro services monitoring: Specialized capabilities for monitoring containers (Docker, Kubernetes) and microservices architectures, including service mesh visibility.
Observability (logs, metrics, traces): Unified collection and analysis of logs, metrics, and distributed traces to provide end-to-end visibility into complex cloud-native applications.

What are the server monitoring features available in OpManager?

OpManager's monitoring features enables you to visualize, monitor, and manage server infrastructure. We've discussed various server monitoring concepts and terminologies so far, now let's put that into practice with OpManager.

Server uptime monitoring

OpManager monitors the uptime of all discovered servers at regular intervals by checking if it's available with a network ping. You can view the availability percentage of a server for a particular period, review availability timelines, and set up alarms for downtime. Uptime or server monitoring is important, but the quality of the network connection is also vital. To track this, OpManager also monitors packet loss and response time for each server connections. You can set up three levels of alarm thresholds to track these metrics.

Learn more about uptime monitoring with OpManager.

Server performance monitoring

OpManager monitors over 3000 performance metrics that are specialized for various monitoring protocols, vendors, and device types. You can set up performance monitors with three-levels of alarm thresholds and colour-coded graphs to track these metrics.

When it comes to server performance monitors, a general list of supported performance metrics can be found here. This might vary based on the type of server, but generally for every server, you can monitor performance metrics relating to the:

CPU
Memory
Disk
Network
Processor hardware
PSU (Power supply unit)
Cooling systems

Learn more about performance monitoring with OpManager.

Process and service monitoring

OpManager offers instant alarms for process and service outages. You can use it to fetch all active processes and services running on your servers and establish regular monitoring checks. If a process or service stops, you can set up alarms detailing the outage, the specific process or service, and the affected device. OpManager also allows you to remotely terminate and restart services or processes directly from the console.

How process and service monitoring works with OpManager.

Server Log Monitoring

OpManager enables comprehensive monitoring of Windows event logs and syslogs, to track server activity at all times. You can configure specific alarms for critical Windows event logs and establish rules to trigger alerts for syslogs matching defined criteria.

Log monitoring in OpManager.

Files, folders, URLs, and scripts

You can monitor files and folders, tracking metrics such as size, age, existence, and the presence of specific strings within files. You can also monitor URLs, checking their availability and for specific string content. Furthermore, OpManager allows for the creation of custom scripts using Perl, VB script, Powershell script, Shell script, etc., enabling the monitoring of highly specific metrics from your server devices.

Learn more about file and folder monitoring with OpManager.

Server visualization

You can visualize servers in different formats with OpManager. Network topology maps and organization maps help you visualize layer 2 and full-stack architectures in an intuitive manner. VM maps represent virtual hosts, VMs, and datastores with their logical connections. Rack builders and datacenter floor views can be used to build faithful representations of your datacenters, complete with server racks, walkways, and walls.

How to set up server monitoring with OpManager?

OpManager has an intuitive UI that's designed to simplify the process of monitoring your IT systems. Before setting up monitoring, you'd have to scan or discover your IT systems, including your servers.

OpManager has a robust auto-discovery feature with filters and rule-based automation to scan your network, find devices, and add them to a monitoring 'inventory'. If you've added the relevant credentials (dedicated access mechanisms for various protocols like SNMP, WMI, and CLI), OpManager also collects device information like the vendor, type, and category of the device.

So before setting up server monitoring itself, OpManager provides you with a neatly-classified inventory where the virtual servers are listed based on their vendors and their role in the virtual architecture, the Windows servers are listed based on their model, and the domain controllers, exchange servers, and the MS SQL servers are all listed separately.

Now that this is ready, let's take a look at how you can set up server monitoring:

Setting up server uptime monitoring with OpManager

OpManager uses network protocols like SNMP, ICMP, and TCP to monitor the uptime or availability of devices. By default, OpManager uses ICMP to ping the IP addresses of your servers. But you can change this and use SNMP or TCP as well to monitor either IP addresses or DNS names (Some networks might block ICMP requests, and some devices may have dynamically varying IP addresses).

You can also customize the monitoring interval for uptime checks. For critical devices, you can set up intervals as low as 30 seconds. You can review the following uptime statistics in OpManager:

Availability state: Indicates whether the server is up or down. You can configure a set number of unresponsive pings before an alarm is generated.

Availability timeline: A timeline of availability for the past hours, days, and weeks to help you track service downtime.

Packet loss: The percentage (%) of packets lost during a network ping. Indicates the quality of the network connection.

Response time: The time required for the server to respond to a network ping, measured in milliseconds (ms). Indicates whether the device is up and the quality of the connection.

Setting up server performance monitors with OpManager

Performance monitors query the value of a certain monitored aspect (Like CPU utilization) at a particular moment at a default monitoring interval. Unlike uptime, OpManager doesn't automatically start monitoring performance metrics after discovery and classification. However, you'll be presented with a curated list of performance metrics that are designated for a particular server model.

You can edit the monitoring interval and set up alarms for these metrics to start monitoring them. The alarms have three levels of severities. Severities indicate the urgency of an alarm and provide more granular control and visibility into alarms.

Attention: This is the lowest severity of alarms. These alarms are colour-coded with yellow and can be used for indicating a potentially dangerous scenario, as the name suggests, to grab your IT admin's attention.
Trouble: This is a medium level alarm severity, colour coded orange. These alarms can be useful to denote issues that spell trouble, but aren't quite critical yet.
Critical: This is the highest-tier alarm severity, aptly coloured red. These can be used to alert for issues that would most definitely affect your systems.

When it comes to performance monitoring or certain availability metrics like response time and packet loss, you have the options to set thresholds for each severity. Thresholds, as mentioned before, are values or limits that you can configure manually.

For instance, you can provide an attention level threshold for a monitored metric like packet loss. If the monitored value crosses the value you provided, an alarm is generated with the attention level severity. Similarly, you can set up trouble and critical thresholds; if the monitored value continues to break the trouble and critical thresholds as well, the alarm will be updated with a trouble and critical severity respectively.

OpManager has a functionality known as re-arm. The re-arm is a value that resets the alarm state as 'cleared'. Naturally, for a metric where higher values indicate failure (Like CPU), this will be a lower value than all the other thresholds. If the monitored value goes below this, the alarm is updated as cleared.

Let's see an example:

For instance, if you want to track the quality of network connections to your servers, you can set up packet loss thresholds of: 50% attention level, 70% trouble level, and 90% critical level. You can also set up a re-arm value of 45%.

When the packet loss crosses 50%, you get an attention alarm. If the packet loss continues to go over 70% and 90%, you get trouble and critical alarms. If, by some chance, it goes down below 45% after this, the alarm is updated as cleared. However, the record of the previous violations will be listed in the alarm, so you can track the performance of the device.

Setting up process and service monitors with OpManager

OpManager can reach server devices and collect active processes and Windows services. In addition, you can also monitor TCP services using port checks.

Process monitoring: For each server, you can fetch all active running processes with a single click. You can set up alarms for process availability (If the process is running or not), process CPU utilization, memory utilization, and instance count.

Service monitoring: OpManager monitors whether TCP services (Like MSSQL service or DNS services) are responding in their respective ports. You can specify the name of the service and the port it's available at, and you can set three levels of alarms for the response times of each service.

Windows service monitoring: Similar to process monitoring, OpManager fetches active Windows services running in your servers. You can set up an alarm to alert you if the service is unavailable after a certain number of polls. You can further configure OpManager to restart the service or restart the server entirely in case the server isn't running.

Best practices for server performance monitoring with OpManager

Now that we've seen how you can set up server monitoring with OpManager, let's take a look at some basic best practices you can follow to simplify this process.

#1 Setting up and customizing server monitors in bulk

There are multiple ways for you to set up the various monitoring features that we mentioned in the previous section. You can leverage the features mentioned below to set up and refine monitoring configurations to best fit your unique requirements.

Using device templates and monitor templates: A template in OpManager refers to a pre-set configuration that applies to either a particular device type or a monitor. Device templates are available for a particular vendor and type of device.

For instance, a Windows 2016 server would have its own template. This template will list things like the availability or uptime polling intervals, the polling mechanism (SNMP/ICMP/TCP), as well as the curated performance metrics and their alarm thresholds. Any changes made at the template level will be applied for all devices that fall under it. For instance, If I set up a CPU utilization monitor with 90% critical thresholds for the Windows 2016 server template, all Windows 2016 servers I monitor will also have the same configuration.

The same also applies for monitor templates, where each monitor is a specific template. Monitor templates are available for performance monitors (Like CPU utilization, disk space, and IOPs), process monitors, service monitors, and so on. OpManager has over 11,000 device templates and 3000+ performance monitoring templates.

Using the quick configuration wizard: The quick configuration wizard enables you to execute configuration actions in bulk. When it comes to server monitoring, you can execute the following actions:

Managing device templates
Setting up thresholds
Configuring monitoring intervals
Creating service monitors

Using the quick configuration wizard and the templates in OpManager, you can set up monitoring for thousands of servers in a matter of minutes. Once the monitors are set up in bulk, you can go to individual servers and fine tune the monitoring configurations.

#2 Automating performance monitoring with AI/ML

Instead of setting up alarm thresholds manually, you can leverage OpManager's adaptive thresholds to calculate, set, and update alarm thresholds automatically on a regular basis. Adaptive thresholds use AI/ML to predict the normal or expected performance for your server's performance metrics. It uses monitored data for each performance metric to train the ML engine and uses predictions to set three levels of alarm thresholds for every performance metric for every hour.

This vastly reduces the manual effort needed to calculate, set, and update alarm thresholds, particularly when you're monitoring hundreds of servers.This also cuts down false positives (Happens when static alarms flag incidents that are actually normal system behaviour) and consequently improves anomaly detection.

OpManager generally needs 14 days of performance data to start predicting device behaviour accurately. So it's recommended to use manual thresholds to monitor your devices until OpManager has 14 days worth of performance data. Once OpManager has enough data, you can enable adaptive thresholds.

#3 Performing server capacity sizing with AI-powered insights

Like I mentioned earlier, server capacity planning is a vital process that helps ensure that your servers run optimally without resource constraints. Performance monitoring helps you track this, but you can still get caught unawares by sudden capacity constraints.

OpManager's AI/ML engine helps resolve this with two features:

Forecast reports: Forecast reports track the utilization of your server's key resources like CPU, memory, and disk space. It collects the current value of each resource and uses ML to predict the number of days left till they reach 80%, 90%, and 100% capacity. You can schedule this report at regular intervals to review your server capacity utilization and plan capacity sizing when required.

Forecast alarms: Forecast alarms can be used to track critical server resources running out. You can set up forecast alarms to generate alarms when there is only a specific number of days left till a particular resource (Like disk space) runs out. The number of days is calculated with the help of ML.

For instance, if you set up a forecast alarm for server disk space running out in 30 days, OpManager analyzes the server's disk space with ML and when it determines that there are 30 days left till the disk space runs out, it creates an alarm. Leaving you with enough time to procure more space.

#4 Using consecutive polls and uplink dependencies to eliminate false positives

Consecutive polls: Getting alarms for one-off occurrences can often lead to alarm floods. For instance, if your server's CPU utilization crosses the alarm threshold you set up for a few seconds, it may not mean much. But if the CPU stays at such high levels for a longer, it most definitely indicates some deeper issues. Consecutive polls are used to separate one-off issues from persistent ones.

If you set your consecutive poll count as three, OpManager won't create an alarm for the first two violations, but the third one would be raised. This means that when you get an alarm, you get it for real, persistent issues that are worth your time and effort.

Uplink dependencies: Another potential way for false positives to occur is when the parent devices go down. Let's say you have 10 servers that are dependent on a single access switch to connect to the network. OpManager is monitoring both the switch and the servers. In case the switch fails, you'd get 11 alarms.

But you can prevent this by designating an uplink dependency for the servers. That way, OpManager understands that the switch is a parent device the servers are dependent. In case the switch goes down, OpManager only generates alarms for the switch.

#5 Leveraging customizable colour codes:

By default, alarm colour codes are set as:

Yellow for attention
Orange for trouble
Red for critical
Green for clear
Violet for service down

But you can customize this and customize the colour scheme. This can help you align the alarms you get with your colour preferences. This can also help admins who might be colour-blind.

#6 Setting up monitoring graphs for performance monitoring

When you're setting up performance monitors in OpManager, you have an option to enable graphs for these monitors. Graphs present data in a colour-coded, time-stamped format with options to customize the data collection time period. You can also leverage OpManager's AI/ML engine to forecast the data from the graphs and extrapolate it across the next hours, days, and weeks.

#7 Improving data polling efficiency with data collection manager

The data collection manager (DCM) is designed to optimize polling efficiency and reduce the overall load on your monitoring server. It acts as a watchdog, preventing unnecessary polling attempts on unresponsive servers.

For example, if you monitor a server at 5-minute intervals and it responds to the polls, it remains in a normally monitored state. However, if OpManager Plus receives no response for, say, 6 hours, the DCM moves it to a suspended state. Here, the server is polled less frequently to conserve resources, perhaps once every 4 hours for 10 days. If no response is still received, the server's status shifts to disabled, halting all polling. You can customize the polling intervals for each state and review all disabled polls in data collection reports. If you prefer not disabling devices after unresponsive polls, you can also keep them at a suspended state where polling is more efficient.

Managing polling activity based on server responsiveness ensures efficient resource allocation and minimizes wasted effort on unresponsive devices, improving overall monitoring performance.

#8 Providing additional context with Custom fields

The Custom Fields feature in OpManager is designed to optimize server monitoring by allowing you to add specific, descriptive properties for each server. This acts as a comprehensive documentation tool, preventing information silos and enhancing data organization for your monitoring server.

For example, you can add custom fields like 'Department', 'Location', 'Criticality Level', or 'Maintenance Schedule' to each server. This enriches the monitoring data, moving beyond basic availability and performance metrics to provide crucial business context. If you need to quickly identify all servers belonging to a specific department or those with a high criticality level, custom fields enable efficient filtering and grouping.

Custom fields add more context to your reports. Moreover, OpManager can suggest potential device groups based on the custom fields associated with them. Custom fields can also be associated with IT assets in your ITSM tools when you integrate OpManager with ITSM tools.

How does OpManager help you monitor servers?

With OpManager fully deployed in your IT infrastructure, you can ensure the following things:

Proactive fault identification

OpManager monitors your servers for uptime, performance, processes, services, files, folders, and URLs. Detected issues are raised instantly as context-rich alarms. With AI/ML driven adaptive thresholds, you can detect anomalies real-time with a high level of automation.

Faster incident response

OpManager has in-built integrations with ITSM tools like ServiceDesk Plus, ServiceNow, Jira, and Freshdesk (You can also build integrations with any other tool with help of a custom integration builder). This means that you can convert context-rich alarms into tickets and kick-start incident response.

Automated fault remediation

When server faults occur, you can automate remediation actions like- restarting servers, processes, and services, powering on and powering off VMs, executing scripts, and logging tickets. OpManager's workflows have over 70 actions that can be arranged and executed in a code-free drag-and-drop interface.

Server performance optimization

You get visibility into server environments, into processors, memory, disk, traffic, hardware, virtual networks, VMs, and datastores. You can track KPIs for each of these components and detect potential bottlenecks. Removing these would help you optimize server performance.

Server capacity management

OpManager gives you detailed insights into server resource utilization. You can track resource-starved and resource greedy servers and prioritize resource provisioning. You can also analyze VMs sprawl, and forecast resource utilization trends to perform informed capacity planning.

What are the types of servers I can monitor with OpManager?

OpManager supports all major server types, functionalities, and vendors by default. There are over 350 server templates with dedicated performance monitors. For each supported server, you can monitor the reachability, processor performance, traffic statistics, hardware health, etc.

Based on server types

Windows servers
Linux servers
HP-UX servers
IBM AIX servers
VMware hosts and VMs
Hyper-V servers
Xen servers
Nutanix servers
Proxmox servers

Based on server functionalities

App servers
Web servers
Mail servers
File servers
Database servers
Domain controllers
Virtual servers

Based on server vendor

Dell
HPE
IBM
Cisco
Lenovo
Fujitsu
Huawei
Supermicro, and more.

5 reasons to choose ManageEngine as your server monitoring vendor:

So far, we have talked about the basics of server monitoring, the mechanisms used to monitor servers, the basic expectations from a server monitoring tool, and the hallmarks of a good server monitor.

We also talked about OpManager's server monitoring capabilities, including brief guidelines for setting up monitors and best practices for getting the most optimal performance. Now, let's take a look at five reasons explaining the advantages you get when you choose ManageEngine as your IT solutions vendor.

Cost-effective and flexible

ManageEngine offers flexible pricing models with multiple deployment types and modes licensing types. Depending on your monitoring requirements, you can choose plans that maximize your returns.

Deployment types

On-premise: OpManager Plus (On-premise full-stack monitoring)
Cloud-based: Site24x7 (Cloud-based full-stack monitoring)

Editions

Home server monitoring: Free license tier (No licensing charges), standard edition (Core features)
Small to medium (SMB): Professional edition (Advanced features, AI/ML engine)
Enterprise: Enterprise edition (Distributed monitoring)
Managed service providers: OpManager MSP (Multi-client monitoring)

License types

Perpetual: Pay once, free forever. You can pay an annual maintenance subscription to continue accessing our technical support
Subscription: Pay to renew the license at set intervals

OpManager is priced competitively to other market players, delivering more functionalities at a lesser prices. Learn more about our pricing in our store page.

Intuitive and automated

From discovering the servers in OpManager, to monitoring, getting alerted, and troubleshooting issues, you have an intuitive interface, high level of automation, and ergonomic functionalities. Let's see how.

Discovering and onboarding servers: Automated discovery and classification
Assigning performance monitors: Pre-configured device templates and curated monitors
Setting up performance monitors: AI driven adaptive thresholds
Detecting and escalating incidents: Multi-channel notifications, ITSM integrations
Incident response: Automated remediation workflows

Regional support teams & user communities

ManageEngine is used by over 1 million IT admins from 200+ countries. We maintain dedicated teams in different regions to cater to these users. Our users are also very active in our community forum: Pitstop. This allows you to find solutions to your problems, either from our community or with the help of our support technicians.

Dedicated to product development & R&D

In a market characterized by incessant price hikes, product acquisitions, and takeovers, ManageEngine builds IT tools from the ground up. This offers multiple advantages:

Homegrown-AI engines: In-house AI/Ml engine: Zia, built to seamlessly enable AI-driven functionalities across IT solutions.
Powerful integration ecosystems: Homegrown IT tools integrate more intuitively and seamlessly.
Customization and adaptability: DevOps engineers to provide customized solutions to unique user challenges.

Data protection standards & security features

ManageEngine's IT solutions offer robust authentication and encryption features to ensure that your data and information are always safe. OpManager features advanced security features like:

Multi-factor authentication
SAML authentication
SHA and AES
File integrity monitoring
SSL configuration
Password protection for sensitive data
Trusted certificates

In addition to this ManageEngine is also compliant with international privacy standards like:

ISO/IEC27001
ISO/IEC 27701
FIPS 140/2
GDPR

Let me conclude: Server monitoring is a diverse and extensive field of knowledge with varying mechanisms, requirements, and expectations. Choosing the right server monitoring tool that aligns with your unique environment gives you the best chance at optimizing server performance. In today's IT landscape, applications, services, and data are distributed across complex infrastructures. Optimal server performance is directly proportional to a positive user experience and business success.

By Visakh,

Product marketer, ManageEngine

Editorial expert who enjoys elucidating the technical jargon of the IT industry into relatable, easy-to-read content. Specializes in ITOps, network monitoring, and full-stack observability.

Discover more about ManageEngine OpManager

Featured

Quick links

Web-page Server monitoring basics Learn more

Blog Acing server performance Learn more

Help VMware ESX monitoring Learn more

Try a free, 30-day trial of OpManager

Related Products

Features

Sectors

Quick links

Why OpManager

Company

Training and Support

Connect with us:

ManageEngine is a division of Zoho Corp.