Network monitoring forms the backbone of IT and is essential to prevent unplanned outages which can plague your business. As a case in point, look to the downing of Facebook's network in 2021.
Forbes estimated that the downtime, which spanned a few hours, could have cost approximately around $65 million. Though the exact figures are not officially out, Facebook in its official post declared that "ads did not deliver during the time our systems were offline, and advertisers were not and will not be charged for ads during the outage." They also mentioned compensating the impacted customers.
This clearly highlights two aspects:
1. An unplanned outage could wreak havoc on a business' finances.
2. An outage also leaves customers unhappy.
This is why it is a wise and safe move to invest in network monitoring solutions. Especially with remote work becoming a norm, monitoring tools enable companies to monitor the network from anywhere in the world and helps maintain peak performance throughout.
When monitoring your network, there are so many aspects that may come to your mind, and this can make monitoring a complex and stressful job.
Focusing on the most important network performance monitoring metrics will simplify this, enabling you to understand the performance of your network clearly. Analyzing the performance based on certain metrics will give you an overall perspective of your network performance and also aid in identifying specific issues that dampen the end-user experience.
Among the list of network monitoring metrics the most fundamental and crucial one is availability. Availability is a metric that tells you the time during which your network was active. Most monitoring tools use the ping function to poll a device at constant intervals to get its availability.
Measuring the packet loss or the response time will also tell you more accurately about availability.
CPU and memory utilization are basic pillars for monitoring the devices and applications in your network.
CPU usage or utilization means the amount of CPU resources used to run an application or to render a service. This tells you about the applications running as well as about the performance of your system.
High CPU utilization does not necessarily mean your system is incapable of handling the load, but if it is constant it can point to the need for proper load balancing.
Memory may seem like something that you need only when you actually require more, but high memory utilization can potentially affect application performance by slowing down the operations.
The reason these two metrics are explained under a common header is that there is a misconception that both are the same, and that both the terms can be used interchangeably.
Bandwidth refers to the maximum data that can be exchanged in your network as allowed by your service provider. You can use bandwidth as a baseline value and fine-tune your routers and connecting channels in your network to maintain data transfer rates as close as possible to the bandwidth.
Throughput, on the other hand, tells you the actual rate of the data transfer that happens within your network. When the throughput value falls down, you can perceive that there is high packet loss and troubleshoot associated devices to find the cause.
Read more on the relationship between throughput and SNR value.
Latency tells you the total time taken between a request and response. This is a useful metric that clearly tells you how the end user feels.
Latency is usually impacted by under-performing devices such as a faulty router or an overloaded server. It can also be affected because of the geographical distance between the client and the server.
Once you find the cause, you can resolve latency either by fixing the devices or setting up extra servers in new locations to handle the extra load.
MOS is an effective way of measuring the quality of the Vocie over IP (VoIP) calls. The score actually tells you the quality of the call as perceived by a human being, that is why it is called an opinion score, meaning it gives you the opinion of a human. The score ranges from 1 to 5. The greater the number the better the score.
As organizations increasingly depend on voice calls and video calls, analyzing the performance of those calls is important to optimize your network so that users enjoy seamless and uninterrupted calls.
Analyzing the associated metrics such as jitter, packet loss, round trip time, and latency will give you an comprehensive picture on call performance, which simplifies troubleshooting.
Data is received and transmitted in the form of packets in the network. So, analyzing the packets is important and helps you troubleshoot issues like sluggish network speed.
Packet loss is one metric that you can focus on. It means that packets are lost (somewhere) during transit. It is a powerful metric that tells you why your VoIP calls are truncated or why a download takes longer than is ideal.
With the help of your monitoring software you can trace the exact spot where the packet loss occurred in your network to resolve it.
Packet discard is another useful metric. This metric is not the same as packet loss. When you are monitoring packet loss, you are actually trying to identity where packet dropped on the way, which means the packet has not reached its destination. Packet discard is about a packet that has reached its destination but has an error in it—the packet can be damaged or a non-requested packet might have arrived.
Network configurations are not a metric in themselves, however managing configurations is important to reduce the time required to resolve outages.
For example, suddenly a device in your data center becomes non-functional, bringing your services to a crashing halt. You arrange an alternate device and physically install the device.
But to operate the device you need to configure it. Manually configuring the new device will be time-consuming and it is seemingly impossible to remember all the configurations that were done incrementally over a period. By backing up the configs you can setup the device and restore your services quickly.
Faulty configurations can also affect device performance, sometimes even triggering an outage. The Facebook network outage cited earlier was triggered by a configuration change. So managing configuration is decisive to keep your network healthy in the long term.
Syslogs again are not a specific performance metric but they do help you identify issues as and when they appear by enabling you to narrow down the cause.
Simply by transferring syslogs from the end device to a syslog server, you can pinpoint the issue. The beauty of syslogs it that you have multiple levels that determine the criticality or threat level of the syslog message such as error, inform, and debug. The criticality enables you to decide whether to act upon a certain event immediately or attend to it later.
For example, you can configure to trigger syslog messages as alerts when a certain number of incorrect login attempts are made. From the syslog message you can understand there is unauthorized user entry into your network and take corrective steps immediately.
Most of the network monitoring software available in the market supports various key performance metrics and help monitor your network. But that alone is not enough if you look forward to achieving your business outcomes.
A good monitoring tool will help you reduce the time wasted in mundane, repetitive tasks and enables you to use that time to complete other meaningful tasks.
So when purchasing a monitoring tool, we suggest you consider the following aspects:
ManageEngine OpManager will tick all these parameters and is suitable for SMBs as well as enterprises. It can monitor a wide range of devices including servers, switches, load balancers, routers, storage devices, and everything that has an IP address and is connected to your network.
It gives a unified intuitive view of your network performance with granular-level visibility on all your individual devices, making things easier for troubleshooting.