Availability monitoring - Troubleshooting

Availability monitoring in OpManager allows you to continuously monitor the availability of network devices, servers, applications, and services. By monitoring key metrics such as uptime, response times, and outage durations, OpManager helps IT teams maintain high availability and reliability across their network infrastructure. With proactive alerts and real-time monitoring capabilities, OpManager ensures that potential issues are addressed swiftly, minimizing impact on business operations and maintaining optimal performance levels.

This help document covers the steps to troubleshoot the errors encountered in Availability monitoring.

1. Error: 'Unable to contact IP driver. General failure'

This alert message is generated when OpManager server fails to contact the monitored device during its periodic availability status poll. This error generally appears in a VM environment where the Virtual devices are running any Windows OS and when they are unable to reach outside the network due to any of the following causes.

Error: Hyper V – WinSock issue

Cause:

This error occurs in your VM when there is a possibility of WinSock and WinSock2 setting being corrupted.

Solution:

You could try to point to the following registry paths:

HKLM\SYSTEM\CurrentControlSet\Services\WinSock
HKLM\SYSTEM\CurrentControlSet\Services\WinSock2

Backup the above registry.

Go to another server (running the same OS configuration), go to the above registry paths, export the registry and copy them to your current server.

Double click on the reg files to register, reboot the system to see how it works.

Source

Error: VM duplicate Security Identifier issue

Cause:

This issue is caused by a duplicate Security Identifier (SID) in a Windows 2008 or Windows 2012 virtual machine, when the either of them are deployed from a template or a cloned virtual machine. And the guest customization option is not selected while deploying the virtual machine.

Solution:

To resolve the issue, you need to run the sysprep tool to generate a new security identifier for the virtual machine. To do this,

Open a console to the affected Windows virtual machine.
Open a command prompt in elevated mode. Right-click a shortcut to the Windows Command Processor and select the Run as administrator option.
Change the path to C:\Windows\System32\sysprep.
Run the sysprep command.
When the sysprep wizard appears, check the generalize check box, leave all other setting at the default values.
Reboot the virtual machine to apply the changes.

Source

Error: TCP/IP issues

Cause:

When you are unable to ping the loopback address/local setup, there are chances of your TCP/IP stack being corrupted.

Solution:

Turn off User Account Control (UAC) and login with the domain admin account. Follow the below steps to reset TCP/IP to its original state:

On the Start screen, type CMD. In the search results, right-click Command Prompt, and then select Run as administrator.
At the command prompt, enter the command given below and then press Enter.

netsh int ip reset resetlog.txt

Restart the computer.

When you run the reset command, it overwrites the following registry keys, both of which are used by TCP/IP:

SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
SYSTEM\CurrentControlSet\Services\DHCP\Parameters

Source

2. Error: Availability value mismatch in the device availability report

The Uptime column in the 'Device Availability' report shows incorrect value even when the availability is 100%.

Cause:

The availability data in the reports are fetched either from the hourly or daily archive tables based on the the time period specified while generating the report. By default, the value will be 'Last 24 hours'. Whereas, in the device snapshot page, the availability data is fetched from raw table. If there is a value mismatch in the 'Uptime' column of the generated report, the root-cause could be due to an entry missing for that time period in the respective archived table.
For example, if the availability monitoring interval is 15 minutes, the raw table will have 4 entries for 1 hour. The average of those 4 values will be calculated and pushed to the hourly table. During the time of archive in the hourly table, if the OpManager service is down or if there is a database disconnection (in case of MSSQL), the update in hourly table fails. This leads to the incorrect data being reported in the generated report.

Solution:

This is an environment specific issue. Please keep an eye on your database downtime or the OpManager server's unavailability to avoid such issues. Please contact our support team at opmanager-support@manageengine.com for further assistance.

3. Error: Device is available, but the status is shown as down in OpManager

This issue occurs when the monitored device is reachable from the network, but OpManager continues to display it as down. Follow the steps below to verify the polling and status sync conditions:

Ensure that the monitoring interval has passed, or wait for the next polling cycle for the status to be updated.
When Poll using IP Address is set, check if the device’s IP address is reachable from the OpManager server.
When Poll using DNS is set, verify if the DNS name of the device is reachable from OpManager.
If Monitoring Via option is set to ICMP, ensure that the IP address is valid, and try pinging the device manually from the OpManager server.
If Monitoring Via option is TCP, ensure that the configured TCP port is open and accessible from the OpManager server.
If Monitoring Via option is SNMP, verify that the SNMP credential is associated with the device and is valid. Try increasing the SNMP timeout value if needed.
If High Availability (HA) is configured, verify that the device is reachable from the failover server.
In Enterprise Edition, check if the device status is consistent in both the Central and Probe servers. If the probe shows the correct status, it will be synced to Central during the next poll or periodic sync operation.
Check if a lost network connectivity alarm is raised in OpManager, indicating the server was detached from the network due to latency or disconnection.