Linux Performance Monitor


Overview

ManageEngine Applications Manager provides out-of-the-box Linux server performance monitoring capabilities. It helps the operations team ensure the servers are up (ping) and also run at peak performance by monitoring CPU usage, memory utilization, processes, disk utilization, disk I/O Stats.

In this help document, you will learn how to get started with Linux performance monitoring along with the list of parameters that are monitored with Applications Manager's Linux monitoring tool.

Creating a new Linux monitor

Supported Distributions: We support monitoring most popular Linux distributions, including but not limited to Debian, Ubuntu, CentOS / CentOS Stream, RedHat, Oracle Linux, Mandriva, Fedora, SLES, OpenSUSE, Amazon Linux, IBM Cloud Linux, Microsoft Azure Linux, Google Cloud Platform (GCP) Linux, and more.

Prerequisites for monitoring Linux server performance metrics: Click here

Using the REST API to add a new Linux server monitor: Click here

Follow the steps given below to create a new Linux server monitor:

  1. Select the Mode of Monitoring (Telnet, SSH or SNMP). For IBM AIX, HP Unix, Tru64 Unix, only Telnet and SSH are supported. For Novell, only SNMP is supported.
  2. If Telnet, provide the port number (default is 23) and user name and password information of the server.
  3. If SSH, provide the port number (default is 22) and user name and password information of the server. You have an option to give Public Key Authentication (User name and Private Key). You can also give a Passphrase if the private key is protected with one.

    Note: To identify the Public/Private key, go to command prompt, type cd.SSH/ then from the list, open the files <id_dsa.pub>/<id_rsa.pub> [Public] or <id_dsa>/<id_rsa>[Private] to get the keys.

  4. If SNMP, provide the port at which it is running (default is 161) and SNMP Community String (default is 'public'). This requires no user name and password information.
  5. For Telnet/SSH mode of monitoring, specify the command prompt value, which is the last character in your command prompt. Default value is $ and possible values are >, #, etc.
    Note: In the server you are attemptimg to monitor through SSH, the PasswordAuthentication variable should be set as 'yes' to enable data collection. To ensure this, access the file /etc/ssh/sshd_config and verify the value of the PasswordAuthentication variable. If it is set to 'no', modify it to 'yes'and restart the SSH Daemon using the command /etc/rc.d/sshd restart.
  6. Choose the Monitor Group from the combo box to which you want to associate the Monitor (optional).
  7. Click Add Monitor(s). This discovers the host or server from the network and starts monitoring them.

Monitored Parameters

Applications Manager's Linux performance monitoring monitors the key performance indicators of Linux servers to detect any performance problems. These indicators include CPU, memory, disk, etc.

  • Availability tab shows the availability history of the Linux server for the past 24 hours or 30 days.
  • Performance tab shows some key performance indicators of the Linux server such as physical memory utilization, CPU utilization, response time and swap memory utilization along with heat charts for these attributes. This tab also shows the health status and events for the past 24 hours or 30 days.
  • List view tab lists all the Linux servers monitored by Applications Manager along with their overall availability and health status. It enables you to perform bulk admin configurations.

Click on the individual monitors listed to view detailed Linux server performance metrics. The performance metrics have been categorized into 7 different tabs:

Overview

This tab provides a high-level overview of the health and performance of the Linux server along with information pertaining to the processes running on the system.

ParameterDescription
Monitor Information
Name The name of the Linux server monitor.
System Health Denotes the health status of the Linux server(clear, critical, warning).
Type Denotes the type you are monitoring.
Host Name The host name of the Linux system.
Host OS The main OS installed on the system.
Last Polled at Specifies the time at which the last poll was performed.
Next Poll at Specifies the time at which the next poll is scheduled.
Today's Availability Shows the overall availability status of the server for the day. You can also view 7/30 reports and the current availability status of the server.

 

ParameterDescription
Thread count The number threads running in the Linux machine
Process Count The number of processes. Too many open processes can give poor performance on servers. it is helpful to be warned that process count is increasing so users can remedy before an issue arises.
Zombie Process Count The number of Zombie processes. Zombie Processes can hold ports open with no control. it is helpful to see when a zombie process is spawned so it can be deal with accordingly before any issues arise
Major Page Faults/s Number of major faults the system has made per second, those which have required loading a memory page from disk.
Context Switches/s Total number of context switches per second.

You can use the Custom Fields option in the 'Monitor Information' section to configure additional fields for the monitor.

  • The Overview tab shows dials for CPU, memory and disk utilization. You can click on these dials to view detailed graphs and charts for these attributes. The graphs available are History report, hour of day report, day of week report and heat chart. These graphs can be generated for both real time and historical data.
  • The CPU and memory utilization - last six hours graph shows the memory usage and CPU usage values for the last six hours. Swap Memory Utilization, Physical Memory Utilization (in % and MB), and CPU utilization (%) are the common attributes for Linux in both SNMP mode and SSH/Telnet mode. Whereas, Available memory (in %) and Total Memory Utilization (in % and MB) attributes are available only for Linux in SSH/Telnet mode.
  • The Breakup of CPU Utilization graph provides a break up of performance metrics for the entire system processor with attributes such as run queue, blocked process, user time (%), system time (%), I/O wait (%), idle time (%), steal time (%) and interrupts/sec.
  • The System Load graph the average system load on the central processing unit (CPU) over predefined time intervals.The system load during the last one-, five- and fifteen-minute periods are represented by parameters such as Load average in minute, Load average in 5 minutes and Load average in 15 minutes. 
    Note: The attributes are displayed differently for Applications Manager versions below 170500:
    • Load average in minute → Jobs in Minute
    • Load average in 5 minutes → Jobs in 5 Minutes
    • Load average in 15 Minutes → Jobs in 15 Minutes
  • The Average System Load graph provides users with an idea of the amount of work that the system performs per core. The Average system load during the last one-, five- and fifteen-minute periods are represented by parameters such as Average load per core in minute, Average load per core in 5 minutes and Average load per core in 15 minutes. The average load per core is a metric that represents the Average system load per core over predefined time intervals. It is available only in Telnet/SSH mode.
    Note: The attributes are displayed differently for Applications Manager versions below 170500:
    • Average load per core in minute → Average Load in Minute
    • Average load per core in 5 minutes → Average Load in 5 Minutes
    • Average load per core in 15 minutes → Average Load in 15 Minutes
  • The Process Details section shows information about the processes running on the Linux server. You can add processes for monitoring using the Add New Process option. You can also delete unwanted processes and enable/disable reports for specific processes. Click on any of the attributes listed to view more details.
  • The Service Details section displays the details of the services configured in this server when monitored in Telnet/SSH mode. This includes the process ID (PID), memory usage, sub status, status, last start time, availability, and health of a service. Users can choose to delete, restart, start, stop, manage, unmanage, and unmanage & reset the status of selected services.
  • The Monitors in this System section shows the availability and health of the monitors configured in this server. To add new monitors for monitoring, use the Add Monitors option.

CPU

This tab provides the CPU usage statistics of the Linux server. The tab includes two graphs - one that displays the CPU utilization by CPU Cores and another that shows the Breakup of CPU utilization - by CPU cores. You can view additional reports by clicking the graphs present in the Breakup of CPU Utilization - by CPU coressection. These reports include Break up of CPU Utilization (%) Vs Time, User Time (%) Vs Time, System Time (%) Vs Time, I/O Wait Time (%) Vs Time, Idle Time (%) Vs Time, Steal Time (%) vs Time, CPU Utilization (%) Vs Time and Interrupts/sec Vs Time for all the CPU cores.

The CPU tab also shows the following performance metrics:

ParameterDescriptionMonitoring Mode
Telnet/SSHSNMP
Core The name of the CPU core    
User Time(%) The percentage of time that the processor spends on User mode operations. This generally means application code.
System Time(%) The percentage of CPU kernel processes that are in use.
I/O Wait Time(%) The time spent by the processor to waiting for I/O to complete.
Idle Time(%) The time when the CPU is idle (not being used by any program)
Steal Time(%) Amount of time a virtual CPU waits for a real CPU while the hypervisor is servicing another virtual processor.
CPU Utilization(%) Specifies the total CPU used by the system.
Interrupts/sec The rate at which CPU handles interrupts from applications or hardware each second. If the value for Interrupts/sec is high over a sustained period of time, there could be hardware issues.

You can also view graphs for these attributes by selecting the necessary CPU core and then choosing the appropriate attribute.

Disk

This tab displays disk usage and disk I/O statistics of the Linux server.

ParametersDescription
Disk Utilization
Disk The name of the disk drive.
Used (%) Denotes how much disk space out of the total disk space has actually been used (in percentage)
Used (MB) The disk space used in mega bytes.
Free (%) The percentage of total usable space on the disk that was free.
Free (MB) The unallocated space on the disk in mega bytes.
Disk I/O Statistics
Transfers/sec The number of read/write operations on the disk that occur each second.
Writes/sec The percentage of elapsed time that the disk drive was busy servicing write requests.
Reads/sec The percentage of elapsed time that the disk drive was busy servicing read requests.
% Busy Time The percentage of time the disk was busy.
Average Queue Length The average number of both read and write requests that were queued for the disk during the sample interval.
Avg. Disk Latency The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.
Read Wait Time The average time (in milliseconds) for read requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servic‐ ing them.
Write Wait Time The average time (in milliseconds) for write requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servic‐ ing them.
Inode Usage
Inode The name of the Inode.
Total The total number of Inodes available in that particular disk.
Used The percentage of elapsed time that the disk drive was busy servicing read requests.
Free The remaining number of Inodes that are available in that particular disk.
Used (%) The number of Inodes used in that particular disk, in percentage.
Free (%) The remaining number of Inodes that are available in that particular disk, in percentage.

You can also delete disks that have been physically removed using the Delete Orphaned Disk option.

Note: Data collection for Disk I/O statistics and Inode statistics can be enabled from 'Disk I/O Statistics Monitoring' and 'Inode Monitoring' options under Settings → Performance Polling → Servers tab.

Memory

ParametersDescription
Memory Usage Statistics
Active Memory Memory that has been used more recently and usually not reclaimed unless absolute necessary (in MB).
Active anonymous memory Anonymous memory that has been used more recently and usually not swapped out (in MB).
Active Files Memory Pagecache memory that has been used more recently and usually not reclaimed until needed (in MB).
Anonymous huge pages memory Non-file backed huge pages mapped into userspace page tables (in MB).
Anonymous pages memory Non-file backed pages mapped into userspace page tables (in MB).
Cached memory Memory in the pagecache (Diskcache and Shared Memory) (in MB).
Commit limit memory Based on the overcommit ratio (vm.overcommit_ratio), this is the total amount of memory currently available to be allocated on the system (in MB). This limit is only adhered to if strict overcommit accounting is enabled (mode 2 in vm.overcommit_memory).
Committed Memory The amount of memory presently allocated on the system (in MB). The committed memory is a sum of all of the memory which has been allocated by processes, even if it has not been "used" by them as of yet.
Unevictable memory Unevictable pages that cannot be swapped out for a variety of reasons (in MB).
Unreclaimable memory The part of the slab that cannot be reclaimed under memory pressure (in MB).

Network

ParameterDescriptionMonitoring Mode
Telnet/SSHSNMP
NETWORK INTERFACE
Name The name of the network interface present in the Windows system.
Speed (Mbps) The estimate of the current bandwidth in Mbps.
MTU Maximum Transmission Unit (MTU) is a measurement of the largest data packet that a network-connected device can accept.
Input Traffic (Kbps) The rate at which packets are received on the interface, in kilo bytes per second.
Output Traffic (Kbps) The rate at which packets are sent on the interface, in kilo bytes per second.
Errors Number of packets that could not be sent or received.
Connection Stats
Socket State The state in which the sockets are present. Following are the list of sockets that are shown:
  • ESTABLISHED - The socket has an established connection.
  • FIN_WAIT1 - The socket is closed, and the connection is shutting down.
  • FIN_WAIT2 - Connection is closed, and the socket is waiting for a shutdown from the remote end.
  • LISTEN - The socket is listening for incoming connections.
  • TIME_WAIT - The socket is waiting after close to handle packets still in the network.
No. of Connections Number of connections that are available for the particular socket state.
NTP Stats
NTP Status Indicates whether the client is synchronized with the server or not.
Server Name Indicates the hostname of the server to which the client is synchronized.
Stratum Level Indicates the level of the strata at which the client is located.
NTP Time correct to within Indicates the time offset value (in milliseconds) displayed for 'time correct to within' after executing the npstat/chrony command.

Time correct to within = (Root dispersion + Root Delay) / 2
Poll Interval Indicates the polling time interval between each sync (in seconds).

Note: You can also delete interfaces that have been physically removed using the Delete Orphaned Interface option.

Cron Job

Cron jobs are used for scheduling tasks like backups, emails, status checks, etc. in Linux and can have a major impact on the performance of your web servers and applications. Applications Manager makes it easy by continuously monitoring them and helps you gain insight into the execution of important jobs in the back-end systems.

Adding a Cron job monitor

Prerequisites : Click here

  1. Go to the Cron Job tab and click on Add Cron Job.
  2. Enter the following details:
    • Display Name - A user-friendly name for identification.
    • Cron Expression - Expression used for scheduling the cron job.
    • Time Zone - Appropriate time zone configured in the remote Linux machine by selecting from the drop-down.
    • Job Script Path - The complete script path that needs to be executed in the cron job.
    • Cron Job Period - The amount of time within which the job should run (in Minutes). If it exceeds the configured time, then the status will be updated as EXCEEDJOBTIME.
  3. After adding a cron job monitor, the curl details for your cron job will be shown below. Copy the displayed curl details by clicking over it and close the curl details window. You will now be redirected to the Cron Job tab of Applications Manager automatically.
  4. In the remote Linux machine, open the command prompt and execute the command crontab -e. This will open the crontab in edit mode. Paste the cron details that was copied earlier, then save and close the crontab.

The below table contains information about Cron job details running in the Linux server.

ParametersDescription
Cron Job Details:
Cron Name Name of the Cron job.
Cron Expression The Cron expression for the corresponding Cron job.
Job Start Time Time and date at which the Cron job started.
Job End Time Time and date at which the Cron job ended.
Next Run Time Time and date at which the next Cron job is scheduled to run.
Elapsed Time The amount of time elapsed since the Cron job started (in Minutes).
Exit Code Denotes the exit code of the Cron job.
Missed Runs The number of times Cron job had failed/missed to start at the scheduled time.
Status Status of the Cron job. Possible values are:
  • PASSED - Job has run successfully with exit code equal to 0.
  • RUNNING - Job is running currently.
  • FAILED - Job has failed with exit code greater than 0.
  • EXCEEDJOBTIME - Job has been running more than the configured job time.

Note: Once the Cron job is added, it will be in discovery state until we receive the first response from the remote server.

Updating cron jobs

To update a Cron job,

  1. Click on the Edit icon for the required cron job.
  2. Enter the required display name and the Cron Job Period for that cron job.
  3. Click Update.

Deleting cron jobs

To delete Cron jobs,

  1. Select the cron jobs that need to be deleted.
  2. Click on Delete Cron Jobs. This will delete the cron jobs from Applications Manager.
  3. Finally, make sure you remove the curl appended to the cron jobs in the remote server using the crontab -e command.

Note: Addition, update, and deletion of Cron jobs will be possible only in managed servers by the administrator user.

Configuration

This tab contains information about system configuration attributes.

ParametersDescription
System Information
Host Name The name of the system.
Domain The name of the domain to which the system belongs.
OS Information
OS Name The name of the operating system instance.
OS Version Version number of the operating system.
OS Release The Linux distribution
Memory Information
Total Physical Memory (MB) Total amount of physical memory as available to the operating system.
Total Swap Memory (MB) Total amount of swap memory available.
Processor Information
Id Unique identifier of a processor on the system
Model The processor model type
Implementation The processor family type.
Manufacturer Name of the processor manufacturer
Speed(MHz) Current speed of the processor
Cache (KB) Size of the processor cache. A cache is an external memory area that has a faster access time than the main memory.
Network Interface Settings
Name The name of the network adapter.
IP Address The IP address configured for this network interface
MTU The network medium in use.
Type The type of network adapter.
Mac Address The Media access control address for this network adapter. A MAC address is a unique 48-bit number assigned to the network adapter by the manufacturer. It uniquely identifies this network adapter and is used for mapping TCP/IP network communications.
Status The current status of the network adapter.
Broadcast Address The IP address to which messages are broadcast.
Printer Settings
Name Name of the printer.
Device The name of the server that controls the printer.
Default Indicates whether the printer is the default one. Values are either True or False.
Status Current status of the printer.

Note: The data present in the configuration tab is not updated during every poll. So if you make any changes to the server configuration, you need to restart Applications Manager for those changes to be reflected in the 'Configuration' tab.

Hardware Metrics

The following are metrics pertaining to the hardware of Dell and HP servers:

CategoryAttributeDescriptionDELLHP
SNMP ModeWMI ModeSNMP ModeWMI Mode
Temperature Sensor The name of the temperature sensor.
Temperature Reading (deg C) The current /present temperature reading.
Status The temperature status - Critical, Warning, Clear
Fan Sensor Name of the fan sensor.
Fan Speed (RPM) The fan speed values displayed in RPM.
Status The fan status - Critical, Warning, Clear
Power Sensor Name of the power supply.
Reading (Watts) The power supply reading values displayed in Watts.
Status The power status - Critical, Warning, Clear
Voltages Sensor Name of the voltage supply.
Reading (Volts) The voltage reading values displayed in Volts.
Status The voltage status - Critical, Warning, Clear
Battery Sensor Name of the Battery sensor..
Status The battery status - Critical, Warning, Clear
Memory Sensor Name of the Memory sensor.
Memory Device Type The type of memory device
Size (MB) The amount of memory currently installed in MB.
Status The memory status - Critical, Warning, Clear
Disk Sensor Identifies the disk's label
Device Name The device name configured for the disk
Size (MB) The allocated size in MB
Status The disk status - Critical, Warning, Clear.
Array Sensor The name of the array disk
Bus protocol The bus type of the array disk
Size (MB) The amount in MB of the used space on the array disk.
Status The array status - Critical, Warning, Clear
Chassis Sensor The user-assigned chassis name of the chassis.
Model The system model type for this chassis
Status The chassis status - Critical, Warning, Clear
Processor Sensor The location name of the processor device status probe
Processor Brand The brand of the processor device.
Processor Current Speed The current speed of the processor device in MHz
 
Processor Core Count The number of processor cores detected for the processor device.
Status The processor status - Critical, Warning, Clear
  • If a component is functioning normally, the status indicator is green.
  • The status indicator changes to orange or red if a system component violates a performance threshold or is not functioning properly. Generally, an orange indicator signifies degraded performance.
  • A red indicator signifies that a component stopped operating or exceeded the highest threshold.
  • If the status is blank, then the health monitoring service cannot determine the status of the component.

Note: Currently hardware performance monitoring is supported in SNMP and WMI monitoring mode.

Hardware Device-Level Configuration

Hardware Configuration option available under Host Details in the right hand side of the details page, will allow you to opt for the various hardware components you want to monitor. This operation can also be done using the Performance Polling option under the Settings tab which will globally configure the hardware stats.

Advanced Settings

By clicking the Advanced Settings option available under Host Details in the right hand side of the details page, you can go to the Performance Data Collectionpage for Servers.

Here you can use the Hardware Health monitoring option to enable or disable hardware monitoring in servers. You can also opt the various hardware components (like power, fan, disk,etc.,) to be monitored by checking the options given. This will globally configure the hardware monitoring status. You can also configure the health status by defining values in the respective text boxes:

    • Critical Severity: If the status matches with any of the values defined in the Critical Severity text box, then Applications Manager displays the status of the hardware device as Critical. The values defined by default are failed, error, failure, nonRecoverable, criticalUpper, criticalLower, nonRecoverableLower and critical.
    • Warning Severity: If the status matches with any of the values defined in the Warning Severity text box, then Applications Manager displays the status of the hardware device as Warning. The values defined by default are degraded, warning, nonCritical, nonCriticalUpper, nonRecoverableUpper and nonCriticalLower.
    • Clear Severity: If the status matches with any of the values defined in the Clear Severity text box, then Applications Manager displays the status of the hardware device as clear. The value defined by default is 'ok'.

Note: If the status of the device does not match with any of the values defined in the severity text box, the device status is displayed as unknown. Status values defined within the severity text boxes are comma-separated and case-insensitive.