Schedule demo
 
 

Metrics to monitor in Kubernetes

Metrics to monitor in Kubernetes

Kubernetes hides much of the complexity of your infrastructure. While this abstraction is helpful, it makes the choice of monitoring metrics very important. Without the right visibility, performance issues and resource bottlenecks can stay hidden until they impact your users.

This article explains the core Kubernetes metrics you should track across every layer. It also covers how a unified observability platform can help you manage these insights effectively.

Why you need to monitor kubernetes metrics

Kubernetes environments change constantly. Pods are temporary, workloads scale on demand, and multiple services share the same resources. Tracking specific metrics allows teams to achieve several goals:

  • Maintain application speed and availability.
  • Lower costs by improving resource use.
  • Find failures before they become major problems.
  • Speed up the troubleshooting process.
  • Make better decisions about capacity and scaling.

Core kubernetes metrics to track

Here are the metrics in Kubernetes that you need to monitor at each layer:

Cluster-level metrics

Cluster-level metrics provide a high-level view of your entire infrastructure. Tracking these helps you detect systemic issues before they impact your users.

  • Node Readiness: Monitor the number of ready nodes. You should alert your team if readiness drops below 80%, as this often indicates failures or issues with draining nodes.
  • Resource Utilization: This aggregates CPU and memory across all nodes. Compare your total capacity against "allocatable" resources to find overcommitment risks. A good rule is to keep average usage under 70% to handle sudden bursts.
  • Control Plane Health: Monitor etcd to confirm that the cluster has a leader and that the database size is stable. High write churn can lead to latency.
  • API Server Latency: If request latency for basic operations exceeds one second, your control plane may be overloaded by too many active "watches" or background reconciliations.

Node-level metrics

Nodes are the physical or virtual machines hosting your pods. Monitoring their hardware health is vital to prevent "eviction storms," where pods are kicked off a node due to lack of resources.

Metric Type What to Watch Why it Matters
CPU/Memory Utilization and throttling. Prevents host starvation and virtualization overhead.
Disk I/O Read/Write time and latency. Detects slow persistent volumes or excessive log spam.
Kubelet Health Running pod count vs. capacity. Avoids overwhelming the node and triggers scheduling.
Hardware Pressure Disk, Memory, and PID pressure. Preemptively identifies nodes that need to be drained.

Pod and container metrics

Pods are the smallest units in Kubernetes and require the most granular look.

Resource usage and throttling

Track CPU usage over five-minute intervals against your set limits. If your "throttled periods" exceed 5%, your application might be slow because it cannot access enough CPU. For memory, track the "working set" to prevent OOMKills (Out Of Memory kills), which happen when a container tries to use more memory than its limit allows.

Stability and lifecycle

Monitor the distribution of pod phases (Running, Pending, or Failed). If you see more than five restarts per hour, your application is likely unstable due to code crashes or failed health probes.

Workload and application metrics

This layer focuses on the controllers that manage your pods, such as Deployments and StatefulSets.

  • Replica Health: Ensure the number of available replicas matches your desired state. If you have less than 90% of your desired pods, your application may not handle traffic well.
  • Autoscaling: Monitor Horizontal Pod Autoscaler (HPA) activity to validate that your cluster scales up or down as traffic changes.
  • Application KPIs: Move beyond infrastructure by tracking HTTP error rates (4xx/5xx) and latency histograms. Linking these to infrastructure data helps you see if a slow database query is tied to a specific pod's performance.

Application performance monitoring metrics

Infrastructure data is not enough on its own. You need application metrics to provide business context. This includes request rates, response times, and error percentages. When you link these to your Kubernetes data, you can find the root cause of an issue much faster.

Control plane and event monitoring

The control plane orchestrates everything. You should monitor the scheduler to see if binding attempts are successful and the controller manager to track queue depths.

Kubernetes events provide a timeline of what is happening. Use these to correlate diagnostics; for example, if you see "Pending" pods alongside "DiskPressure" events, you likely have a storage crunch. Anomaly detection can also help flag sudden spikes in restarts that deviate from your normal baseline.

Best practices for monitoring

To get the most out of your data, follow these guidelines:

  • Monitor every layer: Do not focus only on pods; check the nodes and the cluster too.
  • Compare data: Always look at actual usage alongside your resource requests.
  • Watch the trends: Look at how metrics change over time rather than just looking at instant spikes.
  • Focus your alerts: Make sure your notifications relate to the actual impact on your applications.

Best practices for collection and alerting

To build a reliable monitoring stack, follow these guidelines:

  • Use the right stack: Deploy Prometheus with components like kube-state-metrics and node-exporter for thorough scraping.
  • Tier your alerts:
    • P0 (Critical): Availability issues (e.g., pods not ready).
    • P1 (High): Performance problems (e.g., high latency).
    • P2 (Medium): Capacity warnings (e.g., low allocatable resources).
  • Optimize storage: Retain metrics for 14 to 30 days and use recording rules to speed up expensive data queries.
  • Avoid "cardinality traps": Do not use unique IDs (like User IDs) as labels, as this can crash your monitoring database.

Monitoring kubernetes metrics with ManageEngine Applications Manager

ManageEngine Applications Manager simplifies the task of tracking Kubernetes metrics. It provides end to end visibility by bringing infrastructure and application data into one place. You can simply add your master node credentials, and the platform will auto-discover your clusters across EKS, GKE, or AKS.

Unified metrics dashboard

The platform offers a central view where you can see cluster, node, and pod metrics together. This makes it easy to spot resource bottlenecks and track usage trends over time. You can see your application performance right alongside the health of your Kubernetes environment.

Intelligent alerts

Applications Manager uses threshold-based and anomaly alerts. It sends notifications for resource exhaustion, pod failures, and scaling problems. This provides context to your alerts, which helps reduce notification fatigue for your team.

Application centric approach

The tool goes beyond basic infrastructure. It links Kubernetes metrics to:

  • Application response times.
  • Transaction health.
  • The actual experience of the end user.

This approach helps you understand exactly how a technical metric in Kubernetes affects your business outcomes. It turns raw data into operational intelligence, allowing you to manage your system proactively.

 

Angeline, Marketing Analyst

Angeline is a part of the marketing team at ManageEngine. She loves exploring the tech space, especially observability, DevOps and AIOps. With a knack for simplifying complex topics, she helps readers navigate the evolving tech landscape.

 

Loved by customers all over the world

"Standout Tool With Extensive Monitoring Capabilities"

It allows us to track crucial metrics such as response times, resource utilization, error rates, and transaction performance. The real-time monitoring alerts promptly notify us of any issues or anomalies, enabling us to take immediate action.

Reviewer Role: Research and Development

carlos-rivero

"I like Applications Manager because it helps us to detect issues present in our servers and SQL databases."

Carlos Rivero

Tech Support Manager, Lexmark

Trusted by over 6000+ businesses globally