Metrics to monitor in Kubernetes
Kubernetes hides much of the complexity of your infrastructure. While this abstraction is helpful, it makes the choice of monitoring metrics very important. Without the right visibility, performance issues and resource bottlenecks can stay hidden until they impact your users.
This article explains the core Kubernetes metrics you should track across every layer. It also covers how a unified observability platform can help you manage these insights effectively.
Why you need to monitor kubernetes metrics
Kubernetes environments change constantly. Pods are temporary, workloads scale on demand, and multiple services share the same resources. Tracking specific metrics allows teams to achieve several goals:
- Maintain application speed and availability.
- Lower costs by improving resource use.
- Find failures before they become major problems.
- Speed up the troubleshooting process.
- Make better decisions about capacity and scaling.
Core kubernetes metrics to track
Here are the metrics in Kubernetes that you need to monitor at each layer:
Cluster-level metrics
Cluster-level metrics provide a high-level view of your entire infrastructure. Tracking these helps you detect systemic issues before they impact your users.
- Node Readiness: Monitor the number of ready nodes. You should alert your team if readiness drops below 80%, as this often indicates failures or issues with draining nodes.
- Resource Utilization: This aggregates CPU and memory across all nodes. Compare your total capacity against "allocatable" resources to find overcommitment risks. A good rule is to keep average usage under 70% to handle sudden bursts.
- Control Plane Health: Monitor etcd to confirm that the cluster has a leader and that the database size is stable. High write churn can lead to latency.
- API Server Latency: If request latency for basic operations exceeds one second, your control plane may be overloaded by too many active "watches" or background reconciliations.
Node-level metrics
Nodes are the physical or virtual machines hosting your pods. Monitoring their hardware health is vital to prevent "eviction storms," where pods are kicked off a node due to lack of resources.
| Metric Type | What to Watch | Why it Matters |
|---|---|---|
| CPU/Memory | Utilization and throttling. | Prevents host starvation and virtualization overhead. |
| Disk I/O | Read/Write time and latency. | Detects slow persistent volumes or excessive log spam. |
| Kubelet Health | Running pod count vs. capacity. | Avoids overwhelming the node and triggers scheduling. |
| Hardware Pressure | Disk, Memory, and PID pressure. | Preemptively identifies nodes that need to be drained. |
Pod and container metrics
Pods are the smallest units in Kubernetes and require the most granular look.
Resource usage and throttling
Track CPU usage over five-minute intervals against your set limits. If your "throttled periods" exceed 5%, your application might be slow because it cannot access enough CPU. For memory, track the "working set" to prevent OOMKills (Out Of Memory kills), which happen when a container tries to use more memory than its limit allows.
Stability and lifecycle
Monitor the distribution of pod phases (Running, Pending, or Failed). If you see more than five restarts per hour, your application is likely unstable due to code crashes or failed health probes.
Workload and application metrics
This layer focuses on the controllers that manage your pods, such as Deployments and StatefulSets.
- Replica Health: Ensure the number of available replicas matches your desired state. If you have less than 90% of your desired pods, your application may not handle traffic well.
- Autoscaling: Monitor Horizontal Pod Autoscaler (HPA) activity to validate that your cluster scales up or down as traffic changes.
- Application KPIs: Move beyond infrastructure by tracking HTTP error rates (4xx/5xx) and latency histograms. Linking these to infrastructure data helps you see if a slow database query is tied to a specific pod's performance.
Application performance monitoring metrics
Infrastructure data is not enough on its own. You need application metrics to provide business context. This includes request rates, response times, and error percentages. When you link these to your Kubernetes data, you can find the root cause of an issue much faster.
Control plane and event monitoring
The control plane orchestrates everything. You should monitor the scheduler to see if binding attempts are successful and the controller manager to track queue depths.
Kubernetes events provide a timeline of what is happening. Use these to correlate diagnostics; for example, if you see "Pending" pods alongside "DiskPressure" events, you likely have a storage crunch. Anomaly detection can also help flag sudden spikes in restarts that deviate from your normal baseline.
Best practices for monitoring
To get the most out of your data, follow these guidelines:
- Monitor every layer: Do not focus only on pods; check the nodes and the cluster too.
- Compare data: Always look at actual usage alongside your resource requests.
- Watch the trends: Look at how metrics change over time rather than just looking at instant spikes.
- Focus your alerts: Make sure your notifications relate to the actual impact on your applications.
Best practices for collection and alerting
To build a reliable monitoring stack, follow these guidelines:
- Use the right stack: Deploy Prometheus with components like kube-state-metrics and node-exporter for thorough scraping.
- Tier your alerts:
- P0 (Critical): Availability issues (e.g., pods not ready).
- P1 (High): Performance problems (e.g., high latency).
- P2 (Medium): Capacity warnings (e.g., low allocatable resources).
- Optimize storage: Retain metrics for 14 to 30 days and use recording rules to speed up expensive data queries.
- Avoid "cardinality traps": Do not use unique IDs (like User IDs) as labels, as this can crash your monitoring database.
Monitoring kubernetes metrics with ManageEngine Applications Manager
ManageEngine Applications Manager simplifies the task of tracking Kubernetes metrics. It provides end to end visibility by bringing infrastructure and application data into one place. You can simply add your master node credentials, and the platform will auto-discover your clusters across EKS, GKE, or AKS.
Unified metrics dashboard
The platform offers a central view where you can see cluster, node, and pod metrics together. This makes it easy to spot resource bottlenecks and track usage trends over time. You can see your application performance right alongside the health of your Kubernetes environment.
Intelligent alerts
Applications Manager uses threshold-based and anomaly alerts. It sends notifications for resource exhaustion, pod failures, and scaling problems. This provides context to your alerts, which helps reduce notification fatigue for your team.
Application centric approach
The tool goes beyond basic infrastructure. It links Kubernetes metrics to:
- Application response times.
- Transaction health.
- The actual experience of the end user.
This approach helps you understand exactly how a technical metric in Kubernetes affects your business outcomes. It turns raw data into operational intelligence, allowing you to manage your system proactively.