What are the core network performance monitoring metrics?
Let's now look at a few core metrics that act as pillars to the entire network performance monitoring function:
Availability and uptime
Availability is the foremost service-level indicator because it answers the fundamental question: can the device, link, or service be reached and is it functioning within expected thresholds over time. It’s typically measured via ping checks, health endpoints, synthetic transactions, and BFD (Bidirectional Forwarding Detection) for path liveness on routed links. The nuance isn’t just “up” or "down,” but scope and dependency: available in this site/region, through this upstream, with these prerequisites satisfied. When availability degrades, it drives incident declarations, failovers, and potentially SLA penalties. Teams improve it through redundancy (devices, paths, power), tested failover plans, staged maintenance with rollbacks, and lifecycle management to avoid known-bad firmware or hardware. Availability alone isn’t enough- something can be “up” yet unusable- but without it, every higher-order metric is moot.
Latency (end-to-end and hop-by-hop)
Latency is the time it takes traffic to traverse a path, captured as one-way or round-trip time. It’s central to perceived responsiveness: users feel latency as lag in loading pages, fetching APIs, or interacting with virtual desktops; real-time voice/video suffers from conversational delays. Measurement typically blends synthetic tests (ICMP, HTTP(S), DNS), TCP handshake timing, browser TTFB, and hop-by-hop traces to pinpoint where delay accumulates. Tail behavior (P95/P99) often explains complaints even when averages look fine. When latency drifts up, it’s usually due to congestion, suboptimal routing, middlebox processing, or simply distance. Remediation levers include QoS for interactive flows, routing/peering optimization, right-sizing links, reducing inspection hops, and placing content/services closer to users.
Packet loss
Packet loss is the share of packets that don’t arrive. TCP compensates with retransmissions and congestion control- reducing effective throughput- while real-time UDP streams cannot recover in time and instead degrade quality. Loss is measured via synthetic tests, device drops per interface or queue, and TCP retransmit indicators in flows or application telemetry. Even low single-digit loss over longer RTTs blocks throughput and produces visible artifacts in voice/video. Common drivers include queue overflows, faulty optics or cables, RF contention on Wi‑Fi, MTU mismatches, and overloaded security appliances. Addressing loss involves fixing physical faults, tuning QoS queues and shaping, correcting MTU/MSS, improving Wi‑Fi airtime, and scaling or offloading deep inspection.
Jitter
Jitter is the variation in packet delay over time. Real-time media depends on consistent inter-packet timing; jitter forces larger jitter buffers, which in turn add delay and can still get overrun, resulting in choppy audio or frozen frames. It’s measured with RTP/UDP synthetic tests, voice endpoint metrics, and MOS, and should be tracked across time windows to catch microbursts. Typical causes include shared queues with bursty bulk transfers, route/path flaps, Wi‑Fi contention, and oversubscribed WAN circuits. Relief comes from QoS with strict priority for media, separating or scheduling bulk traffic, stabilizing SD‑WAN path switching, improving RF design, and monitoring queue depth by class.
Throughput and bandwidth utilization
Throughput is achieved data transfer; and utilization expresses what fraction of capacity is in use. These metrics connect performance to capacity planning and cost. Persistent high utilization signals congestion risk; low utilization alongside slow apps points elsewhere (loss, DNS/TCP issues, server saturation). They’re measured with interface counters, per-class utilization, flow records for “who/what/where,” and synthetic bulk tests. Peaks often trace back to backups, software updates, data pipelines, or “noisy neighbors.” Action typically includes adding or enabling burst capacity, applying QoS and shaping, rescheduling bulk jobs, deploying caches/CDNs, and planning upgrades from trends rather than emergencies.
Interface errors and discards
Errors (CRC, alignment, FCS) and discards (buffer overflows or policy drops) are the physical and queue-level tells that explain many “random slowness” tickets. They’re tracked per interface and per QoS class through telemetry. Healthy links should show near-zero sustained errors; discards should be rare in priority classes. Causes range from failing optics/cables and duplex mismatches to microbursts overwhelming buffers or misclassified traffic saturating a class. Fixes include replacing optics/cables, normalizing speed/duplex, resizing queues and policing, refining traffic classification, and instrumenting short-interval counters to catch brief but damaging spikes.
Application response time
Application response time reflects what users actually perceive. The network share can be isolated by combining synthetic app checks, browser timings, and APM breakdowns across client, network, and server tiers. Rising response time often correlates with added network latency/loss, slow DNS, chatty protocols over long RTT, or middlebox overhead. It’s the metric that closes the loop from infrastructure health to business impact: abandonment rates, agent handle times, and API partner satisfaction. Improvements include reducing round trips (HTTP/2/3, keep‑alive), optimizing egress and peering to reach the nearest POP/region, co-locating chatty services, and using CDNs or edge compute for latency-sensitive components, with shared dashboards so app and network teams align on the same view.
DNS and TCP health (resolution time, retransmits, handshake time)
Every connection starts with DNS and TCP. Slow or unreliable DNS inflates page loads and API calls across the board; TCP retransmissions, long handshakes, and zero-window events crush throughput and create intermittent failures. Measurement blends DNS synthetic queries (success and latency) and TCP metrics from APM/flows/packet analysis. Root causes include overloaded or remote resolvers, split-horizon misconfigurations, lossy/high-latency paths, MTU blackholing, and poor TCP tuning (small buffers, disabled window scaling). Remedies include anycast/local resolvers with caching, correcting MTU/MSS, enabling modern TCP features, and improving path quality to reduce handshake time and retransmits.
Path stability and route change frequency
Stable paths underpin stable performance. Frequent route changes (flaps) or aggressive SD‑WAN path switching induces jitter and transient loss that evade coarse polling but are obvious to users of SaaS and collaboration tools. Stability is monitored with continuous traceroute/path tests, BGP telemetry, and SD‑WAN path stats, and correlated to jitter and incident logs. Causes include carrier or peering instability, asymmetric routing that confuses stateful devices, and overly sensitive brownout thresholds. Teams typically tune SD‑WAN policies to require sustained degradation before failover, prefer symmetric routing for sensitive flows, and measure provider performance over time to justify changes or multihoming.
Device health: CPU, memory, temperature/fans
Network nodes under resource or thermal stress silently degrade forwarding and control-plane behavior, leading to delays, drops, or crashes. Telemetry covers CPU (ideally by process), memory, control-plane queues, temperature sensors, and fan status. Chronic pressure often traces to excessive inspection features (DPI/IPS/IDS) on undersized hardware, volumetric spikes or logging/telemetry misconfiguration, or environmental issues (blocked airflow, failed fans). The impact is cascading: slow reconvergence, packet punts, sporadic drops, and prolonged outages with long MTTR. Standard responses include scaling or offloading inspection, rate-limiting export volumes, upgrading hardware, enforcing airflow/HVAC best practices, and capacity planning for control-plane headroom, not just data-plane throughput.