Hunting p99: where tail latency hides

Nika Engineering avatar
Nika Engineering
Cover for Hunting p99: where tail latency hides

Averages lie about latency. A service can report a 40ms mean response time and still be failing a meaningful slice of users — because the average hides the tail, and the tail is where the pain lives. The user staring at a spinner isn’t experiencing your mean. They’re experiencing your p99.

We design to percentiles, and we treat the tail as a defect to be hunted, not a constant to be accepted.

Nobody remembers the request that was fast. They remember the one that hung.

Why the tail matters more than the mean

On any page that makes several backend calls, the slowest one sets the pace. Make ten parallel requests, each with a 1% chance of being slow, and roughly one in ten page loads inherits a slow request. Your p99 per call quietly becomes your p90 per page. Tail latency compounds — which is exactly why shaving the worst 1% often does more for perceived speed than optimizing the average ever could.

Where it hides

Tail latency rarely comes from the code you’d expect. The usual suspects:

  • Queueing, not computing. The work is fast; waiting for a thread, a connection, or a lock is slow. Saturated pools are the most common cause we find.
  • Garbage collection and other pauses. Periodic stop-the-world events turn a fast service into an occasionally frozen one.
  • The cold path. Cache misses, first-request JIT, a lazily-opened connection — the rare path is the slow path, and the rare path is your tail.
  • Noisy neighbours. Shared infrastructure means someone else’s spike becomes your latency.
  • Retries and fan-out. A timeout that triggers a retry can double the work at the worst possible moment, turning a blip into a storm.

How we hunt it

You can’t fix what you can’t see, and averages render the tail invisible. So we instrument for it directly:

  1. Measure percentiles, not means — p50, p95, p99, and p99.9 — on every meaningful path, broken down by endpoint.
  2. Trace the slow requests specifically. Distributed tracing on the requests that exceed budget shows you exactly which span ate the time.
  3. Set a latency budget per call and treat breaches as alerts, the same way you’d treat errors.
  4. Attack the queue first. More often than not, the fix is connection-pool sizing, backpressure, or concurrency limits — not a faster algorithm.

Make the fast path the only path

Once the worst offenders are found, the durable fixes are structural: keep hot data in memory and close to the compute that needs it; bound every queue and pool so saturation degrades predictably instead of catastrophically; cap and jitter retries so they help instead of amplify; and shed load deliberately before the tail runs away from you.

Chasing p99 is unglamorous work — it’s measurement, patience, and a refusal to accept “it’s usually fine.” But on a platform where latency is the product, the slowest 1% is the part of the experience your users remember. So that’s the part we engineer for.