You bucketed the conversations. The p50 is probably fine — somewhere in the 1–2s range. The p75 is probably OK. The p95 is where the problem is hiding. Usually in the 4–8s range. The p99 is in the teens.
Now look at completion rate bucketed by the same time-to-first-token windows. Almost every deployment we've audited shows the same shape:
- Sub-2s buckets: most users finish the conversation
- 2–4s buckets: meaningful drop-off begins
- 4–8s buckets: completion roughly halves
- Over 8s: most users have left
The drop-off cliff is between 2s and 4s. Your median is fine. Your 95th percentile is where most of your would-be conversions live — and die.
Why 3 seconds?
It's a real number, not a vibe. The interaction-design literature has been consistent on it for forty years: 3 seconds is the threshold beyond which users start questioning whether the system is still working. Past 3s without feedback, the user shifts from "waiting" to "is this broken?" and the next click is often Cmd-W.
For human SMS conversations the threshold is more like 30s — because you assume the other person is busy. For AI conversations, the user's mental model is "this is a computer, the response should be instant." A 4-second silence reads as a hang.
The fix has nothing to do with making the model faster. The fix is making the wait feel intentional.
The pattern: a 3-second "still working" interim message
Most AI-chat UIs stream tokens. If the LLM has a 4s thinking phase before the first token arrives, the user sees zero feedback for 4 seconds.
The pattern: if no token has arrived after 3 seconds, send a single interim message back to the client. Not the answer. A signal. Something concrete enough that the user knows the system is alive and progressing — but not so concrete that you've committed to a path the model hasn't reached yet.
Done well, this looks like:
"Pulling your account history..."for retrieval-flavored queries"Drafting a response..."for open-ended ones"Checking the calendar..."for scheduling intents
The interim message itself is generated cheaply — either rule-based off the detected intent, or by a single fast call (gpt-4.1-nano, sonnet-haiku, etc.) routed in parallel with the main reasoning call. It's a 50-token response with a 200ms target. The main response continues uninterrupted.
The implementation
If you're on the OpenAI-compatible streaming protocol, this is one helper function. Wrap your existing stream with a watchdog:
async function streamWithHeartbeat(modelStream, onChunk, intentHint) {
let firstTokenSeen = false;
const heartbeatTimer = setTimeout(async () => {
if (firstTokenSeen) return;
// 3 seconds elapsed, no first token. Send interim.
const interim = await getInterimMessage(intentHint);
onChunk({ type: 'interim', text: interim });
}, 3000);
for await (const chunk of modelStream) {
if (!firstTokenSeen) {
firstTokenSeen = true;
clearTimeout(heartbeatTimer);
}
onChunk({ type: 'token', text: chunk });
}
}
getInterimMessage() can be a switch on intent (free, 0ms) or a fast LLM call (one-shot, ~200ms). The watchdog fires only if the main stream hasn't started by t=3s. Once tokens start flowing, the timer is cleared and the interim never appears.
What changes after you ship it
Two patterns repeat in the deployments where we've shipped this:
- Clients with a long p95 (slow retrieval, slow upstream APIs, or a fan-out reasoning step) see meaningful completion-rate lift. The worse the p95, the bigger the lift.
- Clients with already-tight p95 (sub-3s for nearly all traffic) see a flat outcome — exactly the result you'd expect, because their problem wasn't latency-related.
The signal: this is a fix that only matters if you have a p95 latency problem. Most teams do, but they don't realize because the median looks fine. Bucket the data and you find it immediately.
The other thing it fixes
Mid-conversation drop-off looks like a single metric on most dashboards. It isn't. It's three separate failure modes glued together:
- Initial-load drop: the user sends the first message, then leaves before the response arrives. The 3-second pattern fixes this.
- Mid-stream abandonment: the response starts streaming but the user stops reading. This is a content/length problem, not a latency problem.
- Completion fatigue: the conversation got to a useful place but the user didn't take the next action (book, buy, reply). This is a CTA problem.
The heartbeat fix is surgical — it targets failure mode #1 only. Bucket your drop-offs by point-of-loss before you ship anything, and you'll know which one is yours.
For every agent we deploy at AutomateScale, we run a 7-day capture: every conversation logged with time-to-first-token, total response time, intent classification, and outcome. p50/p75/p95/p99 latency histograms get cut by intent. The heartbeat threshold gets tuned per intent (retrieval-heavy intents have looser thresholds than greeting-style ones, because users have a different mental model for each). Want the script? It's part of the Scale Audit deliverable — apply for the audit and we send it.
The one-line summary
Your median agent response time isn't the bottleneck. Your p95 is. Most users have already left by the time your slow path returns. A 3-second "still working" signal recovers the conversations you're currently losing to perceived hangs — without making the model faster.
Ship the heartbeat this week. Re-bucket the data next week. The lift will be obvious.