Top 1% Upwork (8 years) 286+ client deployments 2,036+ projects shipped GoHighLevel Certified Partner Featured speaker: GHL Summit 2025 Client Login
← All issues
The Scale Brief · Issue #144

Your median response time is fine.
Your 95th percentile is killing you.

Look at your last 1,000 AI-agent conversations and bucket them by time-to-first-token. I'll wait.

You bucketed the conversations. The p50 is probably fine — somewhere in the 1–2s range. The p75 is probably OK. The p95 is where the problem is hiding. Usually in the 4–8s range. The p99 is in the teens.

Now look at completion rate bucketed by the same time-to-first-token windows. Almost every deployment we've audited shows the same shape:

The drop-off cliff is between 2s and 4s. Your median is fine. Your 95th percentile is where most of your would-be conversions live — and die.

Why 3 seconds?

It's a real number, not a vibe. The interaction-design literature has been consistent on it for forty years: 3 seconds is the threshold beyond which users start questioning whether the system is still working. Past 3s without feedback, the user shifts from "waiting" to "is this broken?" and the next click is often Cmd-W.

For human SMS conversations the threshold is more like 30s — because you assume the other person is busy. For AI conversations, the user's mental model is "this is a computer, the response should be instant." A 4-second silence reads as a hang.

The fix has nothing to do with making the model faster. The fix is making the wait feel intentional.

The pattern: a 3-second "still working" interim message

Most AI-chat UIs stream tokens. If the LLM has a 4s thinking phase before the first token arrives, the user sees zero feedback for 4 seconds.

The pattern: if no token has arrived after 3 seconds, send a single interim message back to the client. Not the answer. A signal. Something concrete enough that the user knows the system is alive and progressing — but not so concrete that you've committed to a path the model hasn't reached yet.

Done well, this looks like:

The interim message itself is generated cheaply — either rule-based off the detected intent, or by a single fast call (gpt-4.1-nano, sonnet-haiku, etc.) routed in parallel with the main reasoning call. It's a 50-token response with a 200ms target. The main response continues uninterrupted.

The implementation

If you're on the OpenAI-compatible streaming protocol, this is one helper function. Wrap your existing stream with a watchdog:

async function streamWithHeartbeat(modelStream, onChunk, intentHint) {
  let firstTokenSeen = false;
  const heartbeatTimer = setTimeout(async () => {
    if (firstTokenSeen) return;
    // 3 seconds elapsed, no first token. Send interim.
    const interim = await getInterimMessage(intentHint);
    onChunk({ type: 'interim', text: interim });
  }, 3000);

  for await (const chunk of modelStream) {
    if (!firstTokenSeen) {
      firstTokenSeen = true;
      clearTimeout(heartbeatTimer);
    }
    onChunk({ type: 'token', text: chunk });
  }
}

getInterimMessage() can be a switch on intent (free, 0ms) or a fast LLM call (one-shot, ~200ms). The watchdog fires only if the main stream hasn't started by t=3s. Once tokens start flowing, the timer is cleared and the interim never appears.

What changes after you ship it

Two patterns repeat in the deployments where we've shipped this:

The signal: this is a fix that only matters if you have a p95 latency problem. Most teams do, but they don't realize because the median looks fine. Bucket the data and you find it immediately.

The other thing it fixes

Mid-conversation drop-off looks like a single metric on most dashboards. It isn't. It's three separate failure modes glued together:

The heartbeat fix is surgical — it targets failure mode #1 only. Bucket your drop-offs by point-of-loss before you ship anything, and you'll know which one is yours.

The audit script

For every agent we deploy at AutomateScale, we run a 7-day capture: every conversation logged with time-to-first-token, total response time, intent classification, and outcome. p50/p75/p95/p99 latency histograms get cut by intent. The heartbeat threshold gets tuned per intent (retrieval-heavy intents have looser thresholds than greeting-style ones, because users have a different mental model for each). Want the script? It's part of the Scale Audit deliverable — apply for the audit and we send it.

The one-line summary

Your median agent response time isn't the bottleneck. Your p95 is. Most users have already left by the time your slow path returns. A 3-second "still working" signal recovers the conversations you're currently losing to perceived hangs — without making the model faster.

Ship the heartbeat this week. Re-bucket the data next week. The lift will be obvious.

Enjoyed this? One essay like this every Sunday — 12,400+ founders read it.
Subscribe free RSS

Keep reading

Issue #143
Your support agent is leaking PII. The two-line fix.
RAG retrieval-filter pattern most teams miss.
Issue #145
Your form looks like it works. Your leads are gone.
Four silent failure modes that quietly lose form submissions.
Issue #150 · NEW
12 Business Automations + The OS That Makes Them Compound
Why scattered automation can't compound — and the business-OS architecture that fixes it.
★★★★★

"Adam's work on our funnel mapping buildout was exceptional. He demonstrated a deep understanding of our business needs, translating them into an efficient and effective funnel strategy. His clear communication and expert guidance made the process seamless."

B2B Funnel Building for Amazon Coaching Agency · 2.8h·2023 · Upwork verified →
★★★★★

"Adam's work on our funnel mapping buildout was exceptional. He demonstrated a deep understanding of our business needs, translating them into an efficient and effective funnel strategy. His clear communication and expert guidance made the process seamless."

B2B Funnel Building for Amazon Coaching Agency · 2.8h·2023·Upwork verified → · Upwork ✓
★★★★★

"Adam provided great value during our consult. It was clear his experience and reputation are for very good reason. Looking forward to working further with him on our funnel. Thanks, Adam!"

60 minute consultation · 2025·Upwork verified → · Upwork ✓
Run the audit on your agents

The Scale Audit ships the heartbeat script
+ 24 other patterns we run on every deploy.

Apply for an audit and we run the 7-day capture on your agents, deliver the report, and ship the heartbeat + retrieval filters as part of the engagement.

Apply for a free audit All issues