It happens around the 40th or 50th turn. The user has been working with the agent for ten minutes, building up context. Then the agent answers a question and the answer is weirdly… disconnected. It forgets a detail the user mentioned six turns ago. The user repeats themselves. The trust drops a notch. The conversation ends not long after.
You can ignore this until it bites, or you can plan for it. The pattern that solves it is mostly named after itself: retry-with-summary.
What's actually happening
Every LLM has a context window. The window is finite. Most agent setups today operate well within it for the first 10-20 turns of a conversation. Past that, depending on system prompt length + tool-call overhead + retrieved context + per-turn message size, the agent starts crowding the ceiling.
Two failure modes typically follow:
- Naive truncation: the agent silently drops the oldest messages to make room for the new one. The agent "forgets" the early context. The user notices. Trust drops.
- Hard error: the call throws a
context_length_exceedederror. The agent shows the user something like "Sorry, something went wrong." The conversation dies.
Both are bad. Both are avoidable.
The pattern
Before each model call, check the prospective token count against the model's context limit. If you're approaching the ceiling — say 80% — pause the live exchange and do a summarization pass:
- Compress the conversation: a cheap model (sonnet-haiku, gpt-4.1-nano) summarizes the older turns into a 200-word digest that preserves: the user's stated goal, the key decisions made so far, any concrete data the user provided (names, IDs, preferences), and the current state of the conversation.
- Restart with the digest: replace the older messages with a single
system-role orassistant-role message containing the summary. Keep the most recent 4-6 turns verbatim so continuity is intact. - Continue: the agent answers the user's latest question with the digest as context. The user notices nothing.
The user sees no error, no interruption, no "let me start over." The conversation just continues.
The implementation
If you're using the standard chat-completion protocol, this is a 30-line helper:
async function chatWithSummaryFallback(messages, model, opts) {
const ESTIMATED_LIMIT = opts.contextLimit || 120000; // tokens
const TRIGGER_AT = ESTIMATED_LIMIT * 0.8;
const KEEP_RECENT = 6; // keep last N turns verbatim
const estimated = estimateTokens(messages);
if (estimated < TRIGGER_AT) {
return await chat({ messages, model });
}
// Compress everything except the last KEEP_RECENT turns
const old = messages.slice(0, -KEEP_RECENT);
const recent = messages.slice(-KEEP_RECENT);
const digest = await chat({
model: opts.cheapModel,
messages: [
{ role: "system", content: "Summarize the conversation. Preserve: user goal, decisions made, any specific data the user mentioned, current state. Be concise. 200 words." },
...old
]
});
const compressed = [
{ role: "system", content: `Earlier conversation summary: ${digest.content}` },
...recent
];
return await chat({ messages: compressed, model });
}
That's it. The agent now degrades gracefully forever. Conversations can go 200 turns without hitting a hard error.
The edge cases
A few things the naive version misses:
- Tool-call results live in the history. If your agent uses tools, the call results take up tokens. A retrieval call that pulls 8000 tokens of context lives in the message log just like a user message. The token estimator needs to count those.
- Streaming-mid-call. If the agent is already mid-response when it would overflow, you can't insert a summary in front. Make the check before the call, not during.
- Multi-modal content. Images cost tokens too (often a lot). If your agent handles screenshots or PDFs, the budget math is different per model.
- The summary itself takes tokens to generate. Don't trigger summarization on every turn — only when you're approaching the threshold. The summary call itself costs 500-1500 tokens.
Why this beats "just use a bigger context window"
Modern models advertise 200K+ context windows. Some go to 1M+. So you can just buy your way out of this, right?
Two reasons no:
- Quality degrades at the long end. Retrieval accuracy in extreme long-context regimes is documented to drop, especially for facts that sit in the middle of the conversation. A 200-word digest of relevant facts often beats 200K tokens of raw history.
- Cost scales with context length. Every turn re-sends the entire conversation. At the 80th turn, you're paying for 80 turns × every input token. Summarization caps the linear growth.
The retry-with-summary pattern is a quality + cost win, not a workaround.
Add one chart to your agent observability dashboard: per-conversation message count, bucketed by outcome (completed / abandoned). If you see a spike in abandonment past a specific turn count — that's where your agent is hitting the ceiling without recovery. The fix is the pattern above. Want us to audit yours? Apply for the audit.
The one-line summary
Every long conversation eventually hits the context-window ceiling. Naive truncation makes the agent forget. Hard errors kill the conversation. A 30-line retry-with-summary helper makes the failure invisible. Ship it before your longest-conversation tail breaks something visible.