Most of your agent's calls don't need a frontier model. They need a router. The user says "what's my balance" — that's a 50-token decision: which tool to invoke + which arguments to pass. You're paying gpt-4-class rates to make a decision a Haiku-class model would get right almost every time.
The pattern that fixes this is mostly named after itself: cheap planner, expensive executor.
The shape
Two LLM calls per turn instead of one:
- Planner: small fast model (sonnet-haiku, gpt-4.1-nano, gemini-flash). Input is the user's message + a compressed system prompt that lists available intents/tools. Output is a structured JSON object:
{ intent, tool, args, needs_reasoning }. ~200-500 tokens out. - Executor: depends on
needs_reasoning. If false, run the deterministic tool directly + format the response with a cheap-model pass. If true, hand off to the frontier model with only the relevant slice of context.
The math:
- Conversational turns ("what's my balance", "schedule the call", "cancel my order") → never touch the expensive model. They run planner + tool + cheap formatting.
- Reasoning turns ("explain why my last campaign underperformed and what to try next") → planner routes to the frontier model with a focused context.
- The bulk of turns in a typical conversational agent fall in the first bucket. That's where the cost savings live.
The implementation
Here's the planner in 30 lines. The trick is the structured-output format: tool-use schema or a strict JSON schema. Don't free-text it — the planner is supposed to be cheap and predictable, not creative.
const PLANNER_PROMPT = `You route user requests to the right handler.
Output ONLY a JSON object: {"intent": "...", "tool": "...", "args": {...}, "needs_reasoning": bool}.
Intents: balance_query, schedule_call, cancel_order, billing_question, free_form.
Tools: get_account, book_calendar, cancel_order, lookup_billing, none.
Set needs_reasoning=true ONLY when the user asks "why", "explain", or "what should I do".`;
async function planTurn(userMessage, sessionContext) {
const r = await fetch("https://api.anthropic.com/v1/messages", {
method: "POST",
headers: { "x-api-key": KEY, "content-type": "application/json", "anthropic-version": "2023-06-01" },
body: JSON.stringify({
model: "claude-haiku-4-5", // ~10-15× cheaper than frontier
max_tokens: 300,
system: PLANNER_PROMPT,
messages: [{ role: "user", content: userMessage }]
})
});
const data = await r.json();
return JSON.parse(data.content[0].text);
}
async function handleTurn(userMessage, sessionContext) {
const plan = await planTurn(userMessage, sessionContext);
if (plan.needs_reasoning) {
return await frontierModel(userMessage, sessionContext, plan); // expensive path
}
const toolResult = await runTool(plan.tool, plan.args);
return await cheapFormatter(toolResult, plan.intent); // also haiku-class
}
The flow: every turn pays the planner cost (small). Most turns finish in the cheap path. The minority that need real reasoning still get it, with focused context.
The wins beyond cost
Cost reduction is the obvious win. Two less-obvious ones matter just as much:
- Latency. Haiku-class models stream first tokens in 200-400ms vs 800-1500ms for frontier models. On a routing turn, that's the entire perceptible response latency. Pairs well with the heartbeat pattern from Scale Brief #144.
- Predictability. Structured-output planners are easier to monitor, log, and replay than free-text frontier reasoning. When something goes wrong, the JSON intent + tool + args tell you exactly what the agent thought it was doing. Easier to debug, easier to set guardrails on.
The edge cases
- Planner gets the intent wrong. Inevitable. Mitigation: log
plan.intent+ actual outcome on every turn. Once a week, sample 50 failed turns and refine the planner prompt. Accuracy climbs into the high-nineties after a few iterations. - The "needs_reasoning" flag is wrong. Bias toward overuse, not underuse. False positives are expensive (you paid frontier rates unnecessarily). False negatives are worse (cheap model produces a shallow answer to a reasoning-heavy question). When in doubt, set
needs_reasoning=true. - Multi-step flows. If a turn needs three tool calls in sequence, you have two choices: (a) let the planner emit a plan-list that's executed in order, or (b) call the planner per step. The planner-per-step approach is easier to reason about; the plan-list approach is cheaper. Most teams should start with planner-per-step.
Instrument three numbers per turn: planner cost, executor cost, total latency. Bucket by intent. The shape of cost-per-intent will tell you which intents are over- or under-routed. Want us to audit yours? Apply for the audit.
The one-line summary
Most of your agent's turns are routing decisions, not reasoning. Pay routing rates for routing turns. The cheap-model planner pattern compounds with the heartbeat fix in Issue #144 and the context-window fix in Issue #146 — all three are about not paying frontier rates for non-frontier work.