+1 on interaction terms + tails : fanout × retries × context growth is where linear token math dies.
One thing we do in enzu is make “budget as constraint” executable: we clamp `max_output_tokens` from the budget before the call, and in multi-step/RLM runs we adapt output caps downward as the budget depletes (so it naturally gets shorter/cheaper instead of spiraling). When token counting is unavailable we explicitly enter a “budget degraded” mode rather than pretending estimates are exact.
Also agree p90/p95 cost/run matters more than averages; max-output caps are crude but effective.
If you’re trying to estimate before prod, logging these 4 things in a pilot gets you 80% there:
- tokens/run (in+out)
- tool calls/run (and fanout)
- retry rate (timeouts/429s)
- context length over turns (P50/P95)
Fanout × retries is the classic “bill exploder”, and P95 context growth is the stealth one. The point of “budget as contract” is deciding in advance what happens at limit (degraded mode / fallback / partial answer / hard fail), not discovering it from the invoice.
One thing we do in enzu is make “budget as constraint” executable: we clamp `max_output_tokens` from the budget before the call, and in multi-step/RLM runs we adapt output caps downward as the budget depletes (so it naturally gets shorter/cheaper instead of spiraling). When token counting is unavailable we explicitly enter a “budget degraded” mode rather than pretending estimates are exact.
Also agree p90/p95 cost/run matters more than averages; max-output caps are crude but effective.
Docs: https://github.com/teilomillet/enzu/blob/main/docs/PROD_MULT... and https://github.com/teilomillet/enzu/blob/main/docs/BUDGET_CO...