More

ivzak · 2026-03-15T06:10:56 1773555056

Speaking from experience - serving good context compression is not trivial.

verdverm · 2026-03-15T20:05:20 1773605120

Ymmv, I don't know why you think it's hard other than you want to sell it

Not my experience

ivzak · 2026-03-15T06:06:31 1773554791

Thanks, checking it out!

ivzak · 2026-03-15T05:59:52 1773554392

For auto-compact, we do essentially the same Anthropic does, but at 85% filled context window. Then, when the window is 100% filled, we pull this precompaction + append accumulated 15%. This allows to run compaction instantly

ivzak · 2026-03-14T01:06:21 1773450381

It seems to be the hit rate of a very straightforward (literal matching) retrieval. Just checked the benchmark description (https://huggingface.co/datasets/openai/mrcr), here it is:

"The task is as follows: The model is given a long, multi-turn, synthetically generated conversation between user and model where the user asks for a piece of writing about a topic, e.g. "write a poem about tapirs" or "write a blog post about rocks". Hidden in this conversation are 2, 4, or 8 identical asks, and the model is ultimately prompted to return the i-th instance of one of those asks. For example, "Return the 2nd poem about tapirs".

As a side note, steering away from the literal matching crushes performance already at 8k+ tokens: https://arxiv.org/pdf/2502.05167, although the models in this paper are quite old (gpt-4o ish). Would be interesting to run the same benchmark on the newer models

Also, there is strong evidence that aggregating over long context is much more challenging than the "needle extraction task": https://arxiv.org/pdf/2505.08140

All in all, in my opinion, "context rot" is far from being solved

ivzak · 2026-03-14T00:27:07 1773448027

Probably LLM-generated, but that's a fair point :D Well, the proxy is open source, maybe someone will even implement this before we do :)

Talking about the features proxy unlocks - we have already added some monitoring, such as a dashboard of the currently running sessions and the "prompt bank" storing the previous user's interactions

ivzak · 2026-03-14T00:20:56 1773447656

Claude code still has /compact taking ages - and it is a relatively easy fix. Doing proactive compression the right way is much tougher. For now, they seem to bet on subagents solving that, which is essentially summarization with Haiku. We don't think it is the way to go, because summarization is lossy + additional generation steps add latency

ivzak · 2026-03-13T23:54:56 1773446096

I think we should draw distinction between two compression "stages"

1. Tool output compression: vanilla claude code doesn't do it at all and just dumps the entire tool outputs, bloating the context. We add <0.5s in compression latency, but then you gain some time on the target model prefill, as shorter context speeds it up.

2. /compact once the context window is full - the one which is painfully slow for claude code. We do it instantly - the trick is to run /compact when the context window is 80% full and then fetch this precompaction (our context gateway handles that)

Please try it out and let us know your feedback, thanks a lot!

ivzak · 2026-03-13T23:39:48 1773445188

Subagents do summarization - usually with the cheaper models like Haiku. Summarizing tool outputs doesn't work well because of the information loss: https://arxiv.org/pdf/2508.21433. Compression is different because we keep preserved pieces of context unchanged + we condition compression on the tool call intent, which makes it more precise.

esafak · 2026-03-14T00:29:52 1773448192

I can control the model, prompt, and permissions for the subagents. Can you show how your compression differs from summarization by example? What do you mean by "we keep preserved pieces of context unchanged" ?

ivzak · 2026-03-15T06:04:10 1773554650

We keep preserved pieces of context unchanged = compression removes some pieces of the input while keeping the others verbatim. Let us shortly share a concrete example

ivzak · 2026-03-13T23:20:09 1773444009

I doubt Anthropic would single-handedly cut their API revenue in half by rolling out compression. Zero incentive.

ivzak · 2026-03-13T23:15:00 1773443700

You’re right - poor compression can cause that. But skipping compression altogether is also risky: once context gets too large, models can fail to use it properly even if the needed information is there. So the way to go is to compress without stripping useful context, and that’s what we are doing

backscratches · 2026-03-13T23:22:08 1773444128

Edit your llm generated comment or at least make it output in a less annoying llm tone. It wastes our time.