Ok, so --ctx-size with a value != 0 means that we can override the default model...

ggerganov · on Jan 24, 2025

Yes, exactly. You can set --ctx-size to a smaller value if you know that you will not hit the limit of 32k - this will save you VRAM.

To control how much global context to keep in the ring buffer (i.e. the context that is being reused to enrich the local context), you can adjust the "ring_n_chunks" and "rink_chunk_size". With the default settings, this amounts to about 8k tokens of context on our codebases when the ring buffer is full, which is a conservative setting. Increasing these numbers will make the context bigger, will improve the quality but will affect the performance.

There are a few other tricks to reduce the compute for the local context (i.e. the 1k batch of tokens), so that in practice, a smaller amount is processed. This further saves compute during the prefill.

menaerus · on Jan 24, 2025

Since qwen 2.5 turbo with 1M context size is advertised to be able to crunch ~30k LoC, I guess we can say then that the 32k qwen 2.5 model is capable of ~960 LoC and therefore 32k model with an upper bound of context set to 8k is capable of ~250 LoC?

Not bad.