I think the biggest advantage for me with ollama is the ability to "hotswap" models with different utility instead of restarting the server with different models combined with the simple "ollama pull model". In other words, it has been quite convenient.
Due to this post I had to search a bit and it seems that llama.cpp recently got router support[1], so I need to have a look at this.
My main use for this is a discord bot where I have different models for different features like replying to messages with images/video or pure text, and non reply generation of sentiment and image descriptions. These all perform best with different models and it has been very convenient for the server to just swap in and out models on request.
There are a few countries just below as well like Norway with about 98% renewables in 2024 [1].
The gas power plant is mostly up north powering the gas compressors that fill LNG ships headed for Europe and the coal I think is for Svalbard but that mine/plant closed in 2025 [2].
Norwegian electricity prodcution in February this year was 98,8% hydro and wind and 1,1% thermal (mainly gas). I guess the last 0,1% might be the diesel powered generator on Svalbard (the coal one was turned off 3 years ago).
To a firm with such policies, to allow Cowork outside the VM should be strictly worse.
Ironically, VMs are typically blocked because the infosec team isn't sure how to look inside them and watch you, unlike containers where whatever's running is right there in the `ps` list.
They don't look inside the JVM or .exes either, but they don't think about that the same way. If they treat an app like an exe like a VM, and the VM is as bounded as an app or an exe, with what's inside staying inside, they can get over concerns. (If not, build them a VM with their sensors inside it as well, and move on.)
This conversation can take a while, and several packs of whiteboard markers.
I didn't really understand the performance table until I saw the top ones were 8B models.
But 5 seconds / token is quite slow yeah. I guess this is for low ram machines? I'm pretty sure my 5950x with 128 gb ram can run this faster on the CPU with some layers / prefill on the 3060 gpu I have.
I also see that they claim the process is compute bound at 2 seconds/token, but that doesn't seem correct with a 3090?
DRAM speeds is one thing, but you should also account for the data rate of the PCIe bus (and/or VRAM speed). But yes, holding it "lukewarm" in DRAM rather than on NVMe storage is obviously faster.
Four channels of DDR4-3200 vs two channels of DDR5-6400 (four subchannels) should come out pretty close. I don't see any reason why the DDR4 configuration would be consistently faster; you might have more bank groups on DDR4, but I'm not sure that would outweigh other factors like the topology and bandwidth of the interconnects between the memory controller and the CPU cores.
LLama 3.1 however is not MoE, so all params are active.
For MoE it is tricky, because for each token you only use a subset of params (an “expert”) but you don’t know which one, so you have to keep them all in memory or wait until it loads from slower storage, potentially different for each token.
I was using gemini antigravity in opencode a few weeks ago before they started banning everyone for that and I got into the habit of writing "do x, then wait for instructions".
That helped quite a bit but it would still go off on it's own from time to time.
I've been trying opencode a bit with gemini pro (and claude via those) with a rust project, and I have a push pre-hook to cargo check the code.
The amount of times I have to "yell" at the llm for adding #[allow] statements to silence the linter instead of fixing the code is crazy and when I point it out they go "Oops, you caught me, let me fix it the proper way".
So the tests don't necessarily make them produce proper code.
I was doing a somewhat elaborate form/graph in Google Worksheets, had to translate a bunch of cells from English to Spanish, and said "Why not use Gemini for this easy, grunt work? They tend to output good translations".
I spent 20 minutes between guiding it because it was putting the translation in the wrong cells, asking it not to convert the cells to a fancy table, and finally, convincing it that it really had access to alter the document, because at some point it denied it. I wasn't being rude, but it seems I somehow made it evasive.
I had to ask it to translate in the chat, and manually copy-pasted the translations in the proper cells myself. Bonus points because it only translated like ten cells at a time, and truncated the reply with a "More cells translated" message.
I can't imagine how hard it would be to handhold an LLM while working in a complex code base. I guess they are a godsend for prototypes and proofs of concept, but they can't beat a competent engineer yet. It's like that joke where a student answers that 2+2=5, and when questioned, he replies that his strength is speed, not accuracy.
This is one of those places I feel like they're trying to do too much with the LLMs and I think this is one of those places where there's "a bubble". I feel like the LLMs are text tools, so trying to take them out of their domain and force them somewhere else you're going to have problems.
Anyways, I replied because I had something else I wanted to say.
I was using Gemini in a google worksheet a while back. I had to cross reference a website and input stuff into a cell. I got Gemini to do it, had it do the first row, then the second, then I told it to do a batch of 10, then 20. It had a hiccup at 20, would take too long I guess. So I had it go back to 10. But then Gemini tells me it can't read my worksheet. I convince it that it can, but then it tells me it can't edit my worksheet. I argue with it, "you've been changing the worksheet wtf?" I convinced it that it could and it started again, but then after doing a couple it told me it couldn't again. We went back and forth a bit, I'd get it working, it would break, repeat. I think it was after the third time I just couldn't get it to do it again.
I looked up the docs, searched online, and I was concerned that I found Google didn't allow Gemini to do a lot of stuff to worksheets/docs/other google workspace stuff. They said they didn't allow it to do a ton of stuff that I definitely had Gemini doing.
Then a week or two went by and google announced they're allowing gemini to directly edit worksheets.
So wtf how did I get it to do it before it could do it???
I added a bunch of lines telling it to never do that in CLAUDE.md and it worked flawlessly.
So I have a different experience with Claude Code, but I'm not trying to say you're holding it wrong, just adding a data point, and then, maybe I got lucky.
I'm bit puzzled, isn't VRR more for low powered hardware to consume less battery (handhelds like steam deck)? How does it fit hardware that is constantly connected to power?
Variable refresh rate is nice when your refresh rate doesn't match your output. Especially when you're getting into higher refresh rates. So if your display is running at 120hz, but you're only outputting 100hz: you cannot fit 100 frames evenly into 120 frames. 1/6 of your frames will have to be repeats of other frames, and in an inconsistent manner. Usually called judder.
Most TVs will not let you set the refresh rate to 100hz. Even if my computer could run a game at 100hz, without VRR, my choices are either lots of judder, or lowering it to 60hz. That's a wide range of possible refresh rates you're missing out on.
V-Sync and console games will do this too at 60hz. If you can't reach 60hz, cap the game at 30hz to prevent judder that would come from anything in between 31-59. The Steam Deck actually does not support VRR. Instead the actual display driver does support anything from 40-60hz.
Due to this post I had to search a bit and it seems that llama.cpp recently got router support[1], so I need to have a look at this.
My main use for this is a discord bot where I have different models for different features like replying to messages with images/video or pure text, and non reply generation of sentiment and image descriptions. These all perform best with different models and it has been very convenient for the server to just swap in and out models on request.
[1] https://huggingface.co/blog/ggml-org/model-management-in-lla...
reply