Hacker Newsnew | past | comments | ask | show | jobs | submit | tyfon's commentslogin

I think the biggest advantage for me with ollama is the ability to "hotswap" models with different utility instead of restarting the server with different models combined with the simple "ollama pull model". In other words, it has been quite convenient.

Due to this post I had to search a bit and it seems that llama.cpp recently got router support[1], so I need to have a look at this.

My main use for this is a discord bot where I have different models for different features like replying to messages with images/video or pure text, and non reply generation of sentiment and image descriptions. These all perform best with different models and it has been very convenient for the server to just swap in and out models on request.

[1] https://huggingface.co/blog/ggml-org/model-management-in-lla...


> the ability to "hotswap" models with different utility instead of restarting the server

The article mentions llama-swap does this


Llama.cpp added the ability load/switch models on demand with the max-models and models preset flags.

You can do that with llama-server

Llama-server which is part of llamacpp does this for a few months now

There are a few countries just below as well like Norway with about 98% renewables in 2024 [1]. The gas power plant is mostly up north powering the gas compressors that fill LNG ships headed for Europe and the coal I think is for Svalbard but that mine/plant closed in 2025 [2].

[1] https://www.nve.no/energi/energisystem/energibruk/stroemdekl...

[2] https://www.nrk.no/tromsogfinnmark/norges-siste-kullgruve-pa...


Norwegian electricity prodcution in February this year was 98,8% hydro and wind and 1,1% thermal (mainly gas). I guess the last 0,1% might be the diesel powered generator on Svalbard (the coal one was turned off 3 years ago).

It would be really nice to have an option to not do this since a ton of companies deny VMs in their group policies.


To a firm with such policies, to allow Cowork outside the VM should be strictly worse.

Ironically, VMs are typically blocked because the infosec team isn't sure how to look inside them and watch you, unlike containers where whatever's running is right there in the `ps` list.

They don't look inside the JVM or .exes either, but they don't think about that the same way. If they treat an app like an exe like a VM, and the VM is as bounded as an app or an exe, with what's inside staying inside, they can get over concerns. (If not, build them a VM with their sensors inside it as well, and move on.)

This conversation can take a while, and several packs of whiteboard markers.


Agreed. Need to make this a choice for us.


I didn't really understand the performance table until I saw the top ones were 8B models.

But 5 seconds / token is quite slow yeah. I guess this is for low ram machines? I'm pretty sure my 5950x with 128 gb ram can run this faster on the CPU with some layers / prefill on the 3060 gpu I have.

I also see that they claim the process is compute bound at 2 seconds/token, but that doesn't seem correct with a 3090?


LLM speed is roughly <memory_bandwidth> / <model_size> tok/s.

DDR4 tops out about 27Gbs

DDR5 can do around 40Gbs

So for 70B model at 8 bit quant, you will get around 0.3-0.5 tokens per second using RAM alone.


DRAM speeds is one thing, but you should also account for the data rate of the PCIe bus (and/or VRAM speed). But yes, holding it "lukewarm" in DRAM rather than on NVMe storage is obviously faster.


Yes.

In general systems usually have PCIE version with bandwidth better than RAM of that system.

For example a system with DDR4 (27Gbs) usually has at least PCIE4 (32Gbs at 16x).

But you can bottleneck that by building a DDR5 (40Gbs) system with PCIE4 card.


yeah, actually, I'm bottlenecked af since my mobo got pcie3 only :(


Channels matter a lot, quad channel ddr4 is going to beat ddr5 in dual channel most of the time.


Four channels of DDR4-3200 vs two channels of DDR5-6400 (four subchannels) should come out pretty close. I don't see any reason why the DDR4 configuration would be consistently faster; you might have more bank groups on DDR4, but I'm not sure that would outweigh other factors like the topology and bandwidth of the interconnects between the memory controller and the CPU cores.


Faster than the 0.2tok/s this approach manages


Should be active param size, not model size.


Yes, you’re right.

LLama 3.1 however is not MoE, so all params are active.

For MoE it is tricky, because for each token you only use a subset of params (an “expert”) but you don’t know which one, so you have to keep them all in memory or wait until it loads from slower storage, potentially different for each token.


I was using gemini antigravity in opencode a few weeks ago before they started banning everyone for that and I got into the habit of writing "do x, then wait for instructions".

That helped quite a bit but it would still go off on it's own from time to time.


My pro account also got banned yesterday, but I didn't use any clawd / molt stuff like the OP. I've only been using opencode.

Seems like only openai are willing to let people use their subscriptions with 3rd party tools.


I'm having a hard time parsing the openai website.

Anyone know if it is possible to use this model with opencode with the plus subscription?


It's possible to use opencode with the plus subscription using this plugin for auth [0][1]. Just tested this and it appears to work.

[0]: https://opencode.ai/docs/ecosystem/#:~:text=Use%20your%20Cha...

[1]: https://github.com/numman-ali/opencode-openai-codex-auth


You do not need a plugin anymore.


I've been trying opencode a bit with gemini pro (and claude via those) with a rust project, and I have a push pre-hook to cargo check the code.

The amount of times I have to "yell" at the llm for adding #[allow] statements to silence the linter instead of fixing the code is crazy and when I point it out they go "Oops, you caught me, let me fix it the proper way".

So the tests don't necessarily make them produce proper code.


I was doing a somewhat elaborate form/graph in Google Worksheets, had to translate a bunch of cells from English to Spanish, and said "Why not use Gemini for this easy, grunt work? They tend to output good translations".

I spent 20 minutes between guiding it because it was putting the translation in the wrong cells, asking it not to convert the cells to a fancy table, and finally, convincing it that it really had access to alter the document, because at some point it denied it. I wasn't being rude, but it seems I somehow made it evasive.

I had to ask it to translate in the chat, and manually copy-pasted the translations in the proper cells myself. Bonus points because it only translated like ten cells at a time, and truncated the reply with a "More cells translated" message.

I can't imagine how hard it would be to handhold an LLM while working in a complex code base. I guess they are a godsend for prototypes and proofs of concept, but they can't beat a competent engineer yet. It's like that joke where a student answers that 2+2=5, and when questioned, he replies that his strength is speed, not accuracy.


This is one of those places I feel like they're trying to do too much with the LLMs and I think this is one of those places where there's "a bubble". I feel like the LLMs are text tools, so trying to take them out of their domain and force them somewhere else you're going to have problems.

Anyways, I replied because I had something else I wanted to say.

I was using Gemini in a google worksheet a while back. I had to cross reference a website and input stuff into a cell. I got Gemini to do it, had it do the first row, then the second, then I told it to do a batch of 10, then 20. It had a hiccup at 20, would take too long I guess. So I had it go back to 10. But then Gemini tells me it can't read my worksheet. I convince it that it can, but then it tells me it can't edit my worksheet. I argue with it, "you've been changing the worksheet wtf?" I convinced it that it could and it started again, but then after doing a couple it told me it couldn't again. We went back and forth a bit, I'd get it working, it would break, repeat. I think it was after the third time I just couldn't get it to do it again.

I looked up the docs, searched online, and I was concerned that I found Google didn't allow Gemini to do a lot of stuff to worksheets/docs/other google workspace stuff. They said they didn't allow it to do a ton of stuff that I definitely had Gemini doing.

Then a week or two went by and google announced they're allowing gemini to directly edit worksheets.

So wtf how did I get it to do it before it could do it???


I added a bunch of lines telling it to never do that in CLAUDE.md and it worked flawlessly.

So I have a different experience with Claude Code, but I'm not trying to say you're holding it wrong, just adding a data point, and then, maybe I got lucky.


I'm curious how many of those directives you'll have in that file at the end of the year.


I think it won't be bigger than the giant set of rules people are supposed to read through (they never do) when onboarding.

At least with AGENTS/CLAUDE.md file, you know the agent will re-read those rules on every new session.


Why are you guys having LLMs use git at all???

Manage that yourself! If you have hooks throwing errors then feed the error back into the llm.


That actually worked since I subscribed a few days ago specifically to try open code.

"Your subscription has been canceled and your refund is on the way. Please allow 5-10 business days for the funds to appear in your account."


My PS5 can do 4k/120 hz with VRR support, not sure about the others.


I'm bit puzzled, isn't VRR more for low powered hardware to consume less battery (handhelds like steam deck)? How does it fit hardware that is constantly connected to power?

(I assume VRR = Variable Refresh Rate)


Variable refresh rate is nice when your refresh rate doesn't match your output. Especially when you're getting into higher refresh rates. So if your display is running at 120hz, but you're only outputting 100hz: you cannot fit 100 frames evenly into 120 frames. 1/6 of your frames will have to be repeats of other frames, and in an inconsistent manner. Usually called judder.

Most TVs will not let you set the refresh rate to 100hz. Even if my computer could run a game at 100hz, without VRR, my choices are either lots of judder, or lowering it to 60hz. That's a wide range of possible refresh rates you're missing out on.

V-Sync and console games will do this too at 60hz. If you can't reach 60hz, cap the game at 30hz to prevent judder that would come from anything in between 31-59. The Steam Deck actually does not support VRR. Instead the actual display driver does support anything from 40-60hz.

This is also sometimes an issue with movies filmed at 24hz on 60hz displays too: https://www.rtings.com/tv/tests/motion/24p


It reduces screen tearing without adding all the latency that vsync introduces.


VRR is necessary to avoid tearing or FPS caps (V-sync) when your hardware cannot stably output FPS count matching the screen refresh rate.


Are there games running at 4k 120hz?


Call of Duty and Battlefield both run at 4K@120 with dynamic resolution scaling, PSSR or FSR.

Most single player games (Spider-Man, God of War, Assassin's Creed etc) will allow a balanced graphics/performance which does 40 in a 120hz refresh.


Full 4k - very few, but lots are running adaptive resolutions at > 2k and at 120hz


Touryst renders the game at 4K120 or 8k60. In the latter case, the image is subsampled to 4K output.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: