Hacker Newsnew | past | comments | ask | show | jobs | submit | htsh's commentslogin

yes! especially b/c i want to process a lot of email and directories full of old, personal documents


are we sure the RAM market will stop being insane in a year or two or could this be the new norm?


Every ten years the RAM cartel raises prices (it's not really about AI, see Gamer's Nexus) and every ten years it is forced to lower them again.


Why should it be the new norm? We have an abnormal situation now, of massive amounts of investor money being poured into unprofitable bets, that this time had the side effect of eating up hardware components. There are two possible outcomes:

1. Yes, it's the new normal, then production capacity will be increased and prices fall.

2. No, it's not the new normal, the bubble pops and component prices come crashing down when buyers default etc.

Option 2 has been the normal outcome of these situations so far. But sure, questions remains how long all of this will take.


Option 3: the global wars increase and continue to be the new normal with shipping routes disturbed until the climax, china annexes Taiwan.

In that case prices will continue to rise (among other things).


I don't know if it'll be a year or two, hard to say exactly when the AI bubble will pop, but I feel quite certain it's coming. The AI stuff is great but most of the money being thrown around to all these different companies is mostly going to be wasted. Investors don't know who the winners and losers will be, just like when people were investing in pets.com instead of amazon.com.


thanks! came in here to ask this.

we can do much better with a cheap model on openrouter (glm 4.7, kimi, etc.) than anything that I can run on my lowly 3090 :)


Also recently added ollama launch claude if you want to connect to cloud models from there :)


I have been doing this with claude code and openai codex and/or cline. One of the three takes the first pass (usually claude code, sometimes codex), then I will have cline / gemini 2.5 do a "code review" and offer suggestions for fixes before it applies them.


curious, why the 30b MoE over the 32b dense for local coding?

I do not know much about the benchmarks but the two coding ones look similar.


The MoE version with 3b active parameters will run significantly faster (tokens/second) on the same hardware, by about an order of magnitude (i.e. ~4t/s vs ~40t/s)


> The MoE version with 3b active parameters

~34 tok/s on a Radeon RX 7900 XTX under today's Debian 13.


And vmem use?


~18.6 GiB, according to nvtop.

ollama 0.6.6 invoked with:

    # server
    OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve

    # client
    ollama run --verbose qwen3:30b-a3b
~19.8 GiB with:

    /set parameter num_ctx 32768


Very nice, should run nicely on a 3090 as well.

TY for this.

update: wow, it's quite fast - 70-80t/s on LM Studio with a few other applications using GPU.


A lot of us have ryzen / nvidia combos... hopefully, soon, though.


Openvino runs fine on AMD last I checked


Maybe it does, however the system requirements page makes it looks like it supports everything BUT AMD.

https://docs.openvino.ai/2024/about-openvino/release-notes-o...


It supports AMD cpus because, if I understand correctly, AMD licenses x86 from Intel, so it shares the same bits needed to run openVINO as Intel’s cpus.

Go look at CPUs benchmarks on Phoronix; AMD Ryzen cpus regularly trounce Intel cpus using openVINO inference.


assuming you want to run entirely in GPU, with 12gb vram, your sweet spot is likely the distill 14b qwen at a 4bit quant. so just run:

ollama run deepseek-r1:14b

generally, if the model file size < your vram, it is gonna run well. this file is 9gb.

if you don't mind slower generation, you can run models that fit within your vram + ram, and ollama will handle that offloading of layers for you.

so the 32b should run on your system, but it is gonna be much slower as it will be using GPU + CPU.

prob of interest: https://simonwillison.net/2025/Jan/20/deepseek-r1/

-h


Thank you!! I just loaded it a fits in memory as you said.

I am testing it now and seems quite fast giving the responses for a local model.


As a longtime user of nodemailer, thank you.

I am gonna check out emailengine for future work.


Dreambooth was kinda great?

That said, I agree that I wish there were more done post-research towards products with some of this stuff.


Yes, offloading some layers to the GPU and VRAM should still help. And 11gb isn't bad.

If you're on linux or wsl2, I would run oobabooga with --verbose. Load a GGUF, start with a small number of GPU layers and creep up, keeping an eye on VRAM usage.

If you're on windows, you can try out LM Studio and fiddle with layers while you monitor VRAM usage, though windows may be doing some weird stuff sharing ram.

Would be curious to see the diffs. Specifically if there's a complexity tax in offloading that makes the CPU-alone faster but in my experience with a 3060 and a mobile 3080, offloading what I can makes a big diff.


> Specifically if there's a complexity tax in offloading that makes the CPU-alone faster

Anecdotal, but I played with a bunch of models recently on a machine with a 16GB AMD GPU and 64GB of system memory/12 core CPU. I found offloading to significantly speed things up when dealing with large models, but there was seemingly an inflection point as I tested models that approached the limits of the system, where offloading did seem to significantly slow things down vs just running on the CPU.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: