Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm no expert on these MoE models with "a total of 389 billion parameters and 52 billion active parameters". Do hobbyists stand a chance of running this model (quantized) at home? For example on something like a PC with 128GB (or 512GB) RAM and one or two RTX 3090 24GB VRAM GPUs?


You would need to fit the 389B parameters in VRAM to have a speed that is usable. Different experts are activated on a per token basis, so you would need to load/unload a large chunk of the 52B active parameters every token if you were trying to offload parameters to system RAM or SSD. PCIE 4.0 x16 speed is 64GB/s, so you can load those active parameters maybe 1 or 2 times per second, yielding an output speed of 1-2 tokens per second, which most would consider "unusable".


Does that have to be same-node VRAM? Or can you fit 52B each on several nodes, and only copy the transient state around?


Generally speaking this works well, pending your definition of node and the interconnect between them. If by node you mean GPU, and you have multiple of them on the same system (interconnect is PCIE, doesn't need to be full speed however for inference), you're good. If you mean multiple computers connected by 1 Gigabit Ethernet? More challenging.

When splitting models layer by layer, users in r/LocalLLaMA have reported good results with as low as PCIE 3.0 x4 as the interconnect (4GB/s). For tensor parallelism, the interconnect requirements are higher but the upside can be faster speeds in accordance to number of GPUs split across (whereas layer by layer operated like a pipeline, so isn't necessarily faster than what a single GPU can provide, even if splitting across 8 GPUs).


An H100 supports 80 GB of memory. so at FP8 that would allow 3 of the 16+1 models per GPU (assuming around 26B per model), requiring 9 H100s, that usually would not fit one chassis I guess.

Once you have something with 192 GB it gets interesting. You could probably have 7 at FP8 per GPU. At FP16 it probably only would fit 3 per card, requiring 9 again.

I'd say for the current memory layout of cards they missed a little bit the sweet spot. With slightly smaller models or one expert less one should be able to run it on 8 H100s at FP8 or 2 B100s at FP8 or even on 4 B100s at FP16 if I calculated correctly.


You could always split one of the experts up across multiple GPUs. I tend to agree with your sentiment, I think researchers in this space tend to not optimize that well for inference deployment scenarios. To be fair, there is a lot of different ways to deploy something, and a lot of quantization techniques and parameters.


Yes, it can be done. I'm running a 24-channel DDR5 dual-EPYC rig and get good speed on large MoE models. I only use the GPU for context processing.

They're actually a best-case for CPU inference vs dense models. I usually run deepseek 2.5 quanted to q8, but if this model works well I'll probably switch to it once support hits llama.cpp.


>I only use the GPU for context processing.

If your GPU has enough VRAM to support it, you might benefit from https://github.com/kvcache-ai/ktransformers


Interesting, what RAM do you use exactly? 24x 16GB DDR5-6000 DIMMs? It seems that those boards only support up to DDR5-4800: https://geizhals.de/?cat=mbsp3&xf=4921_2%7E493_24x+DDR5+DIMM...

Does the core count matter or can you get away with the smallest 2x EPYC 9015 configuration? What are "good speeds"?


I use 24 sticks of ddr5-4800, which gets me up to 9t/s on deepseek 2.5 at q8. 48 threads was optimal in llama.cpp. I would like to move to epyc 9005 chips and ddr5-6000, but it is cost prohibitive with CPUs still over $10k each on eBay.

I followed the guide at https://rentry.co/miqumaxx/


How many cores do your CPUs have? Are you using the 64 core EPYC 9334 mentioned in the linked page? Do that many cores provide a speedup versus having fewer cores?


Yes I have the same engineering sample. I can only keep 48 cores fed with max memory bandwidth given current llama.cpp architectural constraints


Looks like we will soon get boards supporting 24x DDR5-6000 for the EPYC 9005 CPUs.


RAM for 4-bit is 1GB per 2 billion parameters. So you will want 256GB RAM and at least one GPU. If you only have one server and one user, it's the full parameter count. (If you have multiple GPUs/servers and many users in parallel, you can shard and route it so you only need the active parameter count per GPU/server. So it's cheaper at scale.)


Do the inactive parameters need to be loaded into RAM to run an MoE model decently enough?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: