I'm no expert on these MoE models with "a total of 389 billion parameters and 52 billion active parameters".
Do hobbyists stand a chance of running this model (quantized) at home?
For example on something like a PC with 128GB (or 512GB) RAM and one or two RTX 3090 24GB VRAM GPUs?
You would need to fit the 389B parameters in VRAM to have a speed that is usable. Different experts are activated on a per token basis, so you would need to load/unload a large chunk of the 52B active parameters every token if you were trying to offload parameters to system RAM or SSD. PCIE 4.0 x16 speed is 64GB/s, so you can load those active parameters maybe 1 or 2 times per second, yielding an output speed of 1-2 tokens per second, which most would consider "unusable".
Generally speaking this works well, pending your definition of node and the interconnect between them. If by node you mean GPU, and you have multiple of them on the same system (interconnect is PCIE, doesn't need to be full speed however for inference), you're good. If you mean multiple computers connected by 1 Gigabit Ethernet? More challenging.
When splitting models layer by layer, users in r/LocalLLaMA have reported good results with as low as PCIE 3.0 x4 as the interconnect (4GB/s). For tensor parallelism, the interconnect requirements are higher but the upside can be faster speeds in accordance to number of GPUs split across (whereas layer by layer operated like a pipeline, so isn't necessarily faster than what a single GPU can provide, even if splitting across 8 GPUs).
An H100 supports 80 GB of memory. so at FP8 that would allow 3 of the 16+1 models per GPU (assuming around 26B per model), requiring 9 H100s, that usually would not fit one chassis I guess.
Once you have something with 192 GB it gets interesting. You could probably have 7 at FP8 per GPU. At FP16 it probably only would fit 3 per card, requiring 9 again.
I'd say for the current memory layout of cards they missed a little bit the sweet spot.
With slightly smaller models or one expert less one should be able to run it on 8 H100s at FP8 or 2 B100s at FP8 or even on 4 B100s at FP16 if I calculated correctly.
You could always split one of the experts up across multiple GPUs. I tend to agree with your sentiment, I think researchers in this space tend to not optimize that well for inference deployment scenarios. To be fair, there is a lot of different ways to deploy something, and a lot of quantization techniques and parameters.
Yes, it can be done. I'm running a 24-channel DDR5 dual-EPYC rig and get good speed on large MoE models. I only use the GPU for context processing.
They're actually a best-case for CPU inference vs dense models. I usually run deepseek 2.5 quanted to q8, but if this model works well I'll probably switch to it once support hits llama.cpp.
I use 24 sticks of ddr5-4800, which gets me up to 9t/s on deepseek 2.5 at q8. 48 threads was optimal in llama.cpp. I would like to move to epyc 9005 chips and ddr5-6000, but it is cost prohibitive with CPUs still over $10k each on eBay.
How many cores do your CPUs have? Are you using the 64 core EPYC 9334 mentioned in the linked page? Do that many cores provide a speedup versus having fewer cores?
RAM for 4-bit is 1GB per 2 billion parameters. So you will want 256GB RAM and at least one GPU. If you only have one server and one user, it's the full parameter count. (If you have multiple GPUs/servers and many users in parallel, you can shard and route it so you only need the active parameter count per GPU/server. So it's cheaper at scale.)