Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Generally speaking this works well, pending your definition of node and the interconnect between them. If by node you mean GPU, and you have multiple of them on the same system (interconnect is PCIE, doesn't need to be full speed however for inference), you're good. If you mean multiple computers connected by 1 Gigabit Ethernet? More challenging.

When splitting models layer by layer, users in r/LocalLLaMA have reported good results with as low as PCIE 3.0 x4 as the interconnect (4GB/s). For tensor parallelism, the interconnect requirements are higher but the upside can be faster speeds in accordance to number of GPUs split across (whereas layer by layer operated like a pipeline, so isn't necessarily faster than what a single GPU can provide, even if splitting across 8 GPUs).



An H100 supports 80 GB of memory. so at FP8 that would allow 3 of the 16+1 models per GPU (assuming around 26B per model), requiring 9 H100s, that usually would not fit one chassis I guess.

Once you have something with 192 GB it gets interesting. You could probably have 7 at FP8 per GPU. At FP16 it probably only would fit 3 per card, requiring 9 again.

I'd say for the current memory layout of cards they missed a little bit the sweet spot. With slightly smaller models or one expert less one should be able to run it on 8 H100s at FP8 or 2 B100s at FP8 or even on 4 B100s at FP16 if I calculated correctly.


You could always split one of the experts up across multiple GPUs. I tend to agree with your sentiment, I think researchers in this space tend to not optimize that well for inference deployment scenarios. To be fair, there is a lot of different ways to deploy something, and a lot of quantization techniques and parameters.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: