Generally speaking this works well, pending your definition of node and the inte...

1R053 · on Nov 6, 2024

An H100 supports 80 GB of memory. so at FP8 that would allow 3 of the 16+1 models per GPU (assuming around 26B per model), requiring 9 H100s, that usually would not fit one chassis I guess.

Once you have something with 192 GB it gets interesting. You could probably have 7 at FP8 per GPU. At FP16 it probably only would fit 3 per card, requiring 9 again.

I'd say for the current memory layout of cards they missed a little bit the sweet spot. With slightly smaller models or one expert less one should be able to run it on 8 H100s at FP8 or 2 B100s at FP8 or even on 4 B100s at FP16 if I calculated correctly.

bick_nyers · on Nov 6, 2024

You could always split one of the experts up across multiple GPUs. I tend to agree with your sentiment, I think researchers in this space tend to not optimize that well for inference deployment scenarios. To be fair, there is a lot of different ways to deploy something, and a lot of quantization techniques and parameters.