Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Several hints here are severely outdated.

For instance, never train a model in end-to-end FP16. Use mixed precision, either via native TF/PyTorch or as a freebie when using TF32 on A100s. This’ll ensure that only suitable ops are run with lower precision; no need to fiddle with anything. Also, PyTorch DDP in multi-node regimes hasn’t been slower or less efficient than Horovod in ages.

Finally, buying a local cluster of TITAN Xs is an outright weird recommendation for massive models. VRAM limitations alone make this a losing proposition.



Hi there - OP here - thanks for reading!

This blog is more of an intro to a few high level concepts (multi-GPU and multi-node training, fp32 vs fp16, buying hardware and dedicated machines vs AWS/GCP, etc) for startups that are early into their deep learning journey, and that might need a nudge in the right direction.

If you're looking for a deep dive into the best GPUs to buy (cost/perf, etc), the link in the below comment gives a pretty good overview.

PS - I can send you some benchmarks we did that show (at least for us) Horovod is ~10% faster than DDP for multi-node training FWIW. Email is in my profile!


> Finally, buying a local cluster of TITAN Xs is an outright weird recommendation for massive models. VRAM limitations alone make this a losing proposition.

Do you have an alternative recommendation?


You can check out some of the benchmarks here: https://lambdalabs.com/blog/nvidia-rtx-a6000-benchmarks/

It provides some modern, real life, deep learning benchmarks using the mixed precision (TF32) that gp was referring to.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: