Several hints here are severely outdated. For instance, never train a model in e...

dylanbfox · on Oct 8, 2021

Hi there - OP here - thanks for reading!

This blog is more of an intro to a few high level concepts (multi-GPU and multi-node training, fp32 vs fp16, buying hardware and dedicated machines vs AWS/GCP, etc) for startups that are early into their deep learning journey, and that might need a nudge in the right direction.

If you're looking for a deep dive into the best GPUs to buy (cost/perf, etc), the link in the below comment gives a pretty good overview.

PS - I can send you some benchmarks we did that show (at least for us) Horovod is ~10% faster than DDP for multi-node training FWIW. Email is in my profile!

jxcole · on Oct 7, 2021

> Finally, buying a local cluster of TITAN Xs is an outright weird recommendation for massive models. VRAM limitations alone make this a losing proposition.

Do you have an alternative recommendation?

sabalaba · on Oct 8, 2021

You can check out some of the benchmarks here: https://lambdalabs.com/blog/nvidia-rtx-a6000-benchmarks/

It provides some modern, real life, deep learning benchmarks using the mixed precision (TF32) that gp was referring to.