Hacker Newsnew | past | comments | ask | show | jobs | submit | lambda-research's commentslogin

> What this article misses though is that despite this, each GPU in the distributed cluster still needs to have enough VRAM to load the entire copy of the model to complete the training process.

That's not exactly accurate. In the data parallel side of techniques, the Distributed Data Parallel (DDP) approach does require a fully copy of the model on each GPU. However there's also Fully Sharded Data Parallel (FSDP) which does not.

Similarly things like tensor parallelism (TP) split the model over GPUs, to the point where full layers are never in a single GPU anymore.

Combining multiple of the above is how huge foundation models are trained. Meta used 4d parallelism (FSDP + TP and pipeline/context parallelism) to train llama 405b.


You're right. My caveat not exactly accurate but I wanted to point out where DisTrO might comes in and why it's relevant here.

I mean it reduces the communication overhead by more orders than DiLoCo.


Unlike text generation using LLMs, text-to-video generation brings unique challenges — balancing realism, prompt alignment, and artistic vision is something much more nuanced and intuitive than generated code.

But how do we measure the quality of the outputs? Is choice of color more important than the realistic aspect or is it the composition of the scene?

We’ve launched a Text-to-Video Model Leaderboard to explore these questions, inspired by the LLM Leaderboard (lmarena.ai). Our idea: many models exist, but only an unbiased comparison can help evaluating what users of text-to-video models actually find most important.

Right now, the leaderboard includes five open-source models: * HunyuanVideo * Mochi1 * CogVideoX-5b * Open-Sora 1.2 * PyramidFlow

We plan to expand it to include proprietary models from Kling AI, LumaLabs.ai, Pika.art. You can check out the current leaderboard here: https://t2vleaderboard.lambdalabs.com/leaderboard/

We’re looking for feedback from the HN community: * How should text-to-video models be evaluated? * What criteria or benchmarks would you find meaningful? * Are there other models we should include?

We’d love to hear your thoughts and suggestions!


Something that I always think about when I see discussions about hallucinations or "confidently wrong answers" is that humans have this issue too.

For those on tiktok, how many times have you found yourself easily believing some random person online who you have no idea the credentials of? And this has been a problem for a long time (news, internet, etc).

It's just interesting to me we are asking way more of AI than we can ever ask from humans.


It's very appropriate to me that we demand more of AI than we do of humans. The amount of harm and the scale of harm AI can do is much greater than a single person and there are no consequences to doing it. If a human kills someone, they go to jail. If an AI kills someone, say a self-driving car for example, it's just the cost of doing business.


It's interesting to be sure. Context is everything, though. It's one thing for a random tiktok head or media pundit to be confidently wrong and it's another thing entirely if I have to work with someone who I cannot trust to be right.

For LLMs to be broadly valuable, they need to be qualified for a role more like a coworker than a stranger.


Mark Twain: "It's not what you don't know that gets you into trouble. It's what you know for sure that just ain't so."



Awesome thank you!


I think the benefit is that SpinQuant had higher throughput and required less memory. At least according to the tables at the bottom of the article.

Definitely nice to see them not cherrypick results - makes them more believable that its not the best along all axes.


Hey there are some details about this scattered throughout. The answer really depends on the technique. For DDP you can fairly easily get same throughput as single gpu throughput (we were getting ~80% gpu util for multiple nodes iirc), as long as all the workers are getting the same sized data.

Once you move to training really large models like Llama 405B with FSDP and use things like CPU offloading, the throughput goes down quite a bit due to all the data transfers between CPU/GPU. If you have large enough clusters and don't have to use CPU offloading, you can get higher throughput.


You are talking about a specific setup:

> Here we are going to utilize an 8 node cluster (64 H100 GPUs)


The benchmark is matrix multiplcation with the shapes `(6, 1500, 256) X (6, 256, 1500)`, which just aren't that big in the AI world. I think the gap would be larger with much larger matrices.

E.g. Llama 3.1 8B which is one of the smaller models has matrix multiplications like `(batch, 14336, 4096) x (batch, 4096, 14336)`.

I just don't think this benchmark is realistic enough.


Let me know if there are any questions or suggestions!

Feel free to open issue on github, and contributions are welcome also


The idea that time is tied to computation makes me wonder if everything we see as 'progress' is just the universe showing us the loading screen percentage of the game of life.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: