Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Shouldn't it be the fact that they're non-associative? Because the reduction kernels will combine partial results (like the dot‑products in a GEMM or the sum across attention heads) in a way that the order of operations may change (non-associative), which can lead to the individual floats to be round off differently.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: