> I doubt Newton-Raphson gonna help much, if at all. FWIW, in some machines in t...

Const-me · on Dec 8, 2021

> last time I looked into this problem (in 2005 or so) there wasn't much latency difference between FMA-based implementation and a dedicated hardware divider

In 2005 or so, the newest and greatest CPUs were probably Core 2 Duo i.e. Conroe microarchitecture.

Hardware dividers became much faster since then. Conroe took up to 20 cycles for FP32 divide (DIVPS), and up to 34 cycles for FP64 divide (DIVPD). AMD Zen 3 does that in 10 and 13 cycles, respectively.

Additions and multiplications improved too, but not nearly as much. FP32 multiplication (MULPS) took 4 cycles on Conroe, 3 cycles on Zen 3. FP64 multiplication (MULPD) took 5 cycles on Conroe, 3 cycles on Zen 3. Both FP32 and FP64 additions (ADDPS, ADDPD) take same 3 cycles, despite 16 years of technical progress.

MatteoFrigo · on Dec 8, 2021

Thanks!

Do you happen to know how these algorithms are implemented? Intel used to implement some variant of long-division (likely SRT), which takes O(n) cycles to divide two n-bit numbers. The Conroe numbers that you are quoting are consistent with an O(n) algorithm that processes two or perhaps three bits at each step.

However, the Zen 3 numbers that you quote are not consistent with an O(n) algorithm, but they are consistent with O(log n) steps of Newton-Raphson. Basically, do whatever you want to compute the single-precision answer in 10 cycles, and spend 3 cycles on one FMA to extend the result to double precision. Could it be that AMD is in fact implementing Newton-Raphson exploiting its 3-cycle FMA?

Const-me · on Dec 8, 2021

> Do you happen to know how these algorithms are implemented?

I do not. I never designed CPUs, only optimized different code for different processors, that’s why I’m aware of these timings. BTW, for AMD64, this is the best resource I’m aware of: https://www.uops.info/table.html

> Could it be that AMD is in fact implementing Newton-Raphson exploiting its 3-cycle FMA?

Maybe, I can only guess. One interesting point, still.

For modern Intel CPUs with lake-based names, the latency is close to Zen 3, 11-12 cycles for FP32 divide, 13-15 cycles for FP64 divide.

AMD only has 2 kinds of ALUs (two of each kind per core), one for scalar stuff and very basic SIMD (bitwise and integer add), another one for real SIMD which seem to be a superset (includes FP, integer multiply, AES, etc.)

Intel cores however have heterogeneous set of ALUs, they call them “execution units” or “execution ports”. On these lakes, FMA instructions can run on ports 0 or 1. Division instructions need to be scheduled on port 0 only. However, the “fast approximate reciprocal” FP32 instruction, RCPPS, which with ~11 bits of precision and 4 cycles of latency seems to be a good first step for Newton-Raphson, needs to be scheduled on the same port 0 as divide.

adgjlsfhk1 · on Dec 8, 2021

The biggest difference is that fma gives you a 2x speedup since it can do a mul and an add with latency of 4 on modern cpus.