> I doubt Newton-Raphson gonna help much, if at all.
FWIW, in some machines in the past, division was implemented via Newton-Raphson and there was no other division hardware available. The implementation was either done in software or in microcode. The same statement applies to square roots as well.
The foundational technical result (due I think to Peter Markstein but I can't find the paper right now) is that a sequence of 4 (or 5?) fused multiply/add instructions produce a correctly rounded division (or square root) in double precision. The fused part is essential: you must implement ab+c with a single rounding at the end, because if you round after ab the final result is no longer exact in all cases.
Fused multiply/add instructions (FMA) were pioneered by the IBM RS/6000 and its Power/PowerPC descendants, and were adopted by Itanium, MIPS, and later x86/arm. IIRC, Itanium had no division instruction at all, and the compiler would generate the 4 (or 5?) FMAs for you, properly pipelined. PowerPC had a division instruction that would use the FMA unit multiple times.
The last time I looked into this problem (in 2005 or so) there wasn't much latency difference between FMA-based implementations and a dedicated hardware divider. I don't know what tradeoffs people are making these days.
> last time I looked into this problem (in 2005 or so) there wasn't much latency difference between FMA-based implementation and a dedicated hardware divider
In 2005 or so, the newest and greatest CPUs were probably Core 2 Duo i.e. Conroe microarchitecture.
Hardware dividers became much faster since then. Conroe took up to 20 cycles for FP32 divide (DIVPS), and up to 34 cycles for FP64 divide (DIVPD). AMD Zen 3 does that in 10 and 13 cycles, respectively.
Additions and multiplications improved too, but not nearly as much. FP32 multiplication (MULPS) took 4 cycles on Conroe, 3 cycles on Zen 3. FP64 multiplication (MULPD) took 5 cycles on Conroe, 3 cycles on Zen 3. Both FP32 and FP64 additions (ADDPS, ADDPD) take same 3 cycles, despite 16 years of technical progress.
Do you happen to know how these algorithms are implemented? Intel used to implement some variant of long-division (likely SRT), which takes O(n) cycles to divide two n-bit numbers. The Conroe numbers that you are quoting are consistent with an O(n) algorithm that processes two or perhaps three bits at each step.
However, the Zen 3 numbers that you quote are not consistent with an O(n) algorithm, but they are consistent with O(log n) steps of Newton-Raphson. Basically, do whatever you want to compute the single-precision answer in 10 cycles, and spend 3 cycles on one FMA to extend the result to double precision. Could it be that AMD is in fact implementing Newton-Raphson exploiting its 3-cycle FMA?
> Do you happen to know how these algorithms are implemented?
I do not. I never designed CPUs, only optimized different code for different processors, that’s why I’m aware of these timings. BTW, for AMD64, this is the best resource I’m aware of: https://www.uops.info/table.html
> Could it be that AMD is in fact implementing Newton-Raphson exploiting its 3-cycle FMA?
Maybe, I can only guess. One interesting point, still.
For modern Intel CPUs with lake-based names, the latency is close to Zen 3, 11-12 cycles for FP32 divide, 13-15 cycles for FP64 divide.
AMD only has 2 kinds of ALUs (two of each kind per core), one for scalar stuff and very basic SIMD (bitwise and integer add), another one for real SIMD which seem to be a superset (includes FP, integer multiply, AES, etc.)
Intel cores however have heterogeneous set of ALUs, they call them “execution units” or “execution ports”. On these lakes, FMA instructions can run on ports 0 or 1. Division instructions need to be scheduled on port 0 only. However, the “fast approximate reciprocal” FP32 instruction, RCPPS, which with ~11 bits of precision and 4 cycles of latency seems to be a good first step for Newton-Raphson, needs to be scheduled on the same port 0 as divide.
FWIW, in some machines in the past, division was implemented via Newton-Raphson and there was no other division hardware available. The implementation was either done in software or in microcode. The same statement applies to square roots as well.
The foundational technical result (due I think to Peter Markstein but I can't find the paper right now) is that a sequence of 4 (or 5?) fused multiply/add instructions produce a correctly rounded division (or square root) in double precision. The fused part is essential: you must implement ab+c with a single rounding at the end, because if you round after ab the final result is no longer exact in all cases.
Fused multiply/add instructions (FMA) were pioneered by the IBM RS/6000 and its Power/PowerPC descendants, and were adopted by Itanium, MIPS, and later x86/arm. IIRC, Itanium had no division instruction at all, and the compiler would generate the 4 (or 5?) FMAs for you, properly pipelined. PowerPC had a division instruction that would use the FMA unit multiple times.
The last time I looked into this problem (in 2005 or so) there wasn't much latency difference between FMA-based implementations and a dedicated hardware divider. I don't know what tradeoffs people are making these days.