Yes, it's a tradeoff. Many other FPUs at the time used radix-4 instead because i...

sifar · 2025-03-03T03:25:31 1740972331

Sure it takes a few iterations, but there is nothing inherently complex about it.

You feed the partial products into a carry-save adder tree and choose a level which minimizes the delay and the extra flop area/power. I am not sure about the pentium technology but whatever the area, it might be comparable to the x3 multiplier

There is an implicit add in the x3 multiplier circuit that increases the delay which was deemed acceptable than the radix-4 delay. Considering all this, I strongly believe latency was a hard constraint (may be for SW perf).