Yes, it's a tradeoff. Many other FPUs at the time used radix-4 instead because it avoids the extra ×3 circuit. Pipelining is tricky because there isn't a nice place to break the multiplication array into two pieces.
Sure it takes a few iterations, but there is nothing inherently complex about it.
You feed the partial products into a carry-save adder tree and choose a level which minimizes the delay and the extra flop area/power. I am not sure about the pentium technology but whatever the area, it might be comparable to the x3 multiplier
There is an implicit add in the x3 multiplier circuit that increases the delay which was deemed acceptable than the radix-4 delay.
Considering all this, I strongly believe latency was a hard constraint (may be for SW perf).