You can tell the veteran status of FPGA devs by the quality of their rants about...

fooblaster · 2025-08-06T13:43:37 1754487817

Xilinx never had hardware that was even remotely capable of competing with Nvidia. So I don't think it's solely a software problem - they literally have never developed hardware that is programmable or general purpose enough. Even their versal hardware today is hideously difficult to program and has a very FPGA centric work flow.

amirhirsch · 2025-08-06T14:37:08 1754491028

This isn’t the full story though, like I (professionally, as a consultant) analyzed GOPs/$ and /Watt for big multi chip GPU or FPGA systems from 2006-2011.

Xilinx routinely had more I/O (SerDes, 100/200/400G MACs on-die) and at times now more HBM bandwidth than contemporary GPUs. Also deterministic latency and perfectly acceptable DSP primitives.

The gap has always been the software.

Of course NVidia wasn’t such an obvious hit either, the flubbed the tablet market due to yield issues and ultimately it really only went exponential in 2014. I invested heavily in NVidia 2007-2014 because of the CUDA edge they had, but sold my $40K of stock at my cost-basis.

I currently do DSP for radar, and implemented the same system on FPGA and in CUDA 2020-2023. I know as a fact that the FFT performance of an $9000 FPGA was equal to a $16000 A100 that also needed a $10000 computer in 2022 (the types on FPGA were fixed point instead of float so no apples-to-apples but definitely application equivalent)

fooblaster · 2025-08-07T16:45:01 1754585101

I think you are making the mistake of thinking that xilinx software can fix the programmability of their hardware. it cannot. If you have to solve a place and route problem or do timing closure in your software, you have made a design mistake in your hardware. You cannot design hardware such that a single FFT kernel takes 2 hours to compile and then fails, when nvcc takes 30 seconds and will always succeed. You have taken your software into the domain of RTL design. This is a result of the hardware design. Xilinx could have made their versal hardware a programmable parallel processor array that is cache coherent, where everyone has access to global memory. It fundamentally isn't that though. it's a bizarre data flow graph systemn that requires dma engines and a place and route, and a static fabric configuration. That's a fault of your hardware design!

imtringued · 2025-08-06T14:31:39 1754490699

It depends on what you want to do. FPGAs excel in periodic "always on" workloads that need deterministic timing and low latency. If you don't have that and just care about total throughput and don't care about energy efficiency, then Nvidia will sell you more tflops per chip.

The energy efficiency of FPGAs can't be understated. Reducing the clock and voltage to levels comparable to an FPGA will kill your GPU's tflops and the control overhead and the energy spent on data movement are unavoidable in a GPU.