You can tell the veteran status of FPGA devs by the quality of their rants about the tools. The big FPGA companies have no quality metrics for developer experience. You should be able to make an LED blink within a minute of powering up a board and not after a day of downloading and installing stuff. It used to be possible to quickly start with Vivado on AWS cloud, and I was using that workflow for years, although recent licensing changes presented a speed-bump there, and I ended up going with a local install for my recent project.
Even once you get that LED blinking, changing a clock speed for that blinking LED should be near instantaneous but more likely requires a rebuilding the whole project. Fundamentally the vendors don’t view their chips as something designed to run programs, and this legacy hardware design mentality plagues their whole business.
Something important here: Xilinx could and should have been where NVidia is today. They were certainly aware of the competitive accelerated computing market as early as 2005, and fundamentally failed to make a software architecture competitive with CUDA.
Before CUDA even existed I interned at Xilinx working on the beginnings of their HLS C compiler. My (decade older) fraternity brother led the C compiler team at Altera. We almost went into making a spreadsheet compiler for FPGA (my masters thesis) together but 2007 ended up being a terrible year to sell accelerated computing to Wall Street.
Xilinx never had hardware that was even remotely capable of competing with Nvidia. So I don't think it's solely a software problem - they literally have never developed hardware that is programmable or general purpose enough. Even their versal hardware today is hideously difficult to program and has a very FPGA centric work flow.
This isn’t the full story though, like I (professionally, as a consultant) analyzed GOPs/$ and /Watt for big multi chip GPU or FPGA systems from 2006-2011.
Xilinx routinely had more I/O (SerDes, 100/200/400G MACs on-die) and at times now more HBM bandwidth than contemporary GPUs. Also deterministic latency and perfectly acceptable DSP primitives.
The gap has always been the software.
Of course NVidia wasn’t such an obvious hit either, the flubbed the tablet market due to yield issues and ultimately it really only went exponential in 2014. I invested heavily in NVidia 2007-2014 because of the CUDA edge they had, but sold my $40K of stock at my cost-basis.
I currently do DSP for radar, and implemented the same system on FPGA and in CUDA 2020-2023. I know as a fact that the FFT performance of an $9000 FPGA was equal to a $16000 A100 that also needed a $10000 computer in 2022 (the types on FPGA were fixed point instead of float so no apples-to-apples but definitely application equivalent)
I think you are making the mistake of thinking that xilinx software can fix the programmability of their hardware. it cannot. If you have to solve a place and route problem or do timing closure in your software, you have made a design mistake in your hardware. You cannot design hardware such that a single FFT kernel takes 2 hours to compile and then fails, when nvcc takes 30 seconds and will always succeed. You have taken your software into the domain of RTL design. This is a result of the hardware design. Xilinx could have made their versal hardware a programmable parallel processor array that is cache coherent, where everyone has access to global memory. It fundamentally isn't that though. it's a bizarre data flow graph systemn that requires dma engines and a place and route, and a static fabric configuration. That's a fault of your hardware design!
It depends on what you want to do. FPGAs excel in periodic "always on" workloads that need deterministic timing and low latency. If you don't have that and just care about total throughput and don't care about energy efficiency, then Nvidia will sell you more tflops per chip.
The energy efficiency of FPGAs can't be understated. Reducing the clock and voltage to levels comparable to an FPGA will kill your GPU's tflops and the control overhead and the energy spent on data movement are unavoidable in a GPU.
Even once you get that LED blinking, changing a clock speed for that blinking LED should be near instantaneous but more likely requires a rebuilding the whole project. Fundamentally the vendors don’t view their chips as something designed to run programs, and this legacy hardware design mentality plagues their whole business.
Something important here: Xilinx could and should have been where NVidia is today. They were certainly aware of the competitive accelerated computing market as early as 2005, and fundamentally failed to make a software architecture competitive with CUDA.
Before CUDA even existed I interned at Xilinx working on the beginnings of their HLS C compiler. My (decade older) fraternity brother led the C compiler team at Altera. We almost went into making a spreadsheet compiler for FPGA (my masters thesis) together but 2007 ended up being a terrible year to sell accelerated computing to Wall Street.