Would it be possible to have fast stack machine CPUs nowadays? (let's say, competing with x64 cpu's) or is there some inherent benefit to register machines?
I think that one of the main issues with stack machines is that the order of instructions in the original instruction stream is constrained by the instruction format - you can't as easy software schedule instructions (harder to express "I have all the information required to load that value from memory around now, I'll start it now because it might take a while").
Now any high performance stack machine is likely to decompose stack code to register renaming and micro-ops so some of that stuff is going to happen anyway (as it would in a register machine) and some stack ops (like dup) are going to disappear into the renamer (but then a compiler would do that as a CSE on a register machine)
I'm pretty sure a GA144 can do more additions or ANDs per second than any comparably-sized register-machine microcontroller: IIRC it's about a million transistors (comparable to an 80486, seven AVRs, or nine ARM9TDMIs) and can do about a hundred billion 18-bit additions per second on a few milliwatts. It definitely doesn't decompose stack code to register renaming and micro-ops.
Unfortunately you can't program it either in C or in Verilog, and it doesn't have multipliers, so it hasn't seen significant adoption. But its problem is that it's hard to program, not that its performance is poor.
They definitely have some things in common, but the GA144 doesn't have multipliers, doesn't do SIMT, doesn't do SIMD, doesn't have texture mapping units, and doesn't have any globally accessible RAM or routing, so I don't think it's very close.
No, the problem is that each of the 144 processors in the GA144 has 64 18-bit words of RAM, which are not byte-addressable. Each word holds 3-4 5-bit instructions, so that's 256 machine instructions if you don't use any RAM for data. Even if you had C-friendly features like a stack pointer and stack-pointer-relative addressing modes, 64 words is just not very much memory for a C program.
You might be able to compile a C program to a floorplan of processors passing messages to one another, but it would be similar to writing a SystemC compiler which compiles C to, ultimately, a circuit netlist. The impedance mismatch is severe.