> The problem is 95% about laying out the instruction dispatching code for the branch predictor to work optimally.
A fun fact I learned while writing this post is that that's no longer true! Modern branch predictors can pretty much accurately predict through a single indirect jump, if the run is long enough and the interpreted code itself has stable behavior!
My experiments on this project anecdotally agree; they didn't make it into the post but I also explored a few of the interpreters through hardware CPU counters and `perf stat`, and branch misprediction never showed up as a dominant factor.
Yes, this was already becoming true around the time I was writing the linked article. And I also read the paper. :-) I also remember I had access to a pre-Haswell era Intel CPUs vs something a bit more recent, and could see that the more complicated dispatcher no longer made as much sense.
Conclusion: the rise of popular interpreter-based languages lead to CPUs with smarter branch predictors.
How do you reconcile that with the observation that moving to a computed goto style provides better codegen in zig[1]? They make the claim that using their “labeled switch” (which is essentially computed goto) allows you to have multiple branches which improves branch predictor performance. They even get a 13% speedup in their parser from switch to this style. If modern CPU’s are good at predicting through a single branch, I wouldn’t expect this feature to make any difference.
While it's unlikely as neat as this, the blog post we're all commenting on is a "I thought we had a 10-15% speedup, but it turned out to be an LLVM optimisation misbehaving". And Zig (for now) uses LLVM for optimised builds too
> The problem is 95% about laying out the instruction dispatching code for the branch predictor to work optimally.
A fun fact I learned while writing this post is that that's no longer true! Modern branch predictors can pretty much accurately predict through a single indirect jump, if the run is long enough and the interpreted code itself has stable behavior!
Here's a paper that studied this (for both real hardware and a certain simulated branch predictor): https://inria.hal.science/hal-01100647/document
My experiments on this project anecdotally agree; they didn't make it into the post but I also explored a few of the interpreters through hardware CPU counters and `perf stat`, and branch misprediction never showed up as a dominant factor.