I have just read the post and have to say that I would appreciate the post containing a more accurate description of the testing methodology. Some of the techniques described are known in interpreter optimization and his description of "threaded code" intepreter is actually not consistent with what it usually means (viz. threaded code interpreters have nothing to do with "threading" the decode for the successor instruction into the operation implementation of the current instruction, but just moving instruction decode and dispatch into the operation implementation.)
Aside of that, there have been papers detailing the software pipelining approach (cf. Hoogerbrugge et al.: "A code compression system based on pipelined interpreters", 1999), but I cannot for the love of god imagine that the loop shown in there is faster than a JIT compiler for a very simple reason: each of the interpreter operations is invoked via a function pointer, which means that the compiler emits an indirect branch instruction for implementing this. Now, these indirect branch instructions are very costly, and a simple JIT compiler could just take the actual values (callee target addresses) of the input program and generate direct call instructions instead. (And I am not even talking about inlining the function bodies of the operation implementation.)
Indirect branch is not too costly when it is correctly predicted and due to proliferation of C++ code, modern CPUs are very good at predicting indirect branches when they mostly lead to same target.
Aside of that, there have been papers detailing the software pipelining approach (cf. Hoogerbrugge et al.: "A code compression system based on pipelined interpreters", 1999), but I cannot for the love of god imagine that the loop shown in there is faster than a JIT compiler for a very simple reason: each of the interpreter operations is invoked via a function pointer, which means that the compiler emits an indirect branch instruction for implementing this. Now, these indirect branch instructions are very costly, and a simple JIT compiler could just take the actual values (callee target addresses) of the input program and generate direct call instructions instead. (And I am not even talking about inlining the function bodies of the operation implementation.)