Right, it's that sync work that I need to research. If I'm compiling some func down to, say, LLVM IR, how do I interleave the synchronization points without too much penalty? Again, surely just a matter of research, just seems like magic from the outside
Haven't written a JIT/dynamic recompiler myself, but here's how I've thought of it:
When two devices are communicating over a port, only operations writing to the port matter. If something is spinning waiting on a signal (assuming the partner isn't affected by this read repeatedly) you can simulate that by sleeping until the partner writes to the port (or another internal event occurs), updating any internal counters with how many cycles would have burned.
Where it gets more difficult is interrupts, which in the above paradigm could come at any time. If they're timers you could know before going into a "basic block" (the JITted chunk, probably larger than a classic basic block) if it would hit during the block, then just single step interpret out the rest of the cycles. If it's irregular... you might either checkpoint regularly and revert to a previous checkpoint if an interrupt comes in during a block, or batch up changes and only commit them if we hit the end of the block with no interrupt.
One big problem with this idea is that RAM is a device, probably not connected to only a single core CPU! You can get around this a little by modelling RAM over time as separate ranges, first assume that writes are only going to be visible to the device that wrote it, then if another reads/writes it, split the region, treat the ranges as separate devices (and treat this event as an interrupt to the first device).
The problem arises when needing to know how frequently to checkpoint, i.e. the size of said blocks. If I am running the emulator on a machine 200x faster than the emulatee, the check should be every instruction because we have the time and executing too many instructions between syncs will be very noticeable. But if I am running the emulator on a machine 1.05x faster than the emulatee, we don't have a lot of time to work with, so you'll guess at the block size between checks? Get it wrong and you've done too many ops or too few. Either way, you definitely don't have time to sleep or use small block sizes. I'm sure I'm over thinking it.
On the plus side, the synchronization is explicit on newer consoles because they couldn't rely on constant timing. Caches and multiple bus masters meant that they couldn't cycle count like the used to be able to.