>> Micro-fusion improves instruction bandwidth delivered from decode to retireme...

stmw · on Feb 4, 2018

Nice article. I wish it also discussed the potential value of future-proofing the code (for future Intel or - more important - Intel-compatible CPUs), in which case wouldn't it make sense to use 3 simpler instructions? Even at the risk of instruction cache pressure. It's not worse here, since the problem was alignment rather than instruction selection.

userbinator · on Feb 5, 2018

in which case wouldn't it make sense to use 3 simpler instructions?

No. Any competitor's CPUs will also have similar constraints on fetch and decode bandwidth, so smaller code that isn't slower will always be the better choice (and sometimes, even slightly slower for one sequence is better if it being smaller results in less cache misses overall).

Plus, the requirement to use an extra register for the simpler sequence instead of the implicitly allocated ones that memory-ops use will affect other code nearby.