>>
Micro-fusion improves instruction bandwidth delivered from decode to retirement and saves power. Coding an instruction sequence by using single-uop instructions will increases the code size, which can decrease fetch bandwidth from the legacy pipeline.
> Out of all runs that I did unfused version was never faster then the fused one.
I mean, yeah. These little microbenchmarks wouldn't cause the kind of I$ pressure that Intel's talking about.
Nice article. I wish it also discussed the potential value of future-proofing the code (for future Intel or - more important - Intel-compatible CPUs), in which case wouldn't it make sense to use 3 simpler instructions? Even at the risk of instruction cache pressure. It's not worse here, since the problem was alignment rather than instruction selection.
in which case wouldn't it make sense to use 3 simpler instructions?
No. Any competitor's CPUs will also have similar constraints on fetch and decode bandwidth, so smaller code that isn't slower will always be the better choice (and sometimes, even slightly slower for one sequence is better if it being smaller results in less cache misses overall).
Plus, the requirement to use an extra register for the simpler sequence instead of the implicitly allocated ones that memory-ops use will affect other code nearby.
> Out of all runs that I did unfused version was never faster then the fused one.
I mean, yeah. These little microbenchmarks wouldn't cause the kind of I$ pressure that Intel's talking about.