His terminology appears to be weird and i don't think he's testing what he's try...

BeeOnRope · on Feb 5, 2018

> 1- using the "INC mem" form is not a fused instruction. Intel has macro op fusion

inc mem is a micro-fused instruction. He's not talking about macro-fusion here, which as you point out involves two separate instructions (a flag-setting ALU op and a conditional jump) being fused together into one uop.

> and the INC and STORE will be fused and retired together (not the LOAD and INC), IIRC.

Actually it's the load and the inc that will be fused together. The store itself internally is 2 uops, which are also fused together. So there are a total of 4 (unufsed) uops implied by inc [mem] and those get fused in pairs for a total of 2 fused uops.

Edit: what I said is true for "add [mem], 1" but apparently not for "inc [mem]" which apparently needs 3 fused domain uops, not 2. I'm not sure why it's worse and which "pair" isn't fusing (but if I had to guess it's the load-op pair that fails to fuse). It's kind of dumb since you could just use add [mem], 1 to get the same effect, except in the rare case you wanted the different flag setting behavior of inc.

jnordwick · on Feb 5, 2018

Thanks. Is all this learnable from the Intel docs? Or is there a lot of IACA experimentation?

BeeOnRope · on Feb 5, 2018

Most of this is in Agner's docs (check out microarchitecture.pdf and look for 'fusion') and his spreadsheets (which lists the number of uops in the fused and unfused domains).

Once you understand that background see this answer:

https://stackoverflow.com/a/31027695/149138

which has the most up-to-date info I'm aware of about some important issues that doc doesn't cover (in particular, revealing that there is are actually two types of micro-fusion: the bad kind that unlaminates before renaming, and the good kind that doesn't).

IACA is mostly pretty useless for this stuff: or at least I'd never trust it by itself since it fails to correctly model a lot of the various cases (including stuff as basic as L1-hit latency for the various addressing modes). In this case, it actually was IACA that triggered the investigation, so this is a bit of an exception to that - but even here it failed to model it correctly once the details were worked out through testing.

jnordwick · on Feb 5, 2018

Thanks so much. It really fills a gap in my knowledge. I do high performance work, so i don't use this directly, but i find it very interesting and feel it improves my mental model of what's going on under the hood.

Id also like to dig into compilers a little more than my college courses, and i have an idea for code gen of APL functions used in a particular setting (database stored procs).