Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

His terminology appears to be weird and i don't think he's testing what he's trying to test.

1- using the "INC mem" form is not a fused instruction. Intel has macro op fusion where two macro ops are send to be decoder as a unit like cmp+jmp to be decoded into uops more efficiently. This isn't that.

2- micro op fusion is when after decoding from macro to micro op, two micro ops that are adjacent are retired together, just as "ADD reg mem" goes to the micro ops "LOAD reg2 mem" and "ADD reg reg2".

Or in his example:

INC mem ->

LOAD reg mem

INC reg

STORE mem reg

and the INC and STORE will be fused and retired together (not the LOAD and INC), IIRC.

3- he's hitting the loop stream detector so decoding is bypassed anyways. His 3 macro op version is hitting the uop cache, same as tbe 1 macro op version.

4- he's trying to say both the "good" and "bad" will execute the same, but i dont think this is impacted by the uop fusion. And nothing in his counters shows any difference.

The single direct memory form of INC cuts down on register and L1 instruct cache pressure, neither of which this benchmark will pick up.

I'm a software guy though, so corrections are definitely encouraged.



> 1- using the "INC mem" form is not a fused instruction. Intel has macro op fusion

inc mem is a micro-fused instruction. He's not talking about macro-fusion here, which as you point out involves two separate instructions (a flag-setting ALU op and a conditional jump) being fused together into one uop.

> and the INC and STORE will be fused and retired together (not the LOAD and INC), IIRC.

Actually it's the load and the inc that will be fused together. The store itself internally is 2 uops, which are also fused together. So there are a total of 4 (unufsed) uops implied by inc [mem] and those get fused in pairs for a total of 2 fused uops.

Edit: what I said is true for "add [mem], 1" but apparently not for "inc [mem]" which apparently needs 3 fused domain uops, not 2. I'm not sure why it's worse and which "pair" isn't fusing (but if I had to guess it's the load-op pair that fails to fuse). It's kind of dumb since you could just use add [mem], 1 to get the same effect, except in the rare case you wanted the different flag setting behavior of inc.


Thanks. Is all this learnable from the Intel docs? Or is there a lot of IACA experimentation?


Most of this is in Agner's docs (check out microarchitecture.pdf and look for 'fusion') and his spreadsheets (which lists the number of uops in the fused and unfused domains).

Once you understand that background see this answer:

https://stackoverflow.com/a/31027695/149138

which has the most up-to-date info I'm aware of about some important issues that doc doesn't cover (in particular, revealing that there is are actually two types of micro-fusion: the bad kind that unlaminates before renaming, and the good kind that doesn't).

IACA is mostly pretty useless for this stuff: or at least I'd never trust it by itself since it fails to correctly model a lot of the various cases (including stuff as basic as L1-hit latency for the various addressing modes). In this case, it actually was IACA that triggered the investigation, so this is a bit of an exception to that - but even here it failed to model it correctly once the details were worked out through testing.


Thanks so much. It really fills a gap in my knowledge. I do high performance work, so i don't use this directly, but i find it very interesting and feel it improves my mental model of what's going on under the hood.

Id also like to dig into compilers a little more than my college courses, and i have an idea for code gen of APL functions used in a particular setting (database stored procs).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: