(63 - nlz) / 7, where nlz is between 0 and 63 lookup table with 64 entries anyon...

dahart · on March 1, 2023

It might be slower to use a memory access, even if you assume an average L1 cache hit. You can pipeline the bitwise version and maybe achieve an amortized divide in one cycle, where (depending on process and many other things) the lookup table might peak at like 4 cycles per divide even in unrolled/parallel situations. Plus the technique in the article is good for just about any constant divisor and if generalized works on much larger ranges, it’s why many compilers use this trick when they can.

hoseja · on March 1, 2023

A bitshift oneliner IS nicer.

resource0x · on March 1, 2023

Maybe nicer, but slower. And I'm not even sure about the former.

cryptonector · on March 1, 2023

Lookup tables can be slower. Do not assume that memory is fast. Even if the table is small and fits in cache, it will still be displacing other useful things from the cache. If you can have a small sequence of fast instructions using only registers and no branches, that's very likely to be faster -if not much faster- than lookup tables. Just one L1 cache miss could take much longer to resolve than computing this particular sequence of instructions.

smallpipe · on March 3, 2023

The example is literally one cache line, which probably won't affect the rest of the program too much. But given the average L1 throughput, I'd bet the bitwise version is faster

Matheus28 · on March 1, 2023

Probably faster: no memory loads, no cache pressure

cornholio · on March 1, 2023

Cache pressure is not really relevant, the result can be represented in 3 bits, so the entire table can fit in three 64 bit ints. It's uglier and you still need to do bitshifts, but much easier to write.

vanderZwan · on March 1, 2023

The method in the blog post uses two bit shifts and two additions. I don't really see how to beat that with this.

You can fit twenty-one 3-bit entries in a 64 bit int, with one bit to spare.

So naively you get:

   table[nlz / 21] >> (nlz % 21)

Which involves a division and a modulo. And that's assuming 63 entries in total, I'm not even trying to handle fitting the 64th entry in those three leftover bits somehow, or spread them across the three ints (again, I don't see how to do that without division).

Alternatively, each integer could contain one bit of the output, so:

   ((table[0] >> nlz) << 2) + ((table[1] >> nlz) << 1) + (table[1] >> nlz)

Which is five shifts and two additions, so more work plus lookup.

If you see another method I overlooked please tell me.

cornholio · on March 2, 2023

> ((table[0] >> nlz) << 2) + ((table[1] >> nlz) << 1) + (table[1] >> nlz)

> Which is five shifts and two additions, so more work plus lookup.

Something like that, perhaps with more bitmasking and less addition. There is no lookup, just three constants loaded into different registers and handled by different out of order execution units, then combined in a final addition / or-ing. So what you will "see" on the critical path is two bitshifts and three bitwise ORs/ANDs.

It's not faster, the point is that the speedup of "binary division" is probably not worth the day or so spent developing it unless in extreme cases.

vanderZwan · on March 2, 2023

Ah, right, I forgot the &1 masks. And I guess that with three literals and OOOE it might be faster, yes.

But yes, extremely unlikely to be a hot path in most situations.

resource0x · on March 1, 2023

No need to pack bits. Use whole words. The table with 64 entries is small, and if the operation is performed often, the table will always be in L1 cache. Table lookup will take 1 cycle. The method with shifts takes 4 cycles. (It can make sense only in tight loops. The gain is really negligible in absolute terms in the large scheme of things)

Timing of instructions for Intel can be found here: www.agner.org/optimize/instruction_tables.pdf

vanderZwan · on March 1, 2023

I know, but the person I replied to specifically said "it fits in three 64 bit integers" to point out you could squeeze the table into three words. I just tried to figure out how that could possibly be advantageous here (I don't think it can).

Best I can come up with is using 32 8-bit integers (so 6h bits more than 3 times 64 bitss in size) and doing thir:

    (table[nlz >> 2] >> (nlz & 3)) & 7

… which is two shifts, two ANDs and one lookup. I'm fairly certain that two shifts and two additions beat that. Well, woulo beat that if everything else in the code wasn't likely to be a much bigger bottleneck that dwarfs the impact of this