Okay, I had something wrong earlier, but I deleted my earlier, incorrect post. T...

gpderetta · on Oct 14, 2020

You also need to load root. If you do not count that, then in the linked list case there are no dereferences (root is litterally the pinter to the allocated block).

The stack case will touch three cachelines (root, nodeptrs and the final block). The linked list only two (there is no nodeptrs).

TBH it would be very hard to design a nonncompletely artificial benchmark were une desig make a difference compared to the other. Only way is test on an actual application and check what fits best.

dragontamer · on Oct 14, 2020

> TBH it would be very hard to design a nonncompletely artificial benchmark were une desig make a difference compared to the other. Only way is test on an actual application and check what fits best.

I think I can agree to that.

I should note that I designed this allocator to be used on the GPU. "Size" is shared between a wavefront / SIMD unit. popcnt(exec_mask) determines how many items to alloc at a time. (if 200 threads call alloc, then they will collectively grab 200 pointers all at once, by doing size -= popcount(exec_mask)).

gpderetta · on Oct 15, 2020

oh, in that contest, being able to parallelize allocation this way must be a big win. Of course it wouldn't work with a free list.