Okay, I had something wrong earlier, but I deleted my earlier, incorrect post. That's what I get for rushing and trying to do this in one go. I think I got it correct this time:
struct Node{
char rep[64]; // 64-bytes per node
};
struct StackLink {
struct StackLink* prev;
Node* nodePtrs[4096];
};
static StackLink* last; // Points to nullptr when empty
static int stackLinkSize;
It'd be a stack, not a queue. So I changed the name to "stack" to better represent the strategy. stackLinkSize doesn't need to be part of the link structure.
That's why the fast path is simply:
I think I agree with you that "size" is dereferenced (for dereference #2), but since its "Hot" I don't think it would be the bottleneck for anything. L1 references are very fast.
You also need to load root. If you do not count that, then in the linked list case there are no dereferences (root is litterally the pinter to the allocated block).
The stack case will touch three cachelines (root, nodeptrs and the final block). The linked list only two (there is no nodeptrs).
TBH it would be very hard to design a nonncompletely artificial benchmark were une desig make a difference compared to the other. Only way is test on an actual application and check what fits best.
> TBH it would be very hard to design a nonncompletely artificial benchmark were une desig make a difference compared to the other. Only way is test on an actual application and check what fits best.
I think I can agree to that.
I should note that I designed this allocator to be used on the GPU. "Size" is shared between a wavefront / SIMD unit. popcnt(exec_mask) determines how many items to alloc at a time. (if 200 threads call alloc, then they will collectively grab 200 pointers all at once, by doing size -= popcount(exec_mask)).
I think I agree with you that "size" is dereferenced (for dereference #2), but since its "Hot" I don't think it would be the bottleneck for anything. L1 references are very fast.