L1 bandwidth is very intimately tied to the core (it's really more proper to tal...

moonchild · on Nov 9, 2022

What makes you say that? Intel, for instance, has IIRC a high 'burst' bandwidth and lower sustained bandwidth for L1. What's improper about discussing it in those terms?`

Tuna-Fish · on Nov 9, 2022

Are you considering just clock speed? The "bandwidth" of L1 is just how fast the load/store units can operate on it, and they do the same amount of work per clock regardless of conditions.

moonchild · on Nov 9, 2022

I am not talking about clock speed, no. I am all but certain I read somewhere that (intel) l1 caches have a sustained throughput which is somewhat slower than their peak throughput, and falls behind the load/store pipes. This could be explained by some queue not being large enough. Can't find the reference now, though.

menaerus · on Nov 10, 2022

What you might be talking about are the effects of load- and store-buffers. These are essentially FIFO queues which are sitting between the core and the cache-hierarchy, and consequently the main memory.

For example, if you want to store a data, this data will first go into a store-buffer and core at this moment is basically done. This operation can go only up to 64 bytes per single cycle per single ALU port. Skylake had only one such port (Store Data) whereas Sunny Cove upgraded to having two such ports. In practice this means, provided that you have at least two store uOps in the CPU uOps pipeline (which maxes out at 4-5 uOps per cycle), Sunny Cove could double the bandwidth because it could store 128 bytes into a store buffer per each cycle. Buffering in general and regardless of the uarch is helping to hide the memory subsystem latencies. And I guess these could be the "bursts" that you might have read about.

End of game to hiding the latencies though is when your code demands that the data you want to store must be immediately visible to other cores. In that case you have to flush the store-buffer which, along with the pipeline flush due to branch misprediction, is one of the most expensive operations you can do in x86-64.

moonchild · on Nov 11, 2022

I don't see the significance. That would not explain a burst bandwidth being greater than sustained bandwidth.

BTW, I'm pretty sure in sunny cove they just went from 1x 64-byte store port to 2x 32-byte store ports, so the actual bandwidth did not increase for vectorised code.