I think this is inadvertently a good critique of this popular item - it doesn't ...

dragontamer · on Oct 3, 2020

I think you underestimate the domination of large numbers.

Let's say you have code that is 50% main memory and 50% L1 cache. Let's say there are 1000 operations.

You have (500 x 100ns) + (500 x 1ns) or 50500ns total time.

Now let's say you optimize the code so that they are all L3 operations: 20ns x 1000 operations is 20000ns, or over twice as fast.

pvg · on Oct 3, 2020

I'm not sure I'm underestimating anything since I'm not estimating things and we seem to be kind of saying the same thing?

Edit: to clarify a bit further - people read this list and think '1 nanosecond, 5 nanoseconds, not important to me, academic, etc'. My point is that's a misunderstanding but the list alone doesn't do a good job of disabusing one of the misunderstanding.

andi999 · on Oct 4, 2020

If you optimizen for L1-L3 caches you need to know/control what hardware you are running on since cache sizes differ between processors, isnt it?

dragontamer · on Oct 4, 2020

Surprisingly: no.

1. Cache-oblivious data-structures work no matter the cache-size. They aren't 100% optimized to a particular cache size, but they are proven to be "efficient" across many different cache sizes.

2. For an example of #1, matrix multiplication is one of the most optimized functions of modern computing. One major step is to transpose the matrix, so that the "vertical" movement turns into a "horizontal" movement. The horizontal movement, which is "address + 1" allows for the cache to work with your loop, while a vertical movement across the matrix is "address + width", and not cache-efficient.

------------

3. I think it is safe to start making some degree of cache-assumptions on modern systems. All caches today have 64-byte cache lines (or greater). That's ARM, Intel, and AMD CPUs all use 64-byte cache lines. AMD GCN GPUs use 64-byte cache lines. AMD RDNA GPUs use 128-byte cache lines (and Intel is rumored to "pair up" cache lines into 128-byte segments at L3).

This means that every memory operation to main-memory reads or writes at least 64-bytes of data at a time, on all modern systems. DDR4 RAM itself has Burst-Length 8, meaning 64-bytes is literally the smallest unit of data that can be addressed, so it isn't surprising that all CPUs / GPUs are working with such a large number these days.

-----------

Given the 64-byte cache line (or 128-byte, for Intel's L3 cache or AMD RDNA GPUs), "unrolling" your linked lists, or using B-trees of larger sizes (instead of the cache-inefficient binary tree), or d-heaps (instead of binary heaps) plays to the benefit of caches helps out a lot.

You don't need to know the specific size of the cache line to know that d-heaps are more efficient than binary-heaps on modern, cache-heavy systems. You just need to know that sequential-memory access stays within the L1 cache.

oriolid · on Oct 4, 2020

Often you can get pretty good results from just knowing that cache misses are expensive and working from that point: Store together what is accessed together, use linear arrays to help prefetcher, prefetch manually if really needed (it usually isn't and the prefetch instruction isn't free), if your data is accessed effectively randomly try packing used parts into smaller amount of cache lines, etc...

The latency numbers are still useful for making back of the envelope calculations about how much it is possible to gain and whether it's worth the trouble.

andi999 · on Oct 4, 2020

Can you explain the radomly accessing data cache mitigation strategy more?

dragontamer · on Oct 4, 2020

A linked list is a random-memory traversal every time you do "node = node->next".

You can grossly mitigate the random-memory read/write by using unrolled linked lists (https://en.wikipedia.org/wiki/Unrolled_linked_list).

Or, instead of using binary-heaps, use d-heaps: https://en.wikipedia.org/wiki/D-ary_heap

oriolid · on Oct 4, 2020

Well, it doesn't work if the access is so random that everything is guaranteed cache miss. On the other hand, if there's a subset that's accessed more often than rest, it reduces cache pollution from accessing outside the core subset.

mjevans · on Oct 4, 2020

If it's this random it might be worth stepping back and trying to figure out how to get the portion of the results that are desired using a more effective mechanism.

Sometimes the source data structure is just a bad fit for the expressed data. I've seen this often in combinatorial expansions of data stored in hierarchal structure to a 2d table.

Sometimes all, or a majority, of the data is going to be touched anyway so it's more costly to avoid a full read and instead more optimal to make the results of that read parallel; either by sorting to a more suitable structure or by working on streams / permutations / parallel processes from a dispatch design.