The % of programmers that actually need to know any of these numbers is tiny. I've done plenty of optimizations for performance-critical systems and have never known any of these numbers. Certainly, I know general principles like "disk is a lot slower than memory" and "don't block the UI on a network call". But knowing whether an L1 cache read is 0.5ns or 5ns has never been necessary. You can optimize so much "highly optimized" code out there without those numbers. I'm sure there are some people that have to know that stuff, maybe a couple people on an AAA video game title, or computer engineers at Intel. But it's the exception and not the norm.
> The % of programmers that actually need to know any of these numbers is tiny.
Context is key. This list was originally written by Jeff Dean, as noted in the credits. He's a legendary [1] Google engineer who (usually side-by-side with the under-credited Sanjay Ghemawat) created such systems as MapReduce, GFS, Bigtable, and Spanner. I'd read "numbers every programmer should know" as "numbers every programmer who wants to be like Jeff Dean would benefit from learning about". I can see how folks might interpret it as gatekeeping—if you can't rattle off these numbers exactly from memory, you aren't a real programmer. I don't think that's the spirit in which it's intended.
These numbers are useful in designing efficient systems in Google datacenters, making decisions like "is it better for all of my stateless servers to keep this on local disk or to send an RPC to an in-cluster server that has it in RAM?" or "do I have sufficient deadline to fail over to another cluster in the 99.999%ile case?" You don't really need to have the entire list memorized, but it helps to know the appropriate operations to consider when deciding if something is suitable from efficiency and latency standpoints. It helps to have intuition for the orders of magnitude. It's okay to look at the cheat sheet. You can do math on these numbers to get a surprisingly accurate idea of a design's performance/feasibility long before you work up a prototype, moving more quickly. Comparing these expectations to experiment helps you spot implementation flaws and get closer to achieving the theoretical performance. And if nothing else, being familiar with the ideas here will help you ace a Google system design interview.
Many of these numbers are also useful in other contexts in which performance matters and you have fairly direct control over what expensive operations are performed. (They're less useful say in an interpreted programming language in which much happens behind your back and you may not care much about performance anyway.) It certainly wouldn't hurt any programmer to know these numbers.
While I totally agree with the argument that everyone should think sonetimes a bit more about the impact of his code (or the execution of it) the relationship between your mentioned SO article and the gist ist close to zero.
There‘s no way from conclude from „memory latency“ to „branch prediction“.
Yes, I sort of meta-agree with this in that the raw list of numbers regularly appearing on top of on nerd forums tends to generate more confusion than enlightenment. I have a comment downthread about that: https://news.ycombinator.com/item?id=24675039
Maybe the list should link the SO answer and similar descriptions of macro-manifestations of seemingly nano-things!
> The % of programmers that actually need to know any of these numbers is tiny.
If you mean the whole range of values I would agree.
However, once I've interviewed a developer for a front-end position that was entirely oblivious to the cost of making an HTTP request, firmly believing that only large downloads had any measurable impact on performance.
Even if you do not have to do back-of-the-napkin calculations on cache latency, knowing the relative cost of each of these operations does wonders to your decision process.
I think having a ballpark intuition for disk/SSD reads and long network packets is possibly useful to a lot of devs. Much beyond that it's academic unless you really need to know.
I think this is inadvertently a good critique of this popular item - it doesn't really tell you why you might want to have some exposure to these numbers.
For a lot of programming, the lowest of these latencies - cache and memory access and branch mispredicts are averaged out and essentially unnoticeable or not worth caring too much about. Which is to be expected, being the design goal. But it's not too rare, even in regular, high-level language programming for them to become 'macroscopic' and that is a useful, practical thing to be aware of, rather than some academic curiosity.
I'm not sure I'm underestimating anything since I'm not estimating things and we seem to be kind of saying the same thing?
Edit: to clarify a bit further - people read this list and think '1 nanosecond, 5 nanoseconds, not important to me, academic, etc'. My point is that's a misunderstanding but the list alone doesn't do a good job of disabusing one of the misunderstanding.
1. Cache-oblivious data-structures work no matter the cache-size. They aren't 100% optimized to a particular cache size, but they are proven to be "efficient" across many different cache sizes.
2. For an example of #1, matrix multiplication is one of the most optimized functions of modern computing. One major step is to transpose the matrix, so that the "vertical" movement turns into a "horizontal" movement. The horizontal movement, which is "address + 1" allows for the cache to work with your loop, while a vertical movement across the matrix is "address + width", and not cache-efficient.
------------
3. I think it is safe to start making some degree of cache-assumptions on modern systems. All caches today have 64-byte cache lines (or greater). That's ARM, Intel, and AMD CPUs all use 64-byte cache lines. AMD GCN GPUs use 64-byte cache lines. AMD RDNA GPUs use 128-byte cache lines (and Intel is rumored to "pair up" cache lines into 128-byte segments at L3).
This means that every memory operation to main-memory reads or writes at least 64-bytes of data at a time, on all modern systems. DDR4 RAM itself has Burst-Length 8, meaning 64-bytes is literally the smallest unit of data that can be addressed, so it isn't surprising that all CPUs / GPUs are working with such a large number these days.
-----------
Given the 64-byte cache line (or 128-byte, for Intel's L3 cache or AMD RDNA GPUs), "unrolling" your linked lists, or using B-trees of larger sizes (instead of the cache-inefficient binary tree), or d-heaps (instead of binary heaps) plays to the benefit of caches helps out a lot.
You don't need to know the specific size of the cache line to know that d-heaps are more efficient than binary-heaps on modern, cache-heavy systems. You just need to know that sequential-memory access stays within the L1 cache.
Often you can get pretty good results from just knowing that cache misses are expensive and working from that point: Store together what is accessed together, use linear arrays to help prefetcher, prefetch manually if really needed (it usually isn't and the prefetch instruction isn't free), if your data is accessed effectively randomly try packing used parts into smaller amount of cache lines, etc...
The latency numbers are still useful for making back of the envelope calculations about how much it is possible to gain and whether it's worth the trouble.
Well, it doesn't work if the access is so random that everything is guaranteed cache miss. On the other hand, if there's a subset that's accessed more often than rest, it reduces cache pollution from accessing outside the core subset.
If it's this random it might be worth stepping back and trying to figure out how to get the portion of the results that are desired using a more effective mechanism.
Sometimes the source data structure is just a bad fit for the expressed data. I've seen this often in combinatorial expansions of data stored in hierarchal structure to a 2d table.
Sometimes all, or a majority, of the data is going to be touched anyway so it's more costly to avoid a full read and instead more optimal to make the results of that read parallel; either by sorting to a more suitable structure or by working on streams / permutations / parallel processes from a dispatch design.
I work on a system for remote sensing and automated marine mammal detection. I may not need contingencies for - but I have reckoned with - the full gamut of time deltas from seconds to microseconds. Maybe I've not had to deal with L1 cache misses, but at times memory fetch vs in-pipeline has come up when it comes to image processing choices.
I'm kind of in an odd niche, so maybe your last sentiment applies. But it strikes me as very un-hacker-mindset to be like, "no one _needs_ to know X". it's curiosity.
i work in food ordering and our volume is big enough noe that understanding cpu caches is definitely important for our oft hit endpoints
there are also domains like realtime control systems, video conferencing, image processing, devices where battery life is at a premium, ai, video editing, decoding, encoding, stock market trading.....
Surely the point is not to "know the numbers", but to understand the mechanics and the relative positions of each peice and their relative efficiencies in system architecture.
when I'm doing optimisation, my goal is not to explicit say "oh, L1 cache is X nanoseconds", it's to say "the profile pattern of the performance we see indicates that X is the current performance bottleneck, and if we can change programming/design, we'll gain Y". Especially since, by the time we're optimising heavily, it generally assumes something was amiss in your original coding or expectations of the system anyway, so your memorised numbers of what performance should be are often shown to be wrong in the first place, and your real world measurements will tell you the numbers that are relevant to your particular use case.
The point is not to memorize the numbers, but to have used them enough that they are at your fingertips at all times. Awareness of costs is essential to engineering economics.
And it is always better not to need to optimize, because the code is already fast enough. You may leap up to proclaim against premature optimization, but only if you have forgotten what Knuth actually wrote.
Having a solid mental model for “how fast is fast” is, in my opinion, critical to what makes an excellent engineer: knowing when to care about performance up front.
And not even big O notation of algorithms or IO latency. But just a general feel for the performance of higher level abstractions. To look at a design that involves some input, data processing, a transfer somewhere, rendering, presentation, or whatever, and to instantly have an intuition on what parts to worry most about.
I would much prefer people know the scale of these numbers than know big O notation. The number of times I've seen big O used as a hammer because everything looks like a nail can be frustrating.
This is doubly true on embedded systems where N > 3000 tends to be the less common case. Most of the time when I do performance profiles it's a 'flat profile' because the majority of the time is lost thrashing the L1/L2 cache.
When I get onto an unfamiliar platform I do a test of sequential search vs binary search to see where the crossover point is on number of elements for small arrays. Then I know roughly when not to bother with a better algorithm.
This was just posted and already there are two types of comments:
1. Most devs don’t need this, it’s not so helpful to know etc.
2. These are critical numbers to know and in the very least devs should know these numbers.
This sort of disagreement is common in our industry, it’s not just this it’s also Big O, algorithms and data structures, and even OS fundamental people disagree on.
I’d love to take both groups of devs commenting and give each group a set of programming tasks to complete. Judge them based on correctness, speed of development, speed the tasks run, the quality of the code etc, etc.
Is it but because our jobs are so varied? My team implements what I would call "business logic" for a (relative to Google scale) small number of users. We often frown at algorithmic coding exercises, like what we believe Google would ask in interviews, using the argument: we suppose that makes sense if you have to optimize compression of YouTube's video streams, which we don't have to... (Rather, our small, ambitious team emphasizes optimization for maintainability over performance.)
In my experience, at least, the vast majority of software development jobs are developing this kind of 'enterprise' infrastructure. The technical details are usually straightforward and the complexity is in meeting the needs of the users and the business, and the amount of processing required is trivial on modern computers to the point where implementation performance almost never matters.
Jobs where you need to do anything tricky or performance-sensitive, where it's worth knowing fancy algorithms and low level implementation details, are few and far between.
In "The Datacenter as a Computer" there was an interesting diagram sketching a datacenter memory hierarchy latency and capacity and bandwidth to access RAM, SSD, and HDD in the same machine, the same rack, and a larger cluster. It's page 72 of https://www.morganclaypool.com/doi/pdf/10.2200/S00874ED3V01Y... and with luck https://books.google.com/books?id=Td51DwAAQBAJ&pg=PA72 will get you to the page. The numbers have changed (commodity SSDs have markedly improved!) but the idea still makes sense.
Fun to consider that boxes under one top-of-rack switch can hold dozens of TB of RAM and hundreds of TB of Flash with latency measured in µs between them. There's legitimately interesting space in between writing code for a single machine and code for huge, spread-out clusters of thousands of boxes.
This gets posted here frequently - I submitted it a year ago - but imo that's not a problem because it's insightful and useful. Should have a (2012) tag though.
I also ran across this version that updates the numbers to 2020 values:
Great idea! I think this is the way to make this more relevant and easier to consume.
Besides, I think the point of the exercise is not to "know" that reading 1MB sequentially from memory is 250K ns but rather that it is worth 50K branch mispredicts. Knowing these numbers won't make you a better engineer, but being able to compare them (and influence your design decisions with it) will.
Though every now and then you get some time-critical tasks in which case the actual numbers might matter. But I imagine at that scale so much more factors will influence your execution that ballpark estimates are nowhere close to tight. Sure 1KB over 1Gbps is 10K ns and you can extrapolate from that how long your 1TB data migration will take. Then it's more worthwhile to check Grafana and reason out from the bandwidth trend.
Mutex lock/unlock 25ns
Main memory reference 100ns
Honest question: how can this be right? Surely living a mutex requires synchronising across the cores of a CPU, which requires at least as much time – probably quite a bit more – than an uncached access to memory?
Funnily enough, there is another version of the slides these numbers come from in which mutex lock/unlock is listed as the same 100ns as memory access: http://static.googleusercontent.com/media/research.google.co.... I can't tell which came first, or if one is a typo, or what's going on there.
In any case, assuming the mutex in cache, and appears to not currently be locked, the core performing the atomic operation needs to: (1) broadcast an invalidate/exclusive hold on the cache line to the other cores, (2) update its own cache, (3) eventually write back to memory. (1) is faster than main memory access, since CPU cores are closer to each other, physically, than they are to the memory bus, and have hard-wired access specifically for this operation. The slowest part is (3), but with a write-back cache the writer doesn't really pay the latency cost, because it gets amortized into the next cache flush.
Interesting, thanks. The idea of it just hitting the shared cache had occurred to me, but somehow locking a mutex seemed like a complex enough operation that it would surely take more time overall than a full memory access. I'm glad to be corrected about it.
Uncontested mutex lock/unlock is just a swap instruction followed (or in "unlock": preceded) by a memory barrier. The flush pushes data to L3 cache, where it can be shared between multiple cores. (L1 and L2 cache is local to a core).
L3 cache is far closer than main-memory, and roughly 20ns these days.
-------
In "actuality", its more of a MESI message, but in the abstract its your L3 caches communicating with each other and synchronizing.
Hey man! Love your newsletter but many times I am just too lazy or occupied to properly read through them :/
Since you already are on YouTube, my one suggestion to improve the adoption would be to make videos on them. I know it might be super time consuming but imo it will reach more people.
Loading a webpage that is considered "lightning fast" - 1 second
Of course the common opinion expressed here is people couldn't care less for those numbers - it's hard to get them to care that the website they are working on can't finish rendering in the time it takes anyone to pour a cup of coffee.
It's great if you know them, but you can put "Every programmer should know" on everything you use daily and post it. It looks like the clickbait articles "N books every programmer should read" and see the thick books on algorithms.
How do you folks think regarding network bandwidth vs disk throughput? I used to think disk was way slower but now if we look at AWS ec2 instances, they are within one order of magnitude of each other: