Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The % of programmers that actually need to know any of these numbers is tiny. I've done plenty of optimizations for performance-critical systems and have never known any of these numbers. Certainly, I know general principles like "disk is a lot slower than memory" and "don't block the UI on a network call". But knowing whether an L1 cache read is 0.5ns or 5ns has never been necessary. You can optimize so much "highly optimized" code out there without those numbers. I'm sure there are some people that have to know that stuff, maybe a couple people on an AAA video game title, or computer engineers at Intel. But it's the exception and not the norm.


> The % of programmers that actually need to know any of these numbers is tiny.

Context is key. This list was originally written by Jeff Dean, as noted in the credits. He's a legendary [1] Google engineer who (usually side-by-side with the under-credited Sanjay Ghemawat) created such systems as MapReduce, GFS, Bigtable, and Spanner. I'd read "numbers every programmer should know" as "numbers every programmer who wants to be like Jeff Dean would benefit from learning about". I can see how folks might interpret it as gatekeeping—if you can't rattle off these numbers exactly from memory, you aren't a real programmer. I don't think that's the spirit in which it's intended.

These numbers are useful in designing efficient systems in Google datacenters, making decisions like "is it better for all of my stateless servers to keep this on local disk or to send an RPC to an in-cluster server that has it in RAM?" or "do I have sufficient deadline to fail over to another cluster in the 99.999%ile case?" You don't really need to have the entire list memorized, but it helps to know the appropriate operations to consider when deciding if something is suitable from efficiency and latency standpoints. It helps to have intuition for the orders of magnitude. It's okay to look at the cheat sheet. You can do math on these numbers to get a surprisingly accurate idea of a design's performance/feasibility long before you work up a prototype, moving more quickly. Comparing these expectations to experiment helps you spot implementation flaws and get closer to achieving the theoretical performance. And if nothing else, being familiar with the ideas here will help you ace a Google system design interview.

Many of these numbers are also useful in other contexts in which performance matters and you have fairly direct control over what expensive operations are performed. (They're less useful say in an interpreted programming language in which much happens behind your back and you may not care much about performance anyway.) It certainly wouldn't hurt any programmer to know these numbers.

[1] see the "Jeff Dean facts" for an idea of how Google engineers view him: https://www.informatika.bg/jeffdean


If ("if" ;P) #18 is real, I'd love to read that change.


No, those numbers are not useful for designing MapReduce, BigTable, GFS or Spanner.


This is one of the most popular answers on SO. The effects of these latencies can easily become visible in pretty vanilla programming.

https://stackoverflow.com/questions/11227809/why-is-processi...


While I totally agree with the argument that everyone should think sonetimes a bit more about the impact of his code (or the execution of it) the relationship between your mentioned SO article and the gist ist close to zero.

There‘s no way from conclude from „memory latency“ to „branch prediction“.


'Branch mis-predict' is the second item in the Latency Numbers thing.


Sorry - my bad. Totally overlooked that.


The actual number is not used on SO and moreover most of those numbers should be obscolete in 2020.


Yes, I sort of meta-agree with this in that the raw list of numbers regularly appearing on top of on nerd forums tends to generate more confusion than enlightenment. I have a comment downthread about that: https://news.ycombinator.com/item?id=24675039

Maybe the list should link the SO answer and similar descriptions of macro-manifestations of seemingly nano-things!


> The % of programmers that actually need to know any of these numbers is tiny.

If you mean the whole range of values I would agree.

However, once I've interviewed a developer for a front-end position that was entirely oblivious to the cost of making an HTTP request, firmly believing that only large downloads had any measurable impact on performance.

Even if you do not have to do back-of-the-napkin calculations on cache latency, knowing the relative cost of each of these operations does wonders to your decision process.


I think having a ballpark intuition for disk/SSD reads and long network packets is possibly useful to a lot of devs. Much beyond that it's academic unless you really need to know.


I think this is inadvertently a good critique of this popular item - it doesn't really tell you why you might want to have some exposure to these numbers.

For a lot of programming, the lowest of these latencies - cache and memory access and branch mispredicts are averaged out and essentially unnoticeable or not worth caring too much about. Which is to be expected, being the design goal. But it's not too rare, even in regular, high-level language programming for them to become 'macroscopic' and that is a useful, practical thing to be aware of, rather than some academic curiosity.


I think you underestimate the domination of large numbers.

Let's say you have code that is 50% main memory and 50% L1 cache. Let's say there are 1000 operations.

You have (500 x 100ns) + (500 x 1ns) or 50500ns total time.

Now let's say you optimize the code so that they are all L3 operations: 20ns x 1000 operations is 20000ns, or over twice as fast.


I'm not sure I'm underestimating anything since I'm not estimating things and we seem to be kind of saying the same thing?

Edit: to clarify a bit further - people read this list and think '1 nanosecond, 5 nanoseconds, not important to me, academic, etc'. My point is that's a misunderstanding but the list alone doesn't do a good job of disabusing one of the misunderstanding.


If you optimizen for L1-L3 caches you need to know/control what hardware you are running on since cache sizes differ between processors, isnt it?


Surprisingly: no.

1. Cache-oblivious data-structures work no matter the cache-size. They aren't 100% optimized to a particular cache size, but they are proven to be "efficient" across many different cache sizes.

2. For an example of #1, matrix multiplication is one of the most optimized functions of modern computing. One major step is to transpose the matrix, so that the "vertical" movement turns into a "horizontal" movement. The horizontal movement, which is "address + 1" allows for the cache to work with your loop, while a vertical movement across the matrix is "address + width", and not cache-efficient.

------------

3. I think it is safe to start making some degree of cache-assumptions on modern systems. All caches today have 64-byte cache lines (or greater). That's ARM, Intel, and AMD CPUs all use 64-byte cache lines. AMD GCN GPUs use 64-byte cache lines. AMD RDNA GPUs use 128-byte cache lines (and Intel is rumored to "pair up" cache lines into 128-byte segments at L3).

This means that every memory operation to main-memory reads or writes at least 64-bytes of data at a time, on all modern systems. DDR4 RAM itself has Burst-Length 8, meaning 64-bytes is literally the smallest unit of data that can be addressed, so it isn't surprising that all CPUs / GPUs are working with such a large number these days.

-----------

Given the 64-byte cache line (or 128-byte, for Intel's L3 cache or AMD RDNA GPUs), "unrolling" your linked lists, or using B-trees of larger sizes (instead of the cache-inefficient binary tree), or d-heaps (instead of binary heaps) plays to the benefit of caches helps out a lot.

You don't need to know the specific size of the cache line to know that d-heaps are more efficient than binary-heaps on modern, cache-heavy systems. You just need to know that sequential-memory access stays within the L1 cache.


Often you can get pretty good results from just knowing that cache misses are expensive and working from that point: Store together what is accessed together, use linear arrays to help prefetcher, prefetch manually if really needed (it usually isn't and the prefetch instruction isn't free), if your data is accessed effectively randomly try packing used parts into smaller amount of cache lines, etc...

The latency numbers are still useful for making back of the envelope calculations about how much it is possible to gain and whether it's worth the trouble.


Can you explain the radomly accessing data cache mitigation strategy more?


A linked list is a random-memory traversal every time you do "node = node->next".

You can grossly mitigate the random-memory read/write by using unrolled linked lists (https://en.wikipedia.org/wiki/Unrolled_linked_list).

Or, instead of using binary-heaps, use d-heaps: https://en.wikipedia.org/wiki/D-ary_heap


Well, it doesn't work if the access is so random that everything is guaranteed cache miss. On the other hand, if there's a subset that's accessed more often than rest, it reduces cache pollution from accessing outside the core subset.


If it's this random it might be worth stepping back and trying to figure out how to get the portion of the results that are desired using a more effective mechanism.

Sometimes the source data structure is just a bad fit for the expressed data. I've seen this often in combinatorial expansions of data stored in hierarchal structure to a 2d table.

Sometimes all, or a majority, of the data is going to be touched anyway so it's more costly to avoid a full read and instead more optimal to make the results of that read parallel; either by sorting to a more suitable structure or by working on streams / permutations / parallel processes from a dispatch design.


It's also good to know th relative costs of memory systems. My rule of thumb, based on market prices:

RAM:SSD:Magnetic - 100:10:1

For example 10 terabytes of RAM costs as much as 1 petabyte of spinning disk magnetic storage.


I work on a system for remote sensing and automated marine mammal detection. I may not need contingencies for - but I have reckoned with - the full gamut of time deltas from seconds to microseconds. Maybe I've not had to deal with L1 cache misses, but at times memory fetch vs in-pipeline has come up when it comes to image processing choices.

I'm kind of in an odd niche, so maybe your last sentiment applies. But it strikes me as very un-hacker-mindset to be like, "no one _needs_ to know X". it's curiosity.


i work in food ordering and our volume is big enough noe that understanding cpu caches is definitely important for our oft hit endpoints

there are also domains like realtime control systems, video conferencing, image processing, devices where battery life is at a premium, ai, video editing, decoding, encoding, stock market trading.....


You may not feel you need to know them, but I use them every day, and I will not hire anybody who doesn't know them.


Surely the point is not to "know the numbers", but to understand the mechanics and the relative positions of each peice and their relative efficiencies in system architecture.

when I'm doing optimisation, my goal is not to explicit say "oh, L1 cache is X nanoseconds", it's to say "the profile pattern of the performance we see indicates that X is the current performance bottleneck, and if we can change programming/design, we'll gain Y". Especially since, by the time we're optimising heavily, it generally assumes something was amiss in your original coding or expectations of the system anyway, so your memorised numbers of what performance should be are often shown to be wrong in the first place, and your real world measurements will tell you the numbers that are relevant to your particular use case.


The point is not to memorize the numbers, but to have used them enough that they are at your fingertips at all times. Awareness of costs is essential to engineering economics.

And it is always better not to need to optimize, because the code is already fast enough. You may leap up to proclaim against premature optimization, but only if you have forgotten what Knuth actually wrote.


“Write everything in Python”?


What kind of software do you work on?


That was the right question. Fintech. Previously, data communications. Those numbers are etched into my cranium.

There is lots of programming that is not performance-sensitive, but I don't do it, or hire for it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: