Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Latency Numbers Every Programmer Should Know (2012) (gist.github.com)
197 points by albertzeyer on Oct 3, 2020 | hide | past | favorite | 68 comments


The % of programmers that actually need to know any of these numbers is tiny. I've done plenty of optimizations for performance-critical systems and have never known any of these numbers. Certainly, I know general principles like "disk is a lot slower than memory" and "don't block the UI on a network call". But knowing whether an L1 cache read is 0.5ns or 5ns has never been necessary. You can optimize so much "highly optimized" code out there without those numbers. I'm sure there are some people that have to know that stuff, maybe a couple people on an AAA video game title, or computer engineers at Intel. But it's the exception and not the norm.


> The % of programmers that actually need to know any of these numbers is tiny.

Context is key. This list was originally written by Jeff Dean, as noted in the credits. He's a legendary [1] Google engineer who (usually side-by-side with the under-credited Sanjay Ghemawat) created such systems as MapReduce, GFS, Bigtable, and Spanner. I'd read "numbers every programmer should know" as "numbers every programmer who wants to be like Jeff Dean would benefit from learning about". I can see how folks might interpret it as gatekeeping—if you can't rattle off these numbers exactly from memory, you aren't a real programmer. I don't think that's the spirit in which it's intended.

These numbers are useful in designing efficient systems in Google datacenters, making decisions like "is it better for all of my stateless servers to keep this on local disk or to send an RPC to an in-cluster server that has it in RAM?" or "do I have sufficient deadline to fail over to another cluster in the 99.999%ile case?" You don't really need to have the entire list memorized, but it helps to know the appropriate operations to consider when deciding if something is suitable from efficiency and latency standpoints. It helps to have intuition for the orders of magnitude. It's okay to look at the cheat sheet. You can do math on these numbers to get a surprisingly accurate idea of a design's performance/feasibility long before you work up a prototype, moving more quickly. Comparing these expectations to experiment helps you spot implementation flaws and get closer to achieving the theoretical performance. And if nothing else, being familiar with the ideas here will help you ace a Google system design interview.

Many of these numbers are also useful in other contexts in which performance matters and you have fairly direct control over what expensive operations are performed. (They're less useful say in an interpreted programming language in which much happens behind your back and you may not care much about performance anyway.) It certainly wouldn't hurt any programmer to know these numbers.

[1] see the "Jeff Dean facts" for an idea of how Google engineers view him: https://www.informatika.bg/jeffdean


If ("if" ;P) #18 is real, I'd love to read that change.


No, those numbers are not useful for designing MapReduce, BigTable, GFS or Spanner.


This is one of the most popular answers on SO. The effects of these latencies can easily become visible in pretty vanilla programming.

https://stackoverflow.com/questions/11227809/why-is-processi...


While I totally agree with the argument that everyone should think sonetimes a bit more about the impact of his code (or the execution of it) the relationship between your mentioned SO article and the gist ist close to zero.

There‘s no way from conclude from „memory latency“ to „branch prediction“.


'Branch mis-predict' is the second item in the Latency Numbers thing.


Sorry - my bad. Totally overlooked that.


The actual number is not used on SO and moreover most of those numbers should be obscolete in 2020.


Yes, I sort of meta-agree with this in that the raw list of numbers regularly appearing on top of on nerd forums tends to generate more confusion than enlightenment. I have a comment downthread about that: https://news.ycombinator.com/item?id=24675039

Maybe the list should link the SO answer and similar descriptions of macro-manifestations of seemingly nano-things!


> The % of programmers that actually need to know any of these numbers is tiny.

If you mean the whole range of values I would agree.

However, once I've interviewed a developer for a front-end position that was entirely oblivious to the cost of making an HTTP request, firmly believing that only large downloads had any measurable impact on performance.

Even if you do not have to do back-of-the-napkin calculations on cache latency, knowing the relative cost of each of these operations does wonders to your decision process.


I think having a ballpark intuition for disk/SSD reads and long network packets is possibly useful to a lot of devs. Much beyond that it's academic unless you really need to know.


I think this is inadvertently a good critique of this popular item - it doesn't really tell you why you might want to have some exposure to these numbers.

For a lot of programming, the lowest of these latencies - cache and memory access and branch mispredicts are averaged out and essentially unnoticeable or not worth caring too much about. Which is to be expected, being the design goal. But it's not too rare, even in regular, high-level language programming for them to become 'macroscopic' and that is a useful, practical thing to be aware of, rather than some academic curiosity.


I think you underestimate the domination of large numbers.

Let's say you have code that is 50% main memory and 50% L1 cache. Let's say there are 1000 operations.

You have (500 x 100ns) + (500 x 1ns) or 50500ns total time.

Now let's say you optimize the code so that they are all L3 operations: 20ns x 1000 operations is 20000ns, or over twice as fast.


I'm not sure I'm underestimating anything since I'm not estimating things and we seem to be kind of saying the same thing?

Edit: to clarify a bit further - people read this list and think '1 nanosecond, 5 nanoseconds, not important to me, academic, etc'. My point is that's a misunderstanding but the list alone doesn't do a good job of disabusing one of the misunderstanding.


If you optimizen for L1-L3 caches you need to know/control what hardware you are running on since cache sizes differ between processors, isnt it?


Surprisingly: no.

1. Cache-oblivious data-structures work no matter the cache-size. They aren't 100% optimized to a particular cache size, but they are proven to be "efficient" across many different cache sizes.

2. For an example of #1, matrix multiplication is one of the most optimized functions of modern computing. One major step is to transpose the matrix, so that the "vertical" movement turns into a "horizontal" movement. The horizontal movement, which is "address + 1" allows for the cache to work with your loop, while a vertical movement across the matrix is "address + width", and not cache-efficient.

------------

3. I think it is safe to start making some degree of cache-assumptions on modern systems. All caches today have 64-byte cache lines (or greater). That's ARM, Intel, and AMD CPUs all use 64-byte cache lines. AMD GCN GPUs use 64-byte cache lines. AMD RDNA GPUs use 128-byte cache lines (and Intel is rumored to "pair up" cache lines into 128-byte segments at L3).

This means that every memory operation to main-memory reads or writes at least 64-bytes of data at a time, on all modern systems. DDR4 RAM itself has Burst-Length 8, meaning 64-bytes is literally the smallest unit of data that can be addressed, so it isn't surprising that all CPUs / GPUs are working with such a large number these days.

-----------

Given the 64-byte cache line (or 128-byte, for Intel's L3 cache or AMD RDNA GPUs), "unrolling" your linked lists, or using B-trees of larger sizes (instead of the cache-inefficient binary tree), or d-heaps (instead of binary heaps) plays to the benefit of caches helps out a lot.

You don't need to know the specific size of the cache line to know that d-heaps are more efficient than binary-heaps on modern, cache-heavy systems. You just need to know that sequential-memory access stays within the L1 cache.


Often you can get pretty good results from just knowing that cache misses are expensive and working from that point: Store together what is accessed together, use linear arrays to help prefetcher, prefetch manually if really needed (it usually isn't and the prefetch instruction isn't free), if your data is accessed effectively randomly try packing used parts into smaller amount of cache lines, etc...

The latency numbers are still useful for making back of the envelope calculations about how much it is possible to gain and whether it's worth the trouble.


Can you explain the radomly accessing data cache mitigation strategy more?


A linked list is a random-memory traversal every time you do "node = node->next".

You can grossly mitigate the random-memory read/write by using unrolled linked lists (https://en.wikipedia.org/wiki/Unrolled_linked_list).

Or, instead of using binary-heaps, use d-heaps: https://en.wikipedia.org/wiki/D-ary_heap


Well, it doesn't work if the access is so random that everything is guaranteed cache miss. On the other hand, if there's a subset that's accessed more often than rest, it reduces cache pollution from accessing outside the core subset.


If it's this random it might be worth stepping back and trying to figure out how to get the portion of the results that are desired using a more effective mechanism.

Sometimes the source data structure is just a bad fit for the expressed data. I've seen this often in combinatorial expansions of data stored in hierarchal structure to a 2d table.

Sometimes all, or a majority, of the data is going to be touched anyway so it's more costly to avoid a full read and instead more optimal to make the results of that read parallel; either by sorting to a more suitable structure or by working on streams / permutations / parallel processes from a dispatch design.


It's also good to know th relative costs of memory systems. My rule of thumb, based on market prices:

RAM:SSD:Magnetic - 100:10:1

For example 10 terabytes of RAM costs as much as 1 petabyte of spinning disk magnetic storage.


I work on a system for remote sensing and automated marine mammal detection. I may not need contingencies for - but I have reckoned with - the full gamut of time deltas from seconds to microseconds. Maybe I've not had to deal with L1 cache misses, but at times memory fetch vs in-pipeline has come up when it comes to image processing choices.

I'm kind of in an odd niche, so maybe your last sentiment applies. But it strikes me as very un-hacker-mindset to be like, "no one _needs_ to know X". it's curiosity.


i work in food ordering and our volume is big enough noe that understanding cpu caches is definitely important for our oft hit endpoints

there are also domains like realtime control systems, video conferencing, image processing, devices where battery life is at a premium, ai, video editing, decoding, encoding, stock market trading.....


You may not feel you need to know them, but I use them every day, and I will not hire anybody who doesn't know them.


Surely the point is not to "know the numbers", but to understand the mechanics and the relative positions of each peice and their relative efficiencies in system architecture.

when I'm doing optimisation, my goal is not to explicit say "oh, L1 cache is X nanoseconds", it's to say "the profile pattern of the performance we see indicates that X is the current performance bottleneck, and if we can change programming/design, we'll gain Y". Especially since, by the time we're optimising heavily, it generally assumes something was amiss in your original coding or expectations of the system anyway, so your memorised numbers of what performance should be are often shown to be wrong in the first place, and your real world measurements will tell you the numbers that are relevant to your particular use case.


The point is not to memorize the numbers, but to have used them enough that they are at your fingertips at all times. Awareness of costs is essential to engineering economics.

And it is always better not to need to optimize, because the code is already fast enough. You may leap up to proclaim against premature optimization, but only if you have forgotten what Knuth actually wrote.


“Write everything in Python”?


What kind of software do you work on?


That was the right question. Fintech. Previously, data communications. Those numbers are etched into my cranium.

There is lots of programming that is not performance-sensitive, but I don't do it, or hire for it.


Having a solid mental model for “how fast is fast” is, in my opinion, critical to what makes an excellent engineer: knowing when to care about performance up front.

And not even big O notation of algorithms or IO latency. But just a general feel for the performance of higher level abstractions. To look at a design that involves some input, data processing, a transfer somewhere, rendering, presentation, or whatever, and to instantly have an intuition on what parts to worry most about.


I would much prefer people know the scale of these numbers than know big O notation. The number of times I've seen big O used as a hammer because everything looks like a nail can be frustrating.

This is doubly true on embedded systems where N > 3000 tends to be the less common case. Most of the time when I do performance profiles it's a 'flat profile' because the majority of the time is lost thrashing the L1/L2 cache.


When I get onto an unfamiliar platform I do a test of sequential search vs binary search to see where the crossover point is on number of elements for small arrays. Then I know roughly when not to bother with a better algorithm.


How fast is fast can be known, but how important are tradeoffs is a better way to test an engineer.


This was just posted and already there are two types of comments:

1. Most devs don’t need this, it’s not so helpful to know etc.

2. These are critical numbers to know and in the very least devs should know these numbers.

This sort of disagreement is common in our industry, it’s not just this it’s also Big O, algorithms and data structures, and even OS fundamental people disagree on.

I’d love to take both groups of devs commenting and give each group a set of programming tasks to complete. Judge them based on correctness, speed of development, speed the tasks run, the quality of the code etc, etc.

I think the results would be profound.


Is it but because our jobs are so varied? My team implements what I would call "business logic" for a (relative to Google scale) small number of users. We often frown at algorithmic coding exercises, like what we believe Google would ask in interviews, using the argument: we suppose that makes sense if you have to optimize compression of YouTube's video streams, which we don't have to... (Rather, our small, ambitious team emphasizes optimization for maintainability over performance.)


In my experience, at least, the vast majority of software development jobs are developing this kind of 'enterprise' infrastructure. The technical details are usually straightforward and the complexity is in meeting the needs of the users and the business, and the amount of processing required is trivial on modern computers to the point where implementation performance almost never matters.

Jobs where you need to do anything tricky or performance-sensitive, where it's worth knowing fancy algorithms and low level implementation details, are few and far between.


You'd then also have to give each dev a task representative of the kind of work they nah do on a daily basis.

I think you'd find probably people are good at what they spend time doing a lot and tend to brush off things they don't as not important.


Alternatively, find successful startups and find who coded their MVP or whatever they started with, and see which type of dev has more success.

I'm guessing the results won't be as lopsided and "profound."

Correctness, quality, speed are not always defining metrics (and sometimes they definitely are!).


In "The Datacenter as a Computer" there was an interesting diagram sketching a datacenter memory hierarchy latency and capacity and bandwidth to access RAM, SSD, and HDD in the same machine, the same rack, and a larger cluster. It's page 72 of https://www.morganclaypool.com/doi/pdf/10.2200/S00874ED3V01Y... and with luck https://books.google.com/books?id=Td51DwAAQBAJ&pg=PA72 will get you to the page. The numbers have changed (commodity SSDs have markedly improved!) but the idea still makes sense.

Fun to consider that boxes under one top-of-rack switch can hold dozens of TB of RAM and hundreds of TB of Flash with latency measured in µs between them. There's legitimately interesting space in between writing code for a single machine and code for huge, spread-out clusters of thousands of boxes.


Consumer latency caused by cookie syncs

https://s6.io/consumer-latency-caused-by-cookie-syncs/

Latency in Digital Advertising: A Guide for Publishers

https://blog.ad-juster.com/latency-in-digital-advertising-a-...

Case Study: Ads Increase Page Load Times by 40 Percent

https://rigor.com/blog/ads-increase-latency-by-40-percent/


This gets posted here frequently - I submitted it a year ago - but imo that's not a problem because it's insightful and useful. Should have a (2012) tag though.

I also ran across this version that updates the numbers to 2020 values:

https://colin-scott.github.io/personal_website/research/inte...

edit: it appears they're just estimating for a given time above, not measuring...


I loved the latency numbers posted by one stackoverflow engineers Nick Craver: https://nickcraver.com/blog/2019/08/06/stack-overflow-how-we...

L1: 1.3ns L2: 3.92ns (3x slower) L3: 11.11ns (8.5x slower) DDR4 RAM: 100ns (77x slower) NVMe SSD: 120,000ns (92,307x slower) SATA/SAS SSD: 400,000ns (307,692x slower) Rotational HDD: 2–6ms (1,538,461x slower) Microsoft Live Login: 12 redirects and 5s (3,846,153,846x slower, approximately)

“How many Microsoft Live Logins before this thing loads?”


Great idea! I think this is the way to make this more relevant and easier to consume.

Besides, I think the point of the exercise is not to "know" that reading 1MB sequentially from memory is 250K ns but rather that it is worth 50K branch mispredicts. Knowing these numbers won't make you a better engineer, but being able to compare them (and influence your design decisions with it) will.

Though every now and then you get some time-critical tasks in which case the actual numbers might matter. But I imagine at that scale so much more factors will influence your execution that ballpark estimates are nowhere close to tight. Sure 1KB over 1Gbps is 10K ns and you can extrapolate from that how long your 1TB data migration will take. Then it's more worthwhile to check Grafana and reason out from the bandwidth trend.



    Mutex lock/unlock       25ns
    Main memory reference  100ns
Honest question: how can this be right? Surely living a mutex requires synchronising across the cores of a CPU, which requires at least as much time – probably quite a bit more – than an uncached access to memory?


Funnily enough, there is another version of the slides these numbers come from in which mutex lock/unlock is listed as the same 100ns as memory access: http://static.googleusercontent.com/media/research.google.co.... I can't tell which came first, or if one is a typo, or what's going on there.

In any case, assuming the mutex in cache, and appears to not currently be locked, the core performing the atomic operation needs to: (1) broadcast an invalidate/exclusive hold on the cache line to the other cores, (2) update its own cache, (3) eventually write back to memory. (1) is faster than main memory access, since CPU cores are closer to each other, physically, than they are to the memory bus, and have hard-wired access specifically for this operation. The slowest part is (3), but with a write-back cache the writer doesn't really pay the latency cost, because it gets amortized into the next cache flush.


Interesting, thanks. The idea of it just hitting the shared cache had occurred to me, but somehow locking a mutex seemed like a complex enough operation that it would surely take more time overall than a full memory access. I'm glad to be corrected about it.


Mutex lock/unlock is commonly over L3 cache on modern systems. It won't hit main-memory.


> Surely living a mutex requires synchronising across the cores of a CPU

Right it's done over cache coherence, which doesn't need to reach all the way out to main memory. And that's only if it's contested.


A contested mutex lock / unlock is over L3 cache.

Uncontested mutex lock/unlock is just a swap instruction followed (or in "unlock": preceded) by a memory barrier. The flush pushes data to L3 cache, where it can be shared between multiple cores. (L1 and L2 cache is local to a core).

L3 cache is far closer than main-memory, and roughly 20ns these days.

-------

In "actuality", its more of a MESI message, but in the abstract its your L3 caches communicating with each other and synchronizing.


Not sure if you're agreeing with me or contradicting me? Isn't that what I said?


Reminds me of Dan Luu's various latency articles.

https://danluu.com/input-lag/


More numbers with accompanying code: https://github.com/sirupsen/napkin-math


Hey man! Love your newsletter but many times I am just too lazy or occupied to properly read through them :/

Since you already are on YouTube, my one suggestion to improve the adoption would be to make videos on them. I know it might be super time consuming but imo it will reach more people.




comments>2 might be a good default qualifier for the 'past' link.


Loading a webpage that is considered "lightning fast" - 1 second

Of course the common opinion expressed here is people couldn't care less for those numbers - it's hard to get them to care that the website they are working on can't finish rendering in the time it takes anyone to pour a cup of coffee.


It's great if you know them, but you can put "Every programmer should know" on everything you use daily and post it. It looks like the clickbait articles "N books every programmer should read" and see the thick books on algorithms.


How do you folks think regarding network bandwidth vs disk throughput? I used to think disk was way slower but now if we look at AWS ec2 instances, they are within one order of magnitude of each other:

Network: ec2 instances (except tiny ones) 1-2 Gbit/s Baseline, 10 Gbits/s Burst

disk: EBS volumes, general purpose, throughput 128 MiB/s - 250 MiB/s or ~ 2 Gbit/s (1,000 MiB/s for io optimized)

So the top of the baseline network bandwidth is the same as the top of the disk throughput, or am I wrong?


I found this article from coding horror about the same thing interesting - "The Infinite Space Between Worlds". https://blog.codinghorror.com/the-infinite-space-between-wor...


This reminds me of one of the few good r/ProgrammerHumor posts from this year: https://old.reddit.com/r/ProgrammerHumor/comments/g4cr9m/my_...



Do we need an update?

Latency between AWS AZs (2-3ms) Latency to NVMe Latency to NAS vs directly attached (Cloud options) Latency to external RAM (RAMCloud?)


With C programming variables can we access L1/L2 cache


The link rot on that page frustrates me :/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: