Where Linux's load average comes from in the kernel

tanelpoder · on April 18, 2022

When I was looking into this, I found that just running ps (with -L option to see all threads, not just processes/thread group leaders) with some grep/sort/uniq was the easiest way to break down where does the “too high Linux system load” come from. No need to compile C code or have root access to drill down into load. And you could drill down further by sampling some additional /proc/PID/task/TID/ files, like “syscall”, “stack” to see which (blocked) syscall is contributing to the load and where in the kernel is it stuck. Knowing what kind of process/thread level /proc files are available and reading/sampling them with a shell one-liner is a powerful entry point for performance troubleshooting and may allow you to delay writing advanced kernel tracing scripts further.

For those that are interested:

https://tanelpoder.com/posts/high-system-load-low-cpu-utiliz...

anonymousDan · on April 18, 2022

Interesting link thanks. The part I always find tricky with kernel debugging is distinguishing normal vs abnormal behaviour (e.g. how many kworker threads is too many)?

tanelpoder · on April 19, 2022

Yep, one thing that helps in this context is to only look into threads in R or D state (and ignore the hundreds or sometimes thousands of kernel threads that are just sleeping or waiting for more work to do in idle (I) state).

But yeah, even that doesn't clearly tell you what amount of active threads is normal for your workload/configuration. Or you can just save the thread count info (grouped by thread state) and save/graph it over time, to have a better idea of what's normal. The difference of this approach vs just graphing Linux-reported load is that you'll get a breakdown of how much of it was CPU demand vs sync I/O demand. Or you could even break the the thread states down further, by syscall and WCHAN, for example. This will give you a much more detailed idea of why the load is high, not just that it is high.

(Like I mentioned in the article), I even have a tool for sampling /proc at regular intervals (1 Hz by default) and saving it into hourly CSV files, so you can easily go "back in time" and zoom into short spikes and see what was going on:

  $ xcapture
  
  0xTools xcapture v1.0 by Tanel Poder [https://0x.tools]
  
  Sampling /proc...
  
  DATE       TIME             PID     TID USERNAME   ST COMMAND        SYSCALL    WCHAN                    
  2020-10-17 12:01:50.583    6404    7524 mysql      R  (mysqld)       fsync      wait_on_page_bit          
  2020-10-17 12:01:50.583    6404    8944 mysql      D  (mysqld)       fsync      wait_on_page_bit          
  2020-10-17 12:01:50.583    6404    8946 mysql      D  (mysqld)       fsync      wait_on_page_bit          
  2020-10-17 12:01:50.583    6404   76046 mysql      D  (mysqld)       fsync      wait_on_page_bit          
  2020-10-17 12:01:50.583    6404   76811 mysql      D  (mysqld)       fdatasync  xfs_log_force_lsn         
  2020-10-17 12:01:50.583    6404   76815 mysql      D  (mysqld)       fsync      blkdev_issue_flush        
  2020-10-17 12:01:50.583    8803    8803 root       R  (md10_resync)  [running]  0                         

  DATE       TIME             PID     TID USERNAME   ST COMMAND        SYSCALL    WCHAN                    
  2020-10-17 12:01:51.623    6404    7521 mysql      D  (mysqld)       pwrite64   xfs_file_buffered_aio_write 
  2020-10-17 12:01:51.623    6404    7524 mysql      D  (mysqld)       fsync      xfs_log_force_lsn         
  ...

belter · on April 18, 2022

From the code comments...

/*

* kernel/sched/loadavg.c

* This file contains the magic bits required to compute the global loadavg

* figure. Its a silly number but people think its important. We go through

* great pains to make it work on big machines and tickless kernels.

*/

R0b0t1 · on April 18, 2022

The comment really undersells it. The fact that it is a regular number, despite not being strongly tied to something real, does make it useful. Even when you have a lot of IO wait causing an extremely large number you are being told something useful.

the8472 · on April 18, 2022

There is a newer alternative, pressure stall information[0], split by CPU, memory and IO pressure. htop can display the rolling PSI averages and one can also see the total time during which some tasks were stalled. They're also available per cgroup.

[0] https://lwn.net/Articles/759781/

tanelpoder · on April 19, 2022

By the way, PSI has the same shortcoming of not accounting the time of threads waiting for async I/O completion as I/O pressure.

zaarn · on April 19, 2022

Yes, but async I/O means the process can continue doing other, potentially useful, work, while waiting for async I/O to complete. So if a process has issues an async I/O, it's not blocked or dead, like a process who issued sync I/O.

And on the topside, if you block on the io_submit and friends, async I/O counts towards the PSI and load average again. Ie, when you can't do useful work because the device queue is full, for example.

So it's not as much of a shortcoming as you think, because if the system is I/O exhausted, async I/O processes will quickly contribute to load as well once the I/O queues are nearly full or full.

tanelpoder · on April 19, 2022

The reason why I'm aware of this is that Oracle databases use libaio for some reads and most of the writes on Linux, including transaction log writes. There are plenty of cases where you don't exhaust the block device I/O queues - io_submit() finishes quickly without blocking, but then the I/O issuing process wants to wait for the asynchronously submitted I/O completion, before moving on. It will use io_getevents() with the "timeout > 0" argument, so it will sleep (in S mode), waiting for I/O completion. If it had submitted a synchronous I/O, it would have increased Linux load, but if configured to use libaio, it won't.

So, the I/O component of Linux system load and PSI do not include the async I/O waiting threads that decide to synchronoysly wait for asynchronously submitted I/O requests :-)

I see MySQL is using libaio in some places too (the io_submit/io_getevents show up in syscalls), but I haven't looked deeper into whether the getevents are "willing to wait" or not. But any application using libaio, that at some point uses io_getevents() in "willing to wait" mode, would be affected by this discrepancy.

zaarn · on April 19, 2022

I still think that is a reasonable caveat; the process didn't get hung up waiting for IO because IO was exhausted when it issues the write, rather it got hung because it deliberately chose to wait for IO to be completed. That isn't a load factor, it was the application doing it. Ideally Oracle DB (and MySQL) should track this and offer a performance metric somewhere (It's systemd service for example) of how long it's waiting for AIO completion.

The crucial point there is that while the application is slow, the system remains as reactive as the PSI number indicates. Just the application didn't.

tanelpoder · on April 19, 2022

Yep I do agree that it's a reasonable caveat and your point that async I/O allows the app to move on and do other stuff (in whatever thread state) and come back later.

But I'm talking about the cases where the app is done with the other work and needs to ensure that the previous write is persisted (for example to a write ahead log), before moving on? It will have to deliberately wait for I/O completion now, thus will run io_getevents() with the "willing to wait" mode enabled. I see this in Oracle database world all the time and MySQL world occasionally too:

- Not seeing a high number of threads in D state or PSI IO figure does not mean that there are no threads waiting for I/O

- Seeing a high number of threads in D state and high PSI IO figure means that there are threads waiting for I/O (but possibly even more, if you'd look into the ones in S state, but in io_getevents syscall, with WCHAN=read_events.

zaarn · on April 19, 2022

Applications should account for it and measure it themselves, when an application chooses to wait (to persist writes, for example), then it's hard to argue the kernel should take that as load. A similar argument could otherwise be made if an app is waiting on a futex or other resource. I don't think an fsync counts either. The solution there, IMO, is to use IO fences. Tell the OS that async writes and reads may not reorder beyond point X in time for this thread (or all threads, optionally for all FDs or just a specific one).

Then your app, like Oracle, simply issues a fence and can be done with it. If the app crashes before the fence is persisted, then it must be able to resume work from a previous one (a simple example would be that between WAL checkpoints, a fence is issued). The app won't have to worry that only some specific writes completed vs some others not beyond what the fence permitted. Additionally a good mechanism might be a call to wait for a fence to be persisted to disk.

Simple example for the WAL use case:

  1. Write new transaction to WAL
  2. Issue an IO fence for the WAL file
  3. Write new data to database file
  4. Wait for Fence from 2
  5. Return success

This is roughly equivalent to the synchronous example:

  1. Write new transaction to WAL file
  2. Issue fsync
  3. Write new data to database file
  4. Return Success

In case writes to the database file are lost, you can recover from the WAL (as intended). Notably Step 4 of the Async Example is a case where the thread is waiting but it can do other useful work while that is happening. The same thread can offload the work and simply issue more IO in the meanwhile and return success to the client once it sees the correct wait-for-fence returning in the async queue. And it won't have to wait for AIO event completion/checkpoints like currently, that make the system load non-indicative of app load (though frankly, system load is never indicative of app load, apache2 doesn't increase load if it runs out of workers).

tanelpoder · on April 19, 2022

You have a good point that the app should track the difference between I/O submission times and I/O completion times. Can't speak for MySQL, but Oracle nowadays has two "wait events" for the DB Writer flow (that syncs dirty buffers from its buffer cache to disk):

1. db file async I/O submit

2. db file parallel write

(the 2nd one has such a name for historical reasons, but it's the I/O reaping/completion check timing, not the submission).

mardifoufs · on April 19, 2022

Now that makes me wonder how does Windows deal with this? Say when you open the task manager, what does the cpu usage % actually mean? I'd guess it's not measuring it in the same way Linux does, but I'd be surprised if it used pressure stall information?

SuperQue · on April 19, 2022

Windows users just look at CPU utilization, which is what most Linux users should be doing as well. Monitoring both total system and single core utilization.

Of course, 100% CPU use also isn't a "problem", it's just a fact. PSI is the thing that actually reveals problems. I'm not aware of a Windows equivalent.

mardifoufs · on April 19, 2022

Okay so my question might not make a lot of sense but how does an OS or I'd guess a kernel knows what's the CPU utilization? I'm pretty sure it's reported by the CPU, but how does the CPU measure that? From my understanding, CPUs have multiple "units" that can specialize in specific operations. Is that taken into account, does it just measure the utilization of it's "general use" units, and if so how does the CPU keep track of it's own usage? I'd google it but I honestly don't know what to look for past the OS/kernel layer

johntb86 · on April 19, 2022

The OS is in charge of telling the CPU what to run. Literally the context switching code sets saves its old registers and reads in the registers of the program that's about to run (including the program counter that says what address to start at). So at that time it can save information about how the current time and long the previous code ran.

Low-level hardware block usage generally isn't accounted for that way, but CPUs can keep track of performance counter events tracking how many of each operation are performed and how many cycles each subunit is active. The OS can periodically read that data, but it's only used for low level metrics about how efficient the code is rather than for calculating overall CPU utilization.

There are separate measurements about GPU and hard disk utilization, which are either determined by when driver queues work on the device and receives a notification that the device is done, or by the device keeping track of what work it's switching between and reporting it periodically.

bragr · on April 18, 2022

One of my old Linux sys admin interview questions was describing a situation with a high load number but low CPU utilization. People who said that was impossible or similar got shown the door and people who knew or could reason their way to IO wait/problem would get to proceed.

EdSchouten · on April 18, 2022

The fact that the load average also counts threads stuck on I/O happens to be Linux specific. On the BSDs it only measures CPU utilization.

So it may be the case that you showed people the door, simply because their experience was based on non-Linux operating systems.

bragr · on April 18, 2022

One does typically prefer Linux experience for a _Linux_ sys admin, but don't get me wrong that's just one of many screening question. That said, if someone doesn't have enough Linux experience to know how Linux differs from *BSD or Solaris or whatever they have time on, that seems like a valid exclusion to me. I don't tend to put any stock of claims of being able to be a quick study unless they've obviously done some interview prep on the subjects they're unfamiliar with. Best way to show you are a motivated, quick study is by being a motivated, quick study.

inferiorhuman · on April 19, 2022

For context: most of my professional experience is with Linux, recreational with FreeBSD. If I'd been shown the door for failing a litmus test I'd probably be more relieved than anything. Your flippant response validates that.

When I've been on the employer side of the interview table my priorities are: does this person have a rudimentary knowledge of the required subjects and what is their problem solving process like. Basically do they know enough to handle the majority of day to day things and know where to look and who to ask when something is beyond their knowledge. Arcane trivia is of little-to-no interest to me because nobody is going to memorize every single detail. There will always be surprises. And, yes, implementation details of the load average is arcane trivia – that's precisely why this article is/was on the front page of HN.

With the scenario you've put forth it's absolutely possible to hit the ground running without having to know the gory details of how the Linux kernel calculates load average. Surely you're already monitoring I/O activity and CPU usage alongside load average. And if you're not (tsk tsk) a competent candidate would know where to look for current system information, and how to respond to the information (e.g. is this elevated load average worth being concerned about in the first place?).

> Best way to show you are a motivated, quick study is by being a motivated, quick study.

The best advice I can give is to look for successes and not failures in your candidates. Asking something tantamount to a trick question. Asking them to work through a scenario where the load average is high and the CPU utilization is low is looking for success.

toast0 · on April 18, 2022

> Best way to show you are a motivated, quick study is by being a motivated, quick study.

It's hard to get generalized knowledge being a quick study, it's a lot easier to get specific knowledge to work on an actual issue being a motivated, quick study. So if that's really what you're looking for, you've got to be OK with using a search engine and manual pages during the interview.

tanelpoder · on April 18, 2022

There's also one interesting addition that people may not be aware of: Synchronous I/O that blocks (like pread64/pwrite64) will contribute to Linux system load (threads in D state).

Asynchronous I/O completion checks (libaio's io_getevents) that are willing to wait for I/O completion, will not contribute to Linux system load (threads in S state).

Asynchronous I/O submissions (libaio's io_submit) either quickly submit their I/O (a small amount of time in R state) OR get stuck in io_submit() if the underlying block device I/O queue is full. When io_submit() gets stuck, then you're sleeping in D mode, thus contributing to system load again.

Drybones · on April 19, 2022

Recently, I was asked an interview question about what affects Linux IO load averages with a MANGA company. When I mentioned that one of the multiple possible reasons could be IO, he acted like I said the sky was green. He did not push me through the process. He was a systems engineer too and not just some HR person reading a script.

belter · on April 18, 2022

"The many load averages of Unix(es)"

https://utcc.utoronto.ca/~cks/space/blog/unix/ManyLoadAverag...

shon · on April 18, 2022

We used to ask the same question. TBH I haven’t read the f’ing article yet but… I’d summarize this as load=Run-que length. Number of run-eligible processes blocked by something where something is usually IO in a low CPU utilization scenario.

lamontcg · on April 18, 2022

processes stuck in a D state on stale NFS mounts.

pram · on April 18, 2022

Ah my old nemesis, you just had to reboot the machine to unfuck it lol

natmaka · on April 19, 2022

Was 'umount -f' unable to unmount? Did you use 'soft' NFS mounting?

walrus01 · on April 18, 2022

one possible 'correct' answer for something like this would be iowait caused by slow disk performance on a degraded raid array that was still operational.

as one of many possible theoretical scenarios.

wbh1 · on April 18, 2022

Similar article from Brendan Gregg a few years back: https://www.brendangregg.com/blog/2017-08-08/linux-load-aver...

tie_ · on April 18, 2022

Install atop and configure it to sample every second.

I can't count the number of times this has helped solve a mysterious behavior. Atop is king.

jeffbee · on April 19, 2022

It's great but it's very expensive so be aware that if you are down to the level of caring about things in the single-digit percentages of your budget, something like atop once per second could easily consume that amount of resources. A lot of these stats are really not cheap at all. E.g. you may find a substantial difference in the amount of CPU time needed to produce /proc/pid/stat as opposed to /proc/pid/statm. And some of the system-wide stats are even worse.

geocrasher · on April 18, 2022

Yes, atop is great for this! There's also "sar" (system activity report) which can do similar things. Both are quite helpful.

winrid · on April 19, 2022

Be careful running atop with the -R flag on machines that create lots of short lived threads or do lots of IO, like many databases.

nousermane · on April 18, 2022

Pretty sure that running "top -b >/var/log/whatever" in background would catch the culprit(s) of originally reported "spikes".

waynesonfire · on April 18, 2022

If I understand the authors use-case to get more refined top results, from the first paragraph,

"Suppose, not hypothetically, that you have a machine that periodically has its load average briefly soar to relatively absurd levels for no obvious reason; the machine is normally at, say, 0.5 load average but briefly spikes to 10 or 15 a number of times a day. "

$ sudo perf sched record -- sleep 5 && sudo perf sched latency

seems to do the trick. I'm not even a perf engineer just did a simple google search. Though, that makes for pretty crummy blog content.

tanelpoder · on April 18, 2022

Perf sched shows you scheduling latency (time spent waiting on CPU runqueue in Runnable state), but not "demand" for (some of) the system resources like the load figure tries to estimate.

I think the author didn't realize that the unit of load is just "number of threads" doing _something_ (wanting to be on CPU, running on CPU or waiting for I/O in D state on Linux, just CPU stuff on other Unixes).

Load average is just "average number of threads" doing something over last 1,5,15 minutes.

So if just the single-number averaged over multiple minutes is not good enough for drilling down into your load spikes, then you just go look into the data source (not necessarily source code) yourself. Just use ps or /proc filesystem to list the _number of threads_ that are currently either in R or D state. That's your system load at the current moment. If you want some summary/average over 10 seconds, run the same commant 100x in a row (and sleep a bit in the between) and count all threads in R & D state (and then divide by 100 to normalize it to an average).

It's basically sampling-based profiling of Linux thread states.