Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Memory management and specifically OOM conditions remain an unbelievably painful nightmare on Linux. It's not like I run into these issues constantly, but I've definitely tried to debug issues like these (unsuccessfully). Ultimately if I OOM a machine I usually wind up installing more RAM, which is wasteful/expensive, but it's pretty clear that handling OOM conditions gracefully is going to be a hard problem for Linux to solve into the future.

This is really great work and will serve as a reference point for debugging similar issues in the future. Pretty happy about systemd's debug-shell feature, I had no idea that existed. I don't think my X670E Steel Legend board has a serial header anywhere on it, though. How do modern built-in serial ports work, anyway? Are they attached off of the chipset PCIe lanes?

Something that's also very useful when trying to dive into the Linux kernel is that there's a bunch of great talks discussing Linux kernel subsystems from conferences like FOSDEM and Linux Plumber's Conference which you can usually find recordings of online. For example, there's this one for TTM, the memory subsystem that most of the desktop GPU DRM drivers use:

https://www.youtube.com/watch?v=MG7_tUNKSt0



Windows says that my motherboard serial port is connected to the Pci Bus → PCI standard ISA bridge. Long live DOS!

Thanks for the video about TTM, I'll watch it when I have a chance.


I’ve had good luck containing ooms with cgroups. I’m not sure if there is a state of the art for handling oom conditions beyond what Linux does. If anyone knows and can recommend some reading I would appreciate it.


There's really two problems as I understand it:

- Overcommit. Linux will "overcommit" memory: allocations will succeed when there's no memory, and then hang when the page is actually mapped if no physical pages are available (to my understanding.) Windows NT doesn't do this. Not sure exactly how macOS/XNU handles it.

- The OOM killer. Because allocations don't fail, to actually recover from an OOM situation the kernel will enumerate processes and try to kill ones that are using a lot of memory, by scoring them using heuristics. The big problem? If there isn't a single process hogging the memory, this approach is likely to work very poorly. As an example, consider a highly parallel task like make -j32. An individual C++ compiler invocation is unlikely to use more than a gigabyte or two of memory, so it's more likely that things like Electron apps will get caught first. The thrashing of memory combined with the high CPU consumption of compilers that are not getting killed will grind the machine to a near-complete halt. If you are lucky, then it will finally pick a compiler to kill, and set off a chain reaction that ends your make invocation.

There are solutions... Indeed, you can use quotas with cgroups. There's tools like systemd-oomd that try to provide better userspace OOM killing using cgroups. You can disable overcommit, but some software will not function very well like this as they like to allocate a ton of pages ahead of time and potentially use them later. Overcommit fundamentally improves the ability to efficiently utilize all available memory. Ultimately I think overcommit is probably a bad idea... but it is hard to come up with a zero-compromises solution that keeps optimal memory/CPU utilization but avoids pathological OOM conditions by design.


> two problems ... overcommit

Is there any other sensible way to do this though? It would be quite inefficient to constantly call mmap for additional small(ish) pieces of memory. In effect overcommit just means that until the page is actually written to it hasn't really been allocated. (Aside: I believe a malloc implementation that zero'd out blocks on allocation would fail abruptly rather than later in case that happens to be what bugs you about it.)

Additionally how do you suppose fork should be implemented efficiently? Currently it performs copy-on-write. At minimum you'd need a way to mark pages as "never going to write to these, don't reserve space for a copy". Except such an API is either very awkward to use in practice or else leaves you with some very awkward edge cases to deal with in your program logic.

> You can disable overcommit, but some software will not function very well

Yeah about that.

Chromium runs (AFAIK) 1 PID namespace per tab. On my machine right now it reports 1.1 TiB virtual memory with a little over 100 MiB resident per tab. 1.1 TiB mapped PER TAB. Of the resident I have no idea how much is actually unique (ie written to following the initial fork).

Firefox is much more reasonable at a mere 18 GiB mapped per PID.


> Chromium runs (AFAIK) 1 PID namespace per tab. On my machine right now it reports 1.1 TiB virtual memory with a little over 100 MiB resident per tab. 1.1 TiB mapped PER TAB. Of the resident I have no idea how much is actually unique (ie written to following the initial fork).

This is most likely a trick for garbage collection or memory bug hardening or both. Haskell programs also map 1tb.


A potential workaround would be to still allow giant mmaps but not hang a program when it runs out of pages and instead send a signal to it. Obviously, neither Chrome nor Firefox actually use this much memory in practice.


Rather than a workaround I think that would just be an overall better approach. Receive an actionable error when the allocation happens "for real", whether that's at an arbitrary point in user code or when malloc zeros out the block ahead of time.

However I think you'd need per-thread signal handlers for that to work sensibly. Which the kernel supports (see man 2 clone) but would require updates to (at least) posix and glibc.

It would probably also be nice to have a way to allocate pages without writing to them. Currently we have mlock but that prevents swapping which isn't desirable in this context.


> Memory management and specifically OOM conditions remain an unbelievably painful nightmare on Linux.

Yes. It's horrendous to put it mildly. Linux does not handle OOM conditions properly.

I know I can set up a few guardrails with cgroups. I know I can also install earlyoom. I know I can increase swap or use zram. In the end these are all fundamentally just nasty hacks that might spare one once in a while. They do not fix how these conditions are handled. Please do not offer these as solutions.

I've seen LUKS volumes mount themselves read-only because the kernel couldn't allocate memory in dm_crypt, for the love of god just kill something in userspace. The current state is utterly unacceptable and I'm tired of all the excuses.


Have you been running zswap / zram?

With zstd you can turn 8GB of RAM into 20GB of 'RAM' without much issue. or 16GB into 40GB. Hell, if you're feeling adventurous (and Android does this, so its very stable) you can overcommit your memory past 100%.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: