> However, I think you’re making a mistake on a core part of your argument:
No, that part has been misunderstood so I guess I didn't write it clearly enough.
I'm not saying the filesystem defragments, or does any particular effort to ensure on-disk contiguous storage. I'm saying that as a result of the presence of other data on the filesystem and historic accumulating entropy in layout (sometimes caused by an LSM tree DB!), the filesystem ends up keeping track of appended data discontiguously on disk. In keeping track, the filesystem has to consult its free-space structures to find new free areas, with a heuristic or two to decide whether to allocate a large or small one, write updates to the free-space structures, and keep track of the resulting discontiguous mapping for the file by writing to those structures too. Even when it's contiguous on-disk, versions of those metadata writes are needed for transactional, durable DB writes. They're simpler during bulk non-durable writes of course.
These are the "more work, sometimes more I/O, just to allocate space".
Good filesystems are efficient, but that activity does add overhead (especially when durable transactions are involved) and big-O algorithmic factors (those 1 MiB extents add up), and the picture painted of linear-time operations in LSM papers is inaccurate as a result. I don't think this overhead is necessarily large in practice most of the time. However it's where much of the complexity and corner cases lie, if honestly analysing the performance range and big-O scaling factors.
You make an interesting point about write amplification at the filesystem layer not being accounted for. In addition, classic LSM trees also have significant write amplification (regardless of filesystem or even block device) due to the simple act of writing all data to every layer. This is well known by LSM designers, and there are mitigations, but it's somehow left out of the public image of LSM trees. Classic LSM trees are excellent for random-access writes that need to be gradually sorted, but for some other write patters, are slower (more I/O) than good quality B-trees and some other structures.
> no one is really going to care about performance on an almost full filesystem (kind full like 75% but old so lots of fragments is valid but I doubt it’s actually a problem because of how good filesystems are).
Heh. In practice, every RocksDB or LevelDB I've seen in production is hundreds of GB or several TB on filesystems that have run low on space or even run out, mainly due to the size of the DB :-) They also have thousands or tens of thousands of files in each DB, which they have to open for random-access queries. This is what motivated me to recognise the filesystem can be quite involved.
I think we’re mostly on the same page. I’m more approaching it from intrinsic complexity vs not. Could a database do a better job than the FS Managing extents manually? Maybe. But I’m not as sold that the win is significant enough vs other techniques that could be taken. So yes the papers should definitely consider the filesystem and they don’t. I’m just not convinced that the fragmentation issue is a
filesystem vs db but more of a block allocation thing that would always be there and probably work very similarly with little in the way of optimization. Certainly I’ve seen people too ready to throw away the filesystem without actually confirming the cost benefit.
No, that part has been misunderstood so I guess I didn't write it clearly enough.
I'm not saying the filesystem defragments, or does any particular effort to ensure on-disk contiguous storage. I'm saying that as a result of the presence of other data on the filesystem and historic accumulating entropy in layout (sometimes caused by an LSM tree DB!), the filesystem ends up keeping track of appended data discontiguously on disk. In keeping track, the filesystem has to consult its free-space structures to find new free areas, with a heuristic or two to decide whether to allocate a large or small one, write updates to the free-space structures, and keep track of the resulting discontiguous mapping for the file by writing to those structures too. Even when it's contiguous on-disk, versions of those metadata writes are needed for transactional, durable DB writes. They're simpler during bulk non-durable writes of course.
These are the "more work, sometimes more I/O, just to allocate space".
Good filesystems are efficient, but that activity does add overhead (especially when durable transactions are involved) and big-O algorithmic factors (those 1 MiB extents add up), and the picture painted of linear-time operations in LSM papers is inaccurate as a result. I don't think this overhead is necessarily large in practice most of the time. However it's where much of the complexity and corner cases lie, if honestly analysing the performance range and big-O scaling factors.
You make an interesting point about write amplification at the filesystem layer not being accounted for. In addition, classic LSM trees also have significant write amplification (regardless of filesystem or even block device) due to the simple act of writing all data to every layer. This is well known by LSM designers, and there are mitigations, but it's somehow left out of the public image of LSM trees. Classic LSM trees are excellent for random-access writes that need to be gradually sorted, but for some other write patters, are slower (more I/O) than good quality B-trees and some other structures.
> no one is really going to care about performance on an almost full filesystem (kind full like 75% but old so lots of fragments is valid but I doubt it’s actually a problem because of how good filesystems are).
Heh. In practice, every RocksDB or LevelDB I've seen in production is hundreds of GB or several TB on filesystems that have run low on space or even run out, mainly due to the size of the DB :-) They also have thousands or tens of thousands of files in each DB, which they have to open for random-access queries. This is what motivated me to recognise the filesystem can be quite involved.