This is partly why SSDs just lie nowadays and tell you they only have 75-90% of ...

fc417fc802 · 2025-04-22T08:55:59 1745312159

> sare capacity and the impact of (not) keeping it open is more about relative than absolute sizes

I don't think this is correct. At least btrfs works with slabs in the 1 GB range IIRC.

One of my current filesystmes is upwards of 20 TB. Reserving 5% of that would mean reserving 1 TB. I'll likely double it in the near future, at which point it would mean reserving 2 TB. At least for my use case those numbers are completely absurd.

kbolino · 2025-04-22T14:47:37 1745333257

We're not talking about optical discs or backup tapes which usually get written in full in a single session. Hard drive storage in general use is constantly changing.

As such, fragmentation is always there; absolute disk sizes don't change the propensity for typical workloads to produce fragmentation. A modern file system is not merely a bucket of files, it is a database that manages directories, metadata, files, and free space. If you mix small and large directories, small and large files, creation and deletion of files, appending to or truncating from existing files, etc., you will get fragmentation. When you get close to full, everything gets slower. Files written early in the volume's life and which haven't been altered may remain fast to access, but creating new files will be slower, and reading those files afterward will be slower too. Large directories follow the same rules as larger files, they can easily get fragmented (or, if they must be kept compact, then there will be time spent on defragmentation). If your free space is spread across the volume in small chunks, and at 95% full it almost certainly will be, then the fact that the sum of it is 1 TB confers no benefit by dint of absolute size.

Even if you had SSDs accessed with NVMe, fragmentation would still be an issue, since the file system must still store lists or trees of all the fragments, and accessing those data structures still takes more time as they grow. But most NAS setups are still using conventional spinning-platter hard drives, where the effects of fragmentation are massively amplified. A 7200 RPM drive takes 8.33 ms to complete one rotation. No improvements in technology have any effect on this number (though there used to be faster-spinning drives on the market). The denser storage of modern drives improves throughput when reading sequential data, but not random seek times. Fragmentation increases the frequency of random seeks relative to sequential access. Capacity issues tend to manifest as performance cliffs, whereby operations which used to take e.g. 5 ms suddenly take 500 or 5000. Everything can seem fine one day and then not the next, or fine on some operations but terrible on others.

Of course, you should be free to (ab)use the things you own as much as you wish. But make no mistake, 5% free is deep into abuse territory.

Also, as a bit of an aside, a 20 TB volume split into 1 GB slabs means there's 20,000 slabs. That's about the same as the number of 512-byte sectors in a 10 MB hard drive, which was the size of the first commercially available consumer hard drive for the IBM PC in the late 1980s. That's just a coincidence of course, but I find it funny that the numbers are so close.

Now, I assume the slabs are allocated from the start of the volume forward, which means external slab fragmentation is nonexistent (unless slabs can also be freed). But unless you plan to create no more than 20,000 files, each exactly 1 GB in size, in the root directory only, and never change anything on the volume ever again, then internal slab fragmentation will occur all the same.

fc417fc802 · 2025-04-22T21:14:42 1745356482

Yes thank you I am aware of what fragmentation is.

There are two sorts of fragmentation that can occur with btrfs. Free space and file data. File data is significantly more difficult to deal with but it "only" degrades read performance. It's honestly a pretty big weakness of btrfs. You can't realistically defragment file data if you have a lot of deduplication going on because (at least last I checked) the tooling breaks the deduplication.

> If your free space is spread across the volume in small chunks, and at 95% full it almost certainly will be

Only if you failed to perform basic maintenance. Free space fragmentation is a non-issue as long as you run the relevant tooling when necessary. Chunks get compacted when you rebalance.

Where it gets dicey is that the btrfs tooling is pretty bad at handling the situation where you have a small absolute number of chunks available. Even if you theoretically have enough chunks to play musical chairs and perform a rebalance the tooling will happily back itself into a corner through a series of utterly idiotic decisions. I've been bitten by this before but in my experience it doesn't happen until you're somewhere under 100 GB of remaining space regardless of the total filesystem size.

kbolino · 2025-04-22T21:51:04 1745358664

If compaction (= defragmentation) runs continuously or near-continuously, it results in write amplification of 2x or more. For a home/small-office NAS (the topic at hand) that's also lightly used with a read-heavy workload, it should be fine to rely on compaction to keep things running smoothly, since you won't need it to run that often and you have cycles and IOPS to spare.

If, under those conditions, 100 GB has proven to be enough for a lot of users, then it might make sense to add more flexible alarms. However, this workload is not universal, and setting such a low limit (0.5% of 20 TB) in general will not reflect the diverse demands that different people put on their storage.