Current file systems are impressive - flexible, robust, close the hardware perfo...

ok_computer · on May 24, 2023

I think the simplicity and flexibility and lack of overall framework is the benefit. Dead simple bytes that may or may not be arranged in a way that works with the program you’re trying to open them with. Then build the relational model on top of it.

Git’s now out of style and we’re onto ____ but my storage is identical. I used to use flickr but now I dump directly to s3 and my jpgs are indistinguishable.

Especially so some consortium of tech companies don’t come up with the next-gen db/fs with bolt on features that no one’s asking for and telemetry to improve your file recall experience. Or logging into my fs because I need customization. For instance any modern web app is built with overkill tech that adds complexity because in certain scale uses that is necessary.

Give me trees of utf-8 encoded flat files any day. Not nested object relational models of stuff that ages faster than milk.

fmap · on May 24, 2023

I don't think many people are arguing against having arbitrary byte arrays for storage and using application specific serialization formats. The real problem with file systems, imo, is that they present a leaky abstraction over something that's internally very complex. Any single operation might look simple, but as soon as you start combining operations you're going to have a bad time with edge cases.

For example, let's say you need to ensure that your writes actually ends up on disk: https://stackoverflow.com/questions/12990180/what-does-it-ta...

The typical file abstraction introduces buffering on top of this, which adds additional edge cases: https://stackoverflow.com/questions/42434872/writing-program...

If you want ordered writes, then you have to handle this through some kind of journalling or at the application level with something like this: https://pages.cs.wisc.edu/~vijayc/papers/UWCS-TR-2012-1709.p...

And that's only something that's visible at the application level. There are plenty of similar edge cases below the surface.

Even if all of this works correctly, you still have to remember that very few systems give any guarantees about what data ends up on disk. So you often end up using checksums and/or error correcting codes in your files.

Finally, all this is really only talking about single files and write operations. As soon as you need to coordinate operations between multiple files (e.g., because you're using directories) things become more complicated. If you then want to abuse the file system to do something other than read and write arrays of bytes you will have to deal with operations that are even more broken, e.g., file locking: http://0pointer.de/blog/projects/locking.html

---

It's not an accident that you are using other services to store your files. For example, S3 handles a lot of the complexity around durable storage for you, but at a massive cost in latency compared to what the underlying hardware is capable of.

Similarly, application programmers often end up using embedded databases, for exactly the same reason and with exactly the same problem.

This is a shame, because your file system has to solve many of the same problems internally anyway. Metadata is usually guaranteed to be consistent and this is implemented through some sort of journaling system. It's just that the file system abstraction does not expose any of this and necessitates a lot of duplicate complexity at the application level.

---

Edit: After re-reading the grandparent comment, it sounds like they are arguing against the "array of bytes" model. I agree that this is usually not what you want at the application level, but it's less clear how to build a different abstraction that can be introduced incrementally. Without incremental adoption such a solution just won't work.

ok_computer · on May 25, 2023

Those are all good points. I will read the rest of the links!

My question is can those uncertainties be fixed with a less performant, ordered, and safe file system for typical application use. Then bleeding-edge with plenty of sharp edge cases for high performance compute work that application programmers can handle at app level? Because it is nuts how fast hardware and inexpensive RAM are and I think if you add +30% time on file write IO that will not greatly impact the user experience vs all the other causes of lag that burden us like network and bloat.

Then in the HPC word if a new byte cloud where all context is in some database with a magic index naturally comes to be we can move to that. I won't rule out needing to change the underlying file system because that's pretty over my head and there are good ideas I don't understand.

My point is to push against the proprietary format vendor lock-in file system abstractions like I get in nested objects in microsoft powerpoint or word or apple garage band where the app is merely wrapping files and hiding your actual data that you can pick up and move to another app. I don't want to need to adopt a way of thinking about pretty simple objects to use every different program.

I like wavs > flac, plain text > binary, constant bit rate > variable bit rate, sqlite > cloud company db (not really fair but just saying sqlite one-file db is amazing). Storage is inexpensive and adding in layers to decode the data runs a risk of breaking it and I like interoperability. Once you lose the file data and just have content clouds there might be compression running on the data changing the quality, e.g. youtube as a video store with successive compression algorithms aging old videos.

It drives me nuts when needing to attach things I'm faced with a huge context list where I'd rather go find a directory. Abstractions are just that, mental models to avoid the low level stuff. I'm still cool thinking of my information as file trees I think that's an OK level. But you're right complex operations with a file system has issues. I've messed up logging and file IO not thinking it through and it made me think about needing to fix my mistaken code.

jcranmer · on May 24, 2023

There are roughly three ways you can look at files.

The first is the traditional way: a file is a bag of bytes. Operating systems could do a better job of handling bags of bytes (really, they should default to making sure that the bags of bytes are updated atomically--you either see only the old bag of bytes or the new bag of bytes, never a weird mixture of both), but this is the fundamental view that most APIs tend to expose.

The second is a file is a collection of fixed-sized blocks, stored at not-necessarily-contiguous offsets. This is where something like mmap comes into play, or sparse storage files. A lot of higher-level formats actually tend to built on this model, and this tends to be how underlying storage thinks of files.

The third is that a file is a collection of data structures. It's tempting to think that the OS should expose this view of files natively in its API, but this turns out to be a really bad idea. If you limit it to well-supportable primitives, it's too simple for many applications, so they need to build their own serialization logic anyways. Cast too wide a net, and now applications have to worry about representing things they can't support. Or you take a third option and have a full serialization/deserialization framework that allows custom pluggable things, which is a ticking time bomb for security.

userbinator · on May 24, 2023

The "stream of bytes" model is what lead to easy data interchange and interoperability. There were plenty of proprietary "structured file" schemes invented in the past, but (fortunately) none of them seem to have become widespread.

rektide · on May 24, 2023

I agree that where are now is bad, but I also think files could be an answer too.

What we saw in 9p was a file orientation as well, but files were much smaller grained structures. We can see various kernel interfaces like /proc and /sys where we have file structures representing bigger objects too.

Rather than use the file system structure, apps have been creating their own structures within files. This obstructs homogenous user access to the data!

If we could start to access finer grained data, start to have objects as file-system-trees, I think a lot of progress could be made in computing, especially vis-a-vie the rifts of human-computer-interaction. It would give us leverage to see & work the data, broadly. Rather than facing endless different opaque streams of bytes.

RcouF1uZ4gsC · on May 24, 2023

I think the closest thing to what you are looking for is SQLite.

It is basically designed to be an fopen replacement. It is designed to be robust. The relational model is very flexible. It provides great interoperability and backwards compatibility.

cmiller1 · on May 24, 2023

Are you proposing something similar to the Apple Newton Soup? https://en.wikipedia.org/wiki/Soup_(Apple)

drpixie · on May 24, 2023

I'm thinking something where the system maintains "objects" of arbitrary types, presumably they could include multiple other objects. You just access objects and the system makes them available - the object could be a document, a game, a purchase order, etc. The system would also handle moving them around - what we know as "networking".

(A little like Smalltalk.)

In that case, I see no need for lower level data. No need to read text into memory, no need to serialise anything, no need to open/read/write/close. So all the work we do to handle low level stuff becomes obsolete :) (Or at least provided once by OS programmers.)

abecedarius · on May 24, 2023

Reminds me of KeyKOS and presumably other capability-oriented OSes (none of which I've ever gotten familiar with, I'm afraid).

drpixie · on May 25, 2023

Not familiar with that one. Thanks for the pointer.

arsome · on May 24, 2023

Or perhaps WinFS?

https://en.m.wikipedia.org/wiki/WinFS

throwawaylinux · on May 24, 2023

Most technology is able to do useful things by building layers of simpler things.

Files are not sequences of bytes in day to day computing. They are videos, or databases, or applications. Actually a lot of the time you'll be doing your day to day computing, thousands of files are being accessed and you wouldn't even know it.

pjc50 · on May 24, 2023

This has attracted a lot of flack, but you can see from actual usage that "S3 blob" is a not-quite-filesystem API that people actually use. Given all the latency and mutability tradeoffs, it might be useful to have something that sits on the PCIe bus and speaks Blob.

pjmlp · on May 24, 2023

In some mainframes and micros, the filesystem is based on database model, there are no files.

drpixie · on May 24, 2023

I'm familiar with OS400 and Pick. Pick provided the DB table as its lowest level of storage.

kragen · on May 24, 2023

this is the same thinking that gave us the 'advanced intelligent network'

current ip networks are impressive - flexible, robust, close to line speed. but i'm disappointed that we are still using such low level models for our day to day computing. tcp/ip = everything is a sequence of packets and every computer has to interpret and manage those packets, 'manually', individually, and slightly differently than other computers do!

it's understandable to use 'packets' when running retro apps, but it's way past time that a high-level model rendered the concept of packets obsolete

that's not a quote from a pre-stupid-network bellhead 25 years ago but it could have been

or the intel iapx432

current cpu architectures are impressive - flexible, robust, with impressive performance. but i'm disappointed that we are still using such low level models for our day to day computing. 8086 = everything is a sequence of computations on 16-bit integers and every program/library has to interpret and manage those 16-bit integers, 'manually', individually, and slightly differently than other programs do!

it's understandable to use '16-bit words' when running retro apps, but it's way past time that a high-level model rendered the concept of untyped words obsolete

in fact file storage forms the same sort of nexus as the rest of the posix system call interface, the 8086 instruction set, ip packets, bytes, and dollars: many things can store files fairly efficiently, and many things can use them for many different purposes, and the nexus permits those things to evolve independently with minimal coupling to one another

(there are many ways the posix concept of files could be improved, which is also true of 8086)

if we want to replace files with a better storage interface, it should probably be something dumber rather than something smarter

'it's done in the os so it's simple' is the same kind of cognitive error as 'it's done in the hardware so it's cheap' https://yosefk.com/blog/its-done-in-hardware-so-its-cheap.ht... (though see https://blog.cr.yp.to/20190430-vectorize.html for some 02019 updates on the relative costs of things like dispatching and floating point)

actual good systems design amounts to more than 'move the problem somewhere where i don't understand what's involved in solving it anymore'

kragen · on May 26, 2023

https://www.oilshell.org/blog/2022/02/diagrams.html is maybe a better explanation of this idea

drpixie · on May 24, 2023

I'm still hoping that the OS will make our lives (programmers and users) easier, rather than us having to know about the low-level stuff, even if (these days) we usually handle it by invoking yet another library.

kragen · on May 24, 2023

despite the needlessly hostile tone of my initial message, i'm curious what you see as the pros and cons of "yet another library" knowing about the low-level stuff rather than the kernel knowing about it

drpixie · on May 25, 2023

The cons that come to mind:

Size: The more externals you drag in, the bigger and less efficient the apps will get. An extreme version of this are the deployment packages like flatpak and snap - you drag it a substantial copy of an OS! Bigger needs more storage, slower startup, etc.

Difference: Every library does things a little differently - uses different init schemes, different serialisation, different naming conventions, etc. Makes for steeper learning curves (for users and programmers), incompatabilties, etc.

Knowledge: over a few projects, you'll need fairly detailed understanding of many libraries. Makes sense to consolidate this into one place (the OS). It will be less than perfect for some apps, but realistically, so are the libraries.

Bugs: Really a by-product of the above - but bigger systems (more libraries and/or more hand-cut low level code) invariably have more bugs.

I'm afraid the only pros that comes to mind:

Potential performance: low level stuff gives you the option to fine tune. But: #1 for almost all apps, a better design gives performance more easily than low-level coding, #2 the industry default is to accept poor performance and buy a faster machine ;)

Familiarity: using a new paradigm involves changing thinking and skills - uncomfortable and time consuming.

kragen · on May 25, 2023

these all seem like arguments for standardizing on a particular implementation of any given functionality, rather than having many different implementations

they don't seem to bear on the question of whether to put that functionality in userspace or in the kernel

you are presumably aware that at times these advantages of standardization on a single implementation are outweighed by its disadvantages; see http://akkartik.name/freewheeling for some interesting exploration of the opposite extreme from the kind of high modernism that seems to be your ideal

drpixie · on May 24, 2023

added...

> (I can be hopeful but I hold not outlook for such better models. Too many backwards compatible apps and too much depends on our existing code.)

I see it that we (you, me, almost all programmers) are so practiced at the "file" way of thinking, that we genuinely struggle to look far beyond that paradigm. We see the advantages of "files" but have no experience with much else, so we struggle to make comparisons.