More

jleahy · 2025-09-23T07:16:03 1758611763

(disclaimer: CTO of XTX)

It was a long long time ago that we were only using NFS, it ran on top of a Solaris machine running ZFS. It did its job at the very beginning, but you don't build up hundreds of petabytes of data on an NFS server.

We did try various solutions in between NFS and developing TernFS, both open source and properietary. However we didn't name these specifically in the blog post because there's little point in bad mouthing what didn't work out for us.

jleahy · 2025-09-23T07:10:04 1758611404

(disclaimer: CTO of XTX)

These limits aren't quite as strict as they first seem.

Our median file size is 2MB, which means 50% of our files are <2MB. Realistically if you've got an exabyte of data with an average file size of a few kilobytes then this is the wrong tool for the job (you need something more like a database), but otherwise it should be just fine. We actually have a nice little optimisation where very small files are stored inline in the metadata.

It works out of the box with "normal" tools like rsync, python, etc despite the immutability. The reality is that most things don't actually modify files, even text editors tend to save a new version and rename over the top. We had to update relatively little of our massive code base when switching over to this. For us that was a big win, moving to an S3-like interface would have required updating a lot of code.

Directory creation/deletion is "slow", currenly limited to about 10,000 operations per second. We don't current need to create more than 10,000 directories per second so we just haven't prioritised improving that. There is an issue open, #28, which would get this up to 100,000 per second. This is the sort of thing that, like access control, I would love to have had in an initial open source release, but we prioritised open sourcing what we have over getting it perfect.

em-bee · 2025-09-23T13:50:55 1758635455

The reality is that most things don't actually modify files, even text editors tend to save a new version and rename over the top.

it is essentially copy-on-write exposed to the user level. the only issue is that this breaks hard links, so tools that rely on that are going to break. but yes, custom code should be easy to adapt.

jleahy · 2025-09-26T13:17:16 1758892636

Yes hard links aren't supported in TernFS. They would actually be really difficult to make work in this kind of sharded metadata design as they would need to be reference counted and all the operations would need to go via the CDC. It wouldn't really have matched with the design philosphy of simple and predictable performance.

em-bee · 2025-09-26T18:12:53 1758910373

well, that's at least consistent. if hard-links aren't even supported, you can't break hard-links by replacing a file with a new one through renaming either.

lucyjojo · 2025-09-23T10:50:32 1758624632

thanks for the open-sourcing!

jleahy · 2025-09-19T06:22:52 1758262972

> So no RDMA?

We can saturate the network interfaces of our flash boxes with our very simple Go block server, because it uses sendfile under the hood. It would be easy to switch to RDMA (it’s just a transport layer change) but right now we didn’t need to. We’ve had to make some difficult prioritisation decisions here.

PRs welcome!

> Implementing distributed consensus correctly from scratch is very hard - why not use some battle-tested implementations?

We’re used to building things like this, trading systems are giant distributed systems with shared state operating at millions of updates per second. We also cheated, right now there is no automatic failover enabled. Failures are rare and we will only enable that post-Jepsen.

If we used somebody else’s implementation we would never be able to do the multi-master stuff that we need to equalise latency for non-primary regions.

> This is not true for NFSv3 and older, it tends to be stateless (no notion of open file).

Even NFSv3 needs a duplicate request cache because requests are not idempotent. Idempotency of all requests is hard to achieve but rewarding.

jleahy · 2025-09-19T06:10:24 1758262224

2MB median to be fair, so half of our files are under 2MB.

jleahy · 2025-09-18T18:30:01 1758220201

Trading using ML.

jleahy · 2025-09-18T17:27:02 1758216422

could be a tectonic shift in the open source filesystem landscape?

jleahy · 2025-09-18T17:16:14 1758215774

The seamless realtime intercontinental replication is a key feature for us, maybe the most important single feature, and AFAIK you can’t do that with Ceph (even if Ceph could scale to our original 10 exabyte target in one instance).

jleahy · 2025-09-18T15:25:19 1758209119

It's just not optimised for tiny files. It absolutely would work with no problems at all, and you could definitely use it to store 100 billion 1kB files with zero problems (and that is 100 terabytes of data, probably on flash, so no joke). However you can't use it to store 1 exabyte of 1 kilobyte files (at least not yet).

jleahy · on March 1, 2024

That’s not correct. The frequency provides an indication of how much power is being drawn. If the frequency is low then generators supply more power to bring it back to 60 Hz.

If there was not a well known fixed frequency it would be impossible to evenly distribute load over power stations. All generators have a %load vs frequency delta curve built into them which is precisely calibrated.

applied_heat · on March 2, 2024

I think the parent is more correct than you are?

The frequency does not provide an indication of load. The frequency can be 60.00Hz with 20,000 MW load in Ontario or with 10,000MW.

Changes in frequency provide a measure of changes in the balance between generation and load.

The generator’s prime mover’s governor has a droop function set so that typically a 5% change in frequency will result in a 100% change in output. This is how most generators on the grid arrest changes in frequency, but they would not restore the frequency to 60Hz. The droop allows for a steady state frequency error.

A handful of special generators are used to restore the frequency to 60Hz or balance the generation and load in an area.

The precise frequency does not matter, if one generator thinks the frequency is 59.99 and another thinks it is 60.01 their outputs will only be a little higher and lower than their load setpoint. It does not matter if they share changes in load perfectly evenly, so long as generators on the system in bulk respond according to their capabilities.

jleahy · on March 2, 2024

The point is that they need to be synchronised in order to respond to changes in load.

applied_heat · on March 2, 2024

Yes, they use the frequency as a common signal to respond to changes in load, that is correct.

With gps synchronized clocks and high speed waveform measurement we can see the propagation delay in the frequency across the country when there is a big event. Pretty neat!

jleahy · on Jan 28, 2024

ICE is actually in Chicago.

lxgr · on Jan 28, 2024

The physical markets are in New Jersey.

Not sure what they do in Chicago, but only derivatives are traded there; the spot markets for stocks are all in/around New York.

jleahy · on Jan 28, 2024

ICE is the name of the derivatives exchange, which is in Chicago. The cash market is called NYSE.

The parent post said ‘ICE AND NYSE’.

lxgr · on Jan 28, 2024

Ah, I didn't know that they operate a derivatives exchange there too.

In any case, they are the ones that operate the electronic NYSE exchange in New York, and that market is definitely located in New Jersey, and neither is in Atlanta.

vineyardmike · on Jan 29, 2024

If you follow the link, they show the locations in NJ and Chicago.