It was a long long time ago that we were only using NFS, it ran on top of a Solaris machine running ZFS. It did its job at the very beginning, but you don't build up hundreds of petabytes of data on an NFS server.
We did try various solutions in between NFS and developing TernFS, both open source and properietary. However we didn't name these specifically in the blog post because there's little point in bad mouthing what didn't work out for us.
These limits aren't quite as strict as they first seem.
Our median file size is 2MB, which means 50% of our files are <2MB. Realistically if you've got an exabyte of data with an average file size of a few kilobytes then this is the wrong tool for the job (you need something more like a database), but otherwise it should be just fine. We actually have a nice little optimisation where very small files are stored inline in the metadata.
It works out of the box with "normal" tools like rsync, python, etc despite the immutability. The reality is that most things don't actually modify files, even text editors tend to save a new version and rename over the top. We had to update relatively little of our massive code base when switching over to this. For us that was a big win, moving to an S3-like interface would have required updating a lot of code.
Directory creation/deletion is "slow", currenly limited to about 10,000 operations per second. We don't current need to create more than 10,000 directories per second so we just haven't prioritised improving that. There is an issue open, #28, which would get this up to 100,000 per second. This is the sort of thing that, like access control, I would love to have had in an initial open source release, but we prioritised open sourcing what we have over getting it perfect.
The reality is that most things don't actually modify files, even text editors tend to save a new version and rename over the top.
it is essentially copy-on-write exposed to the user level. the only issue is that this breaks hard links, so tools that rely on that are going to break. but yes, custom code should be easy to adapt.
Yes hard links aren't supported in TernFS. They would actually be really difficult to make work in this kind of sharded metadata design as they would need to be reference counted and all the operations would need to go via the CDC. It wouldn't really have matched with the design philosphy of simple and predictable performance.
well, that's at least consistent. if hard-links aren't even supported, you can't break hard-links by replacing a file with a new one through renaming either.
We can saturate the network interfaces of our flash boxes with our very simple Go block server, because it uses sendfile under the hood. It would be easy to switch to RDMA (it’s just a transport layer change) but right now we didn’t need to. We’ve had to make some difficult prioritisation decisions here.
PRs welcome!
> Implementing distributed consensus correctly from scratch is very hard - why not use some battle-tested implementations?
We’re used to building things like this, trading systems are giant distributed systems with shared state operating at millions of updates per second. We also cheated, right now there is no automatic failover enabled. Failures are rare and we will only enable that post-Jepsen.
If we used somebody else’s implementation we would never be able to do the multi-master stuff that we need to equalise latency for non-primary regions.
> This is not true for NFSv3 and older, it tends to be stateless (no notion of open file).
Even NFSv3 needs a duplicate request cache because requests are not idempotent. Idempotency of all requests is hard to achieve but rewarding.
The seamless realtime intercontinental replication is a key feature for us, maybe the most important single feature, and AFAIK you can’t do that with Ceph (even if Ceph could scale to our original 10 exabyte target in one instance).
It's just not optimised for tiny files. It absolutely would work with no problems at all, and you could definitely use it to store 100 billion 1kB files with zero problems (and that is 100 terabytes of data, probably on flash, so no joke). However you can't use it to store 1 exabyte of 1 kilobyte files (at least not yet).
That’s not correct. The frequency provides an indication of how much power is being drawn. If the frequency is low then generators supply more power to bring it back to 60 Hz.
If there was not a well known fixed frequency it would be impossible to evenly distribute load over power stations. All generators have a %load vs frequency delta curve built into them which is precisely calibrated.
The frequency does not provide an indication of load. The frequency can be 60.00Hz with 20,000 MW load in Ontario or with 10,000MW.
Changes in frequency provide a measure of changes in the balance between generation and load.
The generator’s prime mover’s governor has a droop function set so that typically a 5% change in frequency will result in a 100% change in output. This is how most generators on the grid arrest changes in frequency, but they would not restore the frequency to 60Hz. The droop allows for a steady state frequency error.
A handful of special generators are used to restore the frequency to 60Hz or balance the generation and load in an area.
The precise frequency does not matter, if one generator thinks the frequency is 59.99 and another thinks it is 60.01 their outputs will only be a little higher and lower than their load setpoint. It does not matter if they share changes in load perfectly evenly, so long as generators on the system in bulk respond according to their capabilities.
Yes, they use the frequency as a common signal to respond to changes in load, that is correct.
With gps synchronized clocks and high speed waveform measurement we can see the propagation delay in the frequency across the country when there is a big event. Pretty neat!
Ah, I didn't know that they operate a derivatives exchange there too.
In any case, they are the ones that operate the electronic NYSE exchange in New York, and that market is definitely located in New Jersey, and neither is in Atlanta.
It was a long long time ago that we were only using NFS, it ran on top of a Solaris machine running ZFS. It did its job at the very beginning, but you don't build up hundreds of petabytes of data on an NFS server.
We did try various solutions in between NFS and developing TernFS, both open source and properietary. However we didn't name these specifically in the blog post because there's little point in bad mouthing what didn't work out for us.