Fun bit of trivia: The name "Magic Pocket" comes from the very first Dropbox demo video. Back when the site was still on getdropbox.com. I believe the video was on the homepage when Dropbox was launched on HN. Here's a copy of it: https://www.youtube.com/watch?v=xy9nSnalvPc
Magic Pocket has been around for a while [0]! I'm curious if there is anything especially new that prompted the QCon talk and this blog post or if it was just a good time?
Author here. We do try to post updates as much as we are able to in our blog: https://dropbox.tech/tag-results.magic-pocket. While the talk did go through details of the system we've covered in the past, the purpose of the talk was to convey my personal learnings of managing such a system at this scale. See key takeaways here: https://qconsf.com/presentation/oct2022/magic-pocket-dropbox.... Sustaining a high amount growth while maintaining high availability, durability, and efficiencies at this scale is very difficult to do.
The system can run on any HDDs, but primarily runs on Shingled Magnetic Recording disks.
SMR is just one of the latest HDD technologies. Wiki tells me that is is about 10 years old now. I cannot believe the implementation of the storage hardware is going to have any affect upon how the software service runs. What am I missing? This sounds like a bit of "tech mumbo jumbo" on the lead-in. To be clear: I am not doubting the impressiveness of this system!
The conclusion is also pithy. I like it.
Managing Magic Pocket, four key lessons have helped us maintain the system:
Protect and verify
Okay to move slow at scale
Keep things simple
Prepare for the worst
The higher density of SMR drives, combined with its random-read nature, fills a niche between the sequential-access tape storage and the random-access conventional hard drive storage. They are suited to storing data that are unlikely to be modified, but need to be read from any point efficiently. One example of the use case is Dropbox's Magic Storage system, which runs the on-disk extents in an append-only way. Device-managed SMR disks have also been marketed as "Archive HDDs" due to this property.
Cool! I was probably wrong in my previous post: The hardware implementation may truly affect the software.
It's not mumbo jumbo at all. Please read the rest of that wiki article. This is not a regular HDD. Buy a few, build a RAID out of them, then try to rebuild a drive while the front-end is online. You'll probably get data corruption and everything will crash.
It's an HDD technology that has to be combined with very specific type of logical volume management and protection workloads, and is very cheap at the tradeoff of these limitations.
To build a large, fault tolerant storage system on top of this "just another HDD technology" as you call it, is not something you can do at home with your normal tools. They did it, and they saved lots of cost because of that.
Do you ever walk into a meeting with an engineer SME talking about a topic which you had to look up on wiki, and because you don't understand what he's saying you dismiss it as "tech mumbo jumbo?" You will be very successful as middle-management at a large corporation.
> and is very cheap at the tradeoff of these limitations
Where can I get SMR drives that are a good deal? And what percent price improvement are you seeing?
When I look at the drives I can easily buy, the bigger models are all CMR and the smaller models keep having SMR snuck in without any notable price reduction.
I suspect they only bother selling large capacity SMR drives to hyperscalers, because at high capacities (eg. 8TB+), the workload for non-hyperscaler customers is usually something like RAID5 or ZFS, which is terrible on SMR drives. Hyperscalers don't have this problem, because they have custom software that can use features like ZNS to make SMR drives bearable, and on the lower end, there are enough "normies" (for lack of a better term) who are just using their drives to copy files back and forth, and probably won't hammer the drives with RAID5 rebuilds or ZFS resilver.
Are RAID5 rebuilds a problem? I thought the issue with ZFS was how scattered the writes are (with work happening on improving that), and RAID should just be a linear sweep.
No idea where you or I can get them, but for low-tier enterprise archival storage, vendors either put them into their own archival-tier arrays, or buy directly from the HDD vendors.
I think of SMR like Optane. A bunch of consumer-level tech guys w/o a real job got all lit up and started arguing about it online, but its target market was not the "NewEgg website" market, it was the "million dollar storage array from IBM" market.
Remember, there are many workloads that are "write this on disk and keep it forever" - even historical data in a data warehouse showing soda sales by season.
I honestly haven't bought any storage or compute in 25 years - my personal crap is old "enterprise/business line" junk from work.
Where you, as a consumer, can get them - don't. I don't think they're meant to be used by consumers. You're gonna have a bad time unless you write software specific to this weird type of drive. What you might find out there is some kind of a storage 'consumer NAS in a box' product that has tiered storage inside, with these drives and other ones. But stay away from drives on their own w/o that special software in front.
Though as a consumer it was easy to buy an optane card.
Not the dimms as much but those had horrible prices per gig, and SMR is supposedly a way to save money.
> Where you, as a consumer, can get them - don't. I don't think they're meant to be used by consumers. You're gonna have a bad time unless you write software specific to this weird type of drive. What you might find out there is some kind of a storage 'consumer NAS in a box' product that has tiered storage inside, with these drives and other ones. But stay away from drives on their own w/o that special software in front.
That would be a reasonable statement if they weren't selling SMR drives to customers without even labeling them as such. Just raw drives that sometimes have the performance go to hell. If you wanted fast you should have gotten an SSD, I guess.
It's not that I can't easily get an SMR drive, it's that I can't get the bigger models and I can't get the price savings.
Author here. We recently published a post about our last 4 years on SMR usage here: https://dropbox.tech/infrastructure/four-years-of-smr-storag.... Note that SMR technology just 5 years ago was rather nascent and a lot of software support was not great. Using SMR for us is possible without penalty only because of our sequential write workloads.
We use a custom disk format along with libzbc, but libzbd now provides many advantages, which we are looking to adopt. I did want the QCon talk to have some super straight-to-the-point conclusions and these, I believe, have saved us the most since I have been on the team. Largely due to the sheer scale of managing such a system.
SHR has high latency when you need to rewrite data on a shared "shingle" - so my guess is they designed the storage system to avoid those for better performance?
Is the diagram in the Zone section wrong? It looks like the Cell diagram below has been used by mistake.
Super interested to understand this better but having a hard time following the article. Is the hash index replicated across regions/zones, or are clients responsible for routing thier requests to the correct zone?
We designed and built the system almost completely ground-up, all the way down to the disk scheduler, but yeah we probably took the term OSD from Ceph.
> The Lustre® file system is an open-source, parallel file system that supports many requirements of leadership class HPC simulation environments. Whether you’re a member of our diverse development community or considering the Lustre file system as a parallel file system solution, these pages offer a wealth of resources and support to meet your needs.
We had a lot of requests for open sourcing MP but I don't think it would have been very useful. The code itself is interesting to look at but the system is designed for very large data sets, has a lot of moving parts to manage and configure, and would be very difficult to operate at anything below at least double-digit petabytes.
MP was designed for multiple exabytes across many data centers and geographic regions. A different system design (or just using S3) would be more appropriate for use at smaller scales.
This is an impressive product, and I apologize, but I'm gonna go on a bit of a rant about the PR language.
I hate the phrase "Our system has over twelve 9s of durability." Amazon was the first motherfucker to claim this, but the other cloud storage folks are also culpable, but at least they mostly had the modesty to add some weasel words like "designed for" and didn't just straight up claim there was less than a 1 in a trillion chance of a durability failure.
You don't have twelve 9s of durability. Your collection of copies of data on the hard drives do, assuming they exist in a vacuum and nothing bad happens to them except the normal sorts of things that cause hard drive failures that are nice and completely independent. But it completely ignores all other sources of problem, and those are so many orders of magnitude more common that you might as well claim "God-given, perfect durability" because it'd be just as accurate.
"It’s fairly easy to design a system with astronomically high durability numbers. 24 nines is a mean time to failure of 1,000,000,000,000,000,000,000,000 years. When your MTTF dwarfs the age of the universe then it might be time to reevaluate your priorities.
Should we trust these numbers though? Of course not, because the secret truth is that adherence to theoretical durability estimates is missing the point. They tell you how likely you are to lose data due to routine disk failure, but routine disk failure is easy to model for and protect against. If you lose data due to routine disk failure you’re probably doing something wrong."
Of course, these systems are always designed for that. Just like every system is designed for a certain amount of availability. Also, even in a vacuum, things would degrade over time due to bit rot, etc. That's why the article mentions protections with verifiers and due to other things such accidental deletions due to potential bugs.
magic pocket's tech lead (disclaimer: my cofounder at convex) has a whole talk on this concept of "durability theater" [1]!
the tldr is that those numbers of 9s are just table stakes. no system should ever lose data due to routine disk failures. so then, as you mention, there's another whole art to mitigating those other sources of problems.
By comparison the old US POTS phone system was required to maintain downtime of less than five minutes per decade. I have no idea what rules, if any, apply to the modern wireless systems.
Note: The consent decree, the actual legal obligation of the Bell system, actually specified the percentage of times someone would pick up the phone and not get a dial tone or operator. It was surprisingly high -- IIRC something like 2%, which is why you can see people toggling the hook in old movies. The five minutes per decade constraint I know because some of my old customers made digital phone switches and they had to provide that SLA to their phone company customers, or else not get the order.
I'm having trouble finding any reference for the "five minutes per decade" downtime limit. That would be "6 9s" or 99.9999% uptime which is just crazy in today's world. Even today VOIP providers only claim 99.999%, though it seems like in practice many fail to even get close too that. I think I found the 756 page consent decree between Bell and the US you mentioned but I can't find a reference there either, though it is quite a massive doc with so-so OCR: https://www.google.com/books/edition/Consent_Decree_Program_...
Stepping back a bit, it seems like the FCC would be in charge of establishing rules like this but that consent decree I found (which might be the wrong one), is with the house Antitrust committee. (After my own Googling failed I also asked GPT-4 and it isn't aware of this either)
Well as I noted in my comment the consent decree itself IIRC specified availability rather than uptime.
However the 5 min figure comes from my customers like DSC (R.I.P), Ericsson and Nokia building POTS switches. These guys were deadly serious, like the folks who made spacecraft and medical devices, plus the Ericsson and Nokia folks were nice too.
In DSC’s case they were so paranoid that they paid us a massive amount to maintain a special tool chain for just for them. It was frozen in time (no upgrades) and when they reported a bug and we sent them an updated tool chain they diffed the binaries and made sure that every delta was due to the big fix and nothing else (that the dev hadn’t snuck in some other patch for some reason)! They did some other headstands with their hardware and software, but in the end it didn’t save them.
It’s cool that the consent decree is online. That arrangement with the Bell system was very clever, though it led to a lot of weird anomalies and distortions but I think it did end up with a better phone system than the PTT model. It’s also been a better model than what’s happened with the power and water utilities.
The FCC back then was a better regulator for phone customers than the (now obsolete) ICC had been.
We kept a data closet in Manhattan in a sister building to 33 Thomas. Customers would ask for a BRP. We had N+2 for data level stuff and higher but some customers would get hung up on not having second location. We took the stance if the facility fails we all have bigger issues.
We had one 18 minute issue in the ten years of use. It was the best facility we ever used.
I remember in the late 90s/early 00s a lot of cases when sites would go down (everybody self-hosted) due to backhoe events and the like.
These companies weren't idiots: they had replication for their databases and leased redundant transmission service because they knew this kind of thing could happen. The problem is you'd buy transmission from two providers...whose fiber turned out to be in the same conduit, or even both had rented bandwidth on the same fiber.
At least people are smarter these days, and with widespread cloud service there are fewer people who need to keep track of this stuff.
I think they got 5 nines mixed up with 6 nines. The 5 nines is 5 minutes per year. I asked ChatGPT and it claims they had a target of 5 nines which makes sense.
That's a fair point. It would be quite possible if the customers were willing to pay a lot more.
I worked on DigitalOcean's object storage for about a year not long ago. Makes sense that those of us who have been in this space would be interested in this article haha.
I imagine that as you add more nines, you start to hit more problems that are more out of your control. Like, how many nines do your backup diesel generators have?
So yeah you need more storage-optimized server racks and all the associated manpower and maintenance, you also need them to be distributed across different datacenters and zones which of course also impacts latency and your ability to provide some appearance of consistency, then you also need the same distribution for stateless services serving the data.
On and on and on and you might be nearly doubling cost to get to an extra 9 that almost all of your customers won't care about.
It's a cost problem. You can absolutely get higher availability, but it requires more hardware, uses more networking, and it makes maintenance and feature development more expensive. Blob storage customers are very price sensitive, but they also want lots of availability. That's why you'll see various "tiers" of storage, to try and capture all the combinations of availability and storage cost that various kinds of customers might want. Pretty much nobody wants to pay twice the cost to turn an exabyte of 99.99% available data to 99.995% available data, so it's not a product.
When you build anything on it, you know it may not be available for a couple hours a year and architect around it. 99.99 isn't bad at all. Your ISP probably has worse SLAs.
> Availability has a cost. That cost is complexity.
It seems that Dropbox's system is complex enough but still isn't as designed to be highly available (5x9s) as a well architected system on AWS using S3.
Yet It is surprising a public billion dollar company like Dropbox cannot go beyond this for their mission critical customers (govt agencies, healthcare, military, etc) willing to spend hundreds of millions.
Strawman. Who said that emergency systems don't ever experience downtime? Orgs using mission critical systems would trust a service that has less than 5x9s.
> Emergency response systems is 99.999% or “five nines” – or about five minutes and 15 seconds of downtime per year.
And AWS is there for the 5th 9, not Dropbox. therefore, Dropbox is not suitable or reliable for these kind of orgs, even with this custom built system they built.
One can have a custom built or an experienced IT team using AWS and achieve the same or even better availability using AWS and other providers that care about critical availability, if architected correctly.
this is an overreaction. Dropbox is mostly storing files that are already stored on the customer's device, so customers usually won't notice an outage.
I would expect Dropbox, a file storage company that proudly invests heavily in tech and infrastructure to achieve a better availability than what they already were on AWS (99.99%)
In terms of availability the change is pretty much 0 and as a business / enterprise customer I might as well choose a different service with similar or higher 9s or (if my needs are complex) choose S3.
Can you provide an example of an alternative service which will give higher than 4 nines for availability that an enterprise customer would pick instead if that < 1hr of downtime per year was too high?
Guaranteed availability is a bet they're willing to make, a gamble they've been on top of so far, a risk that, should something fail, they will pay out on according to their SLA.
Seems to be working out for them (Ably) and this isn't a public billion dollar company like Dropbox.
I would have thought that given Dropbox's engineering talent, they would have designed a system that would account for 5x9s and even making that guarantee for enterprise or mission critical customers.
Can't even find their SLAs anywhere for these customers, so I presume that Dropbox doesn't care about them.
Guess I was wrong and this is just disappointing and made the move not worth it.
Most tech these days have improved to offer maximum consistency, availability is also important, which is why it is a good thing that Dropbox still trusts AWS and especially S3, better than their own on-prem solution.
Do you know of any companies that actually provide better reliability in their consumer product? The ones that lie or skew their uptime calculation don't count.