Magic Pocket: Dropbox’s exabyte-scale blob storage system

varenc · on May 17, 2023

Fun bit of trivia: The name "Magic Pocket" comes from the very first Dropbox demo video. Back when the site was still on getdropbox.com. I believe the video was on the homepage when Dropbox was launched on HN. Here's a copy of it: https://www.youtube.com/watch?v=xy9nSnalvPc

bj-rn · on May 17, 2023

Thought it might have come from the bitmap brothers' game: https://en.m.wikipedia.org/wiki/Magic_Pockets

TheSpiceIsLife · on May 18, 2023

The Bitmat Brothers! Xenon came bundled with my Atari 1040ST and Magic Pockets was the first game I paid for with my money. Good times.

mvdwoord · on May 18, 2023

Triggering fond memories... Thanks!

LtdJorge · on May 17, 2023

Or Doraemon

dang · on May 17, 2023

Dropbox Extending Magic Pocket with SMR Drive Deployment - https://news.ycombinator.com/item?id=17300661 - June 2018 (1 comment)

Inside the Magic Pocket - https://news.ycombinator.com/item?id=11645536 - May 2016 (29 comments)

Scaling to exabytes and beyond - https://news.ycombinator.com/item?id=11283064 - March 2016 (6 comments)

Dropbox’s Exodus from the Amazon Cloud - https://news.ycombinator.com/item?id=11282948 - March 2016 (240 comments)

eatonphil · on May 17, 2023

Magic Pocket has been around for a while [0]! I'm curious if there is anything especially new that prompted the QCon talk and this blog post or if it was just a good time?

[0] https://dropbox.tech/infrastructure/inside-the-magic-pocket

facundo_dbx · on May 18, 2023

Author here. We do try to post updates as much as we are able to in our blog: https://dropbox.tech/tag-results.magic-pocket. While the talk did go through details of the system we've covered in the past, the purpose of the talk was to convey my personal learnings of managing such a system at this scale. See key takeaways here: https://qconsf.com/presentation/oct2022/magic-pocket-dropbox.... Sustaining a high amount growth while maintaining high availability, durability, and efficiencies at this scale is very difficult to do.

throwaway2037 · on May 18, 2023

I'm confused by this part:

    The system can run on any HDDs, but primarily runs on Shingled Magnetic Recording disks.

SMR is just one of the latest HDD technologies. Wiki tells me that is is about 10 years old now. I cannot believe the implementation of the storage hardware is going to have any affect upon how the software service runs. What am I missing? This sounds like a bit of "tech mumbo jumbo" on the lead-in. To be clear: I am not doubting the impressiveness of this system!

The conclusion is also pithy. I like it.

    Managing Magic Pocket, four key lessons have helped us maintain the system:

        Protect and verify
        Okay to move slow at scale
        Keep things simple
        Prepare for the worst

throwaway2037 · on May 18, 2023

Woah, I just found this on the SMR Wiki page: https://en.wikipedia.org/wiki/Shingled_magnetic_recording

    The higher density of SMR drives, combined with its random-read nature, fills a niche between the sequential-access tape storage and the random-access conventional hard drive storage. They are suited to storing data that are unlikely to be modified, but need to be read from any point efficiently. One example of the use case is Dropbox's Magic Storage system, which runs the on-disk extents in an append-only way. Device-managed SMR disks have also been marketed as "Archive HDDs" due to this property.

Cool! I was probably wrong in my previous post: The hardware implementation may truly affect the software.

usernew · on May 18, 2023

It's not mumbo jumbo at all. Please read the rest of that wiki article. This is not a regular HDD. Buy a few, build a RAID out of them, then try to rebuild a drive while the front-end is online. You'll probably get data corruption and everything will crash.

It's an HDD technology that has to be combined with very specific type of logical volume management and protection workloads, and is very cheap at the tradeoff of these limitations.

To build a large, fault tolerant storage system on top of this "just another HDD technology" as you call it, is not something you can do at home with your normal tools. They did it, and they saved lots of cost because of that.

Do you ever walk into a meeting with an engineer SME talking about a topic which you had to look up on wiki, and because you don't understand what he's saying you dismiss it as "tech mumbo jumbo?" You will be very successful as middle-management at a large corporation.

Dylan16807 · on May 18, 2023

> and is very cheap at the tradeoff of these limitations

Where can I get SMR drives that are a good deal? And what percent price improvement are you seeing?

When I look at the drives I can easily buy, the bigger models are all CMR and the smaller models keep having SMR snuck in without any notable price reduction.

gruez · on May 18, 2023

I suspect they only bother selling large capacity SMR drives to hyperscalers, because at high capacities (eg. 8TB+), the workload for non-hyperscaler customers is usually something like RAID5 or ZFS, which is terrible on SMR drives. Hyperscalers don't have this problem, because they have custom software that can use features like ZNS to make SMR drives bearable, and on the lower end, there are enough "normies" (for lack of a better term) who are just using their drives to copy files back and forth, and probably won't hammer the drives with RAID5 rebuilds or ZFS resilver.

Dylan16807 · on May 18, 2023

Are RAID5 rebuilds a problem? I thought the issue with ZFS was how scattered the writes are (with work happening on improving that), and RAID should just be a linear sweep.

usernew · on May 18, 2023

No idea where you or I can get them, but for low-tier enterprise archival storage, vendors either put them into their own archival-tier arrays, or buy directly from the HDD vendors.

I think of SMR like Optane. A bunch of consumer-level tech guys w/o a real job got all lit up and started arguing about it online, but its target market was not the "NewEgg website" market, it was the "million dollar storage array from IBM" market.

Remember, there are many workloads that are "write this on disk and keep it forever" - even historical data in a data warehouse showing soda sales by season.

I honestly haven't bought any storage or compute in 25 years - my personal crap is old "enterprise/business line" junk from work.

Where you, as a consumer, can get them - don't. I don't think they're meant to be used by consumers. You're gonna have a bad time unless you write software specific to this weird type of drive. What you might find out there is some kind of a storage 'consumer NAS in a box' product that has tiered storage inside, with these drives and other ones. But stay away from drives on their own w/o that special software in front.

Dylan16807 · on May 18, 2023

Though as a consumer it was easy to buy an optane card.

Not the dimms as much but those had horrible prices per gig, and SMR is supposedly a way to save money.

> Where you, as a consumer, can get them - don't. I don't think they're meant to be used by consumers. You're gonna have a bad time unless you write software specific to this weird type of drive. What you might find out there is some kind of a storage 'consumer NAS in a box' product that has tiered storage inside, with these drives and other ones. But stay away from drives on their own w/o that special software in front.

That would be a reasonable statement if they weren't selling SMR drives to customers without even labeling them as such. Just raw drives that sometimes have the performance go to hell. If you wanted fast you should have gotten an SSD, I guess.

It's not that I can't easily get an SMR drive, it's that I can't get the bigger models and I can't get the price savings.

facundo_dbx · on May 18, 2023

Author here. We recently published a post about our last 4 years on SMR usage here: https://dropbox.tech/infrastructure/four-years-of-smr-storag.... Note that SMR technology just 5 years ago was rather nascent and a lot of software support was not great. Using SMR for us is possible without penalty only because of our sequential write workloads.

We use a custom disk format along with libzbc, but libzbd now provides many advantages, which we are looking to adopt. I did want the QCon talk to have some super straight-to-the-point conclusions and these, I believe, have saved us the most since I have been on the team. Largely due to the sheer scale of managing such a system.

edude03 · on May 18, 2023

SHR has high latency when you need to rewrite data on a shared "shingle" - so my guess is they designed the storage system to avoid those for better performance?

lastangryman · on May 18, 2023

Is the diagram in the Zone section wrong? It looks like the Cell diagram below has been used by mistake.

Super interested to understand this better but having a hard time following the article. Is the hash index replicated across regions/zones, or are clients responsible for routing thier requests to the correct zone?

bastawhiz · on May 18, 2023

You're right, it is. I was confused as well.

dilyevsky · on May 17, 2023

So is this thing using parts of Ceph? Haven’t really seen OSD term used anywhere else…

james_cowling · on May 18, 2023

We designed and built the system almost completely ground-up, all the way down to the disk scheduler, but yeah we probably took the term OSD from Ceph.

kmod · on May 18, 2023

I think that's right

james_cowling · on May 18, 2023

(kmod was the founding engineer on the project and the guy who came up with the name!)

kmod · on May 18, 2023

The project was originally codenamed "s3box" and then changed for obvious reasons haha. I think Drew picked the name "Magic Pocket"

pinewurst · on May 17, 2023

Lustre used that nomenclature long before Ceph.

MuffinFlavored · on May 17, 2023

https://www.lustre.org/

> The Lustre® file system is an open-source, parallel file system that supports many requirements of leadership class HPC simulation environments. Whether you’re a member of our diverse development community or considering the Lustre file system as a parallel file system solution, these pages offer a wealth of resources and support to meet your needs.

Hadn't heard of it.

LtdJorge · on May 17, 2023

It's normally used for supercomputers

bityard · on May 17, 2023

Sounds interesting, but I can't find the github link...

james_cowling · on May 18, 2023

We had a lot of requests for open sourcing MP but I don't think it would have been very useful. The code itself is interesting to look at but the system is designed for very large data sets, has a lot of moving parts to manage and configure, and would be very difficult to operate at anything below at least double-digit petabytes.

MP was designed for multiple exabytes across many data centers and geographic regions. A different system design (or just using S3) would be more appropriate for use at smaller scales.

LeFantome · on May 18, 2023

Where I work, we deal with double to perhaps eventually triple digit petabytes. I was looking for the GitHub link. :(

CobrastanJorji · on May 17, 2023

This is an impressive product, and I apologize, but I'm gonna go on a bit of a rant about the PR language.

I hate the phrase "Our system has over twelve 9s of durability." Amazon was the first motherfucker to claim this, but the other cloud storage folks are also culpable, but at least they mostly had the modesty to add some weasel words like "designed for" and didn't just straight up claim there was less than a 1 in a trillion chance of a durability failure.

You don't have twelve 9s of durability. Your collection of copies of data on the hard drives do, assuming they exist in a vacuum and nothing bad happens to them except the normal sorts of things that cause hard drive failures that are nice and completely independent. But it completely ignores all other sources of problem, and those are so many orders of magnitude more common that you might as well claim "God-given, perfect durability" because it'd be just as accurate.

james_cowling · on May 18, 2023

Yeah we have a few talks about this and a chapter about this very issue in https://www.oreilly.com/library/view/seeking-sre/97814919788.... Totally agree that in a well designed system the sources of data loss are certainly not disk failures.

As far as I know Magic Pocket has had 100% durability, but that's obviously beside the point.

james_cowling · on May 18, 2023

"It’s fairly easy to design a system with astronomically high durability numbers. 24 nines is a mean time to failure of 1,000,000,000,000,000,000,000,000 years. When your MTTF dwarfs the age of the universe then it might be time to reevaluate your priorities.

Should we trust these numbers though? Of course not, because the secret truth is that adherence to theoretical durability estimates is missing the point. They tell you how likely you are to lose data due to routine disk failure, but routine disk failure is easy to model for and protect against. If you lose data due to routine disk failure you’re probably doing something wrong."

https://medium.com/@jamesacowling/how-many-nines-is-my-stora...

laluser · on May 17, 2023

Of course, these systems are always designed for that. Just like every system is designed for a certain amount of availability. Also, even in a vacuum, things would degrade over time due to bit rot, etc. That's why the article mentions protections with verifiers and due to other things such accidental deletions due to potential bugs.

sujayakar · on May 17, 2023

magic pocket's tech lead (disclaimer: my cofounder at convex) has a whole talk on this concept of "durability theater" [1]!

the tldr is that those numbers of 9s are just table stakes. no system should ever lose data due to routine disk failures. so then, as you mention, there's another whole art to mitigating those other sources of problems.

[1] https://www.facebook.com/atscaleevents/videos/17416916227706...

james_cowling · on May 18, 2023

We should also clarify that Sujay and I are part of the old old team :) The current team have been doing an awesome job since then!

(a bunch of the early MP folks work at Convex now)

dymk · on May 17, 2023

> 99.99% availability

That seems... really bad, for the Core Core Service that Everything Depends On.

gumby · on May 17, 2023

By comparison the old US POTS phone system was required to maintain downtime of less than five minutes per decade. I have no idea what rules, if any, apply to the modern wireless systems.

Note: The consent decree, the actual legal obligation of the Bell system, actually specified the percentage of times someone would pick up the phone and not get a dial tone or operator. It was surprisingly high -- IIRC something like 2%, which is why you can see people toggling the hook in old movies. The five minutes per decade constraint I know because some of my old customers made digital phone switches and they had to provide that SLA to their phone company customers, or else not get the order.

biofunsf · on May 17, 2023

I'm having trouble finding any reference for the "five minutes per decade" downtime limit. That would be "6 9s" or 99.9999% uptime which is just crazy in today's world. Even today VOIP providers only claim 99.999%, though it seems like in practice many fail to even get close too that. I think I found the 756 page consent decree between Bell and the US you mentioned but I can't find a reference there either, though it is quite a massive doc with so-so OCR: https://www.google.com/books/edition/Consent_Decree_Program_...

Stepping back a bit, it seems like the FCC would be in charge of establishing rules like this but that consent decree I found (which might be the wrong one), is with the house Antitrust committee. (After my own Googling failed I also asked GPT-4 and it isn't aware of this either)

gumby · on May 17, 2023

Well as I noted in my comment the consent decree itself IIRC specified availability rather than uptime.

However the 5 min figure comes from my customers like DSC (R.I.P), Ericsson and Nokia building POTS switches. These guys were deadly serious, like the folks who made spacecraft and medical devices, plus the Ericsson and Nokia folks were nice too.

In DSC’s case they were so paranoid that they paid us a massive amount to maintain a special tool chain for just for them. It was frozen in time (no upgrades) and when they reported a bug and we sent them an updated tool chain they diffed the binaries and made sure that every delta was due to the big fix and nothing else (that the dev hadn’t snuck in some other patch for some reason)! They did some other headstands with their hardware and software, but in the end it didn’t save them.

It’s cool that the consent decree is online. That arrangement with the Bell system was very clever, though it led to a lot of weird anomalies and distortions but I think it did end up with a better phone system than the PTT model. It’s also been a better model than what’s happened with the power and water utilities.

The FCC back then was a better regulator for phone customers than the (now obsolete) ICC had been.

didacticBuffalo · on May 17, 2023

We kept a data closet in Manhattan in a sister building to 33 Thomas. Customers would ask for a BRP. We had N+2 for data level stuff and higher but some customers would get hung up on not having second location. We took the stance if the facility fails we all have bigger issues.

We had one 18 minute issue in the ten years of use. It was the best facility we ever used.

gumby · on May 17, 2023

I remember in the late 90s/early 00s a lot of cases when sites would go down (everybody self-hosted) due to backhoe events and the like.

These companies weren't idiots: they had replication for their databases and leased redundant transmission service because they knew this kind of thing could happen. The problem is you'd buy transmission from two providers...whose fiber turned out to be in the same conduit, or even both had rented bandwidth on the same fiber.

At least people are smarter these days, and with widespread cloud service there are fewer people who need to keep track of this stuff.

kccqzy · on May 17, 2023

They could be serious enough about it, and do everything right, and still have an outage. https://news.ycombinator.com/item?id=34665023

themerone · on May 18, 2023

When I interned a phone company I was surprised by the lack of hardware redundancy.

The equipment was trusted to not fail in it's lifetime.

sosodev · on May 17, 2023

I think they got 5 nines mixed up with 6 nines. The 5 nines is 5 minutes per year. I asked ChatGPT and it claims they had a target of 5 nines which makes sense.

zeroxfe · on May 17, 2023

That's 5 minutes a year, not decade. (I worked in telecom for a very long time.)

Telco basically set the bar for "5 9s."

_trampeltier · on May 17, 2023

That was before we had to update every system every day :-)

sosodev · on May 17, 2023

How is that bad? 99.99% means the system is down for less than an hour each year. https://uptime.is/99.99

foobiekr · on May 18, 2023

of course, it actually depends on how they define "up."

abeppu · on May 17, 2023

S3 standard storage class also says it's 99.99% availability so ... I wonder how hard it is to get beyond that?

https://aws.amazon.com/s3/faqs/

sosodev · on May 17, 2023

It's extremely hard to get availability higher than that. Possible but not something companies are willing to promise in an SLA.

awill · on May 17, 2023

I'd say it's not so much that it's really hard, it's that the compromises aren't worth it.

note, I worked on S3 2015-2017

sosodev · on May 17, 2023

That's a fair point. It would be quite possible if the customers were willing to pay a lot more.

I worked on DigitalOcean's object storage for about a year not long ago. Makes sense that those of us who have been in this space would be interested in this article haha.

abeppu · on May 17, 2023

So ... e.g. for 5 nines, the erasure code configuration would demand that do more writes in more locations and the cost impact is too high?

jl6 · on May 17, 2023

I imagine that as you add more nines, you start to hit more problems that are more out of your control. Like, how many nines do your backup diesel generators have?

skrtskrt · on May 17, 2023

or replication if you're not erasure coding.

So yeah you need more storage-optimized server racks and all the associated manpower and maintenance, you also need them to be distributed across different datacenters and zones which of course also impacts latency and your ability to provide some appearance of consistency, then you also need the same distribution for stateless services serving the data.

On and on and on and you might be nearly doubling cost to get to an extra 9 that almost all of your customers won't care about.

CobrastanJorji · on May 18, 2023

It's a cost problem. You can absolutely get higher availability, but it requires more hardware, uses more networking, and it makes maintenance and feature development more expensive. Blob storage customers are very price sensitive, but they also want lots of availability. That's why you'll see various "tiers" of storage, to try and capture all the combinations of availability and storage cost that various kinds of customers might want. Pretty much nobody wants to pay twice the cost to turn an exabyte of 99.99% available data to 99.995% available data, so it's not a product.

foobiekr · on May 18, 2023

Absent here is any statement about how available is defined in terms of being _meaningfully up_ like iops or bps.

rbanffy · on May 17, 2023

When you build anything on it, you know it may not be available for a couple hours a year and architect around it. 99.99 isn't bad at all. Your ISP probably has worse SLAs.

petters · on May 17, 2023

Not really bad. It’s less than an hour per year. Most users would not notice.

colesantiago · on May 17, 2023

Oof.

You're not wrong, this is really really bad, especially for Dropbox, storage is their business so I expected way better.

These stats are no different to S3 at all. All of this engineering and moving away from AWS and for so few gain in availability.

I was initially excited when they moved away from AWS and expected industry leading higher availability when they moved away, but I was wrong.

This is disappointing for Dropbox which this is their main business, storing files without any minor hiccups or outages for years.

rsync · on May 17, 2023

"... storage is their business so I expected way better."

Storage is our business and we target an even worse 99.95% availability[1].

Availability has a cost. That cost is complexity.

We would very much prefer to have boring outages more often than have fascinating outages very rarely.

[1] https://www.rsync.net/resources/notices/sla.html

colesantiago · on May 18, 2023

> Availability has a cost. That cost is complexity.

It seems that Dropbox's system is complex enough but still isn't as designed to be highly available (5x9s) as a well architected system on AWS using S3.

Yet It is surprising a public billion dollar company like Dropbox cannot go beyond this for their mission critical customers (govt agencies, healthcare, military, etc) willing to spend hundreds of millions.

So yes, I expected way better.

rrdharan · on May 17, 2023

> These stats are no different to S3 at all. All of this engineering and moving away from AWS and for so few gain in availability.

What makes you think a gain in availability matters or is necessarily a motivation for the project?

If they can achieve the same availability at far lower cost, it’s a win for them, which is why they would (and did) do it.

colesantiago · on May 17, 2023

> What makes you think a gain in availability matters or is necessarily a motivation for the project?

This isn't a win for enterprise / business / mission critical customers. Governments and public services cannot use this at all.

alex_lav · on May 17, 2023

The propaganda is wild to make people believe governments and public services have even 99% uptime, let alone 99.99%.

colesantiago · on May 17, 2023

5x9s is more than possible and has been done. not propaganda, don't know what you've been reading.

alex_lav · on May 18, 2023

Yes 5x9s is possible. No, governments and utilities services do not keep 5x9s.

CaveTech · on May 17, 2023

They can, and do.

colesantiago · on May 17, 2023

They use AWS more than the unreliability of Dropbox.

jazzyjackson · on May 17, 2023

You think 9-1-1 doesn't experience downtime?

colesantiago · on May 17, 2023

Strawman. Who said that emergency systems don't ever experience downtime? Orgs using mission critical systems would trust a service that has less than 5x9s.

> Emergency response systems is 99.999% or “five nines” – or about five minutes and 15 seconds of downtime per year.

https://aws.amazon.com/blogs/publicsector/achieving-five-nin...

I see hospitals will be using AWS and other reliable hosts to use S3 rather than Dropbox.

jazzyjackson · on May 18, 2023

Hospitals have budgets too, not all of them will spring for the 5th 9

colesantiago · on May 18, 2023

And AWS is there for the 5th 9, not Dropbox. therefore, Dropbox is not suitable or reliable for these kind of orgs, even with this custom built system they built.

One can have a custom built or an experienced IT team using AWS and achieve the same or even better availability using AWS and other providers that care about critical availability, if architected correctly.

https://docs.aws.amazon.com/wellarchitected/latest/reliabili...

awill · on May 17, 2023

this is an overreaction. Dropbox is mostly storing files that are already stored on the customer's device, so customers usually won't notice an outage.

colesantiago · on May 17, 2023

How is this an overreaction?

I would expect Dropbox, a file storage company that proudly invests heavily in tech and infrastructure to achieve a better availability than what they already were on AWS (99.99%)

In terms of availability the change is pretty much 0 and as a business / enterprise customer I might as well choose a different service with similar or higher 9s or (if my needs are complex) choose S3.

nmjohn · on May 17, 2023

Can you provide an example of an alternative service which will give higher than 4 nines for availability that an enterprise customer would pick instead if that < 1hr of downtime per year was too high?

colesantiago · on May 17, 2023

AWS (Architected Correctly) which I am sure Dropbox has experience in.

https://aws.amazon.com/blogs/publicsector/achieving-five-nin...

Here is a service that has managed to achieve 5x9s of availability:

https://ably.com/

jazzyjackson · on May 17, 2023

> achieve 5x9s of availability:

Guaranteed availability is a bet they're willing to make, a gamble they've been on top of so far, a risk that, should something fail, they will pay out on according to their SLA.

colesantiago · on May 18, 2023

Seems to be working out for them (Ably) and this isn't a public billion dollar company like Dropbox.

I would have thought that given Dropbox's engineering talent, they would have designed a system that would account for 5x9s and even making that guarantee for enterprise or mission critical customers.

Can't even find their SLAs anywhere for these customers, so I presume that Dropbox doesn't care about them.

Guess I was wrong and this is just disappointing and made the move not worth it.

prmoustache · on May 17, 2023

I'd rather have maximum consistency and integrity than maximum availability.

colesantiago · on May 18, 2023

Most tech these days have improved to offer maximum consistency, availability is also important, which is why it is a good thing that Dropbox still trusts AWS and especially S3, better than their own on-prem solution.

https://aws.amazon.com/solutions/case-studies/dropbox-s3/

sosodev · on May 17, 2023

Do you know of any companies that actually provide better reliability in their consumer product? The ones that lie or skew their uptime calculation don't count.

colesantiago · on May 18, 2023

https://news.ycombinator.com/item?id=35979536