Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Magic Pocket: Dropbox’s exabyte-scale blob storage system (infoq.com)
126 points by rbanffy on May 17, 2023 | hide | past | favorite | 85 comments


Fun bit of trivia: The name "Magic Pocket" comes from the very first Dropbox demo video. Back when the site was still on getdropbox.com. I believe the video was on the homepage when Dropbox was launched on HN. Here's a copy of it: https://www.youtube.com/watch?v=xy9nSnalvPc


Thought it might have come from the bitmap brothers' game: https://en.m.wikipedia.org/wiki/Magic_Pockets


The Bitmat Brothers! Xenon came bundled with my Atari 1040ST and Magic Pockets was the first game I paid for with my money. Good times.


Triggering fond memories... Thanks!


Or Doraemon


Related:

Optimizing Magic Pocket for cold storage - https://news.ycombinator.com/item?id=19841887 - May 2019 (13 comments)

Dropbox Extending Magic Pocket with SMR Drive Deployment - https://news.ycombinator.com/item?id=17300661 - June 2018 (1 comment)

Inside the Magic Pocket - https://news.ycombinator.com/item?id=11645536 - May 2016 (29 comments)

Scaling to exabytes and beyond - https://news.ycombinator.com/item?id=11283064 - March 2016 (6 comments)

Dropbox’s Exodus from the Amazon Cloud - https://news.ycombinator.com/item?id=11282948 - March 2016 (240 comments)


Magic Pocket has been around for a while [0]! I'm curious if there is anything especially new that prompted the QCon talk and this blog post or if it was just a good time?

[0] https://dropbox.tech/infrastructure/inside-the-magic-pocket


Author here. We do try to post updates as much as we are able to in our blog: https://dropbox.tech/tag-results.magic-pocket. While the talk did go through details of the system we've covered in the past, the purpose of the talk was to convey my personal learnings of managing such a system at this scale. See key takeaways here: https://qconsf.com/presentation/oct2022/magic-pocket-dropbox.... Sustaining a high amount growth while maintaining high availability, durability, and efficiencies at this scale is very difficult to do.


I'm confused by this part:

    The system can run on any HDDs, but primarily runs on Shingled Magnetic Recording disks.
SMR is just one of the latest HDD technologies. Wiki tells me that is is about 10 years old now. I cannot believe the implementation of the storage hardware is going to have any affect upon how the software service runs. What am I missing? This sounds like a bit of "tech mumbo jumbo" on the lead-in. To be clear: I am not doubting the impressiveness of this system!

The conclusion is also pithy. I like it.

    Managing Magic Pocket, four key lessons have helped us maintain the system:

        Protect and verify
        Okay to move slow at scale
        Keep things simple
        Prepare for the worst


Woah, I just found this on the SMR Wiki page: https://en.wikipedia.org/wiki/Shingled_magnetic_recording

    The higher density of SMR drives, combined with its random-read nature, fills a niche between the sequential-access tape storage and the random-access conventional hard drive storage. They are suited to storing data that are unlikely to be modified, but need to be read from any point efficiently. One example of the use case is Dropbox's Magic Storage system, which runs the on-disk extents in an append-only way. Device-managed SMR disks have also been marketed as "Archive HDDs" due to this property.
Cool! I was probably wrong in my previous post: The hardware implementation may truly affect the software.


It's not mumbo jumbo at all. Please read the rest of that wiki article. This is not a regular HDD. Buy a few, build a RAID out of them, then try to rebuild a drive while the front-end is online. You'll probably get data corruption and everything will crash.

It's an HDD technology that has to be combined with very specific type of logical volume management and protection workloads, and is very cheap at the tradeoff of these limitations.

To build a large, fault tolerant storage system on top of this "just another HDD technology" as you call it, is not something you can do at home with your normal tools. They did it, and they saved lots of cost because of that.

Do you ever walk into a meeting with an engineer SME talking about a topic which you had to look up on wiki, and because you don't understand what he's saying you dismiss it as "tech mumbo jumbo?" You will be very successful as middle-management at a large corporation.


> and is very cheap at the tradeoff of these limitations

Where can I get SMR drives that are a good deal? And what percent price improvement are you seeing?

When I look at the drives I can easily buy, the bigger models are all CMR and the smaller models keep having SMR snuck in without any notable price reduction.


I suspect they only bother selling large capacity SMR drives to hyperscalers, because at high capacities (eg. 8TB+), the workload for non-hyperscaler customers is usually something like RAID5 or ZFS, which is terrible on SMR drives. Hyperscalers don't have this problem, because they have custom software that can use features like ZNS to make SMR drives bearable, and on the lower end, there are enough "normies" (for lack of a better term) who are just using their drives to copy files back and forth, and probably won't hammer the drives with RAID5 rebuilds or ZFS resilver.


Are RAID5 rebuilds a problem? I thought the issue with ZFS was how scattered the writes are (with work happening on improving that), and RAID should just be a linear sweep.


No idea where you or I can get them, but for low-tier enterprise archival storage, vendors either put them into their own archival-tier arrays, or buy directly from the HDD vendors.

I think of SMR like Optane. A bunch of consumer-level tech guys w/o a real job got all lit up and started arguing about it online, but its target market was not the "NewEgg website" market, it was the "million dollar storage array from IBM" market.

Remember, there are many workloads that are "write this on disk and keep it forever" - even historical data in a data warehouse showing soda sales by season.

I honestly haven't bought any storage or compute in 25 years - my personal crap is old "enterprise/business line" junk from work.

Where you, as a consumer, can get them - don't. I don't think they're meant to be used by consumers. You're gonna have a bad time unless you write software specific to this weird type of drive. What you might find out there is some kind of a storage 'consumer NAS in a box' product that has tiered storage inside, with these drives and other ones. But stay away from drives on their own w/o that special software in front.


Though as a consumer it was easy to buy an optane card.

Not the dimms as much but those had horrible prices per gig, and SMR is supposedly a way to save money.

> Where you, as a consumer, can get them - don't. I don't think they're meant to be used by consumers. You're gonna have a bad time unless you write software specific to this weird type of drive. What you might find out there is some kind of a storage 'consumer NAS in a box' product that has tiered storage inside, with these drives and other ones. But stay away from drives on their own w/o that special software in front.

That would be a reasonable statement if they weren't selling SMR drives to customers without even labeling them as such. Just raw drives that sometimes have the performance go to hell. If you wanted fast you should have gotten an SSD, I guess.

It's not that I can't easily get an SMR drive, it's that I can't get the bigger models and I can't get the price savings.


Author here. We recently published a post about our last 4 years on SMR usage here: https://dropbox.tech/infrastructure/four-years-of-smr-storag.... Note that SMR technology just 5 years ago was rather nascent and a lot of software support was not great. Using SMR for us is possible without penalty only because of our sequential write workloads.

We use a custom disk format along with libzbc, but libzbd now provides many advantages, which we are looking to adopt. I did want the QCon talk to have some super straight-to-the-point conclusions and these, I believe, have saved us the most since I have been on the team. Largely due to the sheer scale of managing such a system.


SHR has high latency when you need to rewrite data on a shared "shingle" - so my guess is they designed the storage system to avoid those for better performance?


Is the diagram in the Zone section wrong? It looks like the Cell diagram below has been used by mistake.

Super interested to understand this better but having a hard time following the article. Is the hash index replicated across regions/zones, or are clients responsible for routing thier requests to the correct zone?


You're right, it is. I was confused as well.


So is this thing using parts of Ceph? Haven’t really seen OSD term used anywhere else…


We designed and built the system almost completely ground-up, all the way down to the disk scheduler, but yeah we probably took the term OSD from Ceph.


I think that's right


(kmod was the founding engineer on the project and the guy who came up with the name!)


The project was originally codenamed "s3box" and then changed for obvious reasons haha. I think Drew picked the name "Magic Pocket"


Lustre used that nomenclature long before Ceph.


https://www.lustre.org/

> The Lustre® file system is an open-source, parallel file system that supports many requirements of leadership class HPC simulation environments. Whether you’re a member of our diverse development community or considering the Lustre file system as a parallel file system solution, these pages offer a wealth of resources and support to meet your needs.

Hadn't heard of it.


It's normally used for supercomputers


Sounds interesting, but I can't find the github link...


We had a lot of requests for open sourcing MP but I don't think it would have been very useful. The code itself is interesting to look at but the system is designed for very large data sets, has a lot of moving parts to manage and configure, and would be very difficult to operate at anything below at least double-digit petabytes.

MP was designed for multiple exabytes across many data centers and geographic regions. A different system design (or just using S3) would be more appropriate for use at smaller scales.


Where I work, we deal with double to perhaps eventually triple digit petabytes. I was looking for the GitHub link. :(


This is an impressive product, and I apologize, but I'm gonna go on a bit of a rant about the PR language.

I hate the phrase "Our system has over twelve 9s of durability." Amazon was the first motherfucker to claim this, but the other cloud storage folks are also culpable, but at least they mostly had the modesty to add some weasel words like "designed for" and didn't just straight up claim there was less than a 1 in a trillion chance of a durability failure.

You don't have twelve 9s of durability. Your collection of copies of data on the hard drives do, assuming they exist in a vacuum and nothing bad happens to them except the normal sorts of things that cause hard drive failures that are nice and completely independent. But it completely ignores all other sources of problem, and those are so many orders of magnitude more common that you might as well claim "God-given, perfect durability" because it'd be just as accurate.


Yeah we have a few talks about this and a chapter about this very issue in https://www.oreilly.com/library/view/seeking-sre/97814919788.... Totally agree that in a well designed system the sources of data loss are certainly not disk failures.

As far as I know Magic Pocket has had 100% durability, but that's obviously beside the point.


"It’s fairly easy to design a system with astronomically high durability numbers. 24 nines is a mean time to failure of 1,000,000,000,000,000,000,000,000 years. When your MTTF dwarfs the age of the universe then it might be time to reevaluate your priorities.

Should we trust these numbers though? Of course not, because the secret truth is that adherence to theoretical durability estimates is missing the point. They tell you how likely you are to lose data due to routine disk failure, but routine disk failure is easy to model for and protect against. If you lose data due to routine disk failure you’re probably doing something wrong."

https://medium.com/@jamesacowling/how-many-nines-is-my-stora...


Of course, these systems are always designed for that. Just like every system is designed for a certain amount of availability. Also, even in a vacuum, things would degrade over time due to bit rot, etc. That's why the article mentions protections with verifiers and due to other things such accidental deletions due to potential bugs.


magic pocket's tech lead (disclaimer: my cofounder at convex) has a whole talk on this concept of "durability theater" [1]!

the tldr is that those numbers of 9s are just table stakes. no system should ever lose data due to routine disk failures. so then, as you mention, there's another whole art to mitigating those other sources of problems.

[1] https://www.facebook.com/atscaleevents/videos/17416916227706...


We should also clarify that Sujay and I are part of the old old team :) The current team have been doing an awesome job since then!

(a bunch of the early MP folks work at Convex now)


> 99.99% availability

That seems... really bad, for the Core Core Service that Everything Depends On.


By comparison the old US POTS phone system was required to maintain downtime of less than five minutes per decade. I have no idea what rules, if any, apply to the modern wireless systems.

Note: The consent decree, the actual legal obligation of the Bell system, actually specified the percentage of times someone would pick up the phone and not get a dial tone or operator. It was surprisingly high -- IIRC something like 2%, which is why you can see people toggling the hook in old movies. The five minutes per decade constraint I know because some of my old customers made digital phone switches and they had to provide that SLA to their phone company customers, or else not get the order.


I'm having trouble finding any reference for the "five minutes per decade" downtime limit. That would be "6 9s" or 99.9999% uptime which is just crazy in today's world. Even today VOIP providers only claim 99.999%, though it seems like in practice many fail to even get close too that. I think I found the 756 page consent decree between Bell and the US you mentioned but I can't find a reference there either, though it is quite a massive doc with so-so OCR: https://www.google.com/books/edition/Consent_Decree_Program_...

Stepping back a bit, it seems like the FCC would be in charge of establishing rules like this but that consent decree I found (which might be the wrong one), is with the house Antitrust committee. (After my own Googling failed I also asked GPT-4 and it isn't aware of this either)


Well as I noted in my comment the consent decree itself IIRC specified availability rather than uptime.

However the 5 min figure comes from my customers like DSC (R.I.P), Ericsson and Nokia building POTS switches. These guys were deadly serious, like the folks who made spacecraft and medical devices, plus the Ericsson and Nokia folks were nice too.

In DSC’s case they were so paranoid that they paid us a massive amount to maintain a special tool chain for just for them. It was frozen in time (no upgrades) and when they reported a bug and we sent them an updated tool chain they diffed the binaries and made sure that every delta was due to the big fix and nothing else (that the dev hadn’t snuck in some other patch for some reason)! They did some other headstands with their hardware and software, but in the end it didn’t save them.

It’s cool that the consent decree is online. That arrangement with the Bell system was very clever, though it led to a lot of weird anomalies and distortions but I think it did end up with a better phone system than the PTT model. It’s also been a better model than what’s happened with the power and water utilities.

The FCC back then was a better regulator for phone customers than the (now obsolete) ICC had been.


We kept a data closet in Manhattan in a sister building to 33 Thomas. Customers would ask for a BRP. We had N+2 for data level stuff and higher but some customers would get hung up on not having second location. We took the stance if the facility fails we all have bigger issues.

We had one 18 minute issue in the ten years of use. It was the best facility we ever used.


I remember in the late 90s/early 00s a lot of cases when sites would go down (everybody self-hosted) due to backhoe events and the like.

These companies weren't idiots: they had replication for their databases and leased redundant transmission service because they knew this kind of thing could happen. The problem is you'd buy transmission from two providers...whose fiber turned out to be in the same conduit, or even both had rented bandwidth on the same fiber.

At least people are smarter these days, and with widespread cloud service there are fewer people who need to keep track of this stuff.


They could be serious enough about it, and do everything right, and still have an outage. https://news.ycombinator.com/item?id=34665023


When I interned a phone company I was surprised by the lack of hardware redundancy.

The equipment was trusted to not fail in it's lifetime.


I think they got 5 nines mixed up with 6 nines. The 5 nines is 5 minutes per year. I asked ChatGPT and it claims they had a target of 5 nines which makes sense.


That's 5 minutes a year, not decade. (I worked in telecom for a very long time.)

Telco basically set the bar for "5 9s."


That was before we had to update every system every day :-)


How is that bad? 99.99% means the system is down for less than an hour each year. https://uptime.is/99.99


of course, it actually depends on how they define "up."


S3 standard storage class also says it's 99.99% availability so ... I wonder how hard it is to get beyond that?

https://aws.amazon.com/s3/faqs/


It's extremely hard to get availability higher than that. Possible but not something companies are willing to promise in an SLA.


I'd say it's not so much that it's really hard, it's that the compromises aren't worth it.

note, I worked on S3 2015-2017


That's a fair point. It would be quite possible if the customers were willing to pay a lot more.

I worked on DigitalOcean's object storage for about a year not long ago. Makes sense that those of us who have been in this space would be interested in this article haha.


So ... e.g. for 5 nines, the erasure code configuration would demand that do more writes in more locations and the cost impact is too high?


I imagine that as you add more nines, you start to hit more problems that are more out of your control. Like, how many nines do your backup diesel generators have?


or replication if you're not erasure coding.

So yeah you need more storage-optimized server racks and all the associated manpower and maintenance, you also need them to be distributed across different datacenters and zones which of course also impacts latency and your ability to provide some appearance of consistency, then you also need the same distribution for stateless services serving the data.

On and on and on and you might be nearly doubling cost to get to an extra 9 that almost all of your customers won't care about.


It's a cost problem. You can absolutely get higher availability, but it requires more hardware, uses more networking, and it makes maintenance and feature development more expensive. Blob storage customers are very price sensitive, but they also want lots of availability. That's why you'll see various "tiers" of storage, to try and capture all the combinations of availability and storage cost that various kinds of customers might want. Pretty much nobody wants to pay twice the cost to turn an exabyte of 99.99% available data to 99.995% available data, so it's not a product.


Absent here is any statement about how available is defined in terms of being _meaningfully up_ like iops or bps.


When you build anything on it, you know it may not be available for a couple hours a year and architect around it. 99.99 isn't bad at all. Your ISP probably has worse SLAs.


Not really bad. It’s less than an hour per year. Most users would not notice.


Oof.

You're not wrong, this is really really bad, especially for Dropbox, storage is their business so I expected way better.

These stats are no different to S3 at all. All of this engineering and moving away from AWS and for so few gain in availability.

I was initially excited when they moved away from AWS and expected industry leading higher availability when they moved away, but I was wrong.

This is disappointing for Dropbox which this is their main business, storing files without any minor hiccups or outages for years.


"... storage is their business so I expected way better."

Storage is our business and we target an even worse 99.95% availability[1].

Availability has a cost. That cost is complexity.

We would very much prefer to have boring outages more often than have fascinating outages very rarely.

[1] https://www.rsync.net/resources/notices/sla.html


> Availability has a cost. That cost is complexity.

It seems that Dropbox's system is complex enough but still isn't as designed to be highly available (5x9s) as a well architected system on AWS using S3.

Yet It is surprising a public billion dollar company like Dropbox cannot go beyond this for their mission critical customers (govt agencies, healthcare, military, etc) willing to spend hundreds of millions.

So yes, I expected way better.


> These stats are no different to S3 at all. All of this engineering and moving away from AWS and for so few gain in availability.

What makes you think a gain in availability matters or is necessarily a motivation for the project?

If they can achieve the same availability at far lower cost, it’s a win for them, which is why they would (and did) do it.


> What makes you think a gain in availability matters or is necessarily a motivation for the project?

This isn't a win for enterprise / business / mission critical customers. Governments and public services cannot use this at all.


The propaganda is wild to make people believe governments and public services have even 99% uptime, let alone 99.99%.


5x9s is more than possible and has been done. not propaganda, don't know what you've been reading.


Yes 5x9s is possible. No, governments and utilities services do not keep 5x9s.


They can, and do.


They use AWS more than the unreliability of Dropbox.


You think 9-1-1 doesn't experience downtime?


Strawman. Who said that emergency systems don't ever experience downtime? Orgs using mission critical systems would trust a service that has less than 5x9s.

> Emergency response systems is 99.999% or “five nines” – or about five minutes and 15 seconds of downtime per year.

https://aws.amazon.com/blogs/publicsector/achieving-five-nin...

I see hospitals will be using AWS and other reliable hosts to use S3 rather than Dropbox.


Hospitals have budgets too, not all of them will spring for the 5th 9


And AWS is there for the 5th 9, not Dropbox. therefore, Dropbox is not suitable or reliable for these kind of orgs, even with this custom built system they built.

One can have a custom built or an experienced IT team using AWS and achieve the same or even better availability using AWS and other providers that care about critical availability, if architected correctly.

https://docs.aws.amazon.com/wellarchitected/latest/reliabili...


this is an overreaction. Dropbox is mostly storing files that are already stored on the customer's device, so customers usually won't notice an outage.


How is this an overreaction?

I would expect Dropbox, a file storage company that proudly invests heavily in tech and infrastructure to achieve a better availability than what they already were on AWS (99.99%)

In terms of availability the change is pretty much 0 and as a business / enterprise customer I might as well choose a different service with similar or higher 9s or (if my needs are complex) choose S3.


Can you provide an example of an alternative service which will give higher than 4 nines for availability that an enterprise customer would pick instead if that < 1hr of downtime per year was too high?


AWS (Architected Correctly) which I am sure Dropbox has experience in.

https://aws.amazon.com/blogs/publicsector/achieving-five-nin...

Here is a service that has managed to achieve 5x9s of availability:

https://ably.com/


> achieve 5x9s of availability:

Guaranteed availability is a bet they're willing to make, a gamble they've been on top of so far, a risk that, should something fail, they will pay out on according to their SLA.


Seems to be working out for them (Ably) and this isn't a public billion dollar company like Dropbox.

I would have thought that given Dropbox's engineering talent, they would have designed a system that would account for 5x9s and even making that guarantee for enterprise or mission critical customers.

Can't even find their SLAs anywhere for these customers, so I presume that Dropbox doesn't care about them.

Guess I was wrong and this is just disappointing and made the move not worth it.


I'd rather have maximum consistency and integrity than maximum availability.


Most tech these days have improved to offer maximum consistency, availability is also important, which is why it is a good thing that Dropbox still trusts AWS and especially S3, better than their own on-prem solution.

https://aws.amazon.com/solutions/case-studies/dropbox-s3/


Do you know of any companies that actually provide better reliability in their consumer product? The ones that lie or skew their uptime calculation don't count.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: