Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Cheap Bulk Storage?
49 points by e1ven on May 1, 2011 | hide | past | favorite | 42 comments
2 years ago, BackBlaze released specs for their storage pod, which is basically huge tract of drives wired together.

It's not fast, but it's cheap, and it's great for bulk-archiving documents.

Now it's 2011- Is there a better way of solving this problem? S3 is crazy expensive for massive (100+TB) archiving, and I'd rather go one-step above building my own system using custom frames and components.

Are there any web-services which offer bulk storage at somewhat reasonable rates? Are there off-the-shelf cases which offer 50+ HDs for a non-insane price?



Supermicro has a 45-bay JBOD that should be under 10 cents/GB populated: http://www.supermicro.com/products/chassis/4U/?chs=847

HP has a 70-bay JBOD that works out to under 30 cents/GB: http://h10010.www1.hp.com/wwpc/us/en/sm/WF25a/12169-304616-3...


It's also worth looking at their 72 disk, 2.5" version

http://www.supermicro.com/products/chassis/4U/417/SC417E16-R...

Using that, and 72 1TB 2.5" drives, (http://www.newegg.com/Product/Product.aspx?Item=N82E16822136...)

(($109 * 72) / (72 * 1TB * .75) in dollars per gigabyte) = $.15/Gb


It's worth noting at this time, that most SSDs only come in a 2.5" version. (Not for cheap bulk storage but because that's arguable Supermicro's intentions for this chassis.)


No its not intended for SSDs. There is not nearly enough IO bandwidth for SSD to make any sense. 2.5 inch hard drives are what you use if you want disks with spindles not just big storage. Eg for virtualised database servers.


Yeah, basically 2.5" gives you the same space but twice the spindles for twice the price. Probably not worth it for archival storage.


Agreed. It'd be faster, but that's not really what we're looking for- We want to get the cost per GB to be as low as humanly possible, while maintaining at least an illusion of reliability.


So store it all on one of those fake Chinese USB drives. That gives a nice illusion, and is incredibly cheap...

Alternatively we could consider what level of reliability you want and how that will be tested.


In my calculations I was doing elsewhere in the thread, I was assuming 1 drive out of 4 would be for redundancy. There's nothing on these drives that's irreplaceable, and we have system-level redundancy.


The Supermicro case is pretty awesome. It looks like it might be a great option- It's almost exactly what I was looking for.. Half a step above custom frames. Thanks!

It doesn't feel like 2 years worth of progress, though- The numbers are pretty much still where they were 2 years ago ;)

Thanks Again.


You should have a look at a Silicon Mechanics[0] and Aberdeen[1]. I know Aberdeen sells re-branded Supermicro chassis, but Silicon Mechanics sells similar products. I've been shopping around for a 36 bay NAS, and both companies seem to offer the best combination of price, quality, and support (Aberdeen has a 5 year warranty). Though if you want to ditch enterprise drives entirely to minimize cost, you should probably go with a Silicon Mechanics system and buy your own drives.

Some quick estimates: You can get a 36 bay server with an attached 45 bay JBOD for about 8.5k [2]. If you use 2TB consumer drives, they'll run about $140 a piece[3]. This comes out to $0.12 / GB, which is a damn good deal, IMO.

[0] http://www.siliconmechanics.com/

[1] http://www.aberdeeninc.com/

[2] http://www.siliconmechanics.com/i28693/4u-storage-server.php and http://www.siliconmechanics.com/i19897/4u-45-drive-jbod-sas-...

[3] WD Caviar Black drives ( http://www.newegg.com/Product/Product.aspx? Item=N82E16822136824 ). You can get WD Greens for $80 a piece, but I don't know how their variable RPM holds up in a RAID.


Agreed. I like Silicon Mechanics, Also used Avadirect in the past, but they weren't as good.

I think the WD's are the way to go. Enterprise-grade drives aren't really much more reliable, and you need to plan capacity assuming they'll fail anyway. Better to get consumer drives that are at least mid-range (avoid Hitachi), and run with it.


That looks cool.

Where did you get the pricing from, to generate your cost estimates (ie. the 10/30c/gig)?


If you throw the serial number into google, you get a few resellers-

http://www.google.com/search?q=SC847E16-RJBOD1&ie=utf-8&...

2K - Case + Backplanes 1K - CPU, MB, RAM 3.3K - 45 1TB drives at 75 ea --

6.6K - 45TB RAID0 If I want it to be even somewhat sane, I'd add every 4th drive as spare/parity, which drops to 33G

$6600 / 35TB gives $.19/GB

That's higher than the .12 from the 2 year old BackBlaze box, but not much.

If instead we go with the $179 3TB WD drives, we can get slightly better than the pod pricing.

($179 * 45) / (45 * 3TB * .75) in dollars per gigabyte = $.08/GB


It's a shame that nobody has a chassis like the sun fire 4500 anymore. The closest I could find is a 5U Thinkmate STX XA48-2510 which would work out at about 32 cents/GB.


How much would the associated electricity and cooling costs be?


SpiderOak (the backup and sync company) is on the verge of launching a S3-like service, tuned for for archival class data. It's in beta now.

It's also entirely open source, so you could run it on your own hardware if you wanted. It uses parity instead of replication for storing data (but with arbitrary effectively replication levels.) So with R=3, you need a minimum of about 10 nodes to be efficient.

The beta uses AMQP internally, and the upcoming release version is Python/zeromq/gevent based.

https://spideroak.com/diy/


This seems to be a great offering

The pricing is not entirely out of line compared with what I can get in-house, and it's a fantastic solution for storing horribly bulk,low-reliability data.

$10 / 100GB comes out to .01/GB/Month. That means that you pay the total cost of an in-house solution, every month you use their service..

That's 12x the price of doing it yourself, but you don't need to buy replacement drives, pay for electricity, or send someone to go replace drives.

It's not that much cheaper than Amazon in the base pricing, but it looks like you can relax the number of copies for them to keep from 3(default) to 1, and cut the price down to 1/3?

Is this your experience? Thanks for the link.


FYI, I'm one of the founders of SpiderOak, it case that was not clear (it's in my profile.)

For the storage cost calculations: electricity, cooling, and bandwidth end up costing more than the drives, even once you include all additional hardware to keep them running.

Note of caution: I would recommend that shops build bulk storage hardware themselves only if someone on their team has an intensive storage background. There are a number of gotchas that are distinctly non obvious; they are capital intensive lessons to learn, and endanger data integrity. You saw what Backblaze went through to design their storage pods: all of the self hosted offsite backup companies do likewise. It's a core competency.

Typically it's much better to outsource or buy the more expensive business storage products with hardware RAID, enterprise class drives, cache batteries, and all that.

SpiderOak's DIY is designed for high reliability, but not high performance. In price comparisons to S3, there's no option for reduced-reliability storage, but also no charge for bandwidth in our out (up to a reasonable point.)

Having said all that, if you are interested in building 100+ TB of storage hardware in house, feel free to send me a mail; I may be able to save you some difficulty. If you'd rather us host it, we do discounts for startups. :)


Oh, Man, that'll teach me to look more carefully ;)

Oops.

Thanks for the awesome service. I've used SO for backups on and off, and I appreciate the tech work you do. I particularly like that you do client-side encryption. I just wish your UI was a bit better ;)

I must have been mixing up your service with diomede; IIRC, their techcrunch article was about letting you choose the levels of redundancy you want to keep.

I'll grant you that the bandwidth is certainly a price-factor, but your page talks a LOT about how you're higher latency than S3, and not everyone needs that speed for that price, etc, etc, and then your price comes out to be about 80% of theirs. Honestly, after reading your product positioning, I was envisioning it coming out to like 10%.

You're absolutely right that building a huge in-house storage network is a huge ordeal.. If you're worried about making it fast and scalable. What I want to do is to treat it as the equivalent of a huge tape drive- Just throw stuff onto it, and hope for the best.

I've been toying around with the idea of dealing with it entirely in application logic. Rather than RAIDing the disks, mount each one separately, and keep a hashtable of my files in memory with their full path, including redundant copies. Then, if a file fails, I have application logic try the backup copy.

Anyway, you know a LOT more about this than I do, so I'll shut up before I embarrass myself further.

My point is just that there's a lot of people in the market who have huge datasets currently, for various reasons. Lots of science departments, analytics groups, etc. I'd love to be able to keep an nearline copy of work.

I don't mind if it takes a good 15-20 minutes to be able to access the first byte, I just want to know that I can if I need to, faster than waiting for Iron Mountain to ship me something.

Side note- I considered trying to justify it under BackBlaze's "$5, unlimited space" offer, but I didn't think they'd go for it. ;)


I'm remember being fairly skeptical of the Backblaze storage pods. What happens when you have a power loss situation? What happens when a drive dies? Seems like it would be pretty tough to maintain and service those custom pods with the unnecessarily expensive non-redundant custom power supplies. I'm going to second the Supermicro case recommendation made by wmf.


I'm surprised there aren't any services selling raw storage space. i.e. no RAID or other redundancy, no doubling as a static web-server, just a bunch of disks.

So people (myself included) buy physical HDDs for local backup -- simply because every online alternative jacks up the price by maintaining 3+ copies and keeping everything permanently online.


People would lose data and then blame the provider, creating nothing but bad PR.


"if we lose your data, we'll pay you $X compensation"

I'm pretty sure you can find an X which won't cost too much, and will get most of the complaints off your back


That's a curious idea. How would you access the raw storage? iSCSI? And then you run your own RAID on top? What's the advantage to you? It's ever so slightly cheaper?

Backblaze (no affiliation) is $5/month or $50/yr. That's pretty hard to beat.

Also, running a static web-server is fairly trivial to setup these days, what would a service not running a static web-server do/have?

Every online alternative has a vested interest in both A. saving money, and B. not losing your data. As much as you trust a business to perform B adequately, you implicitly trust it to perform A at least as well.


I need data backup on a personal level rather than for business reasons, and I love what Backblaze offer... except that for my data and with my ~50kbps upload speed, it would take about five years to upload it all to their servers.


The closest equiv. would be Amazon's EBS, but they're pricing on this is through the roof. Storing 70TB would be 8K/month, without ANY bandwidth fees..


> any web-services

You may want to take a look at http://www.wuala.com, although I'm not sure it will fit at your scale.

Besides other features, they offer two not-so-common things:

1) you can trade your own space for remote storage

2) they have local encryption (at your end)


Diomede might be worth a look. Their offline and nearline options are waaay less expensive than S3. Never used them though so let us know how it works out if you go with them.

http://www.diomedestorage.com/


Their page appears to have a single unlinked image on it, and nothing else..?


As far as I can tell, they're closed down ... their blog (http://diomedeblog.wordpress.com/) has been inactive for over a year, and as far as I can see there's nowhere for existing users to log in (they had open signups at one point last year...)

Edit: Actually, it looks like they closed on Mar 19, according to an email they sent to existing users. They say they might reopen "with a new approach" near the end of the year.


http://techcrunch.com/2009/02/27/diomede-offers-green-file-s... seems to offer some more details. Their site leaves an awful lot to be desired.


Ah, yes. I remember reading about them now. I'd be interested in looking over their offerings, if they ever release any.


You ask about online bulk storage at reasonable rates, but you've not mentioned what youre acceptable rate of data loss is?

How much data are you willing to lose per day out of your 100TB?


I'd be willing to lose 100TB at once, assuming it wasn't every day.

If there were a system where it lost even 20TB/day, I'd be interested, depending on what the tradeoffs were! I'd expect that to be pretty seriously cheap to offset it.

That's why I'm asking what options people know about. If you know one that is randomly lossy but cheap, I'd be REALLY curious as to what their backend is!


I'd say most of the options people have proposed are in that range. Theres no backup, no simultaneous mirroring etc, which you are paying for with s3


Hardly. Hard disk failure rates are about 1% per year. For many purposes (basically, anything where this isn't the only copy of valuable data), that's acceptable.

Amazon S3 costs around $1/gb/year. Let's say it has 0% chance of data loss Suppose I can store something without any replication. Cost drops to $0.50/gb/year, 2% chance of losing all my data. If the value of my data is less than $25/gb, it makes financial sense to go for the unreliable option.

Much of my data is worth less than $25/gb. photos, video, old mail. Storing it on RAID is a waste of money.

besides, the chance of me deleting everything by human error is often much higher than the chance of hardware failure.

and some decent portion of those failures come with advance warning, so that you can save at least some of the data.


Are there instructions out there on how to build your own?



Would be interesting to install OpenStack Swift on top of Backblaze storage and see how it works.


I only looked at it briefly, but isn't Swift for multi-system installs, not multi-spindle?


www.elephantbackup.com is a great choice. live drive at a way cheaper cost. unlimited is 24 a year, 48 for the briefcase.


www.elephantbackup.com Unlimited for $24 a year. Can't beat it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: