New Amazon EC2 High Storage Instances

pella · on Dec 21, 2012

"The New EC2 High Storage Instance Family"

http://aws.typepad.com/aws/2012/12/the-new-ec2-high-storage-...

" The High Storage Eight Extra Large (hs1.8xlarge) instances are a great fit for applications that require high storage depth and high sequential I/O performance. Each instance includes 117 GiB of RAM, 16 virtual cores (providing 35 ECU of compute performance), and 48 TB of instance storage across 24 hard disk drives capable of delivering up to 2.4 GB per second of I/O performance.

This instance family is designed for data-intensive applications that require high storage density and high sequential I/O -- data warehousing, log processing, and seismic analysis (to name a few). We know that these applications can generate or consume tremendous amounts of data and that you want to be able to run them on EC2. The storage on this instance family is local, and has a lifetime equal to that of the instance. You should think of these instances as building blocks that you can use to build a complete storage system. You should build a degree of redundancy into your storage architecture (e.g. RAID 1, 5, or 6) and you should use a fault-tolerant file system like HDFS or Gluster. Of course, you should also back up your data to Amazon S3 for increased durability. "

alimoeeny · on Dec 21, 2012

Why the link URL is like this?

mbell · on Dec 21, 2012

Seriously is this a phishing site? Amazon page being served off a phx.corporate-ir.net domain?

jeffbarr · on Dec 21, 2012

Great question, here's the scoop!

The corporate-ir site is actually part of Thomson Reuters and our press releases end up there.

If a particular AWS release includes both a press release and a blog post, the press release goes out first.

After the release shows up in public I publish the blog post and submit it to HN.

Quantum fluctuations of the universe caused the press release to get more votes than the blog post and that's why it's on the front page.

MichaelApproved · on Dec 21, 2012

I thought it was industry norm to link a press release to the blog post. In that case, the blog post would go out first. What's the reason you decided to send out release first? I'm genuinely curious about the reasoning.

jeffbarr · on Dec 21, 2012

The press release is considered "definitive" for some reason. This is how our PR team explained it to me.

This is actually a very interesting conversation. It is interesting to see that the post on the AWS blog, hosted on TypePad, is seen as more official than the press release.

wmf · on Dec 21, 2012

This may be related to Reg FD. AFAIK press releases are considered fair to investors since they're guaranteed to hit everybody's Bloomberg terminal. Although Jonathan Schwartz convinced the SEC that blogging is compliant with Reg FD, I suspect many companies don't want to risk it.

apaprocki · on Dec 21, 2012

Yes, Reg FD -- professional IR services cater to this market. In case you're interested in the timeline from the Bloomberg Terminal point of view: http://imgur.com/wwzUp

tl;dr PR newswire (BUS) comes first @ 8:54, it hit Bloomberg News wire four seconds after release. Amazon blog wasn't until 9:17, followed by others such as TechCrunch @ 10:11.

notatoad · on Dec 21, 2012

I'm sure that if the press release were being served off a domain that was even a little bit recognizable, like 'thomsonreuters.com' or similar, everybody would be willing to accept that as official. corporate-ir.net just sounds sketchy.

Also i doubt that typepad would be as trusted if your blog wasn't as popular. If it were some company whose blog i've never read before, i wouldn't be inclined to trust announcements from typepad. but everybody here knows the aws blog.

mattsoldo · on Dec 21, 2012

Some public companies (mine included) have specific guidelines around how information is disseminated that can be reasonably be expected to impact share prices. If you look at the NASDAQ website (http://www.nasdaqtrader.com/Trader.aspx?id=MarketWatch) they have their own regulations about how it is handled. I'd imagine there are SEC regulations around it as well.

corresation · on Dec 21, 2012

Had the same concerns, however it is a press release service (apparently Thompson Reuters).

RyanZAG · on Dec 21, 2012

I'm going to need to mortgage my house to get my hands on some of these, aren't I?

petercooper · on Dec 21, 2012

On-demand pricing is $4.60 at the moment which isn't bad considering what you get ($110 per day; $3300ish per month).

I just ran some 'Reserved Instances' quotes and for 12 months I got $3968.00 upfront and then $2.24 per hour OR $9200.00 upfront and then $1.38 per hour. For 3 years you can go as far as $16924.00 upfront and then $0.76 per hour (for a long term effective rate of $1.404/hr).

moe · on Dec 21, 2012

On-demand pricing is $4.60 at the moment which isn't bad considering what you get ($110 per day; $3300ish per month)

Well, depends on what you compare to.

The reserved price for 3 years is quite revealing in this case; So Amazon asks $16924 upfront and... wait, $17k upfront?

You can buy an equivalent supermicro box with 24x2T, 192G Ram (not 128) for $10k. Thus if you rent the reserved EC2 variant for 3 years you end up paying at least a 4x markup versus housing a dedicated box.

ihsw · on Dec 21, 2012

You fail to mention rack rental fees and network IO fees, and maintenance (parts replacement, parts shipping/handling, cost of downtime, maintenance staff salaries).

moe · on Dec 21, 2012

We're talking about a difference of $10k USD per instance per year.

You'll want at least two for redundancy so that's $20k spare change if you go dedicated. That buys quite lot of rackspace, network IO and spare parts.

Staff salaries don't factor in because if you need storage of that scale you can't do without a competent admin either way (the $20k comfortably pay for a fully managed colo with remote hands).

Needless to say most deployments of that size will need more than two boxes, at which point the markup tips entirely into wtf-territory. Note my calculation was really generous here, too. In reality you get steep hardware discounts on top that make amazon look even worse, and you can buy boxes with higher storage density for an even better $/GB.

TillE · on Dec 21, 2012

My impression of EC2 and S3 has always been that they're appropriate for meeting short term needs (to handle extra load or to function as backups), but that's about it.

Their long-term pricing is terrible compared to other options, and they offer few benefits. At the low end, standard VPSes are much cheaper. At the high end, never mind colo, there are dedicated and even managed servers which offer far better value.

Spooky23 · on Dec 21, 2012

The Amazon pricing is similar to chargeback models in large enterprises. Actually it's cheaper, because big enterprises typically require the you use SAN at $5-15 per GB/mo.

powertower · on Dec 21, 2012

Okay. So instead of a 4x markup, it's now a 3x markup when you include the colo box space in a rack, power, b/w, a couple of spare parts on hand, and the extra couple of hundred to pay the staff at the datacenter (on call 24/7, 365) when those spare parts are put to use.

corresation · on Dec 21, 2012

Worth noting that it sounds like hs1.8xlarge is built on magnetic disks (24 2TB HDs - edit: originally put 1TB), each reading some 100MB/sec, yielding the theoretical max of 2.4GB/s in a RAID-0 configuration. No one actually uses disks in such a fashion, and gross throughput of magnetic drives has seldom been of much utility (hence the strong demand for SSDs. Random IO matches the vast majority of workloads more appropriately).

Just caveats. This doesn't look like a terribly interesting option.

JoachimSchipper · on Dec 21, 2012

Good point, but their blog post does give a few possible uses: "Storage instances are ideal for data-intensive applications including Hadoop workloads, log processing and data warehousing, and parallel file systems to process and analyze large data sets in the AWS Cloud". If your code fits the pattern, a high-storage instance may fit well.

Of course, getting that much data into and out of the cloud is its own problem.

corresation · on Dec 21, 2012

Even getting the data to that machine presents a problem that seems to undermine the value of it entirely: As you mentioned, the real value in this is linear processing of large sets of data, but the storage is ephemeral so your process has to be some variation of firing this instance up, copy TBs of data to the machine, and then do linear processing. Given that you have to get the data there, the value of the high aggregated gross-throughput seems secondary -- just stream process it, etc.

I'm having a tough time seeing where this type of instance fits.

edvinasbartkus · on Dec 21, 2012

That's why AWS Data Pipeline came along. http://aws.amazon.com/datapipeline/

wmf · on Dec 21, 2012

Data Pipeline looks like a fine orchestration service, but it's not going to ingest 48 TB of data any faster than you can do it yourself. Which is probably not that fast.

throw_away · on Dec 21, 2012

maybe + http://aws.amazon.com/importexport/

alexpopescu · on Dec 21, 2012

> No one actually uses disks in such a fashion

Not sure what you mean... there are quite a few data storage & processing tools out there that use a log-structured on-disk storage that were designed to read/write sequentially. These could potentially take full advantage of these instances.

Many other solution requiring random IO is indeed a fact, but I still think there are systems that could benefit.

corresation · on Dec 21, 2012

I mean that no one ever would configure 24 disks as RAID-0 -- the probability of a data-loss failure (which may not be a problem from the data-loss side, but is from the process continuity side) becomes incredibly high. Best case most people would arrange it as RAID-10, instantly dropping the performance by half.

zwily · on Dec 21, 2012

If the use case is to only use the instance for several hours or days for some intense processing, then RAID-0 over 24 disks is probably just fine.