*"On the limited hardware I run for getsentry.com, that is, two servers that act...

zeeg · on Aug 31, 2012

I didn't note in this post (but in others), but before I switched, my Heroku bill was almost $700 (and I couldnt get it to perform well), the current bill is far less even with growth.

timr · on Aug 31, 2012

Yeah, I read your original post. I've worked pretty extensively with Heroku and with large, custom, in-house infrastructure, and I don't share your experience.

There's an I/O penalty for working on AWS, but it's on the order of tens of percent, not hundreds. I suspect that your original problems were related to working set size relative to cache (since Ronin => Fugu bumps cache by over 2GB, and you said that Fugu was working well).

Heroku's largest database has a 68GB cache at an (admittedly expensive) $6,400 a month. But even so, $6,400 is a small expense for a growing web application. A mediocre developer costs more than that. Trading off server cost for developer cost is an asymptotically bad bet.

zeeg · on Aug 31, 2012

Actually I think I/O was my primary bottleneck. Once I addressed that I started hitting CPU/memory constraints on Dynos.

The database definitely wasnt unreasonable at $400, but for a bootstrapped project (especially something thats a side project for me), that was a big consideration.

I probably would have toughed it out with Heroku if I could have gotten things to perform better. At one point I was running 20 dynos trying to get enough CPU for worker tasks to actually keep up, and I unfortunately couldn't solve the bottleneck to where the cost was reasonable.

The application isn't typical (what Sentry does pushes some boundaries of SQL storage for starters), but it was costing too much of my time to struggle with optimizing something that really shouldn't have needed that much effort.

I definitely like the redundancy provided, and the ability to add application servers with zero-thought is a huge plus, I just couldn't justify the cost of the service in addition to my frustrations/time of trying to scale it on Dynos.

timr · on Aug 31, 2012

"Actually I think I/O was my primary bottleneck. Once I addressed that I started hitting CPU/memory constraints on Dynos....At one point I was running 20 dynos trying to get enough CPU for worker tasks to actually keep up"

Something doesn't make sense. If you "addressed" your I/O problem, your CPUs were therefore all busy doing something much, much slower than a disk read/write, in software (which would have to be both obvious and unbelievably horrible). If that's true, something pathological was going on in your code. I'm going to assume that you would have noticed it -- swapping, for example.

So let's go back to I/O: if your database was slow, you might observe something superficially similar to what you've described: throwing lots of extra CPUs at the problem would result in lots of blocked request threads, and appear that your dynos were all pegged. The exact symptoms would depend on your database connection code, and your monitoring tools. But in no case would throwing more dynos at a slow database make sense, so I'm going to assume that you didn't do that on purpose (right?)

Given the above, I still can't meet you at the conclusion that abandoning Heroku was the magic bullet for your problems. There's not enough information, and it doesn't add up. My money is on one or more of the following: DB cache misses (i.e. not enough cache); a heavy DB write load; frequent, small writes to an indexed table; or pathological memory usage on your web nodes. And if it turns out that the cause is due to I/O, you've only bought yourself a temporary respite by moving off Heroku. Eventually, you'll get big enough that the problem will re-emerge, even though your homebuilt servers are 10% faster (or whatever).

EDIT: Aha! Your comment in another thread actually explains your problem: you were swapping your web nodes by using more than 500MB RAM (http://news.ycombinator.com/item?id=4458657).

zeeg · on Aug 31, 2012

It would take me more than one blog post to describe the architecture that powers Sentry, the various areas that have and can have bottlenecks (some more obvious than others).

More importantly, this a few months ago I made the switch, and I don't remember the specifics of the order of events. I can assure you though that I know a little something about a little something, and I wasnt imagining problems.

(Replied to the wrong post originally, I fail at HN)

http://news.ycombinator.com/item?id=4458643

timr · on Aug 31, 2012

"I can assure you though that I know a little something about something, and I wasn't imagining problems."

Since you've made it clear in another thread that you were actually running out of RAM on your dynos, I imagine you were running into trouble. There's no need to be snide about it.

Bottom line: you hit an arbitrary limit in the platform. If heroku had high-memory dynos, the calculus would be different. In the future, instead of arguing that your homebrew system is better than "the cloud", you could just present the actual justification for your choice.

moe · on Aug 31, 2012

There's an I/O penalty for working on AWS, but it's on the order of tens of percent, not hundreds.

That's rather optimistic.

The EC2 ephemeral disks normally clock in at 6-7ms latency, that's >13x slower than dedicated disks.

EBS clocks in at 70-200ms latency, that's >5x slower than a dedicated SAN.

And that is under optimal conditions. In reality the I/O performance on EC2 frequently degrades by orders of magnitude for long periods of time.

timr · on Aug 31, 2012

"The EC2 ephemeral disks normally clock in at 6-7ms latency, that's >13x slower than dedicated disks."

Um. Are you comparing hard drives to SSDs? Rotational latency for a 15k drive is a couple of milliseconds. Seek time for server drives varies from 3-10ms.

EC2 disks are slow, but there's no way they're 13 fold slower than your average server drives. And 6-7ms is just about on par with commodity hardware.

moe · on Sept 1, 2012

Are you comparing hard drives to SSDs?

No, but I mistyped the numbers. That was supposed to read: 60-70ms.