Data is big $$$. Slap a couple of NoSQL databases and Spark on your resume and watch the money roll in. DBAs are disappearing with managed services, though.
Yeah, I don't know about "slap." We want you to have deep production experience with these systems. Designing them, deploying them at significant scale, predicting their pitfalls and avoiding them proactively. Diagnosing systemic problems and finding reasonable solutions.
If you can't magically put out production fires, on huge high-throughput systems, potentially in the dead of night, we are unlikely to pay you $300-400K.
The intent was to be a little hyperbolic and self-effacing. In terms of competent and capable developers, I think it’s hard to get a better return on your skill set than adding “data” stuff. And honestly I think it’s one of the most critical skill sets that is lacking across the board. So many bit companies have great data engineering teams, but generally other dev teams are left to design their own databases, which is a shit show. And even then, it’s amazing to me how difficult it is for data engineering teams to move from framework to framework without just mapping old solutions onto new technology.
My career has been primarily focused on something like “bringing modern data-driven solutions” to big companies. The one thing that is a constant challenge is that most teams (and leadership) aren’t prepared to handle responsibilities of data engineering and stewardship in transactional, operational systems. I feel like critical responsibility when I come on as a consultant is to impart knowledge about managing their data.
I like Scylla. I've been working with it for a while now, and it's a good alternative for transactional loads. It's a hell of a lot faster than Cassandra, and much much much much cheaper than DynamoDB. Cassandra has always felt like improvements came in fits and starts. I work at a Fortune 50 company, and Amazon quoted us ~$2-3 million a year to run our load on Dynamo (we were paying $350k/year for Aurora). With Scylla, we're looking at > $100k to get an order of magnitude better performance and a nice stable system that doesn't wake me up in the middle of the fucking night.
I was following Scylla since the beginning (only because I like their mascot), and it's actually sort of interesting to see what's going on with the company. I've spent the past few years designing things where transactional systems backed by Cassandra. This is the first time I've been able to use Scylla on someone else's dime, though. The unpleasantly big company I'm at right now is looking to replace a bunch of infrastructure with ScyllaDB (Couchbase, Cassandra, Elasticsearch, DynamoDB). It's catching on for sure, but it still doesn't return any results when I search Dice. It looks like Discord is hiring, though...
I'd love to hear how Dynamo would end up being $2-3 million a year They sure do a great job of convincing people that it's cheap so I'm curious where the cost seems to blow up?
If you are doing north of a million ops on DynamoDB you can quickly run into the $2-3 million a year range.
In this 2018 benchmark, we were able to calculate that a sustained, provisioned of only 160k write ops / 80k read ops for DynamoDB would cost >$500k per year:
That was a few years ago. These days, according to our most current pricing you could do DynamoDB provisioned, 1 year reserved for $38,658/month, which is "only" $463,896 annually (pop up the "Details" button and choose "vs. DynamoDB"):
The same workload on Scylla Cloud would run $29,768 reserved/month, or $357,216 per annum — 77% cheaper.
Of course, all of this is just pure list price. Depending on volume you might be able to negotiate better pricing. However, you'd need a really steep discount for DynamoDB just to get back to Scylla Cloud's list price.
Let me know if you spot any math errors or omissions on my part.
Cassandra won't try and load a partition into memory. It doesn't work that way. The only way you would get behavior like that is by setting "allow filtering" to on. Allow filtering is a dedicated keyword for "I know I shouldn't do this but I'm going to anyway". If you're trying to run those types of queries, use a different database. If someone is making you use a transactional database without joins for analytical load, get a different job because that's a nightmare.
Also, your partitions should never get that large. If you're designing your tables in such away that the partitions grow unbounded, there's an issue. There are lots of ways to ensure that the cardinality of partitions grows as the dataset grows. And you actually control this behavior by managing the partitioning. It's really easy to grok the distribution of data in on disk if you think about how it's keyed.
You've basically listed a bunch of examples of what happens when you don't use a wide columnar store correctly. If you're constantly fixing downed nodes, you're probably running the cluster on hardware from Goodwill.
I huge partition is often spread across multiple sstables, and often has tombstone issues if the large partition consists of a large number of column keys or any regular update cycle, which is often the case for hot rows.
In that case the overhead of processing and collecting all the parts of the data you need spread across different sstables and then do tombstones can lead to a lot of memory stress.