Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Database Performance at Scale – A free book (scylladb.com)
353 points by asicsp on Oct 5, 2023 | hide | past | favorite | 122 comments



I wish there are authoritative books or papers on how to build object stores like S3 or software-defined storage in general. It looks object stores in public clouds have been so successful that few companies or research groups have been working on their own. Yes, I'm aware that we have systems like MinIO and Ceph, but so many questions are left unanswered, like how to achieve practically unlimited throughput like S3 does, like what kind of hardware would be optimal for S3's workload, like how to optimize for large-scan incurred by analytics workload, which S3 is really good at; like how to support strong consistency like S3 does without impacting system performance visibly even though S3 internally must have metadata layer, storage layer, and an index layer, like how to shrink or expand clusters without impacting user experience, like how to write an OSD that squeezes out every bit of hardware performance (vitastor claims so, but there's not many details), and the list goes on.


> how to achieve practically unlimited throughput like S3 does

> like what kind of hardware would be optimal for S3's workload

> how to support strong consistency like S3 does without impacting system performance

I think most of these questions are first and most importantly _hardware_, _money_ and _physics_ questions.

I have no expertise in this matter, but I think this would be a good proxy answer: you make a system _seem_ to have unlimited throughput by interconnecting _a lot_ of the fastest available storage solutions with the fastest available network cards, putting it as close as possible to where your users are going to be. All of this is extremely expensive so you would need to have deep pockets, as amazon has, to make it possible (or a lot of investment).

I suspect with the right hardware and physical deployment locations you could fine tune Ceph or MinIO or whatnot to similar performance as S3. S3 is an object storage so the distributed system aspects of its implementation should definitely be a lot easier than, say, distributed SQL (not saying either is an "easy" thing to accomplish).

If you are interested in which hardware to use for a SAN, I found these benchmarks that may be exactly what you are looking for :-) [1]

--

1: https://www.spec.org/storage2020/results/


> I think most of these questions are first and most importantly _hardware_, _money_ and _physics_ questions.

Actually, money (more accurately, cost) is a constraint instead of a resource. S3 is known for its low cost, and S3 can easily dote out a 70% discount to its large customers and still make a profit. So, an interesting question is how to build an low-cost object store/

> interconnecting _a lot_ of the fastest available storage

"A lot of" leads to real challenges. Sooner or later you'll find that managing metadata will become a tough challenge. Case in point, Open-source systems often use systems like Zookeeper or etcd or single-node name server with hot standby for metadata management, which certainly won't be able to handle the scale of S3.


About cost, see [1]. Also, S3 prices have been increasing and there's been a bunch of alternative offers for object store from other companies. I think people in here (HN) comment often about increasing costs of AWS offerings.

Distributed systems and consensus are inherently hard problem, but there are a lot of implementations that you can study (like Etcd that you mention, or NATS [2], which I've been playing with and looks super cool so far :-p) if you want to understand the internals, on top of many books and papers released.

Again, I never said it was "easy" to build distributed systems, I just don't think there's any esoteric knowledge to what S3 provides.

--

1: https://en.wikipedia.org/wiki/Economies_of_scale

2: https://nats.io/


Uh, reference for s3 prices going up? I’ve only ever seen them heading down over time.


Sorry, this is just anecdotal from my recollection of reading random hacker news threads; I think people talk more about the bandwidth being expensive more than the storage itself [1].

--

1: https://hn.algolia.com/?dateEnd=1696464000&dateRange=custom&...


It's not documented in a single authoritative place, but AWS has documented much of the S3 architecture you are curious about via white papers and presentations at Reinvent and other conferences.

This talk explains how S3 achieves such high performance: https://youtu.be/sc3J4McebHE?si=-slHmjdQ4Z0EKQXs&t=1386


I haven't used it for anything that complicated yet, but this is the kind of stuff I have found GPT-4 really useful for, personally. Research, not refactoring.


“A free book” aka a book by a database vendor that wants to skew the premise of this-is-just-a-discussion-on-performance-at-scale-vendor-agnostic to that very vendor. Nothing is free. It’s all marketing.


I do like the premise though of this one

- here is how to optimize your database - here is why that works from a theorethical perspective - by the way, our product makes this process easier

that's fine by me


Yes it’s marketing, but if you have a complex product or service offering, a short book can be a good way to get the thinking across.


Alex Debrie: " Damn, this is an excellent resource. I read the first few chapters closely and plan to return to the rest. Lots of interesting stuff + frank discussion of how to truly evaluate different database options."

https://x.com/alexbdebrie/status/1709960305122418898?s=20


Beautiful. Been aiming to learn how to scale MySQL databases so I can run my apps on VMs without having to used managed dbs like Aurora or Azure Managed Database


Check out RonDB (https://www.rondb.com/blog) which is a fork of MySQL Cluster but has support for elasticity with no downtime, backups, etc. Disclaimer: i am involved with RonDB.


What are the tradeoffs? Guessing based on the stack it's eventual consistency


No. It's a distributed database. Transactions are committed in memory, and then (by default every second) committed as a group to disk. This gives you phenomenal write throughput and low latency, but the tradeoff is that if a whole cluster goes lights out, you could lose a second of data. It's kind of like turning off fsync for InnoDB.


How big are your apps? Wouldn’t the most simple and reliable way be to get one beefy MySQL server and let all the app instances connect to that?


One large VM won't give you replication/sharding, failover, or backups.


With one large VM, for most use cases, you do not need anything but the backups.


Sure, if your workload is small and you have a SLA of "it will be fine."


Most businesses do not need an SLA on ther webservice beyond ‘call us if it doesn’t work’.


Check out Vitess! It’s MySQL at scale.


Why don’t more companies / startups choose ScyllaDB rather than Postgres or MySQL?

ScyllaDB is a C++ version of Cassandra so it looks like speed and scalability is a complete advantage over the Java based Cassandra and Discord is using ScyllaDB at scale too.


Because Postgres and MySQL are more familiar and by the time they're falling over, you have raised 80M and can afford to hire folks like me who's entire job is pestering developers to stop reading-after-writing, to stop forcing queries to the primary (just incase there's lag!), to stop "select id from table" -> "select data_I_want from other_table where ID IN (list-of-a-million-ids"), to beg AWS for early access to the fastest possible machines, to systematically destroy the custom ORM you built during the first 5 years of the company, before you had customers.

Scylla/Cassandra is exactly the database you want when you're serving the entire planet.

Postgres is exactly the database you want when you're desperately searching for product market-fit and just need things to work and code to get written.

Turns out it's much better to start simple and face scaling problems maybe than start complex and face scaling problems maybe.


Curious how you approach business units asking for reports that require multiple joins and can "filter by any column" and "sort by any column" and also offer pagination and then complain about why it's so slow? This is MySQL by the way.

The sorting by any column and pagination is really killer. Can't do cursor-based pagination so you get LIMIT/OFFSET which is terrible for performance, at least in this context. Indexes are often useless in this circumstance as well due to the filter by any column. We get lucky sometimes with a hard-coded WHERE clause as baseline filter for the entire report that hits an index, but not always. Just for fun, add in plenty of LIKE '%X%' because we limiting them to STARTS WITH is out of the question; it must be CONTAINS.

It's a constant source of frustration for the business and for the dev team.


Sounds like the job for a denormalized OLAP data structure (e.g. star schema) with a BI tool. This is the kind of use case they were made for. In other words, you've rediscovered the motivation for data warehouses.


Thank you for the feedback. I'll do some research along these lines. We have moved a report or two to Power BI which has been must faster than similar reports built using dynamic queries. Keeping the data in sync and the Power BI learning curve have not been easy.


In our case we switched to holding that data in Elasticsearch. Now we have 2 problems: running Elasticsearch and keeping the data in Elasticsearch in sync with MySQL. We took an application-level approach to triggering record sync to ES but somewhere in our ORM it's doing some magic and doesn't sync in certain places. Database triggers are the next stop.

If we could do away with ES and go back to MySQL for what are glorified data grids with 20+ column filtering, sorting, and no full text requirements, we would.


We use ElasticSearch in a very limited capacity for keyword searches against consumer product metadata. We haven't ventured into using it for BI reporting, but I'm interested. I'm sure there is plenty of knowledge I can google, but any early insights or traps to avoid when moving from SQL-based reporting to something in ES?


If you want to sort by any column and paginate on any column (so you can have your cursor), a BTREE (MySQL index) is sort of the wrong structure in the first place. You probably want a document store for something like this. Scylla!

More pragmatically, just force bookmarked scans - limit the columns that can be sorted by - use a replica with indexes that the writer doesn’t have - pay for fast disks - use one of the new parallel query engines like Aurora or Vitess - just tell the business people to wait a dang minute, computer is thinking!


The lazy approach is to denormalise the data in advance and load the result into memory. If it’s small enough you can simply return the entire result set to a reporting tool and query it locally.


I think it depends on the problem. I've worked at a tiny telco (<100 people) where MySql absolutely wouldn't have worked. We ended up using Scylla DB. We had billions of rows in single tables.

If you're launching an e-commerce site that sells shoes or something, yeah, you probably aren't going to sell a billion shoes every year so MySql is probably fine.

Finally, I use this example a lot: Github thought they could scale out MySql by writing their own Operator system that sharded MySql. They put thousands of engineering hours in it, and they still had global database outages that lost customer data.

You get that shit for free using Scylla/Cassandra (the ability to replicate across data centers and still fine grained control to do things like enforce local quorum to not impact write speed etc.)

They probably made the wrong choice.


Cassandra does not make you immune to database issues: https://monzo.com/blog/2019/09/08/why-monzo-wasnt-working-on...


Yes. You still have to read the docs before randomly setting “auto_bootstrap” to false.

Or just use Astra and not worry about scaling your own cluster.


Did GP make such a claim? I didn’t see it, but maybe the comment was edited.


Are you referring to their use of Vitess?

https://github.blog/2021-09-27-partitioning-githubs-relation...

GitHub didn't write it, Google/YouTube did.


Or just use RDS Aurora for same benefits out of the box at scale.


You know as I get closer and closer to official graybeard status I start having more of a hair trigger when someone says "just use __________" as if that solves everything and doesn't just create a list of a dozen or two other problems.

Everything is tradeoffs, you need to look at what you're trying to do and see which problems you are okay with. Some of these will be show-stoppers at larger scale. Some might be show-stoppers now.

But "just use __________" is never the right answer for any serious discussion.


Aurora won't work near as well for write heavy/time series type workloads. It's probably a good middle ground for a lot of applications though.


> systematically destroy the custom ORM you built during the first 5 years of the company, before you had customers.

Ouch, haha, that hurt


What's wrong with "select id from table"? (presumably "where" skipped for brevity of the example)


What's wrong is replacing a single query with a join with multiple round trips implementing part of the join logic in application code because you are using an RDBMS as if it were a fairly dumb repository of dusconnected tables.

  SELECT id FROM table1
Is fine, if what your app wants is the ids from table1. Its not good if it is just a prelude to:

  SELECT * FROM table2
  WHERE table1_id in (...ids from previous query...)
instead of just doing:

  SELECT table2.*
  FROM table1 
  INNER JOIN table2 ON (
    table2.table1_id == table1.id
  )


Wow, it is hard to believe anyone would use the first option.


Most of the query optimisers would automatically convert subqueries into join.


No query optimizer will automatically convert to separate queries connected by application logic that the optimizer can’t see shuttling data between them into a single join.


Not if the dev has removed the context of which table the id values come from.


Can you explain bit further. If the subquery do not have context where the value comes from, I would think query will error out?


Lots of application developers don’t really understand how sql works. So instead of learning to do a join (or do a sub query or anything like that) they just issue multiple queries from their application backend.

The posters up thread are talking about novice Python / JavaScript / whatever code which issues queries in sequence: first a query to get the IDs. Then once that query comes back to the application, the IDs are passed back to the database for a single query of a different table.

The query optimizer can’t help here because it doesn’t know what the IDs are used for. It just sees a standalone query for a bunch of IDs. Then later, an unrelated query which happens to use those same IDs to query a different table.


What you're saying is that:

  SELECT * FROM table2
  WHERE table1_id in (SELECT ids FROM table1 WHERE ...)
will be optimized to a join by the query planner (which may or may not be true, depending on how confident you are in the stability of the implementation details of your RDBMS's query planner). But in most circumstances, there is no subquery, it's more like:

  SELECT * FROM table2
  WHERE table1_id in (452345, 529872, 120395, ...)
Where the list of IDs are fetched from an API call to some microservice, or from a poorly used/designed ORM library.


What’s really bad is where they (or the ORM) generate a seperate select statement for each ID in a loop. Selecting thousands of rows one row at a time from an unindexed table can be hilariously slow.


And this doesn’t even get into bizarre edge cases, like how Postgres treats foo = ANY(ARRAY[‘bar’, ‘baz’]) differently than foo IN (VALUES((‘bar’), (‘baz’))). PROTIP: the latter is often MUCH faster. I’ve learned this twice this week.


It looks like what they're describing is implementing database joins in the application: 2 network round trips, first one for ids from a table and the second to get values from another table with foreign keys corresponding to the ids.


Isn‘t implementing joins in the application what microservices are all about?


No, microservices are all about senior management's ability to map applications to people to yell at.


I have supported a microservices environment where each service had its own database, and indeed "foreign keys" were mostly application-level concerns. A real pain, and I don't think any of us would have chosen that approach in hindsight.

I hope there are better ways to design microservices.


You're getting confused with NoSQL ;)


IDK if postgres has this problem but mssql does. If you do "Select * from table where ID in (blah)"

it ends up with some pretty terrible performance.

It's often a lot faster to do something like

Insert into #tempTable ids

select from table JOIN #tempTable t on t.id = table.id


I can‘t believe this is a thing in 2023! If this is faster, shouldn‘t the database do that transparently in the background?!


At least in our case, it comes down to expectations.

For the `IN` query the DB doesn't know how many elements are in the list. In MSSQL, it would default to assuming "well, probably a short list" which blows out performance when that's not the case.

When you first insert into the temp table, the DB can reasonably say "Oh, this table has n elements" and switch the query plan accordingly.

In addition, you can throw an index on the temp table which can also improve performance. Assuming the table you are querying against is indexed on ID, when you have another table with IDs that are indexed it doesn't have to assume random access as it pulls out each id. (Effectively, it just has to navigate the tree nodes in order rather than needing to do a full look into the tree).


I read it as "`where` skipped at code time because the table is never going to be too large".


>> Because Postgres and MySQL are more familiar

If familiarity is the primary problem, it should be relatively easy to fix.


>choose ScyllaDB rather than Postgres or MySQL?

I've worked with a large Cassandra cluster at a pervious job, was involved with a Scylla deployment, and have a lot of experience with the architecture and operations of both databases.

I wouldn't consider Cassandra/Scylla a replacement for Postgres/MySQL unless you have a very specific problem, namely you need a highly available architecture that must sustain a high write throughput. Plenty of companies have this problem and Scylla is a great product, but choosing it when you are just starting out, or if you are unsure you need it will hurt. You lose a lot of flexibility, and if you model your data incorrectly or you need it in a way you didn't foresee most of the advantages of Cassandra become disadvantages.


There are lots of tradeoffs when choosing Cassandra:

1. If you delete data, tombstones are a problem that nobody wants to have to think about.

2. Denormalizing data instead of joining isn't always practical or feasible.

3. If you're not dealing with large amounts of data, Postgres and MySQL are just easier. There's less headache and it's easier to build against them.

4. One of the big advertised strengths of Cassandra is handling high write volume. Many applications have relatively few writes and large amounts of reads.


Unless it's a commenting system ;)


Fast writes and slow reads was true in 2012. The project has been busy since.


Fast reads are commonplace. Having fast writes for a workload without an especially high volume of writes is not selling anyone on a database.


Because I (we) want rigid schemas with transaction wrapped DDL migrations more than I want blazing speed at Discords scale.


You can have a rigid schema with Cassandra/ScyllaDB. Transactions are a whole other thing though for sure.


Plus complex queries with rich capabilities


There are a lot of factors at play. But I think the biggest one is lack of expertise in the community. For postgress isn't that hard to hire a professional with 10 years of real world experience.

Besides that you need to keep in mind that scylla isn't a silver bullet. There are some workloads that it can't handle very well. Where I work we tried to use it in workload with high read and writes and it had trouble to keep up.

In the end we switched back to postgres + redis because it was cheaper.


Can you be more specific? I am surprised Scylla got overwhelmed; it powers Discord.


Why would use a database that is closely linked to a VC backed startup when PostgreSQL works fine you?


Probably the ecosystem and documentation. PaaS like Render, Fly, and even Heroku from back in the day offer addons/plugins/whatever that make the obvious choice out to be Postgres/MySQL. Then there's other tooling such as migration and various ORM adapters. Then there's a wealth of blogs, stack overflow q&a's, etc. And then Postgres/MySQL gets you quite far. Start with a single node -> scale the node -> add read replicas -> shard. Then, sharding becomes painful? Start looking at other data stores, remodeling your data, or eventing, etc.


Because SQL and relational data. They're (MySQL and Pg) a lot more familiar to setup and maintain. There's a lot of managed services offering for the usual SQL databases.

Cassandra/Scylla come with a lot of limitations when compared to typical SQL DB. They were developed for specific use cases and aren't at all comparable to SQL DBs. Both are completely different beast compared to standard SQL DBs.


Transactions mean you can build faster as you don't have to write a bunch of code to achieve those features


Is there any benchmark data to show the difference between no-sql DBs written in different languages when it comes to a real distributed system rather than just a handful of machines?


There are a bunch of common workloads (including i’m guessing HN itself) where the trade offs for distributed databases make them a lot harder to use well, or where distributed databases don’t make sense.


Government agencies I've worked usually build an app and put it in production without very few evolutions for 15+ years.

Data is the most important stuff for them and being able to read it is therefore very important.

A nosql DB structure can usually only be read by the code that goes with it and if nobody understands the code you're doomed.

In 2017, I had to extract data from an app from 1994. The app was built with a language that wasn't widely used at the time and had disappeared today. It disappeared before all the software companies put their doc on the internet. You can't run it on anything older than windows2k. There was only one guy in Montréal that knew how to code and run that thing.

As the DB engine was relational with a tool like sqlplus, I was able to extract all the data and we rebuilt the app based on the structure we found in the database.

If you want to do that with Nosql, you must maintain a separate documentation that explains your structure... And we all now how dev like documentation.


ecosystem integration

Postgres is super well supported, basically every tool out there has a connector to postgres, and most people have familiarity

If I am honest, I never even bother researching alternatives, I always go with postgres out of habit. I know it, and it has never let me down


That speed thing is a nice bit of marketing from Scylla but not really true in the real world.


Haven't have time to look over it. For people who did, is this a generic 'Database performance' book or a longform pamphlet for ScyllaDB ?


This is already addressed: "It’s not “about” ScyllaDB – but it’s highly relevant for anyone using or considering it"

It does contain a plethora of references to other databases, and it does exemplify concepts using ScyllaDB. IIRC they also made it clear in the front matter.


I'm referring to the content itself, not what is advertised about it. The article also states that the book is open source, when it's actually not, so I guess I have trust issues.

But I'll take the fact that you created an account just to write this comment as a positive signal that the book is generic and not only about ScyllaDB.


Other than making the PDF freely available what in your mind would need to happen to make this book open source? I'm not even sure what an "open source" book would be in the FOSS meaning of the phrase.

From the edition notice page of the PDF:

> Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

And the comment you're replying to says "It does contain a plethora of references to other databases" so I'm not sure the snarky "I'm referring to the content itself, not what is advertised about it." is necessary or particularly helpful.


Not entirely sure why everyone is so combative.

> what in your mind would need to happen to make this book open source

Gitbooks is a good example of that.

Open Source is easy to define. This is a PDF, so it's the artifact of some 'build' producing it. If a book is open source, I'd expect a .tex file, or whatever the input was which created the PDF, with the pictures attached or the code which draws the pictures/graphs/etc... If it's a word document, I guess that also works. I'm also not the only one to call it out: https://news.ycombinator.com/item?id=37779549

> I'm not sure the snarky "I'm referring to the content itself, not what is advertised about it." is necessary or particularly helpful.

There is no snark here. I asked a question about the content of the book. I got an answer from the article talking about the book, which I already read.


Perhaps "Open Access" as used on the edition notice is more technically accurate but I'm not seeing how getting any "build" assets to generate this PDF is any better or more free/open than having the PDF and being able to do whatever you want to it within the rules of attribution.


Is the book open source? No. End of discussion. What's better or worse or useful or whatever is orthogonal. This is not an open source book.

Edit:

Ok, I downloaded it. Check page 79. I want to change that diagram, I don't like the client icons. I want o open a PR against the book. How do I do it?


SQLite is undeniably open source yet the general public can't contribute code to it, so I'm not sure why that's meant to be the defining feature in the case of the book.

The earlier request for a .tex source file was on the mark, but accepting contributions has nothing to do with it.


Please don't strawman my argument. The idea was that I don't have access to the raw assets used to compile the book, not necessarily that I want to contribute to it. Say I want to fork it.


It's not a strawman just because it hurts your argument. :)

The point is that you can't reasonably say SQLite is not open source, and the limitations around SQLite and this book are comparable. So it's not an immediate "No. End of discussion." that the book isn't open source.

You're entitled to your opinion but you probably shouldn't act like everyone else is an idiot for disagreeing with you.


ScyllaDB is exceptionally good (and unique) as a company in that they virtually ban marketing for its own sake. I'd point you to their 'vendor agnostic' p99 conference as well. Among the most technically credible technology companies of all time.


Im not sure how this qualifies as open source when the repo for the book[0] is essentially empty?

[0] https://github.com/Apress/db-performance-at-scale


I wondered about that, too, but in the body of the article they claim the license is a Creative Commons flavor so I interpreted it as a marketingdroid faux pas that the source is not available but that the content is open

I guess the distinction would be if someone wanted to upstream a change, not fork it, through what mechanism would that take place, but I'm going with "highly unlikely" on that one


Yes, seems like the GitHhub repo was a bit of an after thought:

Top right:

> About > SOURCE CODE for "Database Performance at Scale: A Practical Guide (Apress, 2023)," by Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov & Cynthia Dunlop

But readme.md:

> This repository ACCOMPANIES Database Performance at Scale:A Practical Guide by Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov & Cynthia Dunlop (Apress, 2023).

[ my emphasis ]


In the context of technical books "source code for" often means "source code for [readers of the book to reference while reading]" rather than "source code for [building the book]"


I disagree. Words have meaning. 'Open source' means 'open source' in all contexts.

For comparison, https://doc.rust-lang.org/book/ is an open source book. A PDF with a CC license without a repo of the publishing artifacts is not an open source book. It's just a free book.


The question is not whether it's open source, the question is what it is.


The first code example is on page 89.


Is it just me, or is the basic knowledge of database normalization and indexing kind of a lost art these days?

How many teams out there just "add cache and more hardware" when no one ran an EXPLAIN on that one core query in a giant table?

Perhaps we should do that before talking about anything "at scale".


I wrote a system that used the database as essentially a rules engine for "what work to execute next?" There were constraints about max throughput per key, waiting periods, etc.

That system broke me down and built me back up over time. Really deep explaining, understanding the internals of how the data is actually being processed, then finding ways to improve it. I had 1-2 other people who were really interested in postgres that would help work on debugging.

We would continuously get 100x query improvements every time we dove in. Including after we thought we maxed it out.

I hope everyone gets the chance to work on something like that. It was really fun.


If anyone starts a new side project, do yourself a favor - write your data access layer in straight SQL. With minimal wiring and safe variable substitution that most libraries provide, you will learn a lot.


> a rules engine for "what work to execute next?"

I'm afraid to inform you, that you have built a cluster workload manager.

(Where a "cluster workload manager" is a type of job scheduler designed for — usually HPC — clusters, that prioritizes among ongoing abstract batch "jobs" that can each broken down into more granular atomic scheduleable work-units; and which accounts the often-unpredictable resource-consumption outcomes of those work-units, to the account of the job and its owning user+group+org, to then weight further executions of that job or other jobs owned by that user+group+org, so that ultimately, jobs can proceed at a QoS/rate that's 1. "fair", and 2. in proportion to the amount that a given customer is willing to pay for prioritization. "Business rules" in such a system, are expressed in terms of conditional coefficients or formulae to apply to the "costs" used to re-weight jobs.)

Examples of cluster workload management software:

• Slurm (https://en.wikipedia.org/wiki/Slurm_Workload_Manager)

• IBM Spectrum LSF (https://www.ibm.com/products/hpc-workload-management)

• technically, Kubernetes' kube-scheduler — though it's the most trivial possible implementation of one, without any of the ongoing-fairness-of-QoS logic.

The key thing to know about these, is that it's nearly impossible to build one that lives "quiesced" inside a database (as you've probably found.) Imagine taking an OS process scheduler and trying to keep the state for that canonically inside a database. Then imagine multiplying the write load by 1000x, while keeping the realtime read requirements the same. It's just a bad idea.

You can certainly persist the state of these systems into a DB — but what you really want is one or more active scheduler nodes, that keep a big in-memory ledger matrix, that can be mutated moment-to-moment, without needing to persist every single change to disk in some WAL/journal/etc. Like a regular OS scheduler. Or like, say, a game physics engine — very similar in how it's all about tiny little "integration" steps, many times per second.


I was generic in how I described it because I didn't want to talk about the system (it doesn't matter).

The system was for a very specific use case and the set of rules were all specific to that domain. The domain was not related to workload management or jobs, but you could generalize what it did in those terms.

edit: removed some snark. I did want to rewrite this into an in-memory data structure (Elixir-based). It generally worked but the postgres solution was actually really really good for the domain it was built for.


My point was that — for just your use-case as originally stated — you probably could have taken a FOSS cluster workload manager like SLURM and "narrowed" it to your domain + use-case; rather than starting from a DBMS and adding all the stuff necessary to make it understand how to be a scheduler and then fighting with it to scale its use-case-impedance-mismatched state-storage backend to your production workload. (Where I assume you were "fighting with it" because, as you said, you were "torn down and built back up" in the process of making this system scale.)

If, however, you had other database-related needs you didn't mention — e.g. being able to arbitrarily join the state data of this system as dimension data in OLAP queries on other, non-scheduler-y tables in that DB — then your constraints probably did indeed make using a DBMS the best choice here, despite the scaling challenges.

(Also, funny enough, I am myself in the middle of writing a cluster-workload-manager as a custom in-memory Elixir daemon! But that's with the unique constraint of needing lightweight, realtime prioritization of workloads, each backing synchronous HTTP RPC requests. Having millions of teeny-tiny jobs that are usually over in milliseconds [but which can sometimes run for multiple minutes], isn't something any HPC-targeting workload manager is really architected for.)


I don't think that would've been a good idea. I guess it's possible, but I'd rather trade known obstacles versus unknown obstacles (that will definitely exist).

I said "torn down and built back up" in an endearing way. Sure it was difficult work at times, but it was super fun and we would spend 2 days at a time fixing it until something popped up 6 months later.

But the amount of stuff I learned from that was (I will wager) exponentially more useful than figuring out how to adapt/run a fairly major FOSS.


I would slaver at mouth over that opportunity.


I've worked in a couple of projects where the existing teams were converting some API calls to new tech and changing UI to accomodate the 'new way' of getting data because "the current is way too slow". And it was. Going to a common dashboard was taking... 45-50 seconds, and growing.

Core was they were pulling all data - 20k records - in to a js datatable.

"This is bad - we have to move to server side paging!"

They'd convinced their clients (I've seen this variations of this process 5x in the past 6 years) to 'upgrade'.

In at least 2 cases, I rewrote a couple of the queries to just pull what was needed - id/name/email - instead of full ORM records with nested associations. One screen went from 45 seconds to 1 second. And this took ... 10-15 minutes of digging and rework. Yes, if your real goal is you want to learn Hotwire or some other new tech... go for it, but it feels wrong to not even investigate why the code is slow in the first place.

Other project - similar - took a screen from 50 seconds to 2 seconds by reducing the query. Then got the whole thing under a second with a couple of indexes.

Now, yes, this isn't all specific to database optimization, but they feel related. This idea of "oh, compute is cheap, devs are expensive, just throw more hardware at it!" has to have some boundaries/limits, but they aren't often realized until you're really ... in deep.


As a dba that gives professional services to solve postgres performance issues, I've seen the same too fu*ing many times that it made my eyes bleed.


Database optimization isn't something you'll need running a flask app on your personal machine, you'll only have to learn it while trying to handle tons of traffic, which I think explains why many developers are late to the party.

That said, I've never worked on a team that didn't carefully consider their schema from both a performance and extendability perspective. Even if the average age of your team is 23, any half decent TL will know the basics. I imagine this is only a problem at startups founded by students.


> I imagine this is only a problem at startups founded by students.

I think you imagine wrong.

> I've never worked on a team that didn't carefully consider their schema from both a performance and extendability perspective.

Count yourself very lucky.


I dare to say the opposite.

Students still care and should have learned these principles in uni. They tend to overengineer and focus too much on the technical challenges

In bigger companies, most IT people tend to be people who got the job because they started coding back in the day when you could get by using just common sense and gut feeling. Today we know those often lead to dark patterns and bad database design, but those people have now grown into architecture and management positions and have no clue what they are talking about.

This is very prevalent in tech teams of non tech companies (e.g. manufacturing, banking, healthcare, CPGs,...)


Maybe you're the unlucky one? How does someone that doesn't know basic system design get put in charge of designing a system?


Shortest answer is they're put in charge by someone who doesn't know that the appointee doesn't know. Either they don't know they're supposed to know that, or (ime) more often they don't know how to judge. As someone else replied, passing leetcode-style tests is taken as a mark of technical skill well beyond what it's intended to indicate.

Glad to hear you've been part of well functioning competent teams of professionals. I know they exist - I've been there a couple times myself over the years, but it's not a given. I would probably say it's not even the norm if you include a wider range of places where tech/developer folks may be employed.

A 'team lead' role is filled by someone who is dating the department manager's family member? Seen that happen.

"Joe's been here longer, so he gets to make the final decision on things". OK, but Joe graduated high school last year, and he's cutting in to systems built by people with 20+ years of experience (both in software and in this domain), and Joe is replacing it with some cut/paste stuff he found on stackoverflow last week.

I was going to produce a larger litany of dumb scenarios I've had to wrestle with, in painful detail, but unsure that would do much to convince you of anything. (venting does help me occasionally, but won't ATM).


Because people with experience expect a certain salary level.


They murdered the Leetcode interview, so they must be good at everything else /s


I think most devs understand indexing at this point, if only because ORMs generally include it as a keyword. What they don't understand is why an index might fail to be used, which to be fair, is a massive subject on its own. MySQL is one thing, since it pretty much only has B+tree (plus FTS). It's also case-insensitive by default, so foot guns like "smith" != "Smith" go away. But of course, MySQL isn't cool anymore, and dear lord does Postgres have some gotchas.

Wrong collation? No index. Wrong case? No index. You didn't explicitly cast something and the result technically works, but isn't exactly correct? No index. Used a partial index with `WHERE foo` and then queried with `WHERE foo IS NOT FALSE`? No index. Table stats are wrong because you haven't adjusted autovacuum/autoanalyze daemons? Believe or not, no index.

As for normalization, yeah, that's right out. Unless you had a unicorn design your schema, 0% chance dev teams know or care about it until it bites them, and then it's a massive undertaking to fix it.

Source: DBRE who is both very happy that my skillset is absurdly in demand, and also constantly screaming into the void about everything.


Running explain is easy. Understanding the explanation is the hard part, and for some reason resources are scarce.


Wait until you see DBs with non-sensical schemas, wrong data types usage (like store dates in text), non-existent Pk/FK/Checks constraints, no use of VIEWS...

ie: Using a RDBMS as if was a NoSql and complaint that RDBMS are bad...


I've seen ints being deliberately stored as text. Not because the id field changed at some point and they just kept it around as legacy (I mean, bad idea, but fair enough), but they thought storing them as text would somehow be better.

As to NoSQL, yeah - devs love shoving everything into a JSON column, and then complain that it's slow. MongoDB is of course there for their web scale needs.


I appreciate that "Just the PDF" actually gives me the PDF and not 10 more popups to subscribe to someones newsletter.


> Noticing that a certain NoSQL database was recently trending on the front page of Hacker News, Patrick picked it for his backend stack

I feel attacked! :D




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: