I wish there are authoritative books or papers on how to build object stores like S3 or software-defined storage in general. It looks object stores in public clouds have been so successful that few companies or research groups have been working on their own. Yes, I'm aware that we have systems like MinIO and Ceph, but so many questions are left unanswered, like how to achieve practically unlimited throughput like S3 does, like what kind of hardware would be optimal for S3's workload, like how to optimize for large-scan incurred by analytics workload, which S3 is really good at; like how to support strong consistency like S3 does without impacting system performance visibly even though S3 internally must have metadata layer, storage layer, and an index layer, like how to shrink or expand clusters without impacting user experience, like how to write an OSD that squeezes out every bit of hardware performance (vitastor claims so, but there's not many details), and the list goes on.
> how to achieve practically unlimited throughput like S3 does
> like what kind of hardware would be optimal for S3's workload
> how to support strong consistency like S3 does without impacting system performance
I think most of these questions are first and most importantly _hardware_, _money_ and _physics_ questions.
I have no expertise in this matter, but I think this would be a good proxy answer: you make a system _seem_ to have unlimited throughput by interconnecting _a lot_ of the fastest available storage solutions with the fastest available network cards, putting it as close as possible to where your users are going to be. All of this is extremely expensive so you would need to have deep pockets, as amazon has, to make it possible (or a lot of investment).
I suspect with the right hardware and physical deployment locations you could fine tune Ceph or MinIO or whatnot to similar performance as S3. S3 is an object storage so the distributed system aspects of its implementation should definitely be a lot easier than, say, distributed SQL (not saying either is an "easy" thing to accomplish).
If you are interested in which hardware to use for a SAN, I found these benchmarks that may be exactly what you are looking for :-) [1]
> I think most of these questions are first and most importantly _hardware_, _money_ and _physics_ questions.
Actually, money (more accurately, cost) is a constraint instead of a resource. S3 is known for its low cost, and S3 can easily dote out a 70% discount to its large customers and still make a profit. So, an interesting question is how to build an low-cost object store/
> interconnecting _a lot_ of the fastest available storage
"A lot of" leads to real challenges. Sooner or later you'll find that managing metadata will become a tough challenge. Case in point, Open-source systems often use systems like Zookeeper or etcd or single-node name server with hot standby for metadata management, which certainly won't be able to handle the scale of S3.
About cost, see [1]. Also, S3 prices have been increasing and there's been a bunch of alternative offers for object store from other companies. I think people in here (HN) comment often about increasing costs of AWS offerings.
Distributed systems and consensus are inherently hard problem, but there are a lot of implementations that you can study (like Etcd that you mention, or NATS [2], which I've been playing with and looks super cool so far :-p) if you want to understand the internals, on top of many books and papers released.
Again, I never said it was "easy" to build distributed systems, I just don't think there's any esoteric knowledge to what S3 provides.
Sorry, this is just anecdotal from my recollection of reading random hacker news threads; I think people talk more about the bandwidth being expensive more than the storage itself [1].
It's not documented in a single authoritative place, but AWS has documented much of the S3 architecture you are curious about via white papers and presentations at Reinvent and other conferences.
I haven't used it for anything that complicated yet, but this is the kind of stuff I have found GPT-4 really useful for, personally. Research, not refactoring.
“A free book” aka a book by a database vendor that wants to skew the premise of this-is-just-a-discussion-on-performance-at-scale-vendor-agnostic to that very vendor. Nothing is free. It’s all marketing.
Alex Debrie: " Damn, this is an excellent resource. I read the first few chapters closely and plan to return to the rest. Lots of interesting stuff + frank discussion of how to truly evaluate different database options."
Beautiful. Been aiming to learn how to scale MySQL databases so I can run my apps on VMs without having to used managed dbs like Aurora or Azure Managed Database
Check out RonDB (https://www.rondb.com/blog) which is a fork of MySQL Cluster but has support for elasticity with no downtime, backups, etc.
Disclaimer: i am involved with RonDB.
No. It's a distributed database. Transactions are committed in memory, and then (by default every second) committed as a group to disk. This gives you phenomenal write throughput and low latency, but the tradeoff is that if a whole cluster goes lights out, you could lose a second of data. It's kind of like turning off fsync for InnoDB.
Why don’t more companies / startups choose ScyllaDB rather than Postgres or MySQL?
ScyllaDB is a C++ version of Cassandra so it looks like speed and scalability is a complete advantage over the Java based Cassandra and Discord is using ScyllaDB at scale too.
Because Postgres and MySQL are more familiar and by the time they're falling over, you have raised 80M and can afford to hire folks like me who's entire job is pestering developers to stop reading-after-writing, to stop forcing queries to the primary (just incase there's lag!), to stop "select id from table" -> "select data_I_want from other_table where ID IN (list-of-a-million-ids"), to beg AWS for early access to the fastest possible machines, to systematically destroy the custom ORM you built during the first 5 years of the company, before you had customers.
Scylla/Cassandra is exactly the database you want when you're serving the entire planet.
Postgres is exactly the database you want when you're desperately searching for product market-fit and just need things to work and code to get written.
Turns out it's much better to start simple and face scaling problems maybe than start complex and face scaling problems maybe.
Curious how you approach business units asking for reports that require multiple joins and can "filter by any column" and "sort by any column" and also offer pagination and then complain about why it's so slow? This is MySQL by the way.
The sorting by any column and pagination is really killer. Can't do cursor-based pagination so you get LIMIT/OFFSET which is terrible for performance, at least in this context. Indexes are often useless in this circumstance as well due to the filter by any column. We get lucky sometimes with a hard-coded WHERE clause as baseline filter for the entire report that hits an index, but not always. Just for fun, add in plenty of LIKE '%X%' because we limiting them to STARTS WITH is out of the question; it must be CONTAINS.
It's a constant source of frustration for the business and for the dev team.
Sounds like the job for a denormalized OLAP data structure (e.g. star schema) with a BI tool. This is the kind of use case they were made for. In other words, you've rediscovered the motivation for data warehouses.
Thank you for the feedback. I'll do some research along these lines. We have moved a report or two to Power BI which has been must faster than similar reports built using dynamic queries. Keeping the data in sync and the Power BI learning curve have not been easy.
In our case we switched to holding that data in Elasticsearch. Now we have 2 problems: running Elasticsearch and keeping the data in Elasticsearch in sync with MySQL. We took an application-level approach to triggering record sync to ES but somewhere in our ORM it's doing some magic and doesn't sync in certain places. Database triggers are the next stop.
If we could do away with ES and go back to MySQL for what are glorified data grids with 20+ column filtering, sorting, and no full text requirements, we would.
We use ElasticSearch in a very limited capacity for keyword searches against consumer product metadata. We haven't ventured into using it for BI reporting, but I'm interested. I'm sure there is plenty of knowledge I can google, but any early insights or traps to avoid when moving from SQL-based reporting to something in ES?
If you want to sort by any column and paginate on any column (so you can have your cursor), a BTREE (MySQL
index) is sort of the wrong structure in the first place. You probably want a document store for something like this. Scylla!
More pragmatically, just force bookmarked scans - limit the columns that can be sorted by - use a replica with indexes that the writer doesn’t have - pay for fast disks - use one of the new parallel query engines like Aurora or Vitess - just tell the business people to wait a dang minute, computer is thinking!
The lazy approach is to denormalise the data in advance and load the result into memory. If it’s small enough you can simply return the entire result set to a reporting tool and query it locally.
I think it depends on the problem. I've worked at a tiny telco (<100 people) where MySql absolutely wouldn't have worked. We ended up using Scylla DB. We had billions of rows in single tables.
If you're launching an e-commerce site that sells shoes or something, yeah, you probably aren't going to sell a billion shoes every year so MySql is probably fine.
Finally, I use this example a lot: Github thought they could scale out MySql by writing their own Operator system that sharded MySql. They put thousands of engineering hours in it, and they still had global database outages that lost customer data.
You get that shit for free using Scylla/Cassandra (the ability to replicate across data centers and still fine grained control to do things like enforce local quorum to not impact write speed etc.)
You know as I get closer and closer to official graybeard status I start having more of a hair trigger when someone says "just use __________" as if that solves everything and doesn't just create a list of a dozen or two other problems.
Everything is tradeoffs, you need to look at what you're trying to do and see which problems you are okay with. Some of these will be show-stoppers at larger scale. Some might be show-stoppers now.
But "just use __________" is never the right answer for any serious discussion.
What's wrong is replacing a single query with a join with multiple round trips implementing part of the join logic in application code because you are using an RDBMS as if it were a fairly dumb repository of dusconnected tables.
SELECT id FROM table1
Is fine, if what your app wants is the ids from table1. Its not good if it is just a prelude to:
SELECT * FROM table2
WHERE table1_id in (...ids from previous query...)
instead of just doing:
SELECT table2.*
FROM table1
INNER JOIN table2 ON (
table2.table1_id == table1.id
)
No query optimizer will automatically convert to separate queries connected by application logic that the optimizer can’t see shuttling data between them into a single join.
Lots of application developers don’t really understand how sql works. So instead of learning to do a join (or do a sub query or anything like that) they just issue multiple queries from their application backend.
The posters up thread are talking about novice Python / JavaScript / whatever code which issues queries in sequence: first a query to get the IDs. Then once that query comes back to the application, the IDs are passed back to the database for a single query of a different table.
The query optimizer can’t help here because it doesn’t know what the IDs are used for. It just sees a standalone query for a bunch of IDs. Then later, an unrelated query which happens to use those same IDs to query a different table.
SELECT * FROM table2
WHERE table1_id in (SELECT ids FROM table1 WHERE ...)
will be optimized to a join by the query planner (which may or may not be true, depending on how confident you are in the stability of the implementation details of your RDBMS's query planner). But in most circumstances, there is no subquery, it's more like:
SELECT * FROM table2
WHERE table1_id in (452345, 529872, 120395, ...)
Where the list of IDs are fetched from an API call to some microservice, or from a poorly used/designed ORM library.
What’s really bad is where they (or the ORM) generate a seperate select statement for each ID in a loop. Selecting thousands of rows one row at a time from an unindexed table can be hilariously slow.
And this doesn’t even get into bizarre edge cases, like how Postgres treats foo = ANY(ARRAY[‘bar’, ‘baz’]) differently than foo IN (VALUES((‘bar’), (‘baz’))). PROTIP: the latter is often MUCH faster. I’ve learned this twice this week.
It looks like what they're describing is implementing database joins in the application: 2 network round trips, first one for ids from a table and the second to get values from another table with foreign keys corresponding to the ids.
I have supported a microservices environment where each service had its own database, and indeed "foreign keys" were mostly application-level concerns. A real pain, and I don't think any of us would have chosen that approach in hindsight.
I hope there are better ways to design microservices.
At least in our case, it comes down to expectations.
For the `IN` query the DB doesn't know how many elements are in the list. In MSSQL, it would default to assuming "well, probably a short list" which blows out performance when that's not the case.
When you first insert into the temp table, the DB can reasonably say "Oh, this table has n elements" and switch the query plan accordingly.
In addition, you can throw an index on the temp table which can also improve performance. Assuming the table you are querying against is indexed on ID, when you have another table with IDs that are indexed it doesn't have to assume random access as it pulls out each id. (Effectively, it just has to navigate the tree nodes in order rather than needing to do a full look into the tree).
I've worked with a large Cassandra cluster at a pervious job, was involved with a Scylla deployment, and have a lot of experience with the architecture and operations of both databases.
I wouldn't consider Cassandra/Scylla a replacement for Postgres/MySQL unless you have a very specific problem, namely you need a highly available architecture that must sustain a high write throughput. Plenty of companies have this problem and Scylla is a great product, but choosing it when you are just starting out, or if you are unsure you need it will hurt. You lose a lot of flexibility, and if you model your data incorrectly or you need it in a way you didn't foresee most of the advantages of Cassandra become disadvantages.
There are lots of tradeoffs when choosing Cassandra:
1. If you delete data, tombstones are a problem that nobody wants to have to think about.
2. Denormalizing data instead of joining isn't always practical or feasible.
3. If you're not dealing with large amounts of data, Postgres and MySQL are just easier. There's less headache and it's easier to build against them.
4. One of the big advertised strengths of Cassandra is handling high write volume. Many applications have relatively few writes and large amounts of reads.
There are a lot of factors at play. But I think the biggest one is lack of expertise in the community. For postgress isn't that hard to hire a professional with 10 years of real world experience.
Besides that you need to keep in mind that scylla isn't a silver bullet. There are some workloads that it can't handle very well. Where I work we tried to use it in workload with high read and writes and it had trouble to keep up.
In the end we switched back to postgres + redis because it was cheaper.
Probably the ecosystem and documentation. PaaS like Render, Fly, and even Heroku from back in the day offer addons/plugins/whatever that make the obvious choice out to be Postgres/MySQL. Then there's other tooling such as migration and various ORM adapters. Then there's a wealth of blogs, stack overflow q&a's, etc. And then Postgres/MySQL gets you quite far. Start with a single node -> scale the node -> add read replicas -> shard. Then, sharding becomes painful? Start looking at other data stores, remodeling your data, or eventing, etc.
Because SQL and relational data. They're (MySQL and Pg) a lot more familiar to setup and maintain. There's a lot of managed services offering for the usual SQL databases.
Cassandra/Scylla come with a lot of limitations when compared to typical SQL DB. They were developed for specific use cases and aren't at all comparable to SQL DBs. Both are completely different beast compared to standard SQL DBs.
Is there any benchmark data to show the difference between no-sql DBs written in different languages when it comes to a real distributed system rather than just a handful of machines?
There are a bunch of common workloads (including i’m guessing HN itself) where the trade offs for distributed databases make them a lot harder to use well, or where distributed databases don’t make sense.
Government agencies I've worked usually build an app and put it in production without very few evolutions for 15+ years.
Data is the most important stuff for them and being able to read it is therefore very important.
A nosql DB structure can usually only be read by the code that goes with it and if nobody understands the code you're doomed.
In 2017, I had to extract data from an app from 1994.
The app was built with a language that wasn't widely used at the time and had disappeared today.
It disappeared before all the software companies put their doc on the internet.
You can't run it on anything older than windows2k.
There was only one guy in Montréal that knew how to code and run that thing.
As the DB engine was relational with a tool like sqlplus, I was able to extract all the data and we rebuilt the app based on the structure we found in the database.
If you want to do that with Nosql, you must maintain a separate documentation that explains your structure... And we all now how dev like documentation.
This is already addressed: "It’s not “about” ScyllaDB – but it’s highly relevant for anyone using or considering it"
It does contain a plethora of references to other databases, and it does exemplify concepts using ScyllaDB. IIRC they also made it clear in the front matter.
I'm referring to the content itself, not what is advertised about it. The article also states that the book is open source, when it's actually not, so I guess I have trust issues.
But I'll take the fact that you created an account just to write this comment as a positive signal that the book is generic and not only about ScyllaDB.
Other than making the PDF freely available what in your mind would need to happen to make this book open source? I'm not even sure what an "open source" book would be in the FOSS meaning of the phrase.
From the edition notice page of the PDF:
> Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
And the comment you're replying to says "It does contain a plethora of references to other databases" so I'm not sure the snarky "I'm referring to the content itself, not what is advertised about it." is necessary or particularly helpful.
> what in your mind would need to happen to make this book open source
Gitbooks is a good example of that.
Open Source is easy to define. This is a PDF, so it's the artifact of some 'build' producing it. If a book is open source, I'd expect a .tex file, or whatever the input was which created the PDF, with the pictures attached or the code which draws the pictures/graphs/etc... If it's a word document, I guess that also works. I'm also not the only one to call it out: https://news.ycombinator.com/item?id=37779549
> I'm not sure the snarky "I'm referring to the content itself, not what is advertised about it." is necessary or particularly helpful.
There is no snark here. I asked a question about the content of the book. I got an answer from the article talking about the book, which I already read.
Perhaps "Open Access" as used on the edition notice is more technically accurate but I'm not seeing how getting any "build" assets to generate this PDF is any better or more free/open than having the PDF and being able to do whatever you want to it within the rules of attribution.
SQLite is undeniably open source yet the general public can't contribute code to it, so I'm not sure why that's meant to be the defining feature in the case of the book.
The earlier request for a .tex source file was on the mark, but accepting contributions has nothing to do with it.
Please don't strawman my argument. The idea was that I don't have access to the raw assets used to compile the book, not necessarily that I want to contribute to it. Say I want to fork it.
It's not a strawman just because it hurts your argument. :)
The point is that you can't reasonably say SQLite is not open source, and the limitations around SQLite and this book are comparable. So it's not an immediate "No. End of discussion." that the book isn't open source.
You're entitled to your opinion but you probably shouldn't act like everyone else is an idiot for disagreeing with you.
ScyllaDB is exceptionally good (and unique) as a company in that they virtually ban marketing for its own sake. I'd point you to their 'vendor agnostic' p99 conference as well. Among the most technically credible technology companies of all time.
I wondered about that, too, but in the body of the article they claim the license is a Creative Commons flavor so I interpreted it as a marketingdroid faux pas that the source is not available but that the content is open
I guess the distinction would be if someone wanted to upstream a change, not fork it, through what mechanism would that take place, but I'm going with "highly unlikely" on that one
Yes, seems like the GitHhub repo was a bit of an after thought:
Top right:
> About
> SOURCE CODE for "Database Performance at Scale: A Practical Guide (Apress, 2023)," by Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov & Cynthia Dunlop
But readme.md:
> This repository ACCOMPANIES Database Performance at Scale:A Practical Guide by Felipe Cardeneti Mendes, Piotr Sarna, Pavel Emelyanov & Cynthia Dunlop (Apress, 2023).
In the context of technical books "source code for" often means "source code for [readers of the book to reference while reading]" rather than "source code for [building the book]"
I disagree. Words have meaning. 'Open source' means 'open source' in all contexts.
For comparison, https://doc.rust-lang.org/book/ is an open source book. A PDF with a CC license without a repo of the publishing artifacts is not an open source book. It's just a free book.
I wrote a system that used the database as essentially a rules engine for "what work to execute next?" There were constraints about max throughput per key, waiting periods, etc.
That system broke me down and built me back up over time. Really deep explaining, understanding the internals of how the data is actually being processed, then finding ways to improve it. I had 1-2 other people who were really interested in postgres that would help work on debugging.
We would continuously get 100x query improvements every time we dove in. Including after we thought we maxed it out.
I hope everyone gets the chance to work on something like that. It was really fun.
If anyone starts a new side project, do yourself a favor - write your data access layer in straight SQL. With minimal wiring and safe variable substitution that most libraries provide, you will learn a lot.
I'm afraid to inform you, that you have built a cluster workload manager.
(Where a "cluster workload manager" is a type of job scheduler designed for — usually HPC — clusters, that prioritizes among ongoing abstract batch "jobs" that can each broken down into more granular atomic scheduleable work-units; and which accounts the often-unpredictable resource-consumption outcomes of those work-units, to the account of the job and its owning user+group+org, to then weight further executions of that job or other jobs owned by that user+group+org, so that ultimately, jobs can proceed at a QoS/rate that's 1. "fair", and 2. in proportion to the amount that a given customer is willing to pay for prioritization. "Business rules" in such a system, are expressed in terms of conditional coefficients or formulae to apply to the "costs" used to re-weight jobs.)
• technically, Kubernetes' kube-scheduler — though it's the most trivial possible implementation of one, without any of the ongoing-fairness-of-QoS logic.
The key thing to know about these, is that it's nearly impossible to build one that lives "quiesced" inside a database (as you've probably found.) Imagine taking an OS process scheduler and trying to keep the state for that canonically inside a database. Then imagine multiplying the write load by 1000x, while keeping the realtime read requirements the same. It's just a bad idea.
You can certainly persist the state of these systems into a DB — but what you really want is one or more active scheduler nodes, that keep a big in-memory ledger matrix, that can be mutated moment-to-moment, without needing to persist every single change to disk in some WAL/journal/etc. Like a regular OS scheduler. Or like, say, a game physics engine — very similar in how it's all about tiny little "integration" steps, many times per second.
I was generic in how I described it because I didn't want to talk about the system (it doesn't matter).
The system was for a very specific use case and the set of rules were all specific to that domain. The domain was not related to workload management or jobs, but you could generalize what it did in those terms.
edit: removed some snark. I did want to rewrite this into an in-memory data structure (Elixir-based). It generally worked but the postgres solution was actually really really good for the domain it was built for.
My point was that — for just your use-case as originally stated — you probably could have taken a FOSS cluster workload manager like SLURM and "narrowed" it to your domain + use-case; rather than starting from a DBMS and adding all the stuff necessary to make it understand how to be a scheduler and then fighting with it to scale its use-case-impedance-mismatched state-storage backend to your production workload. (Where I assume you were "fighting with it" because, as you said, you were "torn down and built back up" in the process of making this system scale.)
If, however, you had other database-related needs you didn't mention — e.g. being able to arbitrarily join the state data of this system as dimension data in OLAP queries on other, non-scheduler-y tables in that DB — then your constraints probably did indeed make using a DBMS the best choice here, despite the scaling challenges.
(Also, funny enough, I am myself in the middle of writing a cluster-workload-manager as a custom in-memory Elixir daemon! But that's with the unique constraint of needing lightweight, realtime prioritization of workloads, each backing synchronous HTTP RPC requests. Having millions of teeny-tiny jobs that are usually over in milliseconds [but which can sometimes run for multiple minutes], isn't something any HPC-targeting workload manager is really architected for.)
I don't think that would've been a good idea. I guess it's possible, but I'd rather trade known obstacles versus unknown obstacles (that will definitely exist).
I said "torn down and built back up" in an endearing way. Sure it was difficult work at times, but it was super fun and we would spend 2 days at a time fixing it until something popped up 6 months later.
But the amount of stuff I learned from that was (I will wager) exponentially more useful than figuring out how to adapt/run a fairly major FOSS.
I've worked in a couple of projects where the existing teams were converting some API calls to new tech and changing UI to accomodate the 'new way' of getting data because "the current is way too slow". And it was. Going to a common dashboard was taking... 45-50 seconds, and growing.
Core was they were pulling all data - 20k records - in to a js datatable.
"This is bad - we have to move to server side paging!"
They'd convinced their clients (I've seen this variations of this process 5x in the past 6 years) to 'upgrade'.
In at least 2 cases, I rewrote a couple of the queries to just pull what was needed - id/name/email - instead of full ORM records with nested associations. One screen went from 45 seconds to 1 second. And this took ... 10-15 minutes of digging and rework. Yes, if your real goal is you want to learn Hotwire or some other new tech... go for it, but it feels wrong to not even investigate why the code is slow in the first place.
Other project - similar - took a screen from 50 seconds to 2 seconds by reducing the query. Then got the whole thing under a second with a couple of indexes.
Now, yes, this isn't all specific to database optimization, but they feel related. This idea of "oh, compute is cheap, devs are expensive, just throw more hardware at it!" has to have some boundaries/limits, but they aren't often realized until you're really ... in deep.
Database optimization isn't something you'll need running a flask app on your personal machine, you'll only have to learn it while trying to handle tons of traffic, which I think explains why many developers are late to the party.
That said, I've never worked on a team that didn't carefully consider their schema from both a performance and extendability perspective. Even if the average age of your team is 23, any half decent TL will know the basics. I imagine this is only a problem at startups founded by students.
Students still care and should have learned these principles in uni.
They tend to overengineer and focus too much on the technical challenges
In bigger companies, most IT people tend to be people who got the job because they started coding back in the day when you could get by using just common sense and gut feeling. Today we know those often lead to dark patterns and bad database design, but those people have now grown into architecture and management positions and have no clue what they are talking about.
This is very prevalent in tech teams of non tech companies (e.g. manufacturing, banking, healthcare, CPGs,...)
Shortest answer is they're put in charge by someone who doesn't know that the appointee doesn't know. Either they don't know they're supposed to know that, or (ime) more often they don't know how to judge. As someone else replied, passing leetcode-style tests is taken as a mark of technical skill well beyond what it's intended to indicate.
Glad to hear you've been part of well functioning competent teams of professionals. I know they exist - I've been there a couple times myself over the years, but it's not a given. I would probably say it's not even the norm if you include a wider range of places where tech/developer folks may be employed.
A 'team lead' role is filled by someone who is dating the department manager's family member? Seen that happen.
"Joe's been here longer, so he gets to make the final decision on things". OK, but Joe graduated high school last year, and he's cutting in to systems built by people with 20+ years of experience (both in software and in this domain), and Joe is replacing it with some cut/paste stuff he found on stackoverflow last week.
I was going to produce a larger litany of dumb scenarios I've had to wrestle with, in painful detail, but unsure that would do much to convince you of anything. (venting does help me occasionally, but won't ATM).
I think most devs understand indexing at this point, if only because ORMs generally include it as a keyword. What they don't understand is why an index might fail to be used, which to be fair, is a massive subject on its own. MySQL is one thing, since it pretty much only has B+tree (plus FTS). It's also case-insensitive by default, so foot guns like "smith" != "Smith" go away. But of course, MySQL isn't cool anymore, and dear lord does Postgres have some gotchas.
Wrong collation? No index. Wrong case? No index. You didn't explicitly cast something and the result technically works, but isn't exactly correct? No index. Used a partial index with `WHERE foo` and then queried with `WHERE foo IS NOT FALSE`? No index. Table stats are wrong because you haven't adjusted autovacuum/autoanalyze daemons? Believe or not, no index.
As for normalization, yeah, that's right out. Unless you had a unicorn design your schema, 0% chance dev teams know or care about it until it bites them, and then it's a massive undertaking to fix it.
Source: DBRE who is both very happy that my skillset is absurdly in demand, and also constantly screaming into the void about everything.
Wait until you see DBs with non-sensical schemas, wrong data types usage (like store dates in text), non-existent Pk/FK/Checks constraints, no use of VIEWS...
ie: Using a RDBMS as if was a NoSql and complaint that RDBMS are bad...
I've seen ints being deliberately stored as text. Not because the id field changed at some point and they just kept it around as legacy (I mean, bad idea, but fair enough), but they thought storing them as text would somehow be better.
As to NoSQL, yeah - devs love shoving everything into a JSON column, and then complain that it's slow. MongoDB is of course there for their web scale needs.