Sure, but that's not MySQL's fault. MySQL/Oracle has no control over what ill-conceived setting overrides are used by DBaaS providers or shared hosting providers.
There are similarly dumb misconfigurations that can theoretically be made in most hosted open source software.
Implementing silent string truncation in the first place (by default(!)) is an excellent indication of how fast and loose and out of touch the original MySQL developers were with the database world. It is the worst database "feature" I have ever seen.
It made more sense at the time. Earlier days of the web, dynamic scripting languages used everywhere, etc. No emoji, barely any use of 4-byte characters at all. Large corps weren't using open source DBs, or sometimes open source anything. Simpler times!
And for sake of comparison -- if I understand correctly, prior to just 2 months ago, sqlite didn't even have the option for strict type enforcement [1]. Do you judge sqlite as harshly, or think its creator is out of touch with the database world?
Relational databases have been used in production since the late 1970s. You would be hard pressed to find any that silently corrupted user data like that, ever, until MySQL came along.
Sqlite is more of an embedded database, but if it silently truncated character data under some conditions that would make it equally unusable. Better to have no type enforcement than defective type enforcement.
I think you misunderstood my point. Yes, relational databases have been around a long time. But in the early days of MySQL, most of the companies using MySQL differed quite a bit from the companies typically using DB2, Sybase, Oracle, etc. I've been using MySQL professionally since 2003 and am speaking from first-hand experience here: the web development world was largely separate from the traditional enterprise world at that time. It was considered a different use case, it's not like today where the biggest companies in the world are tech companies built on open source software.
Regarding "silent", MySQL at least emits a warning when this happens. Still not great / obviously wrong in a modern context, but just for clarity's sake it wasn't completely silent.
As for "defective type enforcement", that's a subjective opinion. In C, atoi("123hello456") returns 123 -- once you hit an invalid character, everything from that point is truncated. Is C atoi defective? Perhaps, but it's an opinion.
With strict mode disabled, four-byte characters worked the same way in MySQL utf8mb3: four-byte characters are invalid in this data type, and everything gets truncated at the first invalid character. Would simply stripping the invalid characters, but keeping valid chars after them, necessarily be more "correct"? No, it's an implementation decision, especially considering that emoji did not exist at the time and barely any non-BMP (4-byte) utf8 characters were used anywhere.
It is certainly true that most users of relational databases before MySQL used them for applications where throwing away data is a potential catastrophe, and most uses of MySQL were for toy applications where if that happened no big deal. It was about a decade before MySQL even supported transactions.
atoi is deficient by modern standards, but at least it clearly documents what the behavior is and always has. It doesn't truncate anything. For a database, truncating data is unforgivable. That is what errors are for, so your database doesn't act as a black hole where your data may silently disappear forever.
And yes, if you can't return an error for some reason, you should preserve as much data as possible. Throwing away hundreds or thousands of characters after an initial anomaly is insane. But a database should never do that anyway, it should either store what you give it, or return an error. Same with a filesystem or any other trustworthy means of data storage. Sorry we were too lazy to tell you we threw your data away isn't acceptable, nor is relying on anyone to check for warnings that should be hard errors.
I would argue yes, that atoi is defective. Stripping characters is equally defective.
A QA walks into a bar and orders "123hello456" beers. The correct answer is not for the barkeep to give them 123 beers, its to ask "I beg your pardon?" So the truncation in general is the problem. If the operation fails - too many bytes, wrong encoding, string too long - the operation should fail. It's not the db's job, that is the business logic's job.
All competent software tokenizes a input string before calling atoi on an extracted piece of it. Otherwise you can't even tell the difference between "0" and an empty string.
On error resume next was the idea about error resilience back then. MSSQL has a similar feature: when you run a script and an error happens it just keeps running statements. If you want to interrupt it, you should wrap the whole thing in the try/catch statement and also roolback a transaction if any or it will commit a partial result.
Well, the things it would have been nice for MySQL to have done would have been not implementing it this way in the first place (yes, that's probably not Oracle's fault), and moving utf8 to utf8mb4 a bit more quickly. (good to know that's finally happening).
Also, I gotta say, now I have to look into this strict mode and what else it breaks, when in this particular case a "don't silently truncate strings" flag would be preferable and probably lower impact, since "strict" vs "security flaw mitigation" is not quite the same thing (does such a more granular flag exist?).
... unless that's all the strict mode does.
Hm, and since you brought this up elsewhere. Is this "strict by default" in MariaDB - which is what most (linux) people are (probably) thinking of.
> now I have to look into this strict mode and what else it breaks
It's a mode to ensure nothing happens silently: illegal data type conversions, attempting to insert invalid values for the data type (too large, too long, invalid characters, etc), attempting to do an INSERT which doesn't specify some columns that lack defaults, etc. With strict mode enabled, all of these things throw hard errors, like in most other relational databases. Generally, it covers the entire class of things that are typically the most common complaints about MySQL on HN.
Interesting wording re: "breaks" -- for practical purposes this is a real concern for existing applications, which may be doing bad things which previously didn't throw errors with strict mode off. Do you want your DB to suddenly fail loudly for these things (strict mode), or do you want your existing application to continue working in these cases? Can't have it both ways.
> Is this "strict by default" in MariaDB - which is what most (linux) people are (probably) thinking of.
MariaDB made it default in 10.2, ~5 years ago.
I don't agree that most Linux people conflate MySQL and MariaDB though. At large US tech companies that use MySQL, they're almost all literally using Oracle MySQL, or a patch-set like Percona Server which directly tracks MySQL, or an internal patch-set that does the same.
In recent years, MySQL and MariaDB have diverged a bit on feature set and focus; each has some interesting features that the other lacks. They are not drop-in replacements for each other.
So... while I'm broadly in agreement with you, and a huge fan of strict typing, I will note people use languages like Javascript and usually are ok with things like string to integer coercion.
It seems to be that there's a world of difference between "silently truncate string" and "type coercion" even if I readily agree both are sources of security vulnerabilities.
Thanks for the replies though, even though I clearly annoyed at least 2 people due to the downvotes.
I enabled strict sql mode in MariaDB for an website just last year, after noticing the database was silently corrupting data. What made the transition much less painful was noticing that when strict sql mode is disabled, MariaDB produces a warning for the queries that corrupted the data, but you must run SHOW WARNINGS; to see the warnings after running such a query. Knowing that, I made the website run SHOW WARNINGS; after every database query and logging if a warning occurred. This allowed me to deal with the most of the warnings before they were promoted to errors by enabling strict sql mode.
Pretty sure the name "utf8" isn't metaphysically wired to a defective physical representation. All they have to do is wait until the next minor release, add a compatibility name, make the generic name refer to a non-defective representation, and update the documentation and release notes accordingly.
If they were really conservative, they could wait until the next major release to repoint the name like that. Haven't there been four major releases since 2003?
This would be problematic for existing applications. Even if you leave existing binary data as-is (in existing tables utf8 becomes utf8mb3), changing the meaning of "utf8" in only new tables is still very problematic. Think about logical dumps, e.g. CREATE TABLE statements which use the "utf8" alias. For example, any self-host application which runs a .sql file in its install process. Or spinning up a dev environment for a new hire. Basically, anything that contains a CREATE TABLE with the utf8 alias now has a landmine:
In older versions of MySQL, InnoDB indexes were limited to 767 bytes. Meanwhile, VARCHAR(N) can store N characters. So with utf8mb3, that's a max of 3N bytes worst-case; but with utf8mb4 ("proper" utf8), it's now 4N. This meant that VARCHAR(255) CHARACTER SET utf8mb4 could not be indexed (in its entirety) in these older versions; the CREATE TABLE would fail if you try.
When you have a massive installed base of users, and those users have lots of dumps / install scripts containing "VARCHAR(255) CHARACTER SET utf8", you simply cannot change the meaning of "utf8" in a minor release.
I fully agree that MySQL could have made this transition faster/earlier than they did, but I disagree that it could be done trivially or terribly quickly.
Advertising the ability to store UTF-8 characters when you can do no such thing is a serious problem. Obviously you have to do whatever it takes to fix it, including fixing a deficient indexing implementation if necessary.
I believe the limitation you mention was fixed in MySQL 5.7, which would have been a reasonable opportunity to make "utf8" live up to its name.
Keep in mind that prior to 2010, MySQL and InnoDB were owned by different companies. Unilaterally deciding to increase the InnoDB index size limitation to permit utf8mb4 varchar(255) was not physically possible for the MySQL team prior to that point.
Anyway, I fully agree that the utf8 alias swap should have been done in MySQL 5.7, or perhaps even 5.6. I'm just disagreeing with your assertions that this would have been a trivial change.
Nor do I think it should have been a "stop the world" high-priority change prior to emoji existing in unicode (~late 2010). Non-BMP characters were very rare before that point. Hard to justify a business need for a complex migration just to store archaic hieroglyphs. For another point of comparison, MS SQL Server only added support for non-BMP characters in 2012, and only added support for utf8 encoding in 2019!
It's less a question of fault and more a question of whether developers need to be aware and look out for those gotchas. A security problem doesn't stop being a problem because your hosting provider did something vs. MySQL doing it by default.
There are similarly dumb misconfigurations that can theoretically be made in most hosted open source software.