Sure, but that's not MySQL's fault. MySQL/Oracle has no control over what ill-co...

butlerm · on Jan 12, 2022

Implementing silent string truncation in the first place (by default(!)) is an excellent indication of how fast and loose and out of touch the original MySQL developers were with the database world. It is the worst database "feature" I have ever seen.

evanelias · on Jan 12, 2022

It made more sense at the time. Earlier days of the web, dynamic scripting languages used everywhere, etc. No emoji, barely any use of 4-byte characters at all. Large corps weren't using open source DBs, or sometimes open source anything. Simpler times!

And for sake of comparison -- if I understand correctly, prior to just 2 months ago, sqlite didn't even have the option for strict type enforcement [1]. Do you judge sqlite as harshly, or think its creator is out of touch with the database world?

[1] https://www.sqlite.org/datatype3.html

butlerm · on Jan 12, 2022

Relational databases have been used in production since the late 1970s. You would be hard pressed to find any that silently corrupted user data like that, ever, until MySQL came along.

Sqlite is more of an embedded database, but if it silently truncated character data under some conditions that would make it equally unusable. Better to have no type enforcement than defective type enforcement.

evanelias · on Jan 12, 2022

I think you misunderstood my point. Yes, relational databases have been around a long time. But in the early days of MySQL, most of the companies using MySQL differed quite a bit from the companies typically using DB2, Sybase, Oracle, etc. I've been using MySQL professionally since 2003 and am speaking from first-hand experience here: the web development world was largely separate from the traditional enterprise world at that time. It was considered a different use case, it's not like today where the biggest companies in the world are tech companies built on open source software.

Regarding "silent", MySQL at least emits a warning when this happens. Still not great / obviously wrong in a modern context, but just for clarity's sake it wasn't completely silent.

As for "defective type enforcement", that's a subjective opinion. In C, atoi("123hello456") returns 123 -- once you hit an invalid character, everything from that point is truncated. Is C atoi defective? Perhaps, but it's an opinion.

With strict mode disabled, four-byte characters worked the same way in MySQL utf8mb3: four-byte characters are invalid in this data type, and everything gets truncated at the first invalid character. Would simply stripping the invalid characters, but keeping valid chars after them, necessarily be more "correct"? No, it's an implementation decision, especially considering that emoji did not exist at the time and barely any non-BMP (4-byte) utf8 characters were used anywhere.

butlerm · on Jan 12, 2022

It is certainly true that most users of relational databases before MySQL used them for applications where throwing away data is a potential catastrophe, and most uses of MySQL were for toy applications where if that happened no big deal. It was about a decade before MySQL even supported transactions.

atoi is deficient by modern standards, but at least it clearly documents what the behavior is and always has. It doesn't truncate anything. For a database, truncating data is unforgivable. That is what errors are for, so your database doesn't act as a black hole where your data may silently disappear forever.

And yes, if you can't return an error for some reason, you should preserve as much data as possible. Throwing away hundreds or thousands of characters after an initial anomaly is insane. But a database should never do that anyway, it should either store what you give it, or return an error. Same with a filesystem or any other trustworthy means of data storage. Sorry we were too lazy to tell you we threw your data away isn't acceptable, nor is relying on anyone to check for warnings that should be hard errors.

kortex · on Jan 13, 2022

I would argue yes, that atoi is defective. Stripping characters is equally defective.

A QA walks into a bar and orders "123hello456" beers. The correct answer is not for the barkeep to give them 123 beers, its to ask "I beg your pardon?" So the truncation in general is the problem. If the operation fails - too many bytes, wrong encoding, string too long - the operation should fail. It's not the db's job, that is the business logic's job.

butlerm · on Jan 13, 2022

All competent software tokenizes a input string before calling atoi on an extracted piece of it. Otherwise you can't even tell the difference between "0" and an empty string.

lillecarl · on Jan 13, 2022

Thanks for the info, I've always wondered when SQLite would get type enforcement, lets hope more awesome datatypes gets added over time.

The world is moving towards strict typing, and since this is what my brain is wired for I'm very happy.

GoblinSlayer · on Jan 13, 2022

On error resume next was the idea about error resilience back then. MSSQL has a similar feature: when you run a script and an error happens it just keeps running statements. If you want to interrupt it, you should wrap the whole thing in the try/catch statement and also roolback a transaction if any or it will commit a partial result.

capitainenemo · on Jan 12, 2022

Well, the things it would have been nice for MySQL to have done would have been not implementing it this way in the first place (yes, that's probably not Oracle's fault), and moving utf8 to utf8mb4 a bit more quickly. (good to know that's finally happening).

Also, I gotta say, now I have to look into this strict mode and what else it breaks, when in this particular case a "don't silently truncate strings" flag would be preferable and probably lower impact, since "strict" vs "security flaw mitigation" is not quite the same thing (does such a more granular flag exist?). ... unless that's all the strict mode does.

Hm, and since you brought this up elsewhere. Is this "strict by default" in MariaDB - which is what most (linux) people are (probably) thinking of.

evanelias · on Jan 12, 2022

> moving utf8 to utf8mb4 a bit more quickly

I definitely agree!

> now I have to look into this strict mode and what else it breaks

It's a mode to ensure nothing happens silently: illegal data type conversions, attempting to insert invalid values for the data type (too large, too long, invalid characters, etc), attempting to do an INSERT which doesn't specify some columns that lack defaults, etc. With strict mode enabled, all of these things throw hard errors, like in most other relational databases. Generally, it covers the entire class of things that are typically the most common complaints about MySQL on HN.

Interesting wording re: "breaks" -- for practical purposes this is a real concern for existing applications, which may be doing bad things which previously didn't throw errors with strict mode off. Do you want your DB to suddenly fail loudly for these things (strict mode), or do you want your existing application to continue working in these cases? Can't have it both ways.

> Is this "strict by default" in MariaDB - which is what most (linux) people are (probably) thinking of.

MariaDB made it default in 10.2, ~5 years ago.

I don't agree that most Linux people conflate MySQL and MariaDB though. At large US tech companies that use MySQL, they're almost all literally using Oracle MySQL, or a patch-set like Percona Server which directly tracks MySQL, or an internal patch-set that does the same.

In recent years, MySQL and MariaDB have diverged a bit on feature set and focus; each has some interesting features that the other lacks. They are not drop-in replacements for each other.

capitainenemo · on Jan 12, 2022

So... while I'm broadly in agreement with you, and a huge fan of strict typing, I will note people use languages like Javascript and usually are ok with things like string to integer coercion.

It seems to be that there's a world of difference between "silently truncate string" and "type coercion" even if I readily agree both are sources of security vulnerabilities.

Thanks for the replies though, even though I clearly annoyed at least 2 people due to the downvotes.

sinner · on Jan 12, 2022

I enabled strict sql mode in MariaDB for an website just last year, after noticing the database was silently corrupting data. What made the transition much less painful was noticing that when strict sql mode is disabled, MariaDB produces a warning for the queries that corrupted the data, but you must run SHOW WARNINGS; to see the warnings after running such a query. Knowing that, I made the website run SHOW WARNINGS; after every database query and logging if a warning occurred. This allowed me to deal with the most of the warnings before they were promoted to errors by enabling strict sql mode.

ars · on Jan 12, 2022

> and moving utf8 to utf8mb4 a bit more quickly. (good to know that's finally happening).

That happened in 2010 - I'm not sure 12 years counts as "finally happening", or just how much quicker you would like than 12 years ago?

Other than a time machine, I'm not sure what else you would like them to do.

capitainenemo · on Jan 12, 2022

I mean the announced plan to make the "utf8" default be utf8mb4 instead of utf8mb3.

See https://news.ycombinator.com/item?id=29910023

butlerm · on Jan 12, 2022

They should have made "utf8" have UTF-8 semantics by default as soon as the defect was discovered, not twenty plus years later.

ars · on Jan 12, 2022

They couldn't. The storage size is different internally. The article covers this in the section in migrating from utf8 to utf8mb4.

Do you actually use MySQL or are you just complaining from a theoretical POV?

butlerm · on Jan 12, 2022

Pretty sure the name "utf8" isn't metaphysically wired to a defective physical representation. All they have to do is wait until the next minor release, add a compatibility name, make the generic name refer to a non-defective representation, and update the documentation and release notes accordingly.

If they were really conservative, they could wait until the next major release to repoint the name like that. Haven't there been four major releases since 2003?

evanelias · on Jan 12, 2022

This would be problematic for existing applications. Even if you leave existing binary data as-is (in existing tables utf8 becomes utf8mb3), changing the meaning of "utf8" in only new tables is still very problematic. Think about logical dumps, e.g. CREATE TABLE statements which use the "utf8" alias. For example, any self-host application which runs a .sql file in its install process. Or spinning up a dev environment for a new hire. Basically, anything that contains a CREATE TABLE with the utf8 alias now has a landmine:

In older versions of MySQL, InnoDB indexes were limited to 767 bytes. Meanwhile, VARCHAR(N) can store N characters. So with utf8mb3, that's a max of 3N bytes worst-case; but with utf8mb4 ("proper" utf8), it's now 4N. This meant that VARCHAR(255) CHARACTER SET utf8mb4 could not be indexed (in its entirety) in these older versions; the CREATE TABLE would fail if you try.

When you have a massive installed base of users, and those users have lots of dumps / install scripts containing "VARCHAR(255) CHARACTER SET utf8", you simply cannot change the meaning of "utf8" in a minor release.

I fully agree that MySQL could have made this transition faster/earlier than they did, but I disagree that it could be done trivially or terribly quickly.

butlerm · on Jan 13, 2022

Advertising the ability to store UTF-8 characters when you can do no such thing is a serious problem. Obviously you have to do whatever it takes to fix it, including fixing a deficient indexing implementation if necessary.

I believe the limitation you mention was fixed in MySQL 5.7, which would have been a reasonable opportunity to make "utf8" live up to its name.

evanelias · on Jan 13, 2022

Keep in mind that prior to 2010, MySQL and InnoDB were owned by different companies. Unilaterally deciding to increase the InnoDB index size limitation to permit utf8mb4 varchar(255) was not physically possible for the MySQL team prior to that point.

Anyway, I fully agree that the utf8 alias swap should have been done in MySQL 5.7, or perhaps even 5.6. I'm just disagreeing with your assertions that this would have been a trivial change.

Nor do I think it should have been a "stop the world" high-priority change prior to emoji existing in unicode (~late 2010). Non-BMP characters were very rare before that point. Hard to justify a business need for a complex migration just to store archaic hieroglyphs. For another point of comparison, MS SQL Server only added support for non-BMP characters in 2012, and only added support for utf8 encoding in 2019!

ruiquelhas · on Jan 13, 2022

It kind-of is "wired" to a "physical" representation, but in a different sense.

https://blog.koehntopp.info/2022/01/12/utf8mb4.html#so-what-...

ryanbrunner · on Jan 13, 2022

It's less a question of fault and more a question of whether developers need to be aware and look out for those gotchas. A security problem doesn't stop being a problem because your hosting provider did something vs. MySQL doing it by default.