You could also have CHECK constraints on those columns to ensure that nothing is written incorrectly, and/or a pre-write trigger that casts them to lowercase.
Or you could make the email column case-insensitive, since it’s generally accepted to be CI anyway.
That helps part of it, but there are still places for problems to pop up.
- you need to do an adhoc query to look something up so you have to type in the key in a where clause
- Or you used different sanitation methods in two different databases and you need to join things now
- you try to join the table that had the sanitized email key to a different table that just so happens to have email but it wasn’t sanitized because it was an optional field and not the key
And then you get the other stuff like one-off migrations that are done by someone unaware of the business constraints, or a contractor who completely misses the ORM method, etc.
These of course should be caught by checks and balances, and you can’t count on anyone knowing what they need to know ten years after an implementation is done
> you need to do an adhoc query to look something up so you have to type in the key in a where clause
Having a surrogate key here doesn’t help. A WHERE predicate can be rewritten as an INNER JOIN quite easily. Or you could use a subquery, or a CTE. Many options.
The other problems discussed are an engineering culture problem. Either you value correctness and consistency or you don’t.
I misunderstood your point; I thought you were saying having a surrogate key here would be better.
For the above, either ensure that all email addresses are stored in one case (and index / query on the same), or if that’s impossible for some reason, you can query with ILIKE, optionally creating a trigram index (for Postgres; this isn’t a problem in MySQL since by default it’s entirely case-insensitive) on the column. Another option is to index with a functional index (CREATE INDEX user_email_lowercase ON user (LOWER(email))), and then code the API doing these lookups to cast to lowercase. That way, it’ll be retrieved and displayed however the user entered it, but retrieved based only on the CI version.