Now, technically this scheme could expand to 6-byte characters without getting confused with things like BOM/etc, however any code points larger than 2^21 wouldn't be representable in UTF-16, which has its own set of constraints. This means the unicode consortium has basically limited themselves to two million or so possible code points, which is why UTF-8 doesn't need to go more than 4 bytes. (I wonder if a future unicode version will require a larger limit and would thus create a new "utf8mb6" scheme, and drop UTF-16 altogether?)
Unicode specifically limited itself to the range zero to U+10FFFF
Obviously nothing in the laws of nature forbids "a future Unicode version" from disavowing this limit, but we could say the same for whether "a future United States of America" could disavow the status of independent Indian Tribes it has previously recognised.
> (I wonder if a future unicode version will require a larger limit and would thus create a new "utf8mb6" scheme, and drop UTF-16 altogether?)
On a thread a couple of years ago (https://news.ycombinator.com/item?id=20600873) it was mentioned that the UTF-8 encoding scheme can be cleanly extended to 36 bits, so even "utf8mb7" would be a possibility.
Characters <128 are encoded with a single byte: 0xxxxxxx
Characters >128 are encoded with multiple bytes.
A two-byte character looks like:
110xxxxx 10xxxxxx (11 useful bits, representing code points 128-2047)
A three-byte character looks like:
1110xxxx 10xxxxxx 10xxxxxx (16 useful bits, representing code points 2048-65535)
A four-byte character looks like:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (21 useful bits, representing code points 65536-2097151)
Now, technically this scheme could expand to 6-byte characters without getting confused with things like BOM/etc, however any code points larger than 2^21 wouldn't be representable in UTF-16, which has its own set of constraints. This means the unicode consortium has basically limited themselves to two million or so possible code points, which is why UTF-8 doesn't need to go more than 4 bytes. (I wonder if a future unicode version will require a larger limit and would thus create a new "utf8mb6" scheme, and drop UTF-16 altogether?)