It's hard but needed to differentiate between UTF-16 and UChar byte array. UChar...

rspeer · on Feb 12, 2016

I'm talking about using UTF-8 as the string representation, not UChars. UChars are an artifact of UTF-16, and thus require converting all text on input and output, unless you work in a Windows API world where I/O is UTF-16.

Modern programming languages such as Rust gain efficiency by working with unmodified UTF-8. All you lose is constant-time arbitrary indexing, which is a bad idea in most cases anyway.

skystrife · on Feb 13, 2016

It's not super convenient, but one can operate on UTF-8 buffers with ICU via UText (see e.g. http://userguide.icu-project.org/strings/utext#TOC-Example:-...)

Not everything is doable this way, but quite a lot actually is.

jstimpfle · on Feb 13, 2016

Why is a bad idea? Because Unicode has too complicated semantics to split a Unicode string at arbitrary points?

imron · on Feb 13, 2016

Both utf8 and utf16 can contain multicharacter elements. If you split a string at an arbitrary point you risk splitting it inside a multicharacter element.

This will be very common in utf8 that contains non-ascii characters, and very rare with utf16 (only happens with characters outside the BMP).

Neither is something you want in your code, unless you think it's a good idea to corrupt your users' data.

Edit: It's not too difficult to handle these cases and make sure you only split at valid positions, but you do need to be careful and there are a number of edge cases you might not think through or even encounter unless you have the right sort of data to test with - which leads to lots of faulty implementations. e.g. for years MySQL couldn't handle utf8 characters outside the BMP.

jstimpfle · on Feb 13, 2016

My parent was speaking about indexing at the code points level, not at the encoding (byte / character) level.

I do know that Unicode has combining code points (confusingly called combining characters) and nasty things like rtl switching code points. I guess it's turtles all the way down.

vardump · on Feb 13, 2016

> My parent was speaking about indexing at the code points level, not at the encoding (byte / character) level.

You need UTF-32 for (random) indexing of code points. UTF-16 has 16-bit code units. Some UTF-16 code points are 32-bits, using a surrogate pair.

So it's the same trade-off as with UTF-8. Thus no reason not to just simply use UTF-8 in the first place and take advantage of the memory savings.

jstimpfle · on Feb 13, 2016

Again, my original parent's statement was not about encoding or memory savings. The statement was that it was a bad idea to index into an (abstract) unicode string (of unicode code points -- not compositions thereof whatsoever).

I didn't question that, but hoped to get some inspiration for sane usage of unicode handling (which I'm not sure is humanly possible except for treating it as a rather black box and make no promises).

imron · on Feb 13, 2016

Your original parent was all about encodings, and mentioned it was a bad idea to arbitrarily index in to utf8 strings, (no mention of abstract strings of unicode codepoints).

> languages such as Rust gain efficiency by working with unmodified UTF-8. All you lose is constant-time arbitrary indexing

So it's saying Rust mostly benefits from using utf8, but in doing so, it loses the ability to arbitrarily index a character in a string (in constant time).

If it was abstract strings of unicode codepoints then there is no problem - except you'd then be using 32bits per codepoint.

imron · on Feb 13, 2016

Actually, they are not combining code points. Take for example the character 𪚥 (4 dragons).

The codepoint is U+2A6A5, but in UTF16 it requires combining 2 utf16 characters (\uD869 and \uDEA5) in order to reference it.

The codepoint however is still exactly the same (U+2A6A5).

vardump · on Feb 13, 2016

> The codepoint is U+2A6A5, but in UTF16 it requires combining 2 utf16 characters (\uD869 and \uDEA5) in order to reference it.

No, you mean two UTF-16 code units. A character is one or more code points.