> Unicode makes extensive use of combining characters for european languages, fo...

maxlybbert · on March 17, 2015

> > It seems like you're trying to single out combining pairs as "less legitimate" when they're extensively used in the standard.

> I'm saying that Unicode only does it in English where it makes semantic sense to a native English speaker.

Well, combining characters almost never come up in English. The best I can think of would be the use of cedillas, diaereses, and acute accents in words like façade, coördinate and renownèd (I've been reading Tolkien's translation of Beowulf, and he used renownèd a lot).

Thinking about the Spanish I learned in high school, ch, ll, ñ, and rr are all considered separate letters (i.e., the Spanish alphabet has 30 letters; ch is between c and d, ll is between l and m, ñ is between n and o, and rr is between r and s; interestingly, accented vowels aren't separate letters). Unicode does not provide code points for ch, ll, or rr; and ñ has a code point more from historical accident than anything (the decision to start with Latin1). Then again, I don't think Spanish keyboards have separate keys for ch, ll, or rr.

Portuguese, on the other hand, doesn't officially include k or y in the alphabet. But it uses far more accents than Spanish. So, a, ã and á are all the same letter. In a perfect world, how would Unicode handle this? Either they accept the Spanish view of the world, or the Portuguese view. Or, perhaps, they make a big deal about not worrying about languages and instead worrying about alphabets ( http://www.unicode.org/faq/basic_q.html#4 ).

They haven't been perfect. And they've certainly changed their approach over time. And I suspect they're including emoji to appear more welcoming to Japanese teenagers than they were in the past. But (1) combining characters aren't second-class citizens, and (2) the standard is still open to revisions ( http://www.unicode.org/alloc/Pipeline.html ).

Maken · on March 17, 2015

Spanish speaker here. "ch" and "ll" being separate letters has been discussed for a long time and finally the decision was that they weren't separate letters but a combination of two [1]. Meanwhile, "ñ" stands as a letter of its own.

Accented vowels aren't considered different letters in Spanish because they affect the word they are in rather than the letter, as they serve to indicate which one is the "strong" syllable in a word. From a Spanish view of point, "a" and "á" are exactly the same letter.

[1] http://www.rae.es/consultas/exclusion-de-ch-y-ll-del-abeceda...

maxlybbert · on March 18, 2015

That's news to me. Perhaps I'll have better luck finding words like "chancho" in a dictionary; I'll be right to look in the c's!

darklajid · on March 17, 2015

I'm coming from a German background and I sympathize with the author.

German has 4 (7 if you consider cases) non-ASCII characters: äüöß(and upper-case umlauts). All of these are unique, well-defined codepoints.

That's not related to composing on a keyboard. In fact, although I'm German I'm using the US keyboard layout and HAD to compose these characters now. But I wouldn't need to and the result is a single codepoint again..

the_mitsuhiko · on March 17, 2015

> German has 4 (7 if you consider cases) non-ASCII characters: äüöß(and upper-case umlauts). All of these are unique, well-defined codepoints.

German does not consider "ä", "ö" and "ü" letters. Our alphabet has 26 letters none of which are the ones you mentioned. In fact, if you go back in History it becomes even clearer that those letters used to be ligatures in writing.

They still are collated as the basic letters the represent, even if they sound different. That we use the uncomposed representation in Unicode usually, is merely a historical artifact because of iso-8859-1 and others, not because it logically makes sense.

When you used an old typewriter you usually did not have those keys either, you composed them.

darklajid · on March 17, 2015

One by one:

I'm confused by your use of 'our' and 'we'. It seems you're trying to write from the general point of view of a German, answering .. a German?

Are umlauts letters? Yes. [1] [2] Maybe not the best source, but please provide a better one if you disagree so that I can actually understand where you're coming from.

I understand - I hope? - composition. And I tend to agree that it shouldn't matter much if the input just works. If I press a key labeled ü and that letter shows up on the screen, I shouldn't really care if that is one codepoint or a composition of two (or more). I do think that the history you mention is an indicator that supports the author's argument. There IS a codepoint for ü (painful to type..). For 'legacy reasons' perhaps. And it feels to me that non-ASCII characters - for legacy reasons or whatever - have better support than the ones he is complaining about, if they originate in western Europe/in my home country.

Typewriters and umlauts:

http://i.ebayimg.com/00/s/Mzk2WDQwMA==/$T2eC16N,!)sE9swmYlFP...

(basically I searched for old typewriter models, 'Adler Schreibmaschinen' results in lots of hits like that). Note the separate umlaut keys. And these are typewriters from .. the 60s? Maybe?)

1: https://de.wikipedia.org/wiki/Alphabet 2: https://de.wikipedia.org/wiki/Deutsches_Alphabet

ptaipale · on March 18, 2015

I am not entirely sure if Germans count umlauts as distinct characters or modified versions of the base character. And maybe it is not so important; they still do deserve their own code points.

Note BTW that in e.g. Swedish and German alphabets, there are some overlapping non-ASCII characters (ä, ö) and some that are distinct to each language (å, ü). It is important that the Swedish ä and German ä are rendered to the same code point and same representation in files; this way I can use a computer localised for Swedish and type German text. Only when I need to type ü I need to compose it from ¨ and u, while ä and ö are right on the keyboard.

The German alphabetical order supports the idea that umlauts are not so distinct from their bases: it is

AÄBCDEFGHIJKLMNOÖPQRSßTUÜVWXYZ while the Swedish/Finnish one is ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ

This has the obvious impacts on sorting order.

BTW, traditionally Swedish/Finnish did not distinguish between V and W in sorting, thus a correct sorting order would be

Vasa

Westerlund

Vinberg

Vårdö

- the W drops right in the middle, it's just an older way to write V. And Vå... is at the end of section V, while Va... is at the start.

Argorak · on March 18, 2015

Umlauts are not distinct characters, but modifications of existing ones to indicate a sound shift.

http://en.wikipedia.org/wiki/Diaeresis_%28diacritic%29

German has valid transcriptions to their base alphabet for those, e.g "Schreoder" is a valid way to write "Schröder".

ß, however, is a separate character that is not listed in the german alphabet, especially because some subgroups don't use it. (e.g. swiss german doesn't have it)

darklajid · on March 18, 2015

Two things

1) To avoid confusing readers that don't know German or are used to umlauts: The correct transcription is base-vowel+e (i.e. ö turns to oe - the example given is therefor wrong. Probably just a typo, but still)

2) These transcriptions are lossy. If you see 'oe' in a word, you cannot (always) pronounce it as umlaut. The second e just might indicate that the o in oe is long.

3) ß is a character in the alphabet, as far as I'm aware and as far as the mighty Wikipedia is concerned, as I pointed out above. If you have better sources that claim something else, please share those (I .. am a native speaker, but no language expert. So I'm genuinely curious why you'd think that this letter isn't part of the alphabet).

Fun fact: I once had to revise all the documentation for a project, because the (huge, state-owned) Swiss customer refused perfectly valid German, stating "We don't have that letter here, we don't use it: Remove it".

Argorak · on March 19, 2015

1) It's a typo, yes. Thanks! 2) Well, they are lossy in the sense that pronunciation is context-sensitive. The number of cases where you actually turn the word into another word is very small: http://snowball.tartarus.org/algorithms/german2/stemmer.html has a discussion. 3) You are right, I'm wrong. ß, ä, ö, ü are considered part of the alphabet. It's not tought in school, though (at least not in mine).

Thanks a lot for making the effort and fact-checking better then I did there :).

ptaipale · on March 18, 2015

Yes, that transcription approach is familiar; here the result of German-Swedish-Finnish equivalency of "ä" is sometimes not so good.

For instance, in skiing competitions, the start lists are for some reason made with transcriptions to ASCII. It's quite okay that Schröder becomes Schroeder, but it is less desirable that Söderström becomes Soederstroem and quite infuriating that Hämäläinen becomes Haemaelaeinen. We'd like it to be Hamalainen, just drop the dots.

vilhelm_s · on March 17, 2015

Well, they have codepoints, but not unique ones (since they can be written both using combining characters or using the compatibility pre-combined form). Software libraries dealing with unicode strings needs to handle both versions, by applying unicode normalization before doing comparisons.

The reason they have two representations is for backwards compatibility with previous character encoding standards, but the unicode standard is more complex because of this (it needs to specify more equivalences for normalization). I guess for languages which were not previously covered by any standards, the unicode consortium tries to represent things "as uniquely as possible".

masklinn · on March 17, 2015

> But I wouldn't need to and the result is a single codepoint again..

Doesn't have to be though, it'd be perfectly correct for an IME to generate multiple codepoints. IIRC, that's what you'd get if you typed those in a filename on OSX then asked for the native file path, as HFS+ stores filenames in NFD. Meanwhile Safari does (used to do?) the opposite, text is automatically NFC'd before sending. Things get interesting when you don't expect it and don't do unicode-equivalent comparisons.

1ris · on March 17, 2015

8 letters actually. 'ẞ' was added quite a while later.

darklajid · on March 17, 2015

Agreed, it exists. But then again, most systems in use today (as far as I'm aware) would turn a ß into SS, not ẞ.

Actually I think I've never seen a ẞ in use, ever. Not once.

Now I'm running around testing 'Try $programmingLanguage' services on the net. Try Clojure for example:

> (.toUpperCase "ß") "SS"

1ris · on March 17, 2015

In Haskell: isLower $ toUpper 'ß' is True. I wonder how many security holes this unexpected behaviour causes.

darklajid · on March 18, 2015

.Net seems to do the same thing, Javascript (according to jsfiddle) as well. So maybe this is more widespread than I thought (again - I have never seen that character in the wild)?

Java (as in Try Clojure) seems to do the 'expected' SS thing. Trying the golang playground I get even worse:

fmt.Println(strings.ToUpper("ßẞ"))

returns

ßẞ

(yeah, unchanged?)

So, while I agree that you're technically correct (ẞ exists!) I do stick to my ~7 letters list for now.. It seems that's both realistic and usable.

e12e · on March 18, 2015

I think this is more related to the fact that there aren't many sane libraries implementing unicode and locales -- so you'll get either some c lib/c++ lib, system lib, java lib -- or an actual new implementation that's actually been done "seriously" -- as part of being able to say: "Yes, X does actually support unicode strings.".

Python3 got a lot of flac for the decision to break away from it's byte sequences, to it's a unicode string. But I think that was the right choice. I still understand why people writing software that only cared about network, on-the-wire, pretend-to-be-text type strings.

Then again, based on some other comments here, apparently there are still some dark corners:

    Python 3.2.3 (default, Feb 20 2013, 14:44:27) 
    [GCC 4.7.2] on linux2
    >>> s="Åßẞ"
    >>> s == s.upper().lower()
    False
    >>> s.lower()
    'åßß'

However, to complicate things:

    Python 3.4.2 (default, Dec 27 2014, 13:16:08)
    [GCC 4.9.2] on linux
    >>> s="Åßẞ"
    >>> s.lower()
    'åßß'
    >>> s.lower().upper()
    'ÅSSSS'
    >>> s == s.lower().upper()
    False
    >>> s.lower().upper() == 'ÅSSSS'
    True
    >>> 'SS'.lower()
    'ss'
    >>> 'ß'.lower()
    'ß'
    >>> 'ß'.lower().upper()
    'SS'
    >>> 'ß'.lower().upper().lower()
    'ss'

So that's fun.

e12e · on March 18, 2015

Thanks for pointing that out -- I was vaguely aware 3.2 wasn't good (but pypy still isn't up to 3.4?) -- it's what's (still) in Debian stable as python3 though. Jessie (soonish to be released) will have 3.4 though, so at that point python3 should really start to be viable (to the extent that there are differences that actually are important...).

For the record, .casefold():

    #Python 3.4:
    >>> 'Åßẞ'.casefold() == 'åßß'.casefold() == 'åssss'
    True

[ed: Also, wrt upper/lower being for display purposes -- I thought it was nice to point out that they are not symmetric, as one might expect them to (although that expectation is probably wrong in the first place...]

Veedrac · on March 18, 2015

FWIW,

- 3.2 is considered broken with a narrow unicode build (although it doesn't matter here)

- .lower and .upper are primarily for display purposes

- .casefold is for caseless matching

elros · on March 17, 2015

> Portuguese, on the other hand, doesn't officially include k > or y in the alphabet.

With no judgement towards your broader point, I'd like to point out that this is no longer the case as of the orthographic agreement of 1990[0].

As far as I know it's been added back in order to better suit African speakers.

[0] https://pt.wikipedia.org/wiki/Acordo_Ortogr%C3%A1fico_de_199...

maxlybbert · on March 18, 2015

That's good to know. I learned Portuguese in '97-'99, so the information I had was incorrect at the time. We Anericans always recited the alphabet with k and y, but our teacher said they weren't official (although he also said that Brazilians would recognize them).

nova · on March 17, 2015

> ch, ll, ñ, and rr are all considered separate letters

In Spanish "rr" has never been considered as a single letter. "Ch" and "ll" used to be, but not anymore. Ñ is, of course.

k0ga · on March 18, 2015

I think i'm older than you. I learnt in the school they were different letters, and also I remember when they were removed at the beginning of 90's

maxlybbert · on March 18, 2015

That's funny; I spent several hours of class time trilling r's to make sure we pronounced "carro" correctly, and repeating a 30 character alphabet.

jimktrains2 · on March 18, 2015

rr not being its own letter has no bearing on if you can pronounce carro correctly, just like saying church right has no bearing on if c and h are two letters or ch is a single letter.

maxlybbert · on March 18, 2015

What can I say? Apparently the textbook was wrong.

elatahualpa · on March 23, 2015

I'm afraid that I have to add my voice to the list of people raised in Spanish speaking countries prior to the 90s who was VERY clearly taught that rr was a separate letter.

This is what I recall from my childhood: http://img.docstoccdn.com/thumb/orig/113964108.png

tracker1 · on March 17, 2015

In your example... I wouldn't really care how it is stored, as long as it looks right on the display, and I don't have to go through contortions to enter it on an input device... for example, I don't care that 'a' maps to \x61 ... it's a value behind the scenes... it's the interface to that value.

As long as the typeface/font used can display the character/combination reasonably, and I can input reasonably it doesn't matter so much how it's used...

Now, having to type in 2-3 characters to get there, that's a different story, and one that should involve better input devices in most cases.

hackuser · on March 17, 2015

> I wouldn't really care how it is stored, as long as it looks right on the display

It becomes a problem when you have other uses besides reading text, such as sorting or searching.

tracker1 · on March 18, 2015

That's why you can normalize the input to certain UTF-8 patterns for this purpose. Same should go for passwords before hashing them.