> Unicode makes extensive use of combining characters for european languages, for example to produce diacritics: ìǒ or even for flag emoji.
But it doesn't, for example say that a lowercase "b" is simply "a lowercase 'l' followed by an 'o' followed by an invisible joiner", because no native English speaker thinks of the character "b" as even remotely related to "lo" when reading and writing.
> It seems like you're trying to single out combining pairs as "less legitimate" when they're extensively used in the standard.
I'm saying that Unicode only does it in English where it makes semantic sense to a native English speaker. It does it in Bengali even where it makes little or no semantic sense to a native Bengali speaker.
> > It seems like you're trying to single out combining pairs as "less legitimate" when they're extensively used in the standard.
> I'm saying that Unicode only does it in English where it makes semantic sense to a native English speaker.
Well, combining characters almost never come up in English. The best I can think of would be the use of cedillas, diaereses, and acute accents in words like façade, coördinate and renownèd (I've been reading Tolkien's translation of Beowulf, and he used renownèd a lot).
Thinking about the Spanish I learned in high school, ch, ll, ñ, and rr are all considered separate letters (i.e., the Spanish alphabet has 30 letters; ch is between c and d, ll is between l and m, ñ is between n and o, and rr is between r and s; interestingly, accented vowels aren't separate letters). Unicode does not provide code points for ch, ll, or rr; and ñ has a code point more from historical accident than anything (the decision to start with Latin1). Then again, I don't think Spanish keyboards have separate keys for ch, ll, or rr.
Portuguese, on the other hand, doesn't officially include k or y in the alphabet. But it uses far more accents than Spanish. So, a, ã and á are all the same letter. In a perfect world, how would Unicode handle this? Either they accept the Spanish view of the world, or the Portuguese view. Or, perhaps, they make a big deal about not worrying about languages and instead worrying about alphabets ( http://www.unicode.org/faq/basic_q.html#4 ).
They haven't been perfect. And they've certainly changed their approach over time. And I suspect they're including emoji to appear more welcoming to Japanese teenagers than they were in the past. But (1) combining characters aren't second-class citizens, and (2) the standard is still open to revisions ( http://www.unicode.org/alloc/Pipeline.html ).
Spanish speaker here. "ch" and "ll" being separate letters has been discussed for a long time and finally the decision was that they weren't separate letters but a combination of two [1]. Meanwhile, "ñ" stands as a letter of its own.
Accented vowels aren't considered different letters in Spanish because they affect the word they are in rather than the letter, as they serve to indicate which one is the "strong" syllable in a word. From a Spanish view of point, "a" and "á" are exactly the same letter.
I'm coming from a German background and I sympathize with the author.
German has 4 (7 if you consider cases) non-ASCII characters:
äüöß(and upper-case umlauts). All of these are unique, well-defined codepoints.
That's not related to composing on a keyboard. In fact, although I'm German I'm using the US keyboard layout and HAD to compose these characters now. But I wouldn't need to and the result is a single codepoint again..
> German has 4 (7 if you consider cases) non-ASCII characters: äüöß(and upper-case umlauts). All of these are unique, well-defined codepoints.
German does not consider "ä", "ö" and "ü" letters. Our alphabet has 26 letters none of which are the ones you mentioned. In fact, if you go back in History it becomes even clearer that those letters used to be ligatures in writing.
They still are collated as the basic letters the represent, even if they sound different. That we use the uncomposed representation in Unicode usually, is merely a historical artifact because of iso-8859-1 and others, not because it logically makes sense.
When you used an old typewriter you usually did not have those keys either, you composed them.
I'm confused by your use of 'our' and 'we'. It seems you're trying to write from the general point of view of a German, answering .. a German?
Are umlauts letters? Yes. [1] [2] Maybe not the best source, but please provide a better one if you disagree so that I can actually understand where you're coming from.
I understand - I hope? - composition. And I tend to agree that it shouldn't matter much if the input just works. If I press a key labeled ü and that letter shows up on the screen, I shouldn't really care if that is one codepoint or a composition of two (or more).
I do think that the history you mention is an indicator that supports the author's argument. There IS a codepoint for ü (painful to type..). For 'legacy reasons' perhaps. And it feels to me that non-ASCII characters - for legacy reasons or whatever - have better support than the ones he is complaining about, if they originate in western Europe/in my home country.
(basically I searched for old typewriter models, 'Adler Schreibmaschinen' results in lots of hits like that). Note the separate umlaut keys. And these are typewriters from .. the 60s? Maybe?)
I am not entirely sure if Germans count umlauts as distinct characters or modified versions of the base character. And maybe it is not so important; they still do deserve their own code points.
Note BTW that in e.g. Swedish and German alphabets, there are some overlapping non-ASCII characters (ä, ö) and some that are distinct to each language (å, ü). It is important that the Swedish ä and German ä are rendered to the same code point and same representation in files; this way I can use a computer localised for Swedish and type German text. Only when I need to type ü I need to compose it from ¨ and u, while ä and ö are right on the keyboard.
The German alphabetical order supports the idea that umlauts are not so distinct from their bases: it is
AÄBCDEFGHIJKLMNOÖPQRSßTUÜVWXYZ
while the Swedish/Finnish one is
ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ
This has the obvious impacts on sorting order.
BTW, traditionally Swedish/Finnish did not distinguish between V and W in sorting, thus a correct sorting order would be
Vasa
Westerlund
Vinberg
Vårdö
- the W drops right in the middle, it's just an older way to write V. And Vå... is at the end of section V, while Va... is at the start.
German has valid transcriptions to their base alphabet for those, e.g "Schreoder" is a valid way to write "Schröder".
ß, however, is a separate character that is not listed in the german alphabet, especially because some subgroups don't use it. (e.g. swiss german doesn't have it)
1) To avoid confusing readers that don't know German or are used to umlauts: The correct transcription is base-vowel+e (i.e. ö turns to oe - the example given is therefor wrong. Probably just a typo, but still)
2) These transcriptions are lossy. If you see 'oe' in a word, you cannot (always) pronounce it as umlaut. The second e just might indicate that the o in oe is long.
3) ß is a character in the alphabet, as far as I'm aware and as far as the mighty Wikipedia is concerned, as I pointed out above. If you have better sources that claim something else, please share those (I .. am a native speaker, but no language expert. So I'm genuinely curious why you'd think that this letter isn't part of the alphabet).
Fun fact: I once had to revise all the documentation for a project, because the (huge, state-owned) Swiss customer refused perfectly valid German, stating "We don't have that letter here, we don't use it: Remove it".
1) It's a typo, yes. Thanks!
2) Well, they are lossy in the sense that pronunciation is context-sensitive. The number of cases where you actually turn the word into another word is very small: http://snowball.tartarus.org/algorithms/german2/stemmer.html has a discussion.
3) You are right, I'm wrong. ß, ä, ö, ü are considered part of the alphabet. It's not tought in school, though (at least not in mine).
Thanks a lot for making the effort and fact-checking better then I did there :).
Yes, that transcription approach is familiar; here the result of German-Swedish-Finnish equivalency of "ä" is sometimes not so good.
For instance, in skiing competitions, the start lists are for some reason made with transcriptions to ASCII. It's quite okay that Schröder becomes Schroeder, but it is less desirable that Söderström becomes Soederstroem and quite infuriating that Hämäläinen becomes Haemaelaeinen. We'd like it to be Hamalainen, just drop the dots.
Well, they have codepoints, but not unique ones (since they can be written both using combining characters or using the compatibility pre-combined form). Software libraries dealing with unicode strings needs to handle both versions, by applying unicode normalization before doing comparisons.
The reason they have two representations is for backwards compatibility with previous character encoding standards, but the unicode standard is more complex because of this (it needs to specify more equivalences for normalization). I guess for languages which were not previously covered by any standards, the unicode consortium tries to represent things "as uniquely as possible".
> But I wouldn't need to and the result is a single codepoint again..
Doesn't have to be though, it'd be perfectly correct for an IME to generate multiple codepoints. IIRC, that's what you'd get if you typed those in a filename on OSX then asked for the native file path, as HFS+ stores filenames in NFD. Meanwhile Safari does (used to do?) the opposite, text is automatically NFC'd before sending. Things get interesting when you don't expect it and don't do unicode-equivalent comparisons.
.Net seems to do the same thing, Javascript (according to jsfiddle) as well. So maybe this is more widespread than I thought (again - I have never seen that character in the wild)?
Java (as in Try Clojure) seems to do the 'expected' SS thing. Trying the golang playground I get even worse:
fmt.Println(strings.ToUpper("ßẞ"))
returns
ßẞ
(yeah, unchanged?)
So, while I agree that you're technically correct (ẞ exists!) I do stick to my ~7 letters list for now.. It seems that's both realistic and usable.
I think this is more related to the fact that there aren't many sane libraries implementing unicode and locales -- so you'll get either some c lib/c++ lib, system lib, java lib -- or an actual new implementation that's actually been done "seriously" -- as part of being able to say: "Yes, X does actually support unicode strings.".
Python3 got a lot of flac for the decision to break away from it's byte sequences, to it's a unicode string. But I think that was the right choice. I still understand why people writing software that only cared about network, on-the-wire, pretend-to-be-text type strings.
Then again, based on some other comments here, apparently there are still some dark corners:
Python 3.2.3 (default, Feb 20 2013, 14:44:27)
[GCC 4.7.2] on linux2
>>> s="Åßẞ"
>>> s == s.upper().lower()
False
>>> s.lower()
'åßß'
Thanks for pointing that out -- I was vaguely aware 3.2 wasn't good (but pypy still isn't up to 3.4?) -- it's what's (still) in Debian stable as python3 though. Jessie (soonish to be released) will have 3.4 though, so at that point python3 should really start to be viable (to the extent that there are differences that actually are important...).
[ed: Also, wrt upper/lower being for display purposes -- I thought it was nice to point out that they are not symmetric, as one might expect them to (although that expectation is probably wrong in the first place...]
That's good to know. I learned Portuguese in '97-'99, so the information I had was incorrect at the time. We Anericans always recited the alphabet with k and y, but our teacher said they weren't official (although he also said that Brazilians would recognize them).
rr not being its own letter has no bearing on if you can pronounce carro correctly, just like saying church right has no bearing on if c and h are two letters or ch is a single letter.
I'm afraid that I have to add my voice to the list of people raised in Spanish speaking countries prior to the 90s who was VERY clearly taught that rr was a separate letter.
In your example... I wouldn't really care how it is stored, as long as it looks right on the display, and I don't have to go through contortions to enter it on an input device... for example, I don't care that 'a' maps to \x61 ... it's a value behind the scenes... it's the interface to that value.
As long as the typeface/font used can display the character/combination reasonably, and I can input reasonably it doesn't matter so much how it's used...
Now, having to type in 2-3 characters to get there, that's a different story, and one that should involve better input devices in most cases.
But it doesn't, for example say that a lowercase "b" is simply "a lowercase 'l' followed by an 'o' followed by an invisible joiner", because no native English speaker thinks of the character "b" as even remotely related to "lo" when reading and writing.
> It seems like you're trying to single out combining pairs as "less legitimate" when they're extensively used in the standard.
I'm saying that Unicode only does it in English where it makes semantic sense to a native English speaker. It does it in Bengali even where it makes little or no semantic sense to a native Bengali speaker.