Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Strings – Dive Into Python 3 (getpython3.com)
53 points by morphics on May 10, 2013 | hide | past | favorite | 28 comments


Apparently, all I know about strings is correct.


"Thank you, Mark Pilgrim".

(Unless you already knew it ten years ago. Back when he wrote "Everything you thought you knew about strings is wrong." it was quite true of most every programmer, and it was thanks to this piece and similar ones, by Spolsky and others, that the information got spread around.

There must be some phrase for this opposite of the "self fulfilling prophecy": the cautionary phrase that causes itself to become false in the future ;-)


As for me, I found Spolsky's article lacking, too. But lurking for years on the Unicode ML is probably not something most people do. You learn a lot there, though.


And, he didn't even touch on the horror that is EBCDIC. Once you've had to touch that, the idea of "code point" for a character is something you can't ignore, hoping that things just work "most of the time" -- ASCII A != EBCDIC A.


EBCDIC A != EBCDIC A :-)

It's easier to read text on punchcards in EBCDIC of whatever variation, though.


Are there still (many) of people using EBCDIC?


Yes. High volume printing is often done using IBM's AFP/MODCA print language, which typically has the text in IBM's EBCDIC encoding.

(disclosure: I once worked at a company that made tools to port code and data from IBM minicomputers to Unix & MS platforms, and also at the largest [format,] print & mail shop in the US)


Or the 6 bit funky set that the old CDC Cyber mainframes used to use! ("What is this lower case 'a' you speak of???")


My thought exactly too haha


TL;DR: think of string as tuple of numbers. some are bytes, some are integers. if you want to transform those numbers into a particular encoding (e.g. UTF-8, CP-1252) then that's a different story.

EDIT: I know the article went all out about character abstraction, that why i said "some are bytes, some are integers"


This is entirely the wrong take-away message from this article. The point is that strings are not sequences of numbers, but are, rather sequences of characters. Characters are abstracted from the underlying byte representation which is unimportant when dealing with strings.

For situations where a concrete byte representation is needed, you can get one by encoding the string.


Even this definition can get hairy, though. What is a character? Is 'á' one character or two? Most human beings would say one, but in actuality I formed it with an 'a' (U+0061) and a combining acute accent (U+0301): Two separate code points. But you can also get the same result with 'á' (U+00E1); this is not true of all combining character combinations.

In the past, I've had to deal with horrible mashups of fixed-byte-length columns in flat text files with UTF-8 bolted onto it. In Java, no less. Trying to figure out how to deal with all the edge cases (how do you truncate a string when the boundary is between a "normal" character and a combining character?) was an endless parade of the bizarre. Strings are hard, fundamentally.


In that case you “only” have to know what you're actually after: Either a grapheme (a character in the human sense) or a code point (a character in the Unicode sense) – well, and then there is the code unit which can be a “character” in the programming language sense but it's best not to go there, lest you want to fall into plenty of traps.

As long as you only wander around one of those levels (grapheme, code point, code unit, byte) all is (fairly) easy, but once you deal with multiple levels mistakes almost invariably creep in and you start treating code points as graphemes or code units as code points, etc. Fun source of all kinds of bugs :-)

So yes, text in general is hard. And Han, Hangul and the Japanese scripts are probably among the easiest scripts to support in software :-)


Yes. It's much better to think of strings as a sequence of Unicode code points.


As your parent already noted, thinking of it as a sequence of code points goes wrong when you need to truncate a string in between a base and a combining character.


Not true. Take this string:

It is composed of two code points: U+0064 and U+034A. The second code point is a combining character. The two code points together form one glyph. The term "character" is confusing because people use different definitions for it, I avoid using it, but the term Unicode code point is very clear.

Python 3's strings is a sequence of code points. The above string is represented like this:

  >>> print("d\u034A")
  d͊
  >>> len("d\u034A")
  2
Truncating between the base and combining code points works as expected:

  >>> "d\u034a"[0]
  'd'
  >>> "d\u034a"[1]
  '͊'


Except it doesn't work as expected because users generally expect graphemes to stay as they are instead of losing random diacritics.


By users, do you mean Python 3 programmers?


Indeed. I think Python 3 is very explicit with that distinction as well. You can have either text, which is in Unicode, or you have data which are arbitrary bytes. Sure, those bytes can represent text by interpreting them with a specific encoding, but you have to convert between one and the other explicitly to make it work. A very nice thing after the debacle in Python 2 where bytestrings in UTF-8 locales on Unix-likes happen to almost work in many cases, just to break horribly in other environments.

That being said, there are a lot of inaccuracies and even wrong things in that article, which saddens me.


If you think of a string as a sequence of code points (integers) you get a correct but inefficient model (UTF-32).

Character is a context in which we read integers and currently we don't use more than a couple hundred thousand of those.

UTF-8 is a data-compression technique taking advantage of the fact that smaller code points are used more often, and the largest ones (which require 5+ bytes, because any compression algorithm expands some inputs) are, currently, not standardized and effectively never used.


In Python 3, there is a clear distinction between string objects and bytes objects. String objects are a sequence of Unicode code points, and bytes objects are a sequence of 0-255 byte values.

> Character is a context in which we read integers and currently we don't use more than a couple hundred thousand of those.

I don't understand this sentence.

> UTF-8 is a data-compression technique taking advantage of the fact that smaller code points are used more often, and the largest ones (which require 5+ bytes, because any compression algorithm expands some inputs) are, currently, not standardized and effectively never used.

It's better to think of UTF-8 and UTF-32 as encodings: their role is to serialize strings into byte sequences and to deserialize byte sequences into strings. Some encodings are more efficient then others in different cases, but that doesn't change their role, and it's not necessary to understand this to avoid Unicode mistakes.


There never will be more than 4 bytes for UTF-8 because Unicode is restricted to 21 bits. Remember that all UTFs have to be able to represent all of Unicode and UTF-16 could not represent those “code points” where UTF-8 needs 5+ bytes.

Also I wouldn't say that UTF-8 is a compression scheme. SCSU is one but has its own share of problems. UTF-8 just happens to preserve ASCII compatibility which is an important property for Unix-like systems. Nothing more and nothing less. That is also happens to be more space-efficient for text that consists mostly of ASCII characters is merely a side-effect of that.


From the standpoint of an English speaker, UTF-8 is effectively a (good) compression scheme for Unicode, as opposed to using 2 or more bytes for every character.

I guess if I were German or Spanish (to say nothing of Asian languages), it would be the opposite of compression :-)


A Friday challenge: In Python when is u'ß'.upper() equal to u'SS'?

I discovered one case today, there may be others. Answer: https://twitter.com/moreati/status/332910618858364928


Actually, this is a bug. Unicode codepoint U+1E9E is LATIN CAPITAL LETTER SHARP S and should be the result of u"ß".upper(). This is especially so because otherwise u"Maße".upper() (Maße means measures) returns "MASSE", which could be confused with u"Masse".upper() (Masse means mass). In such cases, where confusion is possible and no uppercase ß is available, the German dictionary Duden actually suggests using SZ instead. Therefore, u"Maße".upper() would have to return "MASZE". However, since the Python string processing routines can hardly carry a dictionary around just to check whether there is a similar word that would have the same uppercase spelling, this is obviously not feasible. U+1E9E would be the way to go.


According to Unicode ß gets converted to SS in uppercase. This is by definition and doesn't change (stability policies, as far as I recall). Even in German you'll never see ß capitalised as SZ (except when I do it, but I'm a very, very small minority – and now I'm more likely to use ẞ).


And will it equal 'ẞ' in a later update? http://opentype.info/blog/2013/04/22/capital-sharp-s-in-use/


The epigraph to that chapter is brilliant.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: