Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hmm, strange. One of the stated design features of UTF-8 was that Boyer-Moore could still be directly applied (http://www.ietf.org/rfc/rfc2279.txt). Is grep doing something unnecessarily suboptimal here, or is that RFC's comment correct but not the whole story?


The rumor is that Gnu grep does a slow conversion of each input character into a wchar_t. I haven't personally read the code and verified it, but I consider the following sources reliable enough to believe it.

http://9fans.net/archive/?q=%27gnu+grep%27+UTF-8+malloc&...


The problem is probably that deciding whether two Unicode characters (where one character may be composed of multiple code points) are equivalent in general case is MUCH more involved than doing a simple byte comparison.

http://en.wikipedia.org/wiki/Unicode_equivalence


True, but in the special case where your search string is composed entirely of 7-bit ASCII characters (the characters representable as 1-byte in UTF-8), like the example, shouldn't the character-equivalence logic still be easy?


Not unless you are explicitly told that it is composed entirely of 7-bit ASCII characters.

The UTF standard used to allow non-conformant representations of ASCII characters. As http://www.schneier.com/crypto-gram-0008.html notes, this lead to security problems. Now the standard says that you can't allow non-conformant representations of ASCII characters. And if you look at http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf and scroll to page 94 you'll find that you can't be said to be conformant unless you explicitly reject non-conforming input.

Therefore UTF-8 decoders cannot be considered conformant unless they actually examine each and every byte to verify that there is nothing dodgy.


I do not think that accented e is a 7-bit ASCII character. Even if it is, the text being searched may contain the character coded as a multi-byte sequence.


That was just in the filename; the search string in the example is 'asdf'. And UTF-8 I believe is purposely designed so that 7-bit ASCII characters can't appear anywhere else in the stream except when representing themselves (even as part of multibyte characters)--- every other byte in a UTF-8 stream must have the high-order bit set.

My guess is that p9idf has it right, and grep is just converting everything to wchar_t first, rather than trying to do any sort of clever searching directly on the UTF-8 byte stream.


There's nothing to prevent this from being implemented, other than it'd be a huge hacky mess. It would require sidestepping iconv (or however grep does it) when LC_ALL is one of a specific set of strings, activating some special cases, and then additionally, building on those special cases, further scanning the input pattern (which may not just be a literal string - how might this work with character classes/ranges?) to ensure it is 7bit, in order to achieve the desired speedup.

Or if your input data is sufficiently ASCII-ish, and so is your search pattern, then why not just force the process locale to C and avoid the whole mess to begin with.

I'm suddenly left wondering how the "." regex syntax functions in the face of surrogates when handling UTF-8.


Oops. I sort-of guessed at what the example was about, and did not read it.

However, if you find four bytes 'asdf' in the input, you still have to check whether a combining mark follows the 'f'. For this example, that is simple, but I guess things get hairy for many regexes found in real life, such as ones containing even a single period.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: