Yes, absolutely. I figure that rather than have endless repetition of content-free crap, I'd rather see the very occasional revisit of something that's got serious content. In part I look forward to seeing how this particular submission fares.
Nice to see you register to make a contribution - it's an important question to ask, and I appreciate that you've taken the effort to do so (and I've upvoted you).
Do you think something has changed that makes it worth revisiting this post? Or do you think its just a good post to revisit regularly?
I had not read it for a while and enjoyed reading it again. There were definitely somethings I forgot but I am wondering if something has changed/happened that should change my reading of the thread.
Hmm, strange. One of the stated design features of UTF-8 was that Boyer-Moore could still be directly applied (http://www.ietf.org/rfc/rfc2279.txt). Is grep doing something unnecessarily suboptimal here, or is that RFC's comment correct but not the whole story?
The rumor is that Gnu grep does a slow conversion of each input character into a wchar_t. I haven't personally read the code and verified it, but I consider the following sources reliable enough to believe it.
The problem is probably that deciding whether two Unicode characters (where one character may be composed of multiple code points) are equivalent in general case is MUCH more involved than doing a simple byte comparison.
True, but in the special case where your search string is composed entirely of 7-bit ASCII characters (the characters representable as 1-byte in UTF-8), like the example, shouldn't the character-equivalence logic still be easy?
Not unless you are explicitly told that it is composed entirely of 7-bit ASCII characters.
The UTF standard used to allow non-conformant representations of ASCII characters. As http://www.schneier.com/crypto-gram-0008.html notes, this lead to security problems. Now the standard says that you can't allow non-conformant representations of ASCII characters. And if you look at http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf and scroll to page 94 you'll find that you can't be said to be conformant unless you explicitly reject non-conforming input.
Therefore UTF-8 decoders cannot be considered conformant unless they actually examine each and every byte to verify that there is nothing dodgy.
I do not think that accented e is a 7-bit ASCII character. Even if it is, the text being searched may contain the character coded as a multi-byte sequence.
That was just in the filename; the search string in the example is 'asdf'. And UTF-8 I believe is purposely designed so that 7-bit ASCII characters can't appear anywhere else in the stream except when representing themselves (even as part of multibyte characters)--- every other byte in a UTF-8 stream must have the high-order bit set.
My guess is that p9idf has it right, and grep is just converting everything to wchar_t first, rather than trying to do any sort of clever searching directly on the UTF-8 byte stream.
There's nothing to prevent this from being implemented, other than it'd be a huge hacky mess. It would require sidestepping iconv (or however grep does it) when LC_ALL is one of a specific set of strings, activating some special cases, and then additionally, building on those special cases, further scanning the input pattern (which may not just be a literal string - how might this work with character classes/ranges?) to ensure it is 7bit, in order to achieve the desired speedup.
Or if your input data is sufficiently ASCII-ish, and so is your search pattern, then why not just force the process locale to C and avoid the whole mess to begin with.
I'm suddenly left wondering how the "." regex syntax functions in the face of surrogates when handling UTF-8.
Oops. I sort-of guessed at what the example was about, and did not read it.
However, if you find four bytes 'asdf' in the input, you still have to check whether a combining mark follows the 'f'. For this example, that is simple, but I guess things get hairy for many regexes found in real life, such as ones containing even a single period.
What version of grep are you using? Not too long ago the grep in debian/unstable was awful with UTF strings. pg135.txt is project gutenberg's les mis. What were your times?
I no longer notice this behavior:
dfc@motherjones:~$ grep --version
grep (GNU grep) 2.9
dfc@motherjones:~$ time LANG=C grep asdf < pg135.txt > /dev/null
real 0m0.017s
user 0m0.008s
sys 0m0.004s
dfc@motherjones:~$ time LANG=UTF8 grep asdf < pg135.txt > /dev/null
real 0m0.017s
user 0m0.012s
sys 0m0.004s
dfc@motherjones:~$ time LANG=en_us.UTF8 grep asdf < pg135.txt > /dev/null
real 0m0.012s
user 0m0.004s
sys 0m0.004s
There is not a lot of info about this in debian bug 604408
; grep --version
GNU grep 2.5.3
; time LANG=C grep asdf < lesms10.txt > /dev/null
real 0m0.025s
user 0m0.011s
sys 0m0.014s
; time /usr/local/plan9/bin/grep asdf < lesms10.txt > /dev/null
real 0m0.082s
user 0m0.043s
sys 0m0.013s
; time LANG=en_US.UTF-8 grep adsf < lesms10.txt > /dev/null
real 0m1.209s
user 0m0.818s
sys 0m0.018s
Those are the only two grep implementations I have handy. GNU grep 2.6.3 takes the same amount of time searching for 'asdf' in both locales, but searching for '.' is still slow. Thanks for pointing that out.
This is vaguely relevant to one of the recent discussions on HN about null-terminated strings: Boyer-Moore doesn't gain you these advantages if you have null-terminated strings, unless you also know the length, for the same reason that this email gives for not searching for \n initially: you can't skip any bytes during the search if you're looking for a 1-byte match.
While that has been true for 70 years, it is ceasing to be true now; today you can often make your program faster by causing it to do more computation in order to avoid communication, or in order to tolerate communication latency, which may involve doing more communication, or occasionally in order to tolerate node failures, so that you can take advantage of more nodes.
In the first case, you are doing less (communication) to go faster.
In the second case, you are doing less (computation) to go faster.
The less expensive operations you have to do (whatever they may be), the faster your program is. That has been true for the past 70 years, and will continue to be true as long as computing remains in the slightest bit recognizable to men of our field as it currently stands.
In the first case (to do less communication), you are sort of correct; but the amount of extra computation can be truly enormous, and the version of your program that does the least communication and computation is necessarily single-threaded — and so, in the case of a parallelizable problem, slower.
In the other two cases (doing extra work to tolerate communication latency and node failures) I have searched long and hard for a way that your statement could be correct, but I cannot imagine what it could be.
You are very likely correct that it continued to be true as long as computing remained recognizable to the men (and women) of earlier generations. But in a world of multiprocessor microcontrollers, it is no longer true.
Yes, but that definition you quote is for "man" (often written as the more archaic "mankind"), not "men". You're actually using definition a.1: "an individual human; especially : an adult male human", in the plural, to refer to adult male humans of our field. Admittedly a pedantic point, I wouldn't bother if it didn't contribute to a real-world problem.
Now take your self-righteous offtopicness and shoveoff. I contribute nothing to the problems of society by exercising freedom of vocabulary. You want to see real problems? Go look at pictures of the shelled remains of children in Syria. You're suggesting that I am contributing to a "real-world problem" is insulting.
Even if we want to consider the furthering of systemic sexism through benign use of vocabulary a "real-world problem", you are more to blame than I by trying to convert an innocent phrase into an example of hatred. Words only have the power that you allow them to. I accept no responsibility for your oversensitivity.
Meh, it's a smaller problem than most, but it's also far easier to solve than most. Why avoid making the world a tiny bit better if it costs you nothing? If making minor tweaks to our word choices could avoid violence to Syrian children, too, that'd be grand, but alas...
This is not exactly a response to you, BTW (since I can't imagine you'd be convinced) but for other readers.
I think that for the sake of backwards compatability with the linguistic upbringing of most English speakers, we should use "men" to refer to both human genders.
If this means that male gendered individuals will have to be pluralized as "males" and never "men", then that's something we should live with.
Does the claim make sense in newer processors which when halted switch to a low-power C-state? Waking them up can consume a lot of time (order of milliseconds).
To the observation that this has been submitted before, agreed. Contributions to those previous submissions is now closed, however, because they are sufficiently far in the past, and I think this subject is worth re-visiting occasionally.
FWIW: I've upvoted the coments about earlier submissions - I'd like to reward such behavior. Not least, searching for previous discussions often unearths genuinely useful material and should be encouraged.
Yup, exactly so. To quote a slightly different bit of it:
... banal comments, old material being recycled,
and general lack of interesting stuff.
In short, the vast majority of repeated material is drivel. Fluff. Content-free. I'm trying to restore the balance a little with more technical, challenging and deep material.
This is probably controversial, but it's my last ditch effort to actually get some technical material back into Hacker News.
I honestly don't expect it will succeed, but then I can finally depart with a completely clear conscience that I tried my best and found that the so-called "Hacker News" no longer contains much I care about.
I hope I'm wrong, but I had to try the experiment.
random (drivel) hack idea ... create hybrid pages with iframes that combine a meaty technical article (e.g., a russ cox blog post) with a fluffy tech gossip article, and post these hybrid articles to HN. e.g., "Fast string matching with DFAs / new music social network for your pets launch party pics!"
My guess: modern CPUs have hardware support for pausing user-level code and switching to kernel mode to service a page fault, but the syscall interface requires that the transition to kernel mode be done "manually", as it were.
Also, if you're reading from a file with lseek() and read(), you have to marshall data into the struct the kernel expects, make the syscall, and the kernel copies the data into the user-space buffer you provide. If you're reading from a file with mmap(), the page cache just appears in your address-space, no copying necessary.
On the other hand, I was curious why the default had changed from "use mmap" to "don't use mmap". Turns out, the grep 2.6.3 documentation states:
--mmap: This option is ignored for backwards compatibility. It used to read input with the `mmap' system call, instead of the default `read' system call. On modern systems, `--mmap' rarely if ever yields better performance.
That could explain why using mmap is faster than read in general, but what about skipping bytes? It sounds to me like if the seek distance is less than a page size, the kernel would need to fill the entire page with data.
Then again, hard drives return minimum sized blocks of data anyway, which is sized in kilobytes. So I'm guessing the "skipping" is a performance benefit only as so far as user space goes.
One would really have to grep a long string to be able to skip more than a block/page!
Never to be discussed without bringing up The Treacherous Optimization - http://ridiculousfish.com/blog/archives/2006/05/30/old-age-a...
Edit: and TTO's discussion - http://news.ycombinator.com/item?id=1627367