Why GNU grep is fast

js2 · on Aug 8, 2011

129 days ago - http://news.ycombinator.com/item?id=2393587

Never to be discussed without bringing up The Treacherous Optimization - http://ridiculousfish.com/blog/archives/2006/05/30/old-age-a...

Edit: and TTO's discussion - http://news.ycombinator.com/item?id=1627367

y0ghur7_xxx · on Aug 8, 2011

Previous discussions:

http://news.ycombinator.com/item?id=2393587

http://news.ycombinator.com/item?id=1626305

tekyon · on Aug 8, 2011

http://news.ycombinator.com/item?id=2822041 Wasn't this from you?

ColinWright · on Aug 8, 2011

Yes, absolutely. I figure that rather than have endless repetition of content-free crap, I'd rather see the very occasional revisit of something that's got serious content. In part I look forward to seeing how this particular submission fares.

Nice to see you register to make a contribution - it's an important question to ask, and I appreciate that you've taken the effort to do so (and I've upvoted you).

dfc · on Aug 9, 2011

Do you think something has changed that makes it worth revisiting this post? Or do you think its just a good post to revisit regularly?

I had not read it for a while and enjoyed reading it again. There were definitely somethings I forgot but I am wondering if something has changed/happened that should change my reading of the thread.

p9idf · on Aug 8, 2011

Fun fact: Gnu grep is slow at UTF8. Try

  time LANG=C grep asdf < les_misérables.txt > /dev/null

versus

  time LANG=en_US.UTF-8 grep asdf < les_misérables.txt > /dev/null

_delirium · on Aug 8, 2011

Hmm, strange. One of the stated design features of UTF-8 was that Boyer-Moore could still be directly applied (http://www.ietf.org/rfc/rfc2279.txt). Is grep doing something unnecessarily suboptimal here, or is that RFC's comment correct but not the whole story?

p9idf · on Aug 8, 2011

The rumor is that Gnu grep does a slow conversion of each input character into a wchar_t. I haven't personally read the code and verified it, but I consider the following sources reliable enough to believe it.

http://9fans.net/archive/?q=%27gnu+grep%27+UTF-8+malloc&...

Sharlin · on Aug 8, 2011

The problem is probably that deciding whether two Unicode characters (where one character may be composed of multiple code points) are equivalent in general case is MUCH more involved than doing a simple byte comparison.

http://en.wikipedia.org/wiki/Unicode_equivalence

_delirium · on Aug 8, 2011

True, but in the special case where your search string is composed entirely of 7-bit ASCII characters (the characters representable as 1-byte in UTF-8), like the example, shouldn't the character-equivalence logic still be easy?

btilly · on Aug 8, 2011

Not unless you are explicitly told that it is composed entirely of 7-bit ASCII characters.

The UTF standard used to allow non-conformant representations of ASCII characters. As http://www.schneier.com/crypto-gram-0008.html notes, this lead to security problems. Now the standard says that you can't allow non-conformant representations of ASCII characters. And if you look at http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf and scroll to page 94 you'll find that you can't be said to be conformant unless you explicitly reject non-conforming input.

Therefore UTF-8 decoders cannot be considered conformant unless they actually examine each and every byte to verify that there is nothing dodgy.

Someone · on Aug 8, 2011

I do not think that accented e is a 7-bit ASCII character. Even if it is, the text being searched may contain the character coded as a multi-byte sequence.

_delirium · on Aug 8, 2011

That was just in the filename; the search string in the example is 'asdf'. And UTF-8 I believe is purposely designed so that 7-bit ASCII characters can't appear anywhere else in the stream except when representing themselves (even as part of multibyte characters)--- every other byte in a UTF-8 stream must have the high-order bit set.

My guess is that p9idf has it right, and grep is just converting everything to wchar_t first, rather than trying to do any sort of clever searching directly on the UTF-8 byte stream.

forgotusername · on Aug 8, 2011

There's nothing to prevent this from being implemented, other than it'd be a huge hacky mess. It would require sidestepping iconv (or however grep does it) when LC_ALL is one of a specific set of strings, activating some special cases, and then additionally, building on those special cases, further scanning the input pattern (which may not just be a literal string - how might this work with character classes/ranges?) to ensure it is 7bit, in order to achieve the desired speedup.

Or if your input data is sufficiently ASCII-ish, and so is your search pattern, then why not just force the process locale to C and avoid the whole mess to begin with.

I'm suddenly left wondering how the "." regex syntax functions in the face of surrogates when handling UTF-8.

Someone · on Aug 8, 2011

Oops. I sort-of guessed at what the example was about, and did not read it.

However, if you find four bytes 'asdf' in the input, you still have to check whether a combining mark follows the 'f'. For this example, that is simple, but I guess things get hairy for many regexes found in real life, such as ones containing even a single period.

dfc · on Aug 9, 2011

What version of grep are you using? Not too long ago the grep in debian/unstable was awful with UTF strings. pg135.txt is project gutenberg's les mis. What were your times?

I no longer notice this behavior:

    dfc@motherjones:~$ grep --version 
    grep (GNU grep) 2.9

    dfc@motherjones:~$ time LANG=C grep asdf < pg135.txt > /dev/null 

    real	0m0.017s
    user	0m0.008s
    sys	0m0.004s
    dfc@motherjones:~$ time LANG=UTF8 grep asdf < pg135.txt > /dev/null 

    real	0m0.017s
    user	0m0.012s
    sys	0m0.004s
    dfc@motherjones:~$ time LANG=en_us.UTF8 grep asdf < pg135.txt > /dev/null 

    real	0m0.012s
    user	0m0.004s
    sys	0m0.004s

There is not a lot of info about this in debian bug 604408

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=604408

If memory serves me correctly this upstream fixed this sometime after 2.7.1 or 2.7.3

Funner fact: GNU grep used to be slow with UTF.

p9idf · on Aug 9, 2011

  ; grep --version
  GNU grep 2.5.3
  ; time LANG=C grep asdf < lesms10.txt > /dev/null
  real	0m0.025s
  user	0m0.011s
  sys	0m0.014s
  ; time /usr/local/plan9/bin/grep asdf < lesms10.txt > /dev/null
  real	0m0.082s
  user	0m0.043s
  sys	0m0.013s
  ; time LANG=en_US.UTF-8 grep adsf < lesms10.txt > /dev/null
  real	0m1.209s
  user	0m0.818s
  sys	0m0.018s

Those are the only two grep implementations I have handy. GNU grep 2.6.3 takes the same amount of time searching for 'asdf' in both locales, but searching for '.' is still slow. Thanks for pointing that out.

_delirium · on Aug 8, 2011

This is vaguely relevant to one of the recent discussions on HN about null-terminated strings: Boyer-Moore doesn't gain you these advantages if you have null-terminated strings, unless you also know the length, for the same reason that this email gives for not searching for \n initially: you can't skip any bytes during the search if you're looking for a 1-byte match.

DanielRibeiro · on Aug 8, 2011

The key to making programs fast is to make them do practically nothing

Priceless...

kragen · on Aug 8, 2011

While that has been true for 70 years, it is ceasing to be true now; today you can often make your program faster by causing it to do more computation in order to avoid communication, or in order to tolerate communication latency, which may involve doing more communication, or occasionally in order to tolerate node failures, so that you can take advantage of more nodes.

burgerbrain · on Aug 8, 2011

In the first case, you are doing less (communication) to go faster.

In the second case, you are doing less (computation) to go faster.

The less expensive operations you have to do (whatever they may be), the faster your program is. That has been true for the past 70 years, and will continue to be true as long as computing remains in the slightest bit recognizable to men of our field as it currently stands.

kragen · on Aug 9, 2011

In the first case (to do less communication), you are sort of correct; but the amount of extra computation can be truly enormous, and the version of your program that does the least communication and computation is necessarily single-threaded — and so, in the case of a parallelizable problem, slower.

In the other two cases (doing extra work to tolerate communication latency and node failures) I have searched long and hard for a way that your statement could be correct, but I cannot imagine what it could be.

You are very likely correct that it continued to be true as long as computing remained recognizable to the men (and women) of earlier generations. But in a world of multiprocessor microcontrollers, it is no longer true.

jleader · on Aug 8, 2011

70 years later, and we're still all "men"?

burgerbrain · on Aug 8, 2011

Oh get over yourself.

  b : the human race : humankind

http://mw1.merriam-webster.com/dictionary/man

jleader · on Aug 8, 2011

Yes, but that definition you quote is for "man" (often written as the more archaic "mankind"), not "men". You're actually using definition a.1: "an individual human; especially : an adult male human", in the plural, to refer to adult male humans of our field. Admittedly a pedantic point, I wouldn't bother if it didn't contribute to a real-world problem.

burgerbrain · on Aug 9, 2011

Men is nothing more than a pluralization of man. In this context it means members of "man".

https://secure.wikimedia.org/wiktionary/en/wiki/men

Now take your self-righteous offtopicness and shoveoff. I contribute nothing to the problems of society by exercising freedom of vocabulary. You want to see real problems? Go look at pictures of the shelled remains of children in Syria. You're suggesting that I am contributing to a "real-world problem" is insulting.

Even if we want to consider the furthering of systemic sexism through benign use of vocabulary a "real-world problem", you are more to blame than I by trying to convert an innocent phrase into an example of hatred. Words only have the power that you allow them to. I accept no responsibility for your oversensitivity.

jtheory · on Aug 9, 2011

Meh, it's a smaller problem than most, but it's also far easier to solve than most. Why avoid making the world a tiny bit better if it costs you nothing? If making minor tweaks to our word choices could avoid violence to Syrian children, too, that'd be grand, but alas...

This is not exactly a response to you, BTW (since I can't imagine you'd be convinced) but for other readers.

true_religion · on Aug 9, 2011

I think that for the sake of backwards compatability with the linguistic upbringing of most English speakers, we should use "men" to refer to both human genders.

If this means that male gendered individuals will have to be pluralized as "males" and never "men", then that's something we should live with.

elemenohpee · on Aug 9, 2011

How about "people"?

jtheory · on Aug 9, 2011

That was tongue-in-cheek, I think.

Imagine changing all of the restroom signs from "men" to "males", because "men" would signify "bathroom for everyone".

Or seeing in the newspaper "The blast killed 22 men, including 3 that were pregnant".

jerf · on Aug 8, 2011

Tradeoffs have been around for all of that 70 years, only the ratios and absolute performance has changed.

bluekeybox · on Aug 8, 2011

Problem is, sometimes you need to know a whole lot about everything to afford to do nothing.

xtacy · on Aug 8, 2011

Does the claim make sense in newer processors which when halted switch to a low-power C-state? Waking them up can consume a lot of time (order of milliseconds).

haberman · on Aug 8, 2011

"To go fast, do less" --http://asserttrue.blogspot.com/2009/03/how-to-write-fast-cod...

Radim · on Aug 8, 2011

"All programming is an exercise in caching." - Terje Mathisen

kingkilr · on Aug 8, 2011

You can't make your CPU execute instructions faster, you can make it execute fewer of them though.

ColinWright · on Aug 8, 2011

To the observation that this has been submitted before, agreed. Contributions to those previous submissions is now closed, however, because they are sufficiently far in the past, and I think this subject is worth re-visiting occasionally.

FWIW: I've upvoted the coments about earlier submissions - I'd like to reward such behavior. Not least, searching for previous discussions often unearths genuinely useful material and should be encouraged.

antics · on Aug 8, 2011

Yeah. You're the guy who reposted that Reinvigorated Programmer article again a week or so back, right? Something like, "Old Post, What's changed?"

Just a friendly reminder, you may have forgotten, here's the profile blurb you have up right now:

> I'm tired of the repetition, [...] old material being recycled,

> as I write this, Lockhart's Lament has been posted again.

EDIT: formatting error

ColinWright · on Aug 8, 2011

Yup, exactly so. To quote a slightly different bit of it:

    ... banal comments, old material being recycled,
    and general lack of interesting stuff.

In short, the vast majority of repeated material is drivel. Fluff. Content-free. I'm trying to restore the balance a little with more technical, challenging and deep material.

This is probably controversial, but it's my last ditch effort to actually get some technical material back into Hacker News.

I honestly don't expect it will succeed, but then I can finally depart with a completely clear conscience that I tried my best and found that the so-called "Hacker News" no longer contains much I care about.

I hope I'm wrong, but I had to try the experiment.

PS: And FWIW - I've upvoted you.

pgbovine · on Aug 9, 2011

random (drivel) hack idea ... create hybrid pages with iframes that combine a meaty technical article (e.g., a russ cox blog post) with a fluffy tech gossip article, and post these hybrid articles to HN. e.g., "Fast string matching with DFAs / new music social network for your pets launch party pics!"

iam · on Aug 9, 2011

Great article, really makes you think about O(n) substring searching doesn't it?

It would've been nice to drill down why mmap makes a difference in particular, why not use something like lseek to skip characters?

thristian · on Aug 9, 2011

My guess: modern CPUs have hardware support for pausing user-level code and switching to kernel mode to service a page fault, but the syscall interface requires that the transition to kernel mode be done "manually", as it were.

Also, if you're reading from a file with lseek() and read(), you have to marshall data into the struct the kernel expects, make the syscall, and the kernel copies the data into the user-space buffer you provide. If you're reading from a file with mmap(), the page cache just appears in your address-space, no copying necessary.

On the other hand, I was curious why the default had changed from "use mmap" to "don't use mmap". Turns out, the grep 2.6.3 documentation states:

--mmap: This option is ignored for backwards compatibility. It used to read input with the `mmap' system call, instead of the default `read' system call. On modern systems, `--mmap' rarely if ever yields better performance.

iam · on Aug 9, 2011

That could explain why using mmap is faster than read in general, but what about skipping bytes? It sounds to me like if the seek distance is less than a page size, the kernel would need to fill the entire page with data.

Then again, hard drives return minimum sized blocks of data anyway, which is sized in kilobytes. So I'm guessing the "skipping" is a performance benefit only as so far as user space goes.

One would really have to grep a long string to be able to skip more than a block/page!