Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It sounds like someone read Russ Cox's work on regexp.

Russ Cox? You know that regular expressions have existed since Kleene's first proposals, and everyone with a CS degrees knows that you can compile them to deterministic finite state automata. Right?



Sadly not everybody is familiar with Automata theory - furthermore, I'm sure he knows and his link is about implementation and is none trivial and none obvious.

I'm struggling to understand your negativity here..


I'm struggling to understand your negativity here..

Sorry, that was indeed a bit harsh. I wanted to point out that Bram Molenaar is probably well-aware of DFAs given how well-known they are and how many implementations there are available. I don't think Russ Cox has much too do with it and rather that it is driven by demand - linear time matching is attractive for syntax highlighting.


You know that he didn't say Cox invented regular expressions, but that the Vim implementation might be influenced by Cox's advocacy on the matter. Right?


And that's why the RE2 implementation has existed equally long!

Oh, wait.


There are many implementations the standard operations on finite state automata (determinization, minimization, intersection, concatenation, etc.) and nearly as many implementations of regular expressions[1]. Anyone who is in the trade (of text processing), including Bram Molenaar, will be aware that regular languages can be represented as finite state automata.

However, the problem is that not all regular expressions 'dialects' (most notably Perl regexes) are not regular languages and cannot be represented as a finite state automata. For instance, the introduction of back-references makes regular expressions non-regular.

So, it's not so much being enlightened by Russ Cox, but a trade-off between the convenience of some of the Perl extensions and the ability to apply matching in linear time. Also, for complex expressions, constructing the automaton can be slow.

[1] A small selection:

http://www.brics.dk/automaton/

http://www.let.rug.nl/~vannoord/Fsa/

http://www.complang.org/ragel/

http://open.xerox.com/Services/fst-nlp-tools

Some of the packages can also be used to construct transducers, weighted automata, etc.


Cox's work builds on Thompson's work from the 60s. He says so himself.


The basic implementation methods are well-known and quasi-standard, but Cox did make some useful practical additions, in addition to writing up a clear and accessible exposition.

The classic DFA engines, like in Thompson's version of grep, provide a fairly minimal regex language (later standardized as POSIX regexes), not much fancier than the regular-language syntax used in textbooks. Perl-style regular expressions get such widespread usage even outside of Perl in part because they have a more full-featured syntax, which provides conveniences people find useful. Cox implemented a significant subset of this functionality as syntactic sugar on top of a classic automaton regex engine, so RE2, unlike POSIX regexes, supports named matches, named character classes like \d (plus Unicode character classes), non-greedy matches, etc. Basically everything that can be done without making the language nonregular, so excluding lookahead, lookbehind, and backreferences.

Since vim regexes (a custom dialect) support some of those supra-POSIX features, those implementation techniques were likely useful for them to look at. Indeed they link to Cox's article somewhere or other in the design docs.

Vim does have lookahead/lookbehind syntax, though; I wonder how they supported that. Fallback to the backtracking engine?


Larry Wall does not have a CS degree.


But his degree did involve computers. He got his undergraduate degree in natural and artificial languages


I assume a degree with that title would lean towards the formal side of linguistics, which means you would certainly cover things like the Chomsky hierarchy in detail; after all linguistics is where the Chomsky hierarchy (though not regular languages) originated.


In this interview[1] he states that his degree is one that he designed himself (which his college allowed), and that he used his programming classes and experience as the second of two foreign languages he needed for his Linguistics degree (the first being Classical Greek). He calls this a, "hack".

He talks about schooling after his undergrad, but doesn't go into what that was.

[1] http://www.youtube.com/watch?v=aNAtbYSxzuA


He was doing Linguistics at UC Berkeley, too.


I got mine in Linguistics, so I have no problem with him for his degree, just Perl's "regular" expressions.


I believe it's Ken Thompson who first implemented them using a DFA.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: