*It sounds like someone read Russ Cox's work on regexp.* Russ Cox? You know that...

recuter · on Aug 10, 2013

Sadly not everybody is familiar with Automata theory - furthermore, I'm sure he knows and his link is about implementation and is none trivial and none obvious.

I'm struggling to understand your negativity here..

microtonal · on Aug 10, 2013

I'm struggling to understand your negativity here..

Sorry, that was indeed a bit harsh. I wanted to point out that Bram Molenaar is probably well-aware of DFAs given how well-known they are and how many implementations there are available. I don't think Russ Cox has much too do with it and rather that it is driven by demand - linear time matching is attractive for syntax highlighting.

coldtea · on Aug 10, 2013

You know that he didn't say Cox invented regular expressions, but that the Vim implementation might be influenced by Cox's advocacy on the matter. Right?

kenko · on Aug 10, 2013

And that's why the RE2 implementation has existed equally long!

Oh, wait.

microtonal · on Aug 10, 2013

There are many implementations the standard operations on finite state automata (determinization, minimization, intersection, concatenation, etc.) and nearly as many implementations of regular expressions[1]. Anyone who is in the trade (of text processing), including Bram Molenaar, will be aware that regular languages can be represented as finite state automata.

However, the problem is that not all regular expressions 'dialects' (most notably Perl regexes) are not regular languages and cannot be represented as a finite state automata. For instance, the introduction of back-references makes regular expressions non-regular.

So, it's not so much being enlightened by Russ Cox, but a trade-off between the convenience of some of the Perl extensions and the ability to apply matching in linear time. Also, for complex expressions, constructing the automaton can be slow.

[1] A small selection:

http://www.brics.dk/automaton/

http://www.let.rug.nl/~vannoord/Fsa/

http://www.complang.org/ragel/

http://open.xerox.com/Services/fst-nlp-tools

Some of the packages can also be used to construct transducers, weighted automata, etc.

dlisboa · on Aug 10, 2013

Cox's work builds on Thompson's work from the 60s. He says so himself.

mjn · on Aug 10, 2013

The basic implementation methods are well-known and quasi-standard, but Cox did make some useful practical additions, in addition to writing up a clear and accessible exposition.

The classic DFA engines, like in Thompson's version of grep, provide a fairly minimal regex language (later standardized as POSIX regexes), not much fancier than the regular-language syntax used in textbooks. Perl-style regular expressions get such widespread usage even outside of Perl in part because they have a more full-featured syntax, which provides conveniences people find useful. Cox implemented a significant subset of this functionality as syntactic sugar on top of a classic automaton regex engine, so RE2, unlike POSIX regexes, supports named matches, named character classes like \d (plus Unicode character classes), non-greedy matches, etc. Basically everything that can be done without making the language nonregular, so excluding lookahead, lookbehind, and backreferences.

Since vim regexes (a custom dialect) support some of those supra-POSIX features, those implementation techniques were likely useful for them to look at. Indeed they link to Cox's article somewhere or other in the design docs.

Vim does have lookahead/lookbehind syntax, though; I wonder how they supported that. Fallback to the backtracking engine?

zeckalpha · on Aug 10, 2013

Larry Wall does not have a CS degree.

lsiebert · on Aug 10, 2013

But his degree did involve computers. He got his undergraduate degree in natural and artificial languages

mjn · on Aug 10, 2013

I assume a degree with that title would lean towards the formal side of linguistics, which means you would certainly cover things like the Chomsky hierarchy in detail; after all linguistics is where the Chomsky hierarchy (though not regular languages) originated.

justinator · on Aug 11, 2013

In this interview[1] he states that his degree is one that he designed himself (which his college allowed), and that he used his programming classes and experience as the second of two foreign languages he needed for his Linguistics degree (the first being Classical Greek). He calls this a, "hack".

He talks about schooling after his undergrad, but doesn't go into what that was.

[1] http://www.youtube.com/watch?v=aNAtbYSxzuA

zeckalpha · on Aug 11, 2013

He was doing Linguistics at UC Berkeley, too.

zeckalpha · on Aug 11, 2013

I got mine in Linguistics, so I have no problem with him for his degree, just Perl's "regular" expressions.

AYBABTME · on Aug 11, 2013

I believe it's Ken Thompson who first implemented them using a DFA.