Errr.. Did he just wrote an html parser (and then used it) to prove everyone tha...

JonnieCache · on July 8, 2011

Yeah, this is pretty much what seems to be going on.

I don't think anyone has ever said that regexs can't be used in the parsing of html, just that they can't be used for the parsing of html. It's like someone saying "You can't use bricks to keep warm!" and countering that by saying "Observe! I have built a house using, among other things, bricks, and it shelters me from the weather and keeps me warm!" Deliberately missing the point in order to show off your housebuilding skills.

Still, it was an informative and well written article. Article, rather than an answer to a question. Why do people write these thousands of words on stack overflow, when they could publish them on their blogs. Are SO points from confused corporate employees really worth more than the adulation of the blogosphere? Actually, who cares. I almost lost the will to live just writing that sentence...

statictype · on July 8, 2011

I'd rather see it in Stackoverflow. It has much more visibility there than if each person wrote it on their own blog (assuming they have one) which may not be seen by more than a handful of friends.

iron_ball · on July 8, 2011

Look at all the people on HN saying they want to hire based on SO rep and a Github account.

Goladus · on July 8, 2011

Actually I'm more interested in the answer to this question:

Was he able write an arbitrary HTML parser using regexes because HTML is a regular language or because Perl's string matching syntax, which can match some languages that are not regular, is called "regular expressions" anyway?

I am pretty sure that, for example, an arbitrary number of balanced "<div></div>" pairs are not a regular expression however you can use modern PCRE to match it anyway.

amalcon · on July 8, 2011

If I'm understanding it correctly (which I'm not convinced of; this is much more advanced Perl than I'm used to seeing), neither is quite accurate.

The first option is easy to rule out: HTML is not a regular language. Evidence: S-Expressions are not a regular language. XML is isomorphic to S-expressions of the form (<tagname> ((<attr1> <value1>) (<attr2> <value2>) ...) <contents>). HTML is isomorphic to XHTML, which is XML.[1]

I can't (read: don't feel like trying to) rigorously disprove that Perl's "regular expression" facility could be an HTML parser all by itself, but it looks like what's going on here is closer to a regex-based tokenizer (certainly possible, and in fact this is a very common pattern). Then, regular Perl flow control constructs are used to interpret the tokens as tags and such.

[1] Edit: Looking at that, it's not as solid as it was in my head when I was writing it. HTML is still not a regular language, because regular languages cannot key on nesting with uncapped depth.

poelie · on July 9, 2011

It goes deeper than that. He has used back references. So formally it aren't real regular expressions anymore. (Back references rewrites the grammar on the fly).

Perl regular expressions are pretty powerful. I wrote a small perl program (just for fun after reading the post), which can parse S-Expressions. It wrote it quick, but it shows it is possible:

http://pastebin.com/ndrHTZJB

I know it has bugs.

d0m · on July 10, 2011

It's interesting, I'll take a look at it more seriously. Thanks for sharing.