Errr.. Did he just wrote an html parser (and then used it) to prove everyone that you can use regex to solve the use-html-parser-instead-of-regex problem?!
It's like if I suggest someone to use Python instead of ASM to solve a simple problem, but then someone try to prove me wrong by writing a python interpreter in ASM and then USE it to solve the same problem!
Also, that being said, I feel like the post is more a brag about "I'm the creator of a popular perl book and perl rocks your language here's why blahbahblah".
Yeah, this is pretty much what seems to be going on.
I don't think anyone has ever said that regexs can't be used in the parsing of html, just that they can't be used for the parsing of html. It's like someone saying "You can't use bricks to keep warm!" and countering that by saying "Observe! I have built a house using, among other things, bricks, and it shelters me from the weather and keeps me warm!" Deliberately missing the point in order to show off your housebuilding skills.
Still, it was an informative and well written article. Article, rather than an answer to a question. Why do people write these thousands of words on stack overflow, when they could publish them on their blogs. Are SO points from confused corporate employees really worth more than the adulation of the blogosphere? Actually, who cares. I almost lost the will to live just writing that sentence...
I'd rather see it in Stackoverflow. It has much more visibility there than if each person wrote it on their own blog (assuming they have one) which may not be seen by more than a handful of friends.
Actually I'm more interested in the answer to this question:
Was he able write an arbitrary HTML parser using regexes because HTML is a regular language or because Perl's string matching syntax, which can match some languages that are not regular, is called "regular expressions" anyway?
I am pretty sure that, for example, an arbitrary number of balanced "<div></div>" pairs are not a regular expression however you can use modern PCRE to match it anyway.
If I'm understanding it correctly (which I'm not convinced of; this is much more advanced Perl than I'm used to seeing), neither is quite accurate.
The first option is easy to rule out: HTML is not a regular language. Evidence: S-Expressions are not a regular language. XML is isomorphic to S-expressions of the form (<tagname> ((<attr1> <value1>) (<attr2> <value2>) ...) <contents>). HTML is isomorphic to XHTML, which is XML.[1]
I can't (read: don't feel like trying to) rigorously disprove that Perl's "regular expression" facility could be an HTML parser all by itself, but it looks like what's going on here is closer to a regex-based tokenizer (certainly possible, and in fact this is a very common pattern). Then, regular Perl flow control constructs are used to interpret the tokens as tags and such.
[1] Edit: Looking at that, it's not as solid as it was in my head when I was writing it. HTML is still not a regular language, because regular languages cannot key on nesting with uncapped depth.
It goes deeper than that. He has used back references. So formally it aren't real regular expressions anymore. (Back references rewrites the grammar on the fly).
Perl regular expressions are pretty powerful. I wrote a small perl program (just for fun after reading the post), which can parse S-Expressions. It wrote it quick, but it shows it is possible:
It's like if I suggest someone to use Python instead of ASM to solve a simple problem, but then someone try to prove me wrong by writing a python interpreter in ASM and then USE it to solve the same problem!
Also, that being said, I feel like the post is more a brag about "I'm the creator of a popular perl book and perl rocks your language here's why blahbahblah".