Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Last time I used Lpeg for a big project I ran into difficulties with error detection and recovery --- writing the parser for correct input was beautifully easy, but writing a parser that would gracefully handle incorrect input was very hard. (I'm reminded of the apocryphal Prolog compiler which, if you gave it a program with a syntax error, would just reply 'No.'.)

When I asked about the best way to handle errors it was suggested to me that I should add alternatives to my rules that would catch errors and return an error token from there; but of course, that short circuits any alternatives further up the grammar tree, so it wasn't very satisfactory. A quick look at your grammar doesn't show anything like this --- how are you dealing with errors?



> how are you dealing with errors?

The example looks for string literals outside of what it considers comments and preprocessor directives. If you feed it random data it will just end up in various states in the graph. It matches everything and uses captures for subsitution when it's in the right state.

When I need to break on invalid grammar, I use match time captures (Cmt). I think if you have problems with short circuiting parsing you might have the match time capture in the wrong place, but I might be wrong. Do you have an short example?

Trivial example (maybe too trivial?) of error handling:

  lpeg = require "lpeg"
  P, Cmt, V, S = lpeg.P, lpeg.Cmt, lpeg.V, lpeg.S
  function errfunc(match)
    error("invalid token: "..tostring(match))
  end
  P{
    "tokens";
    space   = S" \t",
    invalid = (1-V"space")^1/errfunc,
    token   = P"foo" + P"bar" + P "baz" + V"invalid",
    tokens  = (V"token" * V"space"^0)^0  * -1
  }:match("foo bar baz woops foo")

EDIT: or, to get the position:

    lpeg = require "lpeg"
    P, Cmt, V, S = lpeg.P, lpeg.Cmt, lpeg.V, lpeg.S
    function errfunc(match, pos, cap)
      error(string.format("invalid token at position %d: %s",
        pos-#cap, cap)) 
    end 
    p = P{
      "tokens";
      space   = S" \t",
      invalid = Cmt((1-V"space")^1, errfunc),
      token   = P"foo" + P"bar" + P "baz" + V"invalid",
      tokens  = (V"token" * V"space"^0)^0  * -1
    }:match("foo bar baz woops foo")


Hi, you might want to check out LPegLabel[1]. It's an extension of LPeg that allows error detection and recovery by throwing labels (done manually but can be semi-automated). To put simply, throwing a label is like throwing an exception (so it bubbles up until it is caught). For error recovery, you can catch those labels and record it somewhere then continue parsing possibly skipping to a certain token like a semicolon before doing so.

I worked on modifying LPeg grammars to add error detection and recovery via LPegLabel for my Google Summer of Code, so please ask me questions about it if you have any. Right now, LPegLabel is still quite young, but it is fully usable and since it is a project under LabLua (the same research lab that is in charge of Lua and LPeg), I'm hoping it will be integrated into LPeg one day.

[1] https://github.com/sqmedeiros/lpeglabel


>> (I'm reminded of the apocryphal Prolog compiler which, if you gave it a program with a syntax error, would just reply 'No.'.)

I don't believe you remember this correctly.

The Prolog interpreter will raise an error for syntax errors. If you misspel something but don't cause a syntax error then you may get an unexpected "no" (or "false") but a syntax error causes compilation to fail, in Prolog as in any language.

In any case, a "no" ("false") is the proof procedure failing to prove your query true (or, more accurately, finding a way to prove it false). It's not an error and it's not a failure of the interpreter.


The expanded version of the saying, which I think I got from my Prolog lecturer at university, is:

Writing an optimising compiler in Prolog (note, in Prolog, not for Prolog) which will accept valid programs is trivially easy. Writing an optimising compiler in Prolog which will say anything other than 'No.' for invalid programs is intractably hard.


Ah, I see- you meant a compiler written in Prolog, not a Prolog compiler.

Who was your lecturer? I haven't heard of that problem with optimising compilers in Prolog. I don't know why it should be harder to do in Prolog than in any other language.


Hey, while you are trying out parser combinators, what do you think of this one? http://github.com/meric/leftry It's very new and I would like some comments on how to make it easier to use. You can check out http://github.com/meric/l2l, which is a hybrid lisp and lua programming language that relies on the parser combinator library. (See the lua.lua in that project, where the Lua grammar is implemented using Leftry)


Neat.

I did find the readme confusing. I had to jump up and down the page to make sense of it. 'any' is mentioned but not documented. The first few examples made absolutely zero sense until I've read quite a bit further, despite being familiar with parsers and grammars -- it's not a very gentle introduction.

OK but your actual question was about usability. I don't understand why 'factor' takes a generation function as an argument instead of something much less verbose, like an array/table. That looks like it would be my biggest complaint. Is it implementation detail, or to allow self reference, or to allow arbitrary code inside one? If the latter, can providing a function instead of simpler syntax be made optional? There ought to be alternatives to allowing self-reference too.

It's cute that a parser can be used as an iterator. How useful is it in practice to match concatenated valid sentences like that? I never saw a use for it in my own tasks.


Thanks for taking a look at it, I could really use your feedback to improve the documentation - definitely an area I need to improve. The anonymous function is so that it waits till all the non terminals are defined.

The iterator feature is cute, which was reason enough for me to do it. :)

Much appreciated.


I've used PEG parsers for two projects, one written with peg/leg (which leafo mentioned), and one with a significantly modified fork of pyPEG [2]. My method for handling incorrect input is equivalent to what you described: I extended the pyPEG grammar to add 'checkpoints'. (Earlier, when using peg/leg I used alternative code blocks that threw errors.) Once a checkpoint is passed, the parser is not allowed to backtrack past it again. This avoids the problem of the parser backtracking all the way to the very beginning of input and producing an error equivalent to 'No'. As a bonus, adding checkpoints speeds up parsing and massively reduces memory usage, because as soon as you hit one you can delete state needed for backtracking. Yes, it does mean that you have to think about whether adding a checkpoint somewhere cuts off valid parses; the checkpoints really ought to be added automatically. Ideally you would organise your rules in such a way that all the alternative things that could appear in some context are in a rule called Foo, and put a checkpoint at the beginning, so if none of them match you get a "expected Foo" error. And if you see an opening quote or bracket or anything like that, immediately add a checkpoint because it has to be closed. This worked very well for me, but I imagine would not scale to something as nightmarish as perl.

Real example:

   def expressionList():       return QUES, (expression, STAR, (PLUS, ",", CHECKPNT, expression))
(Note: pyPEG is funky, the ? * and + operators (QUES, STAR, PLUS) precede the thing they act on; I really should change that). Basically in this language if you see a comma in an expressionList, something must follow it. The error produced by this rule is:

   On line 79 in rbtest.rbas:
   Syntax error: while parsing expressionList, expected expression
        dim x as integer = y(a, b,)
                                  ^
This is a parser for a DSL extension of FreeBASIC (which I called ReloadBasic). In addition to the DSL stuff it parses just enough of FB to determine the types of all the variables and arguments and the function/sub beginnings and ends. For the small fragment it handles, it often produces better errors than the handwritten recursive-descent FB compiler. (It's also about 1% the speed...) Source is here: [1]

[1] https://bitbucket.org/rbv/ohrrpgce/src/6ba83910c564aa474220e... [2] https://bitbucket.org/rbv/ohrrpgce/src/6ba83910c564aa474220e...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: