Your experience with the usefulness of error messages from top-down (recursive d...

moring · on Jan 18, 2023

I can't speak about common experience, only my own, but that tells me that an LR(1) can tell which high-level construct it is trying to parse -- this is encoded in the states deeper down in the parse stack, and therefore depends on information which the parser/generator already has.

While to some extent, this may be ambiguous because it depends on tokens further to the right, this is inherent in the "L" part of LR and applies to LL as well; only raising the lookahead length can fix that.

Now I did cheat a bit in what I said above: Parsing further terminal symbols "removes" the high-level information from the state, and then it cannot be used for error messages anymore. This has to notable aspects: First, one can simply keep such information as metadata, either encoded in the state (but this blows up table size because states cannot be merged as easily anymore) or as additional information collected while parsing (probably better -- good idea, I'll have to look at this). ()

Second, there is the question whether this information is really useful in producing better error messages. An input fragment like "(5 + )" is an incomplete expression, and without further metadata you know that (1) another sub-expression has to appear before the closing parenthesis, (2) you are parsing an expression, (3) you are parsing a high-level construct that expects an expression (this knowledge is inherent in LR(1)), but you don't* know anymore what high-level construct you are parsing anymore -- but for the error message it is probably irrelevant.

Obviously, that last part depends, case by case.

(*) re-reading that, I'll have to admit that this incorporates top-down aspects into the parser :)

HarHarVeryFunny · on Jan 18, 2023

If you're using "LR" to mean bottom-up generated parsers (as OP does, rather than meaning LR(1) grammars, parsed by whatever means), then there is no higher level construct info on the stack - these operate in strictly bottom-up fashion.

e.g. first an identifier might be matched (which could occur in dozens of different high level contexts - but the parser has no idea at this point), then depending on what comes next it might turn out to be part of an expression (vs the RHS of an assignment, or parameter, or type definition, etc, etc), so after a few "reduce" actions the parser will have realized it's an expression, which again could occur in many contexts (LHS of an assignment, subject of an if-statement, for-statement, etc, etc), and so it goes ...

At any point in time a bottom-up parser knows nothing about the higher level context that it hasn't yet got to - only the tiny program fragment is is progressively recognizing in bottom-up fashion. This is one of the fundamental differences between a bottom-up and top-down (recursive descent) parser.

moring · on Jan 20, 2023

I was referring specifically to an LR(1) bottom-up parser, but I was confused how the stack looks like (all this was some time ago).

The parser might indeed encounter an identifier without knowing whether it is part of an expression. However, it does know whether an expression is valid at that point. This isn't encoded buried in the stack but it is encoded in the top-of-stack state (this is the part that I was confused about).

For example, suppose that the input is now at a point where a statement starts in a Java-like language. An identifier may be part of many different kinds of statements, and the parser will not know which of them an identifier starts, nor whether that identifier is part of an expression. It does, however, know that a statement is valid at this point, because this information can be found in the transition table used after reduction. It will also know which high-level constructs are valid from that table, including statement and expression, and from the token transition table it will know which tokens are valid.

What you might be referring to is that while the transition table will show that an expression is allowed, it does not easily show whether such an expression must be part of a statement or can indicate another high-level construct. My experience with using such a parser in the context of an IDE is that it is okay to show them both, because most syntax errors made by a user are far more trivial (such as the wrong number of closing parentheses).

Note that an LL(1) top-down (recursive descent) parser would not know either if the first identifier is part of an expression because that information cannot be derived from the left-side context. Only a larger lookahead will fix that. Or backtracking, which is kind of similar to a larger lookahead.

HarHarVeryFunny · on Jan 20, 2023

> The parser might indeed encounter an identifier without knowing whether it is part of an expression. However, it does know whether an expression is valid at that point. This isn't encoded buried in the stack but it is encoded in the top-of-stack state (this is the part that I was confused about).

No, it really isn't!

Table driven parsers basically have two actions encoded in their tables - for any (state, current input symbol) pair (i.e. table indices) they can either SHIFT or REDUCE. A SHIFT action means to push the symbol onto the stack since it doesn't yet know what it's part of. A REDUCE action means to pull the top N symbols (terminals and non-terminals) off the stack and replace them with the RHS of the grammar rule they represent that it has now recognized.

e.g. If there was only one grammar rule:

A -> x y z

Then when the parser saw "x" it would SHIFT, then "y" => SHIFT, then "z" => SHIFT, then (with next symbol end-of-file) REDUCE - popping (x, y, x) off the stack and replacing them with A. This REDUCE action is the first time it knew that those x, y, z were part of A. In a real grammar with more than one rule there may be multiple REDUCE actions (with different RHS) depending on what followed x [y [z]].

In a top-down (recursive descent) parser you know what you're parsing at any point, and then go looking for what you hope to be there. e.g. if you see "if" then you will call ParseIfStatement which will call ParseExpression etc.

In a bottom-up parser you're discovering what you're parsing by SHIFTING these bottom level (then progressively higher) symbols until finally you've seen enough to recognize what the constuct is and can do a REDUCE action using that rule.

So, using the above example, if after "x" and "y" you see something unexpected like "w" rather than "z" you can't generate a top-down error like "Invalid A", because you don't yet know this is part of an "A" (and indeed in a more complex grammar there may be multiple things that "x y" could be part of). So, in the bottom up parser all you can generate as an error is something like "z expected" given that you know "z" would be a valid next symbol (leading to a SHIFT action in this case).

moring · on Jan 23, 2023

The stack contains states, not symbols. A shifted state encodes the shifted terminal PLUS the information what nonterminals that terminal can be part of (in LR(1), each of those nonterminals is augmented with its accepted follow-set).

Really the only place where the above breaks down is when in the middle of a rule, after shifting a terminal (or "shifting" a nonterminal after reduction; this is called differently in literature but it really behaves like shifting nonterminals) all that context information is dropped under the assumption that it is not needed for the current rule, but is still available deeper on the stack and will resurface after reducing the current rule. At that point, an error would be missing this information.