> And you don't need need just any parser, you need very robust ones that can de...

danbruc · on March 29, 2022

With only basic knowledge in the domain I would assume it is hard and ugly. If the file is malformed, there is almost certainly an infinite number of possible edits to make the file adhere to the grammar, hence there can not be any algorithm that just provides the one and only correct syntax tree. This in turn means that you have to come up with heuristics that identify reasonable changes which fix the file and that is probably not easy. Also, if you do this online in an IDE, the problem becomes probably easier [1] - if you have a valid file and then make it invalid by deleting an operator in the middle of some expression, you can still essentially use the syntax tree from just before the deletion. If, on the other hand, you get a malformed file, you might have a harder time.

[1] And also harder because if you want to parse the file after each key stroke, you have to be fast. This probably also makes incremental updates to the syntax tree the preferred solution and that might align well with using prior result for error recovery.

jhgb · on March 29, 2022

"If the file is malformed, there is almost certainly an infinite number of possible edits to make the file adhere to the grammar, hence there can not be any algorithm that just provides the one and only correct syntax tree. This in turn means that you have to come up with heuristics that identify reasonable changes which fix the file and that is probably not easy."

Don't we call such heuristics "test suites"?

danbruc · on March 29, 2022

I don't understand that question. Given the following source file that does not parse

  var foo = bar baz

there are many ways to change it and make it parse including the following reasonable ones

  var foo = barbaz
  var foo = "bar baz"
  var foo = { bar, baz }
  var foo = bar // baz
  var foo = bar
  //var foo = bar baz
  var foo = bar * baz
  var foo = bar + baz
  var foo = bar.baz
  var foo = bar(baz)

but also unreasonable ones like

  var abc = 123

and therefore a parser that can handle malformed inputs has to make educated guesses what the input was actually supposed to look like. And don't be fooled by this simple example, imagine a long source file with deeply nested code in a language with curly braces and randomly deleting some of the braces. Now try to figure out where classes, methods, if or try statements begin and end in order to produce a [partial] syntax tree better than just giving up at the position of the first error.

jhgb · on March 29, 2022

My point was that test suites should give you a heuristic on what corrections are good and which are bad. A source code change that turns a test fail into a test pass should be considered an improvement.

danbruc · on March 29, 2022

I am still lost. Test suite for what? We have a parser - binary, source code and maybe a test suite if the parser developers decided to write tests - and a random text file that we throw at the parser and for which the parser hopefully generates a useful syntax tree if the content is a well-formed or not too badly malformed program in a language the parser understands.

jhgb · on March 29, 2022

What "test suite for the parser"? Of course a test suite for the faulty program you're trying to correct into a working one.

danbruc · on March 29, 2022

So I can only use the diff tool to compare two non-compiling versions of a source file if I provide a test suite for that file to the diff tool? And how would you want to make use of the test suite? Before you can run the test suite, the source file must already parse and compile which is already more than a diff tool based on a syntax tree requires - it must be able to parse the source code but it doesn't have to compile. Passing the test suite requires even more, not only being able to parse and compile but also yield the correct behavior which the diff tool doesn't care about.

And you actually jumped over the hard part that requires the heuristics, how to modify the input in order to make it parse. Take a 10 kB source file and delete 10 random characters - how will you figure out which characters to put back where? With 100 possible characters, 10,000 positions to insert a character, and having to insert 10 characters, you are looking at something like 10^60 possible modifications. You are certainly not going to try them one after another, each time checking if the modified source file parses, compiles, and passes the test suite.

jhgb · on March 29, 2022

> So I can only use the diff tool to compare two non-compiling versions of a source file if I provide a test suite for that file to the diff tool?

Not sure what this whole straw man is about. I definitely didn't suggest anything like that. Of course you can only compare two compiling versions of a source file using a test-suite-based heuristics. I thought this whole thing was about "heuristics that identify reasonable changes which fix the file" mentioned above? "Reasonable changes that DON'T fix the file" are clearly recognizable by NOT passing the test suite, just as if it was a human trying to make those changes and finding out that the change that he just did didn't in fact yield the desired results after running the test suite.

> With 100 possible characters, 10,000 positions to insert a character, and having to insert 10 characters, you are looking at something like 10^60 possible modifications.

If you're working with an AST, you're almost certainly not working with characters. That would be immensely wasteful. In fact working with an AST is pretty much the only way in which the set of changes is sufficiently reduced for almost any change to NOT be rejected outright. With character-level modifications, you're facing the problem that almost every edit will be outright rejected as early as at the stage of parsing.

danbruc · on March 30, 2022

We have obviously been taking past each other. My point was that a parser for a syntax tree based diff tool should probably be able to deal well with files with syntax errors, i.e. it must be able to fix syntax errors. And with fixing syntax errors I did not mean actually fixing the file but being able to construct a reasonable syntax tree even if some subtrees do not adhere to the grammar. Given an input like

  class foo
  {
    function bar() {
    function baz() { }
  }

it should be able to parse the file as if bar() was not missing the closing curly brace. If the parser just gave up or inserted the closing curly brace at the end

  class foo
  {
    function bar() {
    function baz() { }
  }
  }

making baz() a nested function inside of bar() the result would be worse than using a character-based diff algorithm. But I never intended to say anything about making code functionally correct, that is none of the business of a parser or diff algorithm.

db48x · on March 30, 2022

What you do is produce an AST where some nodes indicate syntax errors. This works best in languages where it is easy to resynchronize after an error, of course.