> And you don't need need just any parser, you need very robust ones that can deal with malformed files well.
I very much agree. I feel there has been a trend recently where people (re)discovered how cool and useful ASTs are and now expect everything be using them. I suspect old-school computer scientists might be secretly laughing at this while programming with some Lisp-like languages they invented for themselves.
Jokes aside, I do wonder how modern IDEs manage to parse broken source code into usable ASTs --- is this trivial (CS theory-wise) or are there a lot of engineering secret sauce involved to make it work?
With only basic knowledge in the domain I would assume it is hard and ugly. If the file is malformed, there is almost certainly an infinite number of possible edits to make the file adhere to the grammar, hence there can not be any algorithm that just provides the one and only correct syntax tree. This in turn means that you have to come up with heuristics that identify reasonable changes which fix the file and that is probably not easy. Also, if you do this online in an IDE, the problem becomes probably easier [1] - if you have a valid file and then make it invalid by deleting an operator in the middle of some expression, you can still essentially use the syntax tree from just before the deletion. If, on the other hand, you get a malformed file, you might have a harder time.
[1] And also harder because if you want to parse the file after each key stroke, you have to be fast. This probably also makes incremental updates to the syntax tree the preferred solution and that might align well with using prior result for error recovery.
"If the file is malformed, there is almost certainly an infinite number of possible edits to make the file adhere to the grammar, hence there can not be any algorithm that just provides the one and only correct syntax tree. This in turn means that you have to come up with heuristics that identify reasonable changes which fix the file and that is probably not easy."
I don't understand that question. Given the following source file that does not parse
var foo = bar baz
there are many ways to change it and make it parse including the following reasonable ones
var foo = barbaz
var foo = "bar baz"
var foo = { bar, baz }
var foo = bar // baz
var foo = bar
//var foo = bar baz
var foo = bar * baz
var foo = bar + baz
var foo = bar.baz
var foo = bar(baz)
but also unreasonable ones like
var abc = 123
and therefore a parser that can handle malformed inputs has to make educated guesses what the input was actually supposed to look like. And don't be fooled by this simple example, imagine a long source file with deeply nested code in a language with curly braces and randomly deleting some of the braces. Now try to figure out where classes, methods, if or try statements begin and end in order to produce a [partial] syntax tree better than just giving up at the position of the first error.
My point was that test suites should give you a heuristic on what corrections are good and which are bad. A source code change that turns a test fail into a test pass should be considered an improvement.
I am still lost. Test suite for what? We have a parser - binary, source code and maybe a test suite if the parser developers decided to write tests - and a random text file that we throw at the parser and for which the parser hopefully generates a useful syntax tree if the content is a well-formed or not too badly malformed program in a language the parser understands.
So I can only use the diff tool to compare two non-compiling versions of a source file if I provide a test suite for that file to the diff tool? And how would you want to make use of the test suite? Before you can run the test suite, the source file must already parse and compile which is already more than a diff tool based on a syntax tree requires - it must be able to parse the source code but it doesn't have to compile. Passing the test suite requires even more, not only being able to parse and compile but also yield the correct behavior which the diff tool doesn't care about.
And you actually jumped over the hard part that requires the heuristics, how to modify the input in order to make it parse. Take a 10 kB source file and delete 10 random characters - how will you figure out which characters to put back where? With 100 possible characters, 10,000 positions to insert a character, and having to insert 10 characters, you are looking at something like 10^60 possible modifications. You are certainly not going to try them one after another, each time checking if the modified source file parses, compiles, and passes the test suite.
> So I can only use the diff tool to compare two non-compiling versions of a source file if I provide a test suite for that file to the diff tool?
Not sure what this whole straw man is about. I definitely didn't suggest anything like that. Of course you can only compare two compiling versions of a source file using a test-suite-based heuristics. I thought this whole thing was about "heuristics that identify reasonable changes which fix the file" mentioned above? "Reasonable changes that DON'T fix the file" are clearly recognizable by NOT passing the test suite, just as if it was a human trying to make those changes and finding out that the change that he just did didn't in fact yield the desired results after running the test suite.
> With 100 possible characters, 10,000 positions to insert a character, and having to insert 10 characters, you are looking at something like 10^60 possible modifications.
If you're working with an AST, you're almost certainly not working with characters. That would be immensely wasteful. In fact working with an AST is pretty much the only way in which the set of changes is sufficiently reduced for almost any change to NOT be rejected outright. With character-level modifications, you're facing the problem that almost every edit will be outright rejected as early as at the stage of parsing.
We have obviously been taking past each other. My point was that a parser for a syntax tree based diff tool should probably be able to deal well with files with syntax errors, i.e. it must be able to fix syntax errors. And with fixing syntax errors I did not mean actually fixing the file but being able to construct a reasonable syntax tree even if some subtrees do not adhere to the grammar. Given an input like
class foo
{
function bar() {
function baz() { }
}
it should be able to parse the file as if bar() was not missing the closing curly brace. If the parser just gave up or inserted the closing curly brace at the end
class foo
{
function bar() {
function baz() { }
}
}
making baz() a nested function inside of bar() the result would be worse than using a character-based diff algorithm. But I never intended to say anything about making code functionally correct, that is none of the business of a parser or diff algorithm.
What you do is produce an AST where some nodes indicate syntax errors. This works best in languages where it is easy to resynchronize after an error, of course.
I very much agree. I feel there has been a trend recently where people (re)discovered how cool and useful ASTs are and now expect everything be using them. I suspect old-school computer scientists might be secretly laughing at this while programming with some Lisp-like languages they invented for themselves.
Jokes aside, I do wonder how modern IDEs manage to parse broken source code into usable ASTs --- is this trivial (CS theory-wise) or are there a lot of engineering secret sauce involved to make it work?