Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
On the Speed of Light, Innovation, and the Future of Parsing (codekana.com)
71 points by jng on April 2, 2009 | hide | past | favorite | 23 comments


And here's something I'd like to see in the future: A persistent OS-provided facility for explaining the origin of configuration decisions.

Linux has most configuration in text files, which is already better than binary files, but sometimes you have to do a lot of digging around in documentation to determine what statements of what files determine what configuration options.

In many ways, the links between configuration files are opaque as if they were dynamic (try determining semantic links between disparate config files using standard unix tools), however it's clear that most configuration is in fact mostly static and so it would be great to be able to access this data (like dependencies) easily to aid with automatic dependency management and debugging.

It would be great if in most cases, when the actual configuration differs from what you expect, you could ask the system why and it would tell you all the specific statements of config files that affected the configuration of the particular thing you're asking about (at various levels of granularity and abstraction). You could then easily see what you need to change to have the configuration you want.


On the idea that problems which seem solved to some people are far from solved, indeed. Of course.

I suspect, however, that the idea that there has ever been some kind of consensus that all problems in software were solved, or all problems in Physics were solved, is BS. Fun BS because it is always fun to ascribe a single personality to a large group, and then point out the foolish inconsistencies or limitations to what "they" say as though they were a singular personal pronoun.


This was a nice read and I learned a thing or two, thanks.

Hopefully progress continues to be made in the area of things like better parsing for more helpful coding environments, because while it feels incredibly "impure", there are times when I almost feel like having language rules (such as "End If", though not as ugly hopefully) solely for the purpose of making my editor more helpful would be a worthwhile tradeoff. Emphasis on "almost", but still.

Just as intelligent programming language design saves time, intelligent syntax highlighting and other editor provided assistance does too (and a non-negligible amount at that).


The big new idea that's said to be uncharted territory reminds me of research from the 90s, actually:

"We also exploit the tight integration between versioning and incremental analysis to provide a novel history-sensitive approach to error handling. Our error recovery mechanism reports problems in terms of the user's own changes in a language-independent, non-correcting, automated, and fully incremental manner."

http://techreports.lib.berkeley.edu/accessPages/CSD-97-946?a...

(Excellent thesis, BTW.)


I'm the author of the article. I was aware of the research you link to. I enjoyed it, but research is too often focused on the formal side and too little on the user-experience side. I even mention it succintly in the article (not by name):

"I had a look at what was available on incremental parsing, and found a few research initiatives, and even mentions of a few products. But it turns out none of them was focused on the kind of problem I was trying to solve. They were either trying to improve the performance [...] or they were trying to study the formal mathematics of incremental parsing"

I mean to include other research initiatives in the above description.

Their technique is based in not accepting text edits until they parse properly, a technique which I describe in the article and which I dismissed as a hack and too basic for my goals.

My system is more sophisticated and powerful: I actually incorporate partially-wrong stuff into the parse tree! My tool highlights an incomplete 'if' block with the correct if-color, because it knows what it's dealing with!

I can also apply heuristics to choose the prefered parsing, which they don't.

There's plenty of work pending in incremental parsing, and that thesis just skims the surface (even if the academic-looking presentation may seem like it's going very deep).


I enjoyed reading your article about parsing. Any thoughts on open sourcing your parsing work?

When you mention "lifetimes of research pending to be done in parsing" are you extending the reach of parsing to beyond text - to speech - human conversational analysis, baby acquiring language or maybe even to more abstract communication - like automatically parsing patterns visually.


Thanks for the kind comment. Nowadays, my products are my source of funding, so no chance to open source anything. In the future, I'm sure some things will make their way into more open forms (not totally sure about what form).

Every computation can be seen as an exercise of parsing. The cases you mention, and many other ones.

I'm working in a new computation model, a new programming language and a new type of VM. But I don't want to talk much about it until work is more advanced.


I found myself thinking I could see the answer as you were describing the problem. That's good writing!

My guess would be that you'd need to have a new data structure for the text (or pointing into the straight text) that tracked the order of editing, then a corresponding set of new lexemes and reductions "sitting on top" of the existing parse tree, until the new edits + intervening test/parse tree reduces into something acceptable for the node of the parse tree in which the new text lives.

Fun stuff.


Thanks for the compliment! Actually, I'd think it's probably due to writing the article "after the fact", rather than to good writing.

The proper approach to a problem seems so obvious once you know it...


I'd be interested in learning more how you implemented this.

I reccomended something similar to this for use as the parser for the VB.NET IDE, back around the start of the VS 2008 development cycle. It ended up being kind of expensive , because it required pushing changes all the way down to the text representation. All the IDE schedule time for Orcas went into supporting the LINQ features, so nobody was really interested in hearing about expensive IDE changes (by the way, I'm not disputing the choice, from a product perspective, it was the right one).

I was trying to make the parser faster, and to minimize rebinding of declarations, rather than on improving intermediate states, but I can see how it might be possible to make your stuff work.

My idea was based on several things:

1) Incremental tokenization, using a specialized editor text representation

2) Knowledge of the amount of lexical and syntantic lookahead needed by the VB grammar

3) An index from tokens to parse tree nodes

When an edit happend, the incremental scanner would then move backwards a constant number of tokens (based on some grammar properties) and then scan forward patching up the token stream until the scanner produced a duplicate token in the appropriate offset position.

The parse tree index would then be used to figure out where to restart parsing, again based on some properties of the grammar.

My focus was on producing the same trees as the existing parser, and on minimizing changes to declarations (because of all the rebinding that needs to be done), and so my incremental termination condition was based on entering a method (in the top-down hand written parser) that would generate a node of the same type as the "current position"(offset based on the edit), with the same parent, in the existing parse tree with the same starting token (based on object identity). This all worked because the scanner was incremental, which required the editor to structure it's data in a certain way

I would imagine that you probably used something similar to this, but with different termination conditions.

I would really want to see how similar (or not) what you did was to the scheme I came up with.


I'm kind of working on something similar.

One thing that struck me about C++ is that although template parsing, etc, is pretty hard, you can get a long way by segmenting the source into delimited regions. This might also speed up your semantic analysis, by allowing you to write code that's operating on a known kind of region.

For instance after pre-processing, it should be easy to find nested "{", "[", "(", "\"", "'", and "<". Segmenting the source in this way, first, might make further processing easier.

It seems that the standard parsing methods we learn in CS are optimized for single-pass speed, but not for maintainability or development efficiency. Computers have gotten fast. Using multiple passes of simpler operators seems like a good promising approach.

I see the process of parsing as some kind of folding, where you start with a linear sequence of chars and gradually fold it over and over into a "lumpy" tree.


My actual scheme is similar in some ways to what you describe. More complex, because it involves many different types of structures.

Also, my approach is more explicit, I don't "restart a parse" in a general way, as I contemplate each possible change explicitly. This is prohibitive to do by hand for a full grammar, but I'm not doing full C/C++/C# analysis, so it's doable. It also makes the dynamic parser fully incremental.


This paper is from 1998. That's pretty recent if you consider that most really fundamental CS papers are much older and that it takes a while for academia results to have a practical impact on actual implementations.


Simpler thing that can be exploited is indentation:

    while(1) {
      if(foo) {
         if (bar) {
           ... I'm editing here!
      }
    }
it is clear from indentation that the unmatch is about if(bar). Of course in order to get the level of the open bracket you need to get the index of the first non blank character of the same line. It's a trick but in the practice may work well.

I could like GCC being able to give me an hint about this instead of pointing to the end of the file... that is, if an unmatched bracket is detected and there is something odd about indentation levels of brackets it could show:

Warning: non closed bracket may be around line ...

It is simple even to understand from the source if it's a program where indentation is usually respected or not.


I plan to add indentation support to my system in the future... it's a very interesting area: gathering information from all available sources and using that to help the user.

I do believe using the history of edits, as the article describes, has a much, much greater potential.


In the next century, I can only hope, people will look back and laugh about how we used to store source code in textual form, and marvel at the idea of a "syntax error."


I love this guy's ViEmu and definitely will pay for his new toy codekana.


Btw, every time I read about "we are just at the beginning, and we already think we know all about that" I have a strange feeling. Basically nobody can tell. The contrary happened a lot of times, think about AI, it basically stopped to produce significant results for decades. The only way to know is to invest more work in the field.

And if we want to remember the advice in "you and your research" what really matters is to understand in time fields where it's worth to spend time and investigate more.


Maybe it's obvious just to me? There will be full AI. We will create artificial brains. We will fully understand how the parallel system in the brain works. We will create brains more powerful than ours.

I don't think it's that far, either.


The only claim I strongly disagree with is that "we will fully understand how the parallel system in the brains works". How can you be sure? If understanding is made of abstraction, what if the brain is computationally irreducible? Our understanding will increase of course, but at some point the "explanation" of some component will be equivalent to describing all the neurons in that component and their connections.


Obviously we're discussing faith and intuition here, as none of us has the facts. I don't think there is anything computationally irreducible in the brain.

About some component not being expressible as the result of simpler computations: that has to happen at some point, by definition. But that point is probably within the reach of our understanding.

All in all, I don't think there is much magic, just lack of knowledge on our part. Of course, it's just my intuition.


Yep I mean, in 60s it appeared like it was a-few-years matter.


But I think that the problem is that it was harder than it seemed, but that doesn't mean it's not doable.

Developers are always optimistic when estimating how long it will take to do something they (we) want to do.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: