I have drafted a HF proposal for work on making GHC’s compiler pipeline more fault-tolerant, that is:
Capable of producing partial output if there is an error (rather than stopping immediately)
Capable of consuming partial output and still producing useful diagnostics
This would solve the problem where errors from earlier stages (such as parsing) mask errors from later stages (such as typechecking); as well as benefiting tools that use the compiler, which could then work on partially broken programs.
That seems a great idea (especially now with IA, we don’t take “computer says no” as easily)
User
fix foo :: Int → Int
foo =
bar :: Int → Int
bar = 1 + foo C@@@GPT
It looks like there might be a small error in your code. The foo function seems to be missing its implementation. Here’s a corrected version:
foo :: Int -> Int
foo x = x * 2 -- or any other implementation you desire
bar :: Int -> Int
bar x = 1 + foo x
Now, foo has a simple implementation where it doubles the input x, but you can replace it with any other implementation that suits your requirements. The bar function adds 1 to the result of foo.
I would be happy to help (but I don’t know anything about GHC). Would there be tasks open to GHC beginner ?
Very interesting, thank you for writing this down!
Just a couple thoughts for now:
Step 1: handle fully-recoverable parse errors
A “fully-recoverable” parse error is one where we can know (or make a very good guess) what the user intended. This usually means guessing a few missing literal tokens, such as a trailing bracket.
tup :: (Int, Int)
tup = (1, 2
The example you give is only ambiguous because of the type signature (which is often not even present).
Without such hints, it becomes a probabilistic sort of prediction, which is where corpus-based approaches (or language models as suggested by @maxigit) can help, but do we really want to 10x-50x the size of GHC to put CodeLLaMa or friends inside it?
Step 2: handle other parse errors by producing partial output
I think GLR parsers like tree-sitter do this and it would perhaps make sense to leverage them.
Step 3 etc.
Accepting distributions of tokens/types/ASTs is even harder: the work for each stage multiplies by the number of predictions, and the subsequent output in turn.
I am not suggested the use of a language model, but enterprise IDE will, so the tooling situation will look even worst compare to what people will get used to at work (where their companies will pay for Ai subscription).
So having a more tolerant GHC is a really good idea (until we ALL use language model in our daily life).
To be honest, I’m unsure at this point. I also don’t know enough about GHC to say how hard this stuff is (and I’m hoping to get feedback on that…). I expect that once the infrastructure is in place then it shouldn’t be too hard to contribute by making the various stages more fault-tolerant in one way or another.
The example you give is only ambiguous because of the type signature (which is often not even present).
Without such hints, it becomes a probabilistic sort of prediction
Error recovery by doing token insertion (or deletion) is not really new technology, it’s been around for a long time. So I’m not too worried about this problem, I think we can probably pick good heuristics in most cases.
I think GLR parsers like tree-sitter do this and it would perhaps make sense to leverage them.
I think using the ambiguity-handling of GLR parsers for error recovery might go badly with files that have many errors, since I think the number of alternatives would grow exponentially. Whereas for well-formed examples of more typical ambiguous grammars you hopefully have a smaller number of choice points. Better I think to do local error recovery and continue with a single parse?