Evaluating AI's Impact on Haskell Open Source Development

https://well-typed.com/blog/2025/04/ai-impact-open-source-haskell/

Well-Typed have recently contributed to a research study investigating how AI can help with realistic software development tasks. I have written a summary of experience during the project with some thoughts about where LLMs were useful and where they weren’t.

Has anyone else got any experience using LLMs on other large Haskell codebases? What have your experiences been?

19 Likes

2 Likes

I’m sure I’m not the only one that would appreciate context to that image.

6 Likes

Ah, a nice review without hype & mentioning real use-cases, strengths and weeknesses, thank you!

You’ve mentioned it works pretty good for repetitive refactoring tasks, like renaming fields or moving a definition to top level. Isn’t that something already done pretty well by standard refactoring tools?

For verification, how useful really whas the classification that test 1,4 & 5 failed for the same reason in contrast with fixing test 1 & seeing that tests 4 & 5 no longer fail?

Isn’t it possible that a change suggested by AI as ‘minor output change’ would actually be a symptom of deeper failure, and accepting that solution would only entrench the bug? FWIW I’ve seen that quite often with junior devs & larger odebases.

Have you tried AI at generating test cases that could be useful later as regression tests?

6 Likes

So for extra context for people reading, this fuzz person is writing 99.9% of the Haskell code of that package using Claude. I took a browse through the code.

I think the two things that come to mind are:

  1. The Lisp Curse: Broadly speaking the Lisp curse is that syntactic abstraction and extension of the language is so powerful that Lisp programmers don’t bother to buy into broader abstractions or make compromises with the community in order to get things built into the language in the same way as say, Haskell programmers do. A lot of long discussions in the GHC proposals could often just be a macro in Lisp, for example. So in a way, every project in Lisp is kind of reinventing many ideas and you’re not advancing the language forward, whereas, one could argue, many ideas that have developed in Haskell have spread to other languages. It’s ironic because when you think of say assembler or the C programming language, that’s the kind of thing that they suffer with, and it’s not something that you would imagine a really high-level language would suffer with. So how a language interacts with the people that use it is quite important. (Parenthetically, I don’t worry too much about this for Elisp because I consider Emacs to be entirely about being selfish and focusing only on your own needs. I’m actively annoyed if they update the language.)
  2. I am starting to wonder whether this practice of vibe coding, i.e. having 99% of the code generated by an LLM interaction may lead to a similar problem. If it’s so cheap for me to generate walls of code which have loads of repetition, simply because I can just drop it and re-create it on a whim, that reduces the value of abstraction and it reduces another human’s interest in reading that code. (I can’t be the only one who has zero motivation to read AI-generated code.) In other words you start losing interest in actually programming Haskell, and the language matters even less, so now every project out there will basically not have to share any kind of abstractions or ideas because you’re just going to use an AI to explain someone else’s project to you anyway. Then we’re kind of back to the Lisp Curse.

Only half-baked thoughts but I wanted to capture the idea before it escaped me.

33 Likes

It happens a lot in GHC development that a simple change causes say 200 failing tests. The way I often approach that is to scan the failing tests to identify how many different kinds of failures there are. I found LLMs were quite useful there. Of course, in the end I will fix one issue at a time and check the impact on the testsuite, but I find that the broader perspective can be quite valuable. Sometimes it allows you to see that the approach is fundamentally wrong, while in others you might figure out “it looks like I just missed two simple cases”.

To respond to your point about minor output changes: as you say, you can’t blindly trust the model’s answers, and you should treat the answers with suspicion. But I found that if I gave the model very clear context about:

  • the changes I made,
  • what kind of changes to the testsuite one might expect,
  • which sort of changes would it make sense to accept

then the model does quite well at categorising test failures in accordance to those instructions. Anyway, I thought it was helpful.

5 Likes

I recently made some effort to do a bit of work with LLMs, here’s a bit of a report: Experiments with writing Haskell with Aider

5 Likes