AI Haskell Generation

What it really really sucks at is posix sh scripts.

I switched last week because of this thread and I can confirm: Claude is at least 10x better than OpenAI’s best GPT model. I have been using AI’s extensively for teaching me (and before that Haskell). OpenAI was painful at best. Claude, in contrast, has given me far more correct answers than openAI. It’s not even close. Still, they’re not amazing….just better than getting yelled at on StackOverflow for posting a dupe.

I honestly wish I had a mentor instead though. It would be nice to bounce my ideas off of someone rather than a glorified auto-complete.

1 Like

gemini cli has been very usable for me for the last couple of days. makes refactoring less of a time investment than it used to be.

I have Claude Code running in the background writing Haskell all day every day. It does a very good job most of the time, with the usual caveats (superfluous comments, the odd rabbit hole in the wrong direction, etc).

How does that work/look?

Do you just give it a system design document and leave it for days? How and when do you give it feedback?

I’ve also been impressed by Claude and GPT’s abilities, but they seem to be incredibly inconsistent, and when they do end up following the odd rabbit hole, it usually ends up as a distaster.

No, I still have to check in on it semi-regularly, I’m just not constantly watching it. Over time you get a sense of how to break up the work in ways that’s likely to lead to better outcomes. Tactics like writing a plan to disk and telling Claude to update the tasks as it completes them (this sometimes takes coaxing to get it to work consistently) and make regular commits means you can backtrack and pick up where you left off when it goes down rabbit holes. Also means you can clear the current context and inject its own description of WIP at any time. The hallucinations definitely grow the closer it gets to 100% full.

Using ā€œPlan Modeā€ to come up with the task document and plan is a regular technique I use. Sometimes you can formulate a Claude Code command or skill to perform mechanical refactorings. For example, over the last 3 and a bit weeks I’ve had it converting all of our Dhall (that we were using for CloudFormation configuration, some 45,000 lines) to Haskell using the stratosphere library, and this has worked fantastically well for a task that would have taken a human a very long time indeed and was causing us a fair bit of grief but was difficult to schedule a whole project around (for commercial reasons). When you do things in this way, for the first couple of iterations you can end each session with ā€œupdate the command with information I could have given you up front that would have made you more efficient in completing this taskā€ and it really sings after a few rounds of that.

3 Likes

I pushed up the Claude Code command I used for that Dhall conversion here if you’re interested. The command itself was drafted by me, but by the time it got ā€œmatureā€ most of it was written by Claude. I was really pleased how the structure just let me clear the context and just run /convert-service-cloudformation at any time and it worked beautifully.

7 Likes

Thanks! That’s really useful!

1 Like

I’m curious how much cost you rack up per day? Presumably not more than the cost of an FTE, but curious.

On average about $50usd per day, but that’s me just taking my monthly total and dividing by 30. The bill is largely dictated by how often I’m checking in and unblocking Claude Code which will be far less often outside business hours. So let’s say $75usd a day as a rough estimate.

1 Like

Shameless plug: even with the extra productivity, we’ve still got more projects in the works than we have people to do them, so look out for us hiring in the new year.

1 Like

to provide some real-world feedback (and since it is a bit relevant) i can say that currently, claude-code is capable of assisting in writing haskell fairly well. nearly all lines of code in my current project ( GitHub - n3wm1nd/runix-project , a polysemy effects based task/llm library, with GitHub - n3wm1nd/runix-code a claude-code like coding assistant, feedback welcome) were written by claude (with heavy guidance), so it is getting very practical, but pure one-shot performance is not yet at the level of js/python. if you factor in how well generated code actually works in the end.. it might be on par

1 Like

For Claude Code, are there any interesting Haskell-specific skills? I was thinking for example of a skill focused on generating lucid2 code, with its syntax that is similar, but not identical, to plain HTML. Although perhaps Claude Code can do enough of a good work on its own :thinking:

1 Like

This thread is a great time capsule for how far LLMs have progressed in just a year wrt writing Haskell.

I’ve done a couple experiments with the Google and Anthropic models recently and will post about them soon.

Edit: here it is [vibe coding] Text similarity search via normalized compression distance

My big takeaway from the experiment above is not the library per se, or the particulars of the programmer-LLM interaction, but that 1. modern LLMs can produce full Haskell libraries (provided some domain knowledge and programmer experience) 2. we can now fill ecosystem gaps much faster.

4 Likes

My present angle is to look at formal methods and ones which scale to larger systems. If writing code is the process of creating bugs, we’re possibly accelerating our bugs production rate, which also applies to the test suites, which in my experience people aren’t properly reading. Additionally, nobody wants to read generated code, so it somewhat sucks the fun out of coding.

I think of test suites, contracts and static types (refinements, dependent) as a kind of double entry accounting; you state the same thing twice (object code and meta code), hoping that you won’t make the same mistake twice. Test suites don’t push constraints into the code, so they don’t scale, they don’t compose. Worse, nobody wants to write or read them. Separately, I’ve seen ā€œspec driven developmentā€, wherein one generates rather large documentation about code and the LLM basically keeps the two in sync by blocking inconsistency and aiding change; I’m not sure anyone but the LLM wants to read that, though.

Types and proofs cost less to read and write than the code they talk about, so in an ecosystem where the cost of labour for producing code has gone down, but with no equivalent leg-up in terms of correctness or abstraction forging, formal methods might be the antidote. Especially if you care about staying in the driver’s seat in the software development world, rather than being replaced by someone with fewer scruples than you about ceding their cognition to move faster.

I’ve gained a renewed interest in Liquid Haskell, dependent types, Ghost of Departed Proofs, and looking at Lean differently. TLA+. CT. Anything that makes you a better thinker. I haven’t found another way to navigate how industry and open source is changing now that the genie is out of the bottle.

5 Likes

i do agree: just increasing the volume of code produced (even in tests) does not actually solve any problems in a scalable way. prompting a LLM ā€œand make tests for thatā€ does not automagically fix that.

that does not mean that LLM generation of test code is entirely pointless either though: it works very well where you would litter a program with print statement in other languages to quickly drill down on an issue: just get the code in question to execute. nobody really wants to write all the test code required to set up all values and state for a function to be called, but that’s work a LLM can perform very well. as a bonus: you keep the test around as documentation that there was an issue there once, and guard against regression. but that’s not code you insist on keeping, or expect anyone to actually read with any kind of focus.

the other category though are tests that are thought out, structured and composeable. here too LLMs are not entirely useless: once a pattern is established, they are pretty capable of following that pattern (see this test of llm functionality for example, a LLM is more than capable of extending tests in that pattern all day long)

spec driven tests too are something LLMs are not that bad at generating, but here you’re basically only saving youself having to write down the code, you still have to be very specific what you want tested, and what properties you expect to hold true.

i do feel when it comes to LLM generated code, haskell does have it’s advantages: while not exactly in the ā€œformal verificationā€ territory, just by it’s strong type system it gives the AI a few anchor points to structure it’s code around, with the compiler making sure that those types actually match up in the end. especially for simpler functions that works surprisingly well: generate the types you know you’ll need, and the function signatures of some core parts of the problem. and then let the LLM fill in the gaps

1 Like

I hope nobody interpreted my remark above as meaning ā€œsend a PR with whatever crap Claude comes up with on first try".

I agree that spending brain cycles thinking about invariants is the new frontier of the profession now that code is cheap to write, but the basic fact, that no formal method can address, is that natural language does not map 1:1 with executable code.

ā€œSemantic parsingā€œ is an approximation, and in the same vein producing formal specifications of a piece of intended software just moves the goalposts of correctness, but doesn’t make the problem disappear. This btw has been observed many times in the past and is no news: Galois - Specifications Don't Exist

In Haskell we have many fine tools for keeping the LLM honest, and basically deploying them all became much cheaper now too: ā€œenable -Wall -Werror and fix all resulting breakageā€œ, ā€œwrite property tests for this invariantā€œ, and so on.

Edit: btw Happy new year all!

1 Like

Nice article link, thank you! I’ll keep that one as it echoes a lot of my thinking.