What it really really sucks at is posix sh scripts.
I switched last week because of this thread and I can confirm: Claude is at least 10x better than OpenAIās best GPT model. I have been using AIās extensively for teaching me (and before that Haskell). OpenAI was painful at best. Claude, in contrast, has given me far more correct answers than openAI. Itās not even close. Still, theyāre not amazingā¦.just better than getting yelled at on StackOverflow for posting a dupe.
I honestly wish I had a mentor instead though. It would be nice to bounce my ideas off of someone rather than a glorified auto-complete.
gemini cli has been very usable for me for the last couple of days. makes refactoring less of a time investment than it used to be.
I have Claude Code running in the background writing Haskell all day every day. It does a very good job most of the time, with the usual caveats (superfluous comments, the odd rabbit hole in the wrong direction, etc).
How does that work/look?
Do you just give it a system design document and leave it for days? How and when do you give it feedback?
Iāve also been impressed by Claude and GPTās abilities, but they seem to be incredibly inconsistent, and when they do end up following the odd rabbit hole, it usually ends up as a distaster.
No, I still have to check in on it semi-regularly, Iām just not constantly watching it. Over time you get a sense of how to break up the work in ways thatās likely to lead to better outcomes. Tactics like writing a plan to disk and telling Claude to update the tasks as it completes them (this sometimes takes coaxing to get it to work consistently) and make regular commits means you can backtrack and pick up where you left off when it goes down rabbit holes. Also means you can clear the current context and inject its own description of WIP at any time. The hallucinations definitely grow the closer it gets to 100% full.
Using āPlan Modeā to come up with the task document and plan is a regular technique I use. Sometimes you can formulate a Claude Code command or skill to perform mechanical refactorings. For example, over the last 3 and a bit weeks Iāve had it converting all of our Dhall (that we were using for CloudFormation configuration, some 45,000 lines) to Haskell using the stratosphere library, and this has worked fantastically well for a task that would have taken a human a very long time indeed and was causing us a fair bit of grief but was difficult to schedule a whole project around (for commercial reasons). When you do things in this way, for the first couple of iterations you can end each session with āupdate the command with information I could have given you up front that would have made you more efficient in completing this taskā and it really sings after a few rounds of that.
I pushed up the Claude Code command I used for that Dhall conversion here if youāre interested. The command itself was drafted by me, but by the time it got āmatureā most of it was written by Claude. I was really pleased how the structure just let me clear the context and just run /convert-service-cloudformation at any time and it worked beautifully.
Thanks! Thatās really useful!
Iām curious how much cost you rack up per day? Presumably not more than the cost of an FTE, but curious.
On average about $50usd per day, but thatās me just taking my monthly total and dividing by 30. The bill is largely dictated by how often Iām checking in and unblocking Claude Code which will be far less often outside business hours. So letās say $75usd a day as a rough estimate.
Shameless plug: even with the extra productivity, weāve still got more projects in the works than we have people to do them, so look out for us hiring in the new year.
to provide some real-world feedback (and since it is a bit relevant) i can say that currently, claude-code is capable of assisting in writing haskell fairly well. nearly all lines of code in my current project ( GitHub - n3wm1nd/runix-project , a polysemy effects based task/llm library, with GitHub - n3wm1nd/runix-code a claude-code like coding assistant, feedback welcome) were written by claude (with heavy guidance), so it is getting very practical, but pure one-shot performance is not yet at the level of js/python. if you factor in how well generated code actually works in the end.. it might be on par
For Claude Code, are there any interesting Haskell-specific skills? I was thinking for example of a skill focused on generating lucid2 code, with its syntax that is similar, but not identical, to plain HTML. Although perhaps Claude Code can do enough of a good work on its own ![]()
This thread is a great time capsule for how far LLMs have progressed in just a year wrt writing Haskell.
Iāve done a couple experiments with the Google and Anthropic models recently and will post about them soon.
Edit: here it is [vibe coding] Text similarity search via normalized compression distance
My big takeaway from the experiment above is not the library per se, or the particulars of the programmer-LLM interaction, but that 1. modern LLMs can produce full Haskell libraries (provided some domain knowledge and programmer experience) 2. we can now fill ecosystem gaps much faster.
My present angle is to look at formal methods and ones which scale to larger systems. If writing code is the process of creating bugs, weāre possibly accelerating our bugs production rate, which also applies to the test suites, which in my experience people arenāt properly reading. Additionally, nobody wants to read generated code, so it somewhat sucks the fun out of coding.
I think of test suites, contracts and static types (refinements, dependent) as a kind of double entry accounting; you state the same thing twice (object code and meta code), hoping that you wonāt make the same mistake twice. Test suites donāt push constraints into the code, so they donāt scale, they donāt compose. Worse, nobody wants to write or read them. Separately, Iāve seen āspec driven developmentā, wherein one generates rather large documentation about code and the LLM basically keeps the two in sync by blocking inconsistency and aiding change; Iām not sure anyone but the LLM wants to read that, though.
Types and proofs cost less to read and write than the code they talk about, so in an ecosystem where the cost of labour for producing code has gone down, but with no equivalent leg-up in terms of correctness or abstraction forging, formal methods might be the antidote. Especially if you care about staying in the driverās seat in the software development world, rather than being replaced by someone with fewer scruples than you about ceding their cognition to move faster.
Iāve gained a renewed interest in Liquid Haskell, dependent types, Ghost of Departed Proofs, and looking at Lean differently. TLA+. CT. Anything that makes you a better thinker. I havenāt found another way to navigate how industry and open source is changing now that the genie is out of the bottle.
i do agree: just increasing the volume of code produced (even in tests) does not actually solve any problems in a scalable way. prompting a LLM āand make tests for thatā does not automagically fix that.
that does not mean that LLM generation of test code is entirely pointless either though: it works very well where you would litter a program with print statement in other languages to quickly drill down on an issue: just get the code in question to execute. nobody really wants to write all the test code required to set up all values and state for a function to be called, but thatās work a LLM can perform very well. as a bonus: you keep the test around as documentation that there was an issue there once, and guard against regression. but thatās not code you insist on keeping, or expect anyone to actually read with any kind of focus.
the other category though are tests that are thought out, structured and composeable. here too LLMs are not entirely useless: once a pattern is established, they are pretty capable of following that pattern (see this test of llm functionality for example, a LLM is more than capable of extending tests in that pattern all day long)
spec driven tests too are something LLMs are not that bad at generating, but here youāre basically only saving youself having to write down the code, you still have to be very specific what you want tested, and what properties you expect to hold true.
i do feel when it comes to LLM generated code, haskell does have itās advantages: while not exactly in the āformal verificationā territory, just by itās strong type system it gives the AI a few anchor points to structure itās code around, with the compiler making sure that those types actually match up in the end. especially for simpler functions that works surprisingly well: generate the types you know youāll need, and the function signatures of some core parts of the problem. and then let the LLM fill in the gaps
I hope nobody interpreted my remark above as meaning āsend a PR with whatever crap Claude comes up with on first try".
I agree that spending brain cycles thinking about invariants is the new frontier of the profession now that code is cheap to write, but the basic fact, that no formal method can address, is that natural language does not map 1:1 with executable code.
āSemantic parsingā is an approximation, and in the same vein producing formal specifications of a piece of intended software just moves the goalposts of correctness, but doesnāt make the problem disappear. This btw has been observed many times in the past and is no news: Galois - Specifications Don't Exist
In Haskell we have many fine tools for keeping the LLM honest, and basically deploying them all became much cheaper now too: āenable -Wall -Werror and fix all resulting breakageā, āwrite property tests for this invariantā, and so on.
Edit: btw Happy new year all!
Nice article link, thank you! Iāll keep that one as it echoes a lot of my thinking.