Hi all,
tldr; I’ve vibecoded a new library, here’s how it went.
Background
A while back the NLP community rediscovered a 20-year old clustering algorithm based on compression (see Cilibrasi & Vitanyi 2003), and it made quite a splash because now we don’t have to train neural networks to produce embeedings anymore!
Indeed, it works quite well and you can find various implementations online. Many implementations however use an all-to-all distance computation which you can guess to be both slow and unnecessary.
So I riffed with Gemini on the idea of using tree indexing with the NCD distance metric.
Play-by-play
The LLM happily obliged and produced an implementation, together with a nice discussion on why vantage-point trees are suitable when you have a similarity metric (a, a) → R+
However it was full of off-by-one errors, inconsistencies between bind and let (it doesn’t know which `vector` functions return in a monad and which don’t), makes up functions, gets signatures wrong, and basically didn’t build or make completely sense.
.
-- NB: slice is a pure function, but here Gemini uses it as a monadic one
sub <- MV.slice 1 (len - 1) vec
frozenSub <- V.freeze sub
-- modify returns `m ()` but here the LLM uses it as a pure one
let sorted = V.toList $ V.modify (\v -> sortBy (comparing getD) (V.toList frozenSub))
Not immediatley catching the build failure by eye, I went on and asked to fix various surface infelicities, such as collecting search results in a heap rather than concatenating lists. All things it got right because they are well-known algorithms.
Cleaning up with property tests
Then I spun up a stack project, pasted in the code Gemini produced, fixed the basic type errors by hand, and asked Haiku 4.5 “write property tests that demonstrate the core properties of the library“ - this is where things started to measurably improve.
Tests ground the model, and compiler feedback is an effective signal.
A QuickCheck suite can also teach you which are the correct invariants : for example, the NCD is neither reflexive ( d(x, x) /= 0 ) nor symmetric ( d(x, y) /= d(y, x) ). This makes sense after refreshing what gzip compression does to a string (exercise for the reader), which lets us confidently remove some generated tests.
At this stage, you can expect this kind of stream from the agent:
Let me create a small test program instead:
Created TestDebug.hs
> stack runhaskell app/TestDebug.hs
TestDebug.hs+1-0
> stack runhaskell app/TestDebug.hs
Good! With 4 documents it works. Let me try a case that fails:
TestDebug.hs+7-4
> stack runhaskell app/TestDebug.hs
Interesting! That case works too because it's smaller than the leaf threshold. Let me try with more documents:
TestDebug.hs+4-7
> stack runhaskell app/TestDebug.hs
Hmm, that works too. Let me look at the actual test failure more closely. The issue might be with how documents are being compared. Let me check the set function and == for Document:
Read NCDTreeSpec.hs, lines 1 to 40
Now let me see the test that's failing:
Read NCDTreeSpec.hs, lines 85 to 100
I see the problem! The test uses set to remove duplicates before comparing.
The agent forms hypotheses, tests them, observes the results and repeats.
Aside: tools
Set up GHCUp, ensure VSCode can use it, and use the VSCode Copilot chat interface, setting the interaction to “Agent“ mode. You can also configure VSCode to auto-approve calls to `stack` so the agent can go off an fix things on its own.
[speculation] Why do LLMs fail at guessing types?
I suspect the initial inconsistencies were due to the relationship between programs and types being not in the text, and code corpora not containing program compilation tasks (idea!).
An LLM relies on a learned text association model so it cannot memorize perfectly; in Haskell we can form programs like [print, print] and it’s up to the consumer to figure out whether [String → IO ()] makes sense or not; this means that in a code corpus, we can expect to find monadic actions where other languages would only store pure values (lists). This also significantly increases the space of possible programs, though I’m unsure it explains why Gemini didn’t internalize the difference between a pure and a monadic function.
It’s easy to train an LLM to produce empirically functional Python because the runtime “does something” no matter what you feed to it, and defers any errors to the point of use rather than the point of definition. So all the preceding code will at least run.
