[vibe coding] Text similarity search via normalized compression distance

Hi all,

tldr; I’ve vibecoded a new library, here’s how it went.

Background

A while back the NLP community rediscovered a 20-year old clustering algorithm based on compression (see Cilibrasi & Vitanyi 2003), and it made quite a splash because now we don’t have to train neural networks to produce embeedings anymore!

Indeed, it works quite well and you can find various implementations online. Many implementations however use an all-to-all distance computation which you can guess to be both slow and unnecessary.

So I riffed with Gemini on the idea of using tree indexing with the NCD distance metric.

Play-by-play

The LLM happily obliged and produced an implementation, together with a nice discussion on why vantage-point trees are suitable when you have a similarity metric (a → a → R+ )
However it was full of off-by-one errors, inconsistencies between bind and let (it doesn’t know which `vector` functions return in a monad and which don’t), gets signatures wrong, and basically didn’t build or make completely sense.
.

let getD other = ncd (docText p) (docSize p) (docText other) (docSize other) 

-- 3. Sort/Partition around median using the ST context 
-- For simplicity, we convert to list and back, but MV.sort is possible 

sub <- MV.slice 1 (len - 1) vec 
frozenSub <- V.freeze sub 

let sorted = V.toList $ V.modify (\v -> sortBy (comparing getD) (V.toList frozenSub)) -- Simplified 

let mid = (len - 1) `div` 2

Not immediatley catching the build failure by eye, I went on and asked to fix various surface infelicities, such as collecting search results in a heap rather than concatenating lists. All things it got right because they are well-known algorithms.

Cleaning up with property tests

Then I spun up a stack project, pasted in the code Gemini produced, fixed the basic type errors by hand, and asked Haiku 4.5 “write property tests that demonstrate the core properties of the library“ - this is where things started to measurably improve.

Tests ground the model, and compiler feedback is an effective signal.

A QuickCheck suite can also teach you which are the correct invariants : for example, the NCD is neither reflexive ( d(x, x) /= 0 ) nor symmetric ( d(x, y) /= d(y, x) )

Aside: tools

Set up GHCUp, ensure VSCode can use it, and use the VSCode Copilot chat interface, setting the interaction to “Agent“ mode. You can also configure VSCode to auto-approve calls to `stack` so the agent can go off an fix things on its own.

[speculation] Why do LLMs fail at guessing types?

I suspect the initial inconsistencies were due to the relationship between programs and types being not in the text, and code corpora not containing program compilation tasks (idea!).

An LLM relies on a learned text association model so it cannot memorize perfectly; in Haskell we can form programs like [print, print] and it’s up to the consumer to figure out whether [String → IO ()] makes sense or not; this significantly increases the space of possible programs, though I’m unsure it explains why Gemini didn’t internalize the difference between a pure and a monadic function.

It’s easy to train an LLM to produce empirically functional Python because the runtime “does something” no matter what you feed to it, and defers any errors to the point of use rather than the point of definition. So all the preceding code will at least run.

References

2 Likes

Hadn’t heard of this approach so just the reference was a pretty informative read on its own. Also great experience report! I’m working on some text tasks using underresourced languages so I might reuse a lot of what you’ve done.

1 Like