[vibe coding] Text similarity search via normalized compression distance

Hi all,

tldr; I’ve vibecoded a new library, here’s how it went.

Background

A while back the NLP community rediscovered a 20-year old clustering algorithm based on compression (see Cilibrasi & Vitanyi 2003), and it made quite a splash because now we don’t have to train neural networks to produce embeedings anymore!

Indeed, it works quite well and you can find various implementations online. Many implementations however use an all-to-all distance computation which you can guess to be both slow and unnecessary.

So I riffed with Gemini on the idea of using tree indexing with the NCD distance metric.

Play-by-play

The LLM happily obliged and produced an implementation, together with a nice discussion on why vantage-point trees are suitable when you have a similarity metric (a, a) → R+
However it was full of off-by-one errors, inconsistencies between bind and let (it doesn’t know which `vector` functions return in a monad and which don’t), makes up functions, gets signatures wrong, and basically didn’t build or make completely sense.
.

-- NB: slice is a pure function, but here Gemini uses it as a monadic one
sub <- MV.slice 1 (len - 1) vec 
frozenSub <- V.freeze sub 

-- modify returns `m ()` but here the LLM uses it as a pure one
let sorted = V.toList $ V.modify (\v -> sortBy (comparing getD) (V.toList frozenSub))

Not immediatley catching the build failure by eye, I went on and asked to fix various surface infelicities, such as collecting search results in a heap rather than concatenating lists. All things it got right because they are well-known algorithms.

Cleaning up with property tests

Then I spun up a stack project, pasted in the code Gemini produced, fixed the basic type errors by hand, and asked Haiku 4.5 “write property tests that demonstrate the core properties of the library“ - this is where things started to measurably improve.

Tests ground the model, and compiler feedback is an effective signal.

A QuickCheck suite can also teach you which are the correct invariants : for example, the NCD is neither reflexive ( d(x, x) /= 0 ) nor symmetric ( d(x, y) /= d(y, x) ). This makes sense after refreshing what gzip compression does to a string (exercise for the reader), which lets us confidently remove some generated tests.

At this stage, you can expect this kind of stream from the agent:


Let me create a small test program instead:

Created TestDebug.hs

> stack runhaskell app/TestDebug.hs

TestDebug.hs+1-0

> stack runhaskell app/TestDebug.hs

Good! With 4 documents it works. Let me try a case that fails:

TestDebug.hs+7-4

> stack runhaskell app/TestDebug.hs

Interesting! That case works too because it's smaller than the leaf threshold. Let me try with more documents:

TestDebug.hs+4-7

> stack runhaskell app/TestDebug.hs

Hmm, that works too. Let me look at the actual test failure more closely. The issue might be with how documents are being compared. Let me check the set function and == for Document:

Read NCDTreeSpec.hs, lines 1 to 40

Now let me see the test that's failing:

Read NCDTreeSpec.hs, lines 85 to 100

I see the problem! The test uses set to remove duplicates before comparing.

The agent forms hypotheses, tests them, observes the results and repeats.

Aside: tools

Set up GHCUp, ensure VSCode can use it, and use the VSCode Copilot chat interface, setting the interaction to “Agent“ mode. You can also configure VSCode to auto-approve calls to `stack` so the agent can go off an fix things on its own.

[speculation] Why do LLMs fail at guessing types?

I suspect the initial inconsistencies were due to the relationship between programs and types being not in the text, and code corpora not containing program compilation tasks (idea!).

An LLM relies on a learned text association model so it cannot memorize perfectly; in Haskell we can form programs like [print, print] and it’s up to the consumer to figure out whether [String → IO ()] makes sense or not; this means that in a code corpus, we can expect to find monadic actions where other languages would only store pure values (lists). This also significantly increases the space of possible programs, though I’m unsure it explains why Gemini didn’t internalize the difference between a pure and a monadic function.

It’s easy to train an LLM to produce empirically functional Python because the runtime “does something” no matter what you feed to it, and defers any errors to the point of use rather than the point of definition. So all the preceding code will at least run.

References

8 Likes

Hadn’t heard of this approach so just the reference was a pretty informative read on its own. Also great experience report! I’m working on some text tasks using underresourced languages so I might reuse a lot of what you’ve done.

1 Like

Now ncd-tree is available as a Hackage release candidate too: ncd-tree: text similarity search using normalized compression distance and VP trees

sortedWrt :: Document -> V.Vector Document -> V.Vector Document
sortedWrt p =
  V.modify $ MV.sortBy (\other -> comparing (getD other) p)


mkVPTree :: Int -> [Document] -> VPTree
mkVPTree leafThreshold docs = runST $ do
    mutVec <- V.thaw (V.fromList docs)
    build mutVec
      where
        build vec = do
          let
            len = MV.length vec
          if len == 0 then return VPEmpty
            else if len < leafThreshold then do
            finalItems <- V.freeze vec
            return $ VPLeaf (V.toList finalItems)
            else
            do
              -- 1. Pick first element as pivot (vantage point)
              vp <- MV.read vec 0
              let
                sub = MV.slice 1 (len - 1) vec
              frozenSub <- V.freeze sub
              let
                sorted = sortedWrt vp frozenSub
                lenWithoutPivot = V.length sorted
                mid = max 1 (lenWithoutPivot `div` 2)
                (inV, outV) = V.splitAt mid sorted
                thresh = getD vp (V.last inV)
              -- 4. Recurse
              inVM <- V.thaw inV
              outVM <- V.thaw outV
              leftTree <- build inVM
              rightTree <- build outVM
              return $ VPNode vp thresh leftTree rightTree

This looks suboptimal: we keep thawing and freezing vectors again and again, and it’s not even unsafeThaw / unsafeFreeze. Could we thaw a vector once and keep modifying its slices?

(If anything, I think that plain lists would perform equally well here)

to continue the experiment, I’ve fed your comment verbatim to Haiku 4.5 (adding the full module as additional context)

and it produced this new implementation (after a couple of attempts, again mixing up pure and monadic signatures):

mkVPTree :: Int -> [Document] -> VPTree
mkVPTree leafThreshold docs = runST $ do
    mutVec <- V.thaw (V.fromList docs)
    build mutVec 0 (MV.length mutVec)
      where
        build vec start len = do
          if len == 0 then return VPEmpty
            else if len < leafThreshold then do
            finalVec <- V.freeze (MV.slice start len vec)
            return $ VPLeaf (V.toList finalVec)
            else
            do
              -- 1. Pick first element as pivot (vantage point)
              vp <- MV.read vec start
              -- 2. Sort the remaining elements by distance to vp
              let subStart = start + 1
                  subLen = len - 1
              sortSliceBy vec subStart subLen (comparing (getD vp))
              -- 3. Split at midpoint
              let mid = max 1 (subLen `div` 2)
                  insideStart = subStart
                  insideLen = mid
                  outsideStart = subStart + mid
                  outsideLen = subLen - mid
                  lastInsideIdx = insideStart + insideLen - 1
              -- 4. Find threshold as distance to last element in 'inside'
              lastInside <- MV.read vec lastInsideIdx
              let thresh = getD vp lastInside
              -- 5. Recurse on slices
              leftTree <- build vec insideStart insideLen
              rightTree <- build vec outsideStart outsideLen
              return $ VPNode vp thresh leftTree rightTree

-- | Sort a mutable vector slice in-place by a comparison function on elements
sortSliceBy :: MV.MVector s Document 
            -> Int 
            -> Int 
            -> (Document -> Document -> Ordering) 
            -> ST s ()
sortSliceBy vec start len cmp = do
  let sliced = MV.slice start len vec
  MV.sortBy cmp sliced

tests are still green, benchmark slightly faster (I suspect we’d see a real difference only on very large inputs).

Is there a reason we (safely) thaw a freshly created vector, which is not used anywhere else? Same question about (safely) freezing a vector slice only to disassemble immediately into a list. Why build takes three arguments instead of just a corresponding vector slice?

(Feel free to feed it into Haiku :slight_smile: )

1 Like

Sure enough, the magic 8-ball replies with :

You’re absolutely right! These are unnecessary conversions. Since we’re creating a fresh vector and have exclusive ownership, and we’re only working with slices within the ST monad, we can:

  1. Work directly with the mutable vector without thawing (it’s already mutable)

  2. Pass mutable vector slices directly instead of converting to/from immutable vectors

  3. Keep build working with (start, len) parameters on the same mutable vector

Actually, looking more carefully: we do need to thaw the immutable vector from V.fromList, but then we could avoid the freeze/thaw cycles. Let me reconsider the design:

The real issue is that MV.slice returns a mutable vector view, but then we’re freezing it to a list. Instead, we should either:

  • Keep everything in mutable space and only freeze at the very end (when extracting leaves)

  • Or pass around the mutable vector with indices

The current code actually does this reasonably well for the recursive case, but the leaf case unnecessarily converts to immutable just to convert back to a list. We can optimize that:

And in a separate request it optimized the signature of `build`. Results are up on the main branch.

I posted the LLM’s replies because they show it can reason about program semantics (at least after we give it a lead), which makes it valuable in exploring new features or improvements.