Benchmarking Haskell LLM proficiency with AOC puzzles

Hi there!

This December, I realized that state-of-the-art LLMs were quite capable of coding in Haskell, and even solving Advent of Code puzzles.

This was a surprise to me, as I had been using Claude Code quite intensely since July and hadn’t realized their “raw” capabilities, given I was using them on a complex (and messy) codebase.

That codebase wasn’t in Haskell btw, but in Elm, so close enough “in FP spirit”.

Anyhow, since then I’ve been wondering how I could evaluate different models myself, and not rely on advertised benchmarks, which could be more or less optimized for marketing purposes.

I’ve summarized in a blog post my strategy for doing so, and I feel pretty good about it for now.

It has a bit of a “homegrown” flavor to it, but quite simple and adaptable too, so I thought I should make it a blog post if it could be of any interest to others.

I’d be glad to hear what your thoughts and experiments on the subject. Also, I’m sure I could learn a thing or two by sharing my post. Here it is:

By the way, another thing I realized in December was that LLMs handle different languages/paradigms, even niche ones, just fine. In other words, the choice of a programming language seems to be less critical than I assumed. That was also a surprise to me, as I was convinced that:

  • popular languages had to be “better”
  • typed languages had to be “better”

By “better”, I mean better supported, and therefore producing better results

Both of my assumptions appear to be contradicted by such experiments. As for myself, I’m still interested in wielding the strictest/most powerful language. So if the LLM can help with that (think assisting with teammate onboarding), then I’d tend to think that LLMs have the potential to make Haskell more attractive than it’s been historically.

And yeah, I know about “avoid success at all costs”, but still :wink:

Side note, I may make this post evolve by testing other languages, just to satisfy my own curiosity. Also, I’d be interested to see if all models can complete all days, but I’d prefer to try that later, once I’ll have solved them all myself.

6 Likes

Did you write this last year, or are you a time traveler? A number of dates on the blog post are 2025, when I’d expect e.g. 2026-02-24 for the post date.

Indeed, it appears I was :slight_smile:

I’ve fixed the typo in the post date 2025-02-24 → 2026-02-24

I did the benchmark test yesterday, and wrote the post this morning (from the material given by the benchmark)

One could assume that LLMs could have December in their training data by now (maybe, I’m unsure about that), but I remember having done similar tests in December and got similar results for what it’s worth.

Thank you! This is really interesting! Can you run any of the models you compare locally? I am interested in how far small models suitable for local execution lack the large ones behind.

I hadn’t explored local LLMs until yet (appart from toying with them very briefly), due to the fact that I just don’t have the proper hardware to do interesting things…

But I still wrote and explored the idea here :

Btw, I benchmarked a bunch of other languages and got interesting results…

In my mind, running locally could be a viable option in maybe in 5-10 years? Unless hardware requirements are pushed even further thus requiring even more computing to offload?

One things that’s currently piquing my curiosity is having the ability to refine (aka “fine-tune”) one own’s LLM. From what I can gather, it’d require renting a server with a proper GPU - roughly 200€/month for an entry level server.

So, either you pay “big bucks” to big players (if having the need to produce tons and tons of code is a requirement, which isn’t my case btw), or you can play on your own, perhaps with smaller but more “focused” models for an “equivalent” price.

That’s speculable though, I’d love to learn more though, if some of you are exploring these ideas…

1 Like