Asside token consumtion you can for example measure more PLT oriented questions:
- task completion speed, is a programming language better at solving problems faster?
- because we can re-run the tasks for example 30 times rather easily you get more reliable results.
- human intervention (or lack thereof, fewer are better).
Interesting if you’d start desining a language with an LLM as the primary user in mind, now you could decide to add or remove features based on how good it empirically performs with those features.
Needn’t be a new language. You could do grammar prompting for example if your problem is sufficiently constrained.
I’ve been reading a fair bit about how the classical synthesis community has adapted to LLMs and most of the literature is around correctness (rightfully so) but not much about “search” efficiency. The measures you suggest make sense. Hard thing would be defining a sufficiently representative set of tasks for a benchmark like that.
What are the basic things it wants to look up for Haskell?
Also I think you have to iterate on documenting your setup before it can be efficient.
When I asked it to make game with my engine it rummaged a lot for examples and references.
Then I made it write docs for itself and cleaned that up.
The second run started with a large-ish upfront planning to nail the design, but then the game was mostly one-shot without any extra lookups and rummaging.