Golds-gym : Golden testing for benchmarks

ocramz · January 30, 2026, 1:34pm

I’m doing quite lot of performance work in this period, and needed an integration between “hspec“ and “benchpress“.

The library saves benchmark results to golden files, arranged by architecture, and subsequent tests compare against the golden file to track regressions, improvements etc.

It also has robust statistics capabilities to account for noisy benchmarking environments.

Let me know if you find it useful !

Now on Hackage: golds-gym: Golden testing framework for performance benchmarks

noinia · January 30, 2026, 4:47pm

This seems awesome! I’m definitely going to try it out

Bodigrim · January 30, 2026, 7:21pm

I suspect that I’m arguing with design choices made by AI, but are coarse architecture identifiers like x86_64-linux-Intel_Core_i7 sufficient? I imagine there is huge variability between different Core i7 and atop of that different RAM size and bandwidth are likely to render such golden files impractical.

ocramz · January 30, 2026, 8:02pm

Indeed most of the typing was done by an AI .

I designed it though as a tool for benchmarking across project versions rather than across architectures, with the architecture ID to help discriminate workstation vs CI runner.

Good catch nonetheless, do you have insight on how to include more architecture detail in the ID string ?

Edit @Bodigrim it’s possible of course : Add RAM and CPU cores to architecture identifier by Copilot · Pull Request #2 · ocramz/golds-gym · GitHub , and going beyond that would mean parsing the output of `lshw`or `ioreg` which is a bit of work. It would be great if you could articulate your use case, as I generally use at most one workstation per architecture so don’t need all this detail.

Bodigrim · January 31, 2026, 2:59pm

My view is that golden files for benchmarks are likely to be futile; I was surprised that you found them useful. Either they are coarse-grained, as in your original implementation, in which case other contributors to a project will get test failures. Or they are fine-grained, checking precise CPU model, RAM size, etc., in which case no other contributor will have the same configuration as you and will skip the golden tests.

Did you consider using tasty-bench to benchmark across project versions? I run it with --csv FILE to generate a baseline benchmarks for master branch, then switch to a feature branch and use --baseline FILE --hide-successes --fail-if-slower 10, which highlights nicely all regressed benchmarks.

Further on, I rarely want to test a property “this benchmark takes 1 second”. Normally I want to test “this implementation of an algorithm is at least X times faster than a reference / naive implementation” or “this benchmark is at most Y times slower than a for-loop from 1 to 1000000”, which is what bcompareWithin from tasty-bench achieves. Expressing performance properties in relative, not absolute terms, makes them portable enough to work on different machines and runners.

ocramz · February 1, 2026, 7:37am

My current use case is indeed like what you outlined for tasty-bench , only in the form of a `hspec` assertion. It didn’t exist as far as I could tell, so here it is.

That’s a good point: `golds-gym` triggers on relative change, configurable in various ways : Test.Hspec.BenchGolden.Runner

ocramz · February 1, 2026, 11:10am

I’ve added some lens-based combinators for better expectations. Now you can say e.g

benchGoldenWithExpectation "median comparison" defaultBenchConfig
  [expect _statsMedian (MustImprove 10.0)]
  myAction

if you want the median benchmark to be faster than 10% .
See golds-gym/CHANGELOG.md at main · ocramz/golds-gym · GitHub

Bodigrim · February 1, 2026, 12:29pm

An integration of tasty-bench (or, say, criterion) into hspec would be a welcome effort. The problem of AI is its hubris and DIY / NIH attitude. A lazy human would not attempt to reinvent the wheel and ignore all prior art, but AI will.

Say, I’m looking at the first function in the library:

Assuming quicksort :: Ord a => [a] -> [a] this does not even compile, returning IO [Integer] instead of IO (). But fair enough, we can fix it:

import Control.DeepSeq (deepseq)
import Data.List (sort)
import Test.Hspec
import Test.Hspec.BenchGolden

main :: IO ()
main = hspec $ do
  describe "Sorting" $ do
    benchGolden "quicksort 1000000 elements" $
      return $ sort [1000000, 999999..1] `deepseq` ()

Is it a reasonable way to benchmark a Haskell function? No, because (unless you insist on -O0) GHC will float out sort [1000000, 999999..1] into a top-level thunk, so that it’s computed only once and memorized; all subsequent iterations of benchmark become no-op and return immediately. That’s the reason why benchmarking frameworks have combinators like nf.

Running the example above with golds-gym-0.2.0.0 consistently reports that sorting a million of elements took only 1 microsecond, which is hardly believable.

(What is even more surprising is that increasing to 1 billion of elements does not slow down anything; it seems golds-gym somehow defeats my attempt to deepseq and does not sort the list even once.)

ocramz · February 1, 2026, 3:47pm

@Bodigrim thank you for the corrections, it seems I had forgotten my fundamentals. The latest version now accounts for various depths of evaluation, and gives more sensible timing results. golds-gym: Golden testing framework for performance benchmarks

( another great example of Cunningham’s Law in action !)

Bodigrim · February 1, 2026, 5:23pm

@ocramz now that you vendored in huge chunks of tasty-bench, why would not you just reuse them instead? Your newtype BenchAction seems to be the same as newtype Benchmarkable, so you can re-export the latter and have all the combinators and documentation (mercilessly butchered by your AI) for free.

What your package suggests as a usage example of whnf was apparently taken from tasty-bench documentation, where it serves to illustrate an anti-pattern.

ocramz · February 1, 2026, 5:57pm

with attribution , see https://hackage-content.haskell.org/package/golds-gym-0.4.0.0/docs/src/Test.Hspec.BenchGolden.Runner.html . I did so not to incur into a dependency on tasty. Indeed the code I vendored in only depends on deepseq so ideally should be imported from there.

ocramz · February 2, 2026, 7:28am

Personal remark: after 10+ years of Haskelling, I thought I had grown some scar tissue for the fiery one-upmanship of its hardcore online users, but I still cannot get past the skewering over a slightly less-orthogonal API. All for work that is freely given. No wonder everybody in OSS ends up burnt out.

dbarton · February 2, 2026, 9:47pm

I’ve thought for a while that the bigger issue with Haskell, and maybe Open Source in general, is the lack of a minimal de facto group software development life cycle, including any review/recommendations for requirements analysis and design. (Documentation and testing can also be a problem in general.) Our current system is often that a pioneer announces a new package fully formed, and only then gets feedback. I guess this is a little like an academic model, as opposed to industry. It has advantages, but as many Haskell programmers have noted, it leads to problems such as making it hard to know what all the best libraries are for a given task, or what direction(s) experts recommend in various areas.

In my case, I’d like to work on a couple areas: pure mathematics in Haskell, and shared-memory parallelism in Haskell more generally. I have my own strong opinions, but when I have more time to work on these areas (not now unfortunately), I’d really like input from experts. I plan to post here and solicit feedback before finalizing decisions.

No one’s advocating design by committee, or requiring approvals before having fun trying new ideas. But there’s a lot of duplication, e.g. in my case a lot of sparse matrix libraries that all use IntMap as a data structure, which I find inefficient in practice (plus the library duplication is redundant). Mathematicians love to collaborate and write joint papers, it’s really fun. And it’d be much better to have several folks discuss possibilities for sparse matricies, instead of reinventing a similar (arguably inefficient) library. We could then even try to build more capabilities on a solid base, instead of stagnating.

Another example is the falsify library for testing. I think it has a lot of good ideas, but could have really benefited from group feedback on a number of points. I don’t mean to pick on it in particular, I think it’s just an example.

In summary, I think that if we continue with the relative anarchy of develop whatever you want and then publish it, we should acknowledge the strengths of that approach but then also be prepared for a lot of duplication, objections, and criticism after the fact unfortunately. I don’t want a rigid proposal, just a shift in community behaviour.

If others agree or even want to discuss this, I suppose this should really be split off into a separate thread. I’ll let the experts decide on that.

Bodigrim · February 3, 2026, 12:41am

@ocramz I’m sorry that my comments hurt you, that was never the intention. I got carried away and expressed myself more forcefully than I should have. I know and respect you as a valuable member of Haskell community, who does great work and contributes to the ecosystem we love. Please accept my sincere apologies.

ocramz · February 3, 2026, 6:39am

@Bodigrim Thank you and all good, no need to apologize. I got carried away as well.

In fact, I must acknowledege that the initial versions of the package I published were quite sloppy and improved rapidly thanks to your feedback. Your diagnosis of AI being eager to replicate code to the detriment of ecosystem integration or (worse) mangle comments is very true and a tricky aspect of a tool that’s so seamless.

ocramz · February 3, 2026, 6:46am

@dbarton Very well put. This is a discussion worth having, perhaps in a dedicated thread as I’m sure many others have thoughts about this.

Incidentally, I am guilty as charged also for a sparse matrix package (though typed by hand). I recently revisisted GitHub - ocramz/sparse-linear-algebra: Numerical computation in native Haskell and the “forward” operations work fine but the solvers need time and care to fix.