The sad state of property-based testing libraries

https://stevana.github.io/the_sad_state_of_property-based_testing_libraries.html

7 Likes

I still don’t know if I’m missing something great here or if this approach is too contrived to be useful (I don’t use QuickCheck or similar in my personal projects).

The promise of “maybe some day one of these tests will fail because it hit a jackpot corner case” sure sounds cool, but it gives me way less confidence than an exhaustive list of tests that is handwritten to hit things I actually care about.

1 Like

Try it and see!

In my experience the amount of bugs picked up by QuickCheck is substantial, and this helps with shedding light in those blind spots that happen when we write test “by hand”.
You don’t even have to go for very complex properties to reap most benefits from QuickCheck.

Writing QuickCheck tests is not difficult provided your datatypes are Arbitrary friendly. So yes, you need a bit of pondering at the design stage.
After that, I am confident the boon is worth ten time the hassle.

5 Likes

This list of features is strange, the IMHO single most important feature is good (or at least usable) shrinking. So for Haskell that means using Hedgehog.

Oh yes, you do. Sadly there aren’t many useful guides on how to use it for “real” code.
I know of The 'Property Based Testing' series | F# for fun and profit especially the posts starting at 3
and, although less specific:
[Edit] Sorry, wrong video, I wanted to post this: https://www.youtube.com/watch?v=wHJZ0icwSkc
and his blog articles at Property-Based Testing in a Screencast Editor: Introduction and Time Travelling and Fixing Bugs with Property-Based Testing instead of
https://www.youtube.com/watch?v=4bpc8NpNHRc

4 Likes

And I would really suggest not using QuickCheck, but Hedgehog. QuickCheck was the first, and that really shows.

Oh, there is also a new property testing library, which I haven’t tested [oh, what a pun!] yet: Falsify
https://well-typed.com/blog/2023/04/falsify/

With “eventually hit corner cases” I meant more “while generating 10000 tests in a second or two” rather than “after months and millions of generated tests”.

How do you know that you’ve not forgotten to care about something :-)?

Depends what you are testing I suppose, I’m interested in testing stateful and concurrent systems rather than mere pure functions. Integrated shrinking a la Hedgehog/falsify won’t help much in that case. Since we are generating lists of commands to execute against the real system and the model, we can get shrinking of these lists for free by the library (making integrated shrinking less important).

1 Like

Oh, that’s your blog. Could you please make the text bigger and the TOC a bit more contrasty? The difficulty to read it

Of course. That’s why I found your title to be (too) click-baity.

Well, if there is a chance that those tests will hit a jackpot, then there is a chance that your program will hit a jackpot during execution and fail. Also, there are cases where an exhaustive list of tests is just not realistic, like the de/serialization of arbitrary text.

But actually, I like this way of testing for a different reason: it switches your focus, from example execution, to things that will hold true for any possible execution. It is just a more general way of reasoning about code or a solution, and in my experience, it often leads to better design or understanding of a problem at hand.

How does falsify rank in the survey? I have my eye on that one. It doesn’t support monadic properties yet, but there’s a PR up for it

I also like hedgehog over quickcheck, but can someone explain how you still get good coverage if hedgehog forces you to specify a range? I would imagine it’s easier to miss edge cases like overflow/underflow

1 Like

Support for monadic properties is a pre-requisite for stateful and parallel testing.

I don’t, but property tests don’t solve that issue, I still have to define properties that hit the wanted cases often enough. The pessimistic view here is that I focus on executing working properties a thousand times each while losing track of the edges (since the very definitions now run in terms of properties).

True, that’s why I tested radix trees in a similar way, handcrafting tree generators.

I think for parsers it’s more complex than that, because tests break into two types: testing control flow in general (where property-based testing catches internal state alterations) and testing primitive parsers (where side cases are easy to describe). That’s how binary did it (Action.hs for properties, QC.hs for equalities).

I guess my concern is that I consider property-based testing a specific tool for a job, so coming up with new libraries for shrinking and custom generator monad stacks feels overkill.

1 Like

Is this really a valid concern, though? Software testing is ultimately about economic judgment; you want the best bang for the buck, and as you point out, it depends on a case-by-case basis. If it is ever a valid option, then why not have a better library providing it?

Plus, as I was trying to say, but failed to put it right, property-based testing makes you come up with an actual specification of the thing under test (even if the tests themselves can’t really prove that spec). That makes it special, compared to most other example-based techniques.

1 Like

This happened to me twice.

First at work, where we tested a financial algorithm that would produce odd results when a loan started in 1860 or something. We never figured out why.

Second in the filepath library, where I used property tests to assert that the new version is semantically equivalent to the old version (after the large OsPath rewrite): filepath/tests/filepath-equivalent-tests/TestEquiv.hs at master · haskell/filepath · GitHub

Every 3 months the CI would fail and something new popped up, until I was tired of it. The problem with property tests is that you need really good generators. Windows filepaths are very complex (unlike unix ones) and it’s almost impossible to hit interesting corner cases if you generate naive random ascii strings as filepaths.

So what I did then was to describe windows filepaths as an ABNF, write an ADT for it, then a generator for that ADT, tweaking the distribution a little to ensure it doesn’t exhaust on lists, but always “visits” all branches of the nested sum types… and finally have a “Show” instance that turns it back into a real filepath.

The result is here: filepath/tests/filepath-equivalent-tests/Gen.hs at master · haskell/filepath · GitHub

My conclusion is:

  • property tests are rarely a substitute for unit tests (not all behaviors you want to test for are really “properties”)
  • they are not very useful unless you write really good generators
3 Likes

It’s hard to summarize how I feel about generators, but maybe this does the job: “generators are often complicated enough to themselves need tests”.

3 Likes

Thinking about “necessary features”, generators where you can define frequencies or probabilities for sub-generators are definitely a must have. Which implies support for shrinking of these.

If you include the limits, that’s not a problem?
But I personally always use the explicit way by using frequency, something like this:
Important to know is that Hedeghog shrinks frequency towards the first element of the list.

frequency [ (1, 0.0)
          , (1, 1.0/0.0)
          , (1, -1.0/0.0)
          , (1, 0.0/0.0)
          , (30, float (linear SOME_FUNCTION_USING_FLOAT_RANGE))
]

The most obvious usage of these kind of generators are Unicode code points, as most of the people most of the time do not want all Unicode code points (or Unicode groups) with the same frequency.

frequency [(10, alphaNum), (5, latin1), (1, HGen.unicode)]

With numbers or floats in special I also sometimes use different intervals with different frequencies, for floats this could be denormalized floats (floats, which are not of the form 1.?e-N but 0.0...0?e-N)

Something like

frequency [ (1, 0)
          , (1, MAX_POS)
          , (1, MIN_NEG)
          , (10, int (linear MIN_NEG, MAX_POS))
          , (30, int (linear -50 -7)
          , (50, int (linear 123 12345678))

Yes, this is also something where all(?) tutorials, … fall short.

Yes, I actually often do that - at least “visually”, by looking at “enough” (whatever that means) generated values.

1 Like

When testing pure functions, I agree with this conclusion. When testing stateful systems (which the post is about), less so – as there the property is the real system is faithful to fake/model. This covers all unit tests about the real system while at the same time contract testing the fake, thereby making it possible to use with confidence as part of integration tests further down the road.

Generators for commands tend to be simple compared to functionality they excerise (they are the inputs/API for your black box), and the library takes care of shrinking lists of commands already so you only have to focus on generating individual commands.

One time that property-based testing can really give you a lot of bang for very little buck is when you have an existing good model of the behaviour you want. For example, writing an optimized version of a data structure or algorithm. Testing against the original model removes the need to come up with interesting properties, since the property is always “behaves the same”, you just need to come up with generators that exercise interesting behaviour (which is non-trivial, as people have pointed out).

I once wrote a very-mutable circular buffer using lots of array indices and suchlike, and a single property test comparing it to an existing queue implementation caught dozens of bugs, and since the generator would also produce “simple” operations, I didn’t even need to write any basic smoke/unit tests either. It was sweet.

(EDIT: I see steven said something similar, the point that generating lists of commands/operations is easier is also definitely a good one)

3 Likes

I have a similar example in the post btw.