Why not use smallcheck?

turion · March 30, 2026, 6:46pm

On the smallcheck package there is a disclaimer:

As of 2023, this library is largely obsolete: arbitrary test generators with shrinking such as falsify offer much better user experience.

I don’t understand why that is. As great falsify is, I find it annoying to write generators again and again. Maybe I’m not using generics and deriving via enough, I find it a bit hard to wrap my head around in that library.

But I find the principles of smallcheck so great: By construction, the smallest test case will be tested first. No worries about shrinking. Series of test values can be implemented without boilerplate, with generics.

So why would someone claim that smallcheck has worse user experience? What’s wrong with it?

Bodigrim · March 30, 2026, 7:10pm

Exponential explosion even in simple cases (like, a record with 6-7 fields) makes smallcheck very brittle to control and use in anger.

turion · March 30, 2026, 7:26pm

I’m surprised this happens so quickly. Could it be that this is implementation specific because of MonadLogic? What if Series would be implemented as a simple list that contains all the values in ascending size? After all, we don’t really need backtracking, so MonadLogic is maybe oversized.

turion · March 30, 2026, 8:26pm

I made a little POC implementation of a smallcheck-like library here: tinycheck/src/Data/TestCases.hs at main · turion/tinycheck · GitHub

It only needs base as a dependency and works with lists. It runs 10 million tests on a largeish record (8 fields, some of them recursive) in less then half a minute on my laptop.

Can you provide me with a test that explodes in smallcheck? I’ll have a look whether it’s viable in this testing framework.

Ambrose · March 30, 2026, 9:44pm

Feels like it doing what it says on the tin

Exhaustive checking is pretty underused imo. “only X many floats just test them all” is p effective.

Bodigrim · March 30, 2026, 10:38pm

@turion the problem is not the raw performance of inputs generation, the problem is that for a given depth their number grows exponentially. So with default depth 5 a record with 6 fields will generate roughly 5^6 test cases. And if your property takes two of such records, it’s 5^12, almost a quarter of a billion of them. So at this point you start to micromanage smallcheck tweaking depth here and there (which is already annoying) and end up setting depth of 2 for such test. It’s still at least 2^12 test cases, which is probably okayish, but you are barely testing anything with such depth, right?..

Imagine testing a property on strings. What would you like smallcheck to do? Should it limit itself to enumerating ['a'..] for small depth? In such case it would never generate even something very simple, like a string with a space. Or a digit. Or an upper case letter. quickcheck is so much better: in a matter of microseconds it will throw all kinds of inputs on your test property.

Ambrose · March 30, 2026, 11:19pm

strings don’t see like a reason to deprecate smallcheck in favor of QC tho lolol. A reason to use QC, yeah. But using sc with strings is just using it wrong..

turion · March 31, 2026, 4:16pm

I guess I understand to little how depth works in smallcheck. In my tinycheck, you could simply define something like:

string :: TestCases String
string = arbitrary
  <> (unwords <$> arbitrary)
  <> (unlines . fmap unwords <$> arbitrary)

and you’d get strings, strings separated by spaces, strings separated by spaces and newlines both very early on. If it’s important to have spaces and newlines, then you can simply merge a generator that produces them into the default one.

Or if you need longer strings, you can do:

atLeast :: Int -> TestCases String
atLeast n = (<>) <$> replicateM n arbitrary <*> arbitrary

My point is, if you have requirements of some specific data showing up in the generator, then fix the generator. But I don’t see how there is a fundamental limitation to enumerating values.

Ok, here is maybe one thing: The concept of depth is maybe too restrictive. My approach here is to just enumerate in some order ad hoc, without keeping track precisely how deep we are going. If we need a few deeper values early on, just mix them into the generator. Keeping track of depth is not necessary, we can just restrict the number of test cases instead.

Bodigrim · March 31, 2026, 6:19pm

How would you know upfront whether it’s important or not? If I definitely know that something is important, I’ll write a unit test to nail it down, no need for quickcheck or smallcheck. The very point of property testing is to get confronted by unknown. And QuickCheck will just throw at me spaces, newlines, digits, punctuation, upper case, lower case, Unicode and what not. While SmallCheck will require me writing a dozen of newtypes with different generators and tweak depth here and there. And all of it for what?

I mean, one can definitely make SmallCheck work. I have fairly extensive test suites using SmallCheck. Guess what? I’ve never seen a case where it would spot a bug, which QuickCheck didn’t.

What would be a reason to use SmallCheck tho lolol?

turion · March 31, 2026, 6:51pm

Not wanting to write a shrinker lolol but right, if the downside is that I have to put more effort into the generator, I can see the point.

Leary · March 31, 2026, 7:15pm

Not surprising; that’s really not where the value of exhaustive—or rather, simplest first—testing lies. IMO the big win is getting a genuinely minimal counterexample to your property in cases where greedy shrinking just isn’t good enough.

Ambrose · March 31, 2026, 7:17pm

When my domain is small enough? I’d expect most professional developers to be able to make that determination..

QC “works” sometimes but also for certain types of programs, that kind of random testing is an easy way to get master red and annoy everyone you work with randomly.

idk I just see “deprecated in favor of” as meaning the two things are actually equivalent fundamentally. not opinion-wise.

Bodigrim · March 31, 2026, 7:47pm

Realistically shrink is already implemented for all basic types and then genericShrink gets you 90% there. There are use cases for writing shrink manually, but they are usually in areas which are far beyond SmallCheck’s reach anyway.

Could you give an example?

Perhaps you are seeing something which is not there?.. SmallCheck description does not contain your quote.

Ambrose · March 31, 2026, 8:50pm

oh whoops - wrong word, same idea (“obsolete”)

Leary · April 1, 2026, 3:04am

Greedy shrinking will naturally fail to shrink any part of a counterexample that must relate to some other part of the counterexample. In the common case it can still shrink around one or two simple interrelations (e.g. two numbers that must be equal) and produce something simple enough. When those interrelations become more complex and plentiful, however, a greedy shrinker stalls out all too soon.

In particular, I’ve seen this when shrinking things that are more code-like than data-like, e.g. a representation of an opaque type by an AST of its operations.

turion · April 1, 2026, 8:40am

To counter the previous post, let me just say briefly that I thank you for your maintenance to the library and your detailed responses here. I fully respect your decision to invest your time into projects that you find more rewarding.

turion · April 2, 2026, 6:18pm

I’d still be interested in a very concrete test case that seems infeasible with smallcheck and is easy with QuickCheck

Bodigrim · April 2, 2026, 11:26pm

The premise behind SmallCheck, namely, that

If a program fails to meet its specification in some cases, it almost always fails in some simple case.

is false. There are many practical cases where a minimal counterexample is not that small or simple. Hybrid algorithms provide many examples.

To be concrete, imagine testing a hybrid sorting algorithm, which starts with divide-and-conquer quicksort, but switches to insertion sort once subarrays are less then N elements. On modern hardware typical values of N are [15..20] (say, it’s 18 in vector-algorithms). If you try to test it with SmallCheck you would never ever get into testing quicksort part of it, which could happily remain undefined, because enumerating all arrays up to 18 elements will take too long. But QuickCheck will get there fairly quickly.

turion · April 7, 2026, 2:22pm

I see your point, but I still find that this is a too narrow view on how to treat depth. Here, you implicitly identify depth with length of the array. This would lead to an unusable implementation. For example, [Integer] would enumerate all (i.e. infinitely many) one-element lists before the 2-element lists. Clearly this is not what we want. It’s not what tinycheck implements, and I assumed until now it’s not what smallcheck implements?

For your specific example, for a threshold of 18, tinycheck finds the bug after less than 2 million (and more than 200 000) values, which takes 0.3 seconds on my machine. (1.4 seconds for a threshold of 20, still less than 2 million values.) I agree that this is probably less performant than QuickCheck, but still I find it noteable. I achieved it by just writing down the basic idea behind SmallCheck from memory in little time as a library. This says something about the power of the idea.

jaror · April 7, 2026, 3:16pm

Hah, I wanted to write a comment here about how the enumeration could order the candidates based on the sum of all the integers in the list (e.g. [1,1] would come immediately after or before [2]). But then you’re still generating an exponential amount of tests. It turns out that 18 is still kind of reachable at 131.072 tests before you reach replicate 18 1. However, it takes many more tests before you get to any kind of representative array of 18 numbers. With every value of the sum you add, the number of tests required doubles. I’m guessing you were lucky that the bug appeared for an array with 17 zeroes and a single one (or something like that), which would “only” require between 262.144 and 524.288 tests.

Thinking about it again, maybe you could do much better if you made and ordering based on merging different orders of magnitude. Consider generating tests for a single integer. We could consider the ranges 1-10, 10-100, 100-1000, etc. separately and then merge their orders. So you’d get 0, 10, 100, …, 1, 11, 101, …, 2, 12, 102, …
For lists (or vectors) you’d first test length 1, then length 10, then 100, etc.

I guess one problem will be that the larger lists will take much longer to tests, so maybe we want to bias this more to the smaller lists. Also, this means we would probably want shrinking again.