Quick Sort variants performance

Edit: I also posted the question on StackOverflow (since there is a large audience) with some edits/ameliorations on the question. Here is the link:

I still leave the non-edited question below for further references but you should go see the one on StackOverflow. Also, you can delete this post if it is better to only have the question in one place.

Hello,

I am currently reading about functional data structures and algorithms and I tried different quick sort implementations. I noticed, however, that there is a lot of variation in their performances.

Here are some selected ones that I want to discuss (all the programs have been compiled for test, and the list to be sorted was the (worst case already sorted (which implies that the time taken is in O(n^2))) [1 . . 20000] list):

  1. qs1 []             = []
    qs1 (pivot : rest) = qs1 lower ++ [pivot] ++ qs1 upper  where
      lower = [ x | x <- rest, x <= pivot ]
      upper = [ x | x <- rest, x >  pivot ]
    

    This is the classic sort we see when we learn the language. Here, the rest list is traversed twice to filter first the elements less or equal than the pivot, then the elements greater than the pivot. The time taken was 6.45 sec.

  2. qs2 []             = []
    qs2 (pivot : rest) = qs2 lower ++ [pivot] ++ qs2 upper  where
      (lower, upper) = foldr
        (\x (l, u) -> if x <= pivot then (x : l, u) else (l, x : u))
        ([], [])
        rest
    

    I was surprised by this one. It is the same as above, except that it only traverses the rest list once. It did worse than the first with a total of 8.11 sec. I tried the partition function from Data.List library, but it only did worse (much worse), with a catastrophic 10.99 sec. I checked the implementation, and yet it is almost the same as my lambda function, although it relies on an utility function:

    partition               :: (a -> Bool) -> [a] -> ([a],[a])
    partition p xs = foldr (select p) ([],[]) xs
    
    select :: (a -> Bool) -> a -> ([a], [a]) -> ([a], [a])
    select p x ~(ts,fs) | p x       = (x:ts,fs)
                        | otherwise = (ts, x:fs)
    

    (Taken from https://hackage.haskell.org/package/base-4.14.0.0/docs/src/Data.OldList.html#partition).

  3. qs3 []             s = s
    qs3 [x           ] s = x : s
    qs3 (pivot : rest) s = qs3 lower (pivot : (qs3 upper s))  where
        lower = [ x | x <- rest, x <= pivot ]
        upper = [ x | x <- rest, x >  pivot ]
    

    In this one, there are 2 novelties. Fist, the removal of append ++ in favour of cons :. Second, it is a tail recursion function, so it should (in principle) be faster. However, it is merely better than the first one, with a time of 6.42 sec. In fact, due to variation from an execution to the other, it probably is the same.

  4. qs4 []             s = s
    qs4 [x           ] s = x : s
    qs4 (pivot : rest) s = qs4 lower (pivot : (qs4 upper s))  where
      (lower, upper) = foldr
        (\x (l, u) -> if x <= pivot then (x : l, u) else (l, x : u))
        ([], [])
        rest
    

    Again, this is the same as above, except that I replaced the 2 list traversal by a foldr, and again, the time taken is increased: 8.02 sec.

  5. split pivot [] lower upper s = qs5 lower (pivot : qs5 upper s)
    split pivot (h : t) lower upper s
         | h <= pivot = split pivot t (h : lower) upper s
         | otherwise  = split pivot t lower (h : upper) s
    
    qs5 []             s = s
    qs5 (pivot : rest) s = split pivot rest [] [] s
    

    This one is the fastest. I recorded a stunning time of 2.82 seconds, which is almost 4 times faster than the ā€œslowestā€. It bounces between 2 functions that calls each other. split is a recursive function that separates the elements of the rest list sent by the qs5 functions, which it returns to when it is done partitioning.


Conclusion: What the hell is going on here? I am puzzled by all the hidden subtleties done during the compilation that makes my expectations concerning the performance of the program go wrong. I humbly thank anyone who could help me untangle the pieces of this jigsaw by pointing out what is going on under the hood.

1 Like

One thing I notice here is the lack of a laziness annotation on the tuple in the fold, I think the performance should change if you do it like this:

qs2 []             = []
qs2 (pivot : rest) = qs2 lower ++ [pivot] ++ qs2 upper  where
  ~(lower, upper) = foldr
    (\x ~(l, u) -> if x <= pivot then (x : l, u) else (l, x : u))
    ([], [])
    rest

Edit: I just tested this and it is slower, the most time is spent on garbage collectionā€¦

Also how are you evaluating each function? Are you printing the result or do you use deepseq or something similar? I always test these kind of functions with a custom seqList:

seqList :: [a] -> b -> b
seqList [] y = y
seqList (x:xs) y = seqList xs `seq` y

Also I recommend running with +RTS -s (or even better: criterion or gauge) to get more accurate time measurements than time.

On my machine (with -O2) qs3 is the fastest on [1..20000000] as input. It takes 1.9 seconds, qt5 takes 2.3 seconds.

Here are benchmark results for an input of [1..20000] (q2ā€™ is the lazy version I wrote above):

benchmarked qs1
time                 123.4 Ī¼s   (111.5 Ī¼s .. 136.2 Ī¼s)
                     0.966 RĀ²   (0.950 RĀ² .. 0.991 RĀ²)
mean                 103.8 Ī¼s   (101.7 Ī¼s .. 107.7 Ī¼s)
std dev              9.388 Ī¼s   (5.831 Ī¼s .. 16.84 Ī¼s)
variance introduced by outliers: 58% (severely inflated)

benchmarked qs2
time                 315.8 Ī¼s   (305.8 Ī¼s .. 324.2 Ī¼s)
                     0.985 RĀ²   (0.969 RĀ² .. 0.995 RĀ²)
mean                 349.6 Ī¼s   (340.2 Ī¼s .. 363.3 Ī¼s)
std dev              37.62 Ī¼s   (28.18 Ī¼s .. 49.01 Ī¼s)
variance introduced by outliers: 66% (severely inflated)

benchmarked qs2'
time                 1.135 ms   (1.125 ms .. 1.148 ms)
                     0.999 RĀ²   (0.998 RĀ² .. 1.000 RĀ²)
mean                 1.141 ms   (1.135 ms .. 1.149 ms)
std dev              24.44 Ī¼s   (18.39 Ī¼s .. 39.75 Ī¼s)

benchmarked qs3
time                 111.8 Ī¼s   (111.3 Ī¼s .. 112.4 Ī¼s)
                     1.000 RĀ²   (1.000 RĀ² .. 1.000 RĀ²)
mean                 111.4 Ī¼s   (111.2 Ī¼s .. 111.7 Ī¼s)
std dev              845.4 ns   (672.1 ns .. 1.152 Ī¼s)

benchmarked qs4
time                 328.7 Ī¼s   (327.5 Ī¼s .. 329.7 Ī¼s)
                     1.000 RĀ²   (1.000 RĀ² .. 1.000 RĀ²)
mean                 326.9 Ī¼s   (326.3 Ī¼s .. 327.6 Ī¼s)
std dev              2.279 Ī¼s   (1.878 Ī¼s .. 2.804 Ī¼s)

benchmarked qs5
time                 204.6 Ī¼s   (203.7 Ī¼s .. 205.4 Ī¼s)
                     1.000 RĀ²   (1.000 RĀ² .. 1.000 RĀ²)
mean                 205.2 Ī¼s   (204.8 Ī¼s .. 206.1 Ī¼s)
std dev              2.062 Ī¼s   (1.396 Ī¼s .. 3.525 Ī¼s)

This is because the way youā€™re folding is not very efficient, you can solve it by using foldl':

qs2 []             = []
qs2 (pivot : rest) = qs2 lower ++ [pivot] ++ qs2 upper  where
  (lower, upper) = foldl'
    (\(l, u) x -> if x <= pivot then (x : l, u) else (l, x : u))
    ([], [])
    rest

Or, I like this more, use this trick:

qs2 []             = []
qs2 (pivot : rest) = qs2 lower ++ [pivot] ++ qs2 upper  where
  (lower, upper) = foldr
    (\x xs (l, u) -> xs (if x <= pivot then (x : l, u) else (l, x : u)))
    id
    rest
    ([], [])

Both of these perform nearly identical in my tests:

benchmarked qs2-foldl
time                 188.9 Ī¼s   (187.3 Ī¼s .. 191.4 Ī¼s)
                     0.999 RĀ²   (0.998 RĀ² .. 1.000 RĀ²)
mean                 189.1 Ī¼s   (188.3 Ī¼s .. 190.2 Ī¼s)
std dev              2.971 Ī¼s   (1.963 Ī¼s .. 4.553 Ī¼s)

benchmarked qs2-trick
time                 196.5 Ī¼s   (190.3 Ī¼s .. 202.6 Ī¼s)
                     0.996 RĀ²   (0.993 RĀ² .. 0.999 RĀ²)
mean                 190.0 Ī¼s   (188.8 Ī¼s .. 192.1 Ī¼s)
std dev              4.919 Ī¼s   (3.061 Ī¼s .. 7.414 Ī¼s)
variance introduced by outliers: 11% (moderately inflated)

Hereā€™s my results using the partition function:

benchmarked qs2-partition
time                 1.179 ms   (1.158 ms .. 1.202 ms)
                     0.998 RĀ²   (0.997 RĀ² .. 0.999 RĀ²)
mean                 1.135 ms   (1.129 ms .. 1.145 ms)
std dev              25.00 Ī¼s   (18.57 Ī¼s .. 35.61 Ī¼s)

This has about the same time as my lazy qs2' solution. Iā€™m guessing the laziness causes more memory to be occupied for a longer time which decreases cache locality and might take more time to allocate and free. The advantage of the more lazy qs2' and partition is that they can work on infinite lists (of course sorting wonā€™t work on infinite lists, but partitions can work on infinite lists).

Looking at the core (compiling with -ddump-simpl -ddump-to-file -dsuppress-all -fforce-recomp) I can tell that GHC already did this optimization in the previous versions.

Sorry for the late response!

First of all, for the function evaluation, I had a main function main = print (qsX [1 .. 20000]). I tried, on your advice, to benchmark with criterion, but strangely, the times were slower. Here are the stats without optimization (I built with stack build <name>):

benchmarking qs/appendListComprehension
time                 9.255 s    (9.236 s .. 9.263 s)
                     1.000 RĀ²   (1.000 RĀ² .. 1.000 RĀ²)
mean                 9.240 s    (9.234 s .. 9.245 s)
std dev              7.463 ms   (5.176 ms .. 9.073 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/appendPartition
time                 22.47 s    (20.79 s .. 24.38 s)
                     0.999 RĀ²   (0.997 RĀ² .. 1.000 RĀ²)
mean                 22.41 s    (22.11 s .. 22.60 s)
std dev              283.3 ms   (88.87 ms .. 351.9 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/appendFoldr
time                 5.314 s    (5.219 s .. 5.368 s)
                     1.000 RĀ²   (1.000 RĀ² .. 1.000 RĀ²)
mean                 5.310 s    (5.297 s .. 5.328 s)
std dev              17.38 ms   (5.951 ms .. 23.75 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/appendFoldl'
time                 7.565 s    (6.971 s .. 7.910 s)
                     0.999 RĀ²   (0.999 RĀ² .. 1.000 RĀ²)
mean                 7.864 s    (7.678 s .. 8.203 s)
std dev              329.3 ms   (16.83 ms .. 399.8 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/tailListComprehension
time                 9.331 s    (NaN s .. 9.598 s)
                     1.000 RĀ²   (1.000 RĀ² .. 1.000 RĀ²)
mean                 9.457 s    (9.391 s .. 9.502 s)
std dev              71.23 ms   (32.20 ms .. 91.37 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/tailFoldr
time                 5.243 s    (5.090 s .. 5.317 s)
                     1.000 RĀ²   (1.000 RĀ² .. 1.000 RĀ²)
mean                 5.262 s    (5.244 s .. 5.280 s)
std dev              22.17 ms   (8.449 ms .. 28.50 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/tailFoldl'
time                 3.541 s    (3.532 s .. 3.557 s)
                     1.000 RĀ²   (1.000 RĀ² .. 1.000 RĀ²)
mean                 3.538 s    (3.536 s .. 3.540 s)
std dev              2.275 ms   (732.7 Ī¼s .. 3.108 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/bounceRec
time                 3.609 s    (3.571 s .. 3.639 s)
                     1.000 RĀ²   (1.000 RĀ² .. 1.000 RĀ²)
mean                 3.611 s    (3.605 s .. 3.618 s)
std dev              7.931 ms   (2.888 ms .. 9.921 ms)
variance introduced by outliers: 19% (moderately inflated)

and here with optimization on (I built with stack build --ghc-options="-O2" <name>):

benchmarking qs/appendListComprehension
time                 7.053 s    (6.746 s .. 7.246 s)
                     1.000 RĀ²   (1.000 RĀ² .. 1.000 RĀ²)
mean                 7.146 s    (7.088 s .. 7.178 s)
std dev              55.64 ms   (21.03 ms .. 72.91 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/appendPartition
time                 22.02 s    (21.95 s .. 22.10 s)
                     1.000 RĀ²   (1.000 RĀ² .. 1.000 RĀ²)
mean                 22.01 s    (22.00 s .. 22.02 s)
std dev              13.34 ms   (11.68 ms .. 14.18 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/appendFoldr
time                 4.055 s    (4.038 s .. 4.065 s)
                     1.000 RĀ²   (1.000 RĀ² .. 1.000 RĀ²)
mean                 4.062 s    (4.057 s .. 4.071 s)
std dev              8.356 ms   (754.8 Ī¼s .. 10.38 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/appendFoldl'
time                 7.440 s    (6.979 s .. 8.094 s)
                     0.999 RĀ²   (0.999 RĀ² .. 1.000 RĀ²)
mean                 7.258 s    (7.148 s .. 7.353 s)
std dev              123.4 ms   (31.13 ms .. 161.7 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/tailListComprehension
time                 7.053 s    (6.865 s .. 7.305 s)
                     1.000 RĀ²   (1.000 RĀ² .. 1.000 RĀ²)
mean                 7.009 s    (6.981 s .. 7.038 s)
std dev              36.40 ms   (15.83 ms .. 48.36 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/tailFoldr
time                 4.144 s    (4.081 s .. 4.277 s)
                     1.000 RĀ²   (1.000 RĀ² .. 1.000 RĀ²)
mean                 4.170 s    (4.146 s .. 4.202 s)
std dev              30.98 ms   (1.319 ms .. 38.19 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/tailFoldl'
time                 2.925 s    (2.852 s .. 3.088 s)
                     1.000 RĀ²   (NaN RĀ² .. 1.000 RĀ²)
mean                 2.925 s    (2.899 s .. 2.952 s)
std dev              32.59 ms   (16.25 ms .. 42.00 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/bounceRec
time                 2.967 s    (2.817 s .. 3.135 s)
                     1.000 RĀ²   (0.999 RĀ² .. 1.000 RĀ²)
mean                 3.020 s    (2.983 s .. 3.057 s)
std dev              46.09 ms   (24.79 ms .. 56.32 ms)
variance introduced by outliers: 19% (moderately inflated)

As for the ā€œtrickā€ function, I wasnā€™t aware of it. I must admit that I was perplex when I first saw it. The ā€œ3ā€ arguments function (\x xs (l, u) -> xs (if x <= pivot then (x : l, u) else (l, x : u))) is a bit counter-intuitive, I would have gone with this syntax to make it clearer: (\x fs -> \(l, u) -> fs (if x <= pivot then (x : l, u) else (l, x : u))). I replaced xs by fs since x is a list element and xs is not part of it, instead, it represents the accumulator containing the embedded functions. Anyway, it seems like foldl' does the same job, but it is a nice thing to know.

So while it it true that with the optimizations in place the functions behaviours seem more ā€œas expectedā€, I donā€™t get why criterion analysis is slower.