Quick Sort variants performance

Edit: I also posted the question on StackOverflow (since there is a large audience) with some edits/ameliorations on the question. Here is the link:


I still leave the non-edited question below for further references but you should go see the one on StackOverflow. Also, you can delete this post if it is better to only have the question in one place.

Hello,

I am currently reading about functional data structures and algorithms and I tried different quick sort implementations. I noticed, however, that there is a lot of variation in their performances.

Here are some selected ones that I want to discuss (all the programs have been compiled for test, and the list to be sorted was the (worst case already sorted (which implies that the time taken is in O(n^2))) [1 . . 20000] list):

  1. qs1 []             = []
    qs1 (pivot : rest) = qs1 lower ++ [pivot] ++ qs1 upper  where
      lower = [ x | x <- rest, x <= pivot ]
      upper = [ x | x <- rest, x >  pivot ]
    

    This is the classic sort we see when we learn the language. Here, the rest list is traversed twice to filter first the elements less or equal than the pivot, then the elements greater than the pivot. The time taken was 6.45 sec.

  2. qs2 []             = []
    qs2 (pivot : rest) = qs2 lower ++ [pivot] ++ qs2 upper  where
      (lower, upper) = foldr
        (\x (l, u) -> if x <= pivot then (x : l, u) else (l, x : u))
        ([], [])
        rest
    

    I was surprised by this one. It is the same as above, except that it only traverses the rest list once. It did worse than the first with a total of 8.11 sec. I tried the partition function from Data.List library, but it only did worse (much worse), with a catastrophic 10.99 sec. I checked the implementation, and yet it is almost the same as my lambda function, although it relies on an utility function:

    partition               :: (a -> Bool) -> [a] -> ([a],[a])
    partition p xs = foldr (select p) ([],[]) xs
    
    select :: (a -> Bool) -> a -> ([a], [a]) -> ([a], [a])
    select p x ~(ts,fs) | p x       = (x:ts,fs)
                        | otherwise = (ts, x:fs)
    

    (Taken from https://hackage.haskell.org/package/base-4.14.0.0/docs/src/Data.OldList.html#partition).

  3. qs3 []             s = s
    qs3 [x           ] s = x : s
    qs3 (pivot : rest) s = qs3 lower (pivot : (qs3 upper s))  where
        lower = [ x | x <- rest, x <= pivot ]
        upper = [ x | x <- rest, x >  pivot ]
    

    In this one, there are 2 novelties. Fist, the removal of append ++ in favour of cons :. Second, it is a tail recursion function, so it should (in principle) be faster. However, it is merely better than the first one, with a time of 6.42 sec. In fact, due to variation from an execution to the other, it probably is the same.

  4. qs4 []             s = s
    qs4 [x           ] s = x : s
    qs4 (pivot : rest) s = qs4 lower (pivot : (qs4 upper s))  where
      (lower, upper) = foldr
        (\x (l, u) -> if x <= pivot then (x : l, u) else (l, x : u))
        ([], [])
        rest
    

    Again, this is the same as above, except that I replaced the 2 list traversal by a foldr, and again, the time taken is increased: 8.02 sec.

  5. split pivot [] lower upper s = qs5 lower (pivot : qs5 upper s)
    split pivot (h : t) lower upper s
         | h <= pivot = split pivot t (h : lower) upper s
         | otherwise  = split pivot t lower (h : upper) s
    
    qs5 []             s = s
    qs5 (pivot : rest) s = split pivot rest [] [] s
    

    This one is the fastest. I recorded a stunning time of 2.82 seconds, which is almost 4 times faster than the “slowest”. It bounces between 2 functions that calls each other. split is a recursive function that separates the elements of the rest list sent by the qs5 functions, which it returns to when it is done partitioning.


Conclusion: What the hell is going on here? I am puzzled by all the hidden subtleties done during the compilation that makes my expectations concerning the performance of the program go wrong. I humbly thank anyone who could help me untangle the pieces of this jigsaw by pointing out what is going on under the hood.

1 Like

One thing I notice here is the lack of a laziness annotation on the tuple in the fold, I think the performance should change if you do it like this:

qs2 []             = []
qs2 (pivot : rest) = qs2 lower ++ [pivot] ++ qs2 upper  where
  ~(lower, upper) = foldr
    (\x ~(l, u) -> if x <= pivot then (x : l, u) else (l, x : u))
    ([], [])
    rest

Edit: I just tested this and it is slower, the most time is spent on garbage collection…

Also how are you evaluating each function? Are you printing the result or do you use deepseq or something similar? I always test these kind of functions with a custom seqList:

seqList :: [a] -> b -> b
seqList [] y = y
seqList (x:xs) y = seqList xs `seq` y

Also I recommend running with +RTS -s (or even better: criterion or gauge) to get more accurate time measurements than time.

On my machine (with -O2) qs3 is the fastest on [1..20000000] as input. It takes 1.9 seconds, qt5 takes 2.3 seconds.

Here are benchmark results for an input of [1..20000] (q2’ is the lazy version I wrote above):

benchmarked qs1
time                 123.4 μs   (111.5 μs .. 136.2 μs)
                     0.966 R²   (0.950 R² .. 0.991 R²)
mean                 103.8 μs   (101.7 μs .. 107.7 μs)
std dev              9.388 μs   (5.831 μs .. 16.84 μs)
variance introduced by outliers: 58% (severely inflated)

benchmarked qs2
time                 315.8 μs   (305.8 μs .. 324.2 μs)
                     0.985 R²   (0.969 R² .. 0.995 R²)
mean                 349.6 μs   (340.2 μs .. 363.3 μs)
std dev              37.62 μs   (28.18 μs .. 49.01 μs)
variance introduced by outliers: 66% (severely inflated)

benchmarked qs2'
time                 1.135 ms   (1.125 ms .. 1.148 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 1.141 ms   (1.135 ms .. 1.149 ms)
std dev              24.44 μs   (18.39 μs .. 39.75 μs)

benchmarked qs3
time                 111.8 μs   (111.3 μs .. 112.4 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 111.4 μs   (111.2 μs .. 111.7 μs)
std dev              845.4 ns   (672.1 ns .. 1.152 μs)

benchmarked qs4
time                 328.7 μs   (327.5 μs .. 329.7 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 326.9 μs   (326.3 μs .. 327.6 μs)
std dev              2.279 μs   (1.878 μs .. 2.804 μs)

benchmarked qs5
time                 204.6 μs   (203.7 μs .. 205.4 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 205.2 μs   (204.8 μs .. 206.1 μs)
std dev              2.062 μs   (1.396 μs .. 3.525 μs)

This is because the way you’re folding is not very efficient, you can solve it by using foldl':

qs2 []             = []
qs2 (pivot : rest) = qs2 lower ++ [pivot] ++ qs2 upper  where
  (lower, upper) = foldl'
    (\(l, u) x -> if x <= pivot then (x : l, u) else (l, x : u))
    ([], [])
    rest

Or, I like this more, use this trick:

qs2 []             = []
qs2 (pivot : rest) = qs2 lower ++ [pivot] ++ qs2 upper  where
  (lower, upper) = foldr
    (\x xs (l, u) -> xs (if x <= pivot then (x : l, u) else (l, x : u)))
    id
    rest
    ([], [])

Both of these perform nearly identical in my tests:

benchmarked qs2-foldl
time                 188.9 μs   (187.3 μs .. 191.4 μs)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 189.1 μs   (188.3 μs .. 190.2 μs)
std dev              2.971 μs   (1.963 μs .. 4.553 μs)

benchmarked qs2-trick
time                 196.5 μs   (190.3 μs .. 202.6 μs)
                     0.996 R²   (0.993 R² .. 0.999 R²)
mean                 190.0 μs   (188.8 μs .. 192.1 μs)
std dev              4.919 μs   (3.061 μs .. 7.414 μs)
variance introduced by outliers: 11% (moderately inflated)

Here’s my results using the partition function:

benchmarked qs2-partition
time                 1.179 ms   (1.158 ms .. 1.202 ms)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 1.135 ms   (1.129 ms .. 1.145 ms)
std dev              25.00 μs   (18.57 μs .. 35.61 μs)

This has about the same time as my lazy qs2' solution. I’m guessing the laziness causes more memory to be occupied for a longer time which decreases cache locality and might take more time to allocate and free. The advantage of the more lazy qs2' and partition is that they can work on infinite lists (of course sorting won’t work on infinite lists, but partitions can work on infinite lists).

Looking at the core (compiling with -ddump-simpl -ddump-to-file -dsuppress-all -fforce-recomp) I can tell that GHC already did this optimization in the previous versions.

Sorry for the late response!

First of all, for the function evaluation, I had a main function main = print (qsX [1 .. 20000]). I tried, on your advice, to benchmark with criterion, but strangely, the times were slower. Here are the stats without optimization (I built with stack build <name>):

benchmarking qs/appendListComprehension
time                 9.255 s    (9.236 s .. 9.263 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 9.240 s    (9.234 s .. 9.245 s)
std dev              7.463 ms   (5.176 ms .. 9.073 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/appendPartition
time                 22.47 s    (20.79 s .. 24.38 s)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 22.41 s    (22.11 s .. 22.60 s)
std dev              283.3 ms   (88.87 ms .. 351.9 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/appendFoldr
time                 5.314 s    (5.219 s .. 5.368 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 5.310 s    (5.297 s .. 5.328 s)
std dev              17.38 ms   (5.951 ms .. 23.75 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/appendFoldl'
time                 7.565 s    (6.971 s .. 7.910 s)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 7.864 s    (7.678 s .. 8.203 s)
std dev              329.3 ms   (16.83 ms .. 399.8 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/tailListComprehension
time                 9.331 s    (NaN s .. 9.598 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 9.457 s    (9.391 s .. 9.502 s)
std dev              71.23 ms   (32.20 ms .. 91.37 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/tailFoldr
time                 5.243 s    (5.090 s .. 5.317 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 5.262 s    (5.244 s .. 5.280 s)
std dev              22.17 ms   (8.449 ms .. 28.50 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/tailFoldl'
time                 3.541 s    (3.532 s .. 3.557 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 3.538 s    (3.536 s .. 3.540 s)
std dev              2.275 ms   (732.7 μs .. 3.108 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/bounceRec
time                 3.609 s    (3.571 s .. 3.639 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 3.611 s    (3.605 s .. 3.618 s)
std dev              7.931 ms   (2.888 ms .. 9.921 ms)
variance introduced by outliers: 19% (moderately inflated)

and here with optimization on (I built with stack build --ghc-options="-O2" <name>):

benchmarking qs/appendListComprehension
time                 7.053 s    (6.746 s .. 7.246 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 7.146 s    (7.088 s .. 7.178 s)
std dev              55.64 ms   (21.03 ms .. 72.91 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/appendPartition
time                 22.02 s    (21.95 s .. 22.10 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 22.01 s    (22.00 s .. 22.02 s)
std dev              13.34 ms   (11.68 ms .. 14.18 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/appendFoldr
time                 4.055 s    (4.038 s .. 4.065 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 4.062 s    (4.057 s .. 4.071 s)
std dev              8.356 ms   (754.8 μs .. 10.38 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/appendFoldl'
time                 7.440 s    (6.979 s .. 8.094 s)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 7.258 s    (7.148 s .. 7.353 s)
std dev              123.4 ms   (31.13 ms .. 161.7 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/tailListComprehension
time                 7.053 s    (6.865 s .. 7.305 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 7.009 s    (6.981 s .. 7.038 s)
std dev              36.40 ms   (15.83 ms .. 48.36 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/tailFoldr
time                 4.144 s    (4.081 s .. 4.277 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 4.170 s    (4.146 s .. 4.202 s)
std dev              30.98 ms   (1.319 ms .. 38.19 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/tailFoldl'
time                 2.925 s    (2.852 s .. 3.088 s)
                     1.000 R²   (NaN R² .. 1.000 R²)
mean                 2.925 s    (2.899 s .. 2.952 s)
std dev              32.59 ms   (16.25 ms .. 42.00 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking qs/bounceRec
time                 2.967 s    (2.817 s .. 3.135 s)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 3.020 s    (2.983 s .. 3.057 s)
std dev              46.09 ms   (24.79 ms .. 56.32 ms)
variance introduced by outliers: 19% (moderately inflated)

As for the “trick” function, I wasn’t aware of it. I must admit that I was perplex when I first saw it. The “3” arguments function (\x xs (l, u) -> xs (if x <= pivot then (x : l, u) else (l, x : u))) is a bit counter-intuitive, I would have gone with this syntax to make it clearer: (\x fs -> \(l, u) -> fs (if x <= pivot then (x : l, u) else (l, x : u))). I replaced xs by fs since x is a list element and xs is not part of it, instead, it represents the accumulator containing the embedded functions. Anyway, it seems like foldl' does the same job, but it is a nice thing to know.

So while it it true that with the optimizations in place the functions behaviours seem more “as expected”, I don’t get why criterion analysis is slower.