I think one solution to get a lazy and efficient implementation is to use a balanced tree fold as described in: Balancing Folds - Donnacha Oisín Kidney
The idea is to switch from this nesting structure:
merge xs1 (merge xs2 (merge xs3 xs4)))
To a balanced binary tree structure:
merge (merge xs1 xs2) (merge xs3 xs4)
My reasoning about the performance is that each element of the resulting list has to propagate from the leaves of the structure to the root and each time it moves one level up it costs 1 unit of time. Then we can minimize the time by minimizing the distance between the leaves and the root, which is what a balanced tree structure does. That will make each element of the result cost only log n units of time in total. So we should get the optimal O(n^2 log n) asymptotic complexity.
Let me try it out.
Edit: It seems to work, here’s my new version (using treefold: Provides folds which try to combine elements in a balanced way. by the same author as the blog post):
import Data.Function
import Data.TreeFold
import Data.List
-- assumes the input is sorted from large to small
-- e.g.
-- >>> f [5,4,3,2,1]
-- [(5,4),(5,3),(5,2),(4,3),(5,1),(4,2),(4,1),(3,2),(3,1),(2,1)]
f :: [Int] -> [(Int,Int)]
f xs = treeFold merge [] $ map (\(x:xs) -> map (x,) xs) $ init $ tails xs where
merge [] ys = ys
-- assumes `x` has a bigger sum than the first element of `ys`
merge (x:xs) ys = x : go xs ys where
go (x@(a,b):xs) (y@(c,d):ys)
| a + b >= c + d = x : go xs (y:ys)
| otherwise = y : go (x:xs) ys
go [] ys = ys
go xs [] = xs
It has better memory usage in GHCi:
Old:
ghci> length $ f $ reverse [1..100]
4950
(0.04 secs, 46,661,584 bytes)
New:
ghci> length $ f $ reverse [1..100]
4950
(0.02 secs, 16,600,472 bytes)
And benchmarks (with tasty-bench: Featherlight benchmark framework):
f old: OK
721 μs ± 11 μs, 6.7 MB allocated, 313 KB copied, 7.0 MB peak memory
f new: OK
418 μs ± 27 μs, 3.3 MB allocated, 98 KB copied, 7.0 MB peak memory
That’s a nice speed up, but less than 2x faster.
And kudos to @Bodigrim since his tasty-bench-fit: Determine time complexity of a given function predicts the complexity I expected:
old: 7.04e-10 * x ^ 3
new: 7.98e-9 * x ^ 2 * log x
But sadly it is not lazy any more:
ghci> f (5:4:undefined)
*** Exception: Prelude.undefined
I guess I need to write my own lazy tree fold.