Manual recursion performs better than any fold, why?

benjamin-thomas · August 16, 2024, 11:27am

Hi,

I’m currently reading the book “Get programming with Haskell”. As an exercise, we want to generate a list of prime numbers.

module Primes where

import Data.List (unfoldr)
import GHC.List (foldr')

divBy :: Int -> Int -> Bool
divBy a b = 0 /= mod b a

sieve1 :: [Int] -> [Int]
sieve1 [] = []
sieve1 (x : xs) =
  x : sieve1 (filter (divBy x) xs)

sieve2 :: [Int] -> [Int]
sieve2 = foldr (\x xs -> x : filter (divBy x) xs) []

sieve3 :: [Int] -> [Int]
sieve3 = foldr' (\x xs -> x : filter (divBy x) xs) []

sieve4 :: [Int] -> [Int]
sieve4 = unfoldr step
 where
  step [] = Nothing
  step (x : xs) = Just (x, filter (divBy x) xs)

{-

>>> take 5 (sieve1 [2..])
[2,3,5,7,11]

>>> take 5 (sieve4 [2..])
[2,3,5,7,11]

 -}

{-
>>> take 10 primes
[2,3,5,7,11,13,17,19,23,29]
 -}

primes :: [Int]
primes = sieve1 [2 ..]

sieve1 was my first (naive) implementation. So I was then surprised to find out that sieve1 performs much better than either sieve2 or sieve3. I finally found a way to replicate the performance of sieve1 (wanting to use a higher-order function) with sieve4.

Here’s a summary of the performance (ran with stack test):

benchmarking sieve1/sum [2..10000]
time                 6.596 ms

benchmarking sieve1/take 3 from [2..10000000]
time                 11.33 ns

---

benchmarking sieve2/sum [2..10000]
time                 408.8 ms

benchmarking sieve2/take 3 from [2..10000000]
time                 10.29 ns

---

benchmarking sieve3/sum [2..10000]
time                 447.6 ms

benchmarking sieve3/take 3 from [2..10000000]
time                 1.256 s

---

benchmarking sieve4/sum [2..10000]
time                 6.489 ms

benchmarking sieve4/take 3 from [2..10000000]
time                 14.99 ns

Here’s my test:

{-# LANGUAGE NumericUnderscores #-}

import Criterion.Main
import Primes

benchSieve :: String -> ([Int] -> [Int]) -> Benchmark
benchSieve name sieveFunc =
  bgroup
    name
    [ bench "sum [2..10000]" $ whnf (sum . sieveFunc) [2 .. 10_000]
    , bench "take 3 from [2..10000000]" $ whnf (take 3 . sieveFunc) [2 .. 10_000_000]
    ]

main :: IO ()
main =
  defaultMain
    [ benchSieve "sieve1" sieve1
    , benchSieve "sieve2" sieve2
    , benchSieve "sieve3" sieve3
    , benchSieve "sieve4" sieve4
    ]

I have a couple of questions.

Firstly, I encountered worse performance via GHCi. GHCi seems to run a bytecode interpreter. Is it possible to run compiled optimized code instead?

-- stack repl
ghci> :set +s
ghci> sum $ sieve2 [2..10000]
5736396
(3.93 secs, 5,636,113,056 bytes)

Secondly, I’d like to understand more about what’s happening under the hood exactly. Going from sieve1 to sieve4 wasn’t that obvious to me. I thought I understood that foldr was generally to be favored by default in Haskell but there’s obviously more subtle things going on here.

So in summary, I’d like to understand what the evaluated code looks like. And I’d be interested to hear about any tips or “general rules” one could follow when using folds (or manual recursion!)

Thanks

jaror · August 16, 2024, 11:39am

Note that sieve2 is equivalent to this manual recursive version:

sieve2 :: [Int] -> [Int]
sieve2 [] = []
sieve2 (x : xs) =
  x : filter (divBy x) (sieve2 xs)

This differs from sieve1 in that it filters after the recursive call. So, it will first filter multiples of larger primes before the multiples of smaller primes. That is much less efficient, because there are much more multiples of smaller primes.

There is no easy way to rewrite your sieve1 function using a fold, because it uses a complicated form of recursion.

You can make sieve4 perform the same as your manual version by eta-expanding:

sieve4 :: [Int] -> [Int]
sieve4 xs = unfoldr step xs
 where
  step [] = Nothing
  step (x : xs) = Just (x, filter (divBy x) xs)

The reason this changes the benchmark results is that GHC will only inline functions that are fully saturated. In the eta-reduced implementation, the function unfoldr is not fully saturated, so it is not inlined. That makes the take 3 from benchmark slower.

There used to be a similar issue with foldl': #19534: Foldl' doesn't inline when partially applied · Issues · Glasgow Haskell Compiler / GHC · GitLab. Perhaps we should use the same solution for unfoldr.

You can use -fobject-code to do that. With cabal this works for me:

$ cabal repl -fobject-code -O Sieve.hs
ghci> :set +s
ghci> sum $ sieve2 [2..10000]
5736396
(7.94 secs, 5,636,323,376 bytes)
ghci> sum $ sieve4 [2..10000]
5736396
(0.20 secs, 114,581,232 bytes)
ghci> sum $ sieve1 [2..10000]
5736396
(0.20 secs, 114,518,896 bytes)

benjamin-thomas · August 16, 2024, 5:31pm

jaror:

Note that sieve2 is equivalent to this manual recursive version:
sieve2 :: [Int] -> [Int]
sieve2 [] = []
sieve2 (x : xs) =
  x : filter (divBy x) (sieve2 xs)
This differs from sieve1 in that it filters after the recursive call. So, it will first filter multiples of larger primes before the multiples of smaller primes. That is much less efficient, because there are much more multiples of smaller primes.

There is no easy way to rewrite your sieve1 function using a fold, because it uses a complicated form of recursion.

That explains my confusion, many thanks!

Actually I was fine here, that function performed perfectly fine for me (on ghc-9.6.4).

That does it, awesome!

jaror · August 16, 2024, 5:35pm

I did notice a consistent difference in time for the take 3 benchmark. Your results also show a ~30% slowdown:

Letting unfoldr inline closes that gap.

But I admit it is pretty minor.

Topic		Replies	Views
Understanding lazy list traversal performance Learn	3	355	December 9, 2024
All my sieves are the same! Learn	15	834	June 29, 2021
How can foldl be fast? Learn	3	735	June 7, 2023
How does this nth Prime program terminate? Learn	11	837	October 5, 2022
Refactor Folds Tool? Learn	5	437	March 1, 2019

Manual recursion performs better than any fold, why?

Related topics