All my sieves are the same!

My sources are Melissa E. O’Neill, Jeremy Gibbons and Richard Bird.

There are many ways to get an infinite list of prime numbers. I selected 4 different implementations:

  • Unfaithful sieve, as Melissa says.
  • My take on the true sieve. (Work in progress!)
  • Trial division, as Melissa says.
  • True sieve, as Jeremy says following Richard.

I then wrote some property checks and bench marks. You may see the code here.

I was hoping to see that these implementations produce the same results at different speed. What I do see is that they produce the same results at the same speed.

Please help me understand this. Do you see a difference in performance on your machine? Do you see a stupid error somewhere in my code? What am I doing wrong?

1 Like

I think that the primes lists are shared and generated only once in your benchmarks, so most benchmarks only measure the time it takes to traverse the list. See https://github.com/effectfully-ou/sketches/tree/master/trouble-in-paradise-fibonacci#still-problematic

Instead of storing it in lists you can make them explicit recursive generating functions then they will not be shared.

I do not think I can, actually. I have no idea how that would look. Please give me an example.

You’re right, I just summarized the conclusion from that article, but in this case that does seem difficult.

Currently all sieves are memoized in top-level bindings, so essentially they are evaluated only once, and all you measure is rnf . take size. It is difficult to trick GHC to reevalute them: even if you write explicit lambdas

primeAlgos ∷  [(String, Int -> [Integer])]
primeAlgos =
  [ ("primesFromUnfaithfulSieve", primesFromUnfaithfulSieve)
  , ("primesFromTrueSieve", primesFromTrueSieve)
  , ("primesFromTrialDivision", primesFromTrialDivision)
  , ("primesFromBird", primesFromBird) ]

primesFromUnfaithfulSieve, primesFromTrueSieve, primesFromTrialDivision, primesFromBird ∷ Int -> [Integer]

primesFromUnfaithfulSieve n = take n $ unfaithfulSieve [2..] where unfaithfulSieve (p: xs) = p: unfaithfulSieve [x | x ← xs, x `mod` p > 0]

GHC will happily let-float the argument of take n into a separate top-level thunk.

The simplest option is to benchmark one iteration only. If you take a sufficiently new tasty-bench >= 0.3 and pass --stdev Infinity, you’ll obtain meaningful results:

  bench marks
    1024
      primesFromUnfaithfulSieve: OK
        24.4 ms
      primesFromTrueSieve:       OK
        13.1 ms
      primesFromTrialDivision:   OK
        1.91 ms
      primesFromBird:            OK
        4.24 ms
    2048
      primesFromUnfaithfulSieve: OK (0.05s)
        46.0 ms
      primesFromTrueSieve:       OK (0.03s)
        27.6 ms
      primesFromTrialDivision:   OK
        1.62 ms
      primesFromBird:            OK
        4.07 ms
    4096
      primesFromUnfaithfulSieve: OK (0.19s)
        186  ms
      primesFromTrueSieve:       OK (0.13s)
        127  ms
      primesFromTrialDivision:   OK
        4.03 ms
      primesFromBird:            OK (0.01s)
        10.2 ms

(FWIW AFAIK the fastest implementation of O’Neill sieve can be found at https://hackage.haskell.org/package/arithmoi-0.7.0.0/docs/Math-NumberTheory-Primes-Heap.html)

1 Like

Alright, turns out that trial division is the fastest for all practical problem sizes. So much for theory…

If you just care about practical sieve sizes then a mutable sieve on an array might be much faster, e.g. https://gist.github.com/noughtmare/48b5755768ddec86ad50a3908e721410.

Otherwise you might even be able to use a fast primality test instead of trial division such as Miller or Baille-PSW primality tests. I recently wrote this Miller test which is the fastest primality test for 64 bit words in Haskell that I know of:

{-# LANGUAGE MagicHash, UnboxedTuples, BangPatterns #-}

import GHC.Exts
import GHC.IO
import System.IO.Unsafe
import GHC.Word

{-# INLINE mulMod #-}
mulMod :: Word# -> Word# -> Word# -> Word#
-- mulMod _ _ _ | trace "mulMod" False = undefined
mulMod a b mod =
  case timesWord2# a b of { (# h, l #) ->
  case quotRemWord2# h l mod of { (# _, r #) -> r }}

{-# INLINE powMod #-}
powMod :: Word# -> Word# -> Word# -> Word#
-- powMod _ _ _ | trace "powMod" False = undefined
powMod a b m = go a b 1## where
  go _ 0## r = r
  go a b r =
    case case b `and#` 1## of 0## -> r; _ -> mulMod r a m of { r -> 
    case b `uncheckedShiftRL#` 1# of { b ->
    case case b of 0## -> a; _ -> mulMod a a m of { a ->
    go a b r }}}

sprp :: Word# -> Word# -> Int#
-- sprp _ _ | trace "sprp" False = undefined
sprp n a = go (n `minusWord#` 1##) 0## where
  go :: Word# -> Word# -> Int#
  go d s =
    case d `and#` 0xff## of { 0## -> go (d `uncheckedShiftRL#` 8#) (s `plusWord#` 8##); _ -> 
    case case d `and#` 0xf## of 0## -> (# d `uncheckedShiftRL#` 4#, s `plusWord#` 4## #); _ -> (# d, s #) of { (# d, s #) -> 
    case case d `and#` 0x3## of 0## -> (# d `uncheckedShiftRL#` 2#, s `plusWord#` 2## #); _ -> (# d, s #) of { (# d, s #) ->
    case case d `and#` 0x1## of 0## -> (# d `uncheckedShiftRL#` 1#, s `plusWord#` 1## #); _ -> (# d, s #) of { (# d, s #) ->
    case powMod a d n of { b ->
    case (b `eqWord#` 1##) `orI#` (b `eqWord#` (n `minusWord#` 1##)) of { 1# -> 1#; _ ->
    go' b 1## s }}}}}}

  go' :: Word# -> Word# -> Word# -> Int#
  go' b r s = case r `ltWord#` s of
    1# -> case mulMod b b n of { b ->
          case b `leWord#` 1## of { 1# -> 0#; _ -> 
          case b `eqWord#` (n `minusWord#` 1##) of { 1# -> 1#; _ ->
          go' b (r `plusWord#` 1##) s }}}
    _ -> 0#

{-# NOINLINE isPrime #-}
{-# SCC isPrime #-}
isPrime :: Word# -> Word#
-- isPrime _ | trace "isPrime" False = undefined
isPrime n =
  case n `ltWord#` 2## of { 1# -> 0##; _ ->
  case n `ltWord#` 4## of { 1# -> n; _ ->
  case n `and#` 1## of { 0## -> 0##; _ ->
  go 0# }}}
  where
    ByteArray arr = smallPrimes
    go i = case i <# 256# of
      1# -> 
        case indexWord16Array# arr i of { p ->
        case n `eqWord#` p of { 1# -> n; _ ->
        case n `remWord#` p of 0## -> p; _ -> go (i +# 1#) }}
      _ -> 
        case n `ltWord#` 2647129## of { 1# -> n; _ ->
        -- the smallest prime we haven't checked yet is 1627
        -- so the smallest composite that passes is its square
        case sprp n 2## of { 0# -> 0##; _ ->
        case sprp n 3## of { 0# -> 0##; _ ->
        case sprp n 5## of { 0# -> 0##; _ ->
        case n `ltWord#` 25326001## of { 1# -> n; _ ->
        -- This bound was found by:
        -- C. Pomerance, J. L. Selfridge and Wagstaff, Jr., S. S.,
        -- "The pseudoprimes to 25 · 109,"
        -- Math. Comp., 35:151 (1980) 1003-1026.  MR 82g:10030
        case sprp n 7## of { 0# -> 0##; _ ->
        -- 3215031751 already caught by trial division
        case n `ltWord#` 118670087467## of { 1# -> n; _ ->
        case sprp n 11## of { 0# -> 0##; _ ->
        case n `ltWord#` 2152302898747## of { 1# -> n; _ ->
        case sprp n 13## of { 0# -> 0##; _ ->
        -- I chose to ignore this bound because it is too close to the previous:
        -- case n `ltWord#` 3474749660383## of { 1# -> n; _ ->
        case sprp n 17## of { 0# -> 0##; _ ->
        case n `ltWord#` 341550071728321## of { 1# -> n; _ ->
        -- for bases 7 up to 17 these bounds were found by:
        -- G. Jaeschke, "On strong pseudoprimes to several bases,"
        -- Math. Comp., 61 (1993) 915-926.  MR 94d:11004
        case sprp n 19## of { 0# -> 0##; _ ->
        case sprp n 23## of { 0# -> 0##; _ ->
        case n `ltWord#` 3825123056546413051## of { 1# -> n; _ ->
        -- This bound was conjectured by:
        -- Z. Zhang and M. Tang,
        -- "Finding strong pseudoprimes to several bases. II"
        -- Math. Comp., 72 (2003) 2085-2097.
        case sprp n 29## of { 0# -> 0##; _ ->
        case sprp n 31## of { 0# -> 0##; _ ->
        case sprp n 37## of { 0# -> 0##; _ ->
        n }}}}}}}}}}}}}}}}}}

data ByteArray = ByteArray !ByteArray#

-- all primes under 2^16 except 2
{-# NOINLINE smallPrimes #-}
smallPrimes :: ByteArray
smallPrimes = unsafePerformIO $ IO $ \s -> 
  case newAlignedPinnedByteArray# (len *# 2#) 64# s of { (# s, arr #) ->
  go arr 0# 3# (go0 arr 0# s) }
  where
    len = 65536#
    
    go0 :: MutableByteArray# a -> Int# -> State# a -> State# a
    go0 !arr !i !s
      | isTrue# (i >=# len `quotInt#` 4#) = s
      | otherwise = go0 arr (i +# 1#) (writeInt64Array# arr i 0# s)

    go :: MutableByteArray# a -> Int# -> Int# -> State# a -> (# State# a, ByteArray #)
    go !arr !i !j !s
      | isTrue# (j ==# len) =
        case shrinkMutableByteArray# arr (i *# 2#) s of { s ->
        case unsafeFreezeByteArray# arr s of (# s, arr #) -> (# s, {- trace (show (I# i)) $ -} ByteArray arr #) }
      | otherwise = 
        case readInt16Array# arr j s of { (# s, isComposite #) ->
        case isComposite of
          0# {- | trace ("write" ++ show (I# i, I# j)) True -} ->
            case writeWord16Array# arr i (int2Word# j) s of { s ->
            case go' arr (j +# j) j s of { s ->
            go arr (i +# 1#) (j +# 1#) s }}
          _ -> go arr i (j +# 1#) s }

    go' :: MutableByteArray# a -> Int# -> Int# -> State# a -> State# a
    go' !arr !i !j !s
      | isTrue# (i >=# len) = s
      | otherwise =
        case writeInt16Array# arr i 1# s of s -> go' arr (i +# j) j s

(this is a very low level version; I am planning to make a higher level version)

1 Like

Thanks, but I do not really need those prime numbers by themselves. What I really wanted is to test a theory — turns out it is not practically verifiable. I cannot tell by experiment if the true sieve is actually better than trial division. For all I know, the whole story might be false due to some innocent mistake or assumption. Like, maybe they forgot to count the deconstruction and reconstruction of lists. Whatever the reason, clearly the theory we have so far is not good enough to wield in industrial practice…

1 Like

I had a rant about the very same topic a year ago: https://www.reddit.com/r/haskell/comments/hk15ub/do_not_recommend_the_genuine_sieve_of/

Here is a way to use streams to implement three of your sieves. Using this stream data types ensures that it is never retained in memory, so they can be run repeatedly in a benchmark:

{-# LANGUAGE ExistentialQuantification #-}

data Stream a = forall s. Stream !s !(s -> Step a s)
data Step a s = Skip !s | Yield a !s

instance Functor Stream where
  fmap f (Stream s0 step) = Stream s0 step' where
    step' s = case step s of
      Skip s' -> Skip s'
      Yield x s' -> Yield (f x) s'

foldS :: (a -> b -> b) -> Stream a -> b
foldS f (Stream s0 step) = go s0 where
  go s = case step s of
    Skip s' -> go s'
    Yield x s' -> f x (go s')

toList :: Stream a -> [a]
toList = foldS (:)

enumFromS :: Int -> Stream Int
enumFromS i = Stream i (\x -> Yield x (x + 1))

uncons :: Stream a -> Step a (Stream a)
uncons (Stream s step) = case step s of 
  Skip s' -> Skip (Stream s' step)
  Yield x s' -> Yield x (Stream s' step)

filterS :: (a -> Bool) -> Stream a -> Stream a
filterS f (Stream s step) = Stream s step' where
  step' s = case step s of
    Skip s' -> Skip s'
    Yield x s' | f x -> Yield x s'
               | otherwise -> Skip s'

-- | Strict pairs, good for use in the state of a stream
data Pair a b = Pair !a !b

unfaithful, truesieve, trialDivision :: Stream Int

unfaithful = Stream (enumFromS 2) step where
  step s = case uncons s of
    Skip s' -> Skip s'
    Yield p s' -> Yield p (filterS (\x -> x `mod` p > 0) s')

(\\) :: Ord a => Stream a -> Stream a -> Stream a
Stream xs0 xstep \\ Stream ys0 ystep = Stream (Pair xs0 ys0) step where
  step (Pair xs ys) = case xstep xs of
    Skip xs' -> Skip (Pair xs' ys)
    Yield x xs' -> case ystep ys of
      Skip ys' -> Skip (Pair xs ys') -- revert xs
      Yield y ys' -> case compare x y of
        LT -> Yield x (Pair xs' ys) -- yield x
        EQ -> Skip (Pair xs' ys') -- drop both
        GT -> Skip (Pair xs ys') -- drop y

truesieve = Stream (enumFromS 2) step where
  step s = case uncons s of
    Skip s' -> Skip s'
    Yield p s' -> Yield p (s' \\ fmap (p *) (enumFromS p))

trialDivision = filterS isPrime (enumFromS 2) where
  isPrime 2 = True
  isPrime n = foldS cons trialDivision where
    cons p rest = p * p > n || n `mod` p /= 0 && rest

This unfaithful and true sieve cheat a little (w.r.t memory usage) because they build up larger and larger step functions. This trial division sieve is really constant memory, but it is probably very slow, because it just recomputes the list of primes each time.

Feel like making a pull request?

I’ve tried it out and they are all way slower than your implementations (even with the --stdev Infinity option). I guess the streams-inside-streams are not optimized well. The trial division was actually the fastest again, but it is still obviously slow due to the recomputation. So using streams like this does not really get you the same computational behavior, but at least you can benchmark them without --stdenv Infinity.

By the way, you can also use the -fno-full-laziness flag to prevent let-floating. I have submitted a pull request for that: https://github.com/kindaro/benchmarking-sieves/pull/1.

2 Likes

I would merge your curious co-recursive experiments if only for their curiousity.

-fno-full-laziness is a cool move, I never would have thought of this flag.

1 Like

-fno-full-laziness is a cool move, I never would have thought of this flag.

It has a long and storied history.

I just found this GHC wiki that also links to a blog post which discuss this problem. Maybe with GHC 9.0.1 we can use linearity to avoid this problem. That blog post also links through to this GHC issue where a few methods are proposed for mitigating these space leaks.