Why does N4 create a speed reduction vs N1?

EDIT: MISREAD THE BENCHMARK.

zsh time displays times horizontally on macos, so it’s actually 0.46 seconds to execute the computation on my machine, not 0.002 seconds. 0.007 seconds actually reflects the overhead from the runtime working on N4, likely with multi-threaded garbage collection.

I’m currently trying to optimize the Debian benchmarks pidigits problem in Haskell (this is just for fun, I no longer expect to make major improvements).

I’m basing my improvements off this set of code:

https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/pidigits-ghc-6.html

My version:

-- The Computer Language Benchmarks Game
-- https://salsa.debian.org/benchmarksgame-team/benchmarksgame/
-- contributed by Bryan O'Sullivan
-- modified by Eugene Kirpichov: pidgits only generates
-- the result string instead of printing it. For some
-- reason, this gives a speedup.
-- modified by W. Gordon Goodsman to do two divisions
-- as per the new task requirements.
-- further modified by W. Gordon Goodsman to reduce the memory leak
-- caused by excess laziness in computing the output string before output.
-- liam's version with foldr-for refactor.
{- cabal:
default-language: GHC2021
ghc-options: -O2 -fllvm -threaded
build-depends: base ^>= 4.20.0.0
-}

import System.Environment

pidgits :: Integer -> IO ()
pidgits n = foldr go undefined (0 # (1,0,1)) 1 9 where
 go d cont i k = if i <= n
    then do
        putStr (show d ++ if k <= 0 then "\t:" ++ show i ++ "\n" else "")
        cont (i + 1) (if k <= 0 then 9 else k-1)
     else if k<9 then putStrLn (replicate k ' ' ++ "\t:" ++ show n)
           else putStr ""
 j # s
    | n>a || q/=r = k # t
    | True = q : k # (n*10,(a-(q*d))*10,d) -- inline eliminateDigit
  where k = j+1
        t@(n,a,d)=k&s
        q=3$t
        r=4$t -- two calls to extractDigit
        c$(n,a,d) = (c*n+a)`quot`d -- extractDigit
        j&(n,a,d) = (n*j,(a+n*2)*y,d*y) where y=(j*2+1) -- nextDigit

main :: IO ()
main = pidgits.read.head =<< getArgs

On an Apple Macbook Air 15 M3, I get 0.007 usertime if I have -N4 on, but only 0.002 usertime if I have -N1 on. Any ideas why this may be?

If you don’t explicitly use parallelism or concurrency then you only pay for the extra overhead and don’t actually get any benefit. GHC Haskell has no automatic parallelism (yet).

Does it have to be decimal digits?
See here.

Debian wants a specific algorithm used. I asked on IRC, and apparently the benchmark is sort of broken because the benchmarks I’m seeing show that over 90% of the time is spent in kernel (i.e, output to IO), and the Rust record-holder “cheats” by optimizing the IO time.

I actually misread the benchmark because of horizontal output on zsh time (the 0.46 seconds is userspace time, not kernel tiem). At least this means I can try a bunch of optimizations of Mr. Goodsman’s code.