I am running a parallel program on an AMD Ryzen 5 3600, it has 6 cores and 12 threads. 12 is also the number of capabilities determined by GHC.
You see a linear speed-up only for a number of 2 parallel jobs. Already the third job, while adding a decent speed-up, is way below my expectation. At 5 jobs I already hit the wall and from there, performance just decreases in absolute terms.
My parallel code is actually quite straightforward in the sense, that I really have independent computations:
nj <- getNumCapabilities
putStr $ "\nRunning " <> show nj <> " jobs.\n\n"
hFlush stdout
-- `ls :: [Text]` is a list of 2 million hyphenated words
mvarLs <- newMVar ls
let
loop rs = do
mJob <-
modifyMVar mvarLs $ \ls' ->
pure $ case ls' of
[] -> ([], Nothing)
(j : js) -> (js, Just j)
case mJob of
Just hyph -> do
-- `parseWord` is the runtime intensive function that is being
-- run in parallel on 2 Million words
p <- parseWord hyph
p `deepseq` loop (p : rs)
Nothing -> pure rs
-- parallelism as straightfoward as it could be: run `nj` concurrent jobs
-- to tackle the list of 2 Million words and accumulate the result
lsStenos <- mconcat <$> replicateConcurrently nj (loop [])
I tried a couple of things in order to find a better speed-up, but my results where identical to a scary degree:
-
compiling with and w/o -threaded. I don’t know why, but -threaded is entirely optional, contrary to the documentation.EDIT, w/o threaded, indeed ghc complains+RTS -N
should only have an effect with threaded, which is not true. - limiting the gc threads to 1 witgh -qn1
- limiting the gc threads to half the number of jobs
- running with +RTS -qa: “Use the OS’s affinity facilities to try to pin OS threads to CPU cores.”
- varying the allocation area size of the gc, e.g. +RTS -A64M (cf. documentation)
- switching between GHC 8.8.4 and GHC 8.10.7
- activating --nonmoving-gc (requires GHC 8.10)
- have each job take 100 words at once instead of 1, to avoid waiting for the lock
My expectation is a speed-up all the way to nj = 12
, sublinear because of gc overhead. A more-or-less linear speed-up up to nj = 6
would be logical if the number of actual cores is more important than AMD’s hyperthreading capabilities.
Is there an important measure I have overlooked? Are my expectations wrong maybe? Is my concurrent implementation suboptimal?
If someone wants to actually try out the code to play around with the parameters and run it on their machine, just give me a hint whether you use stack
or nix
or whatnot and I’ll pass over a package.