What speed-up to expect for parallel program on 6 cores?

image

I am running a parallel program on an AMD Ryzen 5 3600, it has 6 cores and 12 threads. 12 is also the number of capabilities determined by GHC.

You see a linear speed-up only for a number of 2 parallel jobs. Already the third job, while adding a decent speed-up, is way below my expectation. At 5 jobs I already hit the wall and from there, performance just decreases in absolute terms.

My parallel code is actually quite straightforward in the sense, that I really have independent computations:

nj <- getNumCapabilities
putStr $ "\nRunning " <> show nj <> " jobs.\n\n"
hFlush stdout

-- `ls :: [Text]` is a list of 2 million hyphenated words
mvarLs <- newMVar ls

let
    loop rs = do
        mJob <-
            modifyMVar mvarLs $ \ls' ->
                pure $ case ls' of
                    [] -> ([], Nothing)
                    (j : js) -> (js, Just j)
        case mJob of
            Just hyph -> do
                -- `parseWord` is the runtime intensive function that is being
                -- run in parallel on 2 Million words
                p <- parseWord hyph
                p `deepseq` loop (p : rs)
            Nothing -> pure rs

-- parallelism as straightfoward as it could be: run `nj` concurrent jobs
-- to tackle the list of 2 Million words and accumulate the result
lsStenos <- mconcat <$> replicateConcurrently nj (loop [])

I tried a couple of things in order to find a better speed-up, but my results where identical to a scary degree:

  • compiling with and w/o -threaded. I don’t know why, but -threaded is entirely optional, contrary to the documentation. +RTS -N should only have an effect with threaded, which is not true. EDIT, w/o threaded, indeed ghc complains
  • limiting the gc threads to 1 witgh -qn1
  • limiting the gc threads to half the number of jobs
  • running with +RTS -qa: “Use the OS’s affinity facilities to try to pin OS threads to CPU cores.”
  • varying the allocation area size of the gc, e.g. +RTS -A64M (cf. documentation)
  • switching between GHC 8.8.4 and GHC 8.10.7
  • activating --nonmoving-gc (requires GHC 8.10)
  • have each job take 100 words at once instead of 1, to avoid waiting for the lock

My expectation is a speed-up all the way to nj = 12, sublinear because of gc overhead. A more-or-less linear speed-up up to nj = 6 would be logical if the number of actual cores is more important than AMD’s hyperthreading capabilities.

Is there an important measure I have overlooked? Are my expectations wrong maybe? Is my concurrent implementation suboptimal?


If someone wants to actually try out the code to play around with the parameters and run it on their machine, just give me a hint whether you use stack or nix or whatnot and I’ll pass over a package.

1 Like

This sounds strange, do you compile directly with ghc or do you pass this as an option via cabal or stack? I definitely remember seeing errors if I forgot to use -threaded with +RTS -N.

I use cabal build and cabal run within a nix shell to run my parallel code.

Do you do cabal run <my-exe> +RTS -N or cabal run <my-exe> -- +RTS -N? The first one applies the RTS options to cabal itself, only the second one applies the option to <my-exe>.

My mistake, indeed -threaded is required. When I added it, I actually added it twice.

1 Like

Do you force the list of words ls to be evaluated completely before starting? otherwise there might be a significant amount of computation while modifying mvarLs, which could cause contention.

I tried with and w/o the deepseq here:

ls <- Text.lines <$> Text.readFile fileInput
mvarLs <- ls `deepseq` newMVar ls

and it doesn’t make a difference. I guess Text.readFile is sufficiently strict. In combination with the fact that I get the length of ls and print it … all before starting the concurrent computation.

One of the last things I can still think of is that this might be caused by GC synchronization pauses. Those could happen because of unsafe FFI calls or tight non-allocating loops. I think you can see those by using the +RTS -s option and looking at the productivity and other GC statistics. Otherwise you could try ThreadScope to see if there are lots of GC pauses.

1 Like

running

cabal run palantype-ops -- stenoDict --file-input hyphenated-h100000.txt +RTS -N12 -A64M -s

Optimizing steno chords ... done.                 
 532,923,175,248 bytes allocated in the heap
     426,278,096 bytes copied during GC
      55,978,208 bytes maximum residency (5 sample(s))
       2,516,624 bytes maximum slop
             895 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       687 colls,   687 par    4.664s   0.379s     0.0006s    0.0246s
  Gen  1         5 colls,     4 par    0.590s   0.053s     0.0107s    0.0177s

  Parallel GC work balance: 86.68% (serial 0%, perfect 100%)

  TASKS: 38 (1 bound, 37 peak workers (37 total), using -N12)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.006s  (  0.005s elapsed)
  MUT     time  729.195s  ( 61.790s elapsed)
  GC      time    5.254s  (  0.433s elapsed)
  EXIT    time    0.003s  (  0.003s elapsed)
  Total   time  734.458s  ( 62.230s elapsed)

  Alloc rate    730,837,952 bytes per MUT second

  Productivity  99.3% of total user, 99.3% of total elapsed

This doesn’t look like the gc is causing trouble to me. It could be that I have issues with my hardware, e.g. the CPU has little cache and RAM access is way slower than it could be. I couldn’t find other programmers’ experience regarding speed up with the same CPU.

and the graph of threadscope looks equally smooth for 2, 6, and 12 jobs running in parallel. GC activity goes up for 12 jobs, but it doesn’t look excessive to me

I am curious to what this means:

 TASKS: 38 (1 bound, 37 peak workers (37 total), using -N12)

What are those tasks? Even on -N1 there are 5 of them. Maybe it’s coincidence, but they scale up with my -N value and the speed-up flattens once the number of tasks reaches 12.

There is no speed-up from single-threaded (no -threaded option) to -threaded with -N1 (with 5 tasks), so this number of tasks doesn’t seem to be relevant.

Does parseWord do IO?
If it doesn’t I’m curious if you get the same results using Control.Parallel.Strategies and parListChunk or a strategy like that.

GC only takes 5s compared to 730s of MUT time, so it’s clearly not caused by GC.

There could be MVar lock contention, but 1-2k locks per second is not that much. I would still suggest removing MVar. It looks like too much locking for me, and it makes code more complicated. Use something like this instead:

concat <$> mapConcurrently (mapM parseWord) 
    (chunksOf (length ls `div` nj + 1) ls)

I think the issue is in the parseWord function itself. Does it contain any locking? Maybe it uses FFI calls that use a lock underneath or interact with runtime weirdly?

2 Likes

I agree, GC doesn’t seem to be the issue. I was just going to suggest this:

concat <$> mapConcurrently (mapM parseWord) (transpose (chunksOf nj ls))

But remember that the OP already said they tried:

  • have each job take 100 words at once instead of 1, to avoid waiting for the lock

So, it seems that this might not be the problem.

If that doesn’t work you could go even further by splitting the input file into 12 files and making one single-threaded executable and running that concurrently on those 12 files. If that still doesn’t work then there is little Haskell/GHC can do about it.

2 Likes

The chunksOf approach is where I started. I switched to the MVar to avoid threads idling towards the end. The runtime of parseWord depends on word length and there are always chunks with longer or shorter words on average.

I haven’t tested that thoroughly back then and now I tested my current code with chunksOf again and the MVar approach is consistently quicker (by less then 10% on average). More importantly, the bad speed-up behaviour is the same. So no, the MVar is not to blaim.

I also went through the code of parseWord again. There is locking going on because of error output that is written to one file, but not at all that much. Removing that output didn’t do anything for a better speed-up, either.

In order to go to the bottom of this, indeed I will try that single-thread approach. Are you suggesting, simply starting my (single threaded) program N times, on N different files? Or is there a more intelligent way to go about this?

I could even write the results to N different output files and do the concatenation of the results completely independently.

I think the recommended way to deal with this is just to increase the number of chunks to perhaps 5x to 10x the number of physical cores/hyperthreads that you have, so something like transpose (chunksOf (5 * nj) ls). Generally, I believe spawning threads is very lightweight in Haskell, so it shouldn’t be a problem to have 100-1000 of them even if your computer only has a dozen physical threads.

Edit: Simon Marlow has a benchmark in his “parallel and concurrent programming in haskell” book: 3. Evaluation Strategies - Parallel and Concurrent Programming in Haskell [Book]. He finds that about 50-100 threads is ideal for his particular problem.

Yes, I would just do the simple thing and perhaps write a small bash script which spawns the program 12 times and then uses cat to combine the results at the end (or just discard the results). If you want to make it fancier you could try to do something with the process library to keep everything in Haskell, but I don’t think that is necessary.

1 Like

I am seeing the speed-up happening already … I will post an updated diagram once I have all the results.
But how weird, isn’t it? I am stupidly running single-threaded programs in parallel using in the background using & and wait and that scales well in contrast to haskell light-weight threads.

If using external processes indeed speeds it up then I think it is good to escalate this to the GHC developers. You can do that by making an issue here. It would be ideal if you can include a way to reproduce the problem.

1 Like

Interesting, haskell-GHC is not as good as advertised in parallelism? As it cannot attain the usual speedup… Kind of sad.
Perhaps green thread is rudimentary in practice.