What speed-up to expect for parallel program on 6 cores?

This is my weird result, all in one picture.

Yes, I am going to create some sort of minimal example and I will check if I can reproduce this on my laptop (4 cores, 8 threads)

2 Likes

Could you show the parseWord code? Maybe as a gist on GitHub.

And what will happen if you replace parseWord with some dummy code (like length or sum . map length . inits)?

1 Like

This is the actual code for parseWord, unfortunately it’s quite long. The actual runtime is spent inside optimizeStenoSeries, which is a recursive function that translates natural german words to steno codes by the means of fitting primitives.

A primitive is one or multiple letter for which a partial steno code is defined. optimizeStenoSeries looks up letters in a trie that contains the mapping from natural word parts to partial steno codes and then finds the optimum based on an efficiency rating.

1 Like

I will try around with this to narrow down the problem before someone has to read my whole program.

Most of the code in the parseWord is pure and does not make any FFI calls, so there shouldn’t be any looking.

I have another suspicion — does deepseq really work? Maybe some NFData instance is not complete, and some data isn’t evaluated until the sequential part of the program is running.

Could you measure the time that lsStenos <- mconcat ... line takes and compare it to the total time? Maybe most of the time is spent in the sequential part of the program.

1 Like

So it looks like I made a mistake regarding the flag “+RTS -A64M”, the allocation area size of the gc. I mistakenly concluded that, in general, A64M results in shorter runtime. Now I can’t reproduce this anymore, neither for parallel code nor for sequential code (I did quite a bit of refactoring in the meantime but none of my changes provide an obious explanation).

At some point I tweaked the -A parameter to that value and left it there in false belief.

The runtime statistics (+RTS -s) indicate “99.0% productivity” with -A64M vs. mere “74.5% productivity” w/o setting -A. There is considerably less gc activity in the former case: 0.9s elapsed time vs. 9.4s in the latter case, but overall run-time is still halved when I omit the -A64M.

Just omitting the -A flag, the speed-up looks way more favorable. It remains odd to me that the OS-process approach consistently outperforms the ghc threads, but they, at least, show consistent speed-up all the way to 11 jobs.

1 Like
let
    traverseDeep :: NFData b => (a -> IO b) -> [a] -> IO [b]
    traverseDeep _ [] = pure []
    traverseDeep f (x : xs) = do
      y <- f x
      ys <- y `deepseq` traverseDeep f xs
      pure $ y : ys

lsStenos <- if nj == 1
  then traverseDeep parseWord ls
  else do
      lls <- mapConcurrently (traverseDeep parseWord)
                             (transpose $ chunksOf (10 * nj) ls)
      stopLls <- lls `deepseq` getTime Monotonic
      putStr "Runtime till lls: "
      fprint (timeSpecs % "\n") start stopLls
      let lsFlat = mconcat lls

      stopLsFlat <- lsFlat `deepseq` getTime Monotonic
      putStr "Runtime till lsFlat: "
      fprint (timeSpecs % "\n") start stopLsFlat
      pure lsFlat

This is how I measure timing. The sequential part does not play an important role, between stpLls and stopLsFlat pass 0.02s. But I am unsure about how to go about time measurement properly.

Furthermore, the value that I deepseq has type

(Text, [(RawSteno, (PatternGroup key, Greediness))]),

where RawSteno is a newtype around Text, PatternGroup key is a flat sum type, and Greediness is a type synonym for Int. So the NFData instances are basic as far as I understand.

I can suggest you giving this package a try: GitHub - lehins/scheduler: A work stealing scheduler

It was designed specifically for such use case!

2 Likes

This my result. scheduler/traverseConcurrently confirms about equal compared to async/mapConcurrently, with a slight edge at nj >= 10. It does follow the general trend though and is stil being outperformed by the processes of OS (with singlethreaded ghc).

Do you have an idea of what parameters to tweak maybe? I determined the optimum of the allocation area size of the gc/the -A parameters, to be at 2 MB, close to the default value. Other parameters haven’t made a difference yet. I still don’t think this speed-up is satisfying.

Do you have this benchmark somewhere in a public repo?

In particular are you benchmarking this?

traverseConcurrently Par (fmap force . parseWord) ls

everything I do, you can find on github

If you want to run the program, I can surely assist with that.

This repo won’t be the easiest possible way to reproduce the speed-up, though. I am working on a minimal example that (most likely) won’t use my actual code but merely some dummy functions with a similar memory and runtime profile.

You should measure the parallel part alone, without additional deepseq that runs in the main thread:

      beginParallel <- getTime Monotonic
      lls <- mapConcurrently (traverseDeep parseWord)
                             (transpose $ chunksOf (10 * nj) ls)
      endParallel <- getTime Monotonic
      putStr "Parallel runtime is: "
      fprint (timeSpecs % "\n") beginParallel endParallel

It won’t make a big difference since it’s just another evaluation of the already evaluated list, but it’s still a more right way to measure.

traverseDeep could be simplified a bit:

traverseDeep f = mapM (evaluate . force <=< f)

But the main issue is that -A64m somehow dampers parallelism. It’s really surprising. It would be great if you could come up with the minimal example that reproduces this behavior and report it to GHC developers.

BTW, what compiler version do you use? Does behavior change on other GHC versions?

1 Like

so the issue with -A64m seems relatively clear to me. The default value of -A is 1 Megabyte (-A1m) and I checked cache size of my CPU, AMD Ryzen 5 3600:

Cache L1: 64K (per core)
Cache L2: 512K (per core)
Cache L3: 32MB (shared)

I guess the default in the ghc settings is oriented to common hardware. If I increase the allocation area size to much more than 4 MB, the main memory will be used instead of cache. The allocation area size is per job, i.e. “+RTS -N10 -A4m” implies 40 MB in total.

At around -A64m, gargabe collecting efficiency is highest, but it all takes place in main memory. At around -A64m, gc efficiency is relatively low, but apparently counterweighted by way faster memory access.

The same behavior can be observed on my laptop with intel cpu.

I will present the minimal example shortly, here, so others will be able to quickly reproduce the issue.

2 Likes

I was thinking something similar, but that doesn’t explain why it does work fine if you use system processes.

1 Like

no sorry, there is some confusion that I caused. I treated the -A64m parameter as minor tweak when it really is a major misconfiguration. I will update the above diagrams with a comment on that. Basically the difference between system processes and ghc threads melts, when I omit the -A parameter altogether (or I set it to the actual optimum of -A2m).

1 Like

This is all what’s left of the weirdness as soon as I treat the -A parameter consistently. System processes slightly outperforming ghc threads.

1 Like

Just to make sure to clear up on my initial benchmarking mistake: -A64m ruins the performance of single-threaded execution of my program, too.

And -A64m equally ruins the speed-up in particular for both, system processses and ghc threads.

1 Like

It seems the advice from the GHC users guide:

In general settings >= 4MB can reduce performance in some cases, in particular for single threaded operation. However in a parallel setting increasing the allocation area to 16MB, or even 64MB can increase gc throughput significantly.

Is not very helpful in this case. They could at least add a note about memory contention and CPU caches.

2 Likes
traverseDeep f = mapM (evaluate . force <=< f)

Indeed, more elegant. And using IO execution order of course makes sense here.

Doing time measurement the right way now: The program does spend 95% of the run time in the concurrent part (it was even more before I refactored the code), as one would sanely expect.

I now uploaded the minimal example including benchmarks here:

I highly appreciate anyone going through the effort of reproducing the result on their machine. I added documentation and scripts to facilitate this.

The minimal example contains a dummy function that mimics the runtime and memory profile of my original code. The speed-up behavior of my minimal example is different, though. It does show system processes outperforming ghc threads, which - I guess - is the part that matters.

3 Likes