So it looks like I made a mistake regarding the flag “+RTS -A64M”, the allocation area size of the gc. I mistakenly concluded that, in general, A64M results in shorter runtime. Now I can’t reproduce this anymore, neither for parallel code nor for sequential code (I did quite a bit of refactoring in the meantime but none of my changes provide an obious explanation).
At some point I tweaked the -A parameter to that value and left it there in false belief.
The runtime statistics (+RTS -s) indicate “99.0% productivity” with -A64M vs. mere “74.5% productivity” w/o setting -A. There is considerably less gc activity in the former case: 0.9s elapsed time vs. 9.4s in the latter case, but overall run-time is still halved when I omit the -A64M.
Just omitting the -A flag, the speed-up looks way more favorable. It remains odd to me that the OS-process approach consistently outperforms the ghc threads, but they, at least, show consistent speed-up all the way to 11 jobs.