Hi!
While re-learning async (great stuff btw!) I just noticed that the -N options, controlling the number of physical threads to be used by the RTS, is not listed in the manual:
after a little printf debugging I saw I was simply feeding it too few tasks at a time. Completely parallel, no data dependencies not even MVar/transactional, no FFi, just rejection sampling a lot of random structures under different conditions and streaming to disk.
Now it’s burning rubber on all 64 cores, happy times.