Is there a parFoldMap? How about libraries for similar functions?

Liamzy · March 9, 2025, 5:23pm

Are there libraries providing parallel variants of normal functions for when parallelism in pure code makes sense?

The easiest way to explain monoids and why they matter is, well, ummm, parFoldMap and parFold.

Swordlash · March 9, 2025, 9:15pm

Can you elaborate on your argument? What do monoids have to do with parallelism? I’d say it’s the Applicatives domain.

jaror · March 9, 2025, 9:32pm

Monoids are the right abstraction (if your code doesn’t have side effects), see this talk by Guy Steele:

Fun quote (from around 49:35):

My Haskell-fan friends asked me: “Why didn’t you just start with Haskell?” and I tell them: “If I had known seven years ago what I know now, I would have started with Haskell and pushed it a tenth of the way toward Fortran, instead of starting with Fortran and pushing it nine tenths of the way towards Haskell.”

jackdk · March 9, 2025, 11:21pm

Concretely, you could use functions like mapConcurrently from package unliftio. But that wouldn’t give you a parallel reduce.

Because of the associativity law, you can know that it’s always safe to reparenthesise a monoidal computation, like the “fold” step of a foldMap. This would, in theory, allow you to give each thread 1/N of the total <> steps, so long as each element was adjacent, and then consolidate the results pairwise.

I have previously joked with people that MapReduce-style frameworks are “foldMap as a service”.

Liamzy · March 10, 2025, 4:00am

In the context of Haskell, shouldn’t all parallel computations be effect-free? Once you add effects, you’re now concurrent, so monoids can be considered an essence of parallelism (and probably why Rustacaeans claim that parallelism on the language level is a solved problem).

tomjaguarpaw · March 10, 2025, 8:19am

Sure, but @Swordlash’s point is that this generalizes to Applicative.

Liamzy · March 10, 2025, 8:52am

Insofar as applicatives are monoidal, but it’s easier to push pure monoids into parallelism than applicatives.

More interestingly, where exactly are the parallelism libraries in Haskell? I know of various parallel streaming libraries, Simon Marlow’s parallel, but shouldn’t there be more ready-baked (pure) parallelism libraries in Haskell, even if parallelism is unstable in Haskell?

jaror · March 10, 2025, 9:07am

There are some libraries for array parallelism. Most notably massiv and accelerate. Both implement some form of parallel folding.

tomjaguarpaw · March 10, 2025, 10:11am

There are some plans to add pure parallel primitives to Bluefin. Doing so is quite low on my priority list, but it could be bumped higher by people explaining why it’s important for them and describing their use cases.

github.com/tomjaguarpaw/bluefin

Adding parallel primitives

opened 10:44PM - 18 Jan 25 UTC

jeukshi

Since Bluefin is adding concurrency primitives #34, it might be worth considerin…g adding parallel primitives also. I can implement this, but I'd rather wait for concurrency API to stabilize. In the meantime, my prototype is [here](https://github.com/jeukshi/bluefin-ki). The idea is to create primitives for programming with threads, that guarantee by construction, that programs written with them are deterministic (or pure, or from now on: parallel). This API would be to concurrent API, what `runPureEff` is to `runEff`. That is, sacrificing power for safety. Main source of inspiration is [monad-par](https://hackage.haskell.org/package/monad-par) which has many examples and very simple API. The main motivation for this is that Bluefin is an effect system, which programs are supposed to be structured around. Dropping to `monad-par` for some computations and losing access to every effect is not desirable. Deterministic parallelism by construction, with access to `runPureEff` goodness, seems attractive. That being said, `monad-par` doesn't look very popular those days. I'll ask on Haskell Discourse about reasons why ([link](http://discourse.haskell.org/t/thoughts-on-monad-par/11233)). How does it to work? We will rely on an ability to shield our parallel code from other effects in scope. This can be done as follows (I hope there is no workaround for this). ```haskell runNewEff :: (forall freshEs. Eff freshEs r) -> Eff es r runNewEff _ = undefined f = runPureEff do evalState "" \s -> do put s "you shall not pass" runNewEff do get s -- compilation error ``` This is important, as we can't distinguish effects and can't trust them to be pure. The very same idea extends to shielding threads from sharing effects. With our shields up, it is very easy to recreate `spawn` from `monad-par`. ```haskell spawn :: (e :> es) => Scope sE e -> Eff forkEs r -> Eff es (Thread r sE es) ``` (I'm still unsure about `get` `put` and `IVar` abstractions. `put` has a very nasty restriction that you can use it on given `IVar` only once, or it throws. I don't know what power do they give over `spawn`.) This is sound, our main thread is pure, and our threads are pure, and they share no state and cannot communicate other than by their results. In some sense, our threads are just lazy values! With just this one simple trick, we can have `monad-par` in Bluefin! But I believe we can do more. Streaming/Channels/Generators?. ([idea from here](https://github.com/simonmar/monad-par/blob/c712b4f09bdf9ed7e08959962beb55e8f1877e32/abstract-par/Control/Monad/Par/Class.hs#L141)) If our threads are just pure lazy values of type `a`, then they can also be pure lazy values of type `[a]`. So instead of waiting for them to compute the whole list, we can allow them to publish results as they go. Then other threads can access those values as they are created, instead of waiting for the whole thing. This is sound as long as every access to such result is a separate stream, from the very beginning. This is doable, but more complicated. Roughly, we can accumulate results in a list, and coordinate with the consumer thread's index of the last inserted value. While streaming is cool, it usually involves IO. This is also fake streaming, as we have to keep the whole list in memory, in case some other thread wishes to access it too. But there might be problems that are easier to express this way. This has to be investigated. Plus, I didn't try to integrate it with Coroutine effect, would be great if it can be done. Checked exceptions/early return/jump. Now, this is a banger. The idea is mine (but I'm sure somebody did it already, if it is doable), so I hope I won't feel embarrassed about it, if it doesn't pan out. The problem with exceptions is that they easily create race conditions. Let's look at this code, it is a pure function. ```haskell validateList :: [Int] -> Either String [Int] validateList xs = runPureEff do try \e -> do when (1 `elem` xs) do throw e "1" when (2 `elem` xs) do throw e "2" when (3 `elem` xs) do throw e "3" return xs ``` But what about this? ```haskell validateListPar :: [Int] -> Either String [Int] validateListPar xs = runPureEff do try \(e :: Exception String e) -> do runParallelEx e \par ex -> do t2 <- pureForkEx par ex \p1 ex1 -> do when (2 `elem` xs) do pureThrow p1 ex1 "2" t3 <- pureForkEx par ex \p1 ex1 -> do when (3 `elem` xs) do pureThrow p1 ex1 "3" when (1 `elem` xs) do pureThrow par ex "1" pureGet par t2 pureGet par t3 return xs ``` With Bluefin exceptions as they are, this is bad. We don't know which thread will throw first. The idea is as follows. There is a special thread in our parallel computation, let's call it main. Main is special in two ways, every thread originates from it, and it is the thread that returns the final value. All we have to do is to track when a value from a thread that has thrown an exception is requested by main! `validateListPar` [3,2] will _always_ return `Left "2"`. It is a bit more complicated when threads create other threads, but a simple state machine can do it. I'll have to give it a go in TLA+, but it looks dauntingly doable. Exceptions in pure code. `monad-par` docs don't mention exceptions. It might be a bit of a taboo in pure parallel community :) Those exceptions will not be deterministic, unless we apply checked exceptions trick. But that might be not worth the effort. Schedulers. `monad-par` offers a few schedulers. As parallelism is all about performance, it might be important. I know nothing about schedulers, nor I'm planing to implement any. My plan for this is to cross fingers that GHC runtime will do (and benchmark). API considerations. I'll leave those for now. Other that guaranteing that only special effects can cross threads boundaries and slapping `NFData` here and there, I'm not that concerned.

Liamzy · March 10, 2025, 11:05am

I’m aware both exist, but it’s more like, there’s no casual parallelism libraries. Honestly, what I really want is a foldMapPar and a foldPar / mconcatPar in base.

jaror · March 10, 2025, 1:00pm

I agree casual parallelism in base would be nice. The first step would be to actually bring parts of the parallel library into base.

Sadly, I think the pseq function, on which the whole system is built, is basically a hack that is not guaranteed to do the right thing and can mess up the strictness analyzer. Consider this program:

go :: Int -> Int -> Int
go 0 x = x
go n x = go (n - 1) (x + 1)

GHC’s strictness analysis will determine that x is used strictly, because it is always used at the end of the recursion and it depends on the previous value of x for every recursive step. GHC is able to unbox all the integers and avoid all memory allocation:

$wgo :: Int# -> Int# -> Int#
$wgo
  = \ (ww_sP8 :: Int#) (ww1_sPc :: Int#) ->
      case ww_sP8 of ds_X2 {
        __DEFAULT -> $wgo (-# ds_X2 1#) (+# ww1_sPc 1#);
        0# -> ww1_sPc
      }

However, if we insert a pseq to explicitly evaluate n before doing the recursive call, then the strictness analyzer can no longer see this.

import GHC.Conc

go :: Int -> Int -> Int
go 0 x = x
go n x = n `pseq` go (n - 1) (x + 1)

GHC now does not unbox x and does memory allocation on every recursive call:

$wgo :: Int# -> Int -> Int
$wgo
  = \ (ww_sU4 :: Int#) (x_sU6 :: Int) ->
      case ww_sU4 of ds_X2 {
        __DEFAULT ->
          lazy
            ($wgo
               (-# ds_X2 1#) (case x_sU6 of { I# x1_aTT -> I# (+# x1_aTT 1#) }));
        0# -> x_sU6
      }

So, pseq does not just impose an ordering on its arguments, it also makes its second argument completely lazy.

Liamzy · March 10, 2025, 1:13pm

TBH I was thinking about casually implementing those functions, but I ended up with an implementation hundreds of times slower than the default fold.

I assumed it was an issue with not understanding the rewrite rules underlying the fold, but if it’s this, it’s annoying.

Of course, I could always pull out unsafePerformIO, but the mvar version I have leaks.

It’s probably just an obvious idea that’s not done for now rather obvious reasons.

jaror · March 10, 2025, 1:39pm

Perhaps a bigger issue is that you could be spawning too many parallel computations, at which point the overhead becomes bigger than the gains. You should aim at about 100 times as many parallel computations as the number of cores that your machine has.

Can you share your attempt?

kfl · March 10, 2025, 4:00pm

I’ve had good success with parMapReduceRange from the monad-par-extras package:

parMapReduceRange :: (NFData a, ParFuture iv p) => InclusiveRange -> (Int -> p a) -> (a -> a -> p a) -> a -> p a

If you don’t mind using the Par monad.

Liamzy · March 10, 2025, 7:43pm

Here is the test code I’m using; I suppose I should be using criterion, but the runs tend to be long right now, so Data.Time is simplest.

main = do
    let u() = Sum <$> [1..1_000_000_000 :: Int]
    initTime <- getCurrentTime
    print $ foldPar (u())
    startTime <- getCurrentTime
    print $ diffUTCTime startTime initTime
    print $ foldPar (u())
    contTime <- getCurrentTime
    print $ diffUTCTime contTime startTime
    print $ fold (u())
    getCurrentTime >>= print . flip diffUTCTime contTime

Interestingly enough, u () isn’t working in attempts to stop caching, because the repetition of the call is very fast. I guess it’s a side effect of -O2?

As for actual code, this is roughly the current best attempt, which is about 50-120x slower than just calling mconcat.

    {- #INLINEABLE foldPar #-}

    foldPar xs = foldl' go' mempty $ chunkBy initialChunkSize xs
      where
        initialChunkSize = numCapabilities

        chunkBy size ls = case cont of
            [] -> [exposed]
            _  -> exposed : chunkBy size cont
          where
            (exposed, cont) = splitAt size ls

        go' acc new = acc <> go new initialChunkSize

        go [] _ = mempty
        go [x] _ = x
        go [x,y] _ = x <> y
        go [x,y,z,a] _ = mconcat [x,y,z,a]
        go xs chunkSize = case cont of
            [] -> processedFocus
            _ -> par processedFocus processedCont
                `pseq` (processedFocus <> processedCont)
          where
            (focus, cont) = splitAt chunkSize xs
            processedFocus = go focus (div chunkSize 2 + 1)
            processedCont = go cont (div chunkSize 2 + 1)

Putting RTS -s, it’s obvious that most of the sparks are fizzling (over 99%, tbh), and I think this has to do with the extreme speed of the (<>) computation and the fact that it’s easy to accidentally write a foldr’ with pseq.

The alternative is to call length, as with an example I for pseq / par usage I found online, but if I’m working with a 1 billion element list, that’s around O(n^2), no?

It’s sort of embarrassing, because I’ve gotten 80-95% of the sparks to convert before, for 75% of linear speed-up, but that was with a set and a long map calculation.

Maybe this is just a problem with list, but the nice thing about list is that you can do toList on any other foldable and get a cheap implementation for FoldablePar.

I suspect the truth is, you probably are best off specifying chunk size in the interface, parFold :: (FoldablePar t, Monoid m) => t m => Int => m, then just using mconcat underneath.

Leary · March 11, 2025, 4:26am

If you want to see a speedup from parallelisation when folding a list (as opposed to a tree with fast root-splitting) you need to actually be doing work that’s slow relative to traversing that list, or you’re stuck in the land of ~sequential computation.

Adding Ints is very fast, so it’s a poor choice here. For the sake of testing, conjure (and map) something like:

expensive :: Integer -> Integer
expensive = go . go
 where
  -- Slow nonsense that doesn't blow up.
  go p = p^p `mod` 11927

Also, your foldPar looks—at a glance—unnecessarily complex and noisy. How about this?

newtype Par a = Par{ runPar :: a }
  deriving (Show, Read, Functor)
  deriving (Semigroup, Monoid)
    via Ap Par a

instance Applicative Par where
  pure = Par
  liftA2 f (Par x) (Par y) = Par (x `par` y `pseq` f x y)

foldPar :: (Foldable f, Monoid a) => f a -> a
foldPar = runPar . foldMap Par

As for chunking, you can always just compose that on the input side.

dschrempf · March 11, 2025, 6:58am

Thanks for sharing this talk, it was one of the best talks I have ever listened to!

It’s also interesting that one can generalize from Monoids are the essence of parallelism (i.e., we can parallelize a computation involving an associative accumulation function) to Applicative (i.e., every Monoid m is also an Applicative (Const m), but there are other Applicative computations too).

dschrempf · March 11, 2025, 7:04am

I think those functions are not in base, but they are easily accessibe using parallel. Did you have a look at parMap, or parListChunks?

dschrempf · March 11, 2025, 7:16am

This is actually wrong. Using Const m is not a generalization. How can we actually generalize this to Applicative?

I think I found it: The Eval Applicative. For example,

Strategies can now be defined using a convenient Monad/Applicative type, Eval. e.g. parList s = traverse (Par . (using`` s))

Liamzy · March 12, 2025, 9:42am

Doesn’t your code chunk more / lack the ability to break up processing order? Then again, the version I had was doing the same thing due to foldl’…

par x y pseq f x y means that you spawn a lot of sparks processing f x y (since that is x in the next sequence), but x has to be fully evaluated before the next f x y can finish.

That said, what is the difference between Applicative (lax, monoidal functor) and Monoidal parallelism? Applicative has the greater power insofar as the contained terms can have different types, whereas monoidal parallelism requires the contained term to be all of the same type, but AFAIK monoidal parallelism has the power that, for an array of [x0, x1…] you can parallelize as f x0 x1, f x2 x3, then parallelize f x0’ x1’, x2’ x3’.

Then again, you have the same number of function applications, but on the plus side, a slow or failed parallel subcomputation (return mempty) won’t lock up the entire computation chain.

Topic		Replies	Views
Thoughts on monad-par Learn	10	548	January 23, 2025
Scrap your iteration combinators Links	36	1159	May 27, 2025
Foldl traverses with State, foldr traverses with anything Links	34	2075	January 9, 2024
`Monoid` of no `mappend`: Call for examples Learn	30	709	April 25, 2025
In retrospect, was burning bridges a mistake?	51	4319	May 14, 2024

Is there a parFoldMap? How about libraries for similar functions?

Related topics