Are there libraries providing parallel variants of normal functions for when parallelism in pure code makes sense?
The easiest way to explain monoids and why they matter is, well, ummm, parFoldMap and parFold.
Are there libraries providing parallel variants of normal functions for when parallelism in pure code makes sense?
The easiest way to explain monoids and why they matter is, well, ummm, parFoldMap and parFold.
Can you elaborate on your argument? What do monoids have to do with parallelism? I’d say it’s the Applicatives domain.
Monoids are the right abstraction (if your code doesn’t have side effects), see this talk by Guy Steele:
Fun quote (from around 49:35):
My Haskell-fan friends asked me: “Why didn’t you just start with Haskell?” and I tell them: “If I had known seven years ago what I know now, I would have started with Haskell and pushed it a tenth of the way toward Fortran, instead of starting with Fortran and pushing it nine tenths of the way towards Haskell.”
Concretely, you could use functions like mapConcurrently
from package unliftio
. But that wouldn’t give you a parallel reduce.
Because of the associativity law, you can know that it’s always safe to reparenthesise a monoidal computation, like the “fold” step of a foldMap
. This would, in theory, allow you to give each thread 1/N of the total <>
steps, so long as each element was adjacent, and then consolidate the results pairwise.
I have previously joked with people that MapReduce-style frameworks are “foldMap
as a service”.
In the context of Haskell, shouldn’t all parallel computations be effect-free? Once you add effects, you’re now concurrent, so monoids can be considered an essence of parallelism (and probably why Rustacaeans claim that parallelism on the language level is a solved problem).
Sure, but @Swordlash’s point is that this generalizes to Applicative
.
Insofar as applicatives are monoidal, but it’s easier to push pure monoids into parallelism than applicatives.
More interestingly, where exactly are the parallelism libraries in Haskell? I know of various parallel streaming libraries, Simon Marlow’s parallel, but shouldn’t there be more ready-baked (pure) parallelism libraries in Haskell, even if parallelism is unstable in Haskell?
There are some libraries for array parallelism. Most notably massiv and accelerate. Both implement some form of parallel folding.
There are some plans to add pure parallel primitives to Bluefin. Doing so is quite low on my priority list, but it could be bumped higher by people explaining why it’s important for them and describing their use cases.
I’m aware both exist, but it’s more like, there’s no casual parallelism libraries. Honestly, what I really want is a foldMapPar and a foldPar / mconcatPar in base.
I agree casual parallelism in base would be nice. The first step would be to actually bring parts of the parallel
library into base.
Sadly, I think the pseq
function, on which the whole system is built, is basically a hack that is not guaranteed to do the right thing and can mess up the strictness analyzer. Consider this program:
go :: Int -> Int -> Int
go 0 x = x
go n x = go (n - 1) (x + 1)
GHC’s strictness analysis will determine that x
is used strictly, because it is always used at the end of the recursion and it depends on the previous value of x
for every recursive step. GHC is able to unbox all the integers and avoid all memory allocation:
$wgo :: Int# -> Int# -> Int#
$wgo
= \ (ww_sP8 :: Int#) (ww1_sPc :: Int#) ->
case ww_sP8 of ds_X2 {
__DEFAULT -> $wgo (-# ds_X2 1#) (+# ww1_sPc 1#);
0# -> ww1_sPc
}
However, if we insert a pseq
to explicitly evaluate n
before doing the recursive call, then the strictness analyzer can no longer see this.
import GHC.Conc
go :: Int -> Int -> Int
go 0 x = x
go n x = n `pseq` go (n - 1) (x + 1)
GHC now does not unbox x
and does memory allocation on every recursive call:
$wgo :: Int# -> Int -> Int
$wgo
= \ (ww_sU4 :: Int#) (x_sU6 :: Int) ->
case ww_sU4 of ds_X2 {
__DEFAULT ->
lazy
($wgo
(-# ds_X2 1#) (case x_sU6 of { I# x1_aTT -> I# (+# x1_aTT 1#) }));
0# -> x_sU6
}
So, pseq
does not just impose an ordering on its arguments, it also makes its second argument completely lazy.
TBH I was thinking about casually implementing those functions, but I ended up with an implementation hundreds of times slower than the default fold.
I assumed it was an issue with not understanding the rewrite rules underlying the fold, but if it’s this, it’s annoying.
Of course, I could always pull out unsafePerformIO, but the mvar version I have leaks.
It’s probably just an obvious idea that’s not done for now rather obvious reasons.
Perhaps a bigger issue is that you could be spawning too many parallel computations, at which point the overhead becomes bigger than the gains. You should aim at about 100 times as many parallel computations as the number of cores that your machine has.
Can you share your attempt?
I’ve had good success with parMapReduceRange
from the monad-par-extras package:
parMapReduceRange :: (NFData a, ParFuture iv p) => InclusiveRange -> (Int -> p a) -> (a -> a -> p a) -> a -> p a
If you don’t mind using the Par
monad.
Here is the test code I’m using; I suppose I should be using criterion, but the runs tend to be long right now, so Data.Time is simplest.
main = do
let u() = Sum <$> [1..1_000_000_000 :: Int]
initTime <- getCurrentTime
print $ foldPar (u())
startTime <- getCurrentTime
print $ diffUTCTime startTime initTime
print $ foldPar (u())
contTime <- getCurrentTime
print $ diffUTCTime contTime startTime
print $ fold (u())
getCurrentTime >>= print . flip diffUTCTime contTime
Interestingly enough, u () isn’t working in attempts to stop caching, because the repetition of the call is very fast. I guess it’s a side effect of -O2?
As for actual code, this is roughly the current best attempt, which is about 50-120x slower than just calling mconcat.
{- #INLINEABLE foldPar #-}
foldPar xs = foldl' go' mempty $ chunkBy initialChunkSize xs
where
initialChunkSize = numCapabilities
chunkBy size ls = case cont of
[] -> [exposed]
_ -> exposed : chunkBy size cont
where
(exposed, cont) = splitAt size ls
go' acc new = acc <> go new initialChunkSize
go [] _ = mempty
go [x] _ = x
go [x,y] _ = x <> y
go [x,y,z,a] _ = mconcat [x,y,z,a]
go xs chunkSize = case cont of
[] -> processedFocus
_ -> par processedFocus processedCont
`pseq` (processedFocus <> processedCont)
where
(focus, cont) = splitAt chunkSize xs
processedFocus = go focus (div chunkSize 2 + 1)
processedCont = go cont (div chunkSize 2 + 1)
Putting RTS -s, it’s obvious that most of the sparks are fizzling (over 99%, tbh), and I think this has to do with the extreme speed of the (<>) computation and the fact that it’s easy to accidentally write a foldr’ with pseq.
The alternative is to call length, as with an example I for pseq / par usage I found online, but if I’m working with a 1 billion element list, that’s around O(n^2), no?
It’s sort of embarrassing, because I’ve gotten 80-95% of the sparks to convert before, for 75% of linear speed-up, but that was with a set and a long map calculation.
Maybe this is just a problem with list, but the nice thing about list is that you can do toList on any other foldable and get a cheap implementation for FoldablePar.
I suspect the truth is, you probably are best off specifying chunk size in the interface, parFold :: (FoldablePar t, Monoid m) => t m => Int => m
, then just using mconcat underneath.
If you want to see a speedup from parallelisation when folding a list (as opposed to a tree with fast root-splitting) you need to actually be doing work that’s slow relative to traversing that list, or you’re stuck in the land of ~sequential computation.
Adding Int
s is very fast, so it’s a poor choice here. For the sake of testing, conjure (and map
) something like:
expensive :: Integer -> Integer
expensive = go . go
where
-- Slow nonsense that doesn't blow up.
go p = p^p `mod` 11927
Also, your foldPar
looks—at a glance—unnecessarily complex and noisy. How about this?
newtype Par a = Par{ runPar :: a }
deriving (Show, Read, Functor)
deriving (Semigroup, Monoid)
via Ap Par a
instance Applicative Par where
pure = Par
liftA2 f (Par x) (Par y) = Par (x `par` y `pseq` f x y)
foldPar :: (Foldable f, Monoid a) => f a -> a
foldPar = runPar . foldMap Par
As for chunking, you can always just compose that on the input side.
Thanks for sharing this talk, it was one of the best talks I have ever listened to!
It’s also interesting that one can generalize from Monoids
are the essence of parallelism (i.e., we can parallelize a computation involving an associative accumulation function) to Applicative
(i.e., every Monoid m
is also an Applicative (Const m)
, but there are other Applicative computations too).
I think those functions are not in base
, but they are easily accessibe using parallel
. Did you have a look at parMap
, or parListChunks
?
This is actually wrong. Using Const m
is not a generalization. How can we actually generalize this to Applicative
?
I think I found it: The Eval
Applicative. For example,
- Strategies can now be defined using a convenient Monad/Applicative type,
Eval
. e.g.parList s = traverse (Par . (
using`` s))
Doesn’t your code chunk more / lack the ability to break up processing order? Then again, the version I had was doing the same thing due to foldl’…
par x y pseq
f x y means that you spawn a lot of sparks processing f x y (since that is x in the next sequence), but x has to be fully evaluated before the next f x y can finish.
That said, what is the difference between Applicative (lax, monoidal functor) and Monoidal parallelism? Applicative has the greater power insofar as the contained terms can have different types, whereas monoidal parallelism requires the contained term to be all of the same type, but AFAIK monoidal parallelism has the power that, for an array of [x0, x1…] you can parallelize as f x0 x1, f x2 x3, then parallelize f x0’ x1’, x2’ x3’.
Then again, you have the same number of function applications, but on the plus side, a slow or failed parallel subcomputation (return mempty) won’t lock up the entire computation chain.