Help with concurrency: more cores makes code slower instead of faster

I have a problem with parallelizing my Haskell program. To make it easier for you to see and understand, I’ve put everything in a very minimal GitHub project, along with a README that describes the problem:

I’ll repeat myself a little bit though: I have a program that reads a binary file and goes through its contents p times, gathering indices of where we have byte values of 0, 1, 2, etc. in the file.

I’m already using Data.Vector.Unboxed which is supposed to be efficient in storage and performance. Going through the file is done with parList rseq from Control.Parallel.Strategies, and I see sparks getting created and converted just fine. All the numbers in the performance report add up. Even the threadscope report looks good to me as a layman. But it gets significantly slower with more cores, which is bizarre to me.

Can anyone offer assistance? Is this something trivial? The code is really small, I encourage you to take a look here.

4 Likes

If I disable profiling, then I do see a performance increase when given more capabilities.

Also, you should always use an export list on the Main module (i.e., module Main (main) where). That gives me a further 30% (but it makes the difference between -N1 and -N6 smaller).

1 Like

A couple of things:

  1. Memory contention from reading the same vector everywhere. Might be better to hand off copies to threads.

  2. Poor cache locality (I think) cause different chunks are read everywhere. Parallelising by chunks could help here but that’s a pretty involved rewrite.

  3. Evaluate the whole thing (deepseq) to avoid thunk build up (incidentally you don’t need the intermediate list you can just use UV.generate and index the bytestring).

4 Likes

Oh wow, you are completely right about both of these things. I wouldn’t have thought profiling would have that much of an impact.

Regarding the export list for main…true also. Can you explain why that is?

Cache locality is definitely an issue here, but that’s something I cannot solve.

Regarding memory contention…how is that a problem with read-only memory? Is there still some sort of locking going on?

I think so.

The following code is slightly faster in parallel. It replicates the vector and then parallelizes over the copies - and this speeds up the code on my machine in both cases but also makes a more prominent gap between single and parallel. The difference is more noticeable when you: 1) increase the number of partitions to 255, 2) force deep evaluation and printing the length for good measure (I’m not sure why the force + evaluate combination doesn’t already do this).

My guess is/was that it’s a memory bandwidth issue not from lock contention but from memory and cache bus issues (the example is slightly different since there are writes involved but the answers hold).

{-# LANGUAGE ImportQualifiedPost #-}

module Main where

import Control.Monad
import Control.DeepSeq (force)
import Control.Exception (evaluate)
import Control.Parallel.Strategies (parList, parListChunk, rseq, rdeepseq, using)
import Data.ByteString qualified as BS
import Data.Vector.Unboxed qualified as UV
import Data.Word (Word8)

type PartitionIndex = Word8

type PartitionVector = UV.Vector PartitionIndex

readPartitionImage :: FilePath -> IO PartitionVector
readPartitionImage fp = do
  contents <- BS.readFile fp
  pure (UV.generate (BS.length contents) (BS.index contents))

main = do
  putStrLn "start"
  pixels <- readPartitionImage "partitions.bin"
  let ps = zip [0..255] (cycle [pixels])
  let findPartitionPixels (i, p) = {-# SCC "findIndices" #-} UV.findIndices (== i) p
      partitionPixelLists :: [UV.Vector Int]
      partitionPixelLists = {-# SCC "partitionPixelLists" #-} findPartitionPixels <$> ps `using` parList rdeepseq
  pixelLists <- evaluate (force partitionPixelLists)
  mapM_ (print . UV.length) pixelLists
  putStrLn "done"

3 Likes

I think the first thing to think about in shared-memory parallelism is whether you’re limited by cpu instructions or by memory (caches). What are your cache sizes? On linux for instance, do lscpu, and perf stat -d <command> is also useful. In general, if you just have a lot of cores fighting over your llc (last level cache), then more cores won’t help and may hurt. I suggest you try your experiment on smaller input file sizes; I think this will be quick and enlightening. Hope this helps.

P.S. Another way to think about it is long ago, you’d be worried about RAM size and swapping. If you run a lot of jobs simultaneously and they all need your whole RAM, they’ll be slower than sequentially because of all the swapping.

As @mchav suggests, you can often speed things up by working in chunks, so more of your working set fits in the llc.

3 Likes

Another way to improve performance is to use bitstrings. Instead of copying the whole input data, we just make a tightly packed bitstring; each bit indicating if the corresponding byte in the input matches the given partition number. Here’s an implementation that uses bitstrings:

{-# LANGUAGE ImportQualifiedPost, MagicHash, BangPatterns #-}

module Main (main) where

import Control.DeepSeq (force)
import Control.Exception (evaluate)
import Data.ByteString qualified as BS
import Data.ByteString.Internal qualified as BSI
import Data.Vector.Storable qualified as V
import Data.Word
import Data.Bits
import GHC.Exts (eqWord8#, int2Word#, wordToWord64#)
import GHC.Word (Word8 (W8#), Word64(W64#))
import GHC.Base (build)

(.>.) :: Bits a => a -> Int -> a
(.>.) = unsafeShiftR

(.<.) :: Bits a => a -> Int -> a
(.<.) = unsafeShiftL

type PartitionIndex = Word8

type PartitionVector = V.Vector Word64 -- PartitionIndex

readPartitionImage :: FilePath -> IO PartitionVector
readPartitionImage fp = do
  contents <- BS.readFile fp
  pure (byteStringAsVector64 contents)

-- stolen from https://hackage.haskell.org/package/hw-prim-0.6.3.2/docs/src/HaskellWorks.Data.Vector.AsVector64.html
byteStringAsVector64 :: BS.ByteString -> V.Vector Word64
byteStringAsVector64 bs = if vLen * 8 == BS.length bs
    then case BSI.toForeignPtr bs of
      (fptr, start, offset) -> V.unsafeCast (V.unsafeFromForeignPtr fptr start offset)
    else case BSI.toForeignPtr (bs <> BS.replicate (vLen * 8 - BS.length bs) 0) of
      (fptr, start, offset) -> V.unsafeCast (V.unsafeFromForeignPtr fptr start offset)
    where vLen = (BS.length bs + 7) `div` 8

-- branchless comparison
(===) :: Word8 -> Word8 -> Word64
(W8# x) === (W8# y) = W64# (wordToWord64# (int2Word# (eqWord8# x y)))

elemBits :: Word8 -> V.Vector Word64 -> V.Vector Word64
elemBits x v = V.generate ((V.length v + 7) .>. 3) $ \i ->
  let
    go :: Int -> Word64 -> Word64
    go j s
      -- | traceShow (i, j) False = undefined
      | j < 8 && i + j < V.length v =
        let
          y = v V.! (i * 8 + j)
          y0 = fromIntegral y
          y1 = fromIntegral (y .>.  8)
          y2 = fromIntegral (y .>. 16)
          y3 = fromIntegral (y .>. 24)
          y4 = fromIntegral (y .>. 32)
          y5 = fromIntegral (y .>. 40)
          y6 = fromIntegral (y .>. 48)
          y7 = fromIntegral (y .>. 56)
        in
          go (j + 1) $ (s .>. 8)
            .|. ((y0 === x) .<. 56)
            .|. ((y1 === x) .<. 57)
            .|. ((y2 === x) .<. 58)
            .|. ((y3 === x) .<. 59)
            .|. ((y4 === x) .<. 60)
            .|. ((y5 === x) .<. 61)
            .|. ((y6 === x) .<. 62)
            .|. ((y7 === x) .<. 63)
      | i + j == V.length v = s .>. ((8 - j) * 8)
      | otherwise = s
  in
    go 0 0

indicesWord64 :: Int -> Word64 -> [Int]
indicesWord64 s0 x0 = build $ \cons nil ->
  let go _ 0 = nil
      go s x
        -- we need this case because (.>. 64) is undefined
        | z == 63 = cons (s + z) nil
        | otherwise = cons (s + z) (go (s + z + 1) (x .>. (z + 1)))
        where
          z = countTrailingZeros x
  in go s0 x0
{-# INLINE indicesWord64 #-}

imap :: (Int -> a -> b) -> [a] -> [b]
imap f xs = build $ \cons nil -> foldr (\x go i -> f i x `cons` (go $! i + 1)) (\_ -> nil) xs 0
{-# INLINE imap #-}

indicesVectorWord64 :: V.Vector Word64 -> V.Vector Int
-- this all gets fused away to a single pass:
indicesVectorWord64 = V.fromList . concat . imap (\i -> indicesWord64 (i * 64)) . V.toList

main = do
  putStrLn "start"
  pixels <- readPartitionImage "partitions.bin"
  let findPartitionPixels partitionNumber = indicesVectorWord64 (elemBits partitionNumber pixels)
      partitionIndices = [1 .. 18]
      partitionPixelLists :: [V.Vector Int]
      partitionPixelLists = findPartitionPixels <$> partitionIndices
  pixelLists <- evaluate (force partitionPixelLists)
  putStrLn "done"

This only takes 100 ms on my machine.

You could make this even faster by using SIMD, but this already seems to be pretty fast.

1 Like