Help with optimization/profiling

I have a Haskell program that generates an .svg file with an embedded (base64-encoded) png in it.

For a 5000x5000 image it takes about 2s to generate the svg. I’m not sure if I can get it any faster, so I created a minimal test case and ran it with profiling on.

speedscope gave the following results:

So clearly my imageBytes function is the main performance culprit. The left half of it is just JuicyPixels’ png generation at work. The right half is time spent in the function itself (I think). The function itself is very simple though:

imageBytes :: SizedImage -> (Float -> Pixel8) -> BSL.ByteString
imageBytes img convertPixel = do
    let image =
            generateImage
                (\x y -> convertPixel $ img.entries SV.! ((img.height - y - 1) * img.width + x))
                img.width
                img.height
    case encodePalettedPng viridisPalette image of
        Right v -> v
        Left e -> error e

(convertPixel takes an entry from img.entries :: Vector Float and normalizes it using the min/max pixel value in the image).

Now I am stuck a bit. How do I go deeper into profiling? In other words, what to do after speedscope exhausted its usefulness?

6 Likes

I’d start looking at Core. Clearly the generateImage and encodePallettedPng functions are inlined. Perhaps they do more work than necessary.

I’d also check if the allocations make sense to you. Perhaps some intermediate structures are allocated which can be eliminated.

1 Like

Okay, I don’t know what that means. Where do I start looking and learning?

1 Like

Are you compiling with late cost centres ? That may help.

2 Likes

Interesting! This results in this (pretty unhelpful, at last to me) flame graph:

The replicateM part is loading the input data, not the actual performance.

Not sure what to make of it.

1 Like

Could you share the minimal example? I’d be curious to take a look.

In speedscope you can also look at allocations (top center of the screen, change to allocations). What does that look like? Does that function allocate a lot?

1 Like

I’ve linked the somewhat minimal example in the first post. Sorry, it might have been a bit too subtle. :slightly_smiling_face:

1 Like

Can you provide the source for encodePalettedPng? I’m assuming that’s where the replicateM is; replicateM on Either might do a lot of allocations continually checking Left

To look at core I like to put this at the top of the file:

{-# OPTIONS_GHC -ddump-simpl -dsuppress-all -dno-suppress-type-signatures -dno-typeable-binds -ddump-to-file #-}

Afterwards I do a find . -name "*.dump-simpl" which shows that on my machine it generated the dump at this location:

% find . -name "*.dump-simpl"
./dist-newstyle/build/aarch64-osx/ghc-9.10.3/hs-plugui-1.0.1/x/main/build/main/main-tmp/app/Main.dump-simpl

That file contains the Core dump which is the intermediate language that GHC has applied most of its optimizations to.

I do notice that it doesn’t give much extra information in this case. There is quite some aggressive inlining which you can work around by adding {-# NOINLINE #-} to imageBytes, generateImage and encodePalettedPng, which doesn’t really change the running time.

In this case I think there’s not much to win from looking at imageBytes because it is basically just calling two library functions.

So optimizing this further would basically mean optimizing the underlying libraries.

2 Likes

You can speed it up a lot by loading the data more directly without using binary:

{-# INLINE unconsW8 #-}
unconsW8 :: BSL.ByteString -> Maybe (Word8, BSL.ByteString)
unconsW8 = BSL.uncons

{-# INLINE unconsW16 #-}
unconsW16 :: BSL.ByteString -> Maybe (Word16, BSL.ByteString)
unconsW16 bs = do
  (x, bs') <- unconsW8 bs
  (y, bs'') <- unconsW8 bs'
  pure (fromIntegral x .|. (fromIntegral y `shiftL` 8), bs'')

{-# INLINE unconsW32 #-}
unconsW32 :: BSL.ByteString -> Maybe (Word32, BSL.ByteString)
unconsW32 bs = do
  (x, bs') <- unconsW16 bs
  (y, bs'') <- unconsW16 bs'
  pure (fromIntegral x .|. (fromIntegral y `shiftL` 16), bs'')

{-# INLINE unconsFloat #-}
unconsFloat :: BSL.ByteString -> Maybe (Float, BSL.ByteString)
unconsFloat bs = case unconsW32 bs of
    Just (x, xs) -> Just (castWord32ToFloat x, xs)
    Nothing -> Nothing

main = do
    pixelsRaw <- BSL.readFile "matrix.bin"
    let pixels = SV.unfoldrExactN (5000*5000) (\s -> fromMaybe (0, s) (unconsFloat s)) pixelsRaw
    BSL.writeFile "/tmp/twotime-bsl.svg" (pageSvg pixels)

That also removes all GC overhead:

before:
% cabal run exes -- +RTS -s
   4,371,906,064 bytes allocated in the heap
   5,643,847,856 bytes copied during GC
   1,000,154,664 bytes maximum residency (12 sample(s))
     170,273,240 bytes maximum slop
            2360 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       882 colls,     0 par    0.884s   0.889s     0.0010s    0.0026s
  Gen  1        12 colls,     0 par    0.716s   0.891s     0.0743s    0.3514s

  INIT    time    0.002s  (  0.002s elapsed)
  MUT     time    1.266s  (  1.120s elapsed)
  GC      time    1.600s  (  1.780s elapsed)
  EXIT    time    0.032s  (  0.010s elapsed)
  Total   time    2.899s  (  2.911s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,453,990,751 bytes per MUT second

  Productivity  43.7% of total user, 38.5% of total elapsed
after:
% cabal -O2 run exes -- +RTS -s
   3,001,347,728 bytes allocated in the heap
         688,096 bytes copied during GC
     100,053,152 bytes maximum residency (4 sample(s))
         667,488 bytes maximum slop
             237 MiB total memory in use (1 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       602 colls,     0 par    0.001s   0.002s     0.0000s    0.0000s
  Gen  1         4 colls,     0 par    0.001s   0.004s     0.0009s    0.0023s

  INIT    time    0.002s  (  0.002s elapsed)
  MUT     time    0.835s  (  0.832s elapsed)
  GC      time    0.002s  (  0.005s elapsed)
  EXIT    time    0.001s  (  0.006s elapsed)
  Total   time    0.839s  (  0.845s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,595,707,368 bytes per MUT second

  Productivity  99.5% of total user, 98.5% of total elapsed

Edit: SV.unfoldrExactN is much better than SV.replicateM

1 Like

It’s always binary’s fault it seems:

Sorry again for creating a bad benchmark or at the very least, not describing it properly. The input decoding is not something I am concerned with. It merely sets the stage for the svg creation. So using binary for that is fine.

No worries, I did see that mentioned in the initial post. I do like that we have a bit cleaner baseline now.

I’ve now found a way to speed up imageBytes a bit. A problem with the current formulation is your use of a function argument. Your imageBytes (and some parent functions) takes a function as argument which tells it how to normalize floats to pixel values. This function gets called for every pixel, but it only takes in a Float and produces a Pixel8. There is quite some overhead for allocating boxes for these values, so unboxing them would be beneficial. You can do that manually like this:

{-# NOINLINE imageBytes #-}
imageBytes :: SizedImage -> (Float# -> Pixel8#) -> BSL.ByteString
imageBytes img convertPixel# = do
    let convertPixel (F# x) = W8# (convertPixel# x)
        image =
            generateImage
                (\x y -> convertPixel $ img.entries SV.! ((img.height - y - 1) * img.width + x))
                img.width
                img.height
    case encodePalettedPng viridisPalette image of
        Right v -> v
        Left e -> error e

I also modified some of the functions that call imageBytes, but only in a quite straightforward way. This cuts down the total number of allocations quite a bit and the running time is cut by about 100ms:

% cabal -O2 run exes -- +RTS -s 
     601,347,672 bytes allocated in the heap
         607,824 bytes copied during GC
     100,037,912 bytes maximum residency (4 sample(s))
         670,440 bytes maximum slop
             237 MiB total memory in use (1 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0        44 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
  Gen  1         4 colls,     0 par    0.001s   0.004s     0.0009s    0.0022s

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    0.705s  (  0.703s elapsed)
  GC      time    0.001s  (  0.004s elapsed)
  EXIT    time    0.001s  (  0.009s elapsed)
  Total   time    0.709s  (  0.718s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    852,594,474 bytes per MUT second

  Productivity  99.5% of total user, 98.0% of total elapsed

Another way to do this is to inline all the functions that take the function argument until the point where you provide the function argument. That is what JuicyPixels does; generateImage and related functions all have INLINE pragmas. In your case that means you would need to add INLINE pragmas for imageBytes and generateImageSvg. But the latter is quite a large function, so it is a bit silly to inline it just for this purpose.

4 Likes

Ahhh interesting, so I get to learn about unboxed types as well, very instructive. Thank you!

2 Likes