I have a Haskell program that generates an .svg file with an embedded (base64-encoded) png in it.
For a 5000x5000 image it takes about 2s to generate the svg. I’m not sure if I can get it any faster, so I created a minimal test case and ran it with profiling on.
So clearly my imageBytes function is the main performance culprit. The left half of it is just JuicyPixels’ png generation at work. The right half is time spent in the function itself (I think). The function itself is very simple though:
imageBytes :: SizedImage -> (Float -> Pixel8) -> BSL.ByteString
imageBytes img convertPixel = do
let image =
generateImage
(\x y -> convertPixel $ img.entries SV.! ((img.height - y - 1) * img.width + x))
img.width
img.height
case encodePalettedPng viridisPalette image of
Right v -> v
Left e -> error e
(convertPixel takes an entry from img.entries :: Vector Float and normalizes it using the min/max pixel value in the image).
Now I am stuck a bit. How do I go deeper into profiling? In other words, what to do after speedscope exhausted its usefulness?
Could you share the minimal example? I’d be curious to take a look.
In speedscope you can also look at allocations (top center of the screen, change to allocations). What does that look like? Does that function allocate a lot?
Can you provide the source for encodePalettedPng? I’m assuming that’s where the replicateM is; replicateM on Either might do a lot of allocations continually checking Left
That file contains the Core dump which is the intermediate language that GHC has applied most of its optimizations to.
I do notice that it doesn’t give much extra information in this case. There is quite some aggressive inlining which you can work around by adding {-# NOINLINE #-} to imageBytes, generateImage and encodePalettedPng, which doesn’t really change the running time.
In this case I think there’s not much to win from looking at imageBytes because it is basically just calling two library functions.
So optimizing this further would basically mean optimizing the underlying libraries.
You can speed it up a lot by loading the data more directly without using binary:
{-# INLINE unconsW8 #-}
unconsW8 :: BSL.ByteString -> Maybe (Word8, BSL.ByteString)
unconsW8 = BSL.uncons
{-# INLINE unconsW16 #-}
unconsW16 :: BSL.ByteString -> Maybe (Word16, BSL.ByteString)
unconsW16 bs = do
(x, bs') <- unconsW8 bs
(y, bs'') <- unconsW8 bs'
pure (fromIntegral x .|. (fromIntegral y `shiftL` 8), bs'')
{-# INLINE unconsW32 #-}
unconsW32 :: BSL.ByteString -> Maybe (Word32, BSL.ByteString)
unconsW32 bs = do
(x, bs') <- unconsW16 bs
(y, bs'') <- unconsW16 bs'
pure (fromIntegral x .|. (fromIntegral y `shiftL` 16), bs'')
{-# INLINE unconsFloat #-}
unconsFloat :: BSL.ByteString -> Maybe (Float, BSL.ByteString)
unconsFloat bs = case unconsW32 bs of
Just (x, xs) -> Just (castWord32ToFloat x, xs)
Nothing -> Nothing
main = do
pixelsRaw <- BSL.readFile "matrix.bin"
let pixels = SV.unfoldrExactN (5000*5000) (\s -> fromMaybe (0, s) (unconsFloat s)) pixelsRaw
BSL.writeFile "/tmp/twotime-bsl.svg" (pageSvg pixels)
That also removes all GC overhead:
before:
% cabal run exes -- +RTS -s
4,371,906,064 bytes allocated in the heap
5,643,847,856 bytes copied during GC
1,000,154,664 bytes maximum residency (12 sample(s))
170,273,240 bytes maximum slop
2360 MiB total memory in use (0 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 882 colls, 0 par 0.884s 0.889s 0.0010s 0.0026s
Gen 1 12 colls, 0 par 0.716s 0.891s 0.0743s 0.3514s
INIT time 0.002s ( 0.002s elapsed)
MUT time 1.266s ( 1.120s elapsed)
GC time 1.600s ( 1.780s elapsed)
EXIT time 0.032s ( 0.010s elapsed)
Total time 2.899s ( 2.911s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,453,990,751 bytes per MUT second
Productivity 43.7% of total user, 38.5% of total elapsed
after:
% cabal -O2 run exes -- +RTS -s
3,001,347,728 bytes allocated in the heap
688,096 bytes copied during GC
100,053,152 bytes maximum residency (4 sample(s))
667,488 bytes maximum slop
237 MiB total memory in use (1 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 602 colls, 0 par 0.001s 0.002s 0.0000s 0.0000s
Gen 1 4 colls, 0 par 0.001s 0.004s 0.0009s 0.0023s
INIT time 0.002s ( 0.002s elapsed)
MUT time 0.835s ( 0.832s elapsed)
GC time 0.002s ( 0.005s elapsed)
EXIT time 0.001s ( 0.006s elapsed)
Total time 0.839s ( 0.845s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,595,707,368 bytes per MUT second
Productivity 99.5% of total user, 98.5% of total elapsed
Edit: SV.unfoldrExactN is much better than SV.replicateM
Sorry again for creating a bad benchmark or at the very least, not describing it properly. The input decoding is not something I am concerned with. It merely sets the stage for the svg creation. So using binary for that is fine.
No worries, I did see that mentioned in the initial post. I do like that we have a bit cleaner baseline now.
I’ve now found a way to speed up imageBytes a bit. A problem with the current formulation is your use of a function argument. Your imageBytes (and some parent functions) takes a function as argument which tells it how to normalize floats to pixel values. This function gets called for every pixel, but it only takes in a Float and produces a Pixel8. There is quite some overhead for allocating boxes for these values, so unboxing them would be beneficial. You can do that manually like this:
{-# NOINLINE imageBytes #-}
imageBytes :: SizedImage -> (Float# -> Pixel8#) -> BSL.ByteString
imageBytes img convertPixel# = do
let convertPixel (F# x) = W8# (convertPixel# x)
image =
generateImage
(\x y -> convertPixel $ img.entries SV.! ((img.height - y - 1) * img.width + x))
img.width
img.height
case encodePalettedPng viridisPalette image of
Right v -> v
Left e -> error e
I also modified some of the functions that call imageBytes, but only in a quite straightforward way. This cuts down the total number of allocations quite a bit and the running time is cut by about 100ms:
% cabal -O2 run exes -- +RTS -s
601,347,672 bytes allocated in the heap
607,824 bytes copied during GC
100,037,912 bytes maximum residency (4 sample(s))
670,440 bytes maximum slop
237 MiB total memory in use (1 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 44 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s
Gen 1 4 colls, 0 par 0.001s 0.004s 0.0009s 0.0022s
INIT time 0.001s ( 0.001s elapsed)
MUT time 0.705s ( 0.703s elapsed)
GC time 0.001s ( 0.004s elapsed)
EXIT time 0.001s ( 0.009s elapsed)
Total time 0.709s ( 0.718s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 852,594,474 bytes per MUT second
Productivity 99.5% of total user, 98.0% of total elapsed
Another way to do this is to inline all the functions that take the function argument until the point where you provide the function argument. That is what JuicyPixels does; generateImage and related functions all have INLINE pragmas. In your case that means you would need to add INLINE pragmas for imageBytes and generateImageSvg. But the latter is quite a large function, so it is a bit silly to inline it just for this purpose.