Optimiser performance problems

I would try adding at least these bang patterns:

    adaMax' !th !t !m !u
            m'  = V.map (\(!m,!g)->beta1 * m + (1-beta1) * g) (V.zip m g)
            u'  = V.map (\(!u,!g)->max (beta2*u) (abs g)) (V.zip u g)
            th' = V.map (\(!th,!m,!u)->th - (alpha / (1-beta1^(t+1))) * m/u) $ V.zip3 th m' u'

(Edit: Actually I’m not sure how much good those bang patterns would do. It seems like the all (\x-> abs x <1e-8) g check should force all the vectors each iteration.)

I don’t think GHC is smart enough to recognize that this means it doesn’t need to allocate thunks, but it should at least prevent the memory leak by forcing the thunks each iteration.

You could also try switching to vectors from the massiv library, those are strict in their elements.

And, by the way, you can also use V.zipWith and V.zipWith3 instead of a separate V.map and V.zip. Although I don’t know if that will improve performance, because the vector library should do a good job at fusion operations like this.

Small change, but instead of map f . zip a v, you can imap (\i x -> ...) v and use the index i to select into the other vector, because it seems to me that zip copies both vectors (and zip3 all three) so this might be a performance boost?

1 Like

Could you share a minimal reproducer? I can’t reproduce a space leak with the code you shared at Optimiser performance problems.

@tomjaguarpaw , it isn’t the grad function. Here is some evidence. Both test1 and test2 exhibit the “memory leak”. So its something the adaMax code is doing, but I totally can’t see what the problem is. @jaror I’ve implemented all the zipWith things too, much more elegant, but unfortunately make no difference to performance or memory usage.

import Numeric.AD(grad)

-- minimal adaMax function
adaMax f' th t m u
  | all (\x->abs x<1e-7) g = th'
  | otherwise = adaMax f' th' (t+1) m' u'
  where g   = f' th
        m'  = zipWith (\m g->beta1 * m + (1-beta1) * g) m g
        u'  = zipWith (\u g->max (beta2*u) (abs g)) u g
        th' = zipWith3 (\th m u->th - (alpha / (1-beta1^(t+1))) * m/u) th m' u'
        (alpha,beta1,beta2) = (0.002, 0.9, 0.999)

fTest  = grad (\[x,y,z] -> (x-1)^2 + (y-2)^2 + (z-3)^2)

fTest2 [x,y,z] = [2*(x-1),2*(y-2),2*(z-3)] 

-- uses grad
test1 = adaMax fTest [100000,100000,100000] 0 [0,0,0] [0,0,0]

-- uses hardcoded gradient.
test2 = adaMax fTest2 [100000,100000,100000] 0 [0,0,0] [0,0,0]

You can observe the leak by running top in linux and watching the ghci process consume ever more memory. Same problem when compiled. Shebangs on everything didn’t make a difference either.

I can confirm the memory leak with that list based version. Replacing all lists with vectors fixes the leak for me:

{- cabal:
build-depends: base, vector, ad
import Numeric.AD(grad)
import qualified Data.Vector as V
import Data.Vector (Vector)

-- minimal adaMax function
adaMax :: (Vector Double -> Vector Double) -> Vector Double -> Int -> Vector Double -> Vector Double -> Vector Double
adaMax f' th t m u
  | V.all (\x->abs x<1e-7) g = th'
  | otherwise = adaMax f' th' (t+1) m' u'
  where g   = f' th
        m'  = V.zipWith (\m g->beta1 * m + (1-beta1) * g) m g
        u'  = V.zipWith (\u g->max (beta2*u) (abs g)) u g
        th' = V.zipWith3 (\th m u->th - (alpha / (1-beta1^(t+1))) * m/u) th m' u'
        (alpha,beta1,beta2) = (0.002, 0.9, 0.999)

fTest :: Vector Double -> Vector Double
fTest  = grad (V.sum . V.imap (\i x -> (x-fromIntegral i-1)^2))

fTest2 :: Vector Double -> Vector Double
fTest2 = V.imap (\i x -> 2*(x-fromIntegral i-1))

-- uses grad
test1 :: Vector Double
test1 = adaMax fTest (V.fromList [100000,100000,100000]) 0 (V.fromList [0,0,0]) (V.fromList [0,0,0])

-- uses hardcoded gradient.
test2 :: Vector Double
test2 = adaMax fTest2 (V.fromList [100000,100000,100000]) 0 (V.fromList [0,0,0]) (V.fromList [0,0,0])

main :: IO ()
main = print test1

Yeah, you’re building a tower of thunks there (in emiruz’s post, not jaror’s version). Each iteration creates a list that doesn’t get forced all the way through, so it holds on to the lists of the previous iteration, and so on down the line.

You might think that your lists are being fully evaluated because of the all condition, but all will short circuit; if your x-coordinate is too big, the y- and z- coordinates won’t be checked, and so they remain as thunks.

I bet if you bang up x, y, and z in fTest2, your memory problems go away. For fTest I’m not so sure; you might have to compose that with force.

shebangs on x,y,z dont make a difference unfortunately.

Thank you for this analysis. Try this:

fTest3 v = let x = v V.! 0
               y = v V.! 1
               z = v V.! 2 in
             V.fromList [2*(x-1),2*(y-2),2*(z-3)] 

This brings back the memory leak in your version, and I don’t understand why.

This is what my real cost function looks like:

cost v = sqrt $ sum $ map (\(i,j,p)-> (auto p-(v V.! i)*(v V.! (m+j)))^2) ps'

Its then presumably this indexing into the vector that causes the problem (since the real adaMax is identical to yours), but I totally can’t see why. Do you see some way I can modify the cost function to fit the non-leaky format?

You can fix that leak by adding bangs:

f3 :: Vector Double -> Vector Double
f3 v = let !x = v V.! 0
           !y = v V.! 1
           !z = v V.! 2 in
       V.fromList [2*(x-1),2*(y-2),2*(z-3)] 
1 Like

That fixes the toy example :slight_smile: but not the real code, even after turning ps into a vector and using V.map, V.sum, and adding even more bangs. Is there some way to more systematically determine what is leaking other than eye-balling?

Did you try forcing your thunks? Bangs only take care of the top level. Or you could try unboxed vectors, but force is the make-sure-it-works option.

An eye-watering number of methods :slight_smile:

Haskell Optimization Handbook is still a work in progress, but already has lots of good info.


Ok I’ve “fixed” it. If I add the following anywhere it results in th' getting evaluated on each recursion and the memory not growing.

(trace $ show $ V.maximum th')

Now is there some way to do that @jaror without causing work?

<tap> <tap> Is this thing on?

force. force is the way to do that.

1 Like

Hmm…I’m guessing this one:

force :: Vector a → Vector a

O(n) Yield the argument, but force it not to retain any extra memory, possibly by copying it. [...]

…but be careful when using functions like these more generally - if ⊥ is lurking anywhere inside the given argument, it usually stops the program e.g. no output, being unresponsive to new input, a runtime error, etc:


Thank you, yes others had mentioned force. I’m just surprised that I have to do this sort of thing given the simplicity of the code. I had hoped there would be a more out-of-the-box tweak.

PS: force from Data.Vector does not work – I’ll try the deepseq one.

EDIT: deepseq worked as advertised. Thank you all.

OK, I now understand what’s going on.

Right, and the reason the problem doesn’t appear on your original example but it does in a higher-dimensional version is because of @rhendric’s observation:

The space leak occurs if the original example is changed from V.fromList [-2000] to V.fromList [-2000, -2000] (so the second coordinate can leak), plus abs x < 1e-8 to abs x < 0 (so that it runs long enough to see the leak).

Yes, make invalid laziness unrepresentable. I disagree with others who are saying that there are too many methods, that you should sprinkle bang patterns, or that you should use force. There is no need to read a book, there is no need to apply the blunt hammer of deepseq, you just need to choose the correct data structure. Here is your original code, fixed by simply choosing the correct data structure:

{-# LANGUAGE GHC2021 #-}

import Numeric.AD (grad)
import qualified Data.Vector as V
import qualified Data.Vector.Generic as G
import qualified Data.Strict.Vector as SV

-- With lazy V.Vector it leaks, with strict SV.Vector it does not
-- leak.
type AdaVector = SV.Vector
-- type AdaVector = V.Vector

adaMax f' start alpha beta1 beta2 = adaMax' start 0 zeros zeros
    zeros = G.fromList (replicate n 0)
    n = length start
    adaMax' th t m u
      | all (\x-> abs x < 0) g = th'
      | otherwise    = adaMax' th' (t+1) m' u'
      where g   = f' th
            m'  = G.map (\(m,g)->beta1 * m + (1-beta1) * g) (G.zip m g)
            u'  = G.map (\(u,g)->max (beta2*u) (abs g)) (G.zip u g)
            th' = G.map (\(th,m,u)->th - (alpha / (1-beta1^(t+1))) * m/u) $ G.zip3 th m' u'

fTest  = grad (\xs -> (G.head xs)^2 - (G.head xs))

test1 = adaMax fTest (G.fromList @AdaVector [-2000, -2000]) 0.002 0.9 0.999

main = test1 `seq` pure ()

All I have done is generalized your code to the Data.Vector.Generic interface so we can compare what happens when we use Data.Vector to what happens when we use Data.Strict.Vector (from strict-containers). With Data.Vector.Strict there is no space leak. No need for !, seq, force, or any form of careful analysis. That is the benefit of making invalid states unrepresentable through a powerful type system!

In practice you probably wouldn’t even bother with this generalization, you’d just write your code against Data.Strict.Vector directly.

Why is this the correct way to write the code? Because in your particular use case there is no purpose to being able to store unevaluated thunks as vector elements. Therefore, let’s statically ensure that the vector cannot possibly leak space by choosing the correct vector data type. (I would actually guess that in 99% of practical use cases there’s no need to store unevaluated thunks either.)


Thank you – this is exactly what I had hoped for: a “no, use X data structure instead” solution. I was feeling quite bad about having to use a function called “force”. Sort of like saying, no it doesn’t work so just bash it into shape.


Add strict vector by Shimuuar · Pull Request #488 · haskell/vector · GitHub :partying_face: the future has fewer footguns (you won’t have to know already that strict-containers exists, just notice from the module list of vector that there is one named Strict)


Is there some way to more systematically determine what is leaking other than eye-balling?

Yes, perhaps take a look at 8. Profiling — Glasgow Haskell Compiler 9.8.2 User's Guide, the memory profiling section.

Here’s a writeup about memory profiling A First Look at Info Table Profiling - Well-Typed: The Haskell Consultants, focusing on one of the profiling methods (info table).

1 Like