Optimiser performance problems

emiruz · February 27, 2024, 4:04pm

I’ve written the infinity norm version of the Adam gradient descent optimiser below. It should work with most multivariate convex problems. It seems to work ok except it is epically slow. Optimising x^2 - x with a bad starting point (e.g. -2000) as shown below in the test1 function takes about 9 seconds.

Is there something I can do to make this orders of magnitude faster? I don’t see any straightforward, SIMD options, but I could be wrong? Perhaps a more Haskell friendly way to restate the computations?

import Numeric.AD(grad)
import qualified Data.Vector as V

adaMax f' start alpha beta1 beta2 = adaMax' start 0 zeros zeros
  where
    zeros = V.fromList (replicate n 0)
    n = length start
    adaMax' th t m u
      | all (\x-> abs x < 1e-8) g = th'
      | otherwise    = adaMax' th' (t+1) m' u'
      where g   = f' th
            m'  = V.map (\(m,g)->beta1 * m + (1-beta1) * g) (V.zip m g)
            u'  = V.map (\(u,g)->max (beta2*u) (abs g)) (V.zip u g)
            th' = V.map (\(th,m,u)->th - (alpha / (1-beta1^(t+1))) * m/u) $ V.zip3 th m' u'

fTest  = grad (\xs -> (V.head xs)^2 - (V.head xs))

test1 = adaMax fTest (V.fromList [-2000]) 0.002 0.9 0.999

I’d note as well that using the explicitly calculated gradient (2x - 1) is about 50% faster, which is a bit surprising given how tiny the function is.

tomjaguarpaw · February 27, 2024, 4:30pm

My paper, “Provably Correct, Asymptotically Efficient, Higher-Order Reverse-Mode Automatic Differentiation” explains why this is slow, in section 8.3. Basically, the vector storing the gradients has no hope of being efficiently packed, let alone vectorised. There is some work on improving that over at GitHub - Mikolaj/horde-ad: Higher Order Reverse Derivatives Efficiently - Automatic Differentiation library based on the paper "Provably correct, asymptotically efficient, higher-order reverse-mode automatic differentiation", but I don’t know what their state of progress is.

emiruz · February 27, 2024, 5:22pm

In my 50k parameter problem, that code is barely managing 3 evaluations a second. I had hoped that I had made some schoolboy error somewhere setting it up.

emiruz · February 27, 2024, 5:49pm

@tomjaguarpaw, otherwise does adamax (above) look correctly coded to you? Is there anything I could do to make it go faster despite the slow gradient calculations

tomjaguarpaw · February 27, 2024, 6:11pm

Unfortunately there is a problem inherent to the implementation of ad.

I’d recommend profiling to see where the problem really is, but I suspect it is indeed in the gradient calculation. You could try using an unboxed vector in adaMax, but unfortunately you’d have to convert to a boxed one to pass to each call to fTest, and also convert the result back, so it might not end up being a win.

chrisdone · February 27, 2024, 6:15pm

I’m not in this field but your paper appears to describe an imperative algorithm consisting of O(1) updates. Yours is an algorithm that repeatedly allocates new vectors, and might be O(n) or worse in the wrong places. Just sanity checking, in case it’s helpful.

sclv · February 27, 2024, 6:54pm

you could experiment with different modes in the ad package like Kahn to see if that improves performance as well…

emiruz · February 27, 2024, 7:20pm

Everything in the algorithm is a vector, and the whole of each vector gets updated every time: they abuse notation a bit so its not clear from the code snippet.

Other than slow gradient calculation, the issue is that there is no SIMD, so all those map statements are very slow compared to a SIMD implementation. Alas, I don’t see any simple route to SIMD in Haskell.

Shimuuar · February 27, 2024, 7:25pm

SIMD is red herring here. Data.Vector.Vector is boxed vector this alone gives you 3× allocation and pointer chases. But AFAIR ad can’t work with unboxed vectors

emiruz · February 27, 2024, 8:08pm

There’s another weird thing, for my more complex cost with 50k+ parameters, the adaMax function looks as if it leaks memory. Even though its doing the same update on every iteration and not intentionally accumulating anything, it just keeps eating memory until eventually its all gone.

Shimuuar · February 27, 2024, 8:25pm

That shouldn’t be very surprising either. Boxed vector is not only boxed it’s lazy in elements so one could easily get space leak.

emiruz · February 27, 2024, 9:59pm

I guess the bang annotations ought to help with that?

jaror · February 27, 2024, 10:39pm

I would try adding at least these bang patterns:

    adaMax' !th !t !m !u
     ...
            m'  = V.map (\(!m,!g)->beta1 * m + (1-beta1) * g) (V.zip m g)
            u'  = V.map (\(!u,!g)->max (beta2*u) (abs g)) (V.zip u g)
            th' = V.map (\(!th,!m,!u)->th - (alpha / (1-beta1^(t+1))) * m/u) $ V.zip3 th m' u'

(Edit: Actually I’m not sure how much good those bang patterns would do. It seems like the all (\x-> abs x <1e-8) g check should force all the vectors each iteration.)

I don’t think GHC is smart enough to recognize that this means it doesn’t need to allocate thunks, but it should at least prevent the memory leak by forcing the thunks each iteration.

You could also try switching to vectors from the massiv library, those are strict in their elements.

And, by the way, you can also use V.zipWith and V.zipWith3 instead of a separate V.map and V.zip. Although I don’t know if that will improve performance, because the vector library should do a good job at fusion operations like this.

Vlix · February 28, 2024, 1:32am

Small change, but instead of map f . zip a v, you can imap (\i x -> ...) v and use the index i to select into the other vector, because it seems to me that zip copies both vectors (and zip3 all three) so this might be a performance boost?

tomjaguarpaw · February 28, 2024, 7:57am

Could you share a minimal reproducer? I can’t reproduce a space leak with the code you shared at Optimiser performance problems.

emiruz · February 28, 2024, 9:11am

@tomjaguarpaw , it isn’t the grad function. Here is some evidence. Both test1 and test2 exhibit the “memory leak”. So its something the adaMax code is doing, but I totally can’t see what the problem is. @jaror I’ve implemented all the zipWith things too, much more elegant, but unfortunately make no difference to performance or memory usage.

import Numeric.AD(grad)

-- minimal adaMax function
adaMax f' th t m u
  | all (\x->abs x<1e-7) g = th'
  | otherwise = adaMax f' th' (t+1) m' u'
  where g   = f' th
        m'  = zipWith (\m g->beta1 * m + (1-beta1) * g) m g
        u'  = zipWith (\u g->max (beta2*u) (abs g)) u g
        th' = zipWith3 (\th m u->th - (alpha / (1-beta1^(t+1))) * m/u) th m' u'
        (alpha,beta1,beta2) = (0.002, 0.9, 0.999)

fTest  = grad (\[x,y,z] -> (x-1)^2 + (y-2)^2 + (z-3)^2)

fTest2 [x,y,z] = [2*(x-1),2*(y-2),2*(z-3)] 

-- uses grad
test1 = adaMax fTest [100000,100000,100000] 0 [0,0,0] [0,0,0]

-- uses hardcoded gradient.
test2 = adaMax fTest2 [100000,100000,100000] 0 [0,0,0] [0,0,0]

You can observe the leak by running top in linux and watching the ghci process consume ever more memory. Same problem when compiled. Shebangs on everything didn’t make a difference either.

jaror · February 28, 2024, 9:34am

I can confirm the memory leak with that list based version. Replacing all lists with vectors fixes the leak for me:

{- cabal:
build-depends: base, vector, ad
-}
import Numeric.AD(grad)
import qualified Data.Vector as V
import Data.Vector (Vector)

-- minimal adaMax function
adaMax :: (Vector Double -> Vector Double) -> Vector Double -> Int -> Vector Double -> Vector Double -> Vector Double
adaMax f' th t m u
  | V.all (\x->abs x<1e-7) g = th'
  | otherwise = adaMax f' th' (t+1) m' u'
  where g   = f' th
        m'  = V.zipWith (\m g->beta1 * m + (1-beta1) * g) m g
        u'  = V.zipWith (\u g->max (beta2*u) (abs g)) u g
        th' = V.zipWith3 (\th m u->th - (alpha / (1-beta1^(t+1))) * m/u) th m' u'
        (alpha,beta1,beta2) = (0.002, 0.9, 0.999)

fTest :: Vector Double -> Vector Double
fTest  = grad (V.sum . V.imap (\i x -> (x-fromIntegral i-1)^2))

fTest2 :: Vector Double -> Vector Double
fTest2 = V.imap (\i x -> 2*(x-fromIntegral i-1))

-- uses grad
test1 :: Vector Double
test1 = adaMax fTest (V.fromList [100000,100000,100000]) 0 (V.fromList [0,0,0]) (V.fromList [0,0,0])

-- uses hardcoded gradient.
test2 :: Vector Double
test2 = adaMax fTest2 (V.fromList [100000,100000,100000]) 0 (V.fromList [0,0,0]) (V.fromList [0,0,0])

main :: IO ()
main = print test1

rhendric · February 28, 2024, 9:36am

Yeah, you’re building a tower of thunks there (in emiruz’s post, not jaror’s version). Each iteration creates a list that doesn’t get forced all the way through, so it holds on to the lists of the previous iteration, and so on down the line.

You might think that your lists are being fully evaluated because of the all condition, but all will short circuit; if your x-coordinate is too big, the y- and z- coordinates won’t be checked, and so they remain as thunks.

I bet if you bang up x, y, and z in fTest2, your memory problems go away. For fTest I’m not so sure; you might have to compose that with force.

emiruz · February 28, 2024, 9:43am

shebangs on x,y,z dont make a difference unfortunately.

emiruz · February 28, 2024, 9:50am

Thank you for this analysis. Try this:

fTest3 v = let x = v V.! 0
               y = v V.! 1
               z = v V.! 2 in
             V.fromList [2*(x-1),2*(y-2),2*(z-3)]

This brings back the memory leak in your version, and I don’t understand why.

This is what my real cost function looks like:

cost v = sqrt $ sum $ map (\(i,j,p)-> (auto p-(v V.! i)*(v V.! (m+j)))^2) ps'

Its then presumably this indexing into the vector that causes the problem (since the real adaMax is identical to yours), but I totally can’t see why. Do you see some way I can modify the cost function to fit the non-leaky format?