I’ve written the infinity norm version of the Adam gradient descent optimiser below. It should work with most multivariate convex problems. It seems to work ok except it is epically slow. Optimising x^2 - x
with a bad starting point (e.g. -2000) as shown below in the test1
function takes about 9 seconds.
Is there something I can do to make this orders of magnitude faster? I don’t see any straightforward, SIMD options, but I could be wrong? Perhaps a more Haskell friendly way to restate the computations?
import Numeric.AD(grad)
import qualified Data.Vector as V
adaMax f' start alpha beta1 beta2 = adaMax' start 0 zeros zeros
where
zeros = V.fromList (replicate n 0)
n = length start
adaMax' th t m u
| all (\x-> abs x < 1e-8) g = th'
| otherwise = adaMax' th' (t+1) m' u'
where g = f' th
m' = V.map (\(m,g)->beta1 * m + (1-beta1) * g) (V.zip m g)
u' = V.map (\(u,g)->max (beta2*u) (abs g)) (V.zip u g)
th' = V.map (\(th,m,u)->th - (alpha / (1-beta1^(t+1))) * m/u) $ V.zip3 th m' u'
fTest = grad (\xs -> (V.head xs)^2 - (V.head xs))
test1 = adaMax fTest (V.fromList [-2000]) 0.002 0.9 0.999
I’d note as well that using the explicitly calculated gradient (2x - 1) is about 50% faster, which is a bit surprising given how tiny the function is.