@tomjaguarpaw , it isn’t the grad function. Here is some evidence. Both test1
and test2
exhibit the “memory leak”. So its something the adaMax code is doing, but I totally can’t see what the problem is. @jaror I’ve implemented all the zipWith things too, much more elegant, but unfortunately make no difference to performance or memory usage.
import Numeric.AD(grad)
-- minimal adaMax function
adaMax f' th t m u
| all (\x->abs x<1e-7) g = th'
| otherwise = adaMax f' th' (t+1) m' u'
where g = f' th
m' = zipWith (\m g->beta1 * m + (1-beta1) * g) m g
u' = zipWith (\u g->max (beta2*u) (abs g)) u g
th' = zipWith3 (\th m u->th - (alpha / (1-beta1^(t+1))) * m/u) th m' u'
(alpha,beta1,beta2) = (0.002, 0.9, 0.999)
fTest = grad (\[x,y,z] -> (x-1)^2 + (y-2)^2 + (z-3)^2)
fTest2 [x,y,z] = [2*(x-1),2*(y-2),2*(z-3)]
-- uses grad
test1 = adaMax fTest [100000,100000,100000] 0 [0,0,0] [0,0,0]
-- uses hardcoded gradient.
test2 = adaMax fTest2 [100000,100000,100000] 0 [0,0,0] [0,0,0]
You can observe the leak by running top
in linux and watching the ghci process consume ever more memory. Same problem when compiled. Shebangs on everything didn’t make a difference either.