Coordinate descent: optimize only along individual dims at a time
Gradient descent (GD)
Line search: Rather than a fixed learning rate, keep moving along the gradient direction until you minimize the objective
Stochastic gradient descent (SGD)
SGD with momentum: first moment (velocity)
AdaGrad
RMSprop (Hinton)
ADAM: first and second moments (Kingma, Ba)
Adam ~ RMSProp with momentum, but in the orig paper
There are a few important differences between RMSProp with momentum and Adam: RMSProp with momentum generates its parameter updates using a momentum on the rescaled gradient, whereas Adam updates are directly estimated using a running average of first and second moment of the gradient.
From paper
Intuition
Momentum: updates w by v, not directly by g. Updates v by g.
v += lr * g
w += v
RMSprop: updates w with LR that goes up/down by g^2.
v = ewma of g^2
w += (lr = 1/sqrt(v)) * g
ADAM
m = EWMA of g = like "v" of momentum
v = EWMA of g^2 = like "v" of RMSprop
w += (lr = 1/sqrt(v)) * m
$m$ is first moment (mean)
$v$ is second moment (uncentered variance)
Code
class Adam(SGD):
def __init__(self, params, lr, wd=0., beta1=0.9, beta2=0.99, eps=1e-5):
super().__init__(params, lr=lr, wd=wd)
self.beta1,self.beta2,self.eps = beta1,beta2,eps
def opt_step(self, p):
if not hasattr(p, 'avg'): p.avg = torch.zeros_like(p.grad.data)
if not hasattr(p, 'sqr_avg'): p.sqr_avg = torch.zeros_like(p.grad.data)
p.avg = self.beta1*p.avg + (1-self.beta1)*p.grad
unbias_avg = p.avg / (1 - (self.beta1**(self.i+1)))
p.sqr_avg = self.beta2*p.sqr_avg + (1-self.beta2)*(p.grad**2)
unbias_sqr_avg = p.sqr_avg / (1 - (self.beta2**(self.i+1)))
p -= self.lr * unbias_avg / (unbias_sqr_avg + self.eps).sqrt()
Adam is "invariant to diagonal rescaling of the gradients”, i.e. Adam is invariant to multiplying the gradient by a diagonal matrix with only positive factors
Another paper says:
Second, while the magnitudes of Adam parameter updates are invariant to rescaling of the gradient, the effect of the updates on the same overall network function still varies with the magnitudes of parameters.
So presumably grad norm clipping has no effect? (Unless grad norm clipping is somehow achieving something on numerics)
With LR=0, we are still updating the moments (and grad norm clipping), just not updating weights.