Refs, mostly TODO
Gradient descent
Line search: Rather than a fixed learning rate, keep moving along the gradient direction until you minimize the objective
Second-order optimization algorithms
BFGS/L-BFGS (source): quasi Newton's method. TODO
Line search (source)
Shampoo TODO
Muon: record holder for nanoGPT (source) TODO
Relation to shampoo: (source)
A connection between Muon and Shampoo (https://arxiv.org/abs/1802.09568) is that in the momentum=0 case, running Shampoo with the preconditioner updated every step and with no accumulation also yields the nearest orthogonal matrix as its update, although this has a slower runtime.
The nearest orthogonal matrix is also equivalent to:
- The semi-unitary factor of the update's polar decomposition.
- UV^T where USV^T is the update's SVD.
Using a rescaled version of (2) was proposed in prior work (
Duality
First-order optimization algorithms
Coordinate descent: optimize only along individual dims at a time
GD
SGD
SGD with momentum: first moment (velocity)
AdaGrad
RMSprop (Hinton)
ADAM: first and second moments (Kingma, Ba)
Adam ~ RMSProp with momentum, but in the orig paper
There are a few important differences between RMSProp with momentum and Adam: RMSProp with momentum generates its parameter updates using a momentum on the rescaled gradient, whereas Adam updates are directly estimated using a running average of first and second moment of the gradient.
From paper
Intuition
Momentum: updates w by v, not directly by g. Updates v by g.
v += lr * g
w += v
RMSprop: updates w with LR that goes up/down by g^2.
v = ewma of g^2
w += (lr = 1/sqrt(v)) * g
ADAM
m = EWMA of g = like "v" of momentum
v = EWMA of g^2 = like "v" of RMSprop
w += (lr = 1/sqrt(v)) * m
$m$ is first moment (mean)
$v$ is second moment (uncentered variance)
Code
class Adam(SGD):
def __init__(self, params, lr, wd=0., beta1=0.9, beta2=0.99, eps=1e-5):
super().__init__(params, lr=lr, wd=wd)
self.beta1,self.beta2,self.eps = beta1,beta2,eps
def opt_step(self, p):
if not hasattr(p, 'avg'): p.avg = torch.zeros_like(p.grad.data)
if not hasattr(p, 'sqr_avg'): p.sqr_avg = torch.zeros_like(p.grad.data)
p.avg = self.beta1*p.avg + (1-self.beta1)*p.grad
unbias_avg = p.avg / (1 - (self.beta1**(self.i+1)))
p.sqr_avg = self.beta2*p.sqr_avg + (1-self.beta2)*(p.grad**2)
unbias_sqr_avg = p.sqr_avg / (1 - (self.beta2**(self.i+1)))
p -= self.lr * unbias_avg / (unbias_sqr_avg + self.eps).sqrt()
Adam is "invariant to diagonal rescaling of the gradients”, i.e. Adam is invariant to multiplying the gradient by a diagonal matrix with only positive factors
Another paper says:
Second, while the magnitudes of Adam parameter updates are invariant to rescaling of the gradient, the effect of the updates on the same overall network function still varies with the magnitudes of parameters.
So presumably grad norm clipping has no effect? (Unless grad norm clipping is somehow achieving something on numerics)
With LR=0, we are still updating the moments (and grad norm clipping), just not updating weights.
Weight decay
loss = ... + 0.5 * wd * weight**2
or $L_{new}\left(w\right) = L_{original}\left(w\right) + \lambda{w^{T}w}$new_weight = weight - lr*weight.grad - lr*wd*weight
# or
weight.grad += wd*weight
Grad norm clipping
In practice
Learning rate schedules
https://twitter.com/__kolesnikov__/status/1687911223096971264
Hyperparameter optimization
TODO An Empirical Model of Large-Batch Training — Sam McCandlish