Refs, mostly TODO
- MIT 6.7220 https://www.mit.edu/~gfarina/notes/
  - https://www.mit.edu/~gfarina/2024/67220s24_L12_newton/
- Deep Learning book chapter 8 optimization
- https://www.cs.toronto.edu/~rgrosse/courses/csc2541_2022/
Preconditioning
Shampoo optimizer

Second-order optimization algorithms

Newton’s method aka Newton-Raphson: iteratively find tangent line (or quadratic, etc.) and find its root (or min for quadratic). This is for finding zeros, but apply it to the gradient to find the minimum, ie where the first derivative is 0. This requires knowledge of the second derivatives ie Hessians
- Often uses a quadratic instead of a line, ie the second order Taylor expansion.

BFGS/L-BFGS (source): quasi Newton's method. TODO
- L is limited memory
- Preconditions the gradient with curvature information
- Covered in Deep Learning book
Line search (source)
Research includes Shampoo optimizer, SOAP kind-of, and Muon
Duality
- Given gradient G of a weights M, update M by UV', where USV' is the SVD of G. This generalizes g / ||g||
- Muon similar but accumulates the gradient. Without this, they agree and agree with shampoo.

First-order optimization algorithms

Coordinate descent: optimize only along individual dims at a time
- Used in variational inference
- When conditionals are easier to optimize
- No step size. Choose dims round robin or randomly.
Gradient descent (GD)
- Invariant to rigid transformations (not scale)
- Should strictly make loss go down every step (every step is a full batch epoch)
Line search: Rather than a fixed learning rate, keep moving along the gradient direction until you minimize the objective
- Since we are usually in a stochastic setting, this is actually not used
Stochastic gradient descent (SGD)
- vs GD is just full vs mini batch. GD is slow per step. SGD is just slow throughput.
- The same final performance should be attainable using any batch size (See Shallue et al. 2018 and Why shouldn't the batch size be tuned to directly improve validation set performance?.
SGD with momentum: first moment (velocity)
- Nesterov momentum: update velocity first
AdaGrad
- Update each parameter differently, based on how much it has changed so far in training.
- Intuition: if a component changed a lot (measured by the sum of its squared gradient), then it must have made a lot of progress toward the target, so slow down its effective learning rate with a larger denominator.
- This lets you perform well if you’re in something less like a bowl and more like a valley, since you prioritize moving along the valley toward the goal instead of dropping straight to the bottom of the valley first and then moving toward the goal hi
RMSprop (Hinton)
- Problem with AdaGrad: it only decreases the LR
- EWMA of square of gradient lets us increase or decrease LR, lets it forget the past
ADAM: first and second moments (Kingma, Ba)
- Adam ~ RMSProp with momentum, but in the orig paper
  
  There are a few important differences between RMSProp with momentum and Adam: RMSProp with momentum generates its parameter updates using a momentum on the rescaled gradient, whereas Adam updates are directly estimated using a running average of first and second moment of the gradient.
- From paper
- Intuition
  - Momentum: updates w by v, not directly by g. Updates v by g.
```
v += lr * g
w += v
```
  - RMSprop: updates w with LR that goes up/down by g^2.
```
v = ewma of g^2
w += (lr = 1/sqrt(v)) * g
```
  - ADAM
```
m = EWMA of g = like "v" of momentum
v = EWMA of g^2 = like "v" of RMSprop
w += (lr = 1/sqrt(v)) * m
```
- $m$ is first moment (mean)
- $v$ is second moment (uncentered variance)
- Code
```
class Adam(SGD):
    def __init__(self, params, lr, wd=0., beta1=0.9, beta2=0.99, eps=1e-5):
        super().__init__(params, lr=lr, wd=wd)
        self.beta1,self.beta2,self.eps = beta1,beta2,eps

    def opt_step(self, p):
        if not hasattr(p, 'avg'): p.avg = torch.zeros_like(p.grad.data)
        if not hasattr(p, 'sqr_avg'): p.sqr_avg = torch.zeros_like(p.grad.data)
        p.avg = self.beta1*p.avg + (1-self.beta1)*p.grad
        unbias_avg = p.avg / (1 - (self.beta1**(self.i+1)))
        p.sqr_avg = self.beta2*p.sqr_avg + (1-self.beta2)*(p.grad**2)
        unbias_sqr_avg = p.sqr_avg / (1 - (self.beta2**(self.i+1)))
        p -= self.lr * unbias_avg / (unbias_sqr_avg + self.eps).sqrt()
```
- Adam is "invariant to diagonal rescaling of the gradients”, i.e. Adam is invariant to multiplying the gradient by a diagonal matrix with only positive factors
  - Explainer
  - Another paper says:
    
    Second, while the magnitudes of Adam parameter updates are invariant to rescaling of the gradient, the effect of the updates on the same overall network function still varies with the magnitudes of parameters.
  - So presumably grad norm clipping has no effect? (Unless grad norm clipping is somehow achieving something on numerics)
- With LR=0, we are still updating the moments (and grad norm clipping), just not updating weights.
Nice animation comparison of algorithms