-
Coordinate descent: optimize only along individual dims at a time

- Used in variational inference
- When conditionals are easier to optimize
- No step size. Choose dims round robin or randomly.
-
Gradient descent (GD)
- Invariant to rigid transformations (not scale)
- Should strictly make loss go down every step (every step is a full batch epoch)
-
Line search: Rather than a fixed learning rate, keep moving along the gradient direction until you minimize the objective
- Since we are usually in a stochastic setting, this is actually not used
-
Stochastic gradient descent (SGD)

-
SGD with momentum: first moment (velocity)

- Nesterov momentum: update velocity first

-
AdaGrad
- Update each parameter differently, based on how much it has changed so far in training.
- Intuition: if a component changed a lot (measured by the sum of its squared gradient), then it must have made a lot of progress toward the target, so slow down its effective learning rate with a larger denominator.


- This lets you perform well if you’re in something less like a bowl and more like a valley, since you prioritize moving along the valley toward the goal instead of dropping straight to the bottom of the valley first and then moving toward the goal hi
-
RMSprop (Hinton)
- Problem with AdaGrad: it only decreases the LR
- EWMA of square of gradient lets us increase or decrease LR, lets it forget the past
