
Coordinate descent: optimize only along individual dims at a time

Gradient descent (GD)
Line search: Rather than a fixed learning rate, keep moving along the gradient direction until you minimize the objective
Stochastic gradient descent (SGD)

SGD with (Polyak) momentum: first moment (velocity)

Nesterov momentum: update velocity first

Intuition:


Tiny tweak to momentum. In step 4, you compute the gradient at w, then let momentum carry you forward. Nesterov says: you already know momentum is about to carry you forward by roughly μ·v — so compute the gradient there, at the place you're about to land, instead of where you are now.
Naively:
w_look = w + (something like μ·v) # peek ahead g = ∇L(w_look) # gradient at the lookahead point v ← μ·v + g w ← w − η·v
Why bother: if the velocity is about to overshoot, the gradient at the lookahead point already "sees" the overshoot and points back, so you correct one step earlier. Slightly better behaved; on convex problems it's provably faster.
The actual nuisance is where that one gradient is taken. Stored weights are w_t, but the gradient is at w_t + ρv_t. Implemented literally:
w += ρ*v # shift params to the lookahead point
g = fwd_bwd(w) # the one gradient eval
w -= ρ*v # shift params back
v = ρ*v − η*g
w += v
So it's not extra flops — it's two extra full-parameter writes per step (shift out, shift back), plus the bookkeeping risk of "my stored params are temporarily not the real w_t." On a big model that memory traffic isn't free, and it's ugly, but it's not a doubled forward pass.
The trick is purely bookkeeping. You're free to choose which point you call "w" and store in memory.
Instead of storing the base point and peeking ahead each time, just store the lookahead point itself. Call the thing you keep in memory w̃. Then every step:
g = ∇L(w̃) # gradient at the thing in memory — normal fwd/bwd, nothing extra v ← μ·v + g w̃ ← w̃ − η·g − η·μ·v # this is the update written in terms of w̃
There's a few lines of algebra showing this is exactly equivalent to the naive version — but the point is: you only ever evaluate the gradient at the single weight vector you're storing. The "lookahead" is absorbed into a change of variables, not an extra computation.
It removes the shift-out/shift-back. If you store w̃ = w + ρv, the thing in memory is already where you want the gradient, so the loop is just:
g = fwd_bwd(w̃)
v = ρ*v − η*g
w̃ += v + ρ*(v − v_prev)
One gradient at the stored point, one param write. Same flops as naive, fewer weight writes, cleaner code.
AdaGrad


RMSprop (Hinton)
