Optimizers
Coordinate descent: optimize only along individual dims at a time
SGD vs GD is just full vs mini batch. GD is slow per step. SGD is just slow throughput.
GD should strictly make loss go down every step (every step is a full batch epoch)
momentum
RMSprop
ADAM
In practice
Learning rate schedules
https://twitter.com/__kolesnikov__/status/1687911223096971264
Hyperparameter optimization