Overview

Optimizer foundations
Adam: Uses diagonal approximation + momentum
Shampoo: Block-diagonal approximation (no momentum)
SOAP: Block-diagonal in a basis where blocks are nearly diagonal + momentum

Muon, Keller Jordan 2024

Record holder for nanoGPT (source) TODO
Relation to shampoo: (source)
A connection between Muon and Shampoo (https://arxiv.org/abs/1802.09568) is that in the momentum=0 case, running Shampoo with the preconditioner updated every step and with no accumulation also yields the nearest orthogonal matrix as its update, although this has a slower runtime.

The nearest orthogonal matrix is also equivalent to:
1. The semi-unitary factor of the update's polar decomposition.
2. UV^T where USV^T is the update's SVD.
Using a rescaled version of (2) was proposed in prior work (

https://ieeexplore.ieee.org/document/7347351…).

SOAP, Harvard 2024

SOAP: Improving and Stabilizing Shampoo using Adam

Algorithm

Infrequent Operations (every f steps)
- Update preconditioner matrices:
  - $L_t = \beta_2 L_{t-1} + (1-\beta_2)G_tG_t^T$
  - $R_t = \beta_2 R_{t-1} + (1-\beta_2)G_t^TG_t$
- Compute rotation matrices (eigenvectors):
  - $L_t = Q_L\Lambda_LQ_L^T$
  - $R_t = Q_R\Lambda_RQ_R^T$
```
# Compute preconditioner matrices (like Shampoo)
L = β2 * L + (1-β2) * GG^T
R = β2 * R + (1-β2) * G^TG

# Find rotation matrices (eigenvectors)
QL = find_eigenvectors(L)
QR = find_eigenvectors(R)
```

Every Step Operations:

# Rotate gradient to new coordinate system
G_rotated = QL^T @ G @ QR

# Run regular Adam steps in rotated space
m_rotated = β1 * m_rotated + (1-β1) * G_rotated  # momentum
v_rotated = β2 * v_rotated + (1-β2) * G_rotated²  # second moment
update = m_rotated / sqrt(v_rotated)

# Rotate update back to original space
final_update = QL @ update @ QR^T

# Final weight update
weight -= lr * final_update

Vs Shampoo
- Shampoo Also Rotates, BUT:
  - Shampoo does the rotation and preconditioning in one step
  - When Shampoo updates its preconditioner (L and R matrices), it has to recompute everything
- Why SOAP is More Robust:
  - If Shampoo's update frequency is too low, performance degrades badly
  - SOAP degrades more gracefully because Adam keeps adapting between rotation updates
  - This is why SOAP can use less frequent updates (cheaper) while maintaining good performance
- Also SOAP needs fewer hyperparameters - much of the complexity is handled by Adam's adaptation in the rotated space, rather than needing to carefully tune the preconditioning

Overview:

Background and Problem:
- Shampoo is a higher-order optimization method that has shown better performance than Adam in deep learning
- However, Shampoo has drawbacks: additional hyperparameters and computational overhead compared to Adam
Main Contribution - SOAP Algorithm:
- The authors establish that Shampoo (with 1/2 power) is equivalent to running Adafactor in the eigenbasis of Shampoo's preconditioner
- Based on this insight, they develop SOAP, which runs Adam in Shampoo's eigenbasis
- SOAP retains the benefits of both methods while addressing Shampoo's limitations
Key Results:
- For large language model training (360M and 660M parameter models):
  - 40% reduction in number of iterations compared to AdamW
  - 35% reduction in wall clock time compared to AdamW
  - 20% improvement in both metrics compared to Shampoo
- SOAP shows better robustness to preconditioning frequency compared to Shampoo
- SOAP requires only one additional hyperparameter compared to AdamW (preconditioning frequency)