Record holder for nanoGPT (source) TODO
Relation to shampoo: (source)
A connection between Muon and Shampoo (https://arxiv.org/abs/1802.09568) is that in the momentum=0 case, running Shampoo with the preconditioner updated every step and with no accumulation also yields the nearest orthogonal matrix as its update, although this has a slower runtime.
The nearest orthogonal matrix is also equivalent to:
- The semi-unitary factor of the update's polar decomposition.
- UV^T where USV^T is the update's SVD.
Using a rescaled version of (2) was proposed in prior work (
Algorithm
Infrequent Operations (every f steps)
# Compute preconditioner matrices (like Shampoo)
L = β2 * L + (1-β2) * GG^T
R = β2 * R + (1-β2) * G^TG
# Find rotation matrices (eigenvectors)
QL = find_eigenvectors(L)
QR = find_eigenvectors(R)
Every Step Operations:
# Rotate gradient to new coordinate system
G_rotated = QL^T @ G @ QR
# Run regular Adam steps in rotated space
m_rotated = β1 * m_rotated + (1-β1) * G_rotated # momentum
v_rotated = β2 * v_rotated + (1-β2) * G_rotated² # second moment
update = m_rotated / sqrt(v_rotated)
# Rotate update back to original space
final_update = QL @ update @ QR^T
# Final weight update
weight -= lr * final_update
Vs Shampoo
Overview:
The significance of this work is that it combines the benefits of first-order methods (Adam) and second-order methods (Shampoo) in a principled way, resulting in an optimizer that is both more efficient and easier to tune than existing approaches. The empirical results on large language model training demonstrate substantial practical benefits.