Shampoo is a preconditioned gradient descent algorithm, similar to Adam, but it handles the parameters in their natural tensor shape rather than flattening everything into a vector. The key insight is that parameters in deep learning often have meaningful structure (like matrices for linear layers or 4D tensors for conv layers) and Shampoo exploits this structure.
Key Components:
Let me show a simplified version for the matrix case:
# Initialize
W = zeros(m, n)# Parameters
L = epsilon * eye(m)# Left preconditioner
R = epsilon * eye(n)# Right preconditioner
eta = learning_rate
for t in range(num_steps):
# Get gradient
G = compute_gradient(W)
# Update preconditioners
L = L + G @ G.T# Accumulate left stats
R = R + G.T @ G# Accumulate right stats
# Update parameters with preconditioned gradient
W = W - eta * L^(-1/4) @ G @ R^(-1/4)
Simple Example: Let's say you have a 2×2 matrix parameter W. At each step:
The key advantage over Adam is that Shampoo can capture parameter correlations more efficiently by working with the natural tensor structure. For a matrix W ∈ Rm×n:
The -1/4 power on the preconditioners gives similar theoretical properties to Adam's 1/√t scaling of learning rates.