image.png

Elaboration

BASIC IDEA OF PRECONDITIONING: When optimizing, some directions in parameter space might be "stretched" or "squeezed", making optimization harder. Preconditioning tries to "reshape" the space to make it more uniform.

Simple 2D Example:

# Imagine a loss surface that's very stretched in one direction:
A = np.array([[100, 0],
              [0, 1]])
x = np.array([1, 1])

# Without preconditioning:
grad = A @ x# [100, 1]# Problem: Very different scales in different directions# With preconditioning:
P = np.array([[1/10, 0],
              [0, 1]])
precond_grad = P @ A @ x# [10, 1]# Much better balanced!

TYPES OF PRECONDITIONING:

  1. Diagonal Preconditioning
# Just scales each dimension independently
P = np.diag([1/scale1, 1/scale2, ...])

  1. Full Matrix Preconditioning (like Newton's method)
# Uses inverse Hessian
P = H^(-1)# Very expensive for large problems

  1. Shampoo's Approach (Kronecker-factored)
# For matrix parameters W, instead of one big preconditioner:# Left preconditioner (L^(-1/4)) and right preconditioner (R^(-1/4))
W_new = W - η * (L^(-1/4) @ G @ R^(-1/4))

COMMON PRECONDITIONERS:

  1. Newton's Method: P = H^(-1)
  2. BFGS/L-BFGS: