Preconditioning

When optimizing, some directions in parameter space might be "stretched" or "squeezed", making optimization harder. Preconditioning tries to "reshape" the space to make it more uniform.
- We ideally want the effective gradient landscape to be described with an identity matrix.

Visualization (source)

This might be the cortour plot for the following quadratic function (recall quadratic forms $x^T A x$—it’s a positive definite matrix)

def f(x, y):
    A = np.array([[4, 2],   # This matrix determines the shape
                  [2, 3]])   # Off-diagonal 2's cause the tilt
    point = np.array([x, y])
    return point @ A @ point

But we want:

def f_preconditioned(x, y):
		A = np.array([[1, 0],    # Identity matrix!
		[0, 1]])    # Makes circular contours
		point = np.array([x, y])
		return point @ A @ point

Types

Diagonal Preconditioning

python
Copy
# Just scales each dimension independently
P = np.diag([1/scale1, 1/scale2, ...])

Full Matrix Preconditioning (like Newton's method)

python
Copy
# Uses inverse Hessian
P = H^(-1)# Very expensive for large problems

Shampoo's Approach (Kronecker-factored)

python
Copy
# For matrix parameters W, instead of one big preconditioner:# Left preconditioner (L^(-1/4)) and right preconditioner (R^(-1/4))
W_new = W - η * (L^(-1/4) @ G @ R^(-1/4))

Common preconditioners

# Original problem space (stretched):
H = np.array([[100, 0],
              [0, 1]])

# Newton's method (perfect conditioning):
P_newton = np.linalg.inv(H)  # Gets identity

# Diagonal preconditioning (just scales):
P_diag = np.diag([1/np.sqrt(100), 1])

# How we get this:
# In Adam, for each parameter, we track:
v = beta2 * v + (1-beta2) * g^2  # Second moment estimate
v_corrected = v / (1 - beta2^t)  # Bias correction
# The diagonal preconditioner is effectively:
D = diag(1 / sqrt(v_corrected + epsilon))
# So for parameters [w1, w2, w3], if their gradients have been:
# w1: mostly large gradients around 1.0
# w2: mostly small gradients around 0.1
# w3: medium gradients around 0.5
# The diagonal matrix would look like:
D = [[1/sqrt(1.0), 0,           0        ],
     [0,           1/sqrt(0.01), 0        ],
     [0,           0,           1/sqrt(0.25)]]
# When applied to a gradient:
g = [1.0, 0.1, 0.5]
D @ g ≈ [1.0, 1.0, 1.0]  # Roughly equalizes the scales

# Shampoo-like (can capture some interaction):
P_left = np.array([[1/10, 0],
                   [0, 1]])
P_right = np.array([[1/sqrt(10), 0],
                    [0, 1]])

Preconditioner shapes
- Adam: Single diagonal matrix (n×n where n is total params)
- Shampoo: Multiple smaller full matrices (one per tensor dimension)

Elaboration

BASIC IDEA OF PRECONDITIONING: When optimizing, some directions in parameter space might be "stretched" or "squeezed", making optimization harder. Preconditioning tries to "reshape" the space to make it more uniform.

Simple 2D Example:

# Imagine a loss surface that's very stretched in one direction:
A = np.array([[100, 0],
              [0, 1]])
x = np.array([1, 1])

# Without preconditioning:
grad = A @ x# [100, 1]# Problem: Very different scales in different directions# With preconditioning:
P = np.array([[1/10, 0],
              [0, 1]])
precond_grad = P @ A @ x# [10, 1]# Much better balanced!

TYPES OF PRECONDITIONING:

Diagonal Preconditioning

# Just scales each dimension independently
P = np.diag([1/scale1, 1/scale2, ...])

Full Matrix Preconditioning (like Newton's method)

# Uses inverse Hessian
P = H^(-1)# Very expensive for large problems

Shampoo's Approach (Kronecker-factored)

# For matrix parameters W, instead of one big preconditioner:# Left preconditioner (L^(-1/4)) and right preconditioner (R^(-1/4))
W_new = W - η * (L^(-1/4) @ G @ R^(-1/4))

COMMON PRECONDITIONERS:

Newton's Method: P = H^(-1)
- Best but computationally infeasible
BFGS/L-BFGS:
- Approximates H^(-1) with rank-1 updates