This might be the cortour plot for the following quadratic function (recall quadratic forms $x^T A x$—it’s a positive definite matrix)
def f(x, y):
A = np.array([[4, 2], # This matrix determines the shape
[2, 3]]) # Off-diagonal 2's cause the tilt
point = np.array([x, y])
return point @ A @ point
But we want:
def f_preconditioned(x, y):
A = np.array([[1, 0], # Identity matrix!
[0, 1]]) # Makes circular contours
point = np.array([x, y])
return point @ A @ point
Types
python
Copy
# Just scales each dimension independently
P = np.diag([1/scale1, 1/scale2, ...])
python
Copy
# Uses inverse Hessian
P = H^(-1)# Very expensive for large problems
python
Copy
# For matrix parameters W, instead of one big preconditioner:# Left preconditioner (L^(-1/4)) and right preconditioner (R^(-1/4))
W_new = W - η * (L^(-1/4) @ G @ R^(-1/4))
Common preconditioners
# Original problem space (stretched):
H = np.array([[100, 0],
[0, 1]])
# Newton's method (perfect conditioning):
P_newton = np.linalg.inv(H) # Gets identity
# Diagonal preconditioning (just scales):
P_diag = np.diag([1/np.sqrt(100), 1])
# How we get this:
# In Adam, for each parameter, we track:
v = beta2 * v + (1-beta2) * g^2 # Second moment estimate
v_corrected = v / (1 - beta2^t) # Bias correction
# The diagonal preconditioner is effectively:
D = diag(1 / sqrt(v_corrected + epsilon))
# So for parameters [w1, w2, w3], if their gradients have been:
# w1: mostly large gradients around 1.0
# w2: mostly small gradients around 0.1
# w3: medium gradients around 0.5
# The diagonal matrix would look like:
D = [[1/sqrt(1.0), 0, 0 ],
[0, 1/sqrt(0.01), 0 ],
[0, 0, 1/sqrt(0.25)]]
# When applied to a gradient:
g = [1.0, 0.1, 0.5]
D @ g ≈ [1.0, 1.0, 1.0] # Roughly equalizes the scales
# Shampoo-like (can capture some interaction):
P_left = np.array([[1/10, 0],
[0, 1]])
P_right = np.array([[1/sqrt(10), 0],
[0, 1]])
Preconditioner shapes
BASIC IDEA OF PRECONDITIONING: When optimizing, some directions in parameter space might be "stretched" or "squeezed", making optimization harder. Preconditioning tries to "reshape" the space to make it more uniform.
Simple 2D Example:
# Imagine a loss surface that's very stretched in one direction:
A = np.array([[100, 0],
[0, 1]])
x = np.array([1, 1])
# Without preconditioning:
grad = A @ x# [100, 1]# Problem: Very different scales in different directions# With preconditioning:
P = np.array([[1/10, 0],
[0, 1]])
precond_grad = P @ A @ x# [10, 1]# Much better balanced!
TYPES OF PRECONDITIONING:
# Just scales each dimension independently
P = np.diag([1/scale1, 1/scale2, ...])
# Uses inverse Hessian
P = H^(-1)# Very expensive for large problems
# For matrix parameters W, instead of one big preconditioner:# Left preconditioner (L^(-1/4)) and right preconditioner (R^(-1/4))
W_new = W - η * (L^(-1/4) @ G @ R^(-1/4))
COMMON PRECONDITIONERS: