Building models (applied ML advice)

This is about the nuts and bolts of training ML models.

Training/optimization advice/wisdom
- https://karpathy.medium.com/yes-you-should-understand-backprop-e2f06eab496b
- https://karpathy.github.io/2019/04/25/recipe/
Coding mistakes
- Dimensions (e.g. ytarget not being columnar)
- Forgetting to zero grads
- Loss: flipping input / target in MSE, omitting negatives in log loss
- When evaluating, forgetting model.eval() or no_grad; and vice-versa (forgetting model.train())
- Forgetting .to(device)
Bias-variance tradeoff
Data processing common things
- Imputation, missing dummies
- Imbalanced classes
- Log skewed data
- Consolidating sparse columns
- Standardization
What to try next

Fixes high bias Fixes high var

Data More

Features More Fewer

Params More Fewer

Regularization Less More
overfitting
- regularization
- dropout
- more data
- data augmentation
  - train augmentation
  - test time augmentation with voting/mean
- early stop
ML projects structure
- transfer learning vs multi-task learning: transfer if base dataset size is much larger, multi-task best if many tasks where all tasks about same dataset size
- multi-stage vs end-to-end: just if you have enough data for the subproblems vs the end-to-end problem
Training issues and optimizations
- Weight decay: only over weights
- Vanishing gradients
- Exploding gradients
- Mixed precision training
Dealing with vanishing/exploding gradients
- initialization
- norms
- residual connections

	Fixes high bias	Fixes high var
Data		More
Features	More	Fewer
Params	More	Fewer
Regularization	Less	More

Debugging activations

Visualize distributions of activations:

For a simple NN like this:

layers = [
  Linear(n_embd * block_size, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(           n_hidden, vocab_size, bias=False), BatchNorm1d(vocab_size),
]

Expect these for activations, grads, and weights—don’t want saturation to go to 0, want all layers to be about the same instead of vanishing/exploding:

plt.figure(figsize=(20, 4)) # width and height of the plot
legends = []
for i, layer in enumerate(layers[:-1]): # note: exclude the output layer
  if isinstance(layer, Tanh):
    t = layer.out
    print('layer %d (%10s): mean %+.2f, std %.2f, saturated: %.2f%%' % (i, layer.__class__.__name__, t.mean(), t.std(), (t.abs() > 0.97).float().mean()*100))
    hy, hx = torch.histogram(t, density=True)
    plt.plot(hx[:-1].detach(), hy.detach())
    legends.append(f'layer {i} ({layer.__class__.__name__}')
plt.legend(legends);
plt.title('activation distribution')

Untitled

# visualize histograms
plt.figure(figsize=(20, 4)) # width and height of the plot
legends = []
for i, layer in enumerate(layers[:-1]): # note: exclude the output layer
  if isinstance(layer, Tanh):
    t = layer.out.grad
    print('layer %d (%10s): mean %+f, std %e' % (i, layer.__class__.__name__, t.mean(), t.std()))
    hy, hx = torch.histogram(t, density=True)
    plt.plot(hx[:-1].detach(), hy.detach())
    legends.append(f'layer {i} ({layer.__class__.__name__}')
plt.legend(legends);
plt.title('gradient distribution')

Untitled

# visualize histograms
plt.figure(figsize=(20, 4)) # width and height of the plot
legends = []
for i,p in enumerate(parameters):
  t = p.grad
  if p.ndim == 2:
    print('weight %10s | mean %+f | std %e | grad:data ratio %e' % (tuple(p.shape), t.mean(), t.std(), t.std() / p.std()))
    hy, hx = torch.histogram(t, density=True)
    plt.plot(hx[:-1].detach(), hy.detach())
    legends.append(f'{i} {tuple(p.shape)}')
plt.legend(legends)
plt.title('weights gradient distribution');

Untitled

Visualize log( SD of grad steps / SD of weights ) for linear layers, over steps:

ud.append([((lr*p.grad).std() / p.data.std()).log10().item() for p in parameters])
...
plt.figure(figsize=(20, 4))
legends = []
for i,p in enumerate(parameters):
  if p.ndim == 2:
    plt.plot([ud[j][i] for j in range(len(ud))])
    legends.append('param %d' % i)
plt.plot([0, len(ud)], [-3, -3], 'k') # these ratios should be ~1e-3, indicate on plot
plt.legend(legends);

Untitled

Initialization
- Wisdom
  - Usu. Gaussian
  - If X, W are unit normal, then XW will have SD > 1, but want SD=1. Should scale init W by $\sqrt{1/d}$ where d is dimension
  - But there’s also the activation nonlinearity, and you want the activation to have SD 1 too.
  - Parameterized Kaiming init is probably the most common init now (and in Pytorch). What gain you use depends on the nonlinearity. E.g. for ReLU, He says init to $\sqrt{2/d}$ (because ReLU discards half the dist).
  - GPT-NeoX uses “small init” $\sqrt{\frac{2}{5d}}$ from Transformers without Tears
  - Biases may be OK to zero
  - Residuals should be zeroed
- Has become less important to init due to modern features like residuals, batch norm/layer norm, and better optimizers (RMSprop, Adam)
- Resources
  - Learning resources: Karpathy, Howard
  - Papers: Delving Deep Into Rectifiers by He
Distributed training
- https://github.com/microsoft/DeepSpeed
- Torch lightning
Pretraining
- one common way to do this atm is to create your own exotic fork of MegatronLM+DeepSpeed. https://news.ycombinator.com/item?id=34421515
Approaches
- Siamese network: either 2x for contrastive loss or 3x triplet loss
  - Triplet loss specifically insists there is at least some margin separation between the distances: norm(anchor - pos) - norm(anchor - neg) < -a
LR schedules
- Visualizations (source)
- Even though Adam effectively gives each parameter dynamic learning rate, it can still make sense to use a scheduler on what is effectively the global cap of learning rates (source)
- From https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/guide2/Research_Projects.html
Regularization techniques
Instruction fine tuning
LLM scaling
Techniques for scaling to larger models than memory

Bias-variance tradeoff

What to try next