Deep learning foundations

References
- https://github.com/jacobhilton/deep_learning_curriculum from OpenAI’s Jacob Hilton
- Simplified and annotated pseudocode: various sketches of architectures/algorithms
- A Hackers' Guide to Language Models - YouTube: high level applied overview
- Neural Networks: Zero To Hero: Quick trip from DNN through transformers. Manual backprop.
- fast.ai – fast.ai—Making neural nets uncool again: a bit too much time on teaching Python, but good concrete notebook style lessons through Stable Diffusion. No language or multimodal.
- karpathy/minGPT: A minimal PyTorch re-implementation of the OpenAI GPT (Generative Pretrained Transformer) training: deprecated for nanoGPT, which is still pretty simple and clean but more practical
- CS25 I Stanford Seminar - Transformers United 2023: Introduction to Transformers w/ Andrej Karpathy - YouTube
Neural network anatomy
- Head: top of the network, often a dense network for causal LMs
- Bottom is where data comes in
- Multi-headed: in the DAG of layers, multiple heads, maybe for prediction, or maybe as internal nodes (multi-headed attention)
Seminal models
- LeNet
- AlexNet
- UNet
Deep generative modeling [note: old]
Neural network techniques
- Skip connections: remove some of the depth of the network, especially initially, when the denser block pathways are initialized to zero(ish?) to have no effect
  - Introduced in Deep Residual Learning for Image Recognition
Convolutions
- 3D vs (2+1)D: often used for temporal video

Normalization

Placement
- Commonly placed in between linear/weight nodes and the activation nonlinearity node

Untitled

Batch normalization: normalize over the batch to unit Gaussian, then scale/shift by learnable params (so not forced to stick to unit Gaussian).

The learned scale/shift allows it to move around. The point is normalizing the initialization, not locking it down and preventing the NN from learning.
Karpathy says best to avoid it.

gain * (x - lerp(x.mean([0,2,3]))) / lerp(x.std([0,2,3)) + bias # NCHW

class BatchNorm(nn.Module):
    def __init__(self, nf, mom=0.1, eps=1e-5):
        super().__init__()
        # NB: pytorch bn mom is opposite of what you'd expect
        self.mom,self.eps = mom,eps
        self.mults = nn.Parameter(torch.ones (nf,1,1))
        self.adds  = nn.Parameter(torch.zeros(nf,1,1))
        self.register_buffer('vars',  torch.ones(1,nf,1,1))
        self.register_buffer('means', torch.zeros(1,nf,1,1))

    def update_stats(self, x):
        m = x.mean((0,2,3), keepdim=True)
        v = x.var ((0,2,3), keepdim=True)
        self.means.lerp_(m, self.mom)
        self.vars.lerp_ (v, self.mom)
        return m,v
        
    def forward(self, x):
        if self.training:
            with torch.no_grad(): m,v = self.update_stats(x)
        else: m,v = self.means,self.vars
        x = (x-m) / (v+self.eps).sqrt()
        return x*self.mults + self.adds

Untitled

The jitter that depends on other examples in your batch is ugly but is also a form of regularization/noise.
Can use LERP to smooth over multiple batches
At inference, can rely on the LERP’d mean/SD, or can initialize to mean/SD over whole dataset and use this as constant
Its bias is redundant with any bias in layer before it

Layer normalization: normalize each row independently, across all features. x - x.mean([1,2,3]) # NCHW , then scale and shift by learned single params x*mul+add .

Like BN, allow the NN to shift/scale the distribution anywhere it wants, starting from a normalized one—the point is normalizing the initialization, not locking it down and preventing the NN from learning.

class LayerNorm(nn.Module):
    def __init__(self, dummy, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.mult = nn.Parameter(tensor(1.))
        self.add  = nn.Parameter(tensor(0.))

    def forward(self, x):
        m = x.mean((1,2,3), keepdim=True)
        v = x.var ((1,2,3), keepdim=True)
        x = (x-m) / ((v+self.eps).sqrt())
        return x*self.mult + self.add

Untitled

Dropout: randomly masks out some nodes on each pass, so you end up training some (overlapping?) ensemble of subnets
Dropout, batchnorm best practices—lots of conflicting wisdom!
- Consider not using batch norm when dropout present (or altogether)
- My intuition: Input → BN → dropout → Linear → Activation → BN → dropout → Linear → Activation → BN → [no dropout] → output
- In various transformers:
  - layernorm before attention and MLP, and at end before linear head
  - dropout before blocks, before wei@v, end of attn, and end of MLP
```
inp
emb
**dropout**
block
	**norm**
	attn
		...
		wei=...
		**dropout**
		wei@v
		linear
		**dropout**
	+x
	**norm**
	mlp
		linear
		relu
		linear
		**dropout**
	+x
**norm**
linear head
```
- IIRC, original transformer paper applied dropout after attention, but now more common before?
- Source, source, orig paper, source, source
Loss functions
- According to a test of time talk by the word2vec authors, they were able to get more scale by choosing smarter loss functions and models. From slowest to fastest: softmax → hierarchical softmax → noise contrastive estimation → negative sampling (Ilya)
Examples
- GPT2 outline ($n_e$ is n_embed)
  - dropout(token embedding + pos embedding)
  - blocks:
    - x + multihead-self-attention(layernorm(x))
    - x + mlp(layernorm(x))
    - where attention is:
      - concat $h$ multi-head-attentions. each takes full input but has a fraction of $n_e$. and does:
        
        q = linear(x) $n_e$ → $n_e/h$ (B,T,C/h)
        
        k = linear(x) $n_e$ → $n_e/h$
        
        v = linear(x) $n_e$ → $n_e/h$
        
        wei = q @ k.transpose(-2,-1) (B,T,T)
        
        the tril trick that uses softmax, which allows us to just use the above wei as some measure of affinity that is just an alternative to 0s everywhere
        
        wei = wei.masked_fill(self.tril[..., :T, :T] == 0, float('-inf')) # (B, T, T) wei = F.softmax(wei, dim=-1) # (B, T, T)
        
        dropout wei
        
        wei @ v (B,T,C/h)
      - linear (C/h → C/h)
      - dropout
    - where MLP is:
      - linear ($n_e$ → $4n_e$)
      - relu
      - linear ($4n_e$ → $n_e$)
      - dropout
  - layernorm
  - linear head
  - loss excludes last token
Architectures
- Transformers = positional encoding + attention + self-attention
  - Used by BERT, GPT, T5, etc.
  - Originally for translation
  - Overall architecture
  - Has cross-attention connecting the encoder and decoder
  - Mechanics
  - Resources
- Encoder-decoder
- Autoencoder
- GAN
- RNNs have to consume all input before producing output, so are limited by ability to memorize things
Building blocks
Activation functions
TODO Definitions
Token embeddings
Tokenization
Transformers
Relationships
Wisdom