• References

    • https://github.com/jacobhilton/deep_learning_curriculum from OpenAI’s Jacob Hilton
    • Simplified and annotated pseudocode: various sketches of architectures/algorithms
    • A Hackers' Guide to Language Models - YouTube: high level applied overview
    • Neural Networks: Zero To Hero: Quick trip from DNN through transformers. Manual backprop.
    • fast.ai – fast.ai—Making neural nets uncool again: a bit too much time on teaching Python, but good concrete notebook style lessons through Stable Diffusion. No language or multimodal.
    • karpathy/minGPT: A minimal PyTorch re-implementation of the OpenAI GPT (Generative Pretrained Transformer) training: deprecated for nanoGPT, which is still pretty simple and clean but more practical
    • CS25 I Stanford Seminar - Transformers United 2023: Introduction to Transformers w/ Andrej Karpathy - YouTube
  • Neural network anatomy

    • Head: top of the network, often a dense network for causal LMs
    • Bottom is where data comes in
    • Multi-headed: in the DAG of layers, multiple heads, maybe for prediction, or maybe as internal nodes (multi-headed attention)
  • Seminal models

    • LeNet
    • AlexNet
    • UNet
  • Deep generative modeling [note: old]

    Untitled

  • Neural network techniques

    • Skip connections: remove some of the depth of the network, especially initially, when the denser block pathways are initialized to zero(ish?) to have no effect

      • Introduced in Deep Residual Learning for Image Recognition

      Untitled

  • Convolutions

    • 3D vs (2+1)D: often used for temporal video

    Untitled

    Untitled

  • Normalization

    • Placement
      • Commonly placed in between linear/weight nodes and the activation nonlinearity node

    Untitled

    Untitled

    • Batch normalization: normalize over the batch to unit Gaussian, then scale/shift by learnable params (so not forced to stick to unit Gaussian).

      • The learned scale/shift allows it to move around. The point is normalizing the initialization, not locking it down and preventing the NN from learning.
      • Karpathy says best to avoid it.
      gain * (x - lerp(x.mean([0,2,3]))) / lerp(x.std([0,2,3)) + bias # NCHW
      
      class BatchNorm(nn.Module):
          def __init__(self, nf, mom=0.1, eps=1e-5):
              super().__init__()
              # NB: pytorch bn mom is opposite of what you'd expect
              self.mom,self.eps = mom,eps
              self.mults = nn.Parameter(torch.ones (nf,1,1))
              self.adds  = nn.Parameter(torch.zeros(nf,1,1))
              self.register_buffer('vars',  torch.ones(1,nf,1,1))
              self.register_buffer('means', torch.zeros(1,nf,1,1))
      
          def update_stats(self, x):
              m = x.mean((0,2,3), keepdim=True)
              v = x.var ((0,2,3), keepdim=True)
              self.means.lerp_(m, self.mom)
              self.vars.lerp_ (v, self.mom)
              return m,v
              
          def forward(self, x):
              if self.training:
                  with torch.no_grad(): m,v = self.update_stats(x)
              else: m,v = self.means,self.vars
              x = (x-m) / (v+self.eps).sqrt()
              return x*self.mults + self.adds
      
      

      Untitled

      • The jitter that depends on other examples in your batch is ugly but is also a form of regularization/noise.
      • Can use LERP to smooth over multiple batches
      • At inference, can rely on the LERP’d mean/SD, or can initialize to mean/SD over whole dataset and use this as constant
      • Its bias is redundant with any bias in layer before it
    • Layer normalization: normalize each row independently, across all features. x - x.mean([1,2,3]) # NCHW , then scale and shift by learned single params x*mul+add .

      • Like BN, allow the NN to shift/scale the distribution anywhere it wants, starting from a normalized one—the point is normalizing the initialization, not locking it down and preventing the NN from learning.
      class LayerNorm(nn.Module):
          def __init__(self, dummy, eps=1e-5):
              super().__init__()
              self.eps = eps
              self.mult = nn.Parameter(tensor(1.))
              self.add  = nn.Parameter(tensor(0.))
      
          def forward(self, x):
              m = x.mean((1,2,3), keepdim=True)
              v = x.var ((1,2,3), keepdim=True)
              x = (x-m) / ((v+self.eps).sqrt())
              return x*self.mult + self.add
      

      Untitled

  • Dropout: randomly masks out some nodes on each pass, so you end up training some (overlapping?) ensemble of subnets

    Untitled

  • Dropout, batchnorm best practices—lots of conflicting wisdom!

    • Consider not using batch norm when dropout present (or altogether)

    • My intuition: Input → BN → dropout → Linear → Activation → BN → dropout → Linear → Activation → BN → [no dropout] → output

    • In various transformers:

      • layernorm before attention and MLP, and at end before linear head
      • dropout before blocks, before wei@v, end of attn, and end of MLP
      inp
      emb
      **dropout**
      block
      	**norm**
      	attn
      		...
      		wei=...
      		**dropout**
      		wei@v
      		linear
      		**dropout**
      	+x
      	**norm**
      	mlp
      		linear
      		relu
      		linear
      		**dropout**
      	+x
      **norm**
      linear head
      
    • IIRC, original transformer paper applied dropout after attention, but now more common before?

    • Source, source, orig paper, source, source

  • Loss functions

    • According to a test of time talk by the word2vec authors, they were able to get more scale by choosing smarter loss functions and models. From slowest to fastest: softmax → hierarchical softmax → noise contrastive estimation → negative sampling (Ilya)
  • Examples

    • GPT2 outline ($n_e$ is n_embed)
      • dropout(token embedding + pos embedding)
      • blocks:
        • x + multihead-self-attention(layernorm(x))
        • x + mlp(layernorm(x))
        • where attention is:
          • concat $h$ multi-head-attentions. each takes full input but has a fraction of $n_e$. and does:
            • q = linear(x) $n_e$ → $n_e/h$ (B,T,C/h)

            • k = linear(x) $n_e$ → $n_e/h$

            • v = linear(x) $n_e$ → $n_e/h$

            • wei = q @ k.transpose(-2,-1) (B,T,T)

            • the tril trick that uses softmax, which allows us to just use the above wei as some measure of affinity that is just an alternative to 0s everywhere

              wei = wei.masked_fill(self.tril[..., :T, :T] == 0, float('-inf')) # (B, T, T)
              wei = F.softmax(wei, dim=-1) # (B, T, T)
              
            • dropout wei

            • wei @ v (B,T,C/h)

          • linear (C/h → C/h)
          • dropout
        • where MLP is:
          • linear ($n_e$ → $4n_e$)
          • relu
          • linear ($4n_e$ → $n_e$)
          • dropout
      • layernorm
      • linear head
      • loss excludes last token
  • Architectures

    • Transformers = positional encoding + attention + self-attention

      • Used by BERT, GPT, T5, etc.
      • Originally for translation
      • Overall architecture

      Untitled

      • Has cross-attention connecting the encoder and decoder
      • Mechanics
      • Resources
    • Encoder-decoder

    • Autoencoder

    • GAN

    • RNNs have to consume all input before producing output, so are limited by ability to memorize things

  • Building blocks

  • Activation functions

  • TODO Definitions

  • Token embeddings

  • Tokenization

  • Transformers

  • Relationships

  • Wisdom