References
Neural network anatomy
Seminal models
Deep generative modeling [note: old]
Neural network techniques
Skip connections: remove some of the depth of the network, especially initially, when the denser block pathways are initialized to zero(ish?) to have no effect
Convolutions
Normalization
Batch normalization: normalize over the batch to unit Gaussian, then scale/shift by learnable params (so not forced to stick to unit Gaussian).
gain * (x - lerp(x.mean([0,2,3]))) / lerp(x.std([0,2,3)) + bias # NCHW
class BatchNorm(nn.Module):
def __init__(self, nf, mom=0.1, eps=1e-5):
super().__init__()
# NB: pytorch bn mom is opposite of what you'd expect
self.mom,self.eps = mom,eps
self.mults = nn.Parameter(torch.ones (nf,1,1))
self.adds = nn.Parameter(torch.zeros(nf,1,1))
self.register_buffer('vars', torch.ones(1,nf,1,1))
self.register_buffer('means', torch.zeros(1,nf,1,1))
def update_stats(self, x):
m = x.mean((0,2,3), keepdim=True)
v = x.var ((0,2,3), keepdim=True)
self.means.lerp_(m, self.mom)
self.vars.lerp_ (v, self.mom)
return m,v
def forward(self, x):
if self.training:
with torch.no_grad(): m,v = self.update_stats(x)
else: m,v = self.means,self.vars
x = (x-m) / (v+self.eps).sqrt()
return x*self.mults + self.adds
Layer normalization: normalize each row independently, across all features. x - x.mean([1,2,3]) # NCHW
, then scale and shift by learned single params x*mul+add
.
class LayerNorm(nn.Module):
def __init__(self, dummy, eps=1e-5):
super().__init__()
self.eps = eps
self.mult = nn.Parameter(tensor(1.))
self.add = nn.Parameter(tensor(0.))
def forward(self, x):
m = x.mean((1,2,3), keepdim=True)
v = x.var ((1,2,3), keepdim=True)
x = (x-m) / ((v+self.eps).sqrt())
return x*self.mult + self.add
Dropout: randomly masks out some nodes on each pass, so you end up training some (overlapping?) ensemble of subnets
Dropout, batchnorm best practices—lots of conflicting wisdom!
Consider not using batch norm when dropout present (or altogether)
My intuition: Input → BN → dropout → Linear → Activation → BN → dropout → Linear → Activation → BN → [no dropout] → output
In various transformers:
inp
emb
**dropout**
block
**norm**
attn
...
wei=...
**dropout**
wei@v
linear
**dropout**
+x
**norm**
mlp
linear
relu
linear
**dropout**
+x
**norm**
linear head
IIRC, original transformer paper applied dropout after attention, but now more common before?
Loss functions
Examples
n_embed
)
q = linear(x) $n_e$ → $n_e/h$ (B,T,C/h)
k = linear(x) $n_e$ → $n_e/h$
v = linear(x) $n_e$ → $n_e/h$
wei = q @ k.transpose(-2,-1)
(B,T,T)
the tril trick that uses softmax, which allows us to just use the above wei
as some measure of affinity that is just an alternative to 0s everywhere
wei = wei.masked_fill(self.tril[..., :T, :T] == 0, float('-inf')) # (B, T, T)
wei = F.softmax(wei, dim=-1) # (B, T, T)
dropout wei
wei @ v
(B,T,C/h)
Architectures
Transformers = positional encoding + attention + self-attention
Encoder-decoder
Autoencoder
GAN
RNNs have to consume all input before producing output, so are limited by ability to memorize things
Building blocks
Activation functions
TODO Definitions
Token embeddings
Tokenization
Transformers
Relationships
Wisdom