References
Neural network anatomy
Deep generative modeling [note: old]
Neural network techniques
Skip connections: remove some of the depth of the network, especially initially, when the denser block pathways are initialized to zero(ish?) to have no effect
Convolutions
Normalization
Batch normalization: normalize over the batch to unit Gaussian, then scale/shift by learnable params (so not forced to stick to unit Gaussian).
gain * (x - lerp(x.mean([0,2,3]))) / lerp(x.std([0,2,3)) + bias # NCHW
class BatchNorm(nn.Module):
def __init__(self, nf, mom=0.1, eps=1e-5):
super().__init__()
# NB: pytorch bn mom is opposite of what you'd expect
self.mom,self.eps = mom,eps
self.mults = nn.Parameter(torch.ones (nf,1,1))
self.adds = nn.Parameter(torch.zeros(nf,1,1))
self.register_buffer('vars', torch.ones(1,nf,1,1))
self.register_buffer('means', torch.zeros(1,nf,1,1))
def update_stats(self, x):
m = x.mean((0,2,3), keepdim=True)
v = x.var ((0,2,3), keepdim=True)
self.means.lerp_(m, self.mom)
self.vars.lerp_ (v, self.mom)
return m,v
def forward(self, x):
if self.training:
with torch.no_grad(): m,v = self.update_stats(x)
else: m,v = self.means,self.vars
x = (x-m) / (v+self.eps).sqrt()
return x*self.mults + self.adds
Layer normalization: normalize each row independently, across all features. x - x.mean([1,2,3]) # NCHW
, then scale and shift by learned single params x*mul+add
.
class LayerNorm(nn.Module):
def __init__(self, dummy, eps=1e-5):
super().__init__()
self.eps = eps
self.mult = nn.Parameter(tensor(1.))
self.add = nn.Parameter(tensor(0.))
def forward(self, x):
m = x.mean((1,2,3), keepdim=True)
v = x.var ((1,2,3), keepdim=True)
x = (x-m) / ((v+self.eps).sqrt())
return x*self.mult + self.add
Dropout: randomly masks out some nodes on each pass, so you end up training some (overlapping?) ensemble of subnets
Dropout, batchnorm best practices—lots of conflicting wisdom!
Consider not using batch norm when dropout present (or altogether)
My intuition: Input → BN → dropout → Linear → Activation → BN → dropout → Linear → Activation → BN → [no dropout] → output
In various transformers:
inp
emb
**dropout**
block
**norm**
attn
...
wei=...
**dropout**
wei@v
linear
**dropout**
+x
**norm**
mlp
linear
relu
linear
**dropout**
+x
**norm**
linear head
IIRC, original transformer paper applied dropout after attention, but now more common before?
Loss functions
Examples
n_embed
)
q = linear(x) $n_e$ → $n_e/h$ (B,T,C/h)
k = linear(x) $n_e$ → $n_e/h$
v = linear(x) $n_e$ → $n_e/h$
wei = q @ k.transpose(-2,-1)
(B,T,T)
the tril trick that uses softmax, which allows us to just use the above wei
as some measure of affinity that is just an alternative to 0s everywhere
wei = wei.masked_fill(self.tril[..., :T, :T] == 0, float('-inf')) # (B, T, T)
wei = F.softmax(wei, dim=-1) # (B, T, T)
dropout wei
wei @ v
(B,T,C/h)
Architectures
Transformers = positional encoding + attention + self-attention
Attention
Self-attention: when keys, queries, values all come from the same source—just tokens looking at each other
Cross-attention: for instance in encoder-decoder transformers, maybe queries produced from x, keys/values produced from external source, sometimes from encoder blocks we want to condition on
Multi-headed attention: run several attention mechanisms in parallel, then concatenate
d_model/n_heads
to get the size of each attention vector. But these each take full d_model
as input (hidden dimensions are not partitioned). Ultimately concatenated back into d_model
.More alternatives: grouped query attention (Llama 2, Mistral?), which interpolates between multi-head and multi-query attention
Mechanics
# version 3: use Softmax
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
wei
"""
tensor([[1.0000, 0.0000, 0.0000],
[0.5000, 0.5000, 0.0000],
[0.3333, 0.3333, 0.3333]])
"""
# version 4: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)
# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x) # (B, T, 16)
q = query(x) # (B, T, 16)
wei = q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)
tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
wei
"""
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
[0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
[0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
[0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
[0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
grad_fn=<SelectBackward0>)
"""
v = value(x)
out = wei @ v
out.shape
"""
4 8 16
"""
Outside of transformers
Resources
General attention
Dense linear vs convolution vs attention vs sliding window attention
Encoder-decoder
Autoencoder
GAN
RNNs have to consume all input before producing output, so are limited by ability to memorize things
Seminal models
Building blocks
Activation functions
TODO Definitions
Token embeddings
Position embeddings
Tokenization
Transformers
Relationships