TODO import old notes including lyx files
Theory
Backpropagation solves circuit search (source)
Vapnik-Chervonenkis Dimension (VC Dimension): measure of the size (capacity, complexity, expressive power, richness, or flexibility) of a binary classification model or function.
Largest number of points n such that there exists a set of n points where for all labelings, it’s separable. So don’t need to satisfy all positionings, like this one (source):
SVM with Gaussian kernel has infinite VC dimension.
Various wisdom
no free lunch theorem: many hyps can fit; always assuming something. but in practice doesn’t matter because most ML works for most real data distributions.
curse of dimensionality: volume grows exponentially with dimensions, so everything is far away.
bitter lesson:
The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.
Misc vocab
❤️ MLE can be viewed as minimizing the KL divergence between the empirical distribution of the data and the model distribution
$$ D_{KL}(P||Q) = \sum_x P(x) \log(\frac{P(x)}{Q(x)}) \\ D_{KL}(P||Q) = \int P(x) \log(\frac{P(x)}{Q(x)}) dx $$
$$ \log L(\theta; X) = \sum_{i=1}^n \log Q(x_i; \theta) $$
$$ E_P[\log L(\theta; X)] = n \sum_x P(x) \log Q(x; \theta) $$
$$ \begin{align} E_P[\log L(\theta; X)] &= n \sum_x P(x) \log Q(x; \theta) \\ &= n \sum_x P(x) \log P(x) - n \sum_x P(x) \log \frac{P(x)}{Q(x; \theta)} \\ &= n H(P) - n D_{KL}(P||Q) \end{align} $$
Loss functions
Dissecting loss functions
BCE is identical to cross entropy loss.
Cross entropy loss: best and worst cases (upper and lower bounds) for inaccurate vs accurate cases (accurate where the max prob is the correct class)
Logit vs probit
Probit: CDF of normal. Logit is a bit simpler analytically (though both have closed form differentials). Logit has fatter tails, probit approaches 1 more quickly.
Sigmoid/logistic and softmax
$\sigma(x) = \frac{e^x}{1+e^x}$ or $\frac{e^x}{e^0+e^x}$ or $1 - \frac{1}{1+e^x}$. So as if $y=1$ has logit $x$ and $y=0$ always has logit 0
softmax generalizes sigmoid to any number of classes. But typically softmax lets any class have its own probability, rather than inferring the last class’s probability as (1 – others). So this is a difference from sigmoid, which assumes one of the classes is held at a logit of 0.
Binary cross entropy: since we use sigmoid, the $y=0$ class entropies are always $1-p$, so here’s a visual.
More robust than pairwise contrastive loss
“Given an anchor (Joe face 1), a positive match (Joe face 2), and a negative match (Jane face), the negative should be at least $m$ distance away from anchor than the positive”
$L = \max(0, d_{ap} - d_{an} + m) = \max(0, ||f_a - f_p||^2 - || f_a - f_n ||^2 + m)$
Need $m$ or else we just learn $f_a=f_n=f_p=0$
Quadruplet or quintuplet may be even better but largely triplet loss is good
Choosing triplets: want to prioritize the difficult cases that have high loss
Common ones:
MSE: pay attention to the direction, it’s truth - output!
Sigmoid
Softmax
Cross entropy, applying the softmax
Common patterns:
# matmul: think about shapes
A = X @ W # ac = ab @ bc
dW = dA.T @ X # bc = ba @ ac
dX = dA @ W.T # ab = ac @ cb
# sums -> scatter the grad
# sum scatters grad to all inputs
# BT = 1 * BTC
Y = c * X.sum(dim=2)
dX = ones_like(X) * (dY * c)[:,None]
# scatter -> sum the grads
# BTC = BTC + C
A = X@W + B
dB = dA.sum(dim=(0,1))
# multiple uses are same as scatters
Y = c * X
Z = d * X
dX = dY * c + dZ * d
# cross entropy / softmax
a2 = softmax(z2)
nll = samesies(
(a2[range(n), y.squeeze()].unsqueeze(dim=1)).log(), torch.gather(a2, 1, y).log()
)
loss = samesies(
-1 / n * nll.sum(),
F.cross_entropy(z2, y.squeeze()),
F.nll_loss(a2.log(), y.squeeze()),
)
# da2/dz2 = a ( 1 - a )
# dL/da2 = for correct, then L=-1/n log a -> -1/n/a, else L=-1/n log(1-a) -> 1/n/(1-a)
# dL/dz2 = dL/da2 da2/dz2 = for correct, then -1/n (1-a) = (a-1)/n, else a/n
dz2 = torch.ones_like(z2) * a2
dz2[range(n), y.squeeze()] -= 1
dz2 /= n
samesies(dz2, z2.grad)
Classification metrics
Precision: TP/(TP+FP)
Recall: TP/(TP+FN)
F1: 2TP/(2TP+FP+FN)
TPR: TP/P = TP/(TP+TN) = 1 - FPR
FPR: FP/N = FP/(FP+TN) = 1 - TPR
ROC: TPR (y) vs FPR (x)
ECE: effective calibration error,
AUC: area under ROC. Good measure of ranked separation, robust to imbalance.
Ranking metrics
Precision@K: precision in top k
Recall@K: recall in top k
Average precision AP@n: $\frac{1}{P} \sum_k^n P@k \times rel@k$ where P@k is precision@k, rel@k is relevance function (1 if document at rank k is relevant, 0 otherwise)
Considers whether all items are relevant.
Mean average precision mAP@k is mean AP@k over all queries/samples
Mean reciprocal rank MRR is 1/(rank of the first relevant item), meaned over all queries/samples.
Normalized discounted cumulative gain NDCG is TODO