TODO import old notes including lyx files
Theory
- Backpropagation solves circuit search (source)
- Vapnik-Chervonenkis Dimension (VC Dimension): measure of the size (capacity, complexity, expressive power, richness, or flexibility) of a binary classification model or function.
  - Largest number of points n such that there exists a set of n points where for all labelings, it’s separable. So don’t need to satisfy all positionings, like this one (source):
  - SVM with Gaussian kernel has infinite VC dimension.
Various wisdom
- no free lunch theorem: many hyps can fit; always assuming something. but in practice doesn’t matter because most ML works for most real data distributions.
- curse of dimensionality: volume grows exponentially with dimensions, so everything is far away.
- bitter lesson:
  
  The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.
Misc vocab
- Concept drift: when input/output dist differs from what was trained
- Covariate shift: when input dist differs from what was trained
- Inductive bias: the assumptions built into an algo
❤️ MLE can be viewed as minimizing the KL divergence between the empirical distribution of the data and the model distribution
- Recall discrete/continuous KL
$$ D_{KL}(P||Q) = \sum_x P(x) \log(\frac{P(x)}{Q(x)}) \\ D_{KL}(P||Q) = \int P(x) \log(\frac{P(x)}{Q(x)}) dx $$
- Log likelihood of some data $X_i$ is
$$ \log L(\theta; X) = \sum_{i=1}^n \log Q(x_i; \theta) $$
- Expectation of this under true distribution $P$ is
$$ E_P[\log L(\theta; X)] = n \sum_x P(x) \log Q(x; \theta) $$
- Relate that to KL
$$ \begin{align} E_P[\log L(\theta; X)] &= n \sum_x P(x) \log Q(x; \theta) \\ &= n \sum_x P(x) \log P(x) - n \sum_x P(x) \log \frac{P(x)}{Q(x; \theta)} \\ &= n H(P) - n D_{KL}(P||Q) \end{align} $$
- So maximizing log likelihood is minimizing KL!

Loss functions

Loss functions
- Mean Squared Error (MSE) and Gaussian MLE: MSE is equivalent to MLE under the assumption that the target variable follows a Gaussian distribution with mean equal to the model's prediction and constant variance.
- Binary Cross-Entropy and Bernoulli MLE: Binary cross-entropy is equivalent to MLE for a Bernoulli distribution, which is appropriate for binary classification problems.
- Cross-Entropy Loss and Categorical MLE: Cross-entropy loss (for multi-class classification) is equivalent to MLE for a categorical distribution.
- See also GLMs
Dissecting loss functions
- BCE is identical to cross entropy loss.
- Cross entropy loss: best and worst cases (upper and lower bounds) for inaccurate vs accurate cases (accurate where the max prob is the correct class)
Logit vs probit
- Probit: CDF of normal. Logit is a bit simpler analytically (though both have closed form differentials). Logit has fatter tails, probit approaches 1 more quickly.
Sigmoid/logistic and softmax
- $\sigma(x) = \frac{e^x}{1+e^x}$ or $\frac{e^x}{e^0+e^x}$ or $1 - \frac{1}{1+e^x}$. So as if $y=1$ has logit $x$ and $y=0$ always has logit 0
- softmax generalizes sigmoid to any number of classes. But typically softmax lets any class have its own probability, rather than inferring the last class’s probability as (1 – others). So this is a difference from sigmoid, which assumes one of the classes is held at a logit of 0.
- Binary cross entropy: since we use sigmoid, the $y=0$ class entropies are always $1-p$, so here’s a visual.

Contrastive and triplet loss

Contrastive loss: compare pair of things
- Prior intuition: Distance $d$ should be 0 for same person faces case ($y=0$), positive for different person faces case ($y=1$)
- But: don’t want to encourage distance to become infinity for different person faces! Just “different enough” past some margin threshold $m$
- $y d^2 + (1-y) \max(0, d-m)^2$
- Intuition: loss should be close to 0 always, for both cases
- Video
Triplet loss: compare triplet of things
- More robust than pairwise contrastive loss
  - E.g. say the network is unable to cluster the same person’s faces to the exact same point and they are actually rather spread out. In that case, we want other people’s faces to be at least much more different than the same person’s faces are from each other.
  - E.g., there are “difficult positive” and “difficult negative” cases
- “Given an anchor (Joe face 1), a positive match (Joe face 2), and a negative match (Jane face), the negative should be at least $m$ distance away from anchor than the positive”
- $L = \max(0, d_{ap} - d_{an} + m) = \max(0, ||f_a - f_p||^2 - || f_a - f_n ||^2 + m)$
- Need $m$ or else we just learn $f_a=f_n=f_p=0$
- Quadruplet or quintuplet may be even better but largely triplet loss is good
- Choosing triplets: want to prioritize the difficult cases that have high loss
- Video, Ng video

Backpropagation

Common ones:
- MSE: pay attention to the direction, it’s truth - output!
- Sigmoid
- Softmax
- Cross entropy, applying the softmax

Common patterns:

# matmul: think about shapes
A = X @ W     # ac = ab @ bc
dW = dA.T @ X # bc = ba @ ac
dX = dA @ W.T # ab = ac @ cb

# sums -> scatter the grad
# sum scatters grad to all inputs
# BT = 1 * BTC
Y = c * X.sum(dim=2)
dX = ones_like(X) * (dY * c)[:,None]

# scatter -> sum the grads
# BTC = BTC + C
A = X@W + B
dB = dA.sum(dim=(0,1))

# multiple uses are same as scatters
Y = c * X
Z = d * X
dX = dY * c + dZ * d

# cross entropy / softmax
a2 = softmax(z2)
nll = samesies(
    (a2[range(n), y.squeeze()].unsqueeze(dim=1)).log(), torch.gather(a2, 1, y).log()
)
loss = samesies(
    -1 / n * nll.sum(),
    F.cross_entropy(z2, y.squeeze()),
    F.nll_loss(a2.log(), y.squeeze()),
)
# da2/dz2 = a ( 1 - a )
# dL/da2 = for correct, then L=-1/n log a -> -1/n/a, else L=-1/n log(1-a) -> 1/n/(1-a)
# dL/dz2 = dL/da2 da2/dz2 = for correct, then -1/n (1-a) = (a-1)/n, else a/n
dz2 = torch.ones_like(z2) * a2
dz2[range(n), y.squeeze()] -= 1
dz2 /= n
samesies(dz2, z2.grad)

Accuracy metrics

Classification metrics
- Precision: TP/(TP+FP)
- Recall: TP/(TP+FN)
- F1: 2TP/(2TP+FP+FN)
- TPR: TP/P = TP/(TP+TN) = 1 - FPR
- FPR: FP/N = FP/(FP+TN) = 1 - TPR
- ROC: TPR (y) vs FPR (x)
- ECE: effective calibration error,
- AUC: area under ROC. Good measure of ranked separation, robust to imbalance.
Ranking metrics

Loss functions

Contrastive and triplet loss

Backpropagation

Accuracy metrics

Misc