TODO import old notes including lyx files
Theory
Backpropagation solves circuit search (source)
Vapnik-Chervonenkis Dimension (VC Dimension): measure of the size (capacity, complexity, expressive power, richness, or flexibility) of a binary classification model or function.
Largest number of points n such that there exists a set of n points where for all labelings, it’s separable. So don’t need to satisfy all positionings, like this one (source):
SVM with Gaussian kernel has infinite VC dimension.
Various wisdom
no free lunch theorem: many hyps can fit; always assuming something. but in practice doesn’t matter because most ML works for most real data distributions.
curse of dimensionality: volume grows exponentially with dimensions, so everything is far away.
bitter lesson:
The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.
Misc vocab
❤️ MLE can be viewed as minimizing the KL divergence between the empirical distribution of the data and the model distribution
$$ D_{KL}(P||Q) = \sum_x P(x) \log(\frac{P(x)}{Q(x)}) \\ D_{KL}(P||Q) = \int P(x) \log(\frac{P(x)}{Q(x)}) dx $$
$$ \log L(\theta; X) = \sum_{i=1}^n \log Q(x_i; \theta) $$
$$ E_P[\log L(\theta; X)] = n \sum_x P(x) \log Q(x; \theta) $$
$$ \begin{align} E_P[\log L(\theta; X)] &= n \sum_x P(x) \log Q(x; \theta) \\ &= n \sum_x P(x) \log P(x) - n \sum_x P(x) \log \frac{P(x)}{Q(x; \theta)} \\ &= n H(P) - n D_{KL}(P||Q) \end{align} $$
Loss functions
Dissecting loss functions
BCE is identical to cross entropy loss.
Cross entropy loss: best and worst cases (upper and lower bounds) for inaccurate vs accurate cases (accurate where the max prob is the correct class)
Logit vs probit
Probit: CDF of normal. Logit is a bit simpler analytically (though both have closed form differentials). Logit has fatter tails, probit approaches 1 more quickly.
Sigmoid/logistic and softmax
$\sigma(x) = \frac{e^x}{1+e^x}$ or $\frac{e^x}{e^0+e^x}$ or $1 - \frac{1}{1+e^x}$. So as if $y=1$ has logit $x$ and $y=0$ always has logit 0
softmax generalizes sigmoid to any number of classes. But typically softmax lets any class have its own probability, rather than inferring the last class’s probability as (1 – others). So this is a difference from sigmoid, which assumes one of the classes is held at a logit of 0.
Binary cross entropy: since we use sigmoid, the $y=0$ class entropies are always $1-p$, so here’s a visual.
Backprop
Common ones:
MSE: pay attention to the direction, it’s truth - output!
Sigmoid
Softmax
Cross entropy, applying the softmax
Common patterns:
# matmul: think about shapes
A = X @ W # ac = ab @ bc
dW = dA.T @ X # bc = ba @ ac
dX = dA @ W.T # ab = ac @ cb
# sums -> scatter the grad
# sum scatters grad to all inputs
# BT = 1 * BTC
Y = c * X.sum(dim=2)
dX = ones_like(X) * (dY * c)[:,None]
# scatter -> sum the grads
# BTC = BTC + C
A = X@W + B
dB = dA.sum(dim=(0,1))
# multiple uses are same as scatters
Y = c * X
Z = d * X
dX = dY * c + dZ * d
# cross entropy / softmax
a2 = softmax(z2)
nll = samesies(
(a2[range(n), y.squeeze()].unsqueeze(dim=1)).log(), torch.gather(a2, 1, y).log()
)
loss = samesies(
-1 / n * nll.sum(),
F.cross_entropy(z2, y.squeeze()),
F.nll_loss(a2.log(), y.squeeze()),
)
# da2/dz2 = a ( 1 - a )
# dL/da2 = for correct, then L=-1/n log a -> -1/n/a, else L=-1/n log(1-a) -> 1/n/(1-a)
# dL/dz2 = dL/da2 da2/dz2 = for correct, then -1/n (1-a) = (a-1)/n, else a/n
dz2 = torch.ones_like(z2) * a2
dz2[range(n), y.squeeze()] -= 1
dz2 /= n
samesies(dz2, z2.grad)
Metrics
Classification metrics
Precision: TP/(TP+FP)
Recall: TP/(TP+FN)
F1: 2TP/(2TP+FP+FN)
TPR: TP/P = TP/(TP+TN) = 1 - FPR
FPR: FP/N = FP/(FP+TN) = 1 - TPR
ROC: TPR (y) vs FPR (x)
ECE: effective calibration error,
AUC: area under ROC. Good measure of ranked separation, robust to imbalance.
Ranking metrics
Precision@K: precision in top k
Recall@K: recall in top k
Average precision AP@n: $\frac{1}{P} \sum_k^n P@k \times rel@k$ where P@k is precision@k, rel@k is relevance function (1 if document at rank k is relevant, 0 otherwise)
Considers whether all items are relevant.
Mean average precision mAP@k is mean AP@k over all queries/samples
Mean reciprocal rank MRR is 1/(rank of the first relevant item), meaned over all queries/samples.
Normalized discounted cumulative gain NDCG is TODO
SVMs are infinite dimensional features
Explainable machine learning
Partial dependence plots (PDP): plot y=average response vs x=feature. (Other features are not perturbed)
Definition
$$ \begin{split}pd_{X_S}(x_S) &\overset{def}{=} \mathbb{E}_{X_C}\left[ f(x_S, X_C) \right]\\ &= \int f(x_S, x_C) p(x_C) dx_C,\end{split} $$
1D vs 2D
Individual conditional expectation (ICE): like PDP but plot multiple y=responses (one per example)
Collaborative filtering
Matrix factorization
class MatrixFactorization(torch.nn.Module):
def __init__(self, n_users, n_items, n_factors=20):
super().__init__()
self.user_factors = torch.nn.Embedding(n_users, n_factors, sparse=True)
self.item_factors = torch.nn.Embedding(n_items, n_factors, sparse=True)
def forward(self, user, item):
return (self.user_factors(user) * self.item_factors(item)).sum(1)
loss_func = torch.nn.MSELoss()
CollabNN: just user/item embeddings into NN
class CollabNN(Module):
def __init__(self, user_sz, item_sz, y_range=(0,5.5), n_act=100):
self.user_factors = Embedding(*user_sz)
self.item_factors = Embedding(*item_sz)
self.layers = nn.Sequential(
nn.Linear(user_sz[1]+item_sz[1], n_act),
nn.ReLU(),
nn.Linear(n_act, 1))
self.y_range = y_range
def forward(self, x):
embs = self.user_factors(x[:,0]),self.item_factors(x[:,1])
x = self.layers(torch.cat(embs, dim=1))
return sigmoid_range(x, *self.y_range)
Lies in machine learning
Trees vs neural networks, why XGBoost better than DNNs: [this is not a great explanation]
(source)
Deep Learning is the state of the art for images, video, audio and natural language processing. In all those tasks the features for each observation are homogeneous in the sense that they have the same scale and unit of measure: pixels in an image, frames in a video, words in text, etc.
When your data is made of heterogeneous columns such as age, weight, number of times the client called, average time of a call, etc then Xgboost is usually better than Deep Learning.
Calibration: how well do classifier probabilities match actual accuracy rates?
Expected calibration error (ECE): weighted by samples per bin, so outermost bins may dominate.
Average calibration error (ACE): unweighted by samples per bin, each bin equally sized.
Ways to make non-differentiable things differentiable
Consider exploring