Monte Carlo inference, probabilistic generative modeling foundations

Variational inference
Graphical models
- bayesian network aka belief network aka directed graphical model
  - local markov prop: any 2 nodes conditionally indep given parents' values
  - edges usu repr causality but not required (either dir)
  - plates: for repeating sub vars
  - eg naive bayes, neural nets
  - markov blanket: neighbors; parents, children, children's parents
    - every node is conditionally indep of all other nodes in network given blanket
  - dynamic bayesian network: for sequences of vars, eg time series
    - eg HMMs, kalman filter, SMC
- markov network aka markov random fields aka undirected graphical model
  - reprs some things bayes nets can't (cyclic deps) & vice versa (induced deps)
  - eg log-linear models, gaussian markov random field TODO
  - must satisfy pairwise, local, global markov props TODO
  - conditional random fields TODO
- factor graph
  - bipartite hypergraph representing factorization of a function TODO
- http://cacm.acm.org/magazines/2010/12/102122-bayesian-networks/fulltext
Numerical sampling algorithms
- Inverse transform sampling
  - Intuition: given ability to generate random numbers in [0,1] i.e. sample uniform $U(0,1)$, if you know the target distribution’s CDF $F$, go backward by mapping an output point on the range (which, since it’s a probability CDF, is always in [0,1]) to the input domain.
    - The “steep” parts are where there’s greatest density in the PDF.
  - Derivation
    - Say $X$ is target distribution random variable, $U$ is uniform random variable, $F_A$ is CDF of any RV $A$, and $T$ is the transformation we want on $U$ to turn it into $X$. Note $P[U<a]=a$ for any $a\in[0,1]$. Then:
    $$ F_X(x)=P[X<x]=P[T(U)<x]=P[U<T^{-1}(x)]=T^{-1}(x) $$
    - So $F_x(x)=T^{-1}(x)$ i.e. $T(x)=F_X^{-1}(x)$! Thus, transform of uniform into $X$ is the inverse of the CDF.
  - But you need the CDF. Many real world problems don’t have this. And then you need to invert it too.
Monte Carlo sampling methods
- Accept-reject sampling aka rejection sampling: simple, sample the same domain with an easy-to-sample distribution, and accept the sample with probability p_target(x) / p_surrogate(x).
  - Don’t know target $p(s)$ but know it’s $f(s)/N_C$ and don’t know $N_C$ the normalizing constant (requires integrating everywhere).
  - First choose easy to sample $g(s)$. Doesn’t need to be similar to $f(s)$ but helps. Scale it up by $M$ so that it’s actually strictly bigger than $f(s)$ everywhere.
  - Repeatedly sample from $g(s)$, accepting with prob $\frac{f(s)}{M g(s)}$.
  - But inefficient. Often $M$ must be huge, lots of wasted space.
- Importance sampling
- markov chain monte carlo (MCMC): general family of algorithms
  - Intuition: accept-reject sampling treats samples independently, but inefficient. If we successfully sampled recently, then we next sample should be close by, because maybe we’re in a good place.
  - Idea: stationary distribution should be the target distribution. We’ll get there and settle there eventually. Throw away the initial samples (”burn in” samples). Keep only the later ones, since those are what actually represent $p(x)$.
  - How the chain evolves up to more specific algorithms.
  - TODO taxonomy: http://brenocon.com/blog/2013/04/monte-carlo-algorithm-inputs/
  - http://www.ams.org/journals/bull/2009-46-02/S0273-0979-08-01238-X/S0273-0979-08-01238-X.pdf
  - variational bayes TODO
    - deterministic
  - TODO BUGS, JAGS, Metropolis-Hastings
- Gibbs sampling: MCMC; sample from multidimensional distribution by sampling from conditionals along each dimension
  - Special cast of Metropolis-Hastings
  - E.g. 2D normal: pick some initial $x,y$, then sample $p(x\mid y)$ then vice versa. Assumes sampling $p(x \mid y)$ easier than $p(x,y)$.
  - Doesn’t work when can’t move along any single dimension to arrive at other density islands
  - Inefficient when mostly low density, hard to arrive at high density bubbles, and then won’t leave
- Metropolis Hastings Monte Carlo (MHMC): most popular MCMC algorithm
  - Intuition: like accept-reject but with MCMC statefulness. Target dist is $f$. Sample from some distribution $g(X_{t+1} \mid X_t)$, say, the normal centered at last point $\mathcal{N}(X_t, \sigma^2)$. (Move nearby.) Then accept with some prob $A = \max(1, \frac{f(b)}{f(a)} \frac{g(a\mid b)}{g(b\mid a)})$.
  - Intuition: always accept next state if it’s higher prob. Reject next state proportionally to how much of a drop it is: $p(b)/p(a)$.
  - Metropolis: more specifically for when $g$ is symmetric
  - aka random walk
- Hamiltonian Monte Carlo (HMC): MCMC; intuition is, flip the pdf, simulate a frictionless object falling down the slopes, occasionally randomly kick it out of minima (source)
- No U-Turn Sampler (NUTS): a HMC method; used in STAN, Pyro
Sampling from a Bayes net
- Pieter Abbeel: https://www.youtube.com/watch?v=ol0l6aTfb_g
- Forward sampling aka prior sampling: just sample in topological order
- Rejection sampling: stop your forward sampling pass as soon as you sample an evidence variable that is inconsistent with the evidence of the query
- Likelihood weighting: rather than sample evidence variables, force them to be consistent with query, then reweight samples to be consistent
- queries w evidence: est $P(Y=y\mid E=e)$
  - rejection sampling: fwd sampling, discard where $E\ne e$
    - $P(E=e)$ usu prohibitively small; too many samples required
  - fwd sampling generally infeasible for markov networks; no "roots"
Control variates
- Reduce variance in MC estimation
Suppose you want to estimate E[X] where X is some random variable. The naive way would be to sample X many times and take the average. But if you can find another random variable Y where:
1. You know E[Y] exactly
2. Y is correlated with X
Then you can use the estimator: X - α(Y - E[Y])

This is guaranteed to be unbiased for any α because: E[X - α(Y - E[Y])] = E[X] - α(E[Y] - E[Y]) = E[X]

Example: estimate E[sin(X)] where X follows a standard normal distribution. The clever part is:
1. We know that E[X] = 0 for a standard normal
2. sin(X) is correlated with X (when X is small, sin(X) ≈ X)
3. So X makes a good control variate!
The key steps in applying control variates are:
1. Find a correlated quantity Y with known expectation
2. Find the optimal scaling factor α (usually through Cov(X,Y)/Var(Y))
3. Subtract α(Y - E[Y]) from your original estimator
Latent dirichlet allocation (LDA)
- learns from body of entities, inferring that a particular attr belongs to a topic
- to generate a document:
  - num words ~ Poisson
  - topic mix ~ Dirichlet (over fixed set of topics)
  - to generate a word:
- learn with MCMC (collapsed Gibbs sampling)