Refs

In much of the following, I’m interchangeably using Θ and $z$.

Overview

tldr:

In Bayesian statistics, we're often interested in finding the posterior distribution p(Θ|X) - the probability of our model parameters Θ given our observed data X.

  1. Traditional Bayesian inference: Finding posterior distributions over model parameters (Θ)
  2. Modern applications like VAEs: Finding posterior distributions over latent variables (z), like some autoencoded representation of an image

The problem is that calculating this posterior exactly is often mathematically impossible or computationally infeasible for complex models. This is because it requires computing a difficult integral (the evidence or marginal likelihood).

Two ways are used to solve for p(Θ|X): (i) using simulation (MCMC) or (ii) through optimization (VI). Sampling-based methods have several important shortcomings.

Idea: We look for a distribution q(Θ) that is a stand-in (a surrogate) for p(Θ|X). We then try to make q[Θ|Φ(X)] look similar to (minimizes KL with) p(Θ|X) by changing the values of Φ (Fig. 2). (Example: Gaussian q with parameters **Φ.) ****This is done by maximising the evidence lower bound (ELBO):

(Φ= E[ln p(X,Θ) — ln q(Θ|Φ)],

where the expectation E[·] is taken over q(Θ|Φ). (Note that Φ implicitly depends on the dataset X, but for notational convenience we'll drop the explicit dependence.)

We turn Bayesian inference into an optimization problem.