Refs

In much of the following, I’m interchangeably using Θ and $z$.

Overview

tldr:

In Bayesian statistics, we're often interested in finding the posterior distribution p(Θ|X) - the probability of our model parameters Θ given our observed data X.

  1. Traditional Bayesian inference: Finding posterior distributions over model parameters (Θ)
  2. Modern applications like VAEs: Finding posterior distributions over latent variables (z)

The problem is that calculating this posterior exactly is often mathematically impossible or computationally infeasible for complex models. This is because it requires computing a difficult integral (the evidence or marginal likelihood).

Two ways are used to solve for p(Θ|X): (i) using simulation (MCMC) or (ii) through optimization (VI). Sampling-based methods have several important shortcomings.

Idea: We look for a distribution q(Θ) that is a stand-in (a surrogate) for p(Θ|X). We then try to make q[Θ|Φ(X)] look similar to p(Θ|X) by changing the values of Φ (Fig. 2). This is done by maximising the evidence lower bound (ELBO):

(Φ= E[ln p(X,Θ) — ln q(Θ|Φ)],

where the expectation E[·] is taken over q(Θ|Φ). (Note that Φ implicitly depends on the dataset X, but for notational convenience we'll drop the explicit dependence.)

We turn Bayesian inference into an optimization problem.