In much of the following, I’m interchangeably using Θ and $z$.
tldr:
In Bayesian statistics, we're often interested in finding the posterior distribution p(Θ|X) - the probability of our model parameters Θ given our observed data X.
The problem is that calculating this posterior exactly is often mathematically impossible or computationally infeasible for complex models. This is because it requires computing a difficult integral (the evidence or marginal likelihood).
Two ways are used to solve for p(Θ|X): (i) using simulation (MCMC) or (ii) through optimization (VI). Sampling-based methods have several important shortcomings.
Idea: We look for a distribution q(Θ) that is a stand-in (a surrogate) for p(Θ|X). We then try to make q[Θ|Φ(X)] look similar to p(Θ|X) by changing the values of Φ (Fig. 2). This is done by maximising the evidence lower bound (ELBO):
ℒ(Φ) = E[ln p(X,Θ) — ln q(Θ|Φ)],
where the expectation E[·] is taken over q(Θ|Φ). (Note that Φ implicitly depends on the dataset X, but for notational convenience we'll drop the explicit dependence.)
We turn Bayesian inference into an optimization problem.