Refs
Overview
tldr:
- optimize parameters of a simpler distribution (from the variational family) to best approximate the target posterior distribution
- turn bayesian inference into optimization problem - scales to bigger data
- variational family: distributions used to approx posteriors—params can be optimized
Two ways are used to solve for p(Θ|X): (i) using simulation (MCMC) or (ii) through optimization (VI). Sampling-based methods have several important shortcomings.
- Although they are guaranteed to find a globally optimal solution given enough time, it is difficult to tell how close they are to a good solution given the finite amount of time that they have in practice.
- In order to quickly reach a good solution, MCMC methods require choosing an appropriate sampling technique (e.g., a good proposal in Metropolis-Hastings). Choosing this technique can be an art in itself.
Idea: We look for a distribution q(Θ) that is a stand-in (a surrogate) for p(Θ|X). We then try to make q[Θ|Φ(X)] look similar to p(Θ|X) by changing the values of Φ (Fig. 2). This is done by maximising the evidence lower bound (ELBO):
ℒ(Φ) = E[ln p(X,Θ) — ln q(Θ|Φ)],
where the expectation E[·] is taken over q(Θ|Φ). (Note that Φ implicitly depends on the dataset X, but for notational convenience we'll drop the explicit dependence.)
We turn Bayesian inference into an optimization problem.
The main differences between sampling and variational techniques are that:
- Unlike sampling-based methods, variational approaches will almost never find the globally optimal solution.
- Unlike MCMC, which are sampling methods, variational inference is deterministic and often provides faster convergence. However, it might not capture the true variability of the posterior distribution as accurately as MCMC/HMC due to the use of a simpler approximating distribution.
- However, we will always know if they have converged. In some cases, we will even have bounds on their accuracy.