1 Gaussian Mixture Model

For each observation \(\mathbf{x}_{n} \in \mathbb{R}^{D}\) we have a corresponding latent variable \(\mathbf{z}_{n}\) comprising a 1-of-K binary vector with elements \(z_{nk}\) for \(k = 1, ..., K\). The conditional distribution of \(\mathbf{Z}\), given the mixing coefficiens \(\pmb{\pi}\), is given by \[ p(\mathbf{Z} | \pmb{\pi}) = \prod_{n=1}^{N}\prod_{k=1}^{K}\pi_{k}^{z_{nk}} \] The conditional distribution of the observed data \(\mathbf{X}\), given the latent variables \(\mathbf{Z}\) and the component parameters \(\pmb{\mu}, \pmb{\Lambda}\) is \[ p(\mathbf{X} | \mathbf{Z}, \pmb{\mu}, \pmb{\Lambda}) = \prod_{n=1}^{N}\prod_{k=1}^{K}\mathcal{N}(\mathbf{x}_{n} | \pmb{\mu}_{k}, \pmb{\Lambda}_{k}^{-1})^{z_{nk}} \] where \(\pmb{\mu} = \{\pmb{\mu}_{k}\}\) and \(\pmb{\Lambda} = \{\pmb{\Lambda}_{k}\}\). Next we introduce priors over the parameters. We choose a Dirichlet distribution over the mixing proportions \(\pmb{\pi}\) \[ p(\pmb{\pi}) = \mathcal{D}ir( \pmb{\pi}| \pmb{\alpha}_{0}) = C(\pmb{\alpha}_{0})\prod_{k=1}^{K}\pi_{k}^{\alpha_{0} - 1} \] where \(C(\pmb{\alpha}_{0})\) is the normalization constant for the Dirichlet distribution and we have chosen the same parameter \(\alpha_{0}\) for each of the mixture components to have a symmetrical Dirichlet distribution.

We introduce an independent Gaussian-Wishart prior governing the mean and variance of each Gaussian component, given by \[ \begin{aligned} p(\pmb{\mu}, \pmb{\Lambda}) & = p(\pmb{\mu} | \pmb{\Lambda}) p(\pmb{\Lambda}) = \prod_{k=1}^{K}\mathcal{N}\big(\pmb{\mu}_{k} | \mathbf{m}_{0}, (\beta_{0} \pmb{\Lambda}_{k})^{-1}\big) \mathcal{W}(\pmb{\Lambda}_{k} | \mathbf{W}_{0}, \nu_{0}) \end{aligned} \] because this represents the conjugate prior distribution when both the mean and precision are unknown. Typically we would choose \(\mathbf{m}_{0} = \mathbf{0}\) by symmetry. Below is shown the probabilistic graphical model of the Bayesian GMM model (Fig. 10.5 in Bishop book):

Aside: This example provides a nice illustration of the distinction between the latent variables and parameters. Variables such as \(\mathbf{z}_{n}\) that appear inside the plate are regarded as latent variables because the number of such variables grows with the size of the data set. By constrast, variables such as \(\pmb{\mu}\) that are outside of the plate in the graphical model are fixed in number independently of the size of the data set, and so are regarded as parameters.

2 Variational GMM

In order to formulate a variational treatment of this model, we need to write down the joint distribution of all the random variables, which is given by \[ \begin{aligned} p(\mathbf{X}, \mathbf{Z}, \pmb{\pi}, \pmb{\mu}, \pmb{\Lambda}) & = p(\mathbf{X} | \mathbf{Z}, \pmb{\pi}, \pmb{\mu}, \pmb{\Lambda}) p(\mathbf{Z} | \pmb{\pi}, \pmb{\mu}, \pmb{\Lambda}) p(\pmb{\pi} | \pmb{\mu}, \pmb{\Lambda}) p(\pmb{\mu} | \pmb{\Lambda}) p(\pmb{\Lambda}) \\ & = p(\mathbf{X} | \mathbf{Z}, \pmb{\mu}, \pmb{\Lambda}) p(\mathbf{Z} | \pmb{\pi}) p(\pmb{\pi}) p(\pmb{\mu} | \pmb{\Lambda}) p(\pmb{\Lambda}) \end{aligned} \] where the decomposition corresponds to the probabilistic graphical model shown above. We now consider a variational distribution which factorizes between the latent variables and the parameters so that \[ q(\mathbf{Z}, \pmb{\pi}, \pmb{\mu}, \pmb{\Lambda}) = q(\mathbf{Z})q(\pmb{\pi}, \pmb{\mu}, \pmb{\Lambda}) \] It should be noted that this is the only assumption that we need to make in order to obtain a tractable practical solution to our Bayesian mixture model. In particular, the functional form of the factors \(q(\mathbf{Z})\) and \(q(\pmb{\pi}, \pmb{\mu}, \pmb{\Lambda})\) will be determinted automatically by optimization of the variational distribution.

2.1 Update factor \(q(\mathbf{Z})\)

Let us consider the derivation of the update equation for the factor \(q(\mathbf{Z})\). The log of the optimized factor is given by \[ \text{ln}\;q^{*}(\mathbf{Z}) = \mathbb{E}_{\pmb{\pi},\pmb{\mu},\pmb{\Lambda}}\Big[\text{ln}\;p(\mathbf{X}, \mathbf{Z}, \pmb{\pi}, \pmb{\mu}, \pmb{\Lambda})\Big] + \text{const} \] Note that we are only interested in the functional dependence of the right-hand side on the variable \(\mathbf{Z}\), thus any terms that do not depend on \(\mathbf{Z}\) can be absorbed in the additive normalized constant. Also, by making use of the decomposition above, we have \[ \text{ln}\;q^{*}(\mathbf{Z}) = \mathbb{E}_{\pmb{\pi}}\Big[\text{ln}\;p(\mathbf{Z} | \pmb{\pi})\Big] + \mathbb{E}_{\pmb{\mu}, \pmb{\Lambda}}\Big[\text{ln}\;p(\mathbf{X} | \mathbf{Z}, \pmb{\mu}, \pmb{\Lambda})\Big] + \text{const} \] where \[ \begin{align} \mathbb{E}_{\pmb{\pi}}\Big[\text{ln}\;p(\mathbf{Z} | \pmb{\pi})\Big] & = \mathbb{E}_{\pmb{\pi}}\Big[\text{ln}\;\prod_{n=1}^{N}\prod_{k=1}^{K}\pi_{k}^{z_{nk}} \Big] = \mathbb{E}_{\pmb{\pi}}\Big[\sum_{n=1}^{N}\sum_{k=1}^{K}z_{nk}\;\text{ln}\;\pi_{k} \Big] = \sum_{n=1}^{N}\sum_{k=1}^{K}z_{nk}\;\mathbb{E}_{\pi_{k}}\Big[\text{ln}\;\pi_{k} \Big] \\ \mathbb{E}_{\pmb{\mu}, \pmb{\Lambda}}\Big[\text{ln}\;p(\mathbf{X} | \mathbf{Z}, \pmb{\mu}, \pmb{\Lambda})\Big] & = \mathbb{E}_{\pmb{\mu}, \pmb{\Lambda}}\Big[\text{ln}\;\prod_{n=1}^{N}\prod_{k=1}^{K}\mathcal{N}(\mathbf{x}_{n} | \pmb{\mu}_{k}, \pmb{\Lambda}_{k}^{-1})^{z_{nk}}\Big] \\ & = \mathbb{E}_{\pmb{\mu}, \pmb{\Lambda}}\Big[\sum_{n=1}^{N}\sum_{k=1}^{K}z_{nk}\;\text{ln}\;\mathcal{N}(\mathbf{x}_{n} | \pmb{\mu}_{k}, \pmb{\Lambda}_{k}^{-1})\Big] \\ & = \sum_{n=1}^{N}\sum_{k=1}^{K}z_{nk}\;\mathbb{E}_{\pmb{\mu}_{k}, \pmb{\Lambda}_{k}}\Big[\text{ln}\;\mathcal{N}(\mathbf{x}_{n} | \pmb{\mu}_{k}, \pmb{\Lambda}_{k}^{-1})\Big] \\ & = \sum_{n=1}^{N}\sum_{k=1}^{K}z_{nk}\Bigg(-\frac{D}{2}\text{ln}\;(2\pi) + \frac{1}{2} \mathbb{E}_{\pmb{\Lambda}_{k}}\Big[\text{ln}\;|\pmb{\Lambda}_{k}|\Big] -\frac{1}{2}\mathbb{E}_{\pmb{\mu}_{k}, \pmb{\Lambda}_{k}}\Big[(\mathbf{x}_{n} - \pmb{\mu}_{k})^{T}\pmb{\Lambda}_{k}(\mathbf{x}_{n} - \pmb{\mu}_{k})\Big]\Bigg) \end{align} \] Substituting for the two conditional distributions on the right-hand side, and absorbing any terms that are independent of \(\mathbf{Z}\) into the additiive constant, we have \[ \text{ln}\;q^{*}(\mathbf{Z}) = \sum_{n=1}^{N}\sum_{k=1}^{K}z_{nk}\;\text{ln}\;\rho_{nk} + \text{const} \] where we have defined \[ \text{ln}\;\rho_{nk} = \mathbb{E}_{\pi_{k}}\Big[\text{ln}\;\pi_{k} \Big] -\frac{D}{2}\text{ln}\;(2\pi) + \frac{1}{2} \mathbb{E}_{\pmb{\Lambda}_{k}}\Big[\text{ln}\;|\pmb{\Lambda}_{k}|\Big] -\frac{1}{2}\mathbb{E}_{\pmb{\mu}_{k}, \pmb{\Lambda}_{k}}\Big[(\mathbf{x}_{n} - \pmb{\mu}_{k})^{T}\pmb{\Lambda}_{k}(\mathbf{x}_{n} - \pmb{\mu}_{k})\Big] \] where D is the dimensionality of the data variable \(\mathbf{x}\). Taking the exponential on both sides we obtain \[ q^{*}(\mathbf{Z}) \propto \prod_{n=1}^{N}\prod_{k=1}^{K}\;\rho_{nk}^{z_{nk}} \] Requiring that this distribution be normalized, and noting that for each value of \(n\) the quantities \(z_{nk}\) are binary and sum to \(1\) over all values of \(k\), we obtain \[ q^{*}(\mathbf{Z}) = \prod_{n=1}^{N}\prod_{k=1}^{K}\;r_{nk}^{z_{nk}} \quad\quad\quad \text{where} \quad r_{nk} = \frac{\rho_{nk}}{\sum_{j=1}^{K}\rho_{nj}} \] We note that the functional form for the factor \(q(\mathbf{Z})\) takes the same functional form as the prior \(p(\mathbf{Z}|\pmb{\pi})\). For the discrete distribution \(q^{*}(\mathbf{Z})\) we have the standard result, which is the expected value of a multunomial variable \[ \mathbb{E}_{z_{nk}}\big[z_{nk}\big] = r_{nk} \] for which we see that the quantities \(r_{nk}\) are playing the role of responsibilities. Note that the optimal solution for \(q^{*}(\mathbf{Z})\) depends on moments evaluated w.r.t the distributions of other variables, and so again the variational update equations are coupled and must be solved iteratively.

At this point, it is convenient to define the statistics of the observed data set evaluated w.r.t the responsibilities, and note that they are analogous to the quantities evaluated in the maximum likelihood EM algorithm for the Gaussian mixture model. \[ \begin{align} N_{k} & = \sum_{n=1}^{N}r_{nk} \quad\quad \quad\quad \bar{\mathbf{x}}_{k} = \frac{1}{N_{k}}\sum_{n=1}^{N}r_{nk}\mathbf{x}_{n} \quad\quad\quad\quad \mathbf{S}_{k} = \frac{1}{N_{k}}\sum_{n=1}^{N}r_{nk}(\mathbf{x}_{n} - \bar{\mathbf{x}}_{k})(\mathbf{x}_{n} - \bar{\mathbf{x}}_{k})^{T} \end{align} \]

2.2 Update factor \(q(\pmb{\pi}, \pmb{\mu}, \pmb{\Lambda})\)

Now let us consider the factor \(q(\pmb{\pi}, \pmb{\mu}, \pmb{\Lambda})\) in the variational posterior distribution. Using again the general expression for the optimized factor we have \[ \begin{align} \text{ln}\;q^{*}(\pmb{\pi}, \pmb{\mu}, \pmb{\Lambda}) & = \mathbb{E}_{\mathbf{Z}}\Big[\text{ln}\;p(\mathbf{X}, \mathbf{Z}, \pmb{\pi}, \pmb{\mu}, \pmb{\Lambda})\Big] + \text{const} \\ & = \mathbb{E}_{\mathbf{Z}}\Big[\text{ln}\;p(\mathbf{X} | \mathbf{Z}, \pmb{\mu}, \pmb{\Lambda})\Big] + \mathbb{E}_{\mathbf{Z}}\Big[\text{ln}\;p(\mathbf{Z} | \pmb{\pi})\Big] + \mathbb{E}_{\mathbf{Z}}\Big[\text{ln}\;p(\pmb{\pi})\Big] + \mathbb{E}_{\mathbf{Z}}\Big[\text{ln}\;p(\pmb{\mu}, \pmb{\Lambda})\Big] + \text{const} \\ & = \mathbb{E}_{\mathbf{Z}}\Big[\text{ln}\;\prod_{n=1}^{N}\prod_{k=1}^{K}\mathcal{N}(\mathbf{x}_{n} | \pmb{\mu}_{k}, \pmb{\Lambda}_{k}^{-1})^{z_{nk}}\Big] + \mathbb{E}_{\mathbf{Z}}\Big[\text{ln}\;p(\mathbf{Z} | \pmb{\pi})\Big] + \text{ln}\;p(\pmb{\pi}) + \sum_{k=1}^{K}\text{ln}\;p(\pmb{\mu}_{k}, \pmb{\Lambda}_{k}) + \text{const} \\ & = \mathbb{E}_{\mathbf{Z}}\Big[\sum_{n=1}^{N}\sum_{k=1}^{K}\text{ln}\;\mathcal{N}(\mathbf{x}_{n} | \pmb{\mu}_{k}, \pmb{\Lambda}_{k}^{-1})^{z_{nk}}\Big] + \mathbb{E}_{\mathbf{Z}}\Big[\text{ln}\;p(\mathbf{Z} | \pmb{\pi})\Big] + \text{ln}\;p(\pmb{\pi}) + \sum_{k=1}^{K}\text{ln}\;p(\pmb{\mu}_{k}, \pmb{\Lambda}_{k}) + \text{const} \\ & = \mathbb{E}_{\mathbf{Z}}\Big[\sum_{n=1}^{N}\sum_{k=1}^{K}z_{nk}\;\text{ln}\;\mathcal{N}(\mathbf{x}_{n} | \pmb{\mu}_{k}, \pmb{\Lambda}_{k}^{-1})\Big] + \mathbb{E}_{\mathbf{Z}}\Big[\text{ln}\;p(\mathbf{Z} | \pmb{\pi})\Big] + \text{ln}\;p(\pmb{\pi}) + \sum_{k=1}^{K}\text{ln}\;p(\pmb{\mu}_{k}, \pmb{\Lambda}_{k}) + \text{const} \\ & = \sum_{n=1}^{N}\sum_{k=1}^{K}\mathbb{E}_{z_{nk}}\Big[z_{nk}\Big]\;\text{ln}\;\mathcal{N}(\mathbf{x}_{n} | \pmb{\mu}_{k}, \pmb{\Lambda}_{k}^{-1}) + \mathbb{E}_{\mathbf{Z}}\Big[\text{ln}\;p(\mathbf{Z} | \pmb{\pi})\Big] + \text{ln}\;p(\pmb{\pi}) + \sum_{k=1}^{K}\text{ln}\;p(\pmb{\mu}_{k}, \pmb{\Lambda}_{k}) + \text{const} \end{align} \] We observe that the right-hand side of this expression decomposes into a sum of terms involving only \(\pmb{\pi}\) together with terms only involving \(\pmb{\mu}\) and \(\pmb{\Lambda}\), which implies that the variational posterior \(q(\pmb{\pi}, \pmb{\mu}, \pmb{\Lambda})\) factorizes to give \(q(\pmb{\pi})q( \pmb{\mu}, \pmb{\Lambda})\). Furthermore, the terms involving \(\pmb{\mu}\) and \(\pmb{\Lambda}\) themselves comprise a sum over \(k\) of terms involving \(\pmb{\mu}_{k}\) and \(\pmb{\Lambda}_{k}\) leading to the further factorization \[ q(\pmb{\pi}, \pmb{\mu}, \pmb{\Lambda}) = q(\pmb{\pi})\prod_{k=1}^{K} q(\pmb{\mu}_{k}, \pmb{\Lambda}_{k}) \] We refer to these additional factorizations as induced factorizations because they arise from an intersection between the factorization assumed in the variational posterior distribution and the conditional independence properties of the true joint distribution.

First we work the distribution over \(\pmb{\pi}\) \[ \begin{align} \text{ln}\;q^{*}(\pmb{\pi}) & = \text{ln}\;p(\pmb{\pi}) + \mathbb{E}_{\mathbf{Z}}\Big[\text{ln}\;p(\mathbf{Z} | \pmb{\pi})\Big] + \text{const} \\ & = \text{ln}\;\prod_{k=1}^{K}\pi_{k}^{\alpha_{0} - 1} + \mathbb{E}_{\mathbf{Z}}\Big[\text{ln}\;\prod_{k=1}^{K}\prod_{n=1}^{N}\pi_{k}^{z_{nk}}\Big] + \text{const} \\ & = \sum_{k=1}^{K}\text{ln}\;\pi_{k}^{\alpha_{0} - 1} + \mathbb{E}_{\mathbf{Z}}\Big[\sum_{k=1}^{K}\sum_{n=1}^{N}\text{ln}\;\pi_{k}^{z_{nk}}\Big] + \text{const} \\ & = \sum_{k=1}^{K}(\alpha_{0} - 1)\text{ln}\;\pi_{k} + \mathbb{E}_{\mathbf{Z}}\Big[\sum_{k=1}^{K}\sum_{n=1}^{N}z_{nk}\;\text{ln}\;\pi_{k}\Big] + \text{const} \\ & = (\alpha_{0} - 1)\sum_{k=1}^{K}\text{ln}\;\pi_{k} + \sum_{k=1}^{K}\sum_{n=1}^{N}\mathbb{E}_{z_{nk}}\Big[z_{nk}\Big]\text{ln}\;\pi_{k} + \text{const} \\ & = (\alpha_{0} - 1)\sum_{k=1}^{K}\text{ln}\;\pi_{k} + \sum_{k=1}^{K}\sum_{n=1}^{N}r_{nk}\;\text{ln}\;\pi_{k} + \text{const} \end{align} \] Taking the exponential on both sides we observe that \(q^{*}(\pmb{\pi})\) is a Dirichlet distribution \[ \begin{align} q^{*}(\pmb{\pi}) & = \prod_{k=1}^{K}\pi_{k}^{\alpha_{0} - 1} + \prod_{k=1}^{K}\prod_{n=1}^{N}\pi_{k}^{r_{nk}} + \text{const} \\ & = \prod_{k=1}^{K}\pi_{k}^{\alpha_{0} - 1} + \prod_{k=1}^{K}\pi_{k}^{\sum_{n=1}^{N}r_{nk}} + \text{const} \\ & = \prod_{k=1}^{K}\pi_{k}^{\alpha_{0} - 1} + \prod_{k=1}^{K}\pi_{k}^{N_{k}} + \text{const} \quad\quad \text{(using equation defined above)} \\ & = \prod_{k=1}^{K}\pi_{k}^{(\alpha_{0} + N_{k} - 1)} \\ & = \mathcal{D}ir(\pmb{\pi} | \pmb{\alpha}) \end{align} \] where \(\pmb{\alpha}\) has components \(\alpha_{k}\) given by \(\alpha_{k} = \alpha_{0} + N_{k}\)

Finally,the variational posterior distribution \(q^{*}(\pmb{\mu}_{k}, \pmb{\Lambda}_{k})\), using the product rule can be written as \[ q^{*}(\pmb{\mu}_{k}, \pmb{\Lambda}_{k}) = q^{*}(\pmb{\mu}_{k} | \pmb{\Lambda}_{k}) q^{*}(\pmb{\Lambda}_{k}) \] where the factors can be found by keeping only the terms that involve these parameters in the above equations. As expected (should try and derive this) the resulting oprimized factor is a Gaussian-Wishart distribution and is given by \[ q^{*}(\pmb{\mu}_{k}, \pmb{\Lambda}_{k}) = \mathcal{N}\big(\pmb{\mu}_{k} | \mathbf{m}_{k}, (\beta_{k} \pmb{\Lambda}_{k})^{-1}\big) \mathcal{W}(\pmb{\Lambda}_{k} | \mathbf{W}_{k}, \nu_{k}) \] where we have defined \[ \begin{align} \beta_{k} & = \beta_{0} + N_{k}\\ \mathbf{m}_{k} & = \frac{1}{\beta_{k}}(\beta_{0}\mathbf{m}_{0} + N_{k}\mathbf{\bar{x}}_{k}) \\ \mathbf{W}_{k}^{-1} & = \mathbf{W}_{0}^{-1} + N_{k}\mathbf{S}_{k} + \frac{\beta_{0}N_{k}}{\beta_{0} + N_{k}}(\mathbf{\bar{x}}_{k} - \mathbf{m}_{0})(\mathbf{\bar{x}}_{k} - \mathbf{m}_{0})^{T} \\ \nu_{k} & = \nu_{0} + N_{k} + 1 \end{align} \] These updates are analogous to the M-step equations of the EM algorithm for the maximum likelihood solution of the GMM. We see that computations that must be performed in order to update the variational posterior distribution over the model parameters involve evaluation of the same sums over the data set, as arose in the ML treatment.

In order to perform the variational M-step, we need the expectations \(\mathbb{E}\Big[z_{nk}\Big] = r_{nk}\) representing the responsibilities, and these are obtained by normalizing the \(\rho_{nk}\) (see equation above). This expression involves expectations w.r.t the variational distributions of the parameters, and these are easily evaluated to give the following results \[ \begin{align} \text{ln}\;\tilde{\pi}_{k} \equiv \mathbb{E}\Big[\text{ln}\;\pi_{k}\Big] & = \psi(\alpha_{k}) - \psi(\hat{\alpha}) \quad\quad \text{where} \quad \hat{\alpha} = \sum_{k}\alpha_{k} \quad\quad\text{(see Eq. B.21)} \\ \text{ln}\;\tilde{\Lambda}_{k} \equiv \mathbb{E}\Big[\text{ln}\;|\pmb{\Lambda}_{k}|\Big] & = \sum_{d = 1}^{D}\psi\Big(\frac{\nu_{k} + 1 - d}{2}\Big) + D\;\text{ln}2 + \text{ln}\;|\mathbf{W}_{k}| \quad\quad\text{(see Eq. B.81)} \\ \mathbb{E}_{\pmb{\mu}_{k}, \pmb{\Lambda}_{k}}\Big[(\mathbf{x}_{n} - \pmb{\mu}_{k})^{T}\pmb{\Lambda}_{k}(\mathbf{x}_{n} - \pmb{\mu}_{k})\Big] & = D\beta_{k}^{-1} + \nu_{k}(\mathbf{x}_{n} - \mathbf{m}_{k})^{T}\mathbf{W}_{k}(\mathbf{x}_{n} - \mathbf{m}_{k}) \;\quad\quad\quad\text{(derive this)} \end{align} \] Substituting these equations to the definition of \(\text{ln}\;\rho_{nk}\) and exponentiating to obtain a formula for the responsibilities \(r_{nk}\) we obtain \[ \begin{align} r_{nk} = \frac{\rho_{nk}}{\sum_{j=1}^{K}\rho_{nj}} & \propto \text{exp}\bigg\lbrace \text{ln}\;\tilde{\pi}_{k} + \frac{1}{2}\text{ln}\;\tilde{\Lambda}_{k} -\frac{1}{2}\big(D\beta_{k}^{-1} + \nu_{k}(\mathbf{x}_{n} - \mathbf{m}_{k})^{T}\mathbf{W}_{k}(\mathbf{x}_{n} - \mathbf{m}_{k})\big)\bigg\rbrace \\ & \propto \text{exp}\bigg\lbrace \text{ln}\;\Big[\tilde{\pi}_{k} \tilde{\Lambda}_{k}^{1/2} exp\Big\{-\frac{D}{2\beta_{k}} - \frac{\nu_{k}}{2}(\mathbf{x}_{n} - \mathbf{m}_{k})^{T}\mathbf{W}_{k}(\mathbf{x}_{n} - \mathbf{m}_{k})\Big\rbrace\Big]\bigg\} \\ & \propto \tilde{\pi}_{k} \tilde{\Lambda}_{k}^{1/2} exp\Big\{-\frac{D}{2\beta_{k}} - \frac{\nu_{k}}{2}(\mathbf{x}_{n} - \mathbf{m}_{k})^{T}\mathbf{W}_{k}(\mathbf{x}_{n} - \mathbf{m}_{k})\Big\rbrace \\ \end{align} \] Optimization of variational posterior resembles the EM algorithm. In the variational equivalent of the E-step, we use tht current distributions over the model parameters to evaluate the moments to compute the responsibilities \(\mathbb{E}[z_{nk}] = r_{nk}\). Then in the variational M-step, we keep these responsibilities fixed and use them to re-compute the variational distribution over the parameters \(q^{*}(\pmb{\pi})\) and \(q^{*}(\pmb{\mu}, \pmb{\Lambda})\).

2.3 Variational lower bound

For the variational mixture od Gaussians, the lower bound \(\mathcal{L}(q)\) is given by \[ \begin{aligned} \mathcal{L}(q) & = \sum_{\mathbf{Z}}\int\int\int q(\mathbf{Z}, \pmb{\pi}, \pmb{\mu}, \pmb{\Lambda})\text{ln}\bigg\{\frac{p(\mathbf{X}, \mathbf{Z}, \pmb{\pi}, \pmb{\mu}, \pmb{\Lambda})}{q(\mathbf{Z}, \pmb{\pi}, \pmb{\mu}, \pmb{\Lambda})}\bigg\}d \pmb{\pi}\;d\pmb{\mu}\;d\pmb{\Lambda} \\ & = \;\mathbb{E}_{\mathbf{Z}, \pmb{\pi}, \pmb{\mu}, \pmb{\Lambda}}\Big[\text{ln}\; p(\mathbf{X}, \mathbf{Z}, \pmb{\pi}, \pmb{\mu}, \pmb{\Lambda})\Big] - \mathbb{E}_{\mathbf{Z}, \pmb{\pi}, \pmb{\mu}, \pmb{\Lambda}}\Big[\text{ln}\; q(\mathbf{Z}, \pmb{\pi}, \pmb{\mu}, \pmb{\Lambda})\Big] \\ & = \;\mathbb{E}_{\mathbf{Z}, \pmb{\mu}, \pmb{\Lambda}}\Big[\text{ln}\; p(\mathbf{X} | \mathbf{Z}, \pmb{\mu}, \pmb{\Lambda})\Big] + \mathbb{E}_{\mathbf{Z}, \pmb{\pi}}\Big[\text{ln}\; p(\mathbf{Z} | \pmb{\pi})\Big] + \mathbb{E}_{\pmb{\pi}}\Big[\text{ln}\; p(\pmb{\pi})\Big] + \mathbb{E}_{\pmb{\mu}, \pmb{\Lambda}}\Big[\text{ln}\; p(\pmb{\mu}, \pmb{\Lambda})\Big] \\ & \quad \quad- \mathbb{E}_{\mathbf{Z}}\Big[\text{ln}\; q(\mathbf{Z})\Big] - \mathbb{E}_{\pmb{\pi}}\Big[\text{ln}\; q(\pmb{\pi})\Big] - \mathbb{E}_{\pmb{\mu}, \pmb{\Lambda}}\Big[\text{ln}\; q(\pmb{\mu}, \pmb{\Lambda})\Big] \end{aligned} \] The various terms in the bound are easily evaluated to give \[ \begin{aligned} \mathbb{E}_{\mathbf{Z}, \pmb{\mu}, \pmb{\Lambda}}\Big[\text{ln}\; p(\mathbf{X} | \mathbf{Z}, \pmb{\mu}, \pmb{\Lambda})\Big] & = \frac{1}{2}\sum_{k = 1}^{K}N_{k}\Big\{\text{ln}\;\tilde{\Lambda}_{k} - D\beta_{k}^{-1} -\nu_{k}\;\text{tr}(\mathbf{S}_{k}\mathbf{W}_{k}) \\ & \quad\quad\quad\qquad - \nu_{k}(\bar{\mathbf{x}}_{k} - \mathbf{m}_{k})^{T}\mathbf{W}_{k}(\bar{\mathbf{x}}_{k} - \mathbf{m}_{k}) - D\;\text{ln}(2\pi)\Big\} \\ \mathbb{E}_{\mathbf{Z}, \pmb{\pi}}\Big[\text{ln}\; p(\mathbf{Z} | \pmb{\pi})\Big] & = \sum_{n = 1}^{N}\sum_{k=1}^{K}r_{nk}\;\text{ln}\;\tilde{\pi}_{k}\\ \mathbb{E}_{\pmb{\pi}}\Big[\text{ln}\; p(\pmb{\pi})\Big] & = \text{ln}\; C(\pmb{\alpha}_{0}) + (\alpha_{0} - 1)\sum_{k=1}^{K}\;\text{ln}\;\tilde{\pi}_{k} \\ \mathbb{E}_{\pmb{\mu}, \pmb{\Lambda}}\Big[\text{ln}\; p(\pmb{\mu}, \pmb{\Lambda})\Big] & = \frac{1}{2}\sum_{k=1}^{K}\Big\{D\;\text{ln}(\beta_{0}/2\pi) + \text{ln}\;\tilde{\Lambda}_{k} - \frac{D\beta_{0}}{\beta_{k}} - \beta_{0}\nu_{k}(\mathbf{m}_{k} - \mathbf{m}_{0})^{T}\mathbf{W}_{k}(\mathbf{m}_{k} - \mathbf{m}_{0})\Big\} \\ & \quad\quad + K\;\text{ln}\;B(\mathbf{W}_{0}, \nu_{0}) + \frac{(\nu_{0} - D - 1)}{2}\sum_{k=1}^{K}\text{ln}\;\tilde{\Lambda}_{k} - \frac{1}{2}\sum_{k = 1}^{K}\nu_{k}\text{tr}(\mathbf{W}_{0}^{-1}\mathbf{W}_{k}) \\ \mathbb{E}_{\mathbf{Z}}\Big[\text{ln}\; q(\mathbf{Z})\Big] & = \sum_{n=1}^{N}\sum_{k=1}^{K}r_{nk}\;\text{ln}\;r_{nk} \\ \mathbb{E}_{\pmb{\pi}}\Big[\text{ln}\; q(\pmb{\pi})\Big] & = \text{ln}\; C(\pmb{\alpha}) + \sum_{k=1}^{K}(\alpha_{k} - 1)\;\text{ln}\;\tilde{\pi}_{k} \\ \mathbb{E}_{\pmb{\mu}, \pmb{\Lambda}}\Big[\text{ln}\; q(\pmb{\mu}, \pmb{\Lambda})\Big] & = \sum_{k=1}^{K}\Big\{\frac{1}{2}\text{ln}\;\tilde{\Lambda}_{k} + \frac{D}{2}\text{ln}\Big(\frac{\beta_{k}}{2\pi}\Big) - \frac{D}{2} - H\big[q(\pmb{\Lambda}_{k})\big]\Big\} \end{aligned} \] where \(D\) is the dimensionality of \(\mathbf{x}\), \(H\big[q(\pmb{\Lambda}_{k})\big]\) is the entropy of the Wishart distribution given by (B.82), and the coefficients \(C(\pmb{a})\) and \(B(\mathbf{W}, \nu)\) are defined by (B.23) and (B.79), respectively, in the Bishop book. Note that the terms involving expectations of the logs of the \(q\) distributions simply represent the negative entropies of those distributions.

2.4 Predictive density

The predictive density of a new observation \(\mathbf{x}_{*}\) which will be associated with a latent variable \(\mathbf{z}_{*}\) is give by

\[ p(\mathbf{x}_{*} | \mathbf{X}) = \sum_{z}\int\int\int p(\mathbf{x}_{*} | \mathbf{z}_{*}, \pmb{\mu}, \pmb{\Lambda})p(\mathbf{z}_{*} | \pmb{\pi})p(\pmb{\pi}, \pmb{\mu}, \pmb{\Lambda} | \mathbf{X}) \;d\pmb{\pi}\;d\pmb{\mu}\;d\pmb{\Lambda} \] We approximate the trye posterior density with the of the parameters with its variational approximation to give \[ p(\mathbf{x}_{*} | \mathbf{X}) \approx \sum_{k=1}^{K}\int\int\int \pi_{k}\mathcal{N}(\mathbf{x}_{*} | \pmb{\mu}_{k}, \pmb{\Lambda}_{k}^{-1})q(\pmb{\pi})q(\pmb{\mu}_{k}, \pmb{\Lambda}_{k}) \;d\pmb{\pi}\;d\pmb{\mu}_{k}\;d\pmb{\Lambda}_{k} \] These integrations can now be evaluated analytically giving a mixture of Student’s t-distributions \[ p(\mathbf{x}_{*} | \mathbf{X}) \approx \frac{1}{\alpha_{*}}\sum_{k=1}^{K}\alpha_{k}\mathcal{St}(\mathbf{x}_{*} | \mathbf{m}_{k}, \mathbf{L}_{k}, \nu_{k}, + 1 - D) \] in which the \(k^{th}\) component has mean \(\mathbf{m}_{k}\), and the precision is given by \[ \mathbf{L}_{k} = \frac{(\nu_{k} + 1 - D)\beta_{k}}{(1 + \beta_{k})}\mathbf{W}_{k} \]

3 Code implementation in R

Helper plotting functions

suppressPackageStartupMessages(require(matrixcalc))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(purrr))
suppressPackageStartupMessages(library(mvtnorm))
suppressPackageStartupMessages(library(Matrix))

# Define ggplot2 theme
gg_theme <- function(){
  p <- theme(
      plot.title = element_text(size = 20,face = 'bold',
                                margin = margin(0,0,3,0), hjust = 0.5),
      axis.text = element_text(size = rel(1.05), color = 'black'),
      axis.title = element_text(size = rel(1.45), color = 'black'),
      axis.title.y = element_text(margin = margin(0,10,0,0)),
      axis.title.x = element_text(margin = margin(10,0,0,0)),
      axis.ticks.x = element_line(colour = "black", size = rel(0.8)),
      axis.ticks.y = element_blank(),
      legend.position = "right",
      legend.key.size = unit(1.4, 'lines'),
      legend.title = element_text(size = 12, face = 'bold'),
      legend.text = element_text(size = 12),
      panel.border = element_blank(),
      panel.grid.major = element_line(colour = "gainsboro"),
      panel.background = element_blank()
    )
  return(p)
}

# Mixture density using approximate predictive gaussian distribution
mixture_pdf_gaussian <- function(model, data){
  mixture <- vector(mode = "numeric", length = NROW(data))
  for (k in 1:length(model$nu)) {
    tau_k <- model$W[,,k] * model$nu[k] # TODO: Is this right?
    mu_k  <- model$m[, k]
    mixture <- mixture + model$pi_k[k] * 
        dmvnorm(cbind(data$x,data$y), mean = mu_k, sigma = solve(tau_k))
  }
  return(mixture)
}
# Mixture density using predictive t-distribution
mixture_pdf_t <- function(model, data){
  mixture <- vector(mode = "numeric", length = NROW(data))
  for (k in 1:length(model$nu)) {
    L_k <- solve((((model$nu[k] + 1 - NROW(model$m)) * model$beta[k]) / 
                    (1 + model$beta[k])) * model$W[,,k])
    mixture <- mixture + (model$alpha[k]/sum(model$alpha)) * 
                dmvt(x = cbind(data$x,data$y), delta = model$m[, k], 
                     sigma = L_k, df = model$nu[k] + 1 - NROW(model$m), 
                     log = FALSE, type = "shifted")
  }
  return(mixture)
}
# Use the log sum exp trick for having numeric stability
log_sum_exp <- function(x) {
  # Computes log(sum(exp(x))
  offset <- max(x)
  s <- log(sum(exp(x - offset))) + offset
  i <- which(!is.finite(s))
  if (length(i) > 0) { s[i] <- offset }
  return(s)
}

Function for fitting the variational Bayesian Gaussian mixture model.

Note the lower bound seems to decrease slightly when we arrive near the optimal solution. This might (probably) be a bug in evaluating the lower bound and not a problem in the update equations, since we consistently get the optimal solution. Thanks to Carl Mueller from the Cairo lab at the University of Colorado for poinint out a small bug in computing the lb_qml component of the lower bound.

# Fit VBLR model
vb_gmm <- function(X, K = 3, alpha_0 = 1/K, m_0 = c(colMeans(X)), 
                   beta_0 = 1, nu_0 = NCOL(X) + 50, 
                   W_0 = diag(100, NCOL(X)), max_iter = 500, 
                   epsilon_conv = 1e-4, is_animation = FALSE, 
                   is_verbose = FALSE){
  # Compute logB function
  logB <- function(W, nu){
    D <- NCOL(W)
    return(-0.5*nu*log(det(W)) - (0.5*nu*D*log(2) + 0.25*D*(D - 1) * 
        log(pi) + sum(lgamma(0.5 * (nu + 1 - 1:NCOL(W)))) ))  #log of B.79
  }
  X <- as.matrix(X)
  D <- NCOL(X)              # Number of features
  N <- NROW(X)              # Number of observations
  W_0_inv <- solve(W_0)     # Compute W^{-1}
  L <- rep(-Inf, max_iter)  # Store the lower bounds
  r_nk = log_r_nk = log_rho_nk <- matrix(0, nrow = N, ncol = K)
  x_bar_k <- matrix(0, nrow = D, ncol = K)
  S_k = W_k <- array(0, c(D, D, K ) )
  log_pi = log_Lambda <- rep(0, K)
  
  if (is_animation) {
    # Variables needed for plotting
    dt <- data.table(expand.grid(x = seq(from = min(X[,1]) - 2, 
                            to = max(X[,1]) + 2, length.out = 80), 
                                 y = seq(from = min(X[,2]) - 8, 
                            to = max(X[,2]) + 2, length.out = 80)))
  }
  dt_all <- data.table(x = numeric(), y = numeric(), z = numeric(), 
                       iter = numeric())
    
  m_k     <- t(kmeans(X, K, nstart = 25)$centers)  # Mean of Gaussian
  beta_k  <- rep(beta_0, K)                # Scale of precision matrix
  nu_k    <- rep(nu_0, K)                  # Degrees of freedom
  alpha   <- rep(alpha_0, K)               # Dirichlet parameter
  log_pi  <- digamma(alpha) - digamma(sum(alpha))
  for (k in 1:K) {
    W_k[,,k] <-  W_0  # Scale matrix for Wishart
    log_Lambda[k] <- sum(digamma((nu_k[k] + 1 - c(1:D))/2)) + 
        D*log(2) + log(det(W_k[,,k]))
  }
  
  if (is_animation) { # Create animation for initial assignments
    my_z = mixture_pdf_t(model = list(m = m_k, W = W_k, beta = beta_k, 
                         nu = nu_k, alpha = rep(1/K, K)), data = dt)
    dt_all <- rbind(dt_all, dt[, z := my_z] %>% .[, iter := 0])
  }
  
  # Iterate to find optimal parameters
  for (i in 2:max_iter) {
    ##-------------------------------
    # Variational E-Step
    ##-------------------------------
    for (k in 1:K) {
      diff <- sweep(X, MARGIN = 2, STATS = m_k[, k], FUN = "-")
      log_rho_nk[, k] <- log_pi[k] + 0.5*log_Lambda[k] - 
          0.5*(D/beta_k[k]) - 0.5*nu_k[k] * diag(diff %*% 
                                    W_k[,,k] %*% t(diff)) # log of 10.67
    }
    # Responsibilities using the logSumExp for numerical stability
    Z        <- apply(log_rho_nk, 1, log_sum_exp)
    log_r_nk <- log_rho_nk - Z              # log of 10.49
    r_nk     <- apply(log_r_nk, 2, exp)     # 10.49
    
    ##-------------------------------
    # Variational M-Step
    ##-------------------------------
    N_k <- colSums(r_nk) + 1e-10  # 10.51
    for (k in 1:K) {
      x_bar_k[, k] <- (r_nk[ ,k] %*% X) / N_k[k]   # 10.52
      x_cen        <- sweep(X,MARGIN = 2,STATS = x_bar_k[, k],FUN = "-")
      S_k[, , k]   <- t(x_cen) %*% (x_cen * r_nk[, k]) / N_k[k]  # 10.53
    }
    # Update Dirichlet parameter
    alpha <- alpha_0 + N_k  # 10.58
    # # Compute expected value of mixing proportions
    pi_k <- (alpha_0 + N_k) / (K * alpha_0 + N)
    # Update parameters for Gaussia-nWishart distribution
    beta_k <- beta_0 + N_k    # 10.60
    nu_k   <- nu_0 + N_k + 1  # 10.63
    for (k in 1:K) {
      # 10.61
      m_k[, k]   <- (1/beta_k[k]) * (beta_0*m_0 + N_k[k]*x_bar_k[, k])  
      # 10.62
      W_k[, , k] <- W_0_inv + N_k[k] * S_k[,,k] + 
          ((beta_0*N_k[k])/(beta_0 + N_k[k])) * 
          tcrossprod((x_bar_k[, k] - m_0))    
      W_k[, , k] <- solve(W_k[, , k])
    }
    # Update expectations over \pi and \Lambda
    # 10.66
    log_pi <- digamma(alpha) - digamma(sum(alpha))                      
    for (k in 1:K) { # 10.65                                              
      log_Lambda[k] <- sum(digamma((nu_k[k] + 1 - 1:D)/2)) + 
          D*log(2) + log(det(W_k[,,k])) 
    }
    
    ##-------------------------------
    # Variational lower bound
    ##-------------------------------
    lb_px = lb_pml = lb_pml2 = lb_qml <- 0
    for (k in 1:K) {
      # 10.71
      lb_px <- lb_px + N_k[k] * (log_Lambda[k] - D/beta_k[k] - nu_k[k] * 
          matrix.trace(S_k[,,k] %*% W_k[,,k]) - nu_k[k]*t(x_bar_k[,k] - 
          m_k[,k]) %*% W_k[,,k] %*% (x_bar_k[,k] - m_k[,k]) - D*log(2*pi) ) 
      # 10.74
      lb_pml <- lb_pml + D*log(beta_0/(2*pi)) + log_Lambda[k] - 
          (D*beta_0)/beta_k[k] - beta_0*nu_k[k]*t(m_k[,k] - m_0) %*% 
          W_k[,,k] %*% (m_k[,k] - m_0)    
      # 10.74
      lb_pml2 <- lb_pml2 + nu_k[k] * matrix.trace(W_0_inv %*% W_k[,,k]) 
      # 10.77
      lb_qml <- lb_qml + 0.5*log_Lambda[k] + 0.5*D*log(beta_k[k]/(2*pi)) - 
          0.5*D - (-logB(W = W_k[,,k], nu = nu_k[k]) - 
          0.5*(nu_k[k] - D - 1)*log_Lambda[k] + 0.5*nu_k[k]*D)
    }
    lb_px  <- 0.5 * lb_px             # 10.71
    lb_pml <- 0.5*lb_pml + K*logB(W = W_0,nu = nu_0) + 0.5*(nu_0 - D - 1) * 
        sum(log_Lambda) - 0.5*lb_pml2 # 10.74
    lb_pz  <- sum(r_nk %*% log_pi)    # 10.72
    lb_qz  <- sum(r_nk * log_r_nk)    # 10.75
    lb_pp  <- sum((alpha_0 - 1)*log_pi) + lgamma(sum(K*alpha_0)) -
        K*sum(lgamma(alpha_0))        # 10.73
    lb_qp  <- sum((alpha - 1)*log_pi) + lgamma(sum(alpha)) - 
        sum(lgamma(alpha)) # 10.76
    # Sum all parts to compute lower bound
    L[i] <- lb_px + lb_pz + lb_pp + lb_pml - lb_qz - lb_qp - lb_qml
    
    ##-------------------------------
    # Evaluate mixture density for plotting
    ##-------------------------------
    if (is_animation) {
      if ( (i - 1) %% 5 == 0 | i < 10) {
        my_z = mixture_pdf_t(model = list(m = m_k, W = W_k, beta = beta_k, 
                nu = nu_k, alpha = alpha), data = dt)
        dt_all <- rbind(dt_all, dt[, z := my_z] %>% .[, iter := i - 1])
      }
    }
    # Show VB difference
    if (is_verbose) { cat("It:\t",i,"\tLB:\t",L[i],
                          "\tLB_diff:\t",L[i] - L[i - 1],"\n")}
    # Check if lower bound decreases
    if (L[i] < L[i - 1]) { message("Warning: Lower bound decreases!\n"); }
    # Check for convergence
    if (abs(L[i] - L[i - 1]) < epsilon_conv) { break }
    # Check if VB converged in the given maximum iterations
    if (i == max_iter) {warning("VB did not converge!\n")}
  }
  obj <- structure(list(X = X, K = K, N = N, D = D, pi_k = pi_k, 
                        alpha = alpha, r_nk = r_nk,  m = m_k, W = W_k, 
                        beta = beta_k, nu = nu_k, L = L[2:i], 
                        dt_all = dt_all), class = "vb_gmm")
  return(obj)
}

3.1 Cluster the Old Faithful data

set.seed(1)  # For reproducibility
X <- as.matrix(faithful)
K <- 5      # Number of clusters
# Run vb-gmm model model
vb_gmm_model <- vb_gmm(X = X, K = K, alpha_0 = 1e-5, max_iter = 1001, 
                       is_animation = TRUE, is_verbose = FALSE)

Generate contour plot of the posterior predictive mixture density.

data.grid <- expand.grid(x = seq(from = min(X[,1]) - 2, 
                                 to = max(X[,1]) + 2, length.out = 100), 
                         y = seq(from = min(X[,2]) - 8, 
                                 to = max(X[,2]) + 2, length.out = 100))
q.samp <- cbind(data.grid, z = mixture_pdf_t(vb_gmm_model,data.grid))
ggplot() + 
  geom_point(data = data.frame(X), mapping = aes(eruptions, waiting)) + 
  geom_contour(data = q.samp, mapping = aes(x = x,y = y, z = z, 
               colour = ..level..), binwidth = 0.001) + 
  gg_theme()

Below is an animation on how the predictive density of the mixture model changes during the variational inference iterations. We observe that after convergence there seem to be around two components that contribute to the posterior density. This shows that there is an automatic trade-off in a Bayesian model between fitting the data and the complexity of the model. ( Note: for the animations below you need to install some additional packages )

suppressPackageStartupMessages(library(gganimate))
dt <- vb_gmm_model$dt_all %>% .[, iter := as.factor(iter)]
p <- ggplot() + geom_point(data = data.frame(X), aes(eruptions,waiting)) + 
  geom_contour(data = dt, mapping = aes(x = x, y = y, z = z, 
               colour = ..level..), binwidth = 0.001) + 
  transition_manual(iter) +
  gg_theme()
p

# animate(p, interval = .45, "figures/vb_gmm/vbgmm.gif", 
#            ani.width = 750, ani.height = 470)
# animate(p, renderer = ffmpeg_renderer())

4 Conclusion

Most of this tutorial-like document was adapted from Chapter 10 in the Pattern Recognition and Machine Learning book from Christopher Bishop. If you found this document useful, check my homepage at the University of Edinburgh (http://homepages.inf.ed.ac.uk/s1263191/) for links to other tutorials.