1 Variational Inference

Consider we have a fully Bayesian probabilistic model in which we denote all observed variables by \(\mathbf{X}\) and all latent variables by \(\mathbf{Z}\). Note that the latent random variables \(\mathbf{Z}\) include both parameters that might govern all the data, as found in Bayesian models, and latent variables that are “local” to individual data points. Our probabilistic model, specifies the joint density \(p(\mathbf{X}, \mathbf{Z})\) and our goal is to compute the posterior distribution \(p(\mathbf{Z} | \mathbf{X})\). This conditional can be used to produce point or interval estimates of the latent variables, form predictive densities of new data, and using Bayes rule we can write it as \[ p(\mathbf{Z} | \mathbf{X}) = \frac{p(\mathbf{X}, \mathbf{Z})}{p(\mathbf{X})} = \frac{p(\mathbf{X} | \mathbf{Z}) p(\mathbf{Z})}{p(\mathbf{X})} \] The denominator contains the marginal likelihood (or model evidence) since we marginalize out the latent variables from the joint density \[ p(\mathbf{X}) = \int p(\mathbf{X}, \mathbf{Z}) d\mathbf{Z} \] For many models, this evidence integral is unavailable in closed form or requires exponential time to compute. The evidence is what we need to compute the conditional density from the joint; this is why inference in such models is hard.

1.1 The evidence lower bound (ELBO)

In variational inference, we specific a family \(\mathcal{Q}\) of densities over the latent variables. Each \(q(\mathbf{Z}) \in \mathcal{Q}\) is a candidate approximation to the exact posterior density. Our goal is to find the best candidate, i.e. the one that minimizes the Kulback-Leibler (\(\mathcal{KL}\)) divergence to the exact posterior. Hence, inference now amounts to solving the following optimization problem \[ \begin{aligned} q^{*}(Z) & = \underset{q(\mathbf{Z}) \in \mathcal{Q}} {\mathrm{argmax}}\; \mathcal{KL}(q(\mathbf{Z})\; || \; p(\mathbf{Z} | \mathbf{X})) \end{aligned} \] Note that the complexity of the family \(\mathcal{Q}\) determines the complexity of this optimization. Let us observe what is the \(\mathcal{KL}\) divergence between these two distributions \[ \begin{aligned} \mathcal{KL}(q(\mathbf{Z})\; || \; p(\mathbf{Z} | \mathbf{X})) & = - \int q(\mathbf{Z}) \ln\frac{p(\mathbf{Z} | \mathbf{X})}{q(\mathbf{Z})} d\mathbf{Z} \\ & = \int q(\mathbf{Z})\ln q(\mathbf{Z}) d\mathbf{Z} - \int q(\mathbf{Z}) \ln p(\mathbf{Z} | \mathbf{X})d\mathbf{Z} \\ & = \mathbb{E}_{q(\mathbf{Z})}\Big[ \ln q(\mathbf{Z})\Big] - \int q(\mathbf{Z}) \ln \frac{p(\mathbf{X}, \mathbf{Z})}{p(\mathbf{X})} d\mathbf{Z} \\ & = \mathbb{E}_{q(\mathbf{Z})}\Big[\ln q(\mathbf{Z})\Big] + \int q(\mathbf{Z}) \ln p(\mathbf{X})d\mathbf{Z} - \int q(\mathbf{Z}) \ln p(\mathbf{X}, \mathbf{Z})d\mathbf{Z}\\ & = \mathbb{E}_{q(\mathbf{Z})}\Big[\ln q(\mathbf{Z})\Big] + \mathbb{E}_{q(\mathbf{Z})}\Big[\ln p(\mathbf{X})\Big] - \mathbb{E}_{q(\mathbf{Z})}\Big[\ln p(\mathbf{X}, \mathbf{Z})\Big] \\ & = \mathbb{E}_{q(\mathbf{Z})}\Big[\ln q(\mathbf{Z})\Big] - \mathbb{E}_{q(\mathbf{Z})}\Big[\ln p(\mathbf{X}, \mathbf{Z})\Big] + \ln p(\mathbf{X}) \\ \end{aligned} \] This quantity is intractable to compute since it denends on the model evidence \(\ln p(\mathbf{X})\), which however does not depend on the choice of the variational distribution \(q(\mathbf{Z})\). Instead, we optimize an alternative objective function which is equivalent to the \(\mathcal{KL}\) up to an added constant (i.e. the model evidence) \[ \begin{aligned} q_{ELBO}(Z) & = \mathbb{E}_{q(\mathbf{Z})}\Big[\ln p(\mathbf{X}, \mathbf{Z})\Big] - \mathbb{E}_{q(\mathbf{Z})}\Big[\ln q(\mathbf{Z})\Big] \end{aligned} \] This function is called the evidence lower bound (ELBO) and maximizing ELBO results in minizing the \(\mathcal{KL}\) divergence defined above, since the ELBO is the negative \(\mathcal{KL}\) plus the model evidence \(\ln p(\mathbf{X})\).

Examining the ELBO gives intuitions about the optimal variational density. We rewrite the ELBO as a sum of the expected log likelihood of the data and the \(\mathcal{KL}\) divergence between the prior \(p(\mathbf{Z})\) and \(q(\mathbf{Z})\) \[ \begin{aligned} q_{ELBO}(Z) & = \mathbb{E}_{q(\mathbf{Z})}\Big[\ln p(\mathbf{X}, \mathbf{Z})\Big] - \mathbb{E}_{q(\mathbf{Z})}\Big[\ln q(\mathbf{Z})\Big] \\ & = \mathbb{E}_{q(\mathbf{Z})}\Big[\ln p(\mathbf{X} | \mathbf{Z})\Big] + \mathbb{E}_{q(\mathbf{Z})}\Big[\ln p(\mathbf{Z})\Big] - \mathbb{E}_{q(\mathbf{Z})}\Big[\ln q(\mathbf{Z})\Big] \\ & = \mathbb{E}_{q(\mathbf{Z})}\Big[\ln p(\mathbf{X} | \mathbf{Z})\Big] - \mathcal{KL} (q(\mathbf{Z}) \; ||\; p(\mathbf{Z}))\\ \end{aligned} \] The first term is an expected likelihood; it encourages densities that place their mass on configurations of the latent variables that explain the observed data. The second term is the negative divergence between the variational density and the prior; it encourages densities close to the prior. Thus, the variational objective mirrors the usual balance between likelihood and prior.

Another property of the ELBO is that it lower-bounds the (log) evidence, \(\ln p(\mathbf{X}) \geq q_{ELBO}(Z)\) for any \(q(\mathbf{Z})\). To see this notice that we can write the log evidence as \[ \ln p(\mathbf{X}) = \mathcal{KL}(q(\mathbf{Z})\; || \; p(\mathbf{Z} | \mathbf{X})) + q_{ELBO}(Z) \] The bound then follows from the fact that \(\mathcal{KL}(q \;||\;p) \geq 0\), and \(q_{ELBO}(Z)\) achieves that bound only when the \(\mathcal{KL}\) divergence vanishes, that is \(\mathcal{KL}\big[q(\mathbf{Z})\;||\; p(\mathbf{Z} | \mathbf{X}) \big] = 0\). But, the \(\mathcal{KL}(q \;||\;p)\) is zero if and only if \(q = p\), which would lead to using the posterior distribution as our proposal. We can derive the same results through Jensen’s inequality Jordan et. al. (1999).

1.2 The mean-field variational family

We now describe a variational family \(\mathcal{Q}\), to complete the specification of the optimization problem. The complexity of the family determines the complexity of the optimization; it is more difficult to optimize over a complex family than a simple family. Here we focus on the mean-field variational family where the latent variables are mutually independent and each governed by a distinct factor in the variational density. A generic member of the mean-field variational family is \[ q(\mathbf{Z}) = \prod_{m=1}^{M} q_{m}(Z_{m}) \] Each latent variable \(Z_{m}\) is governed by its own variational factor, the density \(q_{m}(Z_{m})\). We emphasize that the variational family is not a model of the observed data, indeed, the data \(\mathbf{X}\) does not appear in the above equation. Instead, it is the ELBO, and the corresponding KL minimization problem, that connects the fitted variational density to the data and model. Also, notice we have not specified the parametric form of the individual variational factors. In principle, each can take on any parametric form appropriate to the corresponding random variable.

1.3 Coordinate ascent mean-field variational inference

Using the ELBO and the mean-field family, we have cast approximate conditional inference as an optimization problem. One of the most commonly used algorithms for solving this optimization problem is coordinate ascent variational inference (CAVI) (Bishop, 2006). CAVI iteratively optimizes each factor of the mean-field variational density, while holding the others fixed. It climbs the ELBO to a local optimum.

Consider the \(m^{th}\) latent variable \(Z_{m}\). The complete conditional of \(Z_{m}\) is its conditional density given all of the other latent variables in the model and the observations, i.e. \(p(Z_{m} | \mathbf{Z}_{-m}, \mathbf{X})\). Fix the other variational factors \(q_{j}(Z_{j})\), \(j \neq m\), then the log of the optimal solution for factor \(q_{m}(Z_{m})\) is \[ \ln q_{m}^{*}(Z_{m}) = \mathbb{E}_{j \neq m}\Big[\ln p(\mathbf{X}, \mathbf{Z})\Big] + \text{const} \] That is, the log of the optimal solution for factor \(q_{m}\), is obtained simply by considering the log of the joint distribution over all hidden and observed variables and then taking the expectation w.r.t all of the other factors \(\{q_{j}\}\) for \(j \neq m\).

To obtain the form of \(q^{*}_{m}(Z_{m})\) we exponentiate on both sides and normalize to obtain \[ q_{m}^{*}(Z_{m}) = \frac{\exp\left\{ \mathbb{E}_{j \neq m}\Big[\ln p(\mathbf{X}, \mathbf{Z})\Big] \right\} }{\int\exp\left\{ \mathbb{E}_{j \neq m}\Big[\ln p(\mathbf{X}, \mathbf{Z})\Big] \right\}d Z_{m}} \]

2 Probit Regression Model

Suppose we have \(I\) independent binary random variables \(y_{i} \in \{0, 1\}\), where \[ y_{i} \sim \mathcal{B}ernoulli(y_{i} | \gamma_{i}) \] We assume that the success probability \(\gamma_{i}\) is related to a vector of covariates \(\mathbf{x}_{i} \in \mathbb{R}^{D}\). Then, we define the probit regression model as \[ \begin{aligned} \eta_{i} & = \mathbf{w}^{T}\mathbf{x}_{i} \\ \mathbb{E}[y_{i}] & = \gamma_{i} = g^{-1}(\eta_{i}) \end{aligned} \] where \(\mathbf{w}\) represents a \(D \times 1\) column vector of regression coefficients, and \(g(\cdot)\) is the link function which allows to move from natural parameters \(\eta_{i}\) to mean parameters \(\gamma_{i}\). The probit model is obtained if we define \(g^{-1}(\cdot) = \Phi(\cdot)\), where \(\Phi(\cdot)\) denotes the cumulative distribution function (cdf) of the standard normal distribution. The probabilistic graphical model is shown below

3 Bayesian Probit Regression

In the Bayesian approach we define a prior distribution over the parameters \(\mathbf{w}\). Then the Bayesian probit regression model becomes \[ \begin{aligned} y_{i} & \sim \mathcal{B}ernoulli(y_{i} | \gamma_{i}) \\ \gamma_{i} & = \Phi(\eta_{i})\\ \eta_{i} & = \mathbf{w}^{T}\mathbf{x}_{i} \\ \mathbf{w} & \sim p(\mathbf{w} | \mathcal{H}) \end{aligned} \] where \(\mathcal{H}\) are (hyper)-parameters of the prior. Collecting \(I\) response variables \(\mathbf{y} \in \mathbb{R}^{I}\) and covariates \(\mathbf{X}\in \mathbb{R}^{I\times D}\), the posterior distribution over the parameters is given by \[ \begin{aligned} p(\mathbf{w} | \mathbf{y}, \mathbf{X}) & \propto p(\mathbf{w} | \mathcal{H}) p(\mathbf{y} | \mathbf{X}, \mathbf{w}) = p(\mathbf{w} | \mathcal{H}) \prod\limits_{i=1}^{I} p(y_{i} | \mathbf{x}_{i}, \mathbf{w}) \\ & = p(\mathbf{w} | \mathcal{H}) \prod\limits_{i=1}^{I} \Phi\Big(\mathbf{w}^{T}\mathbf{x}_{i}\Big)^{y_{i}} \Big(1 - \Phi(\mathbf{w}^{T}\mathbf{x}_{i})\Big)^{(1-y_{i})} \end{aligned} \]

3.1 Augmented Model

Performing inference for this model in the Bayesian framework is complicated by the fact that no conjugate prior \(p(\mathbf{w} | \mathcal{H})\) exists for the parameters of the probit regression model. To overcome this problem, Albert and Chib (1993) augmented the original model with an additional auxiliary latent variable that renders the conditional distributions of the model parameters equivalent to those under a Bayesian normal linear regression model with Gaussian noise.

Let us introduce \(I\) independent latent variables \(z_{i}\), where each \(z_{i}\) follows a Gaussian distribution centered at \(\mathbf{w}^{T}\mathbf{x}_{i}\), i.e. \(z_{i} \sim \mathcal{N}(\mathbf{w}^{T}\mathbf{x}_{i}, 1)\). Then the augmented probit model has the following hierarchical structure

where \[ \begin{aligned} \tau & \sim Gamma(\tau | \alpha_{0}, \beta_{0}) \\ \mathbf{w} | \tau & \sim \mathcal{N}(\mathbf{w} | \mathbf{m}_{0}, \tau^{-1}\mathbf{I}) \\ z_{i} | \mathbf{w} & \sim \mathcal{N}(z_{i} | \mathbf{w}^{T}\mathbf{x}_{i}, 1) \\ y_{i} | z_{i} & = \begin{cases} 1 & \; \text{if } z_{i} > 0\\ 0 & \; \text{if } z_{i} \leq 0\\ \end{cases} \nonumber \\ \end{aligned} \] where \(y_{i}\) is now deterministic conditional on the sign of the latent variable \(z_{i}\). Hence, we formulate the original problem as a missing data problem where we have a normal regression model on the latent variables \(z_{i}\) and the observed responses \(y_{i}\) are incomplete in that we only observe whether \(z_{i} > 0\) or \(z_{i} \leq 0\). Note also that we now can put a conjugate Gaussian prior over parameters \(\mathbf{w}\). A typical choice for the conjugate prior mean would be \(\mathbf{m}_{0} = 0\). Regarding the precision parameter \(\tau\) either we can choose a small value to make it uninformative or we can introduce a Gamma prior as shown above.

Thus, the the joint distribution over all the variables is given by \[ \begin{aligned} p(\mathbf{y}, \mathbf{z}, \mathbf{w}, \tau | \mathbf{X}) & = p(\mathbf{y}| \mathbf{z}) p(\mathbf{z}|\mathbf{w}, \mathbf{X}) p(\mathbf{w} | \tau) p(\tau)\\ & = \left\{ \prod\limits_{i=1}^{I} p(y_{i} | z_{i}) p(z_{i}| \mathbf{x}_{i}, \mathbf{w}) \right\} p(\mathbf{w} | \tau)p(\tau) \end{aligned} \] where \[ p(y_{i}| z_{i}) = \mathbf{1}(z_{i} > 0)^{y_{i}} \mathbf{1}(z_{i} \leq 0)^{(1 - y_{i})} \] where \(\boldsymbol{1}(\cdot)\) is the indicator function, equal to \(1\) if the quantities inside the function are satisfied, and \(0\) otherwise. One way to compute the posterior is the Gibbs algorithm, see my previous post http://rpubs.com/cakapourani/bayesian-binary-probit-model. Here we will approximate the posterior using variational inference.

4 Bayesian Probit Regression Mixture Model

Imagine that each of our observations comprises of a different probit regression model and our goal is to cluster together regression models with similar patterns, e.g. we have observed time series data of gene expression levels for different genes (that lie in the (0, 1) interval), and we want to group genes that have similar patterns with respect to their expression levels over time.

Assume that each of our observations is generated from a corresponding latent variable \(\mathbf{c}_{n}\) comprising a 1-of-K binary vector with elements \(c_{nk}\) for \(k = 1, ..., K\). The conditional distribution of \(\mathbf{C}\), given the mixing coefficiens \(\pmb{\pi}\), is given by \[ p(\mathbf{C} | \pmb{\pi}) = \prod_{n=1}^{N}\prod_{k=1}^{K}\pi_{k}^{c_{nk}} \] The latent variables \(\mathbf{C}\) will generate our latent observations \(\mathbf{Z} \in \mathbb{R}^{N\times I_{n}}\), which in turn will generate our binary observations \(\mathbf{Y} \in \mathbb{R}^{N\times I_{n}}\) depending on the sign of \(\mathbf{Z}\) as explained above. The conditional distribution of the data (\(\mathbf{Z}, \mathbf{Y}\)), given the latent variables \(\mathbf{C}\) and the component parameters \(\mathbf{w}\) is

\[ \begin{align} p(\mathbf{Y}, \mathbf{Z} | \mathbf{C}, \mathbf{w}, \mathbf{X}) & = p(\mathbf{Y} | \mathbf{Z}) p(\mathbf{Z} | \mathbf{C}, \mathbf{w}, \mathbf{X}) \\ & = \prod_{n=1}^{N}\prod_{k=1}^{K}\Big\{p(\mathbf{y}_{n} | \mathbf{z}_{n}) \;p(\mathbf{z}_{n} | \mathbf{w}_{k}, \mathbf{X}_{n})\Big\}^{c_{nk}} \\ & = \prod_{n=1}^{N}\prod_{k=1}^{K}\Big\{\mathbf{1}(\mathbf{z}_{n} > 0)^{\mathbf{y}_{n}} \mathbf{1}(\mathbf{z}_{n} \leq 0)^{(\pmb{1} - \mathbf{y}_{n})} \mathcal{N}(\mathbf{z}_{n} | \mathbf{X}_{n} \mathbf{w}_{k}, \mathbf{I}_{n})\Big\}^{c_{nk}} \\ \end{align} \] Next we introduce priors over the parameters. We choose a Dirichlet distribution over the mixing proportions \(\pmb{\pi}\) \[ p(\pmb{\pi}) = \mathcal{D}ir( \pmb{\pi}| \pmb{\delta}_{0}) = C(\pmb{\delta}_{0})\prod_{k=1}^{K}\pi_{k}^{\delta_{0_{k}} - 1} \] where \(C(\pmb{\delta}_{0})\) is the normalization constant for the Dirichlet distribution and we have chosen the same parameter \(\delta_{0_{k}}\) for each of the mixture components to have a symmetrical Dirichlet distribution. We introduce an independent Gaussian prior over the coefficients \(\mathbf{w}_{k}\) as we did in the Bayesian probit regression, i.e. \[ \begin{aligned} p(\mathbf{w} | \pmb{\tau}) & = \prod_{k=1}^{K} \mathcal{N}(\mathbf{w}_{k} | \mathbf{0}, \tau_{k}^{-1} \mathbf{I}) \end{aligned} \]

We also introduce a prior distribution for the (hyper)-parameter \(\pmb{\tau}\), which is the precision of the Gaussian prior over the weights and we assume that each cluster has its own precision parameter \(\tau_{k}\), that is \[ p(\pmb{\tau}) = \prod_{k=1}^{K} \mathcal{G}amma(\tau_{k} | \alpha_{0}, \beta_{0}) = \prod_{k=1}^{K} \frac{1}{\Gamma(\alpha_{0})}\beta_{0}^{\alpha_{0}}\tau_{k}^{\alpha_{0}-1}e^{-\beta_{0}\tau_{k}} \] The joint distribution over all the variables is given by the following factorization. \[ \begin{aligned} p(\mathbf{Y}, \mathbf{Z}, \mathbf{C}, \pmb{\pi}, \mathbf{w}, \pmb{\tau} | \mathbf{X}) & = p(\mathbf{Y} | \mathbf{Z}) \; p(\mathbf{Z} | \mathbf{C}, \mathbf{w}, \mathbf{X}) \;p(\mathbf{C} | \pmb{\pi}) \;p(\pmb{\pi}) \;p(\mathbf{w} | \pmb{\tau}) \;p(\pmb{\tau}) \end{aligned} \] where the decomposition corresponds to the probabilistic graphical model shown below

5 Variational Bayesian Probit Regression Mixture Model

To apply the variational inference machinery, we will divide our parameters in four blocks, the latent variables \(\mathbf{C}\), the latent observations \(\mathbf{Z}\), the parameters \((\mathbf{w}, \pmb{\pi})\) and the parameter \(\pmb{\tau}\). Hence, the variational posterior distribution is given by the factorised expression \[ q(\mathbf{Z}, \mathbf{C}, \pmb{\pi}, \mathbf{w}, \pmb{\tau}) = q(\mathbf{Z})\; q(\mathbf{C}) \;q(\pmb{\pi},\mathbf{w}) \;q(\pmb{\tau}) \] It should be noted that this is the only assumption that we need to make in order to obtain a tractable practical solution to our Bayesian mixture model. In particular, the functional form of the factors \(q(\mathbf{Z})\), \(q(\mathbf{C})\), \(q(\pmb{\pi},\mathbf{w})\) and \(q(\pmb{\tau})\) will be determinted automatically by optimization of the variational distribution.

5.1 Update factor \(q(\mathbf{C})\)

Let us consider the derivation of the update equation for the factor \(q(\mathbf{C})\). The log of the optimized factor is given by \[ \begin{align} \ln q^{*}(\mathbf{C}) & = \mathbb{E}_{q(\mathbf{Z},\pmb{\pi},\mathbf{w},\pmb{\tau})}\Big[\ln p(\mathbf{Y}, \mathbf{Z}, \mathbf{C}, \pmb{\pi}, \mathbf{w}, \pmb{\tau} | \mathbf{X}) \Big] + \text{const} \\ & = \mathbb{E}_{q(\mathbf{Z},\pmb{\pi},\mathbf{w},\pmb{\tau})}\left[\ln \Big\{ \underbrace{p(\mathbf{Y} | \mathbf{Z})}_{\text{const}} \; p(\mathbf{Z} | \mathbf{C}, \mathbf{w}, \mathbf{X}) \;p(\mathbf{C} | \pmb{\pi}) \; \underbrace{p(\pmb{\pi}) \;p(\mathbf{w} | \pmb{\tau}) \;p(\pmb{\tau})}_{\text{const}} \Big\}\right] + \text{const} \\ & = \mathbb{E}_{q(\mathbf{Z}, \mathbf{w})}\Big[\ln\prod_{n=1}^{N}\prod_{k=1}^{K}\mathcal{N}(\mathbf{z}_{n} | \mathbf{X}_{n} \mathbf{w}_{k}, \mathbf{I}_{n})^{c_{nk}}\Big] + \mathbb{E}_{q(\pmb{\pi})}\Big[\ln \prod_{n=1}^{N}\prod_{k=1}^{K}\pi_{k}^{c_{nk}} \Big] + \text{const} \\ & = \sum_{n=1}^{N}\sum_{k=1}^{K} c_{nk}\;\mathbb{E}_{q(\mathbf{z_{n}}, \mathbf{w}_{k})}\Big[\ln\mathcal{N}(\mathbf{z}_{n} | \mathbf{X}_{n} \mathbf{w}_{k}, \mathbf{I}_{n})\Big] + \sum_{n=1}^{N}\sum_{k=1}^{K}c_{nk}\;\mathbb{E}_{q(\pi_{k})}\Big[\ln\pi_{k} \Big] + \text{const} \\ & = \sum_{n=1}^{N}\sum_{k=1}^{K} c_{nk}\;\left\{\mathbb{E}_{q(\mathbf{z}_{n},\mathbf{w}_{k})}\left[-\frac{1}{2}\Big(\mathbf{z}_{n} - \mathbf{X}_{n}\mathbf{w}_{k}\Big)^{T}\Big(\mathbf{z}_{n} - \mathbf{X}_{n}\mathbf{w}_{k}\Big) \right] + \mathbb{E}_{q(\pi_{k})}\Big[\ln\pi_{k} \Big] \right\} + \text{const} \\ & = \sum_{n=1}^{N}\sum_{k=1}^{K}c_{nk}\;\ln\rho_{nk} + \text{const} \end{align} \] where we have defined \[ \ln\rho_{nk} = \mathbb{E}_{q(\mathbf{z}_{n}, \mathbf{w}_{k})}\left[-\frac{1}{2}\Big(\mathbf{z}_{n} - \mathbf{X}_{n}\mathbf{w}_{k}\Big)^{T}\Big(\mathbf{z}_{n} - \mathbf{X}_{n}\mathbf{w}_{k}\Big) \right] + \mathbb{E}_{q(\pi_{k})}\Big[\ln\pi_{k} \Big] \] Note that we are only interested in the functional dependence on the variable \(\mathbf{C}\), thus any terms that do not depend on \(\mathbf{C}\) can be absorbed in the additive normalized constant. Taking the exponential on both sides we obtain \[ q^{*}(\mathbf{C}) \propto \prod_{n=1}^{N}\prod_{k=1}^{K}\;\rho_{nk}^{c_{nk}} \] Requiring that this distribution be normalized, and noting that for each value of \(n\) the quantities \(c_{nk}\) are binary and sum to \(1\) over all values of \(k\), we obtain \[ q^{*}(\mathbf{C}) = \prod_{n=1}^{N}\prod_{k=1}^{K}\;r_{nk}^{c_{nk}} \quad\quad\quad \text{where} \quad r_{nk} = \frac{\rho_{nk}}{\sum_{j=1}^{K}\rho_{nj}} \] We note that the functional form for the factor \(q(\mathbf{C})\) takes the same functional form as the prior \(p(\mathbf{C}|\pmb{\pi})\). For the discrete distribution \(q^{*}(\mathbf{C})\) we have the standard result, which is the expected value of a multinomial variable \[ \mathbb{E}_{q(c_{nk})}\big[c_{nk}\big] = r_{nk} \] for which we see that the quantities \(r_{nk}\) are playing the role of responsibilities. Note that the optimal solution for \(q^{*}(\mathbf{C})\) depends on moments evaluated w.r.t the distributions of other variables, and so again the variational update equations are coupled and must be solved iteratively.

5.2 Update factor \(q(\pmb{\tau})\)

The log of the optimized factor is given by \[ \begin{align} \ln q^{*}(\mathbf{\tau}) & = \mathbb{E}_{q(\mathbf{Z}, \mathbf{C},\pmb{\pi},\mathbf{w})}\Big[\ln p(\mathbf{Y}, \mathbf{Z}, \mathbf{C}, \pmb{\pi}, \mathbf{w}, \pmb{\tau} | \mathbf{X}) \Big] + \text{const} \\ & = \mathbb{E}_{q(\mathbf{Z},\mathbf{C},\pmb{\pi},\mathbf{w})}\left[\ln \Big\{ \underbrace{p(\mathbf{Y} | \mathbf{Z}) \; p(\mathbf{Z} | \mathbf{C}, \mathbf{w}, \mathbf{X}) \;p(\mathbf{C} | \pmb{\pi}) \; p(\pmb{\pi})}_{\text{const}} \;p(\mathbf{w} | \pmb{\tau}) \;p(\pmb{\tau}) \Big\}\right] + \text{const} \\ & = \mathbb{E}_{q(\mathbf{w})}\Big[\ln p(\mathbf{w} | \pmb{\tau})\Big] + \ln p(\pmb{\tau}) + \text{const} \\ & = \sum_{k=1}^{K}\mathbb{E}_{q(\mathbf{w}_{k})}\Big[\ln p(\mathbf{w}_{k} | \tau_{k})\Big] + \sum_{k = 1}^{K}\ln p(\tau_{k}) + \text{const} \end{align} \] Hence we observe that the right hand comprises a sum over \(k\), i.e. each \(\tau_{k}\) is independent of each other leading to the further factorization \[ q(\pmb{\tau}) = \prod_{k=1}^{K}q(\tau_{k}) \] We refer to these additional factorizations as induced factorizations because they arise from an intersection between the factorization assumed in the variational posterior distribution and the conditional independence properties of the true joint distribution.

Hence, \[ \begin{aligned} \ln q^{*}(\tau_{k}) & = \mathbb{E}_{q(\mathbf{w}_{k})}\Big[\ln p(\mathbf{w}_{k} | \tau_{k})\Big] + \ln p(\tau_{k}) + \text{const} \\ & = \underbrace{\frac{D}{2}\ln\tau_{k} - \frac{\tau_{k}}{2}\mathbb{E}_{q(\mathbf{w}_{k})}[\mathbf{w}_{k}^{T}\mathbf{w}_{k}]}_{\text{Gaussian PDF}} + \underbrace{(\alpha_{0} - 1)\ln\tau_{k} - \beta_{0}\tau_{k}}_{\text{Gamma PDF}}\\ & = \underbrace{(\alpha_{0} + \frac{D}{2} - 1)}_{\alpha_{k}\;\text{parameter}}\ln\tau_{k} - \Big(\underbrace{\beta_{0} + \frac{1}{2}\mathbb{E}_{q(\mathbf{w}_{k})}[\mathbf{w}_{k}^{T}\mathbf{w}_{k}]}_{\beta_{k}\;\text{parameter}}\Big)\tau_{k} \end{aligned} \] which is the log of the (unnormalized) Gamma distribution, leading to \[ \begin{aligned} q^{*}(\tau_{k}) & = \mathcal{G}\text{amma}(\tau_{k} | \alpha_{k}, \beta_{k}) \\ \alpha_{k} & = \alpha_{0} + \frac{D}{2} \\ \beta_{k} & = \beta_{0} + \frac{1}{2}\mathbb{E}_{q(\mathbf{w}_{k})}[\mathbf{w}_{k}^{T}\mathbf{w}_{k}] \end{aligned} \]

5.3 Update factor \(q(\pmb{\pi}, \mathbf{w})\)

Now let us consider the factor \(q(\pmb{\pi}, \mathbf{w})\) in the variational posterior distribution. Using again the general expression for the optimized factor we have \[ \begin{align} \ln q^{*}(\pmb{\pi}, \mathbf{w}) & = \mathbb{E}_{q(\mathbf{Z}, \mathbf{C},\pmb{\tau})}\Big[\ln p(\mathbf{Y}, \mathbf{Z}, \mathbf{C}, \pmb{\pi}, \mathbf{w}, \pmb{\tau} | \mathbf{X}) \Big] + \text{const} \\ & = \mathbb{E}_{q(\mathbf{Z},\mathbf{C},\pmb{\tau})}\left[\ln \Big\{ \underbrace{p(\mathbf{Y} | \mathbf{Z})}_{\text{const}} \; p(\mathbf{Z} | \mathbf{C}, \mathbf{w}, \mathbf{X}) \;p(\mathbf{C} | \pmb{\pi}) \; p(\pmb{\pi}) \;p(\mathbf{w} | \pmb{\tau}) \; \underbrace{p(\pmb{\tau})}_{\text{const}} \Big\}\right] + \text{const} \\ & = \underbrace{\mathbb{E}_{q(\mathbf{Z}, \mathbf{C})}\Big[\ln p(\mathbf{Z} | \mathbf{C}, \mathbf{w}, \mathbf{X})\Big] + \mathbb{E}_{q(\pmb{\tau})}\Big[\ln p(\mathbf{w} | \pmb{\tau}) \Big]}_{\ln q(\mathbf{w})} + \underbrace{\mathbb{E}_{q(\mathbf{C})} \Big[\ln p(\mathbf{C} | \pmb{\pi})\Big] + \ln p(\pmb{\pi})}_{\ln q(\pmb{\pi})} + \text{const} \\ \end{align} \] We observe that the right-hand side of this expression decomposes into a sum of terms involving only \(\pmb{\pi}\) together with terms only involving \(\mathbf{w}\), which implies that the variational posterior \(q(\pmb{\pi}, \mathbf{w})\) factorizes to give \(q(\pmb{\pi})q(\mathbf{w})\). Furthermore, the terms involving \(\mathbf{w}\) themselves comprise a sum over \(k\) (i.e. are independent between clusters), leading to the further factorization \[ q(\pmb{\pi}, \mathbf{w}) = q(\pmb{\pi})\prod_{k=1}^{K} q(\mathbf{w}_{k}) \]

5.3.1 Update factor \(q(\pmb{\pi})\)

Since this factor depends only on the latent variables \(\mathbf{C}\) and parameters \(\pmb{\pi}\), the optimal variational factor will be the same for any mixture model, irrespective of the observation model. Hence, we have \[ \begin{align} \ln q^{*}(\pmb{\pi}) & = \text{ln}\;p(\pmb{\pi}) + \mathbb{E}_{q(\mathbf{C})}\Big[\ln p(\mathbf{C} | \pmb{\pi})\Big] + \text{const} \\ & = \underbrace{\ln C(\pmb{\delta}_{0})}_{\text{const}} + \sum_{k=1}^{K}\ln\pi_{k}^{\delta_{0_{k}}-1} + \sum_{k=1}^{K}\sum_{n=1}^{N}\underbrace{\mathbb{E}_{q(c_{nk})}\big[c_{nk}\big]}_{r_{nk}}\ln\pi_{k} + \text{const} \\ & = \sum_{k=1}^{K}\ln\pi_{k}^{\delta_{0_{k}}-1} + \sum_{k=1}^{K}\sum_{n=1}^{N}{r_{nk}}\ln\pi_{k} + \text{const} \\ \end{align} \] Taking the exponential on both sides we observe that \(q^{*}(\pmb{\pi})\) is a Dirichlet distribution \[ \begin{align} q^{*}(\pmb{\pi}) & = \prod_{k=1}^{K}\pi_{k}^{\delta_{0_{k}} - 1} + \prod_{k=1}^{K}\prod_{n=1}^{N}\pi_{k}^{r_{nk}} + \text{const} \\ & = \prod_{k=1}^{K}\pi_{k}^{\delta_{0_{k}} - 1} + \prod_{k=1}^{K}\pi_{k}^{\sum_{n=1}^{N}r_{nk}} + \text{const} \\ & = \prod_{k=1}^{K}\pi_{k}^{(\delta_{0_{k}} + \sum_{n=1}^{N}r_{nk} - 1)} \\ & = \mathcal{D}ir(\pmb{\pi} | \pmb{\delta}) \end{align} \] where \(\pmb{\delta}\) has components \(\delta_{k}\) given by \(\delta_{k} = \delta_{0_{k}} + \sum_{n=1}^{N}r_{nk}\).

5.3.2 Update factor \(q(\mathbf{w})\)

Let us consider the derivation of the update equation for the factor \(q(\mathbf{w}_{k})\). The log of the optimized factor is given by \[ \begin{align} \ln q^{*}(\mathbf{w}_{k}) & = \mathbb{E}_{q(\mathbf{Z},\mathbf{C})}\Big[\ln p(\mathbf{Z} | \mathbf{C}, \mathbf{w}_{k}, \mathbf{X})\Big] + \mathbb{E}_{q(\tau_{k})}\Big[\ln p(\mathbf{w}_{k} | \tau_{k})\Big] + \text{const} \\ & = \sum_{n=1}^{N} \mathbb{E}_{q(c_{nk})}\big[c_{nk}\big]\; \mathbb{E}_{q(\mathbf{z}_{n})}\Big[\ln\mathcal{N}(\mathbf{z}_{n} | \mathbf{X}_{n} \mathbf{w}_{k}, \mathbf{I}_{n})\Big] + \mathbb{E}_{q(\tau_{k})}\Big[\ln p(\mathbf{w}_{k} | \tau_{k})\Big] + \text{const} \\ & = \sum_{n=1}^{N} r_{nk} \;\mathbb{E}_{q(\mathbf{z}_{n})}\left[-\frac{1}{2}\Big(\mathbf{z}_{n} - \mathbf{X}_{n}\mathbf{w}_{k}\Big)^{T}\Big(\mathbf{z}_{n} - \mathbf{X}_{n}\mathbf{w}_{k}\Big) \right] - \frac{1}{2} \mathbb{E}_{q(\tau_{k})}\big[\tau_{k}\big] \mathbf{w}_{k}^{T}\mathbf{w}_{k} + \text{const} \\ & = \sum_{n=1}^{N} r_{nk} \;\mathbb{E}_{q(\mathbf{z}_{n})} \left[-\frac{1}{2}\Big(\underbrace{\mathbf{z}_{n}^{T}\mathbf{z}_{n}}_{\text{const}} - 2\mathbf{w}_{k}^{T} \mathbf{X}_{n}^{T}\mathbf{z}_{n} + \mathbf{w}_{k}^{T} \mathbf{X}_{n}^{T} \mathbf{X}_{n} \mathbf{w}_{k} \Big)\right] - \frac{1}{2} \mathbb{E}_{q(\tau_{k})}\big[\tau_{k}\big] \mathbf{w}_{k}^{T}\mathbf{w}_{k} + \text{const} \\ & = \sum_{n=1}^{N} r_{nk} \left\{ \mathbf{w}_{k}^{T} \mathbf{X}_{n}^{T}\;\mathbb{E}_{q(\mathbf{z}_{n})}\big[\mathbf{z}_{n}\big] -\frac{1}{2} \mathbf{w}_{k}^{T} \mathbf{X}_{n}^{T} \mathbf{X}_{n} \mathbf{w}_{k} \right\} - \frac{1}{2} \mathbb{E}_{q(\tau_{k})}\big[\tau_{k}\big] \mathbf{w}_{k}^{T}\mathbf{w}_{k} + \text{const} \\ & = \mathbf{w}_{k}^{T}\sum_{n=1}^{N} r_{nk} \mathbf{X}_{n}^{T}\;\mathbb{E}_{q(\mathbf{z}_{n})}\big[\mathbf{z}_{n}\big] - \frac{1}{2} \mathbf{w}_{k}^{T}\sum_{n=1}^{N} \Big\{r_{nk} \mathbf{X}_{n}^{T} \mathbf{X}_{n} \Big\} \mathbf{w}_{k} - \frac{1}{2} \mathbb{E}_{q(\tau_{k})}\big[\tau_{k}\big] \mathbf{w}_{k}^{T}\mathbf{w}_{k} + \text{const} \\ & = \mathbf{w}_{k}^{T}\sum_{n=1}^{N} r_{nk} \mathbf{X}_{n}^{T}\;\mathbb{E}_{q(\mathbf{z}_{n})}\big[\mathbf{z}_{n}\big] - \frac{1}{2} \mathbf{w}_{k}^{T}\left\{ \mathbb{E}_{q(\tau_{k})}\big[\tau_{k}\big]\mathbf{I} + \sum_{n=1}^{N} r_{nk} \mathbf{X}_{n}^{T} \mathbf{X}_{n} \right\}\mathbf{w}_{k} + \text{const} \\ \end{align} \] Because this is a quadratic form, the distribution \(q^{*}(\mathbf{w}_{k})\) is a Gaussian distribution and we can complete the square to identify the mean and covariance \[ \begin{aligned} q^{*}(\mathbf{w}_{k}) & = \mathcal{N}(\mathbf{w}_{k} | \mathbf{m}_{k}, \mathbf{S}_{k}) \\ \mathbf{m}_{k} & = \mathbf{S}_{k}\sum_{n=1}^{N}r_{nk}\mathbf{X}_{n}^{T}\;\mathbb{E}_{q(\mathbf{z}_{n})}\big[\mathbf{z}_{n}\big] \\ \mathbf{S}_{k} & = \left(\mathbb{E}_{q(\tau_{k})}\big[\tau_{k}\big]\mathbf{I} + \sum_{n=1}^{N}r_{nk} \mathbf{X}_{n}^{T}\mathbf{X}_{n}\right)^{-1} \end{aligned} \] To complete the square we make use of the fact that the exponent in a general Gaussian distribution \(\mathcal{N}(\mathbf{x} | \pmb{\mu}, \pmb{\Sigma})\) can be written \[ -\frac{1}{2}(\mathbf{x}-\pmb{\mu})^{T}\pmb{\Sigma}^{-1}(\mathbf{x}-\pmb{\mu}) = -\frac{1}{2}\mathbf{x}^{T}\pmb{\Sigma}^{-1}\mathbf{x} + \mathbf{x}^{T}\pmb{\Sigma}^{-1}\pmb{\mu} + \text{const} \] where const denotes terms that are independent of \(\mathbf{x}\), and we have made use of the symmetry of \(\pmb{\Sigma}\).

5.4 Update factor \(q(\mathbf{Z})\)

Finally, let us consider the derivation of the update equation for the factor \(q(\mathbf{Z})\). From the graphical model we observe that the observations (i.e. different probit regressions) are independent of each other, which factorizes the variational factor as \[ q(\mathbf{Z}) = \prod_{n=1}^{N}\prod_{i=1}^{I_{n}}q(z_{ni}) \] The log of the optimized factor assuming that the corresponding \(y_{ni} = 1\) is given by \[ \begin{align} \ln q^{*}(z_{ni}) & = \mathbb{E}_{q(\mathbf{C},\pmb{\pi},\mathbf{w},\pmb{\tau})}\Big[\ln p(\mathbf{Y}, \mathbf{Z}, \mathbf{C}, \pmb{\pi}, \mathbf{w}, \pmb{\tau} | \mathbf{X}) \Big] + \text{const} \\ & = \mathbb{E}_{q(\mathbf{C},\pmb{\pi},\mathbf{w},\pmb{\tau})}\left[\ln \Big\{ p(\mathbf{Y} | \mathbf{Z}) \; p(\mathbf{Z} | \mathbf{C}, \mathbf{w}, \mathbf{X}) \underbrace{\;p(\mathbf{C} | \pmb{\pi}) \; p(\pmb{\pi}) \;p(\mathbf{w} | \pmb{\tau}) \; p(\pmb{\tau})}_{\text{const}} \Big\}\right] + \text{const} \\ & = \ln p(y_{ni} | z_{ni}) + \mathbb{E}_{q(\mathbf{c}_{n},\mathbf{w})} \left[\ln \prod_{k=1}^{K} \mathcal{N}(z_{ni} | \mathbf{w}_{k}^{T}\mathbf{x}_{ni}, 1)^{c_{nk}}\right] + \text{const} \\ & = y_{ni}\ln\mathbf{1}(z_{ni} > 0) + \underbrace{(1 - y_{ni})\ln\mathbf{1}(z_{ni} \leq 0)}_{0,\; \text{since}\;y_{ni}=1} + \sum_{k=1}^{K} \mathbb{E}_{q(\mathbf{c}_{nk})}\big[c_{nk}\big]\;\mathbb{E}_{q(\mathbf{w}_{k})} \Big[\ln \mathcal{N}(z_{ni} | \mathbf{w}_{k}^{T}\mathbf{x}_{ni}, 1)\Big] + \text{const} \\ & = y_{ni}\ln\mathbf{1}(z_{ni} > 0) + \sum_{k=1}^{K} r_{nk} \; \mathbb{E}_{q(\mathbf{w}_{k})} \Big[-\frac{1}{2} (z_{ni} - \mathbf{w}_{k}^{T} \mathbf{x}_{ni})^{2}\Big] + \text{const} \\ & = \ln\mathbf{1}(z_{ni} > 0) + \sum_{k=1}^{K}r_{nk} \left\{-\frac{1}{2} z_{ni}^{2} + z_{ni}\mathbb{E}_{q(\mathbf{w}_{k})} \Big[\mathbf{w}_{k}^{T}\Big]\mathbf{x}_{ni} - \underbrace{\frac{1}{2}\mathbb{E}_{q(\mathbf{w_{k}})} \Big[\mathbf{w}_{k}^{T} \mathbf{x}_{ni} \mathbf{x}_{ni}^{T}\mathbf{w}_{k}\Big]}_{\text{const}}\right\} + \text{const} \\ & = \ln\mathbf{1}(z_{ni} > 0) -\frac{1}{2} z_{i}^{2}\underbrace{\sum_{k=1}^{K}r_{nk}}_{\text{1}} + z_{ni}\sum_{k=1}^{K}r_{nk} \; \mathbb{E}_{q(\mathbf{w}_{k})} \Big[\mathbf{w}_{k}^{T}\Big]\mathbf{x}_{ni} + \text{const} \end{align} \] Exponentiating this quantity and setting \(\mu_{ni} = \sum_{k}^{K} r_{nk} \mathbb{E}_{q(\mathbf{w}_{k})}\big[\mathbf{w}_{k}^{T}\big]\mathbf{x}_{ni}\) we obtain \[ q^{*}(z_{ni}) \propto \mathbf{1}(z_{ni} > 0) \exp\left(-\frac{1}{2} z_{ni}^{2} + z_{ni}\mu_{ni} \right) \] We observe that the optimized factor \(q^{*}(z_{ni})\) is an un-normalized Truncated Normal distribution, so we can complete the square to identify the mean and covariance \[ q^{*}(z_{ni}) = \begin{cases} \mathcal{TN}_{+}\left(z_{ni} | \mu_{ni}, 1 \right) & \; \text{if } y_{ni} = 1\\ \mathcal{TN}_{-}\left(z_{ni} | \mu_{ni}, 1 \right) & \; \text{if } y_{ni} = 0\\ \end{cases} \] where \(\mathcal{TN}_{+}(\cdot)\) denotes the normal distribution truncated on the left tail to zero to contain only positive values, and \(\mathcal{TN}_{-}(\cdot)\) denotes the normal distribution truncated on the right tail to zero to contain only negative values.

5.5 Computing expectations

When deriving the optimized variational factors, the derivations involved expectations with respect to the variational distributions. These expectations are computed as follows

Term \(\mathbb{E}_{q(\mathbf{w})}\big[\mathbf{w}_{k}^{T}\big]\): The factor \(q(\mathbf{w}_{k})\) is a Gaussian distribution \(\mathcal{N}(\mathbf{w}_{k} | \mathbf{m}_{k}, \mathbf{S}_{k})\) hence its expected value is \[ \begin{align} \mathbb{E}_{q(\mathbf{w}_{k})}\big[\mathbf{w}_{k}^{T}\big] = \mathbf{m}_{k} \end{align} \]

Term \(\mathbb{E}_{q(\tau_{k})}\big[\tau\big]\): The factor \(q(\tau_{k})\) is a Gamma distribution \(\mathcal{G}amma(\tau_{k} | \alpha_{k}, \beta_{k})\), hence its expected value is \[ \mathbb{E}_{q(\tau_{k})}\big[\tau\big] = \frac{\alpha_{k}}{\beta_{k}} \]

Term \(\mathbb{E}_{q(\mathbf{w}_{k})}\big[\mathbf{w}_{k}^{T}\mathbf{w}_{k}\big]\): The factor \(q(\mathbf{w}_{k})\) is a Gaussian distribution \(\mathcal{N}(\mathbf{w}_{k} | \mathbf{m}_{k}, \mathbf{S}_{k})\) hence we have \[ \begin{align} \mathbb{E}_{q(\mathbf{w}_{k})}\big[\mathbf{w}_{k}^{T}\mathbf{w}_{k}\big] = \text{tr}\Big(\mathbb{E}_{q(\mathbf{w}_{k})} \big[\mathbf{w}_{k}\mathbf{w}_{k}^{T} \big]\Big) = \text{tr}\Big(\mathbf{m}_{k}\mathbf{m}_{k}^{T} + \mathbf{S}_{k}\Big) = \text{tr}\left(\mathbf{m}_{k}\mathbf{m}_{k}^{T}\right) + \text{tr}(\mathbf{S}_{k}) = \mathbf{m}_{k}^{T}\mathbf{m}_{k} + \text{tr}(\mathbf{S}_{k}) \\ \end{align} \]

Term \(\mathbb{E}_{q(\pi_{k})}\big[\ln\pi_{k}\big]\): The factor \(q(\pi_{k})\) is a Dirichlet distribution \(\mathcal{D}ir(\pi_{k} | \delta_{k})\) and from standard results (see Bishop book Eq. B.21) we obtain \[ \mathbb{E}_{q(\pi_{k})}\big[\ln\pi_{k}\big] = \psi(\delta_{k}) - \psi(\hat{\delta}) \quad\quad \left(\text{where} \;\; \hat{\delta} = \sum_{k=1}^{K}\delta_{k} \right) \] where \(\psi(\delta) \equiv \frac{d}{d\delta}\ln\Gamma(\delta)\) is the digamma function.

Term \(\mathbb{E}_{q(c_{nk})}\big[c_{nk}\big]\): The factor \(q(c_{nk})\) is a Multinomial distribution \(\mathcal{M}ultin(c_{nk} | r_{nk})\), hence its expected value is \[ \mathbb{E}_{q(c_{nk})}\big[c_{nk}\big] = r_{nk} \]

Term \(\mathbb{E}_{q(z_{i})}\big[z_{ni}\big]\): The factor \(q(z_{ni})\) is a Truncated Gaussian distribution as defined above, hence form standard results its expected value is \[ \mathbb{E}_{q(z_{ni})}\big[z_{ni}\big] = \begin{cases} \mu_{ni} + \phi_{ni}/(1 - \Phi_{ni}) & \; \text{if } y_{ni} = 1\\ \mu_{ni} - \phi_{ni}/\Phi_{ni} & \; \text{if } y_{ni} = 0\\ \end{cases} \] where \(\phi_{ni} = \phi(-\mu_{ni})\) is the normal density and \(\Phi_{ni} = \Phi(-\mu_{ni})\) is the cumulative distribution function (cdf) of the standard normal distribution.

Term \(\mathbb{E}_{q(\mathbf{z}_{n},\mathbf{w}_{k})}\left[-\frac{1}{2}\Big(\mathbf{z}_{n} - \mathbf{X}_{n}\mathbf{w}_{k}\Big)^{T}\Big(\mathbf{z}_{n} - \mathbf{X}_{n}\mathbf{w}_{k}\Big) \right]\): The factor \(q(\mathbf{w}_{k})\) is a Gaussian distribution \(\mathcal{N}(\mathbf{w}_{k} | \mathbf{m}_{k}, \mathbf{S}_{k})\), hence we have \[ \begin{align} \mathbb{E}_{q(\mathbf{z}_{n},\mathbf{w}_{k})}\left[-\frac{1}{2}\Big(\mathbf{z}_{n} - \mathbf{X}_{n}\mathbf{w}_{k}\Big)^{T}\Big(\mathbf{z}_{n} - \mathbf{X}_{n}\mathbf{w}_{k}\Big) \right] & = \mathbb{E}_{q(\mathbf{z}_{n}, \mathbf{w}_{k})}\left[ -\frac{1}{2}\Big(\mathbf{z}_{n}^{T}\mathbf{z}_{n}- 2\mathbf{w}_{k}^{T} \mathbf{X}_{n}^{T}\mathbf{z}_{n} + \mathbf{w}_{k}^{T} \mathbf{X}_{n}^{T} \mathbf{X}_{n} \mathbf{w}_{k} \Big)\right] \\ & = -\frac{1}{2}\mathbb{E}_{q(\mathbf{z}_{n})}\big[\mathbf{z}_{n}^{T}\mathbf{z}_{n}\big] + \mathbb{E}_{q(\mathbf{z}_{n})} \Big[\mathbb{E}_{q(\mathbf{w}_{k})} \left[\mathbf{w}_{k}^{T}\right] \mathbf{X}_{n}^{T}\mathbf{z}_{n}\Big] -\frac{1}{2} \mathbb{E}_{q(\mathbf{w}_{k})}\left[ \mathbf{w}_{k}^{T} \mathbf{X}_{n}^{T} \mathbf{X}_{n} \mathbf{w}_{k}\right] \\ & = -\frac{1}{2}\mathbb{E}_{q(\mathbf{z}_{n})}\big[\mathbf{z}_{n}^{T}\mathbf{z}_{n}\big] + \mathbf{m}_{k}^{T} \mathbf{X}_{n}^{T}\mathbb{E}_{q(\mathbf{z}_{n})}\big[\mathbf{z}_{n}\big] -\frac{1}{2} \text{tr}\left( \mathbf{X}_{n}^{T} \mathbf{X}_{n} \mathbb{E}_{q(\mathbf{w}_{k})}\left[\mathbf{w}_{k}\mathbf{w}_{k}^{T}\right] \right) \\ & = -\frac{1}{2}\sum_{i=1}^{I_{n}}\mathbb{E}_{q(z_{ni})}\big[z_{ni}^{2}\big] + \sum_{i=1}^{I_{n}}\mathbb{E}_{q(z_{ni})}\big [z_{ni}\big]\mathbf{m}_{k}^{T} \mathbf{x}_{ni} -\frac{1}{2} \text{tr}\left( \mathbf{X}_{n}^{T} \mathbf{X}_{n} \big(\mathbf{m}_{k}\mathbf{m}_{k}^{T} + S_{k} \big)\right) \\ \end{align} \] where \[ \begin{align} \mathbb{E}_{q(z_{ni})}\big[z_{ni}^{2}\big] & = \mathbb{E}_{q(z_{ni})}\big[z_{ni}\big]^{2} + \mathbb{Var}_{q(z_{ni})}\big[z_{ni}\big] \\ & = \begin{cases} 1 + \mu_{ni} \Big(\mu_{ni} + \phi_{ni}/(1 - \Phi_{ni}) \Big) & \; \text{if } y_{ni} = 1\\ 1 + \mu_{ni} \Big(\mu_{ni} - \phi_{ni}/\Phi_{ni}\Big) & \; \text{if } y_{ni} = 0\\ \end{cases} \\ & = 1 + \mu_{ni} \;\mathbb{E}_{q(z_{ni})}\big[z_{ni}\big] \end{align} \]

5.6 Variational lower bound

The variational lower bound \(\mathcal{L}(q)\) (i.e. evidence lower bound (ELBO) ) is given by \[ \begin{aligned} \mathcal{L}(q) & = \sum_{\mathbf{C}}\int\int\int\int q(\mathbf{Z}, \mathbf{C}, \pmb{\pi}, \mathbf{w}, \pmb{\tau}) \ln\left\{\frac{p(\mathbf{Y}, \mathbf{Z}, \mathbf{C}, \pmb{\pi}, \mathbf{w}, \pmb{\tau} |\mathbf{X})}{q(\mathbf{Z}, \mathbf{C}, \pmb{\pi}, \mathbf{w}, \pmb{\tau})}\right\}d\mathbf{Z}\; d \pmb{\pi}\;d\mathbf{w}\;d\pmb{\tau} \\ & = \;\mathbb{E}_{q(\mathbf{Z},\mathbf{C}, \pmb{\pi}, \mathbf{w}, \pmb{\tau})}\Big[\ln p(\mathbf{Y}, \mathbf{Z}, \mathbf{C}, \pmb{\pi}, \mathbf{w}, \pmb{\tau}|\mathbf{X})\Big] - \mathbb{E}_{q(\mathbf{Z}, \mathbf{C}, \pmb{\pi}, \mathbf{w}, \pmb{\tau})}\Big[\ln q(\mathbf{Z}, \mathbf{C}, \pmb{\pi}, \mathbf{w}, \pmb{\tau})\Big] \\ & = \; \mathbb{E}_{q(\mathbf{Z})}\Big[\ln p(\mathbf{Y} | \mathbf{Z})\Big] + \mathbb{E}_{q(\mathbf{Z}, \mathbf{C}, \mathbf{w})}\Big[\ln p(\mathbf{Z} | \mathbf{C}, \mathbf{w},\mathbf{X})\Big] + \mathbb{E}_{q(\mathbf{C}, \pmb{\pi})}\Big[\ln p(\mathbf{C} | \pmb{\pi})\Big] + \mathbb{E}_{q(\pmb{\pi})}\Big[\ln p(\pmb{\pi})\Big] + \mathbb{E}_{q(\mathbf{w}, \pmb{\tau})} \Big[\ln p(\mathbf{w} | \pmb{\tau})\Big] \\ & \quad \qquad + \mathbb{E}_{q(\pmb{\tau}) }\Big[\ln p(\pmb{\tau})\Big] - \mathbb{E}_{q(\mathbf{Z})}\Big[\ln q(\mathbf{Z})\Big] - \mathbb{E}_{q(\mathbf{C})}\Big[\ln q(\mathbf{C})\Big] - \mathbb{E}_{q(\pmb{\pi})}\Big[\ln q(\pmb{\pi})\Big] - \mathbb{E}_{q(\mathbf{w})}\Big[\ln q(\mathbf{w})\Big] - \mathbb{E}_{q(\pmb{\tau})} \Big[\ln q(\pmb{\tau})\Big] \end{aligned} \]

Note that the terms involving expectations of \(\ln q(\cdot)\) distributions simply represent the negative information entropies \(\mathbb{H}(\cdot)\) of those distributions. The various terms in the ELBO are evaluated to give \[ \begin{align} \mathbb{E}_{q(\mathbf{Z})}\Big[\ln p(\mathbf{Y} | \mathbf{Z})\Big] & = \sum_{n=1}^{N}\sum_{i}^{I_{n}} \mathbb{E}_{q(z_{ni})}\Big[y_{ni}\ln \mathbf{1}(z_{ni} > 0) + (1 - y_{ni}) \ln \mathbf{1}(z_{ni} \leq 0) \Big] \\ & = \sum_{n=1}^{N}\sum_{i}^{I_{n}} \int_{-\infty}^{\infty} q(z_{ni})\Big\{y_{ni}\ln \mathbf{1}(z_{ni} > 0) + (1 - y_{ni}) \ln \mathbf{1}(z_{ni} \leq 0) \Big\}\;d z_{ni} \\ & = \sum_{n=1}^{N}\sum_{i}^{I_{n}} \begin{cases} \int_{-\infty}^{\infty} q(z_{ni}) \ln \mathbf{1}(z_{ni} > 0)\; d z_{ni} & \; \text{if } y_{ni} = 1 \\ \int_{-\infty}^{\infty} q(z_{ni}) \ln \mathbf{1}(z_{ni} \leq 0)\; d z_{ni} & \; \text{if } y_{ni} = 0 \\ \end{cases} \\ & = \sum_{n=1}^{N}\sum_{i}^{I_{n}} \begin{cases} \int_{-\infty}^{\infty} 0\; d z_{ni} & \; \text{if } y_{ni} = 1 \\ \int_{-\infty}^{\infty} 0\; d z_{ni} & \; \text{if } y_{ni} = 0 \\ \end{cases} \\ & = 0 \\ \end{align} \] \[ \begin{align} \mathbb{E}_{q(\mathbf{Z}, \mathbf{C}, \mathbf{w})}\Big[\ln p(\mathbf{Z} | \mathbf{C}, \mathbf{w}, \mathbf{X})\Big] & = \sum_{n=1}^{N}\sum_{k=1}^{K} \mathbb{E}_{q(c_{nk})} \big[c_{nk}\big]\mathbb{E}_{q(\mathbf{z}_{n},\mathbf{w}_{k})}\big[\ln\mathcal{N}(\mathbf{z}_{n}|\mathbf{X}_{n} \mathbf{w}_{k}, \mathbf{I}_{n}) \big] \\ & = \sum_{n=1}^{N}\sum_{k=1}^{K} r_{nk}\left\{-\frac{I_{n}}{2}\ln \left(2\pi\right) - \frac{1}{2} \mathbb{E}_{q(\mathbf{z}_{n},\mathbf{w}_{k})} \left[ \left( \mathbf{z}_{n} - \mathbf{X}_{n} \mathbf{w}_{k}\right)^{T}\left(\mathbf{z}_{n} - \mathbf{X}_{n} \mathbf{w}_{k} \right) \right] \right\}\\ & = \sum_{n=1}^{N}\Bigg\{ -\frac{I_{n}}{2}\ln \left(2\pi\right) -\frac{1}{2}\mathbb{E}_{q(\mathbf{z})}\big[\mathbf{z}_{n}^{T}\mathbf{z}_{n}\big] + \sum_{k=1}^{K} r_{nk}\; \mathbf{m}_{k}^{T} \mathbf{X}_{n}^{T} \mathbb{E}_{q(\mathbf{z}_{n})}\big[\mathbf{z}_{n}\big] \\ & \qquad\qquad\qquad\qquad\qquad\qquad -\frac{1}{2} \sum_{k=1}^{K} r_{nk}\; \text{tr}\Big( \mathbf{X}_{n}^{T} \mathbf{X}_{n} \big(\mathbf{m}_{k}\mathbf{m}_{k}^{T} + S_{k} \big)\Big)\Bigg\} \\ \mathbb{E}_{q(\mathbf{C}, \pmb{\pi})}\Big[\ln p(\mathbf{C} | \pmb{\pi})\Big] & = \sum_{n = 1}^{N}\sum_{k=1}^{K} \mathbb{E}_{q(c_{nk})}\big [c_{nk}\big ]\mathbb{E}_{q(\pi_{k})}\big[\ln\pi_{k}\big] \\ & = \sum_{n = 1}^{N}\sum_{k=1}^{K} r_{nk}\left\{\psi(\delta_{k}) - \psi(\hat{\delta})\right\} \\ \mathbb{E}_{q(\pmb{\pi})}\Big[\ln p(\pmb{\pi})\Big] & = \ln C(\pmb{\delta}_{0}) + \sum_{k=1}^{K} (\delta_{0_{k}} - 1)\left\{\psi(\delta_{k}) - \psi(\hat{\delta})\right\} \\ \mathbb{E}_{q(\mathbf{w}, \pmb{\tau})}\Big[\ln p(\mathbf{w} | \pmb{\tau})\Big] & = \sum_{k=1}^{K}\Bigg\{ \mathbb{E}_{q(\mathbf{w}_{k}, \tau_{k})} \left[ -\frac{D}{2}\ln 2\pi - \frac{D}{2}\ln \tau_{k}^{-1} - \frac{1}{2\tau_{k}^{-1}} \mathbf{w}_{k}^{T}\mathbf{w}_{k}\right]\Bigg\} \\ & = \sum_{k=1}^{K}\Bigg\{ -\frac{D}{2}\ln 2\pi + \frac{D}{2}\mathbb{E}_{q(\tau_{k})} \big[\ln \tau_{k}\big] - \frac{1}{2}\mathbb{E}_{q(\tau_{k})}\big[\tau_{k}\big] \mathbb{E}_{q(\mathbf{w}_{k})} \big[\mathbf{w}_{k}^{T}\mathbf{w}_{k}\big]\Bigg\} \\ & = \sum_{k=1}^{K}\Bigg\{ -\frac{D}{2}\ln 2\pi + \frac{D}{2}\Big(\psi(\alpha_{k}) - \ln \beta_{k}\Big) - \frac{\alpha_{k}}{2\beta_{k}} \Big(\mathbf{m}_{k}^{T}\mathbf{m}_{k} + \text{tr}(\mathbf{S}_{k})\Big)\Bigg\} \\ \mathbb{E}_{q(\pmb{\tau})}\Big[\ln p(\pmb{\tau})\Big] & = \sum_{k=1}^{K}\Bigg\{ \alpha_{0}\ln \beta_{0} + (\alpha_{0} - 1)\Big(\psi(\alpha_{k}) - \ln \beta_{k}\Big) - \beta_{0}\frac{\alpha_{k}}{\beta_{k}} - \ln\Gamma(\alpha_{0})\Bigg\} \\ \mathbb{E}_{q(\mathbf{Z})}\Big[\ln q(\mathbf{Z})\Big] & = \sum_{n=1}^{N}\sum_{i}^{I} \mathbb{E}_{q(z_{ni})}\left[ \ln \left(\mathcal{TN}_{+}(z_{ni} | \mu_{ni}, 1)^{y_{ni}} \mathcal{TN}_{-}(z_{ni} | \mu_{ni}, 1)^{(1-y_{ni})}\right)\right] \\ & = \sum_{n=1}^{N}\sum_{i=1}^{I} \mathbb{E}_{q(z_{ni})}\left[ \ln \left\{\mathcal{N}(z_{ni} | \mu_{ni}, 1) \left(\frac{1}{1 - \Phi_{ni}}\right)^{y_{ni}} \left(\frac{1}{\Phi_{ni}}\right)^{(1-y_{ni})}\right\}\right] \\ & = \sum_{n=1}^{N}\Bigg\{\mathbb{E}_{q(\mathbf{z}_{n})}\Big[ \ln\mathcal{N}(\mathbf{z}_{n} | \pmb{\mu}_{n}, \mathbf{I}_{n})\Big] - \sum_{i=1}^{I_{n}}\bigg\{ y_{ni}\ln \left(1-\Phi_{ni}\right) + (1-y_{ni})\ln\left(\Phi_{ni}\right)\bigg\}\Bigg\} \\ & = \sum_{n=1}^{N}\Bigg\{-\frac{I_{n}}{2}\ln2\pi -\frac{1}{2}\mathbb{E}_{q(\mathbf{z}_{n})} \big[\mathbf{z}_{n}^{T}\mathbf{z}_{n}\big] + \pmb{\mu}_{n}^{T} \mathbb{E}_{q(\mathbf{z}_{n})}\big[\mathbf{z}_{n}\big] -\frac{1}{2}\pmb{\mu}_{n}^{T}\pmb{\mu}_{n} \\ & \qquad\qquad\qquad\qquad\qquad - \sum_{i=1}^{I_{n}} \bigg\{ y_{ni}\ln(1-\Phi_{ni}) + (1-y_{ni})\ln(\Phi_{ni}) \bigg\}\Bigg\} \\ \mathbb{E}_{q(\mathbf{C})}\Big[\ln q(\mathbf{C})\Big] & = \sum_{n=1}^{N}\sum_{k=1}^{K}r_{nk}\ln r_{nk} \\ \mathbb{E}_{q(\pmb{\pi})}\Big[\ln q(\pmb{\pi})\Big] & = \ln C(\pmb{\delta}) + \sum_{k=1}^{K}(\delta_{k} - 1)\left\{\psi(\delta_{k}) - \psi(\hat{\delta})\right\} \\ \mathbb{E}_{q(\mathbf{w})}\Big[\ln q(\mathbf{w})\Big] & = \sum_{k=1}^{K}\Big\{-\frac{1}{2}\ln |\mathbf{S}_{k}| - \frac{D}{2}(1 + \ln 2\pi)\Big\} \\ \mathbb{E}_{q(\pmb{\tau})}\Big[\ln q(\pmb{\tau})\Big] & = \sum_{k=1}^{K}\Big\{-\ln\Gamma(\alpha_{k}) + (\alpha_{k} - 1)\psi(\alpha_{k}) + \ln \beta_{k} - \alpha_{k}\Big\} \end{align} \] where \(\Phi_{ni} \equiv \Phi(-\mu_{ni})\), and \(\Phi(\cdot)\) is the cumulative distribution function (cdf) of the standard normal distribution. The quantity \(\mathbb{E}_{q(\mathbf{Z})}\Big[\ln p(\mathbf{Y} | \mathbf{Z})\Big]\) will always be zero w.r.t. to the variational factors, hence it does not contribute in the ELBO. Also, by inspecting expectations \(\mathbb{E}_{q(\mathbf{Z},\mathbf{C}, \mathbf{w})}\Big[\ln p(\mathbf{Z} | \mathbf{C}, \mathbf{w}, \mathbf{X})\Big]\) and \(\mathbb{E}_{q(\mathbf{Z})}\Big[\ln q(\mathbf{Z})\Big]\) some of the elements cancel out, thus we have \[ \begin{align} \mathbb{E}_{q(\mathbf{Z}, \mathbf{C}, \mathbf{w})}\Big[\ln p(\mathbf{Z} | \mathbf{C}, \mathbf{w}, \mathbf{X})\Big] - \mathbb{E}_{q(\mathbf{Z})}\Big[\ln q(\mathbf{Z})\Big] & = \sum_{n=1}^{N}\Bigg\{ -\frac{1}{2}\sum_{k=1}^{K}r_{nk}\; \text{tr}\left(\mathbf{X}_{n}^{T}\mathbf{X}_{n} \Big(\mathbf{m}_{k}\mathbf{m}_{k}^{T} + \mathbf{S}_{k}\Big)\right) + \frac{1}{2} \pmb{\mu}_{n}^{T}\pmb{\mu}_{n} \\ & \qquad\qquad\qquad + \sum_{i=1}^{I_{n}}\bigg\{ y_{ni}\ln \left(1-\Phi_{ni}\right) + (1-y_{ni})\ln\left(\Phi_{ni}\right)\bigg\}\Bigg\} \\ \end{align} \]

5.7 Predictive density

The predictive density of a new observation \(\mathbf{y}_{*}\) which will be associated with a latent variable \(\mathbf{c}_{*}\), latent observation \(\mathbf{z}_{*}\) and covariates \(\mathbf{X}_{*}\) is given by \[ \begin{align} p(\mathbf{y}_{*} | \mathbf{X}_{*}, \mathbf{Y}, \mathbf{X}) & = \sum_{c}\int_{-\infty}^{\infty}\int\int\int p(\mathbf{y}_{*}, \mathbf{c}_{*}, \mathbf{z}_{*}, \pmb{\pi}, \mathbf{w}, \pmb{\tau} | \mathbf{X}_{*}, \mathbf{Y}, \mathbf{X}) \;d\pmb{\pi}\;d\pmb{\tau}\;d\mathbf{w}\;d\mathbf{z}_{*} \\ & = \sum_{c}\int_{-\infty}^{\infty}\int\int\int p(\mathbf{y}_{*} | \mathbf{z}_{*}) p(\mathbf{z}_{*} | \mathbf{c}_{*}, \mathbf{w}, \mathbf{X}_{*}) p(\mathbf{c}_{*} | \pmb{\pi}) p(\pmb{\pi}, \mathbf{w}, \pmb{\tau} | \mathbf{Y}, \mathbf{X}) \;d\pmb{\pi}\;d\pmb{\tau}\;d\mathbf{w}\;d\mathbf{z}_{*} \\ & = \sum_{k=1}^{K}\int_{-\infty}^{\infty}\int\int\int p(\mathbf{y}_{*} | \mathbf{z}_{*}) p(\mathbf{z}_{*} | \mathbf{w}_{k}, \mathbf{X}_{*}) \pi_{k}\; p(\pmb{\pi}, \mathbf{w}, \pmb{\tau} | \mathbf{Y}, \mathbf{X}) \;d\pmb{\pi}\;d\pmb{\tau}\;d\mathbf{w}\;d\mathbf{z}_{*} \qquad\;\;(\text{Summation over}\; c) \\ & = \sum_{k=1}^{K}\int_{-\infty}^{\infty}\int\int\int p(\mathbf{y}_{*} | \mathbf{z}_{*}) p(\mathbf{z}_{*} | \mathbf{w}_{k}, \mathbf{X}_{*}) \pi_{k}\; q(\pmb{\pi}) q(\mathbf{w}_{k}) q(\tau_{k}) \;d\pmb{\pi}\;d\tau_{k}\;d\mathbf{w}_{k}\;d\mathbf{z}_{*} \qquad(\text{Variational approximation}) \\ & = \sum_{k=1}^{K}\frac{\delta_{k}}{\hat{\delta}}\int_{-\infty}^{\infty}\int p(\mathbf{y}_{*} | \mathbf{z}_{*}) \mathcal{N}(\mathbf{z}_{*} | \mathbf{X}_{*}\mathbf{w}_{k}, \mathbf{I}_{n}) \mathcal{N}(\mathbf{w}_{k} | \mathbf{m}_{k}, \mathbf{S}_{k}) \;d\mathbf{w}_{k}\;d\mathbf{z}_{*} \qquad\qquad\qquad(\text{Integrate}\; \pmb{\pi}\;\text{and}\; \tau_{k})\\ & = \sum_{k=1}^{K}\frac{\delta_{k}}{\hat{\delta}}\int_{-\infty}^{\infty} p(\mathbf{y}_{*} | \mathbf{z}_{*}) \mathcal{N}\left(\mathbf{z}_{*} | \mathbf{X}_{*}\mathbf{m}_{k},\;\mathbf{I}_{n} + \text{diag}\big(\mathbf{X}_{*} \mathbf{S}_{k} \mathbf{X}_{*}^{T}\big)\right) \;d\mathbf{z}_{*} \quad\qquad\qquad\quad\;(\text{Integrate}\; \mathbf{w}_{k}) \\ & = \sum_{k=1}^{K}\frac{\delta_{k}}{\hat{\delta}}\begin{cases} \int_{0}^{\infty} \mathcal{N}\left(\mathbf{z}_{*} | \mathbf{X}_{*}\mathbf{m}_{k},\;\mathbf{I}_{n} + \text{diag}\big(\mathbf{X}_{*} \mathbf{S}_{k} \mathbf{X}_{*}^{T}\big)\right)\;dz & \; \text{where } \mathbf{y}_{*} = 1\\ \int_{-\infty}^{0} \mathcal{N}\left(\mathbf{z}_{*} | \mathbf{X}_{*}\mathbf{m}_{k},\;\mathbf{I}_{n} + \text{diag}\big(\mathbf{X}_{*} \mathbf{S}_{k} \mathbf{X}_{*}^{T}\big)\right)\;dz & \; \text{where }\mathbf{y}_{*} = 0\\ \end{cases} \\ & = \sum_{k=1}^{K}\frac{\delta_{k}}{\hat{\delta}}\Phi \left(\frac{\mathbf{X}_{*}\mathbf{m}_{k}}{\left(\mathbf{I}_{n} + \text{diag}\big(\mathbf{X}_{*} \mathbf{S}_{k} \mathbf{X}_{*}^{T}\big)\right)^{1/2}} \right)^{y_{*}}\left(1 - \Phi \left(\frac{\mathbf{X}_{*}\mathbf{m}_{k}}{\left(\mathbf{I}_{n} + \text{diag}\big(\mathbf{X}_{*} \mathbf{S}_{k} \mathbf{X}_{*}^{T}\big)\right)^{1/2}}\right)\right)^{(1-y_{*})} \\ & = \sum_{k=1}^{K}\frac{\delta_{k}}{\hat{\delta}}\mathcal{B}ernoulli\left(y_{*} \Big{|} \Phi \left(\frac{\mathbf{X}_{*}\mathbf{m}_{k}}{\left(\mathbf{I}_{n} + \text{diag}\big(\mathbf{X}_{*} \mathbf{S}_{k} \mathbf{X}_{*}^{T}\big)\right)^{1/2}}\right)\right) \end{align} \]

5.8 Update equations summary

Below we summarize the update equations we obtain for the Variational Bayes, which we split to variational E and M steps to be consistent with the Expectation Maximization algorithm.

5.8.1 Variational E-step

Compute responsibilities \[ r_{nk} = \frac{\rho_{nk}}{\sum_{j=1}^{K}\rho_{nj}} \] where \(\rho_{nk}\) is given by \[ \rho_{nk} \propto \exp\left\{\left\{\psi(\delta_{k}) - \psi(\hat{\delta})\right\} - \frac{1}{2}\mathbb{E}_{q(\mathbf{z})}\big[\mathbf{z}_{n}^{T}\mathbf{z}_{n}\big] + \mathbf{m}_{k}^{T} \mathbf{X}_{n}^{T}\mathbb{E}_{q(\mathbf{z}_{n})}\big[\mathbf{z}_{n}\big] -\frac{1}{2} \text{tr}\left( \mathbf{X}_{n}^{T} \mathbf{X}_{n} \big(\mathbf{m}_{k}\mathbf{m}_{k}^{T} + S_{k} \big)\right) \right\} \] where \[ \begin{align} \mathbb{E}_{q(z_{ni})}\big[z_{ni}\big] & = \begin{cases} \mu_{ni} + \phi_{ni}/(1 - \Phi_{ni}) & \; \text{if } y_{ni} = 1\\ \mu_{ni} - \phi_{ni}/\Phi_{ni} & \; \text{if } y_{ni} = 0\\ \end{cases} \\ \mathbb{E}_{q(z_{ni})}\big[z_{ni}^{2}\big] & = 1 + \mu_{ni} \;\mathbb{E}_{q(z_{ni})}\big[z_{ni}\big] \end{align} \]

5.8.2 Variational M-step

Compute variational ditribution parameters \[ \begin{align} \pmb{\mu}_{n} & = \mathbf{X}_{n}\sum_{k}^{K} r_{nk} \;\mathbf{m}_{k} \\ \delta_{k} & = \delta_{0_{k}} + \sum_{n=1}^{N}r_{nk} \\ \mathbf{m}_{k} & = \mathbf{S}_{k}\sum_{n=1}^{N}r_{nk}\mathbf{X}_{n}^{T}\mathbb{E}_{q(\mathbf{z}_{n})}\big[\mathbf{z}_{n}\big] \\ \mathbf{S}_{k} & = \left(\frac{\alpha_{k}}{\beta_{k}}\mathbf{I} + \sum_{n=1}^{N}r_{nk} \mathbf{X}_{n}^{T}\mathbf{X}_{n}\right)^{-1} \\ \alpha_{k} & = \alpha_{0} + \frac{D}{2} \\ \beta_{k} & = \beta_{0} + \frac{1}{2}\left(\mathbf{m}_{k}^{T}\mathbf{m}_{k} + \text{tr}(\mathbf{S}_{k})\right)\\ \end{align} \]

6 Code implementation in R

Helper plotting functions

suppressPackageStartupMessages(require(BPRMeth))
suppressPackageStartupMessages(require(matrixcalc))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(purrr))
suppressPackageStartupMessages(library(mvtnorm))
suppressPackageStartupMessages(library(Matrix))
suppressPackageStartupMessages(require(gganimate))

# Define ggplot2 theme
gg_theme <- function(){
  p <- theme(
      plot.title = element_text(size = 20,face = 'bold',
                                margin = margin(0,0,3,0), hjust = 0.5),
      axis.text = element_text(size = rel(1.05), color = 'black'),
      axis.title = element_text(size = rel(1.45), color = 'black'),
      axis.title.y = element_text(margin = margin(0,10,0,0)),
      axis.title.x = element_text(margin = margin(10,0,0,0)),
      axis.ticks.x = element_line(colour = "black", size = rel(0.8)),
      axis.ticks.y = element_blank(),
      legend.position = "right",
      legend.key.size = unit(1.4, 'lines'),
      legend.title = element_text(size = 12, face = 'bold'),
      legend.text = element_text(size = 12),
      panel.border = element_blank(),
      panel.grid.major = element_line(colour = "gainsboro"),
      panel.background = element_blank()
    )
  return(p)
}

# Plot the predictive distribution
draw_predictive_anim <- function(X, title="", ...){
  p <- ggplot(X, aes(x = xs, y = ys, color = Cluster, frame = iter)) +
    geom_line(size = 2) +
    geom_ribbon(aes(ymin = X$ys_low, ymax = X$ys_high, fill = Cluster), 
                alpha = 0.23, size = 0.1) +
    scale_x_continuous(limits = c(-1, 1), 
                       labels = c("-5kb", "", "TSS", "", "+5kb")) + 
    scale_color_brewer(palette = "Dark2") +
    scale_fill_brewer(palette = "Dark2") +
    labs(title = title, x = "x", y = "y") + gg_theme()
  return(p)
}

# Use the log sum exp trick for having numeric stability
log_sum_exp <- function(x) {
  # Computes log(sum(exp(x))
  offset <- max(x)
  s <- log(sum(exp(x - offset))) + offset
  i <- which(!is.finite(s))
  if (length(i) > 0) { s[i] <- offset }
  return(s)
}

#' Generating probit regression data for a given region
generate_pr_data <- function(basis, w, max_cov = 40, xmin = -500, 
                             xmax = 500){
  # L is the number of observations found in the ith region
  L <- rbinom(n = 1, size = max_cov, prob = .8)
  x <- matrix(0, nrow = L, ncol = 2)
  # Randomly sample locations for the observations
  obs <- sort(sample(xmin:xmax, L))
  # Scale to (-1,1)
  x[, 1] <- BPRMeth:::.minmax_scaling(data = obs, xmin = xmin, 
                                xmax = xmax, fmin = -1, fmax = 1)
  H      <- design_matrix(basis, x[, 1])$H
  p_suc  <- pnorm(H %*% w) + rnorm(NROW(H), mean = 0, sd = 0.05)
  p_suc[which(p_suc > (1 - 1e-10))] <- 1 - 1e-10
  p_suc[which(p_suc < 1e-10)] <- 1e-10
  x[, 2] <- rbinom(NROW(H), 1, p_suc)
  return(x)
}

#' Generating probit regression data
generate_data <- function(N = 300, K = 3, pi_k = c(0.45, 0.35, 0.2),
                          w = NULL, basis = create_rbf_object(M = 3),
                          max_cov = 50, xmin = -500, xmax = 500){
  require(BPRMeth)
  x <- vector("list", length = N)
  if (is.null(w)) {
    w <- list(w1 = c(-1, -1, 0.9, 3),
              w2 = c(0.1, -2.4, 3, -2),
              w3 = c(0.4, 0.7, .7, -2.8))
  }
  # Choose cluster to generate the data
  C_true <- t(rmultinom(N, 1, pi_k))
  C_true <- unlist(apply(C_true, 1, function(x) 
      which(x == max(x, na.rm = TRUE))[1]))
  for (i in 1:N) {
    x[[i]] <- generate_pr_data(basis = basis, w = w[[C_true[i]]],
                     max_cov = max_cov, xmin = xmin, xmax = xmax)
  }
  return(list(x = x, C_true = C_true))
}

## A general-purpose adder:
add_func <- function(x) Reduce("+", x)

Function for fitting the variational Bayesian probit regression mixture model.

# Compute predictive distribution of VB_PRMM model
vb_prmm_predictive <- function(model, X_test){
  Phi = mu_pred = s_pred <- matrix(0,ncol = model$K,nrow = NROW(X_test))
  for (k in 1:model$K) {
    # Predictive mean
    mu_pred[,k] <- c(X_test %*% model$m[,k])
    # Predictive variance
    s_pred[,k] <- sqrt(1 + diag(X_test %*% model$S[,,k] %*% t(X_test)))
    Phi[,k] <- pnorm(mu_pred[,k] / s_pred[,k])
  }
  pi_k <- model$delta / sum(model$delta)
  return(list(Phi = Phi,mu_pred = mu_pred,s_pred = s_pred,pi_k = pi_k))
}

# Fit VB_PRMM model
vb_prmm <- function(x, K = 3, basis, delta_0 = rep(1/K, K), 
                    alpha_0 = 1e-1, beta_0 = 1e-1,
                    max_iter = 500, epsilon_conv = 1e-4, 
                    is_animation = FALSE, is_verbose = FALSE){
  assertthat::assert_that(is.list(x))
  # Compute E[z]
  update_Ez <- function(E_z, mu, y_1, y_0){
    E_z[y_1] <- mu[y_1] + dnorm(-mu[y_1]) / (1 - pnorm(-mu[y_1]))
    E_z[y_0] <- mu[y_0] - dnorm(-mu[y_0]) / pnorm(-mu[y_0])
    return(E_z)
  }
  # Animation plot
  dt_frame <- NULL
  if (is_animation) {
    xs <- seq(-1,1,length.out = 100)
    x_test <- design_matrix(obj = basis, obs = xs)$H
    dt_frame <- data.table(xs = numeric(), ys = numeric(), 
        ys_low = numeric(), ys_high = numeric(), Cluster = numeric(), 
        iter = numeric())
  }
  
  N <- length(x)            # Number of observations
  D <- basis$M + 1          # Number of features
  L <- rep(-Inf, max_iter)  # Store the lower bounds
  r_nk = log_r_nk = log_rho_nk <- matrix(0,nrow = N,ncol = K)
  E_ww <- vector("numeric", length = K)
    
  #------------------------------------
  # Initiliaze variables
  y <- lapply(X = x, FUN = function(x) x[,2])   # Extract responses y_{n}
  # Create design matrix X 
  X <- lapply(X = x, FUN = function(x) 
      design_matrix(obj = basis, obs = x[,1])$H) 
  # Compute X_{n}'X_{n}
  XX <- lapply(X = X, FUN = function(x) crossprod(x))         
  y_1 <- lapply(1:N, function(n) which(y[[n]] == 1))
  y_0 <- lapply(1:N, function(n) which(y[[n]] == 0))
  E_z = E_zz <- y
  # Compute shape parameter of Gamma
  alpha_k <- rep(alpha_0 + D/2, K)                        
  W_opt <- infer_profiles_mle(X = x, model = "bernoulli", basis = basis,
                             H = X, lambda = .5, opt_itnmax = 50)$W
  # Use Kmeans with random starts
  cl  <- stats::kmeans(W_opt, K, nstart = 25)              
  # Create a hot 1-K encoding
  C_n <- as.matrix(Matrix::sparseMatrix(1:N, cl$cluster, x = 1))
  # Mean for each cluster
  m_k <- t(cl$centers)                                     
  S_k <- array(0, dim = c(D,D,K))
  # Covariance of each cluster
  for (k in 1:K) { S_k[,,k] <- solve(diag(2, D))}   
  # Scale of precision matrix
  beta_k   <- rep(beta_0, K)    
  # Dirichlet parameter
  delta_k  <- delta_0                  
  # Expectation of log Dirichlet  
  e_log_pi <- digamma(delta_k) - digamma(sum(delta_k))     
  mk_Sk    <- lapply(X = 1:K, function(k) tcrossprod(m_k[, k]) + S_k[,,k])
  # Update \mu, E[z] and E[z^2]
  mu <- lapply(1:N, function(n) c(X[[n]] %*% rowMeans(m_k)))
  # mu <- lapply(1:N, function(n) c(X[[n]] %*% rowSums(sapply(X = 1:K,
  # function(k) C_n[n,k]*m_k[,k]))))
  E_z <- lapply(1:N, function(n) update_Ez(E_z = E_z[[n]], mu = mu[[n]], 
                                    y_0 = y_0[[n]], y_1 = y_1[[n]])) 
  E_zz <- lapply(1:N, function(n) 1 + mu[[n]] * E_z[[n]])

  # Iterate to find optimal parameters
  for (i in 2:max_iter) {
    ##-------------------------------
    # Variational E-Step
    ##-------------------------------
    for (k in 1:K) {
      log_rho_nk[,k] <- e_log_pi[k] + sapply(1:N, function(n) 
        -0.5*crossprod(E_zz[[n]]) + m_k[,k] %*% t(X[[n]]) %*% E_z[[n]] - 
            0.5*matrix.trace(XX[[n]] %*% mk_Sk[[k]]))
    }
    # Responsibilities using the logSumExp for numerical stability
    log_r_nk <- log_rho_nk - apply(log_rho_nk, 1, log_sum_exp)
    r_nk     <- exp(log_r_nk)
    
    ##-------------------------------
    # Variational M-Step
    ##-------------------------------
    # Update Dirichlet parameter
    delta_k <- delta_0 + colSums(r_nk)  
    for (k in 1:K) {
      # Update covariance for Gaussian
      w_XX <- lapply(X = 1:N, function(n) XX[[n]]*r_nk[n,k])
      S_k[,,k] <- solve(diag(alpha_k[k]/beta_k[k], D) + 
                            add_func(w_XX))
      # Update mean for Gaussian
      w_Xz <- lapply(X = 1:N, function(n) t(X[[n]]) %*% 
                         E_z[[n]]*r_nk[n,k])
      m_k[,k] <- S_k[,,k] %*% add_func(w_Xz)
      # Update \beta_k parameter for Gamma
      E_ww[k] <- crossprod(m_k[,k]) + matrix.trace(S_k[,,k])
      beta_k[k]  <- beta_0 + 0.5*E_ww[k]
    }
    # Update \mu, E[z] and E[z^2]
    mu <- lapply(1:N, function(n) c(X[[n]] %*% rowSums(sapply(X = 1:K,
                        function(k) r_nk[n,k]*m_k[,k]))))
    E_z <- lapply(1:N, function(n) 
        update_Ez(E_z = E_z[[n]], mu = mu[[n]], 
                  y_0 = y_0[[n]], y_1 = y_1[[n]])) 
    E_zz <- lapply(1:N, function(n) 1 + mu[[n]] * E_z[[n]])
    # Update expectations over \ln\pi
    e_log_pi  <- digamma(delta_k) - digamma(sum(delta_k))
    # Compute expectation of E[a]
    E_alpha <- alpha_k / beta_k  
    # Perform model selection
    # e_log_pi <- colMeans(r_nk)
    
    ##-------------------------------
    # Variational lower bound
    ##-------------------------------
    mk_Sk    <- lapply(X = 1:K, function(k) tcrossprod(m_k[, k]) + 
                           S_k[,,k])
    lb_pz_qz <- sum(sapply(1:N, function(n) 0.5*crossprod(mu[[n]]) + 
      sum(y[[n]] * log(1 - pnorm(-mu[[n]])) + (1 - y[[n]]) * 
      log(pnorm(-mu[[n]]))) - 0.5*sum(sapply(1:K, function(k) 
      r_nk[n,k] * matrix.trace(XX[[n]] %*% mk_Sk[[k]]) )) )) 
    lb_p_w   <- sum(-0.5*D*log(2*pi) + 0.5*D*(digamma(alpha_k) - 
      log(beta_k)) - 0.5*E_alpha*E_ww)
    lb_p_c   <- sum(r_nk %*% e_log_pi)   
    lb_p_pi  <- sum((delta_0 - 1)*e_log_pi) + lgamma(sum(delta_0)) - 
      sum(lgamma(delta_0))
    lb_p_tau <- sum(alpha_0*log(beta_0) + (alpha_0 - 1)*(digamma(alpha_k) - 
      log(beta_k)) - beta_0*E_alpha - lgamma(alpha_0))
    lb_q_c   <- sum(r_nk*log_r_nk)
    lb_q_pi  <- sum((delta_k - 1)*e_log_pi) + lgamma(sum(delta_k)) - 
      sum(lgamma(delta_k))
    lb_q_w   <- sum(-0.5*log(sapply(X = 1:K,function(k) det(S_k[,,k]))) - 
      0.5*D*(1 + log(2*pi)))
    lb_q_tau <- sum(-lgamma(alpha_k) + (alpha_k - 1)*digamma(alpha_k) + 
      log(beta_k) - alpha_k)
    # Sum all parts to compute lower bound
    L[i] <- lb_pz_qz + lb_p_c + lb_p_pi + lb_p_w + lb_p_tau - lb_q_c - 
        lb_q_pi - lb_q_w - lb_q_tau
    
    if (is_animation) {
      if ( (i - 1) %% 5 == 0 | i < 10) {
        ys <- vb_prmm_predictive(model = list(m = m_k, S = S_k, K = K, 
                      delta = delta_k), X_test = x_test)$Phi
        for (k in 1:K) {
          dt_frame <- rbind(dt_frame,data.table(xs = xs, ys = ys[,k], 
                ys_low = ys[,k] - ys[,k]*(1 - ys[,k]), 
                ys_high = ys[,k] + ys[,k]*(1 - ys[,k]),
                Cluster = k, iter = i - 1))
        }
      }
    }
    # Show VB difference
    if (is_verbose) { 
      cat("It:\t",i,"\tLB:\t",L[i],"\tLB_diff:\t",L[i] - L[i - 1],"\n")
      cat("Z: ",lb_pz_qz,"\tC: ",lb_p_c - lb_q_c,
          "\tW: ",lb_p_w - lb_q_w,
          "\tPi: ",lb_p_pi - lb_q_pi,
          "\tTau: ",lb_p_tau - lb_q_tau,"\n")
    }
    # Check if lower bound decreases
    if (L[i] < L[i - 1]) { message("Warning: Lower bound decreases!\n") }
    # Check for convergence
    if (abs(L[i] - L[i - 1]) < epsilon_conv) { break }
    # Check if VB converged in the given maximum iterations
    if (i == max_iter) {warning("VB did not converge!\n")}
  }
  if (is_animation) {
    dt_frame <- dt_frame %>% .[, iter := as.factor(iter)] %>% 
        .[, Cluster := as.factor(Cluster)]
  }
  obj <- structure(list(m = m_k,S = S_k,r_nk = r_nk,delta = delta_k,
            beta = beta_k, alpha = alpha_k,K = K,N = N,D = D,
            L = L[2:i],dt_frame = dt_frame), class = "vb_gmm")
  return(obj)
}

6.1 Cluster synthetic probit regression data

set.seed(2)  # For reproducibility
x <- generate_data(N = 300)$x  # Observations
K <- 5       # Number of clusters
basis <- create_rbf_object(M = 5) # Basis function object
# Run vb-blrmm model model
vb_prmm_model <- vb_prmm(x = x, K = K, delta_0 = rep(1e-10,K), 
  basis = basis, max_iter = 101, is_animation = TRUE, is_verbose = FALSE)

Let’s examine the posterior predictive distribution

p <- draw_predictive_anim(X = vb_prmm_model$dt_frame, 
                          title = "Probit regr mixture model, Iter:")
gganimate(p, interval = .55, "figures/vb_bpr/vbprmm.gif", 
          ani.width = 750, ani.height = 470)

Variational Bayes probit regression mixture model

6.2 Perform model selection using the lower bound

We will keep the data generated from the previous example and we will go through different models, i.e. increase the number of clusters \(K\) and observe how the lower bound \(\mathcal{L}(q)\) changes. Plot of the lower bound \(\mathcal{L}(q)\) versus the model complexity \(\mathcal{M}\) (i.e. number of clusters). The value of the lower bound gives the log probability of the model \(\text{ln}\;p(\mathbf{y} | \mathcal{M})\), and we see that the value of the bound peaks peaks at \(\mathcal{M} = 3\) (i.e. three clusters), corresponding to the true model that generated the data.

We can do use the same approach to test how many basis functions \(M\) are needed to model the data by keeping fixed the number of clusters to K=3.

Finally, the following animation shows how we can automatically perform model selection (i.e. select optimal number of clusters) when using a Bayesian approach. An interesting observation is that some profiles collapse to one another, hence they all explain equally the same cluster, and a post-processing step will be required to merge these clusters together. Due to the collapsing of the profiles the model is penalized for using more clusters than required, since they do not contribute to increase the lower bound.

Variational Bayes probit regression mixture model: Model Selection

7 Conclusions

This tutorial-like document showed how we can perform Variational Bayes for mixture of Bayesian probit regression models.

If you found this document useful, check my homepage at the University of Edinburgh for links to other tutorials.

Variational Mixture of Bayesian Probit Regressions

9 January 2018