If you found this document useful, check my homepage at the University of Edinburgh for links to other tutorials.

1 Introduction

Suppose we have N independent observations \(y_{i} = (m_{i}, N_{i})\) that follow a Binomial distribution,

\[ m_{i} \sim \mathcal{B}inomial(N_{i}, p_{i}) \]

where \(N_{i}\) is the total number of trials for the \(i^{th}\) observation and \(m_{i}\) denotes the number of successes. The \(p_{i}\) is the probability of success and as in the binary probit model (tutorial on performing Bayesian binary probit regression) takes the form \(p_{i} = \Phi(\boldsymbol{x}_{i}\boldsymbol{\theta})\), where \(\boldsymbol{x}_{i} = (x_{i,1}, ..., x_{1,D})\) is a vector of covariates, \(\boldsymbol{\theta}\) represents a (\(D \times 1\)) column vector of regression coefficients, and \(\Phi(\cdot)\) denotes the cumulative distribution function (cdf) of the standard normal distribution.

1.1 Generate synthetic data

As a first step, we create some synthetic data that will be used to fit our model. In this tutorial we will use the BPRMeth package, which implements the Binomial Probit Regression (BPR) model under a maximum likelihood approach. Also, to obtain a much more useful class of functions, we will take linear combinations of a fixed set of nonlinear functions of the input variables \(x\), known as basis functions. Here we will consider Radial Basis Functions (RBFs), however, other basis functions, such as polynomial or Fourier, could also be used. For a single input variable \(x\), the RBF takes the form \(h_{j}(x) = exp(-\gamma || x - \mu_{j} ||^2)\), where \(\mu_{j}\) denotes the location of the \(j^{th}\) basis function in the input space and \(\gamma\) controls the spatial scale.

# Install the BPRMeth package

## From Bioconductor
# source("https://bioconductor.org/biocLite.R")
# biocLite("BPRMeth")

## Or from Github
# install.packages("devtools")
# devtools::install_github("andreaskapou/BPRMeth")

require(BPRMeth)

# Set seed for reproducible results
set.seed(1234)

# Create 400 covariate data x which lie between (-1, 1)
N <- 400
x <- seq(-1, 1, length.out = N)

# Number of basis functions
D <- 4
# Create RBF basis object (we set D-1, since theta_0 is the offset parameter)
basis <- create_rbf_object(M = D - 1)
# Create design matrix (design.matrix.rbf is an internal method for BPRMeth)
H <- design_matrix(basis, x)$H

# True values of regression coeffiecients theta
true_theta <- c(-0.6, 0.75, 1.2, -0.8)
# Obtain the vector with probabilities of success p using the probit link
p <- pnorm(H %*% true_theta)

# Generate total number of trials N_i
N_i <- rbinom(N, 40, 0.8)
# Generate number of successes m_i
m_i <- rbinom(N, N_i, p)
# Combine observation data to matrix y
y_data <- cbind(N_i, m_i)

2 MLE for Binomial Probit Regression

We can directly compute the maximum likelihood estimate of the model parameters using the infer_profiles_mle() function from the BPRMeth package. The likelihood for the BPR model is the following:

\[ p(\boldsymbol{y} | \boldsymbol{\theta}) = \prod\limits_{i = 1}^{N} \bigg\lbrace \mathcal{B}inom\big(m_{i} | N_{i}, \Phi(g(x_{i};\boldsymbol{\theta})\big) \bigg\rbrace \]

where the function \(g(x)\) is given as a linear combination of \(M\) fixed non-linear basis functions \(h_{j}(\cdot)\) (in our case RBFs) of the input space \(x\), of the form:

\[ g(x) = \sum\limits_{j = 0}^{M-1} \theta_{j} h_{j}(x) = \boldsymbol{h}(x)\boldsymbol{\theta} \]

We observe that the ML estimate of the model parameters is quite close to the true model parameters.

# Initial theta parameters
theta_init <- rep(0, D)
# Create object required for the BPRMeth package
obj <- cbind(x, y_data)

# Fit the model to the data
theta_optim <- infer_profiles_mle(X = obj, model = "binomial", 
                                  w = theta_init, basis = basis)

# ------------------------------------------
# MLE estimates of the RBF coefficients
# ------------------------------------------
#  theta_0     theta_1    theta_2    theta_3
#  -0.6355983  0.7555961  1.1903444  -0.7664941
mle_theta <- c(theta_optim$W)

# Plot covariates x versus the ratio of observations m_i/N_i
plot(x, m_i/N_i, main = "Synthetic data", ylab = "p")
# Show the fitted function using the true and MLE estimates
lines(x = x, y = pnorm(H %*% true_theta), col = "coral1", lwd = 2)
lines(x = x, y = pnorm(H %*% mle_theta), col = "cornflowerblue", lwd = 2)
legend("topright", legend = c("True", "MLE"), 
       col = c("coral1", "cornflowerblue"), 
       bty = 'n', lwd = 2, inset = c(0.015, 0.03), lty = 1, cex = 0.9)

3 Bayesian Binomial Probit Regression

Instead of computing just the MLE estimate of the model parameters, we can go further and compute their posterior distribution by taking a Bayesian approach. We can complete the Bayesian model by taking a prior distribution \(\pi(\boldsymbol{\theta})\) over the model parameters. Then the Bayesian BPR model becomes,

\[ \begin{aligned}\label{probit-regression-eq} m_{i} & \sim \mathcal{B}inomial\Big(N_{i}, \Phi(\eta_{i})\Big) \\ \eta_{i} & = \boldsymbol{h}_{i}\boldsymbol{\theta} \\ \boldsymbol{\theta} & \sim \pi(\boldsymbol{\theta}) \end{aligned} \]

where we defined \(\boldsymbol{h}(x_{i}) = \boldsymbol{h}_{i}\) to avoid cluttering the notation, and the graphical model can be represented in the following way,

Then the posterior distribution for the Bayesian BPR model is, \[ \begin{aligned} p(\boldsymbol{\theta} | \boldsymbol{y}, \boldsymbol{H}) & \propto p(\boldsymbol{\theta}) p(\boldsymbol{y} | \boldsymbol{\theta}, \boldsymbol{H}) = \pi(\boldsymbol{\theta}) \prod\limits_{i=1}^{N} p(y_{i} | \boldsymbol{\theta}, \boldsymbol{h}_{i}) \\ & \propto \pi(\boldsymbol{\theta}) \prod\limits_{i=1}^{N} \Phi\Big(\boldsymbol{h}_{i} \boldsymbol{\theta}\Big)^{m_{i}} \Big(1 - \Phi(\boldsymbol{h}_{i} \boldsymbol{\theta})\Big)^{(N_{i} - m_{i})} \end{aligned} \] where instead of using the data \(\boldsymbol{X}\) in the input space, we use the design matrix \(\boldsymbol{H}\) which is computed in the RBF space.

Performing inference for this model in the Bayesian framework is complicated by the fact that no conjugate prior \(\pi(\boldsymbol{\theta})\) exists for the parameters of the probit regression model. To overcome this problem, Albert and Chib (1993) augmented the original model with an additional auxiliary variable that renders the conditional distributions of the model parameters equivalent to those under a Bayesian normal linear regression model with Gaussian noise; and derived an efficient Gibbs sampling scheme for computing the posterior statistics.

4 Augmented model

In the BPR model, we can think of each \(m_{i}\) as the total number of successes of \(N_{i}\) independent Bernoulli experiments with outcomes \(y_{i,1}^{*}, ..., y_{i,N_{i}}^{*}\), where now each \(y_{i,n}^{*}\) follows a binary probit regression model, \[ p(y_{i,n}^{*} = 1 | x_{i}, \boldsymbol{\theta}) = \Phi(\boldsymbol{h}(x_{i})\boldsymbol{\theta}) = \Phi(\boldsymbol{h}_{i}\boldsymbol{\theta}) \]

It is simple to reconstruct the binary outcomes \(y_{i,n}^{*}\) from the Binomial observations \(N_{i}\) using the following formula, \[ y_{i,n}^{*} = \begin{cases} 1 & \; \text{if } 1 \leq n \leq m_{i}\\ 0 & \; \text{if } m_{i} < n \leq N_{i}\\ \end{cases} \]

Hence, we can apply the Gibbs sampling scheme for each of the binary outcomes \(y_{i,1}^{*}, ..., y_{i,N_{i}}^{*}\) using the augmented binary probit regression model proposed by Albert and Chib (1993) (check this post for an implementation of this model), in order to perform inference on the original model.

Let us introduce for each binary observation \(y_{i,n}^{*}\) a latent variable \(z_{i,n}\) that follows a normal distribution. Then the augmented BPR model has the following hierarchical structure,

\[ \begin{aligned} y_{i,n}^{*} & = \begin{cases} 1 & \; \text{if } z_{i,n} > 0 \\ 0 & \; \text{otherwise } \\ \end{cases} \nonumber \\ z_{i,n} & = \boldsymbol{h}_{i} \boldsymbol{\theta} + \epsilon_{i,n} \\ \epsilon_{i,n} & \sim \mathcal{N}(0,1) \nonumber \\ \boldsymbol{\theta} & \sim \pi(\boldsymbol{\theta}) \nonumber \end{aligned} \]

Under independence of \(\epsilon_{i,n}\), the marginal likelihood of the augmented model is the same as the likelihood of the original model,

\[ \begin{aligned} Pr(y_{i,n}^{*} = 1 | x_{i}, \boldsymbol{\theta}) & = Pr(z_{i,n} > 0) \nonumber\\ & = Pr(\boldsymbol{h}_{i} \boldsymbol{\theta} + \epsilon_{i,n} > 0) \nonumber\\ & = Pr(\epsilon_{i,n} > -\boldsymbol{h}_{i} \boldsymbol{\theta})) \\ & = Pr(\epsilon_{i,n} < \boldsymbol{h}_{i} \boldsymbol{\theta}) \quad \quad (\text{by symmetry of normal distribution}) \nonumber\\ & = \Phi(\boldsymbol{h}_{i} \boldsymbol{\theta}) \nonumber \end{aligned} \]

We are interested in computing the joint posterior distribution of the latent variables \(\boldsymbol{z}\) and the model parameters \(\boldsymbol{\theta}\) given the data \(\boldsymbol{y}\) and \(\boldsymbol{H}\),

\[ \begin{aligned} p(\boldsymbol{z}, \boldsymbol{\theta} | \boldsymbol{y}, \boldsymbol{H}) & \propto p(\boldsymbol{\theta}) p(\boldsymbol{z}|\boldsymbol{\theta}, \boldsymbol{H}) p(\boldsymbol{y}| \boldsymbol{z}) \nonumber = \pi(\boldsymbol{\theta}) \prod\limits_{i=1}^{N} \prod\limits_{n=1}^{N_{i}} p(z_{i,n}| \boldsymbol{\theta}, \boldsymbol{h}_{i}) p(y_{i,n}^{*}| z_{i,n}) \end{aligned} \]

where we have that,

\[ \begin{aligned} p(z_{i,n}| \boldsymbol{\theta}, \boldsymbol{h}_{i}) & = \mathcal{N}(z_{i,n} | \boldsymbol{h}_{i} \boldsymbol{\theta}, 1) \nonumber\\ p(y_{i,n}^{*}| z_{i,n}) & = \boldsymbol{1}(y_{i,n}^{*} = 1)\boldsymbol{1}(z_{i,n} > 0) + \boldsymbol{1}(y_{i,n}^{*} = 0)\boldsymbol{1}(z_{i,n} \leq 0) \end{aligned} \] where \(\boldsymbol{1}(\cdot)\) is the indicator function, equal to \(1\) if the quantities inside the function are satisfied, and \(0\) otherwise. Now \(y_{i,n}^{*}\) is deterministic conditional on the sign of the stochastic auxiliary variable \(z_{i, n}\). Therefore, we formulate the original problem as a missing data problem where we have a normal regression model on the latent data \(z_{i, n}\) and the observed responses \(y_{i, n}^{*}\) are incomplete in that we only observe whether \(z_{i, n} > 0\) or \(z_{i, n} \leq 0\).

4.0.1 Gibbs sampling scheme

This joint posterior is difficult to normalize and sample from directly. However, computation of the marginal posterior of \(\boldsymbol{\theta}\) and \(\boldsymbol{z}\) using the Gibbs sampling requires only computing \(p(\boldsymbol{\theta} | \boldsymbol{z}, \boldsymbol{H})\) and \(p(\boldsymbol{z} | \boldsymbol{\theta}, \boldsymbol{y}, \boldsymbol{H})\), and these full conditional distributions are of standard forms.

First, the full conditional of \(\boldsymbol{\theta}\) is given by,

\[ p(\boldsymbol{\theta} | \boldsymbol{z}, \boldsymbol{H}) \propto \pi(\boldsymbol{\theta}) \prod\limits_{i=1}^{N} \prod\limits_{n=1}^{N_{i}} \mathcal{N}(z_{i,n} | \boldsymbol{h}_{i} \boldsymbol{\theta}, 1) \]

We can simplify this equation by observing that each \(\boldsymbol{h}_{i}\) is the same for all \(\boldsymbol{z}_{i,\cdot}\)’s. Thus, we can transform the \(z_{i,n}\) to a large vector \(z_{j}\) with \(j = (1, ..., J)\), where \(J = \sum_{i=1}^{N} N_{i}\). In a similar fashion, we transform each \(y_{i,n}^{*}\) to a larger vector \(y_{j}^{*}\). Also, we replicate each \(\boldsymbol{h}_{i}\) entry \(N_{i}\) times, leading to a design matrix \(\boldsymbol{H}\) with dimension \((J \times D)\).

Thus, we can write the previous quantity as follows, \[ p(\boldsymbol{\theta} | \boldsymbol{z}, \boldsymbol{H}) \propto \pi(\boldsymbol{\theta}) \prod\limits_{j=1}^{J} \mathcal{N}(z_{j} | \boldsymbol{h}_{j} \boldsymbol{\theta}, 1) \]

This quantity is the posterior density for the normal linear regression model, and if we assign the proper conjugate prior on the parameters \(\boldsymbol{\theta}\), i.e. \(\pi(\boldsymbol{\theta}) = \mathcal{N}(\boldsymbol{\theta}_{0}, \boldsymbol{Q}_{0})\), then, \[ \begin{aligned} \boldsymbol{\theta} | \boldsymbol{z}, \boldsymbol{H} & \sim \mathcal{N}(\boldsymbol{M}, \boldsymbol{V}) \\ \boldsymbol{M} & = \boldsymbol{V} (\boldsymbol{Q}_{0}^{-1}\boldsymbol{\theta}_{0} + \boldsymbol{H}^{T}\boldsymbol{z}) \\ \boldsymbol{V} & = (\boldsymbol{Q}_{0}^{-1} + \boldsymbol{H}^{T}\boldsymbol{H})^{-1} \end{aligned} \]

Now suppose we knew \(\boldsymbol{\theta}\). Then we could easily draw the latent variables \(z_{j}\) from their distribution conditional on \(\boldsymbol{\theta}\) and \(\boldsymbol{h}_{j}\), which is \(\mathcal{N}(\boldsymbol{h}_{j}\boldsymbol{\theta},1)\). However, if we also condition on the \(y_{j}^{*}\), we need to take into consideration this additional source of information. Heuristically, observing \(y_{j}^{*} = 1\) tells us that \(z_{j} > 0\) and observing \(y_{j}^{*} = 0\) tells that \(z_{j} \leq 0\). Thus, the full conditional posterior of \(z_{j} | \boldsymbol{\theta}, y_{j}^{*}, \boldsymbol{h}_{j}\) is a truncated normal distribution, \[ z_{j} | \boldsymbol{\theta}, y_{j}^{*}, \boldsymbol{h}_{j} \sim \begin{cases} \mathcal{TN}(\boldsymbol{h}_{j}\boldsymbol{\theta}, 1, 0, \infty) & \; \text{if } y_{j}^{*} = 1\\ \mathcal{TN}(\boldsymbol{h}_{j}\boldsymbol{\theta}, 1, -\infty, 0) & \; \text{if } y_{j}^{*} = 0\\ \end{cases} \]

4.1 Gibbs sampling implementation

In each simulation step of the Gibbs algorithm we have to sample \(\boldsymbol{\theta} | \boldsymbol{z}, \boldsymbol{H}\) and \(\boldsymbol{z} | \boldsymbol{\theta}, \boldsymbol{y}, \boldsymbol{H}\) in turn. The following code implements the Gibbs sampling algorithm for the Bayesian BPR model.

# Library for sampling from Multivariate Normal distribution
require(mvtnorm)
# Library for sampling from Truncated Normal distribution
require(truncnorm)

# Number of simulations
N_sim <- 10000
# Burn in period
burn_in <- 5000

# Sum of total trials for each observation i
J <- sum(N_i)

# Create extended vector y of length (J x 1)
y <- vector(mode = "integer")
for (i in 1:N) {
  y <- c(y, rep(1, m_i[i]), rep(0, N_i[i] - m_i[i]))
}

# Variables that we will need later
N1  <- sum(y)  # Number of successes
N0  <- J - N1  # Number of failures

# Create extended design matrix H_ext of dimension (J x D)
H_ext <- H[rep(1:nrow(H), N_i), ]

# Conjugate prior on the coefficients \theta ~ N(theta_0, Q_0)
theta_0 <- rep(0, D)
Q_0 <- diag(10, D)

# Initialize theta
theta <- rep(0, D)
# Initialize latent variable Z, from truncated normal
z <- rep(0, J)
z[y == 0] <- rtruncnorm(N0, mean = 0, sd = 1, a = -Inf, b = 0)
z[y == 1] <- rtruncnorm(N1, mean = 0, sd = 1, a = 0, b = Inf)

# Matrix storing samples of the \theta parameter
gibbs_theta_chain <- matrix(0, nrow = N_sim, ncol = D)
gibbs_theta_chain[1, ] <- theta

# ---------------------------------
# Gibbs sampling algorithm
# ---------------------------------

# Compute posterior variance of theta
prec_0 <- solve(Q_0)
V <- solve(prec_0 + crossprod(H_ext, H_ext))

for (t in 2:N_sim) {
  # Compute posterior mean of theta
  M <- V %*% (prec_0 %*% theta_0 + crossprod(H_ext, z))
  # Draw variable \theta from its full conditional: \theta | z, H
  theta <- c(rmvnorm(1, M, V))
  
  # Update Mean of z
  mu_z <- H_ext %*% theta
  # Draw latent variable z from its full conditional: z | \theta, y, H
  z[y == 0] <- rtruncnorm(N0, mean = mu_z[y == 0], sd = 1, a = -Inf, b = 0)
  z[y == 1] <- rtruncnorm(N1, mean = mu_z[y == 1], sd = 1, a = 0, b = Inf)
  
  # Store the \theta draws
  gibbs_theta_chain[t, ] <- theta
}

# ------------------------------------------
# Gibbs posterior mean estimates of the RBF coefficients
# ------------------------------------------
#  theta_0     theta_1    theta_2    theta_3
#  -0.6189819  0.7308269  1.2051232  -0.7920864
post_theta <- colMeans(gibbs_theta_chain[-(1:burn_in), ])

Having stored the information of each MCMC iteration in the gibbs_theta_chain matrix, we can create some useful plots to obsevere how the chain evolved over each iteration and how the marginal density plot of each parameter looks like. In the figure below, the top panel depicts the marginal histogram of the sampled values over all the iterations after the burn-in period, and the bottom panel shows the trace plot of each parameter, i.e. the values the parameters took during the runtime of the MCMC simulation. As we observe, after only a few hundred iteration steps the MCMC chain converged to the true values of the parameters. The \(red\) line corresponds to the true parameter values, the \(gold\) line to the posterior mean of the Gibbs chain, and the \(green\) line to the MLE estimate.

par(mfrow = c(2,4))
# First parameter
hist(gibbs_theta_chain[-(1:burn_in), 1], breaks = 30, 
     main = "Posterior of theta_1", 
     xlab = "True value = red line", col = "cornflowerblue")
abline(v = post_theta[1], col = "goldenrod2", lwd = 3)
abline(v = true_theta[1], col = "red2", lwd = 3)
abline(v = mle_theta[1], col = "darkgreen", lwd = 3)
# Second parameter
hist(gibbs_theta_chain[-(1:burn_in), 2], breaks = 30, 
     main = "Posterior of theta_2", 
     xlab = "True value = red line", col = "cornflowerblue")
abline(v = post_theta[2], col = "goldenrod2", lwd = 3)
abline(v = true_theta[2], col = "red2", lwd = 3)
abline(v = mle_theta[2], col = "darkgreen", lwd = 3)
# Third parameter
hist(gibbs_theta_chain[-(1:burn_in), 3], breaks = 30, 
     main = "Posterior of theta_3", 
     xlab = "True value = red line", col = "cornflowerblue")
abline(v = post_theta[3], col = "goldenrod2", lwd = 3)
abline(v = true_theta[3], col = "red2", lwd = 3)
abline(v = mle_theta[3], col = "darkgreen", lwd = 3)
# Fourth parameter
hist(gibbs_theta_chain[-(1:burn_in), 4], breaks = 30, 
     main = "Posterior of theta_4", 
     xlab = "True value = red line", col = "cornflowerblue")
abline(v = post_theta[4], col = "goldenrod2", lwd = 3)
abline(v = true_theta[4], col = "red2", lwd = 3)
abline(v = mle_theta[4], col = "darkgreen", lwd = 3)
# Add Legend
legend("topright", c("True", "MLE", "Gibbs"), lty = 1, lwd = 2,
       col = c("red2", "darkgreen", "goldenrod2"), bty = 'n', cex = .95)

# Create trace plots
# First parameter
plot(gibbs_theta_chain[, 1], type = "l", xlab = "True value = red line", 
     main = "Chain values of theta_1")
abline(h = true_theta[1], col = "red2", lwd = 2)
lines(cumsum(gibbs_theta_chain[, 1]) / (1:N_sim), col = "goldenrod2", lwd = 2)
# Second parameter
plot(gibbs_theta_chain[, 2], type = "l", xlab = "True value = red line", 
     main = "Chain values of theta_2")
abline(h = true_theta[2], col = "red2", lwd = 2)
lines(cumsum(gibbs_theta_chain[, 2]) / (1:N_sim), col = "goldenrod2", lwd = 2)
# Third parameter
plot(gibbs_theta_chain[, 3], type = "l", xlab = "True value = red line", 
     main = "Chain values of theta_3")
abline(h = true_theta[3], col = "red2", lwd = 2)
lines(cumsum(gibbs_theta_chain[, 3]) / (1:N_sim), col = "goldenrod2", lwd = 2)
# Fourth parameter
plot(gibbs_theta_chain[, 4], type = "l", xlab = "True value = red line", 
     main = "Chain values of theta_4")
abline(h = true_theta[4], col = "red2", lwd = 2)
lines(cumsum(gibbs_theta_chain[, 4]) / (1:N_sim), col = "goldenrod2", lwd = 2)

5 Compute posterior distribution of the original model

By using the data augmentation approach, the advantage is that the dependencies in the posterior are transferred to the multivariate Gaussian distribution, where they are easy to handle. However, a potential bottleneck when performing inference using the augmented BPR model is the introduction of \(\sum_{i = 1}^{N} N_{i}\) latent variables \(z_{i,n}\). In other words, the number of latent variables scales with the number of Bernoulli observations (i.e. one Binomial observation corresponds to \(N_{i}\) Bernoulli observations), the higher the number of observations the larger the number of latent variables. Hence, the computational time becomes prohibitive for analysing large datasets, such as High-Throughput Sequencing (HTS) data coming from biological experiments.

We can take another direction and compute the posterior distribution of the parameters \(\boldsymbol{\theta}\) using the original formulation of the BPR model by applying the Metropolis-Hastings (MH) algorithm.

The likelihood for the BPR model is the following, \[ p(\boldsymbol{y} | \boldsymbol{\theta}, \boldsymbol{H}) = \prod\limits_{i = 1}^{N} \mathcal{B}inom\Big(m_{i} | N_{i}, \Phi\big(g(x_{i};\boldsymbol{\theta})\big)\Big) \]

We also specify a prior distribution for each parameter \(\theta_{d}\), where \(d = 1, ..., D\) are the number of the RBF coefficients, \[ \pi(\theta_{d}) \sim \mathcal{N}(\mu_{\theta_{0}}, \sigma_{\theta_{0}}) \] where \(\mu_{\theta_{0}}\) and \(\sigma_{\theta_{0}}\) are constants.

Hence, the target posterior distribution that the MH algorithm will eventually sample from becomes, \[ \begin{aligned} p(\boldsymbol{\theta} | \boldsymbol{y}, \boldsymbol{H}) & \propto p(\boldsymbol{y} | \boldsymbol{\theta}, \boldsymbol{H}) \pi(\boldsymbol{\theta}) \\ & = \prod\limits_{i = 1}^{N} \mathcal{B}inom\Big(m_{i} | N_{i}, \Phi\big(g(x_{i};\boldsymbol{\theta})\big)\Big) \prod\limits_{d=1}^{D} \mathcal{N}(\mu_{\theta_{0}}, \sigma_{\theta_{0}}) \end{aligned} \]

To build the MH algorithm that simulates samples from the target distribution \(p(\boldsymbol{\theta} | \boldsymbol{y}, \boldsymbol{H})\), the proposal distribution \(q(\boldsymbol{\theta}' | \boldsymbol{\theta})\) will proceed by jointly updating the parameters \(\boldsymbol{\theta}\). The probability density that we choose for the proposal distribution is a multivariate normal distribution centred on the current value of the parameter vector \(\boldsymbol{\theta}\),

\[ \boldsymbol{\theta}' \sim \mathcal{N}(\boldsymbol{\theta}, \boldsymbol{\Sigma}_{0}) \] where \(\boldsymbol{\Sigma}_{0}\) is a fixed covariance matrix, which can be tuned in order to change the shape of the proposal distribution, and hence change the acceptance rate of the MH algorithm.

A MH step of target distribution \(p(\boldsymbol{\theta} | \boldsymbol{y}, \boldsymbol{H})\) and proposal distribution \(q(\boldsymbol{\theta}' | \boldsymbol{\theta})\) involves sampling a candidate value \(\boldsymbol{\theta}'\) given the current value \(\boldsymbol{\theta}\) according to \(q(\boldsymbol{\theta}' | \boldsymbol{\theta})\). The Markov chaing then moves towards \(\boldsymbol{\theta}'\) with acceptance probability, \[ A(\boldsymbol{\theta}, \boldsymbol{\theta}') = min\Bigg\lbrace 1, \frac{p(\boldsymbol{\theta}' | \boldsymbol{y}, \boldsymbol{H}) }{p(\boldsymbol{\theta} | \boldsymbol{y}, \boldsymbol{H})} \frac{q(\boldsymbol{\theta} | \boldsymbol{\theta}')}{q(\boldsymbol{\theta}' | \boldsymbol{\theta})}\Bigg\rbrace \] otherwise it remains at the current state \(\boldsymbol{\theta}\).

Since the Normal distribution is symmetric, the acceptance probability is simplified into, \[ A(\boldsymbol{\theta}, \boldsymbol{\theta}') = min\Bigg\lbrace 1, \frac{p(\boldsymbol{\theta}' | \boldsymbol{y}, \boldsymbol{H}) }{p(\boldsymbol{\theta} | \boldsymbol{y}, \boldsymbol{H})} \Bigg\rbrace \] which is the \(Metropolis\) algorithm.

It should be noted that the normalising constant of the target distribution is not required, and we only need to know the target distribution up to a constant of proportionality since in the acceptance probability we compute the ratio of the target distribution (check this paper for an introduction to MCMC for Machine Learning).

5.1 Metropolis-Hastings implementation

The initial step is to write down the posterior distribution in order to evaluate it in each simulation step of the MH algorithm. When working with computations where lots of small probabilities are multiplied, it is advisable to work with \(logarithms\) due to underflow numerical problems. Thus, the unnormalized posterior is computed as, \(posterior \propto \log(likelihood) + \log(prior)\).

# BPR log likelihood function
likelihood <- function(theta, H, N_i, m_i){
  # Evaluate probit transformed function
  Phi <- pnorm(H %*% theta)
  # In extreme cases where probit is 0 or 1, subtract a tiny number due to
  # possible numerical issues
  Phi[which(Phi > (1 - 1e-289))] <- 1 - 1e-289
  Phi[which(Phi < 1e-289)] <- 1e-289
  
  # Compute the actual log likelihood and take the sum
  res <- sum(dbinom(x = m_i, size = N_i, prob = Phi, log = TRUE)) 
  return(res)
}

# Compute prior distribution
prior <- function(theta, mu_theta_0 = 0, s_theta_0 = 5){
  # Compute prior for each \theta_i and take the sum
  return(sum(dnorm(theta, mean = mu_theta_0, sd = s_theta_0, log = T)))
}

# Compute unnormalized posterior distribution
posterior <- function(theta, H, N_i, m_i, mu_theta_0 = 0, s_theta_0 = 5){
  return(likelihood(theta, H, N_i, m_i) + prior(theta, mu_theta_0, s_theta_0))
}

So now we can proceed and implement the actual Metropolis-Hastings algorithm. The steps for the MH algorithm are simple,

  1. Start at a random parameter value \(\boldsymbol{\theta}\), with the constraint that the target distribution has some probability mass around that region.
  2. Choosing a new parameter value \(\boldsymbol{\theta}'\) close to the old value by sampling from the proposal distribution \(q(\boldsymbol{\theta}' | \boldsymbol{\theta})\), which in our example is a \(\mathcal{N}(\boldsymbol{\theta}, \boldsymbol{\Sigma}_{0})\).
  3. Jump to this new point with acceptance probability \(\frac{p(\boldsymbol{\theta}' | \boldsymbol{y}, \boldsymbol{H}) }{p(\boldsymbol{\theta} | \boldsymbol{y}, \boldsymbol{H})}\), where \(p(\cdot)\) is the target function, i.e. the unnormalized posterior distribution.
# Sample from the proposal distribution, i.e. N(theta, S_0)
proposal <- function(theta, Sigma_0){
  return(rmvnorm(1, mean = theta, sigma = Sigma_0))
}

# Metropolis algorithm implementation
metropolis <- function(theta = NULL, H, N_i, m_i, Sigma_0, mu_theta_0 = 0, 
                       s_theta_0 = 3, N_sim = 10000){
  
  # Number of parameters
  D <- NCOL(H)
  # Initialize regression coefficients
  if (is.null(theta)) { theta <- rep(0, D) }
  if (is.null(Sigma_0)) { Sigma_0 <- solve(diag(10, D)) }
  
  # Matrix storing samples of the \theta parameter
  theta_chain <- matrix(0, nrow = N_sim, ncol = D)
  theta_chain[1, ] <- theta
  
  # Start MH algorithm
  for (t in 2:N_sim) {
    # Draw a new point from the proposal distribution
    theta_chain[t, ] <- proposal(theta_chain[t - 1, ], Sigma_0)
    
    # Compute the acceptance probability in log space
    log_acc_prob <- posterior(theta_chain[t, ], H, N_i, m_i, 
                              mu_theta_0, s_theta_0) - 
                    posterior(theta_chain[t - 1, ], H, N_i, m_i, 
                              mu_theta_0, s_theta_0)
    
    # Sample from uniform and accept or reject the proposed point
    if (log(runif(1)) > log_acc_prob) {
      theta_chain[t, ] <- theta_chain[t - 1, ]
    }
  }
  return(theta_chain)
}

We can now run the MH algorithm to obtain the posterior distribution of the \(\boldsymbol{\theta}\) parameters.

# Initial parameters \theta
theta <- rep(0, NCOL(H))
# Covariance matrix Sigma_0
Sigma_0 <- solve(diag(20, NCOL(H)) + crossprod(H, H))

# Run Metropolis algorithm
mh_theta_chain = metropolis(theta = theta, H = H, N_i = N_i, m_i = m_i, 
                            Sigma_0 = Sigma_0, N_sim = N_sim)

# ------------------------------------------
# MH posterior mean estimates of the RBF coefficients
# ------------------------------------------
#  theta_0     theta_1    theta_2    theta_3
#  -0.6204764  0.7300933  1.2066415 -0.7922801
post_mh_theta <- colMeans(mh_theta_chain[-(1:burn_in), ])

Having stored the information of each MCMC iteration in the mh_theta_chain matrix, we can create the same plots as in the previous section to observe how the chain evolved over each iteration and how the marginal density plot of each parameter looks like. Again, \(red\) line corresponds to the true parameter values, the \(gold\) line to the posterior mean of the Metropolis-Hastings algorithm, and the \(green\) line to the MLE estimate.

6 MCMC chain mixing

If we compare the traceplots from the augmentation apporach using Gibbs sampling and the MH algorithm on the original model, we observe that in the augmented model we obtain a much better mixing of the chain, hence we can estimate our target distribution much more accurately and with less samples. However, the draws from the MH algorithm have low acceptance rate (i.e. we did not accept many samples from the proposal distribution), resulting in rather poor performance of the MH algorithm in computing the posterior distribution. Thus, we need to customize the proposal distribution to achieve moderate acceptance rates during the MH algorithm; or we could work with more complex MCMC methods, such as adaptive MCMC, Hamiltonian Monte Carlo (HMC), etc., in order to explore the parameter space of the posterior distribution adequately.

Let us compute the Effective Sample Size (\(ESS\)) of the chains for the augmentation apporach using Gibbs sampling and the MH algorithm on the original model. For doing this we will use the coda package. A slightly more detailed explanation of the ESS and in general the convergence diagnostics in MCMC can be found in the last section of my previous post about the Bayesian binary probit model

# Coda library
require(coda)

# ---------------------------
# Get ESS for augmentation approach
# ---------------------------
#  theta_1   theta_2   theta_3   theta_4
#  2006.911  1986.837  1965.318  2005.140 
ess_gibbs <- effectiveSize(mcmc(gibbs_theta_chain[-(1:burn_in), ]))

# ---------------------------
# Get ESS for MH algorithm on original model
# ---------------------------
#  theta_1   theta_2   theta_3   theta_4
#  38.33483  35.39863  57.49000  34.12458 
ess_mh <- effectiveSize(mcmc(mh_theta_chain[-(1:burn_in), ]))

7 Conclusions

This tutorial-like document showed how we can perform Bayesian Binomial Probit Regression (BPR) using the data augmentation approach and using the Metropolis-Hastings algorithm with a multivariate Gaussian proposal distribution.

If you found this document useful, check my homepage at the University of Edinburgh for links to other tutorials.