1 Linear Regression

Linear regression assumes \[ y_{i} = \mathbf{w}^{T}\mathbf{x}_{i} + \epsilon_{i} \] where the response \(y_{i}\) is a linear function of the covariate \(\mathbf{x}_{i} \in \mathbb{R}^{D}\) and is linear in the parameters \(\mathbf{w}\) as well. Collecting \(I\) response variables \(\mathbf{y} \in \mathbb{R}^{I}\) we have \[ \mathbf{y} = \mathbf{X}\mathbf{w} + \mathbf{\epsilon} \] where \(\mathbf{X}\in \mathbb{R}^{I\times D}\) is referred to as design matrix. Now, given a training set \(\mathcal{D} = \{\mathbf{X}, \mathbf{y}\}\), estimate \(\mathbf{w}\) so that the response \(y_{*}\) to a new data point \(\mathbf{x}_{*}\) can be predicted, i.e., \(\mathbb{E}[y_{*} | \mathbf{x}_{*}] = \mathbf{w}^{T}\mathbf{x}_{*}\).

2 Bayesian Linear Regression

The likelihood function for \(\mathbf{w}\) and the prior over \(\mathbf{w}\), are given by \[ \begin{aligned} p(\mathbf{y} | \mathbf{X}, \mathbf{w}) & = \prod_{i=1}^{I} \mathcal{N}(y_{i} | \mathbf{w}^{T}\mathbf{x}_{i}, \lambda^{-1}) \\ p(\mathbf{w} | \tau) &= \mathcal{N}(\mathbf{w} | \mathbf{m}_{0}, \tau^{-1} \mathbf{I}) \end{aligned} \] where, \(\lambda\) is the noise precision parameter and is assumed to be know for simplicity, although the framework is easily extented to include the distribution over \(\lambda\). A typical choice for the conjugate prior mean parameter would be \(\mathbf{m}_{0} = 0\) We also introduce a prior distribution for the (hyper)-parameter \(\tau\), which is the precision of the Gaussian prior over the weights \[ p(\tau) = Gamma(\tau | \alpha_{0}, \beta_{0}) \] Thus, the joint distribution over all the variables is given by the following factorization. \[ p(\mathbf{y}, \mathbf{w}, \tau | \mathbf{X}) = p(\mathbf{y} | \mathbf{X}, \mathbf{w})\;p(\mathbf{w} | \tau)\;p(\tau) \] The probabilistic graphical model of the Bayesian linear regression model is shown below

2.1 Bayesian model selection

  • Avoid the overfitting associated with the ML by marginalizing over the model parameters instead of making point estimates of their values.
  • Models can be compared directly on the training data, without the need for a validation set.
  • Avoids the multiple training runs for each model associated with cross-validation.

Suppose that we wish to compare a set of \(L\) models, \(\{\mathcal{M}_{n}\}\) for \(n = 1, ..., N\). A model here refers to a probability distribution over the observed data \(\mathcal{D}\). Given a training set \(\mathcal{D}\), we then wish to evaluate the posterior distribution \[ p(\mathcal{M}_{n} | \mathcal{D}) \propto \underbrace{p(\mathcal{M}_{n})}_{\text{prior}}\;\underbrace{p(\mathcal{D} | \mathcal{M}_{n})}_{\text{evidence}} \] where \(p(\mathcal{M}_{n})\) is a prior probability distribution to express our uncertainty: the data is generated from one of these models but we are uncertain which one. The model evidence \(p(\mathcal{D} | \mathcal{M}_{n})\) expresses the preference shown by the data for different models. This is also known as marginal likelihood, since it can be viewed as a likelihood function over the space of models, in which the parameters have been marginalized out.

The Bayes Factor is the ratio of model evidences for two different models \[ BF = \frac{p(\mathcal{D} | \mathcal{M}_{n})}{p(\mathcal{D} | \mathcal{M}_{j})} \] When we perform model selection we choose the single most probable model. For a model governed by a set of parameters \(\mathbf{w}\), the model evidence is given by \[ p(\mathcal{D} | \mathcal{M}_{n}) = \int \underbrace{p(\mathcal{D} | \mathbf{w}, \mathcal{M}_{n})}_{\text{likelihood}} \;\underbrace{p(\mathbf{w} | \mathcal{M}_{n})}_{\text{prior}}\;d\mathbf{w} \]

3 Variational Linear Regression

Our first goal is to find an approximation to the posterior distribution \(p(\mathbf{w}, \tau | \mathbf{X}, \mathbf{y})\). To do this, we use a variational posterior distribution given by the factorised expression \[ q(\mathbf{w}, \tau) = q(\mathbf{w}) q(\tau) \approx p(\mathbf{w}, \tau | \mathbf{X}, \mathbf{y}) \] We compute the optimal variational distributions \(q^{*}(\mathbf{w})\) and \(q^{*}(\tau)\) using the following general expression which is the basis for any variational inference method \[ \text{ln}\;q_{j}^{*}(z_{j}) = \mathbb{E}_{i \neq j}[\text{ln}\;p(\mathbf{X, \mathbf{z}})] + \text{const} \] That is, the log of the optimal solution for factor \(q_{j}\), is obtained simply by considering the log of the joint distribution over all hidden and observed variables and then taking the expectation w.r.t all of the other factors \(\{q_{i}\}\) for \(i \neq j\). In our example we have \[ \begin{aligned} \text{ln}\;q^{*}(\mathbf{w}) & = \mathbb{E}_{\tau}[\text{ln}\;p(\mathbf{y}, \mathbf{w}, \tau | \mathbf{X})] + \text{const} \\ \text{ln}\;q^{*}(\tau) & = \mathbb{E}_{\mathbf{w}}[\text{ln}\;p(\mathbf{y}, \mathbf{w}, \tau | \mathbf{X})] + \text{const} \end{aligned} \] We consider first the distribution over \(\tau\) \[ \begin{aligned} \text{ln}\;q^{*}(\tau) & = \mathbb{E}_{\mathbf{w}}[\text{ln}\;p(\mathbf{y}, \mathbf{w}, \tau | \mathbf{X})] + \text{const} \\ & = \mathbb{E}_{\mathbf{w}}[\text{ln}\;p(\mathbf{y} | \mathbf{X}, \mathbf{w}) p(\mathbf{w} | \tau) p(\tau)] + \text{const} \\ & = \mathbb{E}_{\mathbf{w}}[\text{ln}\;p(\mathbf{y} | \mathbf{X}, \mathbf{w})] + \mathbb{E}_{\mathbf{w}}[\text{ln}\;p(\mathbf{w} | \tau)] + \mathbb{E}_{\mathbf{w}}[\text{ln}\;p(\tau)] + \text{const} \\ & = \text{ln}\;p(\tau) + \mathbb{E}_{\mathbf{w}}[\text{ln}\;p(\mathbf{w} | \tau)] + \text{const} \\ & = \underbrace{(\alpha_{0} - 1)\text{ln}\;\tau - \beta_{0}\tau}_{\text{Gamma PDF}} + \underbrace{\frac{D}{2}\text{ln}\;\tau - \frac{\tau}{2}\mathbb{E}_{\mathbf{w}}[\mathbf{w}^{T}\mathbf{w}]}_{\text{Gaussian PDF}} \\ & = \underbrace{(\alpha_{0} + \frac{D}{2} - 1)}_{\alpha\;\text{parameter}}\text{ln}\;\tau - \Big(\underbrace{\beta_{0} + \frac{1}{2}\mathbb{E}_{\mathbf{w}}[\mathbf{w}^{T}\mathbf{w}]}_{\beta\;\text{parameter}}\Big)\tau \end{aligned} \] which is the log of the Gamma distribution, leading to \[ \begin{aligned} q^{*}(\tau) & = \mathcal{G}\text{amma}(\tau | \alpha, \beta) \\ \alpha & = \alpha_{0} + \frac{D}{2} \\ \beta & = \beta_{0} + \frac{1}{2}\mathbb{E}_{\mathbf{w}}[\mathbf{w}^{T}\mathbf{w}] \end{aligned} \] Similarly, we can find the variational re-estimation equation for the posterior distribution over \(\mathbf{w}\) as follows \[ \begin{aligned} \text{ln}\;q^{*}(\mathbf{w}) & = \mathbb{E}_{\tau}[\text{ln}\;p(\mathbf{y}, \mathbf{w}, \tau | \mathbf{X})] + \text{const} \\ & = \mathbb{E}_{\tau}[\text{ln}\;p(\mathbf{y} | \mathbf{X}, \mathbf{w}) p(\mathbf{w} | \tau) p(\tau)] + \text{const} \\ & = \mathbb{E}_{\tau}[\text{ln}\;p(\mathbf{y} | \mathbf{X}, \mathbf{w})] + \mathbb{E}_{\tau}[\text{ln}\;p(\mathbf{w} | \tau)] + \mathbb{E}_{\tau}[\text{ln}\;p(\tau)] + \text{const} \\ & = \text{ln}\;p(\mathbf{y} | \mathbf{X}, \mathbf{w}) + \mathbb{E}_{\tau}[\text{ln}\;p(\mathbf{w} | \tau)] + \text{const} \\ & = -\frac{\lambda}{2}\sum_{i = 1}^{I}(y_{i} - \mathbf{w}^{T}\mathbf{x}_{i})^{2} - \frac{1}{2}\mathbb{E}_{\tau}[\tau]\mathbf{w}^{T}\mathbf{w} + \text{const} \\ & = -\frac{\lambda}{2}(\mathbf{y} - \mathbf{X}\mathbf{w})^{T}(\mathbf{y} - \mathbf{X}\mathbf{w}) - \frac{1}{2}\mathbb{E}_{\tau}[\tau]\mathbf{w}^{T}\mathbf{w} + \text{const} \\ & = -\frac{1}{2}\mathbf{w}^{T}(\mathbb{E}_{\tau}[\tau]\mathbf{I} + \lambda \mathbf{X}^{T}\mathbf{X})\mathbf{w} + \lambda\mathbf{w}^{T}\mathbf{X}^{T}\mathbf{y} + \text{const} \\ \end{aligned} \] Because this is a quadratic form, the distribution \(q^{*}(\mathbf{w})\) is a Gaussian distribution, and so we can complete the square in the usual way to identify the mean and covariance, giving \[ \begin{aligned} q^{*}(\mathbf{w}) & = \mathcal{N}(\mathbf{w} | \mathbf{m}, \mathbf{S}) \\ \mathbf{m} & = \lambda\mathbf{S}\mathbf{X}^{T}\mathbf{y} \\ \mathbf{S} & = \big(\mathbb{E}_{\tau}[\tau]\mathbf{I} + \lambda \mathbf{X}^{T}\mathbf{X}\big)^{-1} \end{aligned} \] Since \(\tau\) follows a Gamma distribution, its expected value is given by \[ \mathbb{E}_{\tau}[\tau] = \frac{\alpha}{\beta} \] and form standard results (check Bishop book Appendix B) we get \[ \mathbb{E}_{\mathbf{w}}[\mathbf{w}^{T}\mathbf{w}] = \mathbf{m}^{T}\mathbf{m} + tr(\mathbf{S}) \] The evaluation of the variational posterior distribution begins by initializing the parameters of one of the distributions \(q(\mathbf{w})\) or \(q(\tau)\), and then alternately re-estimate these factors in turn until a suitable convergence criterion is satisfied, e.g. no significant difference in the lower bound.

3.1 Predictive distribution

The predictive distribution over \(y_{*}\), given a new input \(\mathbf{x}_{*}\), is evaluated using the Gaussian variational posterior for the parameters \[ \begin{aligned} p(y_{*} | \mathbf{x}_{*}, \mathbf{X}, \mathbf{y}) & = \int p(y_{*}, \mathbf{w} | \mathbf{x}_{*}, \mathbf{X}, \mathbf{y})\;d\mathbf{w} \\ & = \int p(y_{*} | \mathbf{w} , \mathbf{x}_{*}) p(\mathbf{w} | \mathbf{X}, \mathbf{y})\;d\mathbf{w} \\ & \approx \int p(y_{*} | \mathbf{w} , \mathbf{x}_{*}) q(\mathbf{w})\;d\mathbf{w} \\ & = \int \mathcal{N}(y_{*} | \mathbf{w}^{T}\mathbf{x}_{*}, \lambda^{-1}) \mathcal{N}(\mathbf{w} | \mathbf{m}, \mathbf{S})\;d\mathbf{w} \\ & = \mathcal{N}(y_{*} | \mathbf{m}^{T}\mathbf{x}_{*}, \lambda^{-1} + \mathbf{x}_{*}^{T}\mathbf{S}\mathbf{x}_{*}) \\ \end{aligned} \] The last equality was derived using the following Gaussian identity. Given a marginal Gaussian distribution \(\mathbf{x}\) and a conditional Gaussian distribution for \(\mathbf{y}\) given \(\mathbf{x}\) in the form \[ \begin{aligned} p(\mathbf{x}) & = \mathcal{N}(\mathbf{x} | \pmb{\mu}, \pmb{\Lambda}^{-1}) \\ p(\mathbf{y} | \mathbf{x}) & = \mathcal{N}(\mathbf{y} | \mathbf{Ax} + \mathbf{b}, \mathbf{L}^{-1}) \end{aligned} \] the marginal distribution of \(\mathbf{y}\) and the conditional distribution of \(\mathbf{x}\) given \(\mathbf{y}\) are given by \[ \begin{aligned} p(\mathbf{y}) & = \mathcal{N}\Big(\mathbf{y} | \mathbf{A}\pmb{\mu} + \mathbf{b}, \mathbf{L}^{-1} + \mathbf{A} \pmb{\Lambda}^{-1}\mathbf{A}^{T}\Big) \\ p(\mathbf{x} | \mathbf{y}) & = \mathcal{N}\Big(\mathbf{x} | \pmb{\Sigma}\big(\mathbf{A}^{T}\mathbf{L}(\mathbf{y} - \mathbf{b}) + \pmb{\Lambda\mu}\big), \pmb{\Sigma}\Big) \\ \pmb{\Sigma} & = (\pmb{\Lambda} + \mathbf{A}^{T}\mathbf{LA})^{-1} \end{aligned} \]

3.2 Lower bound

Another quantity of interest is the lower bound \(\mathcal{L}\). Consider \[ \begin{aligned} \text{ln}\;p(\mathbf{y}, \mathbf{X}) & = \text{ln}\int \int p(\mathbf{y}, \mathbf{w}, \tau | \mathbf{X})\;d\mathbf{w}\;d\tau \\ & = \text{ln}\int \int q(\mathbf{w}, \tau) \frac{p(\mathbf{y}, \mathbf{w}, \tau | \mathbf{X})}{q(\mathbf{w}, \tau)}\;d\mathbf{w}\;d\tau \\ & \geq \int \int q(\mathbf{w}, \tau) \text{ln}\Bigg[ \frac{p(\mathbf{y}, \mathbf{w}, \tau | \mathbf{X})}{q(\mathbf{w}, \tau)}\Bigg]\;d\mathbf{w}\;d\tau \\ & = \mathcal{L}(q) \end{aligned} \] Then the lower bound \(\mathcal{L}(q)\) is given by \[ \begin{aligned} \mathcal{L}(q) = & \int \int q(\mathbf{w}, \tau) \; \text{ln}\; p(\mathbf{y}, \mathbf{w}, \tau | \mathbf{X})\;d\mathbf{w}\;d\tau - \int \int q(\mathbf{w}, \tau) \; \text{ln}\; q(\mathbf{w}, \tau)\;d\mathbf{w}\;d\tau \\ = & \;\mathbb{E}_{\mathbf{w}, \tau}\Big[\text{ln}\; p(\mathbf{y}, \mathbf{w}, \tau | \mathbf{X})\Big] - \mathbb{E}_{\mathbf{w}, \tau}\Big[\text{ln}\; q(\mathbf{w}, \tau)\Big] \\ = & \; \mathbb{E}_{\mathbf{w}}\Big[\text{ln}\; p(\mathbf{y} | \mathbf{X}, \mathbf{w})\Big] + \mathbb{E}_{\mathbf{w}, \tau}\Big[\text{ln}\; p(\mathbf{w} | \tau)\Big] + \mathbb{E}_{\tau}\Big[\text{ln}\; p(\tau)\Big] - \mathbb{E}_{\mathbf{w}}\Big[\text{ln}\; q(\mathbf{w})\Big] - \mathbb{E}_{\tau}\Big[\text{ln}\; q(\tau)\Big]\\ \end{aligned} \] Using standard results, the evaluation of each term is straightforward and gives \[ \begin{aligned} \mathbb{E}_{\mathbf{w}}\Big[\text{ln}\; p(\mathbf{y} | \mathbf{X}, \mathbf{w})\Big] & = \frac{I}{2} \text{ln}\;\Big(\frac{\lambda}{2\pi}\Big) - \frac{\lambda}{2}\mathbf{y}^{T}\mathbf{y} + \lambda\mathbf{m}^{T}\mathbf{X}^{T}\mathbf{y} - \frac{\lambda}{2}tr\Big(\mathbf{X}^{T}\mathbf{X}(\mathbf{m}\mathbf{m}^{T} + \mathbf{S})\Big) \\ \mathbb{E}_{\mathbf{w}, \tau}\Big[\text{ln}\; p(\mathbf{w} | \tau)\Big] & = -\frac{D}{2}\text{ln}\;2\pi + \frac{D}{2}\Big(\psi(\alpha) - \text{ln}\;\beta\Big) - \frac{\alpha}{2\beta}\Big(\mathbf{m}^{T}\mathbf{m} + tr(\mathbf{S})\Big) \\ \mathbb{E}_{\tau}\Big[\text{ln}\; p(\tau)\Big] & = \alpha_{0}\;\text{ln}\;\beta_{0} + (\alpha_{0} - 1)\Big(\psi(\alpha) - \text{ln}\;\beta\Big) - \beta_{0}\frac{\alpha}{\beta} - \text{ln}\;\Gamma(\alpha_{0}) \\ \mathbb{E}_{\mathbf{w}}\Big[\text{ln}\; q(\mathbf{w})\Big] & = -\frac{1}{2}\text{ln}|\mathbf{S}| - \frac{D}{2}(1 + \text{ln}\;2\pi)\\ \mathbb{E}_{\tau}\Big[\text{ln}\; q(\tau)\Big] & = - \text{ln}\;\Gamma(\alpha) + (\alpha - 1)\psi(\alpha) + \text{ln}\;\beta - \alpha \end{aligned} \] where \(\Gamma(\cdot)\) and \(\psi(\cdot)\) are the Gamma and digamma functions, respectively.

4 Code implementation in R

Helper plotting functions

# devtools::install_github("andreaskapou/BPRMeth-devel")
suppressPackageStartupMessages(require(BPRMeth))
suppressPackageStartupMessages(require(matrixcalc))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(purrr))

# Define ggplot2 theme
gg_theme <- function(){
  p <- theme(
      plot.title = element_text(size = 20,face = 'bold',
                                margin = margin(0,0,3,0), hjust = 0.5),
      axis.text = element_text(size = rel(1.05), color = 'black'),
      axis.title = element_text(size = rel(1.45), color = 'black'),
      axis.title.y = element_text(margin = margin(0,10,0,0)),
      axis.title.x = element_text(margin = margin(10,0,0,0)),
      axis.ticks.x = element_line(colour = "black", size = rel(0.8)),
      axis.ticks.y = element_blank(),
      legend.position = "right",
      legend.key.size = unit(1.4, 'lines'),
      legend.title = element_text(size = 12, face = 'bold'),
      legend.text = element_text(size = 12),
      panel.border = element_blank(),
      panel.grid.major = element_line(colour = "gainsboro"),
      panel.background = element_blank()
    )
  return(p)
}

# Plot the predictive distribution
draw_predictive <- function(X, obs, title="", ...){
  p <- ggplot(data = data.frame(X), aes(x = xs, y = ys)) +
    geom_line(aes(x = xs, y = ys), size = 1.5, col = "darkblue") +
    geom_ribbon(aes(ymin = ys_l, ymax = ys_h), alpha = 0.4, 
                size = 0.1, fill = "cornflowerblue") +
    geom_point(data = obs, mapping = aes(x = x, y = y), shape = 1, 
               color = "red", size = 3) +
    labs(title = title, x = "x", y = "y") + gg_theme()
}

Function for fitting the variational Bayesian linear regression model and for computing the posterior predictive distribution.

# Compute predictive distribution of VBLR model
vblr_predictive <- function(model, X_test){
  # Predictive mean
  mu_pred <- c(X_test %*% model$m)
  # Predictive variance
  s_pred <- sqrt(1/model$lambda + diag(X_test %*% model$S %*% t(X_test)))
  return(list(mu_pred = mu_pred, s_pred = s_pred))
}

# Fit VBLR model
vblr_fit <- function(X, y, lambda = 0.5, a_0 = 1e-3, b_0 = 1e-3, 
                     max_iter = 500, epsilon_conv = 1e-5, 
                     is_verbose = FALSE){
  X <- as.matrix(X)
  D <- NCOL(X)              # Number of covariates
  I <- NROW(X)              # Number of observations
  XX <- crossprod(X, X)     # Compute X'X
  Xy <- crossprod(X, y)     # Compute X'y
  yy <- c(crossprod(y))     # Compute y'y
  L <- rep(-Inf, max_iter)  # Store the lower bounds
  
  a <- a_0 + D / 2          # Compute \alpha parameter of Gamma
  E_a <- a_0 / b_0          # Expectation of precision parameter
  # Iterate to find optimal parameters
  for (i in 2:max_iter) {
    # Update covariance matrix of Gaussian factor
    S <- solve(diag(E_a, D) + lambda * XX)
    # Update mean of Gaussian factor
    m <- lambda * S %*% Xy
    # Update expectation of E[w'w]
    E_ww <- as.vector(crossprod(m) + matrix.trace(S))
    # Update b_N parameter of Gamma factor
    b <- b_0 + 0.5 * E_ww
    # Compute expectation of E[a]
    E_a <- a / b
    
    # Compute lower bound
    lb_py <- 0.5 * I * log(lambda / (2 * pi)) - 0.5 * lambda * yy + 
        lambda * crossprod(m, Xy) - 0.5 * lambda * 
        matrix.trace(XX %*% (tcrossprod(m, m) + S)) 
    lb_pw <- -0.5 * D * log(2*pi) + 0.5 * D * (digamma(a) - log(b)) - 
        0.5 * E_a * E_ww 
    lb_pa <- a_0*log(b_0) + (a_0 - 1) * (digamma(a) - log(b)) - b_0*E_a - 
        log(gamma(a_0))
    lb_qw <- -0.5 * log(det(S)) - 0.5 * D * (1 + log(2*pi)) 
    lb_qa <- -lgamma(a) + (a - 1)*digamma(a) + log(b) - a
    
    L[i] <- lb_py + lb_pw + lb_pa - lb_qw - lb_qa
    
    # Show VB difference
    if (is_verbose) { cat("It:\t",i,"\tLB:\t", L[i],"\tLB_diff:\t", 
                          L[i] - L[i - 1],"\n")}
    # Check if lower bound decreases
    if (L[i] < L[i - 1]) { message("Lower bound decreases!\n")}
    # Check for convergence
    if (L[i] - L[i - 1] < epsilon_conv) { break }
    # Check if VB converged in the given maximum iterations
    if (i == max_iter) {warning("VB did not converge!\n")}
  }
  obj <- structure(list(m = m, S = S, a = a, b = b, lambda = lambda, 
                        X = X, I = I, D = D, L = L[2:i]), 
                   class = "vblr")
  return(obj)
}

4.1 Learn simple linear regression model

Here we generate synthetic data and fit the VBLR model

set.seed(12345)       # For reproducibility
I     <- 30           # Number of observations
D     <- 1            # For simplicity we have only one covariate
coefs <- c(-0.5, 2)   # Generate y with these coefficients
lambda  <- 1/2        # Precision parameter

X <- cbind(1, replicate(D, rnorm(I)))             # Generate X data
y <- X %*% coefs  + rnorm(I, sd = sqrt(1/lambda)) # Generate y data
# Run VBLR model
vblr_model <- vblr_fit(X = X, y = y, lambda = lambda, is_verbose = TRUE)
## It:   2  LB:  -72.22305  LB_diff:     Inf 
## It:   3  LB:  -72.18343  LB_diff:     0.03962364 
## It:   4  LB:  -72.18322  LB_diff:     0.0002118777 
## It:   5  LB:  -72.18322  LB_diff:     1.020172e-06

Let’s examine the posterior predictive distribution

xs <- seq(-3, 3, len = 100) # create test X values
# Estimate predictive distribution
pred <- vblr_predictive(model = vblr_model, X_test = cbind(1, xs))
# Store test data in dt
dt <- data.table(xs = xs, ys = pred$mu_pred, ys_l = 0, ys_h = 0)
dt <- dt %>% .[,c("ys_l","ys_h") := list(ys - 2*pred$s_pred, 
                                         ys + 2*pred$s_pred)]
# Store training data
obs <- as.data.table(cbind(X[, 2], y)) %>% setnames(c("x", "y"))
# Create plot
p <- draw_predictive(X = dt, obs = obs, 
                     title = "Simple linear regression example")
print(p)

4.2 Learn RBF basis function regression model

Now we work with linear basis function regression, i.e. we transform the data to the RBF basis space before applying linear regression; a nice way to perform non-linear regression on the input space \(x\), which is still linear in the coefficients \(\mathbf{w}\). The BPRMeth package will be used for some of the analysis.

set.seed(12345)                 # For reproducibility
I      <- 30                    # Number of observations
lambda <- 10                    # Precision parameter
D      <- 4                     # Number of basis functions
X      <- sort(runif(I, -1, 1)) # Generate X data
basis  <- BPRMeth::create_rbf_object(M = D - 1)# Create RBF basis object
H      <- BPRMeth::design_matrix(basis, X)$H   # Create design matrix
coefs <- c(-0.6, 1.75, -1.2, -4.8)             # Regression coeffiecients
y <- rnorm(I, mean = H %*% coefs, sd = sqrt(1/lambda)) # Generate y data
# Run VBLR model
vblr_model <- vblr_fit(X = H, y = y, lambda = lambda, is_verbose = TRUE)
## It:   2  LB:  -32.2247   LB_diff:     Inf 
## It:   3  LB:  -31.68046  LB_diff:     0.5442432 
## It:   4  LB:  -31.67986  LB_diff:     0.0005927667 
## It:   5  LB:  -31.67986  LB_diff:     8.680225e-07

Let’s examine the posterior predictive distribution

xs <- seq(-2, 2, len = 100) # create test X values
Hs <- BPRMeth::design_matrix(basis, xs)$H
# Estimate predictive distribution
pred <- vblr_predictive(model = vblr_model, X_test = Hs)
# Store test data in dt
dt <- data.table(xs = xs, ys = pred$mu_pred, ys_l = 0, ys_h = 0)
dt <- dt %>% .[,c("ys_l","ys_h") := list(ys - 2*pred$s_pred, 
                                         ys + 2*pred$s_pred)]
# Store training data
obs <- as.data.table(cbind(X, y)) %>% setnames(c("x", "y"))
# Create plot
p <- draw_predictive(X = dt, obs = obs, 
                     title = "Linear basis function regression example")
print(p)

4.3 Perform model selection using the lower bound

We will keep the data generated from the previous example and we will go through different models, i.e. increase the number of basis functions \(M\) and observe how the lower bound \(\mathcal{L}(q)\) changes.

set.seed(1234)
M <- 8
L <- vector(mode = "numeric", length = M)
for (m in 1:M) {
  basis <- BPRMeth::create_rbf_object(M = m)    # Create RBF basis object
  H     <- BPRMeth::design_matrix(basis, X)$H   # Create design matrix
  # Run VBLR model
  vblr_model <- vblr_fit(X = H, y = y, lambda = lambda, is_verbose = FALSE)
  # Store lower bound for each model
  L[m] <- tail(vblr_model$L, n = 1)
}
dt <- data.table("Model" = 1:M, "LB" = L)
ggplot(data = dt, aes(x = Model, y = LB, group = 1)) + geom_line() + 
    geom_point() + scale_x_continuous(breaks = seq(1:M)) + gg_theme() 

Plot of the lower bound \(\mathcal{L}(q)\) versus the model complexity \(\mathcal{M}\) of the polynomial. The value of the lower bound gives the log probability of the model \(\text{ln}\;p(\mathbf{y} | \mathcal{M})\), and we see that the value of the bound peaks peaks at \(M = 3\), corresponding to the true model that generated the data.

5 Conclusion

Most of this tutorial-like document was adapted from Chapter 10 in the Pattern Recognition and Machine Learning book from Christopher Bishop and the following slides: http://mlg.postech.ac.kr/~seungjin/courses/pgm/handouts/handout13.pdf. If you found this document useful, check my homepage http://homepages.inf.ed.ac.uk/s1263191/ at the University of Edinburgh for links to other tutorials.