If you found this document useful, check my homepage at the University of Edinburgh for links to other tutorials.

1 Introduction

Single parameter distributions, such as Poisson and Binomial, imply that the variance is determined by the mean. However in many cases and especially in analysis of biological data, this mean-variance relationship fails mainly due to presence of overdispresion, where the data have a higher variance than anticipated under the simple model (See Cox(1983), Hinde & Demetrio (1998)). In biology, overdispresion occurs since there is more than one source of variation: technical variation due to error measurements coming from the experiment design and biological variation between the subjects of interest, e.g. different cells on different conditions. Check Wang’s blog and this Quora question for more details.

This issue is well studied in analysis of RNA-Seq data where the Negative Binomial is used to model the output counts coming from RNA-Seq experiments, see edgeR chapter and RNA-Seq lecture. Here we focus on modelling overdispersion for proportions data. More specifically we are interesting in modelling the methylation rate of genomic regions and quantifying methylation variability across different cells, where the methylation rate \(\mu \in [0, 1]\). For example, one might want to test which genes are tightly controlled in terms of methylation or if there is more variability in methylation across different developmental stages for the same promoter region. The Binomial variance understimates the true methylation variability, since it assumes that cells are independent of each other and the probabilities of success do not vary between cells; hence, a Beta Binomial model should be used to account for overdispersion.

2 Beta Binomial model

Assume that we have a predefined set of features, e.g. enhancers, and we are interested in modelling the methylation rate and variance on those regions. We assume that the features are independent, hence different Beta-Binomial models can be fit for each region. To not clutter notation we will focus on modelling a single region.

Assume that a given study consists of \(I\) sequenced cells, and for each cell \(i (i = 1, ..., I)\) we have observed \(n_{i}\) CpGs for the region of interest. The number of CpGs \(n_{i}\) are treated as fixed constants and are allowed to vary between cells, which is the case for BS-Seq experiments. Let \(m_{i}\) denote the number of methylated CpGs in the \(i^{th}\) cell. Each cell \(i\) has an underlying probability \(p_{i}\) of being methylated, and conditional on \(p_{i}\), the \(m_{i}\) are independent binomial random variables

\[ m_{i} | p_{i} \sim Binomial(m_{i} | n_{i}, p_{i}) = \binom{n_{i}}{m_{i}} p_{i}^{m_{i}}(1-p_{i})^{n_{i}-m_{i}}\]

We also assume that the random variables \(p_{i}\) are independent and follow a Beta distribution

\[ p_{i} | \alpha, \beta \sim Beta(p_{i} | \alpha, \beta) = \frac{p_{i}^{\alpha-1}(1-p_{i})^{\beta -1}}{B(\alpha, \beta)} = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} p_{i}^{\alpha-1}(1-p_{i})^{\beta -1}\]

where \(\alpha > 0\) and \(\beta > 0\) are unknown constants that govern how the underlying probabilities of success \(p_{i}\) are distributed. We will estimate \(\alpha\) and \(\beta\) using a maximum likelihood approach (See Wilcox (1979), Tripathi et. al. (1994) and Ennis & Bi (1997)).

We can marginalise out the unknown parameter \(p_{i}\), and the likelihood of probability of having \(m_{i}\) methylated CpGs given \(n_{i}\), \(\alpha\) and \(\beta\) becomes

\[ \begin{aligned} Pr(m_{i} | n_{i}, \alpha, \beta) & = \int_{0}^{1} Binomial(m_{i} | n_{i}, p_{i}) \cdot Beta(p_{i} | \alpha, \beta) \; dp_{i} \\ & = \int_{0}^{1} \binom{n_{i}}{m_{i}} \frac{1}{B(\alpha, \beta)} \; p_{i}^{\alpha + m_{i} - 1}(1-p_{i})^{\beta + n_{i} - m_{i} - 1} \; dp_{i} \\ & = \binom{n_{i}}{m_{i}} \frac{B(\alpha + m_{i}, \beta + n_{i} - m_{i})}{B(\alpha, \beta)} \end{aligned} \] This compound distribution is the Beta-Binomial distribution, which can account for overdispersion when the probabilities of success are thought to vary from cell to cell.

We can reparametrize the BB model with strictly positive parameters \(\mu\) and \(\gamma\),

\[ \begin{aligned} \mu & = \alpha (\alpha + \beta)^{-1} \\ \gamma & = (\alpha + \beta + 1)^{-1} \end{aligned} \] Then the mean and variance of \(m_{i}\) can be expressed by

\[ \begin{aligned} \mathbb{E}[m_{i} | n_{i}, \mu, \gamma] & = n_{i} \mu \\ Var[m_{i} | n_{i}, \mu, \gamma] & = n_{i} \mu (1 - \mu)(1 + (n_{i} - 1)\gamma) \end{aligned} \] The parameter \(\mu\), where \(0 < \mu < 1\) can be interpreted as the mean of the marginal probabilities of success \(p_{i}\) and \(\rho\), where \(0 < \gamma < 1\) is called the dispersion parameter. Note that when \(\gamma\) is equal to 0 (i.e. no overdispersion), the BB model reduces to having the same variance as the Binomial distribution.

2.1 What do you mean by mean-variance relationship?

For the Binomial model, we can compute the variance of the random variable by just knowing the probability of success (e.g. methylation level) since the variance for a Binomial variable is given by

\[ Var[m_{i} | n_{i}, p] = n_{i} p(1 - p) \]

Assume for simplicity that \(n_{i} = 100\) for all observations. Then, we can directly obtain the mean and variance of the successes for the Binomial model

set.seed(4)
# Total number of trials
n <- rep(100, 20)
# Probability of success
p <- seq(0, 1, length.out = 20)
# Mean of successes
mu <- n*p
# Variance of successes
succ_var <- n*p*(1 - p)
dt <- data.table(x = p, y = mu, y_low = mu - 2*sqrt(succ_var), 
                 y_high = mu + 2*sqrt(succ_var))
p_bin <- plot_model(X = dt, title = "Binomial model")
print(p_bin)

Note that by only knowing the mean we can have only a specific value for the variance (in the above plot we show mean \(\pm2\) standard deviations), which is maximised when \(p = 0.5\). Assume that in our model we have \(p = 0.4\), i.e. the methylation level is 40%, so in general we would expect \(m = 40\) methylated reads out of the total \(n = 100\) reads, and the variance would be \(\text{var}(m) = 100 * 0.4 * (1 - 0.4) = 24\). Let us sample \(r = 5\) points from the Binomial and show them in our plot, e.g. \(r\) can be considered as the number of replicates.

r <- 5  # Total number of replicates
# Probability of success
bin_samples <- rbinom(n = r, size = 100, prob = 0.4)
bin_obs <- data.table(x = rep(0.4, r), y = bin_samples)
p_bin <- plot_model(X = dt, obs = bin_obs, title = "Binomial model")
print(p_bin)

However, when obtaining replicates from real data, we have higher variability (i.e. overdispersion) than the Binomial model would expect. This means that the Binomial model is the wrong model to try and explain the data. E.g. imagine we obtained the following observations, shown in green.

# Probability of success
set.seed(12)
bb_samples <- VGAM::rbetabinom(r, 100, prob = 0.4, rho = 0.1)
betabin_obs <- data.table(x = rep(0.4, r), y = bb_samples)
p_betabin <- p_bin + 
    geom_point(data = betabin_obs, mapping = aes(x = x, y = y), 
               shape = 20, color = "darkgreen", size = 3)
print(p_betabin)

Obviously, the Binomial model cannot capture this variability of the data, and when someone would try to perform differential analysis might erroneously consider them as differentially methtylated when considering a Binomial test. So let us see what would be the mean-variance relationship for the Beta-Binomial model. Note that having the additional dispersion parameter \(\gamma\) (which we learn from the data) we can change the mean-variance relationship; hence depending on the data we can obtain a better estimate of the variability that we can use for performing differential analysis, something that we cannot do using the Binomial model. Below, we show the mean-variance relationship for two different dispersion parameter values.

set.seed(4)
# Dispersion parameters
gamma <- c(0.05, 0.1)
# Total number of trials
n <- rep(100, 20)
# Probability of success
p <- seq(0, 1, length.out = 20)
# Mean of successes
mu <- n*p
# Variance of successes
succ_var_1 <- n*p*(1 - p)*(1 + (n - 1) * gamma[1])
dt_1 <- data.table(x = p, y = mu, y_low = mu - 2*sqrt(succ_var_1), 
                   y_high = mu + 2*sqrt(succ_var_1))
succ_var_2 <- n*p*(1 - p)*(1 + (n - 1) * gamma[2])
dt_2 <- data.table(x = p, y = mu, y_low = mu - 2*sqrt(succ_var_2), 
                   y_high = mu + 2*sqrt(succ_var_2))
p_betabin1 <- plot_model(X = dt_1, title = "BB, disp = 0.05")
p_betabin2 <- plot_model(X = dt_2, title = "BB, disp = 0.1")
p_betabin_all <- cowplot::plot_grid(p_betabin1, p_betabin2, ncol = 2)
print(p_betabin_all)

2.1.1 Differential analysis

Let’s show how the previous observations would fit to the Beta Binomial model with \(\gamma = 0.1\), i.e. the actual distribution that generated the data.

p_betabin <- p_betabin2 + 
    geom_point(data = bin_obs, mapping = aes(x = x, y = y), 
               shape = 20, color = "red", size = 3) +
    geom_point(data = betabin_obs, mapping = aes(x = x, y = y), 
               shape = 20, color = "darkgreen", size = 3)
print(p_betabin)

This clearly captures better the variability of the data and will take it into account when computing the statistic that will be used for identifying differentially methylated data.

If we were to use the Wald test proposed in the DSS package, the statistic takes the following form (See Eq. 1.2 in their paper )

\[ t = \frac{\mu_{1} - \mu_{2}}{\sqrt{\text{var}_{1} + \text{var}_{2}}} \] where \[ \begin{aligned} \mu & = \frac{\sum_{r} m_{r}}{\sum_{r} n_{r}} \\ \text{var} & = \text{var}\left(\frac{\sum_{r} m_{r}}{\sum_{r} n_{r}}\right) = \left(\frac{1}{\sum_{r} n_{r}}\right)^{2}\text{var}\left(\sum_{r} m_{r}\right) \end{aligned} \]

and \(r\) are the replicates for each experiment. Note that for Binomial assumption we will underestimate the variance term, hence the statistic \(t\) will give us larger values which then will lead to lower p-values!!

However, when using the Beta-Binomial, we will have a better estimate of the variance, which in general will have larger values compared to the Binomial, hence the statistic \(t\) will give us more approriate estimates, thus we take into account the variability when computing p-values. This can be thought as the Beta-Binomial being more conservative in identifying differential methylated loci, however it is the right model for capturing overdispersion.

Note In the example given above, both models might give that the difference is statistically significant, since the majority of the green points have lower mean methylation level, however the confidence of the models in making this statement is really differnent.

3 MLE for Beta Binomial model

We can estimate the optimal parameters \(\alpha\) and \(\beta\) using maximum likelihood. Since we have \(I\) cells the likelihood function is the following

\[L(m | \alpha, \beta) = \prod_{i = 1}^{I} \binom{n_{i}}{m_{i}} \frac{B(\alpha + m_{i}, \beta + n_{i} - m_{i})}{B(\alpha, \beta)}\] and the log likelihood becomes \[ \begin{aligned} l(m | \alpha, \beta) & = \text{log}\; L(m | \alpha, \beta) \\ & = \sum_{i = 1}^{I} \text{log}\;\binom{n_{i}}{m_{i}} + \text{log}\;B(\alpha + m_{i}, \beta + n_{i} - m_{i}) - \text{log}\;B(\alpha, \beta) \\ & = \sum_{i = 1}^{I} \text{log}\;\binom{n_{i}}{m_{i}} + \gamma(\alpha + m_{i}) + \gamma(\beta + n_{i} - m_{i}) - \\ & \qquad \qquad\gamma(\alpha + \beta + n_{i}) - \gamma(\alpha) - \gamma(\beta) + \gamma(\alpha + \beta) \end{aligned} \] where: \(\gamma(x) = \text{log}\;\Gamma(x)\).

We can then ecompute the 1st partial derivatives and the 2nd partial derivatives (i.e. the Hessian matrix) of the log likelihood wrt parameters \(\alpha\) and \(\beta\) (equations not shown). Then we can estimate the optimal parameter values using the Newton-Raphson iterative method.

The approximate covariance matrix and the standard errors for the parameters \(\hat{\alpha}\) and \(\hat{\beta}\) can be obtained by inverting the Hessian matrix at the ML solution,

\[ Cov(\hat{\alpha}, \hat{\beta}) = \begin{bmatrix} \hat{\sigma}_{\alpha}^{2} & \gamma \hat{\sigma}_{\alpha} \hat{\sigma}_{\beta} \\ \gamma \hat{\sigma}_{\alpha} \hat{\sigma}_{\beta} & \hat{\sigma}_{\beta}^{2} \end{bmatrix} \]

3.1 Simulation data and MLE fit

Below we simulate data from a Beta-Binomial distribution with \(\mu = 0.8\) and \(\rho = 0.1\) using the VGAM package and learn the optimal parameters using maximum likelihood.

set.seed(4)
# Total number of cells
C <- rbinom(1, 50, runif(1, min = 0.4, max = 0.7))
# Total number of CpGs for each cell
n <- rbinom(C, 50, runif(1, min = 0.3, max = 0.8))
# Generate synthetic data with \mu = 0.8 and \rho = 0.1
m <- VGAM::rbetabinom(length(n), n, prob = 0.8, rho = 0.1)
# MLE fit
fit <- Coef(VGAM::vglm(cbind(m, n-m) ~ 1, betabinomial, trace = FALSE))

print(paste0("mu_hat: ", round(fit[1], 3), "     rho_hat: ", round(fit[2], 3)))
## [1] "mu_hat: 0.784     rho_hat: 0.125"

4 Test for overdispersion

Before adopting the Beta-Binomial model for analysis, we must first examine whether the data are overdispersed to the extent that the BB model would be a better fit than the simple Binomial model. Since we are able to estimate \(\rho\), we can directly test whether \(\rho\) is significantly greater than zero. However, this test is less sensitive in detecting departure from Binomial because boundary problems arise as we test whether a positive-valued parameter is greater than 0 (recall that \(0 < \rho < 1\)), see Xu & Chan (2008).

4.1 Likelihood ratio test

An alternative approach is to perform the likelihood ratio test (LRT). The null hypothesis is that the underlying distribution is Binomial while the alternative hypothesis is that the distribution is a Beta-Binomial. The likelihood ratio test statistic is given by

\[ \chi_{1}^{2} = - 2 \; (L_{B} - L_{BB}) \] where \(L_{B}\) is the log likelihood value of the Binomial distribution at the MLE, and \(L_{BB}\) is the log likelihood value of the BB model at the MLE. This statistic follows a \(\chi^{2}\) distribution with 1 degree of freedom, which is the difference on the number of parameters for each distribution. It should be noted that the same boundary problem applies also for this test.

4.2 Tarone’s Z statistic

To avoid the boundary problem we can use an alternative statistic called Tarone’s Z statistic Tarone (1979). This statistic can be used as goodness of fit test of the Binomial distribution against the BB distribution,

\[ Z = \frac{(S - \sum_{i} n_{i})}{\sqrt{2 \sum_{i}n_{i}(n_{i} - 1)}} \] where, \[ S = \sum_{i} \frac{(m_{i} - n_{i} \hat{p})^{2}} {\hat{p}(1 - \hat{p})} \] \[ \hat{p} = \sum_{i} m_{i} \Big/ \sum_{i} n_{i} \] The statistic Z has an asymptotic standard normal distribution under the null hypothesis of a Binomial distribution.

Let’s see an example of the Tarone’s Z test in the following code, where we simulate 1000 samples from Binomial and Beta-Binomial distributions and calculate the Z-score.

set.seed(5)
M <- 1000
alt_hyp = null_hyp <- vector("numeric", length = M)
for (i in 1:M) {
  # Total number of cells
  C <- rbinom(1, 70, runif(1, min = 0.4, max = 0.7))
  # Total number of CpGs for each cell
  n <- rbinom(C, 80, runif(1, min = 0.3, max = 0.8))
  
  ##---
  # Compute Tarone's Z statistic for overdispersion Beta Binomial
  ##---
  # Generate synthetic data with \mu = 0.8 and \rho = 0.05
  m_alt <- VGAM::rbetabinom(length(n), n, prob = 0.8, rho = 0.05)
  p_hat = sum(m_alt) / sum(n)
  S = sum( (m_alt - n * p_hat)^2 / (p_hat * (1 - p_hat)) )
  Z_score = (S - sum(n)) / sqrt(2 * sum(n * (n - 1)))
  alt_hyp[i] <- Z_score 
  
  ##---
  # Compute Tarone's Z statistic for Binomial
  ##---
  # Generate synthetic data with \mu = 0.8 and \rho = 0
  m_null <- rbetabinom(length(n), n, prob = 0.8, rho = 0)
  # Compute Tarone's Z statistic
  p_hat = sum(m_null) / sum(n)
  S = sum( (m_null - n * p_hat)^2 / (p_hat * (1 - p_hat)) )
  Z_score = (S - sum(n)) / sqrt(2 * sum(n * (n - 1)))
  null_hyp[i] <- Z_score
}
# Melt object for ggplot2
dt <- list(Null = null_hyp, Alternative = alt_hyp) %>% 
    melt %>% as.data.table %>% 
    setnames(c("value", "L1"), c("value", "Hypothesis"))
# Create histogram plot
p <- ggplot(dt, aes(x = value, color = Hypothesis, fill = Hypothesis)) +
  geom_histogram(aes(y = ..density..), alpha = 0.3, 
                 position = "dodge", bins = 100) +
  geom_vline(xintercept = 0.2, linetype = "dashed", size = 0.7) + 
  stat_function(fun = dnorm, colour = "cornflowerblue", size = 1, 
                arg = list(mean = 0, sd = 1)) + 
  scale_x_continuous(limits = c(-3,11.45)) +
  labs(title = "Tarone's Z score distribution",x = "Z score",y = "Density") +
  scale_fill_brewer(palette = "Dark2") +
  scale_color_brewer(palette = "Dark2") +
  histogram_theme()
print(p)

The underlying blue curve shows the theoretical null distribution which is the standard normal distribution, i.e. \(N(0,1)\), the light orange histogram is the empirical distribution under the null model (1000 simulations) and the green coloured histogram is the distribution of the Z-scores under a Beta Binomial distribution with dispersion \(\rho = 0.05\). We can see that the empirical distribution is really close to the theoretical, which makes the Tarone’s Z-score a sensible statistic to test for overdispersion against the BB model.

5 Remarks

  1. Here we account for overdispersion due to difference in methylation rate across cells for the same region. However, CpGs on the same cell are also highly spatially correlated. Can the Beta Binomial model both of these associations or we have more overdispersion than the Beta-Binomial can account?
  2. Instead of the BB model one could use the Correlated Binomial (CB) model. This model does not take into account the cell-to-cell variability but the correlations between CpGs in each cell. This derivation also allows for negative association, i.e. underdispersion, which however is really rare on real datasets, see Paul 1982, Prentice (1986) and Engel and te Brake (1993).

6 Conclusions

This tutorial-like document showed how we can account for overdispersion in biological data by using the Beta-Binomial model and test for overdispersion using Tarone’s Z-statistic.

If you found this document useful, check my homepage at the University of Edinburgh for links to other tutorials.