Lecture 8: Gibbs Sampling and MCMC Diagnostics

1 Gibbs Sampler (1984)

In the normal model we were able to derive the full joint posterior distribution \[ p(\theta,\sigma^2\mid y_1,\ldots,y_n) \]

In most models, the joint posterior cannot be written in a closed form.

But often we are able to derive the full conditional posterior distributions:

\[ p(\theta\mid \sigma^2,y_1,\ldots,y_n)\qquad \text{ and }\qquad p(\sigma^2\mid \theta,y_1,\ldots,y_n) \]

This is a normal model with semi-conjugate prior. More on this later. But for now let’s see how we can use the full conditional distributions to sample from the joint posterior distribution. This is the idea behind the Gibbs sampler, which is a special case of Markov Chain Monte Carlo (MCMC) methods. Other MCMC methods include Metropolis, Metropolis-Hastings and Hamiltonian (or Hybrid) Monte Carlo, which we will discuss in future lectures.

1.1 General setup of the Gibbs sampler

Suppose we have a vector of parameters \(\boldsymbol{\phi}\) of dimension \(p\): \[ \boldsymbol{\phi}=\{\phi_1,\phi_2,\ldots,\phi_p\}. \]

Goal: Generate samples from the joint distribution \(p(\boldsymbol{\phi})\) having access only to the full conditional distributions \(p(\phi_j\mid \boldsymbol{\phi}_{-j})\). In this setup, \(p(\boldsymbol{\phi})\) is the target distribution, and in Bayesian inference, it is the posterior distribution of interest.

The Gibbs sampling algorithm proceeds iteratively as follows:

Given starting values \(\boldsymbol{\phi}^{(0)}=\{\phi_1^{(0)},\phi_2^{(0)},\ldots,\phi_p^{(0)}\}\)

sample \(\phi_1^{(i)}\sim p(\phi_1\mid \phi_2^{(i-1)},\phi_3^{(i-1)}\ldots,\phi_p^{(i-1)})\)
sample \(\phi_2^{(i)}\sim p(\phi_2\mid \phi_1^{(i)},\phi_3^{(i-1)}\ldots,\phi_p^{(i-1)})\) \[ \vdots \]

sample \(\phi_p^{(i)}\sim p(\phi_p\mid \phi_1^{(i)},\phi_2^{(i)}\ldots,\phi_{p-1}^{(i)})\)

Q: What will be the 3rd step of the Gibbs sampler?

Answer

The 3rd step of the Gibbs sampler will be to sample \(\phi_3^{(i)}\) from its full conditional distribution given the most recent values of the other parameters. Specifically, it will be: \[ \phi_3^{(i)}\sim p(\phi_3\mid \phi_1^{(i)},\phi_2^{(i)},\phi_4^{(i-1)}\ldots,\phi_p^{(i-1)}) \]

The updating cycle 1 through p is called a Gibbs scan. We get a sequence of \(S\) Gibbs’ scans: \[ \begin{aligned} \boldsymbol{\phi}^{(1)}&=\{\phi_1^{(1)},\phi_2^{(1)},\ldots,\phi_p^{(1)}\}\\ \boldsymbol{\phi}^{(2)}&=\{\phi_1^{(2)},\phi_2^{(2)},\ldots,\phi_p^{(2)}\}\\ &\vdots\\ \boldsymbol{\phi}^{(S)}&=\{\phi_1^{(S)},\phi_2^{(S)},\ldots,\phi_p^{(S)}\}\\ \end{aligned} \]

The above \(S\) samples are not independent, but they form a Markov chain.

Fact:

Under certain conditions, the samples \(\{\boldsymbol{\phi}^{(i)}\}_{i=1}^S\) are from the \(p(\boldsymbol{\phi})\) distribution and

\[ \frac{1}{S}\sum_{i=1}^S g(\boldsymbol{\phi}^{(i)})\quad\to\quad E[g(\boldsymbol{\phi})] \]

\[ \frac{\#\{\boldsymbol{\phi}^{(i)}\in A\}}{S}\quad\to\quad P(\boldsymbol{\phi}\in A) \]

In other words, the Gibbs sampler generates samples from the target distribution \(p(\boldsymbol{\phi})\) and allows us to estimate expectations and probabilities with respect to this distribution. This is a key property that makes the Gibbs sampler a powerful tool for Bayesian inference, especially when the joint distribution is complex and cannot be sampled from directly.

1.2 Monte Carlo vs Gibbs sampling?

Q: What is the conceptual difference between the Monte Carlo sampling from the joint posterior and the Gibbs sampling?

Answer

In Monte Carlo sampling, we assume that we can directly sample independently from the joint posterior distribution \(p(\boldsymbol{\phi})\). In contrast, the Gibbs sampler generates a sequence of samples \(\{\boldsymbol{\phi}^{(i)}\}_{i=1}^S\) where each sample is generated conditionally on the previous sample. This means that the samples from the Gibbs sampler are not independent, but rather form a Markov chain. The Gibbs sampler relies on the full conditional distributions to iteratively update each component of \(\boldsymbol{\phi}\), while Monte Carlo sampling assumes direct access to the joint distribution for sampling.

Recall, Markov chain is defined the following way:

Definition: Sequence \(\{\phi^{(i)}\}_{i=1}^S\) is a Markov chain if \[ \Pr(\phi^{(i)}\mid \phi^{(i-1)},\phi^{(i-2)},\ldots,\phi^{(1)})=\Pr(\phi^{(i)}\mid \phi^{(i-1)}) \]

We will discuss the properties of Markov chains later in the course. For now, we will focus on the practical aspects of using Gibbs sampling and diagnosing its convergence.

2 MCMC diagnostics

How do we know if the samples generated by the Gibbs sampler are from the target distribution \(p(\boldsymbol{\phi})\)? How many samples \(S\) do we need to generate to get a good approximation of the target distribution? MCMC diagnostics are tools that help us assess the convergence of the Markov chain to the target distribution and the quality of the samples generated.

2.1 A Toy Example: Mixture Distribution

Consider a mixture distribution for a parameter \(\theta\): \[ p(\theta)= \sum_{\delta=1}^3 p(\theta\mid\delta)\, p(\delta)=\sum_{\delta=1}^3 N(\mu_\delta, \sigma_\delta^2)\, w_\delta \]

where \[ \delta\in\{1,2,3\},\qquad Pr(\delta=1)=Pr(\delta=3)=0.45,\quad Pr(\delta=2)=0.1 \]

and \[ \theta\mid \delta=1\sim N(-3,1/3),\quad \theta\mid \delta=2\sim N(0,1/3),\quad \theta\mid \delta=3\sim N(3,1/3) \]

Above is the plot of the mixture distribution (blue line) and a histogram of 2,000 samples drawn from this distribution using Monte Carlo sampling. You may think of it as some posterior distribution that we are trying to approximate using MCMC methods.

Q: How do we approximate this distribution using Monte Carlo?

Answer

The pseudo-code for Monte Carlo sampling from the mixture distribution is as follows:

For \(i\) in 1 to \(S\):
1. Sample \(\delta^{(i)}\) from the categorical distribution with probabilities \(w_i\). I.e. \(\delta^{(i)}\sim p(\delta)\).
2. Sample \(\theta^{(i)}\) from the normal distribution corresponding to \(\delta^{(i)}\), i.e. \(\theta^{(i)}\sim p(\theta\mid \delta^{(i)})= N(\mu_{\delta^{(i)}}, \sigma_{\delta^{(i)}}^2)\).
The samples \(\{\theta^{(i)}\}_{i=1}^S\) are then samples from the marginal \(p(\theta)\).

Let’s construct a Gibbs sampler for \(\phi=\{\delta, \theta\}\). The pseudo-code for the Gibbs sampler is as follows:

Initialize \(\theta^{(0)},\delta^{(0)}\) (e.g., \(\theta^{(0)}=0, \delta^{(0)}=1\)).
For \(i\) in 1 to \(S\):
1. Sample \(\delta^{(i)}\) from the full conditional distribution \(p(\delta\mid \theta^{(i-1)})\). This is a categorical distribution with probabilities proportional to \(w_j \times \text{dnorm}(\theta^{(i-1)}, \mu_j, \sqrt{\sigma^2_j})\) for \(j=1,2,3\).
2. Sample \(\theta^{(i)}\) from the full conditional distribution \(p(\theta\mid \delta^{(i)})\), which is a normal distribution \(N(\mu_{\delta^{(i)}}, \sigma^2_{\delta^{(i)}})\).

To derive the full conditional distributions, we can use Bayes’ theorem. The full conditional for \(\delta\) given \(\theta\) is: \[ \begin{aligned} p(\delta=j\mid \theta) &\propto p(\theta\mid \delta=j)\times p(\delta=j) \\ &= w_j \cdot \text{dnorm}(\theta; \mu_j, \sigma_j) \end{aligned} \] The full conditional for \(\theta\) given \(\delta\) is: \[ p(\theta\mid \delta=j) = N(\mu_j, \sigma^2_j) \]

We make \(S=10,000\) Gibbs scans, and plot the first 1,000 samples of \(\theta\) in a histogram and overlay the true mixture distribution.:

Does this look like a good approximation of the true mixture distribution? Probably not. The histogram of the first 1,000 samples from the Gibbs sampler does not closely match the true mixture distribution (blue line). This is due to poor mixing of the Markov chain, which can occur when the chain gets stuck in one mode of the distribution and does not explore the other modes effectively. In this case, the chain may be spending too much time in one of the components of the mixture and not adequately sampling from the others, leading to a biased approximation of the target distribution.

Diagnostic Tool: trace plot

Trace plots show the values of the samples generated by the Gibbs sampler across iterations. They can help us visually assess the mixing and convergence of the Markov chain.

Q: What do you observe? Is there an issue?

Answer

The trace plot shows that the samples of \(\theta\) are not mixing well across the iterations. The chain appears to be getting stuck in certain regions of the parameter space, which is indicative of poor mixing. This can lead to a biased approximation of the target distribution, as the chain may not be exploring all modes of the mixture distribution effectively. The trace plot suggests that we may need to run the Gibbs sampler for more iterations or consider alternative sampling strategies to improve mixing and convergence.

Now, running the chain for 10,000 iterations shows much better mixing and a histogram that closely matches the true mixture distribution. The trace plot also shows that the samples are exploring the parameter space more effectively, indicating improved convergence of the Markov chain. Here we have the benefit of knowing the true distribution, but in practice, we would not have this information and would need to rely on more diagnostic tools to assess convergence.

One way to check if the chain explored the full parameter space is to examine the trace plots of multiple chains. If the chains are exploring different regions of the parameter space and mixing well, it suggests that the chain has adequately explored the target distribution.

The next code snippet runs three independent chains of the Gibbs sampler and plots their trace plots to visually assess convergence across multiple chains. It also calculates the Gelman-Rubin diagnostic to quantitatively assess convergence.

library(ggplot2)
library(dplyr)
library(coda)

# Number of chains and iterations
S <- 1000
n_chains <- 3
chains <- list()
# Initialize multiple chains
set.seed(632)
for (i in 1:n_chains) {
  th <- 0
  chain <- numeric()
  for (s in 1:S) {
    d <- sample(1:3, 1, prob = w * dnorm(th, mu, sqrt(s2)))
    th <- rnorm(1, mu[d], sqrt(s2[d]))
    chain <- c(chain, th)
  }
  chains[[i]] <- chain
}

# Combine chains into a data frame for plotting
chains_df <- data.frame(
  Iteration = rep(1:S, n_chains),
  Theta = unlist(chains),
  Chain = factor(rep(1:n_chains, each = S))
)

# Plot the three chains
ggplot(chains_df, aes(x = Iteration, y = Theta, color = Chain)) +
  geom_line(alpha = 0.7) +
  labs(title = "Trace Plot for 3 MCMC Chains",
       x = "Iteration",
       y = expression(theta),
       color = "Chain") +
  theme_minimal()

# Convert to mcmc.list
mcmc_chains <- mcmc.list(lapply(chains, mcmc))

Gelman-Rubin Diagnostic

Another way to assess convergence is to run multiple chains with different starting values and compare their behavior. The Gelman-Rubin diagnostic compares the variance between multiple chains to the variance within each chain. If chains are exploring different parts of the parameter space, the between-chain variance will be larger than the within-chain variance, indicating that the chains have not yet converged to the same distribution.

The potential scale reduction factor (PSRF or \(\hat{R}\)) is calculated as: \[ \hat{R} = \sqrt{\frac{\hat{V}}{W}} \] where:

\(\hat{V}\) is the estimated variance of the target distribution (combining between-chain and within-chain variances).
\(W\) is the average within-chain variance.

Rule Of Thumb:

If \(\hat{R} \approx 1\), the chains have likely converged.
If \(\hat{R} > 1.1\), it suggests that the chains have not yet converged.
More stringent thresholds (e.g., \(\hat{R} < 1.05\)) may be used for more rigorous convergence assessment.
More chains provide a more reliable estimate of convergence at the expense of increased computational cost.

# Gelman-Rubin diagnostic
gelman.diag(mcmc_chains)

Potential scale reduction factors:

     Point est. Upper C.I.
[1,]       1.24       1.82

In the 3 chains with \(S=1000\) we can clearly see that they are not mixing well and the Gelman-Rubin diagnostic indicates that the chains have not yet converged. However, when we run the chains for \(S=10,000\) iterations, we see much better mixing and the Gelman-Rubin diagnostic indicates that the chains have likely converged to the same distribution.

2.2 Three chains with \(S=10,000\):

Potential scale reduction factors:

     Point est. Upper C.I.
[1,]       1.01       1.05

Diagnostic Tool: Rank Trace Plots

Rank trace plots (I’ve seen the term trank plots but this seems to be not so common) visualize the distribution of ranks of samples from multiple chains. If the chains have converged, the ranks should be uniformly distributed across chains. In other words, the rank trace plot should show that samples from different chains are intermingled and not separated by chain.

While the Rank trace plot for the first 1,000 iterations may show some separation between chains, indicating that they have not yet converged, the rank trace plot for all iterations should show that the ranks are much better mixed across chains, suggesting that the chains are much closer to being converged to the same distribution.

2.3 Autocorrelation and Effective Sample Size

The point estimate \(E[\phi|y]\) is approximated by \[ \overline{\phi}=\frac{1}{S}\sum_{i=1}^S \phi^{(i)} \]

Q: In the Monte Carlo case, what is the variance of this point estimate?

Answer

\[ Var(\overline{\phi})=\frac{Var(\phi)}{S} \]

In the MCMC case, what is the variance of this point estimate?

\[ \begin{aligned} Var(\overline{\phi})&=\frac{1}{S}\sum_{i=1}^S Var(\phi^{(i)}) + \sum_{i<j} Cov(\phi^{(i)}, \phi^{(j)})\\ &=\frac{Var(\phi)}{S} + \sum_{i<j} Cov(\phi^{(i)}, \phi^{(j)})\\ &\approx \frac{Var(\phi)}{S}\left(1 + 2\sum_{k=1}^\infty \rho_k\right)\\ &=\frac{Var(\phi)}{S_{eff}} \end{aligned} \] The main difference with the Monte Carlo case is that in the MCMC case, the samples are not independent, and there is autocorrelation between the samples. This autocorrelation increases the variance (usually) of the point estimate compared to the Monte Carlo case, where the samples are independent. The effective sample size \(S_{eff}\) accounts for this autocorrelation and provides a measure of how many independent samples would be needed to achieve the same variance as the correlated MCMC samples.

Above, \(\rho_k\) is the autocorrelation at lag \(k\), and \(S_{eff}\) is the effective sample size, which accounts for the autocorrelation in the MCMC samples. The effective sample size can be thought of as the equivalent number of independent samples that would provide the same variance of the point estimate as the correlated MCMC samples.

\[ S_{eff} = \frac{S}{1 + 2\sum_{k=1}^\infty \rho_k} \]

The autocorrelation terms are typically positive, so the effective sample size is smaller than the total number of samples.

Q: Is it possible for the effective sample size to be larger than the total number of samples? Does it make sense?

Answer

In theory, the effective sample size can be larger than the total number of samples if the autocorrelation is negative at certain lags, which can lead to a situation where the variance of the point estimate is actually smaller than what would be expected from independent samples. Thus, this chain provides more information than an independent sample of the same size. This happens when the chain exhibits negative autocorrelation, or a mean-reverting behavior, which can reduce the variance of the estimator. However, in practice, it is more common for the effective sample size to be smaller than the total number of samples due to positive autocorrelation in MCMC chains. Therefore, while it is possible for the effective sample size to be larger than the total number of samples, it is not a common occurrence and may indicate that the chain has some unusual properties.

2.4 MCMC diagnostics for the semiconjugate normal analysis

Let’s go back to the Midge dataset, i.e. the normal model with semi-conjugate prior and run the Gibbs sampler to generate samples from the posterior distribution of \(\theta\) and \(\sigma^2\). We will then use trace plots, autocorrelation plots, and effective sample size calculations to assess the convergence and quality of our MCMC samples.

Trace plots:

Let’s display the trace plots for \(\theta^{(i)}\) and \(\sigma^{2(i)}\) from our Gibbs sampler:

# Hyperparameters
mu0 <- 1.9; t20 <- 0.95^2; s20 <- 0.01; nu0 <- 1

# Data
y <- c(1.64, 1.70, 1.72, 1.74, 1.82, 1.82, 1.82, 1.90, 2.08)
n <- length(y); mean.y <- mean(y);var.y <- var(y)

## starting values
set.seed(632)
S <- 5000
PHI<-matrix(nrow=S,ncol=2)
PHI[1,]<-phi<-c( mean.y, 1/var.y)

## Gibbs sampling algorithm
for(s in 2:S) {

  # generate a new theta value from its full conditional
  mun<-  ( mu0/t20 + n*mean.y*phi[2] ) / ( 1/t20 + n*phi[2] )
  t2n<- 1/( 1/t20 + n*phi[2] )
  phi[1]<-rnorm(1, mun, sqrt(t2n) )

  # generate a new sigma^2 value from its full conditional
  nun<- nu0+n
  s2n<- (nu0*s20 + (n-1)*var.y + n*(mean.y-phi[1])^2 ) /nun
  phi[2]<- rgamma(1, nun/2, nun*s2n/2)

  PHI[s,]<-phi
  }
# plotting
plot(PHI[,1],xlab="iteration",ylab=expression(theta), type="l", lwd=0.25)

plot(1/PHI[,2],xlab="iteration",ylab=expression(sigma^2), type="l", lwd=0.25)

ACF plots:

Since our Gibbs sampler constructs a Markov chain in the (mean,precision) space, it makes sense to plot the ACF of the \(1/\phi_2^{(i)}\), which corresponds to \(\sigma^{2(i)}\), instead of \(\phi_2^{(i)}\) itself, which corresponds to the precision. The ACF plots will show us how much autocorrelation is present in our MCMC samples for both \(\theta\) and \(\sigma^2\).

acf(PHI[,1])

acf(1/PHI[,2])

Q: What do you observe from the ACF plots? Are the chains mixing well? Is there a lot of autocorrelation?

Answer

For the ACF interpretation, we would look for how quickly the autocorrelation drops off as the lag increases. If the autocorrelation drops off quickly (e.g., within a few lags), it suggests that the chain is mixing well and that the samples are relatively independent. However, if the autocorrelation remains high for many lags, it indicates that the chain is not mixing well and that there is a lot of dependence between samples, which can lead to a smaller effective sample size and less reliable estimates. Also, the blue dashed lines in the ACF plot represent the 95% confidence intervals for the autocorrelation. If the autocorrelation values are below these lines, it suggests that the autocorrelation is not significantly different from zero, which is a good sign for mixing. If many autocorrelation values are above these lines, it indicates significant autocorrelation and potential issues with mixing.

For our plots, chain for \(\theta\) has almost no autocorrelation, which is a good sign for mixing. However, the chain for \(\sigma^2\) shows some autocorrelation, especially at lower lags, which suggests that the samples for \(\sigma^2\) are not mixing as well as those for \(\theta\). This could lead to a smaller effective sample size for \(\sigma^2\) and may require more iterations or alternative sampling strategies to improve mixing.

Effective sample size

#install.packages("coda")
library(coda)
effectiveSize( PHI[,1] )

var1 
5000

effectiveSize(1/PHI[,2] )

    var1 
3818.773

As we suspected from the ACF plots, the effective sample size for \(\theta\) is the same as the total number of samples, which indicates that the samples for \(\theta\) are effectively independent. However, the effective sample size for \(\sigma^2\) is smaller than the total number of samples, which confirms that there is significant autocorrelation in the samples for \(\sigma^2\) and that they are not mixing well.

This means that we have fewer effective samples for \(\sigma^2\) than we do for \(\theta\), which can lead to less reliable estimates for \(\sigma^2\). Given that the \(S_{eff}\) for \(\sigma^2\) is around 4,000, we are estimating the posterior distribution of \(\sigma^2\) quite well.

In the next lecture, we will cover the semi-conjugate multivariate normal model and run the Gibbs sampler for that model, which will involve sampling from multivariate normal distributions and inverse-Wishart distributions. We will also discuss how to assess convergence and mixing in that more complex setting.