# Simple 3-state Markov chain
P <- matrix(c(0.7, 0.2, 0.1,
0.3, 0.4, 0.3,
0.2, 0.3, 0.5), nrow=3, byrow=TRUE)
# Simulate chain
n_steps <- 100
states <- numeric(n_steps)
states[1] <- 1
for(t in 2:n_steps) {
states[t] <- sample(1:3, 1, prob = P[states[t-1],])
}
df <- data.frame(t = 1:n_steps, state = states)
ggplot(df, aes(x = t, y = state)) +
geom_line(alpha = 0.3) +
geom_point(aes(color = factor(state)), size = 1.5) +
scale_color_manual(values = c("steelblue", "coral", "forestgreen"),
name = "State") +
labs(x = "Time step t", y = "State",
title = "Sample path of a 3-state Markov chain") +
theme_minimal()Lecture 19: MCMC: Why Does It Work?
1 Introduction
What We Will Learn
In this lecture we will learn:
- Why MCMC produces valid samples from posterior distributions
- The mathematical foundations that guarantee convergence
- Detailed balance and why it ensures the correct stationary distribution
- Ergodic theorems that justify using sample averages for posterior expectations
- The connection between Metropolis-Hastings, Gibbs sampling, and general Markov chain theory
The Fundamental Problem
In Bayesian inference, we repeatedly face the same computational challenge: given a posterior distribution \[p(\theta | y) = \frac{p(y|\theta)p(\theta)}{\int p(y|\theta)p(\theta)d\theta}\]
we need to:
\[ \\[4cm] \]
The problem: The normalizing constant \(\int p(y|\theta)p(\theta)d\theta\) is often intractable, and direct sampling is impossible for complex posteriors.
What MCMC Promises
MCMC methods construct a sequence of random variables \(\theta^{(0)}, \theta^{(1)}, \theta^{(2)}, \ldots\) such that:
- We only need to evaluate \(p(\theta|y)\) up to a normalizing constant
- The samples eventually “look like” draws from \(p(\theta|y)\)
- Sample averages converge to true expectations
Today’s Goal: Understand why these promises are kept.
Notation: Throughout this lecture, we will denote our target distribution by \(p_0(\theta)\). In Bayesian applications, \(p_0(\theta) = p(\theta|y)\), but the theory applies to any target distribution.
2 Markov Chain Fundamentals
2.1 Definition: Markov Chain
A sequence of random variables \(\{\theta^{(t)}\}_{t=0}^{\infty}\) is a Markov chain if:
\[ P(\theta^{(t+1)} = \theta | \theta^{(t)}, \theta^{(t-1)}, \ldots, \theta^{(0)}) = \underline{\hspace{6cm}} \]
\[ \\[5cm] \]
The dynamics of the chain are governed by the transition kernel \(K(\theta, \theta')\) which gives the probability (density) of moving from \(\theta\) to \(\theta'\): \[K(\theta, A) = P(\theta^{(t+1)} \in A | \theta^{(t)} = \theta)\]
For continuous state spaces, think of \(K(\theta,\theta')\) as the density of \(\theta^{(t+1)}\) given \(\theta^{(t)} = \theta\).
\[ \\[4cm] \]
2.2 Visualizing Markov Dynamics
\[ \\[4cm] \]
Key Properties We Need
For MCMC to work, our Markov chains must have these:
- Irreducibility: The chain can reach any state from any other state
- Aperiodicity: The chain doesn’t get stuck in deterministic cycles
- Stationarity: There exists a distribution \(p_0\) that is preserved by the chain
3 Stationary Distributions
3.1 Definition: Stationary Distribution
A distribution \(p_0\) is stationary (or invariant) for a Markov chain with kernel \(K\) if:
\[p_0(\theta') = \underline{\hspace{6cm}}\]
\[ \\[4cm] \]
# Compute stationary distribution of our 3-state chain
eigen_result <- eigen(t(P))
stationary <- eigen_result$vectors[,1]
stationary <- stationary / sum(stationary)
stationary <- Re(stationary)
# Show convergence to stationary distribution
n_chains <- 1000
n_steps <- 50
final_states <- matrix(0, nrow = n_chains, ncol = n_steps)
for(chain in 1:n_chains) {
state <- sample(1:3, 1) # Random start
for(t in 1:n_steps) {
final_states[chain, t] <- state
state <- sample(1:3, 1, prob = P[state,])
}
}
# Compute empirical distribution at each time
empirical <- data.frame()
for(t in 1:n_steps) {
props <- table(factor(final_states[,t], levels=1:3)) / n_chains
empirical <- rbind(empirical,
data.frame(t = t, state = 1:3, proportion = as.numeric(props)))
}
ggplot(empirical, aes(x = t, y = proportion, color = factor(state))) +
geom_line(linewidth = 1) +
geom_hline(aes(yintercept = stationary[1]), linetype = "dashed", alpha = 0.5) +
geom_hline(aes(yintercept = stationary[2]), linetype = "dashed", alpha = 0.5) +
geom_hline(aes(yintercept = stationary[3]), linetype = "dashed", alpha = 0.5) +
scale_color_manual(values = c("steelblue", "coral", "forestgreen"),
name = "State") +
labs(x = "Time step", y = "Proportion of chains in state",
title = "Convergence to stationary distribution",
subtitle = "Dashed lines show stationary probabilities") +
theme_minimal()3.2 Detailed Balance: A Sufficient Condition
How do we ensure our chain has \(p_0\) as its stationary distribution?
If a transition kernel \(K\) satisfies detailed balance with respect to \(p_0\):
\[\underline{\hspace{4cm}} = \underline{\hspace{4cm}} \quad \text{for all } \theta, \theta'\]
then \(p_0\) is a stationary distribution for \(K\).
Interpretation: Detailed balance means the “probability flow” from \(\theta\) to \(\theta'\) equals the flow from \(\theta'\) to \(\theta\) under \(p_0\).
Proof of Detailed Balance Theorem
Proof:
\[ \\[6cm] \]
4 The Metropolis-Hastings Algorithm
4.1 Algorithm Construction
Given a target distribution \(p_0(\theta)\) (known up to normalizing constant) and a proposal distribution \(J(\theta'|\theta)\):
- Initialize \(\theta^{(0)}\)
- For \(t = 0, 1, 2, \ldots\):
- Propose \(\theta^* \sim J(\cdot|\theta^{(t)})\)
- Compute the ratio:
\[r = \frac{p_0(\theta^*)J(\theta^{(t)}|\theta^*)}{p_0(\theta^{(t)})J(\theta^*|\theta^{(t)})}\]
- Compute acceptance probability: \(\alpha = \min(1, r)\)
- With probability \(\alpha\), set \(\theta^{(t+1)} = \theta^*\); otherwise \(\theta^{(t+1)} = \theta^{(t)}\)
Key insight: If \(p_0(\theta) = p(\theta|y) \propto p(y|\theta)p(\theta)\), the normalizing constant cancels in \(r\)!
Why M-H Works: Detailed Balance Verification
Claim: The M-H kernel satisfies detailed balance with respect to \(p_0\).
Verification of Detailed Balance for M-H
\[ \\[8cm] \]
# Target: mixture of two normals
target <- function(x) {
0.3 * dnorm(x, -2, 0.8) + 0.7 * dnorm(x, 2, 1.2)
}
# M-H with normal proposal
mh_sample <- function(n_iter, proposal_sd = 1) {
samples <- numeric(n_iter)
samples[1] <- 0
accepted <- 0
for(t in 2:n_iter) {
current <- samples[t-1]
proposal <- rnorm(1, current, proposal_sd)
# Ratio r (symmetric proposal, so J cancels)
r <- target(proposal) / target(current)
alpha <- min(1, r)
if(runif(1) < alpha) {
samples[t] <- proposal
accepted <- accepted + 1
} else {
samples[t] <- current
}
}
list(samples = samples, acceptance_rate = accepted / (n_iter - 1))
}
result <- mh_sample(10000, proposal_sd = 2)
samples <- result$samples
# Plot
x_seq <- seq(-6, 6, length.out = 200)
df_target <- data.frame(x = x_seq, density = target(x_seq))
df_samples <- data.frame(x = samples[1001:10000]) # Discard burn-in
ggplot() +
geom_histogram(data = df_samples, aes(x = x, y = after_stat(density)),
bins = 50, fill = "steelblue", alpha = 0.6) +
geom_line(data = df_target, aes(x = x, y = density),
color = "coral", linewidth = 1.5) +
labs(x = expression(theta), y = "Density",
title = "Metropolis-Hastings samples vs. target distribution",
subtitle = paste("Acceptance rate:", round(result$acceptance_rate, 3))) +
theme_minimal()5 Ergodic Theorems: From Samples to Estimates
5.1 The Law of Large Numbers for Markov Chains
If \(\{\theta^{(t)}\}\) is an irreducible, aperiodic Markov chain with stationary distribution \(p_0\), then for any function \(g\) with \(E_{p_0}[|g(\theta)|] < \infty\):
\[\frac{1}{n}\sum_{t=1}^{n} g(\theta^{(t)}) \xrightarrow{a.s.} \underline{\hspace{5cm}}\]
regardless of the starting distribution.
This is the fundamental result that makes MCMC useful for Bayesian inference!
When \(p_0(\theta) = p(\theta|y)\), we have: \[\frac{1}{n}\sum_{t=1}^{n} g(\theta^{(t)}) \xrightarrow{a.s.} E[g(\theta)|y]\]
Visualizing Ergodicity
# Use samples from previous M-H run
# True mean of target
true_mean <- 0.3 * (-2) + 0.7 * 2 # = 0.8
# Compute running averages
n <- length(samples)
running_avg <- cumsum(samples) / (1:n)
df_ergodic <- data.frame(
t = 1:n,
running_avg = running_avg
)
ggplot(df_ergodic, aes(x = t, y = running_avg)) +
geom_line(color = "steelblue", alpha = 0.7) +
geom_hline(yintercept = true_mean, color = "coral", linewidth = 1, linetype = "dashed") +
annotate("text", x = 8000, y = true_mean + 0.15,
label = paste("True mean =", true_mean), color = "coral", size = 4) +
labs(x = "Number of iterations", y = "Running average",
title = "Ergodic theorem in action",
subtitle = expression(paste(bar(theta)[n], " converges to ", E[p[0]](theta)))) +
theme_minimal()Why Initial Values Don’t Matter (Eventually)
# Run multiple chains from different starting points
starting_points <- c(-10, -5, 0, 5, 10)
n_iter <- 2000
all_chains <- data.frame()
for(start in starting_points) {
samples_chain <- numeric(n_iter)
samples_chain[1] <- start
for(t in 2:n_iter) {
current <- samples_chain[t-1]
proposal <- rnorm(1, current, 2)
r <- target(proposal) / target(current)
alpha <- min(1, r)
samples_chain[t] <- ifelse(runif(1) < alpha, proposal, current)
}
all_chains <- rbind(all_chains,
data.frame(t = 1:n_iter, value = samples_chain,
start = factor(start)))
}
ggplot(all_chains, aes(x = t, y = value, color = start)) +
geom_line(alpha = 0.7) +
scale_color_viridis_d(name = "Starting\nvalue") +
labs(x = "Iteration", y = expression(theta),
title = "Convergence from different starting points",
subtitle = "All chains eventually explore the same distribution") +
theme_minimal()6 Irreducibility and Aperiodicity
Definition: Irreducibility
A Markov chain is irreducible if for any two states \(\theta\) and \(\theta'\), there exists \(n > 0\) such that
\[ \\[3cm] \]
For M-H: If the proposal \(J(\theta'|\theta) > 0\) for all \(\theta, \theta'\) (e.g., Gaussian proposals), and the target \(p_0(\theta) > 0\) everywhere on its support, then the chain is irreducible.
Definition: Aperiodicity
The period of state \(\theta\) is:
\[ d_\theta = \underline{\hspace{5cm}} \]
\[ \\[3cm] \]
For M-H: The rejection mechanism guarantees aperiodicity! If \(\theta^*\) is proposed and rejected, we stay at \(\theta^{(t)}\), so \(K(\theta, \theta) > 0\) for states where proposals might be rejected.
# Example: deterministic cycle (periodic) vs. random (aperiodic)
n <- 30
# Periodic chain (period 3)
periodic <- numeric(n)
periodic[1] <- 1
for(t in 2:n) {
periodic[t] <- (periodic[t-1] %% 3) + 1
}
# Aperiodic chain
P_aperiodic <- matrix(c(0.5, 0.3, 0.2,
0.2, 0.5, 0.3,
0.3, 0.2, 0.5), nrow=3, byrow=TRUE)
aperiodic <- numeric(n)
aperiodic[1] <- 1
for(t in 2:n) {
aperiodic[t] <- sample(1:3, 1, prob = P_aperiodic[aperiodic[t-1],])
}
df_periodic <- data.frame(t = 1:n, state = periodic, type = "Periodic (d=3)")
df_aperiodic <- data.frame(t = 1:n, state = aperiodic, type = "Aperiodic (d=1)")
df_both <- rbind(df_periodic, df_aperiodic)
ggplot(df_both, aes(x = t, y = state)) +
geom_line(alpha = 0.5) +
geom_point(aes(color = factor(state)), size = 2) +
facet_wrap(~type) +
scale_color_manual(values = c("steelblue", "coral", "forestgreen")) +
labs(x = "Time", y = "State", color = "State",
title = "Periodic vs. Aperiodic Markov Chains") +
theme_minimal() +
theme(legend.position = "none")7 Convergence Theory
7.1 Total Variation Distance
The total variation distance between distributions \(\mu\) and \(\nu\) is:
\[\|\mu - \nu\|_{TV} = \underline{\hspace{6cm}}\] \[ \\[4cm] \]
7.2 Main Convergence Theorem
If a Markov chain is irreducible and aperiodic with stationary distribution \(p_0\), then for any starting distribution \(\mu\): \[\|K^n(\theta, \cdot) - p_0\|_{TV} \to 0 \text{ as } n \to \infty\]
If the chain is also positive recurrent (guaranteed for finite state spaces or under mild conditions), then \(p_0\) is unique.
\[ \\[3cm] \]
7.3 Geometric Ergodicity
For practical MCMC, we often need stronger convergence guarantees:
A chain is geometrically ergodic if there exist \(M < \infty\) and \(\rho < 1\) such that: \[\|K^n(\theta, \cdot) - p_0\|_{TV} \leq M(\theta)\rho^n\]
This ensures exponentially fast convergence and is needed for:
- Central Limit Theorems for MCMC
- Valid standard error estimates
# Illustrate convergence using our discrete chain
P_matrix <- matrix(c(0.7, 0.2, 0.1,
0.3, 0.4, 0.3,
0.2, 0.3, 0.5), nrow=3, byrow=TRUE)
# Stationary distribution
eigen_result <- eigen(t(P_matrix))
p0_stat <- Re(eigen_result$vectors[,1])
p0_stat <- p0_stat / sum(p0_stat)
# Compute TV distance over iterations
n_iter <- 30
tv_distances <- data.frame()
for(start in 1:3) {
mu <- rep(0, 3)
mu[start] <- 1 # Start in state 'start'
for(n in 1:n_iter) {
mu <- mu %*% P_matrix
tv_dist <- 0.5 * sum(abs(mu - p0_stat))
tv_distances <- rbind(tv_distances,
data.frame(n = n, tv = tv_dist, start = factor(start)))
}
}
ggplot(tv_distances, aes(x = n, y = tv, color = start)) +
geom_line(linewidth = 1) +
geom_hline(yintercept = 0, linetype = "dashed") +
scale_y_log10() +
scale_color_manual(values = c("steelblue", "coral", "forestgreen"),
name = "Starting\nstate") +
labs(x = "Number of transitions n", y = "Total variation distance (log scale)",
title = "Convergence to stationary distribution",
subtitle = expression(paste("||", K^n, "(θ,·) - p"[0], "||"[TV], " → 0 exponentially"))) +
theme_minimal()Warning in scale_y_log10(): log-10 transformation introduced infinite values.
8 The Gibbs Sampler as a Special Case
Gibbs Sampling
For \(\theta = (\theta_1, \ldots, \theta_p)\), the Gibbs sampler cycles through: \[ \theta_j^{(t+1)} \sim p\left(\theta_j \,\Big|\, \theta_{<j}^{(t+1)}, \theta_{>j}^{(t)}\right) \]
Why Gibbs is a Special Case of M-H
The Gibbs sampler is M-H with a special proposal: propose \(\theta_j^*\) from the full conditional \(p(\theta_j | \theta_{-j})\).
HW Q: the acceptance probability is always 1.
Conclusion: Every proposal is accepted in Gibbs sampling because \(r = 1\)
# Target: bivariate normal with correlation
rho <- 0.8
# Gibbs sampler - storing intermediate steps to show axis-aligned moves
gibbs_bivariate_with_path <- function(n_iter, rho) {
# Store both the samples and the full path (including intermediate steps)
samples <- matrix(0, n_iter, 2)
samples[1,] <- c(0, 0)
# Path will have 2 points per iteration (after theta1 update, after theta2 update)
path <- matrix(0, n_iter * 2, 2)
path[1,] <- samples[1,]
path[2,] <- samples[1,]
for(t in 2:n_iter) {
# Current position
current <- samples[t-1,]
# Step 1: Sample theta1 | theta2 (horizontal move)
new_theta1 <- rnorm(1, rho * current[2], sqrt(1 - rho^2))
intermediate <- c(new_theta1, current[2])
# Step 2: Sample theta2 | theta1 (vertical move)
new_theta2 <- rnorm(1, rho * new_theta1, sqrt(1 - rho^2))
final <- c(new_theta1, new_theta2)
samples[t,] <- final
path[(t-1)*2 + 1,] <- intermediate # After horizontal move
path[(t-1)*2 + 2,] <- final # After vertical move
}
list(samples = samples, path = path)
}
result_gibbs <- gibbs_bivariate_with_path(2000, rho)
samples_gibbs <- result_gibbs$samples
path_gibbs <- result_gibbs$path
df_gibbs <- data.frame(x = samples_gibbs[,1], y = samples_gibbs[,2])
# Show first 30 iterations as path (60 path points = 30 complete iterations)
n_show <- 30
df_path <- data.frame(
x = path_gibbs[1:(n_show*2), 1],
y = path_gibbs[1:(n_show*2), 2],
t = 1:(n_show*2)
)
# Also mark the iteration endpoints
df_points <- data.frame(
x = samples_gibbs[1:n_show, 1],
y = samples_gibbs[1:n_show, 2],
t = 1:n_show
)
ggplot() +
stat_density_2d(data = df_gibbs, aes(x = x, y = y, fill = after_stat(level)),
geom = "polygon", alpha = 0.5) +
scale_fill_viridis_c() +
geom_path(data = df_path, aes(x = x, y = y),
color = "coral", alpha = 0.8, linewidth = 0.5) +
geom_point(data = df_points, aes(x = x, y = y),
color = "coral", size = 1.5) +
geom_point(data = df_points[1,], aes(x = x, y = y),
color = "red", size = 3, shape = 17) +
labs(x = expression(theta[1]), y = expression(theta[2]),
title = "Gibbs sampler: bivariate normal",
subtitle = paste("ρ =", rho, "— Note the axis-aligned (horizontal/vertical) moves")) +
theme_minimal() +
theme(legend.position = "none") +
coord_fixed()9 Practical Diagnostics
Checking Convergence
Since we can’t compute \(\|K^n - p_0\|_{TV}\) directly, we use heuristic diagnostics:
Trace Plots
# Multiple chains for the mixture target
n_chains <- 4
n_iter <- 2000
all_traces <- data.frame()
for(chain in 1:n_chains) {
samples_tr <- numeric(n_iter)
samples_tr[1] <- runif(1, -5, 5)
for(t in 2:n_iter) {
proposal <- rnorm(1, samples_tr[t-1], 2)
r <- target(proposal) / target(samples_tr[t-1])
alpha <- min(1, r)
samples_tr[t] <- ifelse(runif(1) < alpha, proposal, samples_tr[t-1])
}
all_traces <- rbind(all_traces,
data.frame(t = 1:n_iter, value = samples_tr,
chain = factor(chain)))
}
ggplot(all_traces, aes(x = t, y = value, color = chain)) +
geom_line(alpha = 0.7) +
scale_color_viridis_d() +
labs(x = "Iteration", y = expression(theta), color = "Chain",
title = "Trace plots: 4 chains from random starts") +
theme_minimal()Gelman-Rubin Diagnostic (\(\hat{R}\))
Compares within-chain and between-chain variance. Want \(\hat{R} < 1.1\).
# Simple R-hat calculation
compute_rhat <- function(chains) {
n <- nrow(chains)
m <- ncol(chains)
chain_means <- colMeans(chains)
overall_mean <- mean(chains)
# Between-chain variance
B <- n * var(chain_means)
# Within-chain variance
W <- mean(apply(chains, 2, var))
# Pooled variance estimate
var_hat <- (n-1)/n * W + B/n
# R-hat
sqrt(var_hat / W)
}
# Convert to matrix
chain_matrix <- matrix(all_traces$value, ncol = n_chains)
rhat <- compute_rhat(chain_matrix[1001:2000,]) # After burn-in
cat("R-hat:", round(rhat, 3), "\n")R-hat: 1.002
Effective Sample Size
# ACF for one chain
one_chain <- all_traces$value[all_traces$chain == 1][1001:2000]
acf_values <- acf(one_chain, lag.max = 50, plot = FALSE)
df_acf <- data.frame(lag = acf_values$lag, acf = acf_values$acf)
ggplot(df_acf, aes(x = lag, y = acf)) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_segment(aes(xend = lag, yend = 0), color = "steelblue", linewidth = 1) +
geom_point(color = "steelblue", size = 2) +
geom_hline(yintercept = c(-1.96/sqrt(1000), 1.96/sqrt(1000)),
linetype = "dotted", color = "red") +
labs(x = "Lag", y = "Autocorrelation",
title = "Autocorrelation function of MCMC samples",
subtitle = "High autocorrelation → lower effective sample size") +
theme_minimal()10 Conclusions and Main Takeaways
The Logical Chain: Why MCMC Works
Design: Metropolis-Hastings constructs a kernel satisfying detailed balance with target \(p_0\) \[p_0(\theta)K(\theta,\theta') = p_0(\theta')K(\theta',\theta)\]
Stationarity: Detailed balance implies \(p_0\) is the stationary distribution \[\int K(\theta,\theta')p_0(\theta)d\theta = p_0(\theta')\]
Regularity: Sensible proposals ensure irreducibility and aperiodicity
Convergence: Fundamental theorem guarantees \(\theta^{(t)} \xrightarrow{d} p_0\)
Utility: Ergodic theorem justifies using sample averages for posterior expectations \[\frac{1}{n}\sum_{t=1}^{n} g(\theta^{(t)}) \xrightarrow{a.s.} E_{p_0}[g(\theta)]\]
Key Takeaways
We don’t need the normalizing constant because it cancels in the ratio \(r\)
Starting values don’t matter (asymptotically) due to the convergence theorem
Sample averages work due to the ergodic theorem
Finite samples require care: burn-in, thinning, diagnostics
The theory guarantees eventual correctness, but practical performance depends on:
- Proposal tuning
- Chain length
- Monitoring convergence
Key Definitions Summary
| Concept | Definition |
|---|---|
| Markov Chain | \(P(\theta^{(t+1)}|\theta^{(t)}, \ldots, \theta^{(0)}) = P(\theta^{(t+1)}|\theta^{(t)})\) |
| Stationary Distribution | \(p_0(\theta') = \int K(\theta,\theta')p_0(\theta)d\theta\) |
| Detailed Balance | \(p_0(\theta)K(\theta,\theta') = p_0(\theta')K(\theta',\theta)\) |
| Irreducibility | All states reachable from all other states |
| Aperiodicity | No deterministic cycles (period = 1) |
| M-H Ratio | \(r = \frac{p_0(\theta^*)J(\theta|\theta^*)}{p_0(\theta)J(\theta^*|\theta)}\) |