Lecture 19: MCMC: Why Does It Work?

1 Introduction

What We Will Learn

In this lecture we will learn:

  1. Why MCMC produces valid samples from posterior distributions
  2. The mathematical foundations that guarantee convergence
  3. Detailed balance and why it ensures the correct stationary distribution
  4. Ergodic theorems that justify using sample averages for posterior expectations
  5. The connection between Metropolis-Hastings, Gibbs sampling, and general Markov chain theory

The Fundamental Problem

In Bayesian inference, we repeatedly face the same computational challenge: given a posterior distribution \[p(\theta | y) = \frac{p(y|\theta)p(\theta)}{\int p(y|\theta)p(\theta)d\theta}\]

we need to:

\[ \\[4cm] \]

The problem: The normalizing constant \(\int p(y|\theta)p(\theta)d\theta\) is often intractable, and direct sampling is impossible for complex posteriors.

What MCMC Promises

MCMC methods construct a sequence of random variables \(\theta^{(0)}, \theta^{(1)}, \theta^{(2)}, \ldots\) such that:

  1. We only need to evaluate \(p(\theta|y)\) up to a normalizing constant
  2. The samples eventually “look like” draws from \(p(\theta|y)\)
  3. Sample averages converge to true expectations

Today’s Goal: Understand why these promises are kept.

Notation: Throughout this lecture, we will denote our target distribution by \(p_0(\theta)\). In Bayesian applications, \(p_0(\theta) = p(\theta|y)\), but the theory applies to any target distribution.

2 Markov Chain Fundamentals

2.1 Definition: Markov Chain

Definition

A sequence of random variables \(\{\theta^{(t)}\}_{t=0}^{\infty}\) is a Markov chain if:

\[ P(\theta^{(t+1)} = \theta | \theta^{(t)}, \theta^{(t-1)}, \ldots, \theta^{(0)}) = \underline{\hspace{6cm}} \]

\[ \\[5cm] \]

The dynamics of the chain are governed by the transition kernel \(K(\theta, \theta')\) which gives the probability (density) of moving from \(\theta\) to \(\theta'\): \[K(\theta, A) = P(\theta^{(t+1)} \in A | \theta^{(t)} = \theta)\]

For continuous state spaces, think of \(K(\theta,\theta')\) as the density of \(\theta^{(t+1)}\) given \(\theta^{(t)} = \theta\).

\[ \\[4cm] \]

2.2 Visualizing Markov Dynamics

# Simple 3-state Markov chain
P <- matrix(c(0.7, 0.2, 0.1,
              0.3, 0.4, 0.3,
              0.2, 0.3, 0.5), nrow=3, byrow=TRUE)

# Simulate chain
n_steps <- 100
states <- numeric(n_steps)
states[1] <- 1

for(t in 2:n_steps) {
  states[t] <- sample(1:3, 1, prob = P[states[t-1],])
}

df <- data.frame(t = 1:n_steps, state = states)

ggplot(df, aes(x = t, y = state)) +
  geom_line(alpha = 0.3) +
  geom_point(aes(color = factor(state)), size = 1.5) +
  scale_color_manual(values = c("steelblue", "coral", "forestgreen"),
                     name = "State") +
  labs(x = "Time step t", y = "State", 
       title = "Sample path of a 3-state Markov chain") +
  theme_minimal()

\[ \\[4cm] \]

Key Properties We Need

For MCMC to work, our Markov chains must have these:

Three Essential Properties
  1. Irreducibility: The chain can reach any state from any other state
  2. Aperiodicity: The chain doesn’t get stuck in deterministic cycles
  3. Stationarity: There exists a distribution \(p_0\) that is preserved by the chain

3 Stationary Distributions

3.1 Definition: Stationary Distribution

Definition

A distribution \(p_0\) is stationary (or invariant) for a Markov chain with kernel \(K\) if:

\[p_0(\theta') = \underline{\hspace{6cm}}\]

\[ \\[4cm] \]

# Compute stationary distribution of our 3-state chain
eigen_result <- eigen(t(P))
stationary <- eigen_result$vectors[,1]
stationary <- stationary / sum(stationary)
stationary <- Re(stationary)

# Show convergence to stationary distribution
n_chains <- 1000
n_steps <- 50
final_states <- matrix(0, nrow = n_chains, ncol = n_steps)

for(chain in 1:n_chains) {
  state <- sample(1:3, 1)  # Random start
  for(t in 1:n_steps) {
    final_states[chain, t] <- state
    state <- sample(1:3, 1, prob = P[state,])
  }
}

# Compute empirical distribution at each time
empirical <- data.frame()
for(t in 1:n_steps) {
  props <- table(factor(final_states[,t], levels=1:3)) / n_chains
  empirical <- rbind(empirical, 
                     data.frame(t = t, state = 1:3, proportion = as.numeric(props)))
}

ggplot(empirical, aes(x = t, y = proportion, color = factor(state))) +
  geom_line(linewidth = 1) +
  geom_hline(aes(yintercept = stationary[1]), linetype = "dashed", alpha = 0.5) +
  geom_hline(aes(yintercept = stationary[2]), linetype = "dashed", alpha = 0.5) +
  geom_hline(aes(yintercept = stationary[3]), linetype = "dashed", alpha = 0.5) +
  scale_color_manual(values = c("steelblue", "coral", "forestgreen"),
                     name = "State") +
  labs(x = "Time step", y = "Proportion of chains in state",
       title = "Convergence to stationary distribution",
       subtitle = "Dashed lines show stationary probabilities") +
  theme_minimal()

Stationary distribution: starting from p₀ preserves p₀

3.2 Detailed Balance: A Sufficient Condition

How do we ensure our chain has \(p_0\) as its stationary distribution?

Theorem: Detailed Balance (Reversibility)

If a transition kernel \(K\) satisfies detailed balance with respect to \(p_0\):

\[\underline{\hspace{4cm}} = \underline{\hspace{4cm}} \quad \text{for all } \theta, \theta'\]

then \(p_0\) is a stationary distribution for \(K\).

Interpretation: Detailed balance means the “probability flow” from \(\theta\) to \(\theta'\) equals the flow from \(\theta'\) to \(\theta\) under \(p_0\).

Proof of Detailed Balance Theorem

Proof:

\[ \\[6cm] \]

Detailed balance: probability flows balance at equilibrium

4 The Metropolis-Hastings Algorithm

4.1 Algorithm Construction

Given a target distribution \(p_0(\theta)\) (known up to normalizing constant) and a proposal distribution \(J(\theta'|\theta)\):

Metropolis-Hastings Algorithm
  1. Initialize \(\theta^{(0)}\)
  2. For \(t = 0, 1, 2, \ldots\):
    1. Propose \(\theta^* \sim J(\cdot|\theta^{(t)})\)
    2. Compute the ratio:

\[r = \frac{p_0(\theta^*)J(\theta^{(t)}|\theta^*)}{p_0(\theta^{(t)})J(\theta^*|\theta^{(t)})}\]

  1. Compute acceptance probability: \(\alpha = \min(1, r)\)
  2. With probability \(\alpha\), set \(\theta^{(t+1)} = \theta^*\); otherwise \(\theta^{(t+1)} = \theta^{(t)}\)

Key insight: If \(p_0(\theta) = p(\theta|y) \propto p(y|\theta)p(\theta)\), the normalizing constant cancels in \(r\)!

Why M-H Works: Detailed Balance Verification

Important

Claim: The M-H kernel satisfies detailed balance with respect to \(p_0\).

Verification of Detailed Balance for M-H

\[ \\[8cm] \]

# Target: mixture of two normals
target <- function(x) {
  0.3 * dnorm(x, -2, 0.8) + 0.7 * dnorm(x, 2, 1.2)
}

# M-H with normal proposal
mh_sample <- function(n_iter, proposal_sd = 1) {
  samples <- numeric(n_iter)
  samples[1] <- 0
  accepted <- 0
  
  for(t in 2:n_iter) {
    current <- samples[t-1]
    proposal <- rnorm(1, current, proposal_sd)
    
    # Ratio r (symmetric proposal, so J cancels)
    r <- target(proposal) / target(current)
    alpha <- min(1, r)
    
    if(runif(1) < alpha) {
      samples[t] <- proposal
      accepted <- accepted + 1
    } else {
      samples[t] <- current
    }
  }
  
  list(samples = samples, acceptance_rate = accepted / (n_iter - 1))
}

result <- mh_sample(10000, proposal_sd = 2)
samples <- result$samples

# Plot
x_seq <- seq(-6, 6, length.out = 200)
df_target <- data.frame(x = x_seq, density = target(x_seq))
df_samples <- data.frame(x = samples[1001:10000])  # Discard burn-in

ggplot() +
  geom_histogram(data = df_samples, aes(x = x, y = after_stat(density)), 
                 bins = 50, fill = "steelblue", alpha = 0.6) +
  geom_line(data = df_target, aes(x = x, y = density), 
            color = "coral", linewidth = 1.5) +
  labs(x = expression(theta), y = "Density",
       title = "Metropolis-Hastings samples vs. target distribution",
       subtitle = paste("Acceptance rate:", round(result$acceptance_rate, 3))) +
  theme_minimal()

Metropolis-Hastings sampling from a mixture of normals

5 Ergodic Theorems: From Samples to Estimates

5.1 The Law of Large Numbers for Markov Chains

Ergodic Theorem

If \(\{\theta^{(t)}\}\) is an irreducible, aperiodic Markov chain with stationary distribution \(p_0\), then for any function \(g\) with \(E_{p_0}[|g(\theta)|] < \infty\):

\[\frac{1}{n}\sum_{t=1}^{n} g(\theta^{(t)}) \xrightarrow{a.s.} \underline{\hspace{5cm}}\]

regardless of the starting distribution.

This is the fundamental result that makes MCMC useful for Bayesian inference!

When \(p_0(\theta) = p(\theta|y)\), we have: \[\frac{1}{n}\sum_{t=1}^{n} g(\theta^{(t)}) \xrightarrow{a.s.} E[g(\theta)|y]\]

Visualizing Ergodicity

# Use samples from previous M-H run
# True mean of target
true_mean <- 0.3 * (-2) + 0.7 * 2  # = 0.8

# Compute running averages
n <- length(samples)
running_avg <- cumsum(samples) / (1:n)

df_ergodic <- data.frame(
  t = 1:n,
  running_avg = running_avg
)

ggplot(df_ergodic, aes(x = t, y = running_avg)) +
  geom_line(color = "steelblue", alpha = 0.7) +
  geom_hline(yintercept = true_mean, color = "coral", linewidth = 1, linetype = "dashed") +
  annotate("text", x = 8000, y = true_mean + 0.15, 
           label = paste("True mean =", true_mean), color = "coral", size = 4) +
  labs(x = "Number of iterations", y = "Running average",
       title = "Ergodic theorem in action",
       subtitle = expression(paste(bar(theta)[n], " converges to ", E[p[0]](theta)))) +
  theme_minimal()

Ergodic averages converging to true expectation

Why Initial Values Don’t Matter (Eventually)

# Run multiple chains from different starting points
starting_points <- c(-10, -5, 0, 5, 10)
n_iter <- 2000
all_chains <- data.frame()

for(start in starting_points) {
  samples_chain <- numeric(n_iter)
  samples_chain[1] <- start
  
  for(t in 2:n_iter) {
    current <- samples_chain[t-1]
    proposal <- rnorm(1, current, 2)
    r <- target(proposal) / target(current)
    alpha <- min(1, r)
    samples_chain[t] <- ifelse(runif(1) < alpha, proposal, current)
  }
  
  all_chains <- rbind(all_chains, 
                      data.frame(t = 1:n_iter, value = samples_chain, 
                                 start = factor(start)))
}

ggplot(all_chains, aes(x = t, y = value, color = start)) +
  geom_line(alpha = 0.7) +
  scale_color_viridis_d(name = "Starting\nvalue") +
  labs(x = "Iteration", y = expression(theta),
       title = "Convergence from different starting points",
       subtitle = "All chains eventually explore the same distribution") +
  theme_minimal()

Chains from different starting points converge to the same distribution

6 Irreducibility and Aperiodicity

Definition: Irreducibility

Definition

A Markov chain is irreducible if for any two states \(\theta\) and \(\theta'\), there exists \(n > 0\) such that

\[ \\[3cm] \]

For M-H: If the proposal \(J(\theta'|\theta) > 0\) for all \(\theta, \theta'\) (e.g., Gaussian proposals), and the target \(p_0(\theta) > 0\) everywhere on its support, then the chain is irreducible.

Definition: Aperiodicity

Definition

The period of state \(\theta\) is:

\[ d_\theta = \underline{\hspace{5cm}} \]

\[ \\[3cm] \]

For M-H: The rejection mechanism guarantees aperiodicity! If \(\theta^*\) is proposed and rejected, we stay at \(\theta^{(t)}\), so \(K(\theta, \theta) > 0\) for states where proposals might be rejected.

# Example: deterministic cycle (periodic) vs. random (aperiodic)
n <- 30

# Periodic chain (period 3)
periodic <- numeric(n)
periodic[1] <- 1
for(t in 2:n) {
  periodic[t] <- (periodic[t-1] %% 3) + 1
}

# Aperiodic chain 
P_aperiodic <- matrix(c(0.5, 0.3, 0.2,
                        0.2, 0.5, 0.3,
                        0.3, 0.2, 0.5), nrow=3, byrow=TRUE)
aperiodic <- numeric(n)
aperiodic[1] <- 1
for(t in 2:n) {
  aperiodic[t] <- sample(1:3, 1, prob = P_aperiodic[aperiodic[t-1],])
}

df_periodic <- data.frame(t = 1:n, state = periodic, type = "Periodic (d=3)")
df_aperiodic <- data.frame(t = 1:n, state = aperiodic, type = "Aperiodic (d=1)")
df_both <- rbind(df_periodic, df_aperiodic)

ggplot(df_both, aes(x = t, y = state)) +
  geom_line(alpha = 0.5) +
  geom_point(aes(color = factor(state)), size = 2) +
  facet_wrap(~type) +
  scale_color_manual(values = c("steelblue", "coral", "forestgreen")) +
  labs(x = "Time", y = "State", color = "State",
       title = "Periodic vs. Aperiodic Markov Chains") +
  theme_minimal() +
  theme(legend.position = "none")

7 Convergence Theory

7.1 Total Variation Distance

Definition

The total variation distance between distributions \(\mu\) and \(\nu\) is:

\[\|\mu - \nu\|_{TV} = \underline{\hspace{6cm}}\] \[ \\[4cm] \]

7.2 Main Convergence Theorem

Fundamental Theorem of Markov Chains

If a Markov chain is irreducible and aperiodic with stationary distribution \(p_0\), then for any starting distribution \(\mu\): \[\|K^n(\theta, \cdot) - p_0\|_{TV} \to 0 \text{ as } n \to \infty\]

If the chain is also positive recurrent (guaranteed for finite state spaces or under mild conditions), then \(p_0\) is unique.

\[ \\[3cm] \]

7.3 Geometric Ergodicity

For practical MCMC, we often need stronger convergence guarantees:

Definition: Geometric Ergodicity

A chain is geometrically ergodic if there exist \(M < \infty\) and \(\rho < 1\) such that: \[\|K^n(\theta, \cdot) - p_0\|_{TV} \leq M(\theta)\rho^n\]

This ensures exponentially fast convergence and is needed for:

  • Central Limit Theorems for MCMC
  • Valid standard error estimates
# Illustrate convergence using our discrete chain
P_matrix <- matrix(c(0.7, 0.2, 0.1,
                     0.3, 0.4, 0.3,
                     0.2, 0.3, 0.5), nrow=3, byrow=TRUE)

# Stationary distribution
eigen_result <- eigen(t(P_matrix))
p0_stat <- Re(eigen_result$vectors[,1])
p0_stat <- p0_stat / sum(p0_stat)

# Compute TV distance over iterations
n_iter <- 30
tv_distances <- data.frame()

for(start in 1:3) {
  mu <- rep(0, 3)
  mu[start] <- 1  # Start in state 'start'
  
  for(n in 1:n_iter) {
    mu <- mu %*% P_matrix
    tv_dist <- 0.5 * sum(abs(mu - p0_stat))
    tv_distances <- rbind(tv_distances,
                          data.frame(n = n, tv = tv_dist, start = factor(start)))
  }
}

ggplot(tv_distances, aes(x = n, y = tv, color = start)) +
  geom_line(linewidth = 1) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  scale_y_log10() +
  scale_color_manual(values = c("steelblue", "coral", "forestgreen"),
                     name = "Starting\nstate") +
  labs(x = "Number of transitions n", y = "Total variation distance (log scale)",
       title = "Convergence to stationary distribution",
       subtitle = expression(paste("||", K^n, "(θ,·) - p"[0], "||"[TV], " → 0 exponentially"))) +
  theme_minimal()
Warning in scale_y_log10(): log-10 transformation introduced infinite values.

Convergence in total variation distance (illustration)

8 The Gibbs Sampler as a Special Case

Gibbs Sampling

For \(\theta = (\theta_1, \ldots, \theta_p)\), the Gibbs sampler cycles through: \[ \theta_j^{(t+1)} \sim p\left(\theta_j \,\Big|\, \theta_{<j}^{(t+1)}, \theta_{>j}^{(t)}\right) \]

Why Gibbs is a Special Case of M-H

The Gibbs sampler is M-H with a special proposal: propose \(\theta_j^*\) from the full conditional \(p(\theta_j | \theta_{-j})\).

HW Q: the acceptance probability is always 1.

Conclusion: Every proposal is accepted in Gibbs sampling because \(r = 1\)

# Target: bivariate normal with correlation
rho <- 0.8

# Gibbs sampler - storing intermediate steps to show axis-aligned moves
gibbs_bivariate_with_path <- function(n_iter, rho) {
  # Store both the samples and the full path (including intermediate steps)
  samples <- matrix(0, n_iter, 2)
  samples[1,] <- c(0, 0)
  
  # Path will have 2 points per iteration (after theta1 update, after theta2 update)
  path <- matrix(0, n_iter * 2, 2)
  path[1,] <- samples[1,]
  path[2,] <- samples[1,]
  
  for(t in 2:n_iter) {
    # Current position
    current <- samples[t-1,]
    
    # Step 1: Sample theta1 | theta2 (horizontal move)
    new_theta1 <- rnorm(1, rho * current[2], sqrt(1 - rho^2))
    intermediate <- c(new_theta1, current[2])
    
    # Step 2: Sample theta2 | theta1 (vertical move)
    new_theta2 <- rnorm(1, rho * new_theta1, sqrt(1 - rho^2))
    final <- c(new_theta1, new_theta2)
    
    samples[t,] <- final
    path[(t-1)*2 + 1,] <- intermediate  # After horizontal move
    path[(t-1)*2 + 2,] <- final         # After vertical move
  }
  
  list(samples = samples, path = path)
}

result_gibbs <- gibbs_bivariate_with_path(2000, rho)
samples_gibbs <- result_gibbs$samples
path_gibbs <- result_gibbs$path

df_gibbs <- data.frame(x = samples_gibbs[,1], y = samples_gibbs[,2])

# Show first 30 iterations as path (60 path points = 30 complete iterations)
n_show <- 30
df_path <- data.frame(
  x = path_gibbs[1:(n_show*2), 1], 
  y = path_gibbs[1:(n_show*2), 2],
  t = 1:(n_show*2)
)

# Also mark the iteration endpoints
df_points <- data.frame(
  x = samples_gibbs[1:n_show, 1],
  y = samples_gibbs[1:n_show, 2],
  t = 1:n_show
)

ggplot() +
  stat_density_2d(data = df_gibbs, aes(x = x, y = y, fill = after_stat(level)),
                  geom = "polygon", alpha = 0.5) +
  scale_fill_viridis_c() +
  geom_path(data = df_path, aes(x = x, y = y), 
            color = "coral", alpha = 0.8, linewidth = 0.5) +
  geom_point(data = df_points, aes(x = x, y = y), 
             color = "coral", size = 1.5) +
  geom_point(data = df_points[1,], aes(x = x, y = y),
             color = "red", size = 3, shape = 17) +
  labs(x = expression(theta[1]), y = expression(theta[2]),
       title = "Gibbs sampler: bivariate normal",
       subtitle = paste("ρ =", rho, "— Note the axis-aligned (horizontal/vertical) moves")) +
  theme_minimal() +
  theme(legend.position = "none") +
  coord_fixed()

Gibbs sampler for a bivariate normal

9 Practical Diagnostics

Checking Convergence

Since we can’t compute \(\|K^n - p_0\|_{TV}\) directly, we use heuristic diagnostics:

Trace Plots

# Multiple chains for the mixture target
n_chains <- 4
n_iter <- 2000

all_traces <- data.frame()
for(chain in 1:n_chains) {
  samples_tr <- numeric(n_iter)
  samples_tr[1] <- runif(1, -5, 5)
  
  for(t in 2:n_iter) {
    proposal <- rnorm(1, samples_tr[t-1], 2)
    r <- target(proposal) / target(samples_tr[t-1])
    alpha <- min(1, r)
    samples_tr[t] <- ifelse(runif(1) < alpha, proposal, samples_tr[t-1])
  }
  
  all_traces <- rbind(all_traces,
                      data.frame(t = 1:n_iter, value = samples_tr, 
                                 chain = factor(chain)))
}

ggplot(all_traces, aes(x = t, y = value, color = chain)) +
  geom_line(alpha = 0.7) +
  scale_color_viridis_d() +
  labs(x = "Iteration", y = expression(theta), color = "Chain",
       title = "Trace plots: 4 chains from random starts") +
  theme_minimal()

Trace plots for assessing convergence

Gelman-Rubin Diagnostic (\(\hat{R}\))

Compares within-chain and between-chain variance. Want \(\hat{R} < 1.1\).

# Simple R-hat calculation
compute_rhat <- function(chains) {
  n <- nrow(chains)
  m <- ncol(chains)
  
  chain_means <- colMeans(chains)
  overall_mean <- mean(chains)
  
  # Between-chain variance
  B <- n * var(chain_means)
  
  # Within-chain variance
  W <- mean(apply(chains, 2, var))
  
  # Pooled variance estimate
  var_hat <- (n-1)/n * W + B/n
  
  # R-hat
  sqrt(var_hat / W)
}

# Convert to matrix
chain_matrix <- matrix(all_traces$value, ncol = n_chains)
rhat <- compute_rhat(chain_matrix[1001:2000,])  # After burn-in
cat("R-hat:", round(rhat, 3), "\n")
R-hat: 1.002 

Effective Sample Size

# ACF for one chain
one_chain <- all_traces$value[all_traces$chain == 1][1001:2000]

acf_values <- acf(one_chain, lag.max = 50, plot = FALSE)
df_acf <- data.frame(lag = acf_values$lag, acf = acf_values$acf)

ggplot(df_acf, aes(x = lag, y = acf)) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_segment(aes(xend = lag, yend = 0), color = "steelblue", linewidth = 1) +
  geom_point(color = "steelblue", size = 2) +
  geom_hline(yintercept = c(-1.96/sqrt(1000), 1.96/sqrt(1000)), 
             linetype = "dotted", color = "red") +
  labs(x = "Lag", y = "Autocorrelation",
       title = "Autocorrelation function of MCMC samples",
       subtitle = "High autocorrelation → lower effective sample size") +
  theme_minimal()

Autocorrelation function - key to understanding effective sample size

10 Conclusions and Main Takeaways

The Logical Chain: Why MCMC Works

The Complete Picture
  1. Design: Metropolis-Hastings constructs a kernel satisfying detailed balance with target \(p_0\) \[p_0(\theta)K(\theta,\theta') = p_0(\theta')K(\theta',\theta)\]

  2. Stationarity: Detailed balance implies \(p_0\) is the stationary distribution \[\int K(\theta,\theta')p_0(\theta)d\theta = p_0(\theta')\]

  3. Regularity: Sensible proposals ensure irreducibility and aperiodicity

  4. Convergence: Fundamental theorem guarantees \(\theta^{(t)} \xrightarrow{d} p_0\)

  5. Utility: Ergodic theorem justifies using sample averages for posterior expectations \[\frac{1}{n}\sum_{t=1}^{n} g(\theta^{(t)}) \xrightarrow{a.s.} E_{p_0}[g(\theta)]\]

Key Takeaways

Summary
  1. We don’t need the normalizing constant because it cancels in the ratio \(r\)

  2. Starting values don’t matter (asymptotically) due to the convergence theorem

  3. Sample averages work due to the ergodic theorem

  4. Finite samples require care: burn-in, thinning, diagnostics

  5. The theory guarantees eventual correctness, but practical performance depends on:

    • Proposal tuning
    • Chain length
    • Monitoring convergence

Key Definitions Summary

Concept Definition
Markov Chain \(P(\theta^{(t+1)}|\theta^{(t)}, \ldots, \theta^{(0)}) = P(\theta^{(t+1)}|\theta^{(t)})\)
Stationary Distribution \(p_0(\theta') = \int K(\theta,\theta')p_0(\theta)d\theta\)
Detailed Balance \(p_0(\theta)K(\theta,\theta') = p_0(\theta')K(\theta',\theta)\)
Irreducibility All states reachable from all other states
Aperiodicity No deterministic cycles (period = 1)
M-H Ratio \(r = \frac{p_0(\theta^*)J(\theta|\theta^*)}{p_0(\theta)J(\theta^*|\theta)}\)