Lecture 7: The Normal Model and Intro to Gibbs Sampling

1 Normal Model

1.1 Recall: Conjugate prior

In the previous lecture we considered the normal model with conjugate prior:

\[ \begin{aligned} Y_1,\ldots,Y_n \mid \theta,\sigma^2 &\sim \text{ i.i.d. } N(\theta,\sigma^2)\\ \theta \mid \sigma^2 &\sim N\left(\mu_0,\frac{\sigma^2}{\kappa_0}\right)\\ \sigma^2 &\sim \operatorname{Inv-Gamma}\left(\frac{\nu_0}{2},\frac{\nu_0\sigma^2_0}{2}\right)\\ (\theta,\sigma^2) &\sim N\left(\mu_0,\frac{\sigma^2}{\kappa_0}\right)\times \operatorname{Inv-Gamma}\left(\frac{\nu_0}{2},\frac{\nu_0\sigma^2_0}{2}\right)\\ \end{aligned} \]

The above prior is often abbreviated as \(\theta,\sigma^2 \sim \text{NIG}(\mu_0,\kappa_0,\nu_0,\sigma^2_0)\). Often you also see notation \(\text{NIX}()\), which stands for normal-inverse-chi-squared distribution.

The connection is the following: if \(X\sim \text{Gamma}(\alpha/2,\beta/2)\), then \(\beta X\sim \chi^2_\alpha\).

1.2 Posterior

The posterior distribution is also a Normal-Inverse-Gamma distribution:

\[ \begin{aligned} (\theta,\sigma^2)\mid y_1,\ldots,y_n &\sim N\left(\mu_n,\frac{\sigma^2}{\kappa_n}\right)\times \operatorname{Inv-Gamma}\left(\frac{\nu_n}{2},\frac{\nu_n\sigma^2_n}{2}\right)\\ \kappa_n&=\kappa_0+n,\\ \mu_n&=\frac{k_0\mu_0+n\bar{y}}{k_0+n},\\ \nu_n&=\nu_0+n,\\ \nu_n\sigma_n^2&=\nu_0\sigma^2_0+(n-1)s^2+\frac{k_0n}{k_0+n}(\bar{y}-\mu_0)^2. \end{aligned} \] Or, in NIG notation: \[ \theta,\sigma^2\mid y_1,\ldots, y_n \sim \text{NIG}\left(\mu_n,\kappa_n,\nu_n,\sigma^2_n\right) \]

Q: Think of the interpretation of the hyperparameters \(\kappa_0,\nu_0\) in terms of the data. They are prior sample sizes, but why do we have two of them?

Answer

\(\kappa_0\) is the prior sample size for the mean \(\theta\), while \(\nu_0\) is the prior sample size for the variance \(\sigma^2\). They represent how much information we have about the mean and variance before seeing the data. A larger \(\kappa_0\) means we have more confidence in our prior belief about the mean, while a larger \(\nu_0\) means we have more confidence in our prior belief about the variance.

2 Probabilistic Graphical Models: Bayesian Networks for Normal Models

As you can see, the normal model with conjugate prior has a very nice, albeit complicated, structure. It is very common to represent such models using probabilistic graphical models, which are a powerful tool for visualizing and understanding the dependencies between variables in a probabilistic model. Probabilistic graphical models can be directed (Bayesian networks) or undirected (Markov random fields). In the case of the normal model with conjugate prior, we can represent it as a Bayesian network, where the nodes represent the random variables and the edges represent the dependencies between them.

Figure 1: Bayesian network for the Normal Model with Conjugate Prior

The hyperparameters \(\mu_0,\kappa_0,\nu_0,\sigma^2_0\) are represented as square nodes, which indicates that they are fixed parameters (not random variables). The latent parameters \(\theta\) and \(\sigma^2\) are represented as circular nodes, which indicates that they are random variables. The observed variables \(Y_1,\ldots,Y_n\) are represented as shaded circular nodes, which indicates that they are observed data. The edges represent the dependencies between the variables: for example, \(\theta\) depends on \(\mu_0\) and \(\kappa_0\), while \(Y_i\) depends on both \(\theta\) and \(\sigma^2\).

It is common to use plate notation to represent the repeated structure of the observations. In this case, we can draw a plate around the \(Y_i\) nodes to indicate that they are i.i.d. samples from the same distribution.

Figure 2: Bayesian network for the Normal Model with Plate Notation

It turns out, it is more natural to model the prior on mean and variance independent from each other. This will lead to the so-called semi-conjugate prior. So the Bayesian network for this will become:

Figure 3: Bayesian network for the Normal Model with the semi-conjugate prior

3 Normal Model: Conjugate vs Semiconjugate prior

Note, that conjugate prior is a very special case: it couples the mean and variance parameters in the prior together. In other words, if we are uncertain about the variance \(\sigma^2\), then we are also uncertain about the mean \(\theta\). This is a very strong assumption, and it may not always be appropriate. This is an algebraic consequence of the conjugate prior, but it is not a necessary assumption. We can relax this assumption by assuming that priors on mean and variance are independent: \[ p(\theta,\sigma^2)=p(\theta)p(\sigma^2) \]

This makes the model more flexible, but it also makes the posterior distribution more complicated. In particular, the posterior distribution is no longer a Normal-Inverse-Gamma distribution, and we cannot write down a closed-form expression for it. This is an example of a semi-conjugate prior, where the prior is not conjugate to the likelihood, but it still allows for some analytical tractability (more on this in a bit).

\[ \begin{aligned} p(\theta,\sigma^2)&=p(\theta\mid\sigma^2)\times p(\sigma^2)\\ &= N(\mu_0,\tau_0^2=\frac{\sigma^2}{\kappa_0})\times \text{Inv-Gamma}\left(\frac{\nu_0}{2},\frac{\nu_0}{2}\sigma^2_0\right) \end{aligned} \]

In the semi-conjugate case, the priors: \[ \theta\sim N(\mu_0,\tau_0^2),\qquad \sigma^2\sim \text{Inv-Gamma}\left(\frac{\nu_0}{2},\frac{\nu_0}{2}\sigma^2_0\right) \] We can only write down the full conditional posterior distributions, but not the joint posterior distribution. In particular, we can write down the full conditional for \(\theta\) and \(\sigma^2\), but we cannot write down the joint posterior distribution \(p(\theta,\sigma^2\mid y_1,\ldots,y_n)\) in closed form: \[ \theta\mid\sigma^2,y_1,\ldots,y_n\sim N(\mu_n,\tau_n^2(\sigma^2)),\qquad \sigma^2\mid\theta,y_1,\ldots,y_n\sim \text{Inv-Gamma}\left(\frac{\nu_n}{2},\frac{\nu_n}{2}\sigma^2_n(\theta)\right) \]

3.1 Discrete approximation

Since there is no closed-form expression for the joint posterior distribution in the semi-conjugate case, we can approximate it numerically on a grid. We can create a grid of values for \(\theta\) and \(\sigma^2\), and then compute the unnormalized posterior density at each point on the grid. Finally, we can normalize the grid to get an approximation to the joint posterior distribution. Consider a grid of values \(\{\theta_k\}_{k=1}^K\) and \(\{\widetilde{\sigma}^2_l\}_{l=1}^L\). Here we are using the (mean, precision) parametrization. We can compute the unnormalized posterior density at each point on the grid as follows:

\[ p(\theta_k, \widetilde{\sigma}^2_l\mid y_1,\ldots,y_n) = \text{ grid approximation of } p(\theta, \widetilde{\sigma}^2\mid y_1,\ldots,y_n) \]

\[ \begin{aligned} p(\theta,\widetilde{\sigma}^2, y_1,\ldots,y_n) &=p(\theta,\widetilde{\sigma}^2)\times p(y_1,\ldots,y_n\mid\theta,\widetilde{\sigma}^2)\\ &= N(\mu_0,\tau_0^2)\times \text{Gamma}\left(\frac{\nu_0}{2},\frac{\nu_0}{2}\sigma^2_0\right)\times \prod_i N\left(\theta,1/\widetilde{\sigma}^2\right) \end{aligned} \]

Let’s compute it on a 100-by-100 grid \[ \begin{aligned} p(\theta_k,\widetilde{\sigma}^2_l\mid y_1,\ldots,y_n)&\propto p(\theta_k,\widetilde{\sigma}^2_l,y_1,\ldots,y_n)\\ &= \text{dnorm}(\theta_k;\mu_0,\tau_0^2)\times \text{dgamma}\left(\widetilde\sigma^2_l;\frac{\nu_0}{2},\frac{\nu_0}{2}\sigma^2_0\right)\times \prod_i \text{dnorm}\left(y_i;\theta_k,1/\widetilde{\sigma}^2_l\right) \end{aligned} \]

3.2 Midge Wing Length dataset

For simplicity, presume the prior parameters are the following: \[ \mu_0=1.9,\tau_0=0.95,\qquad \nu_0=1,\sigma^2_0=0.01. \]

We obtain a joint posterior density plot, as well as the marginal posterior distributions for the mean and precision parameters. Note, how in the prior plot the mean and precision are independent, while in the posterior plot they are dependent. This is a consequence of the data coupling the parameters together in the likelihood, even though they were independent in the prior.

The posterior plot looks similar to the one we would get in the conjugate case, but it is not exactly the same. We had to use grid approximation to get the joint posterior.

Q: What could be an issue with this approach?

Answer

We did have to use the grid approximation to get the joint posterior distribution, which is computationally expensive. We had to evaluate the posterior density at 40,000 (\(K=L=200\)) points on the grid, which is not very efficient. If the number of parameters increases, the grid approximation becomes infeasible due to the curse of dimensionality (\(200^p\), where \(p=\) is the number of paramaters). This is a major drawback of the grid approximation method, and it motivates the need for more efficient sampling methods, such as Markov Chain Monte Carlo (MCMC) methods, which we will discuss next.

3.3 Intro to Gibbs Sampling

A better way to do it: a Gibbs’ Sampler!

Instead of approximating the full joint posterior, let’s sample iteratively from full conditional posteriors. We can write down the full conditional posterior distributions for \(\theta\) and \(\sigma^2\) as follows:

\[ \begin{aligned} p(\widetilde{\sigma}^2\mid \theta,y_1,\ldots,y_n) &\propto p(\widetilde{\sigma}^2,\theta,y_1,\ldots,y_n)\\ &= p(y_1,\ldots,y_n\mid\theta, \widetilde{\sigma}^2) p(\theta\mid\widetilde{\sigma}^2)p(\widetilde{\sigma}^2)\\ &\propto p(y_1,\ldots,y_n\mid \theta, \widetilde{\sigma}^2) p(\widetilde{\sigma}^2)\\ &=\mathrm{Gamma}(\nu_n/2,\nu_n\sigma^2(\theta)/2) \end{aligned} \] where \[ \nu_n=\nu_0+n,\qquad \sigma^2(\theta)=\frac{1}{\nu_n}[\nu_0\sigma^2_0+n\,s^2_n(\theta)]=\frac{1}{\nu_n}[\nu_0\sigma^2_0+\sum_i(y_i-\theta)^2] \]

Exercise: Verify the formula above for the full conditional posterior distribution of \(\widetilde{\sigma}^2\).

And \(p(\theta\mid\widetilde{\sigma}^2,y_1,\ldots,y_n)\) we know from before: \[ \theta\mid\widetilde{\sigma}^2,y_1,\ldots,y_n\sim N(\mu_n,\tau_n^2) \] where \(\mu_n\) and \(\tau_n^2\) are the same as in the conjugate case.

The algorithm: Gibbs’ sampler for the normal model with semi-conjugate prior

Start with initial guess: \[ \underline{\phi}^{(0)}=\{\theta^{(0)},\widetilde{\sigma}^{2(0)}\} \]

Sample \(\theta^{(i)}\sim p(\theta\mid \widetilde{\sigma}^{2(i-1)},y_1,\ldots,y_n)\)

Sample \(\widetilde{\sigma}^{2(i)}\sim p(\widetilde{\sigma}^{2}\mid {\theta}^{(i)},y_1,\ldots,y_n)\)

Let \(\boldsymbol{\phi}^{(i)}=\{\theta^{(i)},\widetilde{\sigma}^{2(i)}\}\)

One pass of the above steps is called one scan of the Gibbs sampler.

As the parameters are updated iteratively, we can visualize the trajectory of the Gibbs sampler in the parameter space. The plot above shows the path taken by the Gibbs sampler for the first 5, 15, 50, and 100 iterations. Each point represents the sampled values of \(\theta\) and \(\widetilde{\sigma}^2\) at a given iteration, and the lines connect the points in the order they were sampled. This visualization helps us understand how the Gibbs sampler explores the parameter space and converges towards the high-density regions of the posterior distribution.

Note, how the Gibbs sampler samples are concentrated in the high-density regions of the posterior distribution, which is a desirable property. This is in contrast to the grid approximation, where we had to evaluate the posterior density at many points in low-density regions, which was inefficient.

Justification of the Gibbs sampler:

Samples \(\boldsymbol{\phi}^{(i)}=\{\theta^{(i)},\widetilde{\sigma}^{2(i)}\}\) form a Markov chain with stationary distribution equal to the target posterior distribution \(p(\theta,\widetilde{\sigma}^2\mid y_1,\ldots,y_n)\). In the limit, irreducible and aperiodic Markov chain samples converge to the target posterior distribution.

Recall, the Markov chain is a sequence of random variables \(\{\boldsymbol{\phi}^{(i)}\}_{i=1}^\infty\) such that \[ \Pr(\underline{\Phi}^{(k)}=\underline{\phi}^{(k)}\mid \underline{\Phi}^{(k-1)}=\underline{\phi}^{(k-1)},\ldots,\underline{\Phi}^{(1)}=\underline{\phi}^{(1)})=\Pr(\underline{\Phi}^{(k)}=\underline{\phi}^{(k)}\mid \underline{\Phi}^{(k-1)}=\underline{\phi}^{(k-1)}) \]

In other words, the distribution of the next state \(\underline{\Phi}^{(k)}\) depends only on the current state \(\underline{\Phi}^{(k-1)}\), and not on the past states.

We will introduce two more MCMC algorithms later in the semester: the Metropolis and the Metropolis-Hastings algorithms. We will justify the convergence of these algorithms later in the semester, but for now, we will just take it as given that the Gibbs sampler converges to the target posterior distribution.

3.4 Marginal posterior distributions from the Gibbs sampler vs Grid approximation

Above, the dashed lines represent the Gibbs approximation to the marginal posterior distributions, while the solid lines represent the grid approximation. The Gibbs approximation is close to the grid approximation, and one can see that the Gibbs approximation captures the shape of the posterior distribution quite well. This is a good indication that the Gibbs sampler is working correctly and is providing a good approximation to the posterior distribution.

The main question is: how many samples do we need to get a good approximation to the posterior distribution? This is a difficult question, and it depends on the specific problem at hand. In practice, one can use diagnostic tools such as trace plots, autocorrelation plots, and effective sample size calculations to assess the convergence of the Gibbs sampler and determine how many samples are needed.

4 Bird eye view of the Bayesian Data Analysis

Let’s take a step back and look at the big picture of Bayesian data analysis. The main goal of Bayesian data analysis is to use the observed data to update our beliefs about the parameters of interest, which are represented by the posterior distribution. The process of Bayesian data analysis can be broken down into three main steps:

Sampling Model specification:

Choose a sampling model \(p(y\mid \phi)\)

Depending on the data generating process, we can choose an appropriate sampling model. For example, if we are modeling continuous data, we might choose a normal distribution (or exponential, or gamma, or Cauchy, etc). If we are modeling count data, we might choose a Poisson distribution (or Negative binomial, or geometric etc). The choice of the sampling model is crucial, as it determines the likelihood function and the form of the posterior distribution.

Prior specification:

Choose a prior \(p(\phi)\)

Choose a prior that represents your beliefs about \(\phi\in\Phi\) before seeing the data

We can choose subjective or objective priors, depending on the context and the amount of prior information available.

Main Result of Bayesian Analysis: the full posterior distribution \(p(\phi\mid y)\). This is the main outcome of the Bayesian analysis, and it contains all the information about the parameters of interest after observing the data.

We can extract posterior summary statistics from the posterior distribution to make inferences about the parameters of interest. For example:

Point estimates: posterior mean, median, mode, variance, etc.
Interval estimates: quantile-, HPD- intervals.
Functions of parameters: posterior predictive distribution, posterior distribution of a function of parameters, etc.

It is important to draw conceptual distinction between Estimation and Approximation.

Estimation: the process of using the posterior distribution to make inferences about the parameters of interest. This can be done by extracting summary statistics from the posterior distribution, such as the mean, median, mode, or credible intervals.
Approximation: the process of using numerical methods to approximate the posterior distribution when it cannot be computed analytically. This can be done using methods such as grid approximation, Markov Chain Monte Carlo (MCMC) methods, variational inference, etc.

In the next lecture, we will discuss the Gibbs sampler in more detail, and we will talk about the convergence diagnostics for MCMC methods.