Lecture 6: The Normal Model

1 Normal Models

We start investigating the normal model, which is one of the most widely used models in statistics. It is also the first model with more than one parameter that we will study. We will start with inference for the mean \(\theta\) conditional on the variance \(\sigma^2\) and then move to joint inference for \((\theta,\sigma^2)\).

We will be working with the conditional prior \(p(\theta\mid\sigma^2)\) and conditional posterior \(p(\theta\mid y_1,\ldots,y_n,\sigma^2)\) for \(\theta\) given \(\sigma^2\). We will relax the assumption that \(\sigma^2\) is known later in the lecture and move to joint inference for \((\theta,\sigma^2)\).

1.1 Inference for the mean conditional on the variance

Recall, we assumed that the variance \(\sigma^2\) is known. We will relax this assumption later in the lecture.

Sampling model

\[ Y_i\mid \theta,\sigma^2\sim\text{ i.i.d. } N(\theta,\sigma^2) \]

\[ p(y | \theta,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\,\exp\left[-\frac{1}{2}\left(\frac{y-\theta}{\sigma}\right)^2\right] \]

Joint sampling density: \[ \begin{aligned} p(y_1,\ldots,y_n\mid\theta,\sigma^2)\ &\propto\ \exp\left[-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-\theta)^2\right]\\ &=\ \exp\left[-\frac{1}{2\sigma^2}\left(\sum_i y_i^2-2\,\theta\sum_iy_i+n\,\theta^2\right)\right]\\ &\propto\ \exp\left[a_1\theta^2+a_2\theta\right]\\ \end{aligned} \]

Conjugate prior and posterior (conditional on \(\sigma^2\))

From the kernel of the joint sampling density, we can see that the conjugate prior for \(\theta\) has to be a normal distribution. Specifically, we have:

\[ \theta\mid\sigma^2\sim N(\mu_0,\tau_0^2) \]

The conditional posterior distribution of \(\theta\) given the data and \(\sigma^2\) is also normal:

\[ \theta\mid y_1,\ldots,y_n,\sigma^2\sim N(\mu_n,\tau_n^2) \]

with the parameters \(\mu_n\) and \(\tau_n^2\) given by:

\[ \mu_n=\dfrac{\frac{1}{\tau_0^2}\mu_0+\frac{n}{\sigma^2}\bar{y}}{\frac{1}{\tau_0^2}+\frac{n}{\sigma^2}} \] \[ \frac{1}{\tau_n^2}=\frac{1}{\tau_0^2}+\frac{n}{\sigma^2} \]

Interpretation: it is more convenient to express the prior and posterior in terms of precision (inverse variance) rather than variance.

  • \(\tilde\sigma^2=1/\sigma^2=\) sampling precision, - \(\tilde\tau_0^2=1/\tau_0^2=\) prior precision, - \(\tilde\tau_n^2=1/\tau_n^2=\) posterior precision,

then we can rewrite the above as: \[ \widetilde\tau_n^2 = \widetilde\tau_0^2 + n\widetilde\sigma^2\quad\Leftrightarrow\quad \text{posterior precision} = \text{prior precision} + \text{data precision} \]

\[ \begin{aligned} \mu_n &= \frac{\widetilde\tau_0^2} {\widetilde\tau_0^2 + n\widetilde\sigma^2} \mu_0 + \frac{n\widetilde\sigma^2}{\widetilde\tau_0^2 + n\widetilde\sigma^2}\bar{y}\\ \text{posterior mean} &= \frac{\text{prior precision}}{\text{posterior precision}} \times \text{prior mean} + \frac{\text{data precision}}{\text{posterior precision}} \times \text{data mean} \end{aligned} \]

One useful way to reparametrize the prior parameter \(\tau_0^2\) is to express it in terms of the number of prior observations \(k_0\):

\[ \tau_0^2 = \frac{\sigma^2}{k_0}\quad\Rightarrow\quad \theta\mid\sigma^2\sim N\left(\mu_0,\frac{\sigma^2}{k_0}\right) \]

One way to think about \(k_0\) is as the number of “pseudo-observations” from the sampling model with the same variance \(\sigma^2\) as the data and mean \(\mu_0\). With this reparametrization, the posterior parameters can be expressed as: \[ \mu_n=\frac{k_0}{k_0+n}\mu_0+\frac{n}{k_0+n}\bar{y} \]

1.2 Recap: Conditional Expectation and Variance

We will use the tools of conditional mean and variance soon. Let us recall the main facts briefly. Consider 3 random variables \(U,V,W\).

\[ \operatorname{E}[U] = \text{ mean of }\ U \]

\[ \operatorname{E}[U\mid V=v] = \text{ mean of $U$ given that $V$ takes value of $v$.} \]

The above two quantities are fixed numbers. The first one is the mean of \(U\) and the second one is the mean of \(U\) when we know that \(V=v\).

The next quantity is a random variable: \[ \operatorname{E}[U\mid V] = \text{ mean of $U$ given that $V$ takes value of $V$.} \]

We have the Law of Total Expectation:

\[ \operatorname{E}[\operatorname{E}[U\mid V]] = \operatorname{E}[U] \] The above says that if we take the conditional expectation of \(U\) given \(V\) and then take the expectation of that, we get back the original mean of \(U\).

A variation of the above is:

\[ \operatorname{E}[\operatorname{E}[U\mid V, W]\mid W] = \operatorname{E}[U\mid W] \] The above says that if we take the conditional expectation of \(U\) given \(V\) and \(W\) and then take the expectation of that given \(W\), we get back the conditional expectation of \(U\) given \(W\).

The next tool is the Law of Total Variance:

\[ \operatorname{Var}(U)=\operatorname{E}[\operatorname{Var}(U\mid V)] + \operatorname{Var}[\operatorname{E}(U\mid V)] \] The above says that the variance of \(U\) can be decomposed into two parts: the expected value of the conditional variance of \(U\) given \(V\) and the variance of the conditional expectation of \(U\) given \(V\).

Q: Write the above decomposition for \(\operatorname{Var}(U\mid W)\).

Answer \[ \operatorname{Var}(U\mid W)=\operatorname{E}[\operatorname{Var}(U\mid V, W)\mid W] + \operatorname{Var}[\operatorname{E}(U\mid V, W)\mid W] \]

1.3 Predictive posterior distribution

Now, suppose we want to predict a new observation \(\widetilde{Y}\) from the same sampling model. The predictive posterior distribution of \(\widetilde{Y}\) given the data and \(\sigma^2\) is given by:

\[ \begin{aligned} p(\widetilde{y}\mid y_1,\ldots,y_n,\sigma^2)&=\int p(\widetilde{y},\theta\mid y_1,\ldots,y_n,\sigma^2)d\theta\\ &=\int p(\widetilde{y}\mid y_1,\ldots,y_n,\theta,\sigma^2)p(\theta\mid y_1,\ldots,y_n,\sigma^2)d\theta\\ &=\int p(\widetilde{y}\mid \theta,\sigma^2)p(\theta\mid y_1,\ldots,y_n,\sigma^2)d\theta\\ &=\int \text{Normal}\times \text{Normal}\ d\theta\\ \end{aligned} \] One can show that the above integral evaluates to a normal distribution with mean \(\mu_{pred}\) and variance \(\sigma^2_{pred}\). To find these values we use the tools of conditional expectation and variance that we recalled earlier. We have:

\[ \begin{aligned} \mu_{pred} &= \operatorname{E}[\widetilde{Y}\mid y_1,\ldots,y_n,\sigma^2]\\ &=\operatorname{E}[\operatorname{E}[\widetilde{Y}\mid \theta, y_1,\ldots,y_n,\sigma^2]\mid y_1,\ldots,y_n,\sigma^2]\\ &=\operatorname{E}[\operatorname{E}[\widetilde{Y}\mid \theta,\sigma^2]\mid y_1,\ldots,y_n,\sigma^2]\\ &=\operatorname{E}[\theta\mid y_1,\ldots,y_n,\sigma^2]\\ &=\mu_n \end{aligned} \] and \[ \begin{aligned} \sigma^2_{pred} &= \operatorname{Var}(\widetilde{Y}\mid y_1,\ldots,y_n,\sigma^2)\\ &=\operatorname{E}[\operatorname{Var}(\widetilde{Y}\mid \theta, y_1,\ldots,y_n,\sigma^2)\mid y_1,\ldots,y_n,\sigma^2] + \operatorname{Var}[\operatorname{E}(\widetilde{Y}\mid \theta, y_1,\ldots,y_n,\sigma^2)\mid y_1,\ldots,y_n,\sigma^2]\\ &=\operatorname{E}[\operatorname{Var}(\widetilde{Y}\mid \theta,\sigma^2)\mid y_1,\ldots,y_n,\sigma^2] + \operatorname{Var}[\operatorname{E}(\widetilde{Y}\mid \theta,\sigma^2)\mid y_1,\ldots,y_n,\sigma^2]\\ &=\operatorname{E}[\sigma^2\mid y_1,\ldots,y_n,\sigma^2] + \operatorname{Var}[\theta\mid y_1,\ldots,y_n,\sigma^2]\\ &=\sigma^2 + \tau_n^2 \end{aligned} \] Thus, the predictive posterior distribution of \(\widetilde{Y}\) given the data and \(\sigma^2\) is: \[ \widetilde{Y}\mid y_1,\ldots,y_n,\sigma^2\sim N(\mu_n,\sigma^2+\tau_n^2) \]

Intuitively, the predictive posterior distribution is centered at the posterior mean of \(\theta\) and has variance that is larger than the sampling variance \(\sigma^2\) by the amount of uncertainty we have about \(\theta\) (which is \(\tau_n^2\)).

1.4 R example: Midge Wing Length

This is a dataset of wing lengths of midges (small flies) collected in the field. We will use this dataset to illustrate the concepts we have discussed so far.

Here is a Midge
y <- midge[midge[,1]=="Af",3]
head(midge)
  Species Ant.Length Wing.Length
1      Af       1.38        1.64
2      Af       1.40        1.70
3      Af       1.24        1.72
4      Af       1.36        1.74
5      Af       1.38        1.82
6      Af       1.48        1.82
print(y)
[1] 1.64 1.70 1.72 1.74 1.82 1.82 1.82 1.90 2.08
hist(y, breaks = 10, main = "Midge wing lengths", xlab = "Wing length (mm)")
abline(v = mean(y),col = "red",lwd = 2)

mean(y)
[1] 1.804444

It is a small dataset with only 9 observations. The sample mean is 1.8044444 mm and the sample variance is 0.0168778 mm\(^2\). We will use this dataset to illustrate the concepts we have discussed so far.

Prior parameter choice

Prior \[ \theta\mid \sigma^2\sim N(\mu_0,\tau_0^2) \]

Expert opinion: Experts say that the mean wing length is about 1.9 mm.

Q: What should the parameters \(\mu_0\) and \(\tau_0^2\) be?

Answer

We can set \(\mu_0=1.9\) mm. To choose \(\tau_0^2\), we can think that most (lets say, 95%) of the prior mass should be above zero. So, 2 standard deviations should be about 1.9 mm. This gives us \(\tau_0^2=(1.9/2)^2=0.95^2=0.9025\) mm\(^2\).

      Mean           Type
1 1.804444      Data mean
2 1.804643 Posterior mean
3 1.900000     Prior mean

Posterior

\[ \mu_n=\dfrac{\frac{1}{\tau_0^2}\mu_0+\frac{n}{\sigma^2}\bar{y}}{\frac{1}{\tau_0^2}+\frac{n}{\sigma^2}} \]

Exercise: Verify that the posterior mean evaluates to 1.8046426 mm.

Here we assumed \(\sigma^2\) is known. Often, this is not the case. We need to incorporate uncertainty about \(\sigma^2\) into our inference about \(\theta\). This leads to joint inference for \((\theta,\sigma^2)\).

1.5 Joint inference for the mean and variance

Conjugate prior

We will express the joint prior for \((\theta,\sigma^2)\) as a product of a conditional prior for \(\theta\) given \(\sigma^2\) and a marginal prior for \(\sigma^2\):

\[ p(\theta,\sigma^2\mid y_1,\ldots,y_n)\propto p(y_1,\ldots,y_n\mid\theta,\sigma^2)\times p(\theta\mid\sigma^2)p(\sigma^2) \]

What should the form of \(p(\theta\mid\sigma^2)\) and \(p(\sigma^2)\) be to get a conjugate prior?

We aim to represent the joint sampling model as a product \(F(\theta\mid\sigma^2)G(\sigma^2)\), such that when paired with the corresponding prior, the posterior is in the same family as the prior.

\[ \begin{aligned} p(\underline{y}\mid\theta)&\propto (\sigma^2)^{-n/2}\exp\left[-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-\theta)^2\right]\\ &\propto (\sigma^2)^{-n/2}\exp\left[-\frac{\sum_{i=1}^n(y_i-\bar{y})^2}{2\sigma^2} - \frac{n(\theta-\bar{y})^2}{2\sigma^2}\right]\\ &=\exp\left[-\frac{n(\theta-\bar{y})^2}{2\sigma^2}\right]\times (\sigma^2)^{-n/2}\exp\left[- \frac{\sum_{i=1}^n(y_i-\bar{y})^2}{2\sigma^2}\right]\\ &=F(\theta\mid\sigma^2)\times G(\sigma^2) \end{aligned} \]

Continuing from the above: \[ p(\theta\mid \underline{y},\sigma^2)p(\sigma^2\mid \underline{y})\propto F(\theta\mid\sigma^2)G(\sigma^2)\times p(\theta\mid\sigma^2)p(\sigma^2) \]

Turns out, to get a conjugate prior, we need to set \(p(\theta\mid\sigma^2)\) to be a normal distribution and \(p(\sigma^2)\) to be an inverse-gamma distribution. Specifically, we have: \[ \begin{aligned} \theta\mid\sigma^2 &\sim N(\mu_0,\sigma^2/k_0)\\ \sigma^2 &\sim \text{Inverse-Gamma}(a_0=\nu_0/2,b_0= \nu_0\sigma^2_0/2) \end{aligned} \]

We chose the parameters of the inverse-gamma distribution for the ease of interpretation later on.

The prior is then given by: \[ p(\theta,\sigma^2) =\text{Normal}(\mu_0,\sigma^2/k_0)\times \text{Inverse-Gamma}(\nu_0/2,\nu_0\sigma^2_0/2)=NIG(\mu_0,k_0,\nu_0,\sigma^2_0) \]

NIG stands for Normal-Inverse-Gamma distribution. It is also sometimes called the NIX prior (Normal-Inverse-Chi-Squared) because the inverse-gamma distribution can be reparametrized in terms of an inverse-chi-squared distribution.

Joint Posterior

\[ \begin{aligned} 1/\sigma^2 &\sim \text{Gamma}\left(\frac{\nu_0}{2},\frac{\nu_0\sigma^2_0}{2}\right)\\ \theta \mid \sigma^2 &\sim N\left(\mu_0,\frac{\sigma^2}{\kappa_0}\right)\\ Y_1,\ldots,Y_n \mid \theta,\sigma^2 &\sim \text{ i.i.d. } N(\theta,\sigma^2)\\ \end{aligned} \]

\[ \begin{aligned} p(\theta,\sigma^2\mid y_1,\ldots,y_n)&=p(\theta\mid\sigma^2,y_1,\ldots,y_n)p(\sigma^2\mid y_1,\ldots,y_n)\\ &=p(y_1,\ldots,y_n\mid\theta,\sigma^2)p(\theta\mid\sigma^2)p(\sigma^2)\\ &=F(\theta\mid\sigma^2)\times G(\sigma^2)\times \text{Normal}(\mu_0,\sigma^2/k_0)\times \text{Inverse-Gamma}(\nu_0/2,\nu_0\sigma^2_0/2)\\ \end{aligned} \]

Now, our expressions for \(F(\theta\mid\sigma^2)\) and \(G(\sigma^2)\) are exactly the kernels of the normal and inverse-gamma distributions, respectively. Thus, we can read off the parameters of the posterior distributions for \(\theta\) given \(\sigma^2\) and for \(\sigma^2\) directly from the above expression. We have: \[ \begin{aligned} p(\theta,\sigma^2\mid y_1,\ldots,y_n)&\propto F(\theta\mid\sigma^2)\times \text{Normal}(\mu_0,\sigma^2/k_0)\times G(\sigma^2)\times \text{Inverse-Gamma}(\nu_0/2,\nu_0\sigma^2_0/2)\\ &=\exp\left[-\frac{n(\theta-\bar{y})^2}{2\sigma^2}\right]\times \text{N}(\mu_0,\sigma^2/k_0)\times (\sigma^2)^{-n/2}\exp\left[- \frac{\sum (y_i-\bar{y})^2}{2\sigma^2}\right]\times \text{IG}(\nu_0/2,\nu_0\sigma^2_0/2)\\ \end{aligned} \]

The first two terms give us the posterior distribution of \(\theta\) given \(\sigma^2\): \(p(\theta\mid y_1,\ldots,y_n,\sigma^2)\).

And the last two terms give us the posterior distribution of \(\sigma^2\): \(p(\sigma^2\mid y_1,\ldots,y_n)\). Specifically, we have:

\[ \begin{aligned} p(\theta\mid y_1,\ldots,y_n,\sigma^2)&=N\left(\mu_n,\frac{\sigma^2}{k_n}\right), \end{aligned} \] where \[ \begin{aligned} \mu_n&=\frac{k_0}{k_0+n}\mu_0+\frac{n}{k_0+n}\bar{y}\\ k_n&=k_0+n \end{aligned} \]

And for the posterior distribution of \(\sigma^2\), we have:

\[ \begin{aligned} p(\sigma^2\mid y_1,\ldots,y_n)&\propto (\sigma^2)^{-(n+\nu_0)/2}\exp\left[- \frac{\nu_0\sigma^2_0 + \sum (y_i-\bar{y})^2 + k_0n(\bar{y}-\mu_0)^2}{2\sigma^2}\right]\\ &=(\sigma^2)^{-\nu_n}\exp\left[- \frac{\nu_n\sigma^2_n}{2\sigma^2}\right]\\ &=\text{Inverse-Gamma}\left(\frac{\nu_n}{2},\frac{\nu_n\sigma^2_n}{2}\right), \end{aligned} \]

where \[ \begin{aligned} \nu_n&=\nu_0+n\\ \nu_n\sigma^2_n&=\nu_0\sigma^2_0 + \sum (y_i-\bar{y})^2 + k_0n(\bar{y}-\mu_0)^2 \end{aligned} \]

In shorthand, we can say \[ \theta,\sigma^2\sim NIG(\mu_0,k_0,\nu_0,\sigma^2_0)\quad\Rightarrow\quad \theta,\sigma^2\mid y_1,\ldots,y_n\sim NIG(\mu_n,k_n,\nu_n,\sigma^2_n). \]

Let’s interpret the posterior parameters. The posterior mean \(\mu_n\) is a weighted average of the prior mean \(\mu_0\) and the sample mean \(\bar{y}\), where the weights are given by the prior precision sample size \(k_0\) and the data sample size \(n\). The posterior sample size \(\nu_n\) is the sum of the prior sample size and the data sample size. The posterior scale parameter \(\nu_n\sigma^2_n\) is a posterior sum of squares. It is a combination of the prior sum of squares, the sample sum of squares, and a term that captures how far the sample mean is from the prior mean.

Q: Why there are two prior sample size parameters \(k_0\) and \(\nu_0\)?

Answer The parameter \(k_0\) captures the amount of information we have about the mean \(\theta\) given \(\sigma^2\). The parameter \(\nu_0\) captures the amount of information we have about the variance \(\sigma^2\). They are separate because they capture different aspects of our prior knowledge.

Example: Back to Midges

## prior
mu0<-1.9  ; k0<-1
s20<-.01 ; nu0<-1

## data
y<-c(1.64,1.70,1.72,1.74,1.82,1.82,1.82,1.90,2.08)
n<-length(y) ; ybar<-mean(y) ; s2<-var(y)

## posterior inference
kn<-k0+n ; nun<-nu0+n
mun<- (k0*mu0 + n*ybar)/kn  
s2n<- (nu0*s20 +(n-1)*s2 +k0*n*(ybar-mu0)^2/(kn))/(nun)
mun
[1] 1.814
s2n
[1] 0.015324
sqrt(s2n)
[1] 0.1237901

Prior and posterior joint densities in (mean,precision) space





Sequential updating of the joint posterior distribution

Joint posterior density plot in (mean,variance) space


Q: Think why the posterior mean of the variance is not a convex combination of prior variance and sample variance?

1.6 Marginal posterior: Monte Carlo sampling

We are often interested in \[ E[\theta\mid y_1,\ldots,y_n] \quad\text{and}\quad E[\sigma^2\mid y_1,\ldots,y_n] \]

I.e. the marginal posterior means of the mean and variance parameters. We can compute these by numerical integration, but it is often easier to sample from the joint posterior and then compute sample averages.

Q: How do we sample from the joint posterior \(p(\theta, \sigma^2 \mid y_1, \ldots, y_n)\) in this model?

A:

set.seed(632)
S <- 10000
s2.postsample <- 1/rgamma(S,  (nu0+n)/2, s2n*(nu0+n)/2 )
theta.postsample <- rnorm(S, mun, sqrt(s2.postsample/(k0+n)))

WWaBD: 95% Bayesian quantile CI for \(\theta\)

quantile( theta.postsample,c(.025,.975))
    2.5%    97.5% 
1.727771 1.898717 

WWaFD: 95% Classical CI for \(\theta\)

ybar+qt( c(.025,.975), n-1) *sqrt(s2/n)
[1] 1.704583 1.904306

Connection to the Frequentism:

In frequentist inference, we have the pivotal quantity \[ \frac{\bar{y}-\theta}{s/\sqrt{n}}\;\Big|\; \theta \qquad\sim\qquad t_{n-1} \]

The t-distribution arises because we are estimating the variance from the data: the denominator \(s/\sqrt{n}\) is a random variable that depends on the data, and its contribution makes the statistic more heavy-tailed than a normal distribution.

In Bayesian inference, we have the pivotal quantity

\[ \dfrac{\theta-\mu_n}{\sigma_n/\sqrt{k_n}}\;\Big|\; y_1,\ldots,y_n \qquad\sim\qquad t_{\nu_0+n} \]

Here, the t-distribution arises because we are integrating out the uncertainty in \(\sigma^2\): the variance of the posterior distribution of \(\theta\) depends on \(\sigma^2\), which is a random variable with an inverse-gamma posterior distribution. This integration leads to a t-distribution for the marginal posterior of \(\theta\).

1.7 Improper Priors

Q: What is the smallest \(\nu_0\) and \(k_0\) can we set to still have a proper posterior?

\[ \begin{aligned} \mu_n&=\frac{k_0\mu_0+n\bar{y}}{k_0+n},\\ \sigma_n^2&=\frac{1}{\nu_0+n}\left[\nu_0\sigma^2_0+(n-1)s^2+\frac{k_0n}{k_0+n}(\bar{y}-\mu)^2\right]. \end{aligned} \] Let \(k_0,\nu_0\rightarrow 0\)

Priors will look like

\[ p(\theta,\sigma^2) = p(\theta\mid\sigma^2)\times p(\sigma^2) = N(\mu_0,\infty)\times \text{Inv-Gamma}(0,0) \] This corresponds to \(NIG(\mu_0, 0, 0, 0)\).

Exercise: Show that this prior is \(p(\theta,\sigma^2)\propto 1/\sigma^2\). This is also known as Haar’s prior for the normal model.

The prior will be improper. But posterior will be proper if \(n\geq 2\) and \(s^2>0\):

\[ \begin{aligned} \theta\mid\sigma^2,y_1,\ldots,y_n&\sim N(\bar{y},\sigma^2/n),\\ \sigma^2\mid y_1,\ldots,y_n&\sim \text{Inv-Gamma}\left(\frac{n-1}{2},\frac{(n-1)s^2}{2}\right). \end{aligned} \]