Conjugate prior
We will express the joint prior for \((\theta,\sigma^2)\) as a product of a conditional prior for \(\theta\) given \(\sigma^2\) and a marginal prior for \(\sigma^2\):
\[
p(\theta,\sigma^2\mid y_1,\ldots,y_n)\propto p(y_1,\ldots,y_n\mid\theta,\sigma^2)\times p(\theta\mid\sigma^2)p(\sigma^2)
\]
What should the form of \(p(\theta\mid\sigma^2)\) and \(p(\sigma^2)\) be to get a conjugate prior?
We aim to represent the joint sampling model as a product \(F(\theta\mid\sigma^2)G(\sigma^2)\), such that when paired with the corresponding prior, the posterior is in the same family as the prior.
\[
\begin{aligned}
p(\underline{y}\mid\theta)&\propto (\sigma^2)^{-n/2}\exp\left[-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-\theta)^2\right]\\
&\propto (\sigma^2)^{-n/2}\exp\left[-\frac{\sum_{i=1}^n(y_i-\bar{y})^2}{2\sigma^2} - \frac{n(\theta-\bar{y})^2}{2\sigma^2}\right]\\
&=\exp\left[-\frac{n(\theta-\bar{y})^2}{2\sigma^2}\right]\times (\sigma^2)^{-n/2}\exp\left[- \frac{\sum_{i=1}^n(y_i-\bar{y})^2}{2\sigma^2}\right]\\
&=F(\theta\mid\sigma^2)\times G(\sigma^2)
\end{aligned}
\]
Continuing from the above: \[
p(\theta\mid \underline{y},\sigma^2)p(\sigma^2\mid \underline{y})\propto F(\theta\mid\sigma^2)G(\sigma^2)\times p(\theta\mid\sigma^2)p(\sigma^2)
\]
Turns out, to get a conjugate prior, we need to set \(p(\theta\mid\sigma^2)\) to be a normal distribution and \(p(\sigma^2)\) to be an inverse-gamma distribution. Specifically, we have: \[
\begin{aligned}
\theta\mid\sigma^2 &\sim N(\mu_0,\sigma^2/k_0)\\
\sigma^2 &\sim \text{Inverse-Gamma}(a_0=\nu_0/2,b_0= \nu_0\sigma^2_0/2)
\end{aligned}
\]
We chose the parameters of the inverse-gamma distribution for the ease of interpretation later on.
The prior is then given by: \[
p(\theta,\sigma^2) =\text{Normal}(\mu_0,\sigma^2/k_0)\times \text{Inverse-Gamma}(\nu_0/2,\nu_0\sigma^2_0/2)=NIG(\mu_0,k_0,\nu_0,\sigma^2_0)
\]
NIG stands for Normal-Inverse-Gamma distribution. It is also sometimes called the NIX prior (Normal-Inverse-Chi-Squared) because the inverse-gamma distribution can be reparametrized in terms of an inverse-chi-squared distribution.
Joint Posterior
\[
\begin{aligned}
1/\sigma^2 &\sim \text{Gamma}\left(\frac{\nu_0}{2},\frac{\nu_0\sigma^2_0}{2}\right)\\
\theta \mid \sigma^2 &\sim N\left(\mu_0,\frac{\sigma^2}{\kappa_0}\right)\\
Y_1,\ldots,Y_n \mid \theta,\sigma^2 &\sim \text{ i.i.d. } N(\theta,\sigma^2)\\
\end{aligned}
\]
\[
\begin{aligned}
p(\theta,\sigma^2\mid y_1,\ldots,y_n)&=p(\theta\mid\sigma^2,y_1,\ldots,y_n)p(\sigma^2\mid y_1,\ldots,y_n)\\
&=p(y_1,\ldots,y_n\mid\theta,\sigma^2)p(\theta\mid\sigma^2)p(\sigma^2)\\
&=F(\theta\mid\sigma^2)\times G(\sigma^2)\times \text{Normal}(\mu_0,\sigma^2/k_0)\times \text{Inverse-Gamma}(\nu_0/2,\nu_0\sigma^2_0/2)\\
\end{aligned}
\]
Now, our expressions for \(F(\theta\mid\sigma^2)\) and \(G(\sigma^2)\) are exactly the kernels of the normal and inverse-gamma distributions, respectively. Thus, we can read off the parameters of the posterior distributions for \(\theta\) given \(\sigma^2\) and for \(\sigma^2\) directly from the above expression. We have: \[
\begin{aligned}
p(\theta,\sigma^2\mid y_1,\ldots,y_n)&\propto F(\theta\mid\sigma^2)\times \text{Normal}(\mu_0,\sigma^2/k_0)\times G(\sigma^2)\times \text{Inverse-Gamma}(\nu_0/2,\nu_0\sigma^2_0/2)\\
&=\exp\left[-\frac{n(\theta-\bar{y})^2}{2\sigma^2}\right]\times \text{N}(\mu_0,\sigma^2/k_0)\times (\sigma^2)^{-n/2}\exp\left[- \frac{\sum (y_i-\bar{y})^2}{2\sigma^2}\right]\times \text{IG}(\nu_0/2,\nu_0\sigma^2_0/2)\\
\end{aligned}
\]
The first two terms give us the posterior distribution of \(\theta\) given \(\sigma^2\): \(p(\theta\mid y_1,\ldots,y_n,\sigma^2)\).
And the last two terms give us the posterior distribution of \(\sigma^2\): \(p(\sigma^2\mid y_1,\ldots,y_n)\). Specifically, we have:
\[
\begin{aligned}
p(\theta\mid y_1,\ldots,y_n,\sigma^2)&=N\left(\mu_n,\frac{\sigma^2}{k_n}\right),
\end{aligned}
\] where \[
\begin{aligned}
\mu_n&=\frac{k_0}{k_0+n}\mu_0+\frac{n}{k_0+n}\bar{y}\\
k_n&=k_0+n
\end{aligned}
\]
And for the posterior distribution of \(\sigma^2\), we have:
\[
\begin{aligned}
p(\sigma^2\mid y_1,\ldots,y_n)&\propto (\sigma^2)^{-(n+\nu_0)/2}\exp\left[- \frac{\nu_0\sigma^2_0 + \sum (y_i-\bar{y})^2 + k_0n(\bar{y}-\mu_0)^2}{2\sigma^2}\right]\\
&=(\sigma^2)^{-\nu_n}\exp\left[- \frac{\nu_n\sigma^2_n}{2\sigma^2}\right]\\
&=\text{Inverse-Gamma}\left(\frac{\nu_n}{2},\frac{\nu_n\sigma^2_n}{2}\right),
\end{aligned}
\]
where \[
\begin{aligned}
\nu_n&=\nu_0+n\\
\nu_n\sigma^2_n&=\nu_0\sigma^2_0 + \sum (y_i-\bar{y})^2 + k_0n(\bar{y}-\mu_0)^2
\end{aligned}
\]
In shorthand, we can say \[
\theta,\sigma^2\sim NIG(\mu_0,k_0,\nu_0,\sigma^2_0)\quad\Rightarrow\quad \theta,\sigma^2\mid y_1,\ldots,y_n\sim NIG(\mu_n,k_n,\nu_n,\sigma^2_n).
\]
Let’s interpret the posterior parameters. The posterior mean \(\mu_n\) is a weighted average of the prior mean \(\mu_0\) and the sample mean \(\bar{y}\), where the weights are given by the prior precision sample size \(k_0\) and the data sample size \(n\). The posterior sample size \(\nu_n\) is the sum of the prior sample size and the data sample size. The posterior scale parameter \(\nu_n\sigma^2_n\) is a posterior sum of squares. It is a combination of the prior sum of squares, the sample sum of squares, and a term that captures how far the sample mean is from the prior mean.
Q: Why there are two prior sample size parameters \(k_0\) and \(\nu_0\)?
Answer
The parameter \(k_0\) captures the amount of information we have about the mean \(\theta\) given \(\sigma^2\). The parameter \(\nu_0\) captures the amount of information we have about the variance \(\sigma^2\). They are separate because they capture different aspects of our prior knowledge.
Joint posterior density plot in (mean,variance) space
Q: Think why the posterior mean of the variance is not a convex combination of prior variance and sample variance?