Lecture 5: The Normal Model and Jeffreys’ Prior
1 Choice of a prior
Developing prior distributions is undoubtedly the most controversial aspect of any Bayesian analysis. — Lindley (1983)
The fact that one needs to specify a prior distribution is often cited as the main criticism of Bayesian methods. But the choice of a prior is not as subjective as it may seem. In fact, there are many ways to elicit priors, and there are also objective methods for choosing priors.
1.1 Types of priors
There are two main types of priors:
- Subjective priors (informative): based on expert knowledge, previous studies, etc.
- Objective priors (Non-informative): designed to have minimal influence on the posterior distribution.
Subjective priors
- For a single event: relative probability of \(A\) vs \(A^c\).
For example, if you think that \(\theta\) is more likely to be less than 0.5 than greater than 0.5, then you can assign \(p(\theta<0.5)=0.7\) and \(p(\theta>0.5)=0.3\). So, the prior odds are 7:3 in favor of \(\theta<0.5\). After observing the data, you can update your beliefs using Bayes’ theorem.
- For continuous parameters: divide range into intervals, assign probabilities, and then fit a smooth pdf.
Example:
\(\theta\sim N(\mu,\sigma^2)\), educated guess of 25, and the likely range between 10 and 40. Then we assume that \(\Pr(10\leq \theta \leq 40)=0.95\). This implies that \(\mu=25\) and \(4\sigma=40-10=30\), so \(\sigma=7.5\). And the prior is \(N(25, 7.5^2)\).
\(\theta\sim \text{Beta}(a,b)\), \(mean=0.3\), \(sd=0.1\). Then using formulas for the mean and variance of the Beta distribution (\(E\theta=\frac{a}{a+b}\) and \(Var\theta=\frac{ab}{(a+b)^2(a+b+1)}\)), we can solve for \(a\) and \(b\) to get \(a=6\) and \(b=14\). So, the prior is \(\text{Beta}(6, 14)\).
1.2 Objective priors
The idea of objective priors is to design priors that have minimal influence on the posterior distribution.
Two main types of objective priors:
- diffuse priors (or weakly informative / vague priors)
- reference priors (e.g. Jeffreys’ prior)
Diffuse priors
A flat prior, e.g. \(p(\theta)=const\) may appear appealing.
- The good:
- all values of \(\theta\) are equally likely.
Posterior is proportional to the Likelihood/ Sampling model
\[ p(\theta\mid y_{obs})\propto p(y_{obs}\mid\theta)\times p(\theta)\propto p(y_{obs}\mid\theta) \]
In this case, the maximum a posteriori (MAP) estimate is the same as the maximum likelihood estimate (MLE). I.e.,
\[ \arg\max_{\theta} p(\theta\mid y_{obs})=\arg\max_{\theta} p(y_{obs}\mid\theta) \]
Easy to work with
The bad
- It could be improper
- it assigns disproportionate weight to the large values of \(\theta\).
E.g., values of \(\theta\) between 100 and 1000 are 10 times more likely than values between 0 and 100, even though the latter may be more plausible.
- Flatness is not preserved under reparametrizations \(\tau=g(\theta)\).
Recall, when we plotted the pdf of log-odds \(\log[\theta/(1-\theta)]\) for the Binomial model, for the uniform prior on \(\theta\). It was not uniform, but had a significant mass around 0, with most of the values of log-odds being between -5 and 5. So, the uniform prior on \(\theta\) is not uniform on the log-odds scale. Similarly, the uniform prior on \(\theta\) is not uniform on the odds scale \(\tau=\theta/(1-\theta)\).
Example: Non-informative priors for the Binomial model
For example, for the Binomial model \(Y\mid\theta\sim Bin(n,\theta)\)
\(\theta \sim \text{Beta}(1, 1)\) is uniform for \(\theta\in[0,1]\).
\(\theta \sim \text{Beta}(0, 0)\) uniform for log-odds \(\tau=\log[\theta/(1-\theta)]\).
\(\theta \sim \text{Beta}(1, -1)\) uniform for odds \(\tau=\theta/(1-\theta)\).
Exercise: Prove the last two statements.
Note, that last two priors are improper, but often they lead to proper posteriors.
Distributions \(\theta\sim \text{Beta}(0,0)\) is understood as \(p(\theta)\propto \theta^{-1}(1-\theta)^{-1}\), and \(\theta\sim \text{Beta}(1,-1)\) is understood as \(p(\theta)\propto (1-\theta)^{-2}\).
Remedy: Jeffreys’ prior
Jeffreys’ prior \(p_J(\theta)\) is defined as
\[ p_J(\theta)\propto\sqrt{I(\theta)},\qquad I(\theta)=-E\left[\frac{\partial^2\log p(y\mid\theta)}{\partial\theta^2}\mid \theta\right]= -E\left[\frac{\partial^2\log L}{\partial\theta^2}\mid \theta\right] \]
Function \(I(\theta)\) is called the Fisher information, and it measures the amount of information that an observable random variable \(Y\) carries about an unknown parameter \(\theta\) upon which the likelihood depends. The expectation is taken with respect to the sampling distribution of \(Y\) given \(\theta\), \(p(y\mid\theta)\).
For the vector parameter \(\underline{\theta}=(\theta_1,\ldots,\theta_k)\), Jeffreys’ prior is defined as \[ p_J(\theta)\propto\sqrt{det(I(\theta))},\quad I(\theta)=[I_{ij}(\theta)],\quad I_{ij}(\theta)=-E\left[\frac{\partial^2\log L}{\partial\theta_i\partial\theta_j}\mid \theta\right] \] Here, \(I(\theta)\) is the Fisher information matrix, and \(det(I(\theta))\) is its determinant. Fisher information matrix is essentially a negative of the expected Hessian of the log-likelihood function, and it captures the curvature of the likelihood surface around the true parameter value. The determinant of the Fisher information matrix is a measure of the overall amount of information that the data provides about the parameters.
Property: transformation invariant (i.e. ``uninformed with respect to transformations’’) and depends only on the sampling model.
Proof:
First we prove \[ I(\theta)=-E\left[\frac{\partial^2\log L}{\partial\theta^2}\mid\theta\right] =E\left[\left(\frac{\partial \log L}{\partial\theta}\right)^2\mid\theta \right] \] \[ \begin{aligned} E_{p(y\mid\theta)}\left[\left(\frac{\partial \log L}{\partial\theta}\right)^2\mid\theta \right]&= \int \frac{\partial \log L}{\partial\theta} \frac{\partial \log L}{\partial\theta} p(y\mid\theta) dy\\ &=\int \frac{\partial \log L}{\partial\theta} \frac{L'}{L}L dy\qquad\text{(using the fact, that $p(y\mid\theta)=L(\theta)$)}\\ &=\int (\log L)' L' dy\\ &=\int [((\log L)'L)' - (\log L)''L]\,dy\qquad \text{(using the product rule) }\\ &=\int L'' dy - \int (\log L)''L dy\\ &=\frac{\partial^2}{\partial\theta^2}\left(\int L dy\right) +I(\theta)\qquad \text{(using the definition of Fisher information)}\\ &=I(\theta)\qquad \text{(using the fact that $\int L dy=1$ for all $\theta$)} \end{aligned} \]
Thus, we have shown that the Fisher information can be expressed as the expected value of the square of the score function, which is the first derivative of the log-likelihood. This is a fundamental result in statistical theory, and it provides a connection between the curvature of the likelihood function and the variability of the score function. \[ I(\theta)=E\left[\left(\frac{\partial \log L}{\partial\theta}\right)^2\mid\theta \right] \]
Second, we prove that if we construct Jeffreys’ prior using \(\theta\) and then reparametrize to \(\phi=g(\theta)\), we get the same prior as if we had constructed Jeffreys’ prior directly using \(\phi\).
\[ p_J(\theta)\propto\sqrt{I(\theta)},\quad \theta\rightarrow\phi \qquad\Longrightarrow\qquad p_J(\phi)\propto\sqrt{I(\phi)} \]
We will use the change of variable formula for probability densities: \[ p_J(\phi)=p_J(\theta)\left|\frac{\partial \theta}{\partial \phi}\right| \] \[ \begin{aligned} p_J(\phi)&=p_J(\theta)\left|\frac{\partial \theta}{\partial \phi}\right|\\ &\propto \sqrt{I(\theta)}\left|\frac{\partial \theta}{\partial \phi}\right|\\ &=\sqrt{E\left[\left(\frac{\partial \log L}{\partial\theta}\right)^2\left(\frac{\partial \theta}{\partial \phi}\right)^2\mid\theta \right]}\\ &=\sqrt{E\left[\left(\frac{\partial \log L}{\partial\theta}\frac{\partial \theta}{\partial \phi}\right)^2\mid\theta \right]}\\ &=\sqrt{E\left[\left(\frac{\partial \log L}{\partial\phi}\right)^2\mid\theta \right]}\\ &=\sqrt{E\left[\left(\frac{\partial \log L}{\partial\phi}\right)^2\mid\phi \right]}\\ &=\sqrt{I(\phi)} \end{aligned} \]
The conditioning on \(\theta\) vs \(\phi\) does not matter, because \(\theta\) and \(\phi\) are in one-to-one correspondence, so conditioning on one is the same as conditioning on the other. Thus, we have shown that Jeffreys’ prior is invariant under reparametrization, which means that it does not depend on the particular parameterization of the model. This is a desirable property for an objective prior, as it ensures that the prior does not introduce any unintended bias based on the choice of parameterization.
Example: Jeffreys’ prior for the Binomial model
Jeffreys prior do not guarantee to be proper, conjugate, or to lead to a proper posterior. For the Binomial model \(Y\mid\theta\sim Bin(n,\theta)\), we have \[ p_J(\theta)\propto \theta^{-1/2}(1-\theta)^{-1/2}\,=\,\text{Beta}(1/2,1/2) \]
It is a conjugate prior, it is proper and it gives proper posterior distribution. The posterior distribution is
\[ \theta\mid y_1,\ldots,y_n\sim \text{Beta}(y+1/2,n-y+1/2) \]
The prior looks like this:
Example: Jeffreys’ prior for the Poisson model,
\[ p_J(\theta)\propto \theta^{-1/2}=\lim_{b\rightarrow 0}\,\text{Gamma}(1/2,b) \]
We get a conjugate prior, but it is improper. However, it leads to a proper posterior distribution as long as \(y>0\). The posterior distribution is
\[ \theta\mid y_1,\ldots,y_n\sim \text{Gamma}(1/2+\sum_i y_i,0+n) \]
Exercise: derive the two Jeffreys’ priors above using the definition of Jeffreys’ prior and the formula for the Fisher information.
Information-theoretic justification of Jeffreys’ prior
Maximizing the expected Kullback-Leibler divergence between the prior and posterior distributions leads to Jeffreys’ prior. More precisely:
Jeffreys’ prior is defined as the prior that maximizes the asymptotic expected Kullback–Leibler divergence from prior to posterior: \[ p_J(\theta) = \arg\max_{p(\theta)} \lim_{n\to\infty} \mathbb{E}_{y\mid\theta} \!\left[ D_{KL}\!\left( p(\theta\mid y)\,\|\,p(\theta) \right) \right]. \] For regular one-parameter models, this yields \[ p_J(\theta)\propto \sqrt{I(\theta)}, \] where \(I(\theta)\) is the Fisher information.
Intuitively, Jeffreys’ prior is the prior that allows the data to have the most influence on the posterior distribution, as measured by the Kullback-Leibler divergence. This means that Jeffreys’ prior is the least informative prior in terms of how much it allows the data to update our beliefs about \(\theta\).
2 Two-parameter models: Normal Models
After considering the one-parameter models, we now turn to the two-parameter models. The most common example of a two-parameter model is the Normal model, where both the mean and the variance are unknown.
\[ p(y | \theta,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\,exp\left[-\frac{1}{2}\left(\frac{y-\theta}{\sigma}\right)^2\right] \]
2.1 Properties:
- mean=median=mode=\(\theta\)
- symmetric around \(\theta\)
- \(Pr(Y\in[\theta\pm 1.96\sigma])\approx0.95\)
- If \(X\sim N(\mu, \tau^2)\), \(Y\sim N(\theta, \sigma^2)\) and \(X,Y\) are independent then} \[ aX+bY\sim N(a\mu+b\theta, a^2\tau^2+b^2\sigma^2) \]
\[ \\[2cm] \]
- WARNING!!!: R commands
\(\texttt{rnorm(a,b)}\) generates random samples from \(N(a,b^2)\), NOT \(N(a,b)\)!
I strongly suggest everyone to use the following code to generate random samples from \(N(a,b^2)\): \(\texttt{rnorm(mean=a, sd=b)}\).
2.2 Inference for the mean conditional on the variance
First, assume that \(\sigma^2\) is known, and we want to make inference about \(\theta\).
The sampling model
\[ Y_i\mid \theta,\sigma^2\sim\text{ i.i.d. } N(\theta,\sigma^2) \] Joint sampling density: \[ \begin{aligned} p(y_1,\ldots,y_n\mid\theta,\sigma^2)&=\prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}\,\exp\left[-\frac{1}{2}\left(\frac{y_i-\theta}{\sigma}\right)^2\right]\\ &=(2\pi\sigma^2)^{-n/2}\exp\left[-\frac{1}{2\sigma^2}(\sum_i y_i^2-2\theta\sum_i y_i+n\theta^2)\right]\\ &\propto \exp\left[c_1(\theta-c_2)^2+c_3\right]\\ \end{aligned} \]
Since we are interested in inference about \(\theta\), we can ignore multiplicative constants that do not depend on \(\theta\).
Q: What is the sufficient statistic for the inference in this model?
A: If we are only interested in \(\theta\), then the sufficient statistic is the \(\sum_i y_i\).
If we are interested in both \(\theta\) and \(\sigma^2\), then the sufficient statistic is a pair \((\sum_i y_i, \sum_i y_i^2)\). The same information is contained in \((\bar{y}, s^2)\), where \(\bar{y}=\frac{1}{n}\sum_i y_i\) is the sample mean, and \(s^2=\frac{1}{n-1}\sum_i (y_i-\bar{y})^2\) is the sample variance.
Conjugate prior (conditional)
Question: what do you think should be a conjugate prior for \(\theta\)?
\[ \begin{aligned} p(\theta\mid\sigma^2,y_1,\ldots,y_n)&\propto p(y_1,\ldots,y_n\mid\theta,\sigma^2)\times p(\theta\mid\sigma^2)\\ &\propto \exp\left[c_1(\theta-c_2)^2+c_3\right]\times p(\theta\mid\sigma^2) \end{aligned} \]
Assume \(\sigma^2\) is known. For the prior and posterior to be of the same functional form for \(\theta\), the prior should be \(\propto\exp\left[d_1(\theta-d_2)^2+d_3\right]\). In other words, the conjugate prior for \(\theta\) is Normal:
\[ p(\theta\mid\sigma^2)\propto\exp\left[-\frac{1}{2}\left(\frac{\theta-\mu_0}{\tau_0}\right)^2\right] \]
The parameter \(\mu_0\) is the prior mean, and \(\tau_0^2\) is the prior variance. \[ \theta\mid\sigma^2\sim N(\mu_0,\tau_0^2) \] \[ \begin{aligned} p(\theta\mid\sigma^2)&\propto \exp\left[-\frac{1}{2}\left(\frac{\theta-\mu_0}{\tau_0}\right)^2\right]\\ &=\exp\left[-\frac{1}{2}\left(\frac{\theta^2}{\tau_0^2}-\frac{2\mu_0\theta}{\tau_0^2}+\frac{\mu_0^2}{\tau_0^2}\right)\right]\\ &=\exp\left[-\frac{1}{2}\left(a\theta^2-2b\theta+c\right)\right]\\ \end{aligned} \]
We are going to develop a shortcut, that will allow us to identify the mean and variance of the normal distribution, as long as we identify the coefficients \(a\), and \(b\) in front of the quadratic and linear terms in \(\theta\).
Q: What is the mean and variance of a normal distribution with pdf proportional to \(\exp[-\frac{1}{2}(a\theta^2-2b\theta+c)]\)?
Click to see the answer
A: The mean is \(\frac{b}{a}\) and the variance is \(\frac{1}{a}\).Posterior distribution (conditional posterior):
\[ \begin{aligned} p(\theta\mid y_1,\ldots,y_n,\sigma^2)&\propto p(y_1,\ldots,y_n\mid\theta,\sigma^2)\times p(\theta\mid\sigma^2)\\ &\propto \exp\left[-\frac{1}{2\sigma^2}(\sum_i y_i^2-2\theta\sum_i y_i+n\theta^2)\right]\\ &\qquad\qquad\times \exp\left[-\frac{1}{2}\left(\frac{\theta^2}{\tau_0^2}-\frac{2\mu_0\theta}{\tau_0^2}+\frac{\mu_0^2}{\tau_0^2}\right)\right]\\ &\propto \exp\left[-\frac{1}{2}\left(A\theta^2-2B\theta+C\right)\right]. \end{aligned} \]
Q: Determine the coefficients \(A\) and \(B\) in the expression above, and then use the shortcut to find the mean and variance of the posterior distribution.
We will get the following posterior distribution (conditional): \[ \theta\mid y_1,\ldots,y_n,\sigma^2\sim N(\mu_n,\tau_n^2), \] where \[ \mu_n=\frac{\mu_0/\tau_0^2+n\bar{y}/\sigma^2}{1/\tau_0^2+n/\sigma^2},\qquad \frac{1}{\tau_n^2}=\frac{1}{\tau_0^2}+\frac{n}{\sigma^2} \]
Interpretation: the posterior mean is a weighted average of the prior mean and the sample mean, where the weights are inversely proportional to the variances of the prior and the sampling distribution.
The inverse of the posterior variance is the sum of the inverses of the prior variance and the sampling variance divided by \(n\). This means that as we collect more data (i.e., as \(n\) increases), the posterior variance decreases, reflecting increased certainty about \(\theta\).
A natural way to think about the bayesian updating in the normal model case is via inverse of the variance, which is also known as precision.
Using the precision, we can rewrite the posterior mean and variance as follows: \[ \mu_n=\frac{\widetilde{\tau}^2_0\mu_0+n\widetilde{\sigma}^{2}\bar{y}}{\widetilde{\tau}_0^{2}+n\widetilde{\sigma}^{2}},\qquad \widetilde\tau_n^2=\widetilde{\tau}_0^{2}+n\widetilde{\sigma}^{2}, \] where \(\widetilde{\tau}^2_0\) is the prior precision, and \(\widetilde{\sigma}^{2}\) is the sampling precision of a single observation.
We have \[ \text{posterior mean}=\frac{\text{prior precision}}{\text{posterior precision}}\times (\text{prior mean}) + \frac{\text{data precision}}{\text{posterior precision}}\times (\text{sample mean}) \] \[ \text{posterior precision}=\text{prior precision} + \text{data precision} \]
For interpretability, we can express \(\tau_0^2\) as \(\tau_0^2=\sigma^2/k_0\), where \(k_0\) is the prior strength in terms of the number of observations. We can do this, because \(\sigma^2\) is a constant, so instead of thinking about the prior strength in terms of the inverse of the prior variance, we can think about it in terms of the number of observations that the prior is worth. In this case, we can rewrite the posterior mean and variance as follows: \[ \mu_n=\frac{k_0\mu_0+n\bar{y}}{k_0+n},\qquad \widetilde\tau_n^2=(k_0+n)\widetilde{\sigma}^{2}. \]