Lecture 13: Regression Model (continued)

1 Recall: Exercise Program Data

Recall, in the previous lecture, we considered the linear relationship between the change in maximal oxygen uptake with age and exercise program.

x2 <- c(0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1)
x3 <- c(23, 22, 22, 25, 27, 20, 31, 23, 27, 28, 22, 24)
y <- c(-0.87, -10.74, -3.27, -1.97, 7.50, -7.25, 17.05, 4.96, 10.40, 11.05, 0.26, 2.51)

We will consider the following regression model:

\[ \begin{aligned} Y_i &= \beta_1 + \beta_2\times \text{Regimen}_i + \beta_3 \times \text{Age}_i + \beta_4\, (\text{Age}_i \times \text{Regimen}_i) + \varepsilon_i,\\ &\phantom{=}\\ \varepsilon_i &\stackrel{iid}{\sim} \operatorname{normal}(0, \sigma^2) \end{aligned} \]

\(Regimen\) is a binary variable (or categorical with 2 levels) that indicates whether the subject was in the running group (\(Regimen=0\)) or the aerobics group (\(Regimen=1\)).
\(Age\) is a continuous variable that represents the age of the subject.
The interaction term \(Age \times Regimen\) allows the effect of age on the change in maximal oxygen uptake to differ between the two exercise regimens.

We can rewrite the model in matrix form as follows: \[ \begin{aligned} \underline{Y} &= \underline{X} \underline{\beta} + \underline{\varepsilon},\\ &\phantom{=}\\ \underline{\varepsilon} &\stackrel{iid}{\sim} \operatorname{MVN}(\underline{0}, \sigma^2 \underline{I}) \end{aligned} \]

Above we denoted \[ \underline{Y} = \begin{bmatrix} Y_1 \\ Y_2 \\ \vdots \\ Y_n \end{bmatrix}, \quad \underline{X} = \begin{bmatrix} 1 & x_{21} & x_{31} & x_{21}x_{31} \\ 1 & x_{22} & x_{32} & x_{22}x_{32} \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_{2n} & x_{3n} & x_{2n}x_{3n} \end{bmatrix} = \begin{bmatrix} -& \underline{x}_1^T &-\\ -& \underline{x}_2^T &-\\ &\vdots& \\ -& \underline{x}_n^T &- \end{bmatrix}, \quad \underline{\beta} = \begin{bmatrix} \beta_1 \\ \beta_2 \\ \beta_3 \\ \beta_4 \end{bmatrix}, \quad \underline{\varepsilon} = \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix} \]

2 Bayesian Estimation

2.1 Joint sampling model

The joint sampling model (from the previous lecture) is given by \[ p(\underline{y} \mid \underline{X}, \underline{\beta}, \sigma^2) \propto (\sigma^2)^{-\frac{n}{2}} \exp\left(-\frac{1}{2\sigma^2} SSR(\underline{\beta})\right) \]

Recall, the key to deriving the full conditioal distributions is to recognize that when we treat \(\underline{\beta}\) as a variable and \(\sigma^2\) as a constant, the sampling model is a kernel of a multivariate normal distribution. Similarly, when we treat \(\sigma^2\) as a variable and \(\underline{\beta}\) as a constant, the sampling model is a kernel of an inverse-gamma distribution.

2.2 Semiconjugate prior

With the above in mind, we set up a prior for \(\underline{\beta}\) and \(\sigma^2\) such that the full conditional distributions are of known form.

Prior for \(\underline{\beta}\)

\(\underline{\beta} \sim \operatorname{MVN}\left(\underline{\beta}_{0}, \underline{\Sigma}_{0}\right)\), then

Full conditional for \(\underline{\beta}\): \[ \underline{\beta} \mid \underline{y}, \underline{X}, \sigma^{2} \sim MVN(\underline{m},\underline{V}) \] \[ \underline{V}=\left(\underline{\Sigma}_{0}^{-1}+\frac{1}{\sigma^{2}} \underline{X}^{T} \underline{X}\right)^{-1}, \quad \underline{m}=\underline{V}\left(\underline{\Sigma}_{0}^{-1} \underline{\beta}_{0}+\frac{1}{\sigma^{2}} \underline{X}^{T} \underline{y}\right) \]

Exercise: Derive the full conditional distribution for \(\sigma^2\).

Interpretation:

We can rewrite the variance via precision as follows: \[ \underline{V}^{-1}=\underline{\Sigma}_{0}^{-1}+[\sigma^{2} (\underline{X}^{T} \underline{X})^{-1}]^{-1}=\text{Prior precision} + \text{Data precision} \] The data precision, is the precision of an OLS estimator of \(\underline{\beta}\). Recall, \(Var(\hat{\underline{\beta}}^{OLS}) = \sigma^2 (\underline{X}^T \underline{X})^{-1}\).

Interpretation of the mean: \[ \underline{m}=\underline{V}\left(\underline{\Sigma}_{0}^{-1} \underline{\beta}_{0}+\frac{1}{\sigma^{2}} \underline{X}^{T} \underline{y}\right)=\underline{V}\left(\text{Prior precision}\times \text{Prior mean} + \text{Data precision}\times \text{OLS estimate}\right) \]

Note that the data precision is \(\frac{1}{\sigma^2} \underline{X}^T \underline{X}\), and the OLS estimate is \(\hat{\underline{\beta}}^{OLS} = (\underline{X}^T \underline{X})^{-1} \underline{X}^T \underline{y}\), so \(\text{data precision}\times\text{OLS estimate}\): \[ \frac{1}{\sigma^2} \underline{X}^T \underline{X} \times (\underline{X}^T \underline{X})^{-1} \underline{X}^T \underline{y} = \frac{1}{\sigma^2} \underline{X}^T \underline{y} \]

Prior for \(\sigma^{2}\)

The semiconjugate prior distribution for \(\sigma^{2}\) is an inverse-gamma distribution. Parametrizing it in terms of the precision \(\gamma = 1/\sigma^2\), we have

\[ \gamma=1 / \sigma^{2},\qquad \gamma \sim \operatorname{Gamma}\left(\nu_{0} / 2, \nu_{0} \sigma_{0}^{2} / 2\right) \]

Full conditional for \(\sigma^{2}\): \[ \gamma \mid \underline{y}, \underline{X}, \underline{\beta}\sim \operatorname{Gamma}\left(\left[\nu_{0}+n\right] / 2,\left[\nu_{0} \sigma_{0}^{2}+\operatorname{SSR}(\underline{\beta})\right] / 2\right) \]

Interpretation:

The posterior distribution for \(\sigma^2\) is an inverse-gamma distribution with updated parameters. - The shape parameter is updated by adding the sample size \(n\) to the prior sample size \(\nu_0\).

The scale parameter is updated by adding the sum of squared residuals (SSR) from the regression model (data sum of squares) to the prior sum of squares \(\nu_0 \sigma_0^2\).

2.3 Connection to the Frequentist Approach

Consider a flat independent prior on \((\beta,\log\sigma^2)\), i.e. \(p(\underline{\beta}, \log \sigma^2) \propto 1\).

Side note: Often in Bayesian statistics we take logarithm of the scale parameter \(\sigma^2\) and put a flat prior on \(\log \sigma^2\) instead of \(\sigma^2\) itself. This is because \(\sigma^2\) is a positive parameter, and putting a flat prior on \(\log \sigma^2\) allows us to express our ignorance about the scale of \(\sigma^2\) without favoring any particular value.

Exercise: Show that flat prior on \(\log \sigma^2\) corresponds to an improper prior on \(\sigma^2\) of the form \(p(\sigma^2) \propto 1/\sigma^2\).

Under the flat prior above, the posterior distribution for \(\underline{\beta}\) and \(\sigma^2\) is given by \[ \underline{\beta}\mid \underline{y}, \underline{X}, \sigma^{2} \sim MVN(\hat{\underline{\beta}}^{OLS}, \sigma^2 (\underline{X}^T \underline{X})^{-1}) \]

\[ \sigma^2 \mid \underline{y}, \underline{X}, \underline{\beta} \sim \operatorname{Inv-Gamma}\left(\frac{n-p}{2}, \frac{SSR(\hat{\underline{\beta}}^{OLS})}{2}\right) \]

Q: What method we should use to approximate the posterior distribution of \(\underline{\beta}\) and \(\sigma^2\)?

Answer

We just need to use an independent Monte Carlo sampler, i.e. we can draw samples \(\sigma^2^{(i)}\) from its inverse-gamma posterior distribution, and then draw samples of \(\underline{\beta}\) from its multivariate normal posterior distribution conditional on the sampled \(\sigma^2^{(i)}\).

3 Gibbs Sampler

Going back to our original semiconjugate prior, the full conditional distributions for \(\underline{\beta}\) and \(\sigma^2\) are of known form. Therefore, we can use a Gibbs sampler to generate samples from the joint posterior distribution of \(\underline{\beta}\) and \(\sigma^2\).

The pseudocode for the Gibbs sampler is as follows:

Note

Given current values \(\left\{\underline{\beta}^{(s)}, \sigma^{2(s)}\right\}\), new values can be generated by

updating \(\underline{\beta}\):
1. compute \(\underline{V}=\operatorname{Var}\left[\underline{\beta} \mid \underline{y}, \underline{X}, \sigma^{2(s)}\right]\) and \(\underline{m}=\mathrm{E}\left[\underline{\beta} \mid \underline{y}, \underline{X}, \sigma^{2(s)}\right]\)
2. sample \(\underline{\beta}^{(s+1)} \sim \operatorname{MVN}(\underline{m}, \underline{V})\)
updating \(\sigma^{2}\):
1. compute \(\operatorname{SSR}\left(\underline{\beta}^{(s+1)}\right)\)
2. sample \(\sigma^{2(s+1)} \sim \operatorname{Inv-Gamma}\left(\left[\nu_{0}+n\right] / 2,\left[\nu_{0} \sigma_{0}^{2}+\operatorname{SSR}\left(\underline{\beta}^{(s+1)}\right)\right] / 2\right)\).

3.1 Prior Parameter Selection

What are the possible priors for \(\underline{\beta}\)?

We will consider two options:

Unit information prior
Zellner’s \(g\)-prior

UIP: Unit information prior

Zellner’s \(g\)-prior

3.2 Results

Coefficient	OLS	g-prior shrinkage
Intercept	-51.294	-47.348
x2	13.107	12.099
x3	2.095	1.934
x2:x3	-0.318	-0.294

\[ \\[5cm] \]

Posterior distributions for \(\beta_2\) and \(\beta_4\)

95% confidence intervals for the difference in expected change scores between aerobics subjects and running subjects.

\[ \\[8cm] \]