Lecture 12: Regression Model

1 Bayesian Linear Regression

In this lecture, we will discuss the linear regression model, which is a fundamental tool for modeling the relationship between a response variable, \(Y\), and one or more predictor variables, \(\underline{x}\). We will start by introducing the dataset we will be working with, which involves studying the effects of different exercise regimens on oxygen uptake in healthy adults. We will then formulate the linear regression model for this dataset, and discuss how to estimate the model parameters using both least squares and Bayesian methods.

1.1 Dataset: Oxygen Uptake

\(n=12\) healthy men who did not exercise regularly were recruited to take part in a study of the effects of two different exercise regimen on oxygen uptake.

6 men: 12-week flat-terrain running program
other 6: 12-week step aerobics program

Research Question: how does a subject’s change in maximal oxygen uptake depend on which program they were assigned to, as well as other factors such as age?

1.2 Change in maximal oxygen uptake as a function of age and exercise program.

Here is the data:

#### Figure 9.1
x2 <- c(0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1)
x3 <- c(23, 22, 22, 25, 27, 20, 31, 23, 27, 28, 22, 24)
y <- c(-0.87, -10.74, -3.27, -1.97, 7.50, -7.25, 17.05, 4.96, 10.40, 11.05, 0.26, 2.51)

We coded the exercise program as a binary variable \(x_2\), where \(x_2=0\) corresponds to the running program and \(x_2=1\) corresponds to the aerobic program. The age of the subjects is represented by the continuous variable \(x_3\), and the change in maximal oxygen uptake is represented by the response variable \(y\). The variable \(x_1\) is the intercept term, which is equal to 1 for all subjects.

1.3 The Linear Regression Model

We assume that the change in maximal oxygen uptake, \(Y_i\), for subject \(i\) can be modeled as a linear function of the predictors, which include an intercept term, the exercise program, age, and an interaction term between exercise program and age. The model can be written as:

\[ \begin{aligned} Y_{i} & =\beta_{1} x_{i, 1}+\beta_{2} x_{i, 2}+\beta_{3} x_{i, 3}+\beta_{4} x_{i, 4}+\epsilon_{i}, \text { where } \\ &\\ x_{i, 1} & =1 \text { for each subject } i\qquad \text{ (the intercept)}, \\ &\\ x_{i, 2} & =0 \text { if subject } i \text { is on the running program, }\\ &= 1 \text { if on aerobic } \qquad \text{(exercise program, categorical predictor)},\\ &\\ x_{i, 3} & =\text { age of subject } i\qquad \text{(age, continuous predictor)}, \\ &\\ x_{i, 4} & =x_{i, 2} \times x_{i, 3} \qquad \text{(interaction between exercise program and age)}, \\ &\\ \varepsilon_{i} & \sim \text { i.i.d. with mean } 0 \text { and variance } \sigma^{2} \qquad \text{(error term)} \end{aligned} \]

The error term \(\varepsilon_i\) captures the random variability in the response variable that is not explained by the predictors. We introduced an interaction term \(x_{i,4}\) to allow for the possibility that the effect of age on the change in maximal oxygen uptake may differ between the two exercise programs. Specifically:

\[ \begin{aligned} \text { if Running, then } x_{2}=0:\qquad& \mathrm{E}[Y \mid \underline{x}]= \beta_1 + \beta_3 x_{i,3} \\ \text{ if Aerobic, then } x_{2}=1:\qquad& \mathrm{E}[Y \mid \underline{x}]= \beta_1 + \beta_2 + (\beta_3 + \beta_4) x_{i,3} \end{aligned} \]

For the Aerobic program, the effect of age on the change in maximal oxygen uptake is given by \(\beta_3 + \beta_4\), which allows for a different slope compared to the Running program, where the effect of age is given by \(\beta_3\). This means that the relationship between age and the change in maximal oxygen uptake can differ between the two exercise programs, and the interaction term \(x_{i,4}\) allows us to capture this potential difference in the model.

Normal Linear regression model

Normal linear regression model assumes that the error terms \(\varepsilon_i\) are independent and identically distributed (i.i.d.) normal random variables with mean 0 and variance \(\sigma^2\). This leads to the following model for the response variable \(Y_i\):

\[ \begin{aligned} \varepsilon_{1}, \ldots, \varepsilon_{n} & \sim \text { i.i.d. normal }\left(0, \sigma^{2}\right) \\ Y_{i} & =\underline{\beta}^{T} \underline{x}_{i}+\varepsilon_{i} \end{aligned} \]

Therefore, the conditional distribution of \(Y_i\) given the predictors \(\underline{x}_i\) and the parameters \(\underline{\beta}\) and \(\sigma^2\) is:

\[ \left\{Y_{i} \mid \underline{x}_{i}, \underline{\beta}, \sigma^{2}\right\} \sim \text {N}\left(\underline{\beta}^{T} \underline{x}_{i}, \sigma^{2}\right) \]

Here the mean of the distribution is given by the linear predictor \(\underline{\beta}^{T} \underline{x}_{i}\), which captures the systematic part of the model, while the variance \(\sigma^2\) captures the random variability in the response variable that is not explained by the predictors.

Above we used the notation \(\underline{\beta}\) to represent the vector of regression coefficients, and \(\underline{x}_i\) to represent the vector of predictors for subject \(i\). Specifically, we have: \[ \underline{x}_i = \begin{bmatrix}1 \\ x_{i,2} \\ x_{i,3} \\ x_{i,2}\times x_{i,3}\end{bmatrix}, \qquad \underline{\beta} = \begin{bmatrix}\beta_1 \\ \beta_2 \\ \beta_3 \\ \beta_4\end{bmatrix}. \]

Joint sampling model

Let us write the joint sampling model for the response vector \(\underline{y} = (y_1, \ldots, y_n)^T\) given the design matrix \(\underline{X}\), the regression coefficients \(\underline{\beta}\), and the variance \(\sigma^2\). The design matrix \(\underline{X}\) is an \(n \times p\) matrix where each row corresponds to a subject and each column corresponds to a predictor (including the intercept). The response vector \(\underline{y}\) is an \(n\)-dimensional vector of the observed responses for each subject. The joint sampling model can be derived from the conditional distribution of each \(Y_i\) given \(\underline{x}_i\), \(\underline{\beta}\), and \(\sigma^2\) by assuming that the \(Y_i\) are independent given the predictors and parameters. This leads to the following expression for the joint sampling model:

\[ \begin{aligned} p\left(y_{1}, \ldots, y_{n} \mid \underline{x}_{1}, \ldots \underline{x}_{n}, \underline{\beta}, \sigma^{2}\right)&=\prod_{i=1}^n p\left(y_i \mid \underline{x}_i, \underline{\beta}, \sigma^2\right) \\ &=\prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y_i - \underline{\beta}^T \underline{x}_i)^2}{2\sigma^2}\right) \\ &=(2\pi\sigma^2)^{-n/2} \exp\left(-\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \underline{\beta}^T \underline{x}_i)^2\right) \\ &=(2\pi)^{-n/2} |\Sigma|^{-1/2} \exp\left(-\frac{1}{2} (\underline{y} - \underline{X} \underline{\beta})^T \Sigma^{-1} (\underline{y} - \underline{X} \underline{\beta})\right) \end{aligned} \]

In other words, the joint sampling model for the response vector \(\underline{y}\) given the design matrix \(\underline{X}\), the regression coefficients \(\underline{\beta}\), and the variance \(\sigma^2\) is a multivariate normal distribution with mean vector \(\underline{X} \underline{\beta}\) and covariance matrix \(\sigma^2 \underline{I}\), where \(\underline{I}\) is the identity matrix of appropriate dimension:

\[ \left\{\underline{y} \mid \underline{X}, \underline{\beta}, \sigma^{2}\right\} \sim \text { multivariate normal }\left(\underline{X} \underline{\beta}, \sigma^{2} \underline{I}\right) \]

Here the design matrix \(\underline{X}\) is an \(n \times p\) matrix where each row corresponds to a subject and each column corresponds to a predictor (including the intercept), and the response vector \(\underline{y}\) is an \(n\)-dimensional vector of the observed responses for each subject.

\[ \underline{X} = \begin{bmatrix} \underline{x}_1^T \\ \underline{x}_2^T \\ \vdots \\ \underline{x}_n^T \end{bmatrix} = \begin{bmatrix}1 & x_{1,2} & x_{1,3} & x_{1,2}x_{1,3} \\ 1 & x_{2,2} & x_{2,3} & x_{2,2}x_{2,3} \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n,2} & x_{n,3} & x_{n,2}x_{n,3} \end{bmatrix}, \qquad \underline{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}. \]

1.4 WWaFD: Least squares estimation

Goal: find \(\widehat{\underline{\beta}}\) that minimizes the sum of squared residuals (SSR):

\[ SSR(\underline{\beta})=\sum_i (y_i - \underline{\beta}^T \underline{x}_i)^2 \]

Note: Sometime SSR is also called the residual sum of squares (RSS) or the error sum of squares (SSE).

\[ \underline{\widehat{\beta}}^{LS} = \arg\min_{\underline{\beta}} SSR(\underline{\beta}) \]

Least squares estimate is sometimes called the ordinary least squares (OLS) estimate to distinguish it from other types of least squares estimates that we will discuss later in the course.

Using vector notation, we can write the SSR as: \[ \begin{aligned} SSR(\underline{\beta}) &= \|\underline{y} - \underline{X} \underline{\beta}\|^2 \\ &= (\underline{y} - \underline{X} \underline{\beta})^T (\underline{y} - \underline{X} \underline{\beta})\\ &= \underline{y}^T \underline{y} - 2 \underline{\beta}^T \underline{X} \underline{y} + \underline{\beta}^T \underline{X}^T \underline{X} \underline{\beta} \end{aligned} \]

Taking the derivative of the SSR with respect to \(\underline{\beta}\) and setting it to zero gives us the normal equations for least squares estimation: \[ \frac{\partial\, SSR(\underline{\beta})}{\partial \underline{\beta}} = -2 \underline{X}^T \underline{y} + 2 \underline{X}^T \underline{X} \underline{\beta}= \underline{0} \]

Assuming that \(\underline{X}^T \underline{X}\) is invertible, we can solve for the least squares estimator \(\underline{\widehat{\beta}}\):

\[ \underline{\widehat{\beta}}_{OLS} = (\underline{X}^T \underline{X})^{-1} \underline{X}^T \underline{y} \]

1.5 WWaFD: Visualizing the fitted regression lines

Q: Derive expressions for the models M1, M2, M3, and M4 in terms of the regression coefficients \(\beta_1\), \(\beta_2\), \(\beta_3\), and \(\beta_4\) and the predictor \(x_3\), Age.

Answer

M1: \(Y \sim \text{Regimen}\) \[ \begin{aligned} \text{Running program: }\qquad EY = \beta_1 \\ \text{Aerobic program: }\qquad EY = \beta_1 + \beta_2 \end{aligned} \]
M2: \(Y \sim \text{Age}\) \[ \begin{aligned} \text{Running and Aerobic programs: }\qquad EY = \beta_1 + \beta_3 x_3 \end{aligned} \]
M3: \(Y \sim \text{Regimen} + \text{Age}\) \[ \begin{aligned} \text{Running program: }\qquad EY = \beta_1 + \beta_3 x_3 \\ \text{Aerobic program: }\qquad EY = \beta_1 + \beta_2 + \beta_3 x_3 \end{aligned} \]
M4: \(Y \sim \text{Regimen} + \text{Age} + \text{Regimen}\times\text{Age}\) \[ \begin{aligned} \text{Running program: }\qquad EY = \beta_1 + \beta_3 x_3 \\ \text{Aerobic program: }\qquad EY = \beta_1 + \beta_2 + (\beta_3 + \beta_4) x_3 \end{aligned} \]

From the fitted regression model M4 (the full model), we can derive the estimated regression lines for the Running and Aerobic programs by plugging in the estimated coefficients \(\widehat{\beta}_1\), \(\widehat{\beta}_2\), \(\widehat{\beta}_3\), and \(\widehat{\beta}_4\) into the expressions for the expected value of \(Y\) given the predictors. Specifically, we have:

\[ \text{Running program: }\qquad \widehat{EY} \approx -50 + 2 \times \text{age} \]

\[ \text{Aerobic program: }\qquad \widehat{EY} \approx -37 + 1.77 \times \text{age} \]

\[ \\[2cm] \]

Coefficient	Estimate	Std. Error
\(\widehat{\beta}_1\)	-51.29	12.25
\(\widehat{\beta}_2\)	13.11	15.76
\(\widehat{\beta}_3\)	2.09	0.53
\(\widehat{\beta}_4\)	-0.32	0.65

Q: Which predictors are statistically significant?

Answer

The predictor “age” (x3) is statistically significant, as its estimated coefficient \(\widehat{\beta}_3\) has a standard error of 0.53, which is relatively small compared to the magnitude of the coefficient (2.09). This suggests that age has a significant effect on the change in maximal oxygen uptake.
On the other hand, the predictor “exercise program” (x2) and the interaction term (x4) are not statistically significant, as their estimated coefficients \(\widehat{\beta}_2\) and \(\widehat{\beta}_4\) have relatively large standard errors (15.76 and 0.65, respectively) compared to the magnitude of the coefficients (13.11 and -0.32, respectively). This suggests that there is no significant difference in the change in maximal oxygen uptake between the two exercise programs, and that the effect of age on the change in maximal oxygen uptake does not differ significantly between the two exercise programs.

Looks like age is the only significant predictor in the model, and there is no significant difference between the two exercise programs in terms of their effect on the change in maximal oxygen uptake. We will explore this further in this lecture and in the next lecture when we discuss model comparison and selection.

1.6 WWaBD: Bayesian Estimation

Bayesian estimation of the linear regression model involves specifying a prior distribution for the parameters \(\underline{\beta}\) and \(\sigma^2\), and then using Bayes’ theorem to derive the posterior distribution of these parameters given the observed data.

Graphical model for the linear regression model:

Figure 1: Bayesian network for Linear Regression with Conjugate Prior

Semiconjugate prior

From our extensive experience with conjugate priors in this course, we know that the likelihood function for the linear regression model is a multivariate normal distribution, and that the natural semi-conjugate prior for the regression coefficients \(\underline{\beta}\) is also a multivariate normal distribution.

Prior: \(\underline{\beta} \sim \operatorname{MVN}\left(\underline{\beta}_{0}, \underline{\Sigma_{0}}\right)\)

Let us verify this by deriving the conditional posterior.

Posterior (full conditional):

\[ \begin{aligned} p\left(\underline{\beta} \mid \underline{y}, \underline{X}, \sigma^{2}\right) &\propto p\left(\underline{y} \mid \underline{X}, \underline{\beta}, \sigma^{2}\right) p\left(\underline{\beta}\mid \underline{X}, \sigma^{2}\right) \\ &\propto p\left(\underline{y} \mid \underline{X}, \underline{\beta}, \sigma^{2}\right) p\left(\underline{\beta}\right) \\ &\propto \exp\left(-\frac{1}{2\sigma^2} (\underline{y} - \underline{X} \underline{\beta})^T (\underline{y} - \underline{X} \underline{\beta})\right) \exp\left(-\frac{1}{2} (\underline{\beta} - \underline{\beta}_0)^T \Sigma_0^{-1} (\underline{\beta} - \underline{\beta}_0)\right) \\ &=\exp\left(-\frac{1}{2\sigma^2} \left[ \underline{\beta}^T \underline{X}^T \underline{X} \underline{\beta} - 2 \underline{\beta}^T \underline{X}^T \underline{y} + \underline{y}^T \underline{y}\right]\right)\times \\ &\qquad \exp\left(-\frac{1}{2} \left[ \underline{\beta}^T \Sigma_0^{-1} \underline{\beta} - 2 \underline{\beta}^T \Sigma_0^{-1} \underline{\beta}_0 + \underline{\beta}_0^T \Sigma_0^{-1} \underline{\beta}_0\right]\right) \\ &\propto\exp\left(-\frac{1}{2} \left[ \underline{\beta}^T \left(\frac{1}{\sigma^2} \underline{X}^T \underline{X} + \Sigma_0^{-1}\right) \underline{\beta} - 2 \underline{\beta}^T \left(\frac{1}{\sigma^2} \underline{X}^T \underline{y} + \Sigma_0^{-1} \underline{\beta}_0\right)\right]\right) \end{aligned} \]

We recognize this as the kernel of a multivariate normal distribution, which means that the posterior distribution of \(\underline{\beta}\) given \(\underline{y}\), \(\underline{X}\), and \(\sigma^2\) is also a multivariate normal distribution: \[ \underline{\beta} \mid \underline{y}, \underline{X}, \sigma^{2} \sim \operatorname{MVN}\left(\underline{m}, \underline{V}\right), \] where the posterior mean \(\underline{m}\) and the posterior covariance matrix \(\underline{V}\) are given by: \[ \underline{V} = \left(\frac{1}{\sigma^2} \underline{X}^T \underline{X} + \Sigma_0^{-1}\right)^{-1}, \qquad \underline{m} = \underline{V} \left(\frac{1}{\sigma^2} \underline{X}^T \underline{y} + \Sigma_0^{-1} \underline{\beta}_0\right). \]

Q: Why the precision of the data has the form \(\frac{1}{\sigma^2} \underline{X}^T \underline{X}\)?

Answer

This is exactly the inverse of the variance of the \(\underline{\widehat\beta}_{ols}\) estimator in the frequentist linear regression model. In the frequentist setting, the variance of the least squares estimator \(\underline{\widehat{\beta}}\) is given by \(\sigma^2 (\underline{X}^T \underline{X})^{-1}\). Therefore, the precision (which is the inverse of the variance) of the data in the Bayesian setting is \(\frac{1}{\sigma^2} \underline{X}^T \underline{X}\).

Let’s check our intuition about the \(\underline{m}\) and \(\underline{V}\) by considering two extreme cases:

If \(\sigma^2\to\infty\), then \(\underline{V} \to \Sigma_0\) and \(\underline{m} \to \underline{\beta}_0\). In this case, the data is very noisy and provides no information about the parameters, so the posterior distribution of \(\underline{\beta}\) is the same as the prior distribution.
If \(\Sigma_0 \to \infty\), then \(\underline{V} \to \sigma^2 (\underline{X}^T \underline{X})^{-1}=Var(\underline{\widehat{\beta}}_{OLS})\) and \(\underline{m} \to \underline{\widehat{\beta}}_{OLS}\). In this case, the prior is very vague and provides no information about the parameters, so the posterior distribution of \(\underline{\beta}\) is the same as the sampling distribution of the least squares estimator.

Full conditional for \(\sigma^2\)

Similarly to a lot of computations we have done in this course, we can derive the full conditional distribution for \(\sigma^2\) by treating \(\underline{\beta}\) as known and using the likelihood function for \(\underline{y}\) given \(\underline{X}\), \(\underline{\beta}\), and \(\sigma^2\). We will get

\[ p(\sigma^2 \mid \underline{y}, \underline{X}, \underline{\beta}) \propto (\sigma^2)^{-a} \exp\left(-\frac{b}{\sigma^2}\right) \]

If we choose a prior for \(\sigma^2\) \[ \sigma^2 \sim \text{Inverse-Gamma}\left(\frac{\nu_0}{2}, \frac{\nu_0 \sigma_0^2}{2}\right), \]

then the posterior distribution for \(\sigma^2\) will also be an inverse-gamma distribution: \[ \sigma^2 \mid \underline{y}, \underline{X}, \underline{\beta} \sim \text{Inverse-Gamma}\left(\frac{\nu_0 + n}{2}, \frac{\nu_0 \sigma_0^2 + SSR(\underline{\beta})}{2}\right), \]

Interpretation:

\(SSR(\underline{\beta}) = \sum_{i=1}^n (y_i - \underline{\beta}^T \underline{x}_i)^2\) is the sum of squared residuals evaluated at the given \(\underline{\beta}\), i.e. the data sum of squares for the regression model with parameters \(\underline{\beta}\).
\(\nu_0\) is the prior degrees of freedom, which can be thought of as the equivalent sample size of the prior information about \(\sigma^2\).
\(\nu_0 \sigma_0^2\) is the prior sum of squares, which can be thought of as the equivalent sum of squared residuals from the prior information about \(\sigma^2\).

Choice of the parameters of the prior distribution

The choice of the parameters \(\underline{\beta}_0\) and \(\Sigma_0\) for the prior distribution of \(\underline{\beta}\) can be guided by prior knowledge or beliefs about the parameters, as well as by considerations of computational convenience.

Usually, we can choose \(\underline{\beta}_0\) to be a vector of zeros, which corresponds to a prior belief that the predictors have no effect on the response variable. Or, \(\underline{\beta}_0\) can be set to the least squares estimates \(\underline{\widehat{\beta}}_{OLS}\), which corresponds to a prior belief that the parameters are likely to be close to the least squares estimates.

Note: Setting \(\underline{\beta}_0\) to the least squares estimates is against the Bayesian principle of not using the data to inform the prior distribution. However, it can be a reasonable choice in practice when we want to use a weakly informative prior that is centered around a reasonable estimate of the parameters.

The choice of \(\Sigma_0\) can be guided by the desired level of informativeness of the prior. A common choice is to set \(\Sigma_0\) to a diagonal matrix with large values on the diagonal, which corresponds to a weakly informative prior that allows for a wide range of possible values for the parameters.

Alternatively, for the sake of computational convenience there are two common choices for \(\Sigma_0\):

Unit information prior (UIP)
Zellner’s \(g\)-prior

Unit information prior (UIP)

Idea: choose \(\Sigma_0\) such that the prior distribution of \(\underline{\beta}\) has the same amount of information as one data point.

The data precision is given by \(\frac{1}{\sigma^2} \underline{X}^T \underline{X}\), which is the inverse of the variance of the least squares estimator. The information from one data point is just 1/n of the total information from the data, so we want the prior precision to be \(\frac{1}{n} \frac{1}{\sigma^2} \underline{X}^T \underline{X}\). Therefore, we set \(\Sigma_0\) to be the inverse of this prior precision:

\[ \Sigma_0 = n \sigma^2 (\underline{X}^T \underline{X})^{-1}. \]

We will consider Zellner’s \(g\)-prior in the next lecture, which is a generalization of the unit information prior that allows for a tunable parameter \(g\) to control the informativeness of the prior. It has an additional advantage that it leads to a closed-form expression for the marginal likelihood \(p(\underline{y} \mid \underline{X})\), which can be useful for model comparison and selection.