Lecture 9: The Multivariate Normal Model

1 Multivariate normal data

1.1 Dataset

A sample of twenty-two children (\(n=22\)) are given reading comprehension tests before and after receiving a particular instructional method. Each student \(i\) will then have two scores, \(Y_{i,1}\) and \(Y_{i,2}\) denoting the pre- and post-instructional scores respectively. Thus, we have 22 observations of the form \(\underline{Y}_i = (Y_{i,1}, Y_{i,2})^T\) for \(i=1,\ldots,22\), i.e. vector observations.

     pretest posttest
[1,]      59       77
[2,]      43       39
[3,]      34       46
[4,]      32       26
[5,]      42       38
[6,]      38       43

Q: why can’t we fit two separate univariate normal models to the pretest and posttest scores?

Answer

Because the pretest and posttest scores are likely correlated. Fitting separate univariate normal models would ignore this correlation and could lead to incorrect inferences about the relationship between pretest and posttest scores.

1.2 Sampling model

We model each student’s scores as a draw from a multivariate normal (MVN) distribution with mean vector \(\underline{\theta}\) and covariance matrix \(\underline{\Sigma}\):

\[ \underline{Y}_i\mid \underline{\theta}, \underline{\Sigma} \sim iid\ \text{MVN}(\underline{\theta}, \underline{\Sigma}), \quad i=1,\ldots,n \]

Above, \(\underline{Y}_i\) is a \(p\)-dimensional vector (in our case, \(p=2\) for pretest and posttest scores), \(\underline{\theta}\) is the mean vector of the MVN (also \(p\)-dimensional), and \(\underline{\Sigma}\) is the covariance matrix (or variance-covariance matrix, or variance matrix) that captures the variability and correlation between the components of \(\underline{Y}_i\).

The pdf of the MVN is given by:

\[ p(\underline{y} | \underline{\theta}, \underline{\Sigma}) = (2\pi)^{-p/2}|\underline{\Sigma}|^{-1/2} \exp\left\{-\frac{1}{2}(\underline{y} - \underline{\theta})^T \underline{\Sigma}^{-1} (\underline{y} - \underline{\theta})\right\} \]

Where:

\[ \underline{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_p \end{bmatrix}, \quad \underline{\theta} = \begin{bmatrix} \theta_1 \\ \theta_2 \\ \vdots \\ \theta_p \end{bmatrix}, \quad \underline{\Sigma} = \begin{bmatrix} \sigma_1^2 & \sigma_{1,2} & \cdots & \sigma_{1,p} \\ \sigma_{1,2} & \sigma_2^2 & \cdots & \sigma_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{1,p} & \sigma_{2,p} & \cdots & \sigma_p^2 \end{bmatrix} \]

And the \(|\underline{\Sigma}|\) denotes the determinant of the covariance matrix, and \(\underline{\Sigma}^{-1}\) is the inverse of the covariance matrix.

Interpretation of parameters:

\(\theta_j\): the mean of the \(j\)-th component of the multivariate normal distribution (e.g., mean pretest score, mean posttest score).
\(\sigma_j^2\): the variance of the \(j\)-th component (e.g., variance of pretest scores, variance of posttest scores).
\(\sigma_{j,k}\): the covariance between the \(j\)-th and \(k\)-th components (e.g., covariance between pretest and posttest scores).

Compare the MVN pdf to the pdf of the Normal\((\theta,\sigma^2)\):

\[ p({y} | {\theta}, {\sigma}) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left\{-\frac{1}{2\sigma^2}(y-\theta)^2\right\} \]

Notice, they both have the similar structure of a normalizing constant (involving \(\sigma^2\) or \(|\underline{\Sigma}|\)) and an exponential term that depends on the squared distance between the observation and the mean, scaled by the variance or covariance.

Above, we see contour plots of the MVN density for different values of the correlation \(\rho\) between the two components. The points represent samples drawn from the corresponding MVN distribution. As \(\rho\) increases, the contours become more elongated along the line \(y_1 = y_2\), indicating a stronger positive correlation between the two variables.

Q: What does the parameter \(\rho\) represent in the covariance matrix \(\underline{\Sigma}\)?

Answer

The parameter \(\rho\) represents the correlation between the two components of the multivariate normal distribution. It is defined as \(\rho = \frac{\sigma_{1,2}}{\sqrt{\sigma_1^2 \sigma_2^2}}\), where \(\sigma_{1,2}\) is the covariance between the two components, and \(\sigma_1^2\) and \(\sigma_2^2\) are the variances of the first and second components, respectively. The value of \(\rho\) ranges from -1 to 1, where \(\rho = 0\) indicates no correlation, \(\rho > 0\) indicates a positive correlation, and \(\rho < 0\) indicates a negative correlation. Note, the correlation measures the strength and direction of the linear relationship between the two variables, while the covariance \(\sigma_{1,2}\) measures the joint variability of the two variables without standardizing by their variances.

1.3 Semiconjugate prior

We assume the priors of the two parameters \(\underline{\theta}\) and \(\underline{\Sigma}\) are independent, i.e. the joint prior factorizes as: \[ p(\underline{\theta}, \underline{\Sigma}) = p(\underline{\theta}) p(\underline{\Sigma}) \]

This will allow us to derive the full conditional distributions for \(\underline{\theta}\) and \(\underline{\Sigma}\) separately, which is useful for Gibbs sampling.

\[ p(\underline{\theta}) = \text{MVN}(\underline{\mu}_0, \underline{\Lambda}_0) \]

\[ \begin{aligned} p(\underline{\theta}) &= (2\pi)^{-p/2}|\underline{\Lambda}_0|^{-1/2} \exp\left\{-\frac{1}{2}(\underline{\theta} - \underline{\mu}_0)^T \underline{\Lambda}_0^{-1} (\underline{\theta} - \underline{\mu}_0)\right\}\\ &\propto \exp\left\{-\frac{1}{2}\underline{\theta}^T \underline{\Lambda}_0^{-1} \underline{\theta} + \underline{\theta}^T \underline{\Lambda}_0^{-1} \underline{\mu}_0\right\}\\ &=\exp\left\{-\frac12\underline{\theta}^T \underline{A}\underline{\theta} + \underline{\theta}^T \underline{b}\right\} \end{aligned} \]

Recall our shortcut: the mean of the MVN is \(\underline{A}^{-1}\underline{b}\) and the covariance matrix is \(\underline{A}^{-1}\).

1.4 Joint sampling model

\[ \begin{aligned} p(\underline{y}_1, \ldots, \underline{y}_n | \underline{\theta}, \underline{\Sigma}) &= \prod_{i=1}^{n} p(\underline{y}_i | \underline{\theta}, \underline{\Sigma}) \\ &= (2\pi)^{-np/2}|\underline{\Sigma}|^{-n/2} \exp\left\{-\frac{1}{2}\sum_{i=1}^{n}(\underline{y}_i - \underline{\theta})^T \underline{\Sigma}^{-1} (\underline{y}_i - \underline{\theta})\right\}\\ \end{aligned} \]

Now, we will write it as a function of \(\underline{\theta}\) only (for the full conditional computation):

\[ \begin{aligned} p(\underline{y}_1, \ldots, \underline{y}_n | \underline{\theta}, \underline{\Sigma}) &\propto \exp\left[-\frac12 n\underline\theta^T\underline\Sigma^{-1}\underline{\theta}+ \sum_{i=1}^n\underline\theta^T\underline\Sigma^{-1}\underline{y}_i\right]\\ &=\exp\left[-\frac12 \underline\theta^T\underline{A}_n\underline{\theta}+\underline{\theta}^T\underline{b}_n\right], \end{aligned} \] where \(\underline{A}_n = n\underline{\Sigma}^{-1}\) and \(\underline{b}_n = \underline{\Sigma}^{-1} \sum_{i=1}^n \underline{y}_i=n\underline{\Sigma}^{-1}\underline{\bar{y}}\).

1.5 Conditional Posterior aka Full Conditional for the Mean

\[ \begin{aligned} p(\underline{\theta}|\underline{y}_1, \ldots, \underline{y}_n, \underline{\Sigma})&\propto p(\underline{y}_1,\ldots,\underline{y}_n\mid \underline{\theta},\underline{\Sigma})\times p(\underline{\theta}\mid\underline{\Sigma})\\ &\propto \exp\left[-\frac12 \underline\theta^T\underline{A}_n\underline{\theta}+\underline{\theta}^T\underline{b}_n\right]\times \exp\left[-\frac12\underline{\theta}^T \underline{A}_0 \underline{\theta} + \underline{\theta}^T \underline{b}_0\right]\\ &=\exp\left[-\frac12\underline{\theta}^T (\underline{A}_n+\underline{A}_0)\underline{\theta} + \underline{\theta}^T (\underline{b}_n+\underline{b}_0)\right]\\ &=\exp\left[-\frac12\underline{\theta}^T \underline{A}_{post}\underline{\theta} + \underline{\theta}^T \underline{b}_{post}\right]\\ \end{aligned} \]

Thus, we derived the full conditional distribution of \(\underline{\theta}\) given the data and \(\underline{\Sigma}\), which is also a multivariate normal distribution with parameters that depend on both the prior parameters and the data. The mean of this full conditional distribution can be computed as \(\underline{A}_{post}^{-1} \underline{b}_{post}\) and the covariance matrix can be computed as \(\underline{A}_{post}^{-1}\).
\[ \underline{\theta} | \underline{y}_1, \ldots, \underline{y}_n, \underline{\Sigma} \sim \text{MVN}(\underline{A}_{post}^{-1} \underline{b}_{post}, \underline{A}_{post}^{-1}) \] In other words, \[ Var(\underline{\theta} | \underline{y}_1, \ldots, \underline{y}_n, \underline{\Sigma}) = (\underline{A}_n + \underline{A}_0)^{-1}= (n\underline{\Sigma}^{-1} + \underline{\Lambda}_0^{-1})^{-1} \]

\[ E(\underline{\theta}\mid \underline{y}_1, \ldots, \underline{y}_n, \underline{\Sigma})=(n\underline{\Sigma}^{-1} + \underline{\Lambda}_0^{-1})^{-1}(n\underline{\Sigma}^{-1}\underline{\bar{y}} + \underline{\Lambda}_0^{-1}\underline{\mu}_0) \]

Interpretation of the full conditional parameters:

We have: \[ \text{Posterior Precision}=\text{Data Precision} + \text{Prior Precision} \]

\[ \text{Posterior Mean} = \frac{\text{Data Precision} \times \text{Data Mean} + \text{Prior Precision} \times \text{Prior Mean}}{\text{Posterior Precision}} \]

Now, the question is: how do we derive the full conditional distribution for \(\underline{\Sigma}\)? And how do we set the prior \(p(\underline{\Sigma})\)? This is a prior on the space of \(p\times p\) positive semi-definite matrices! This calls for introduction of a new distribution, the Wishart (and inverse-Wishart) distribution, which is a distribution over positive semi-definite matrices and is the multivariate generalization of the Gamma (and inverse-gamma, correspondingly) distributions used in the univariate normal model for the variance parameter.

2 Wishart and the Inverse-Wishart Distributions

2.1 Empirical Covariance matrix

Recall, that in 1D case, if \(Z_i \sim iid\ \text{N}(0,1)\), then \[ \sum_{i=1}^{n} Z_i^2 \quad \sim \quad \chi^2_{df=n} \sim \text{Gamma}\left(\frac{n}{2}, \frac{1}{2}\right) \]

If we didn’t have access to the \(\texttt{rgamma}\) function, how would we sample from this distribution? Easy!

Sample \(Z_i \sim iid\ \text{N}(0,1)\), for \(i=1,\ldots,n\)

Compute \(\sum_{i=1}^{n} Z_i^2\)

The sum above gives us a single sample from the \(\chi^2\) distribution with \(n\) degrees of freedom, which is equivalent to a sample from the Gamma distribution with shape \(n/2\) and rate \(1/2\).

Now, for the \(p\)-dimensional case:

Sample \(\underline{Z}_i \sim iid\ \text{MVN}(\underline{0}, \underline{\Phi}_0)\), for \(i=1,\ldots,\nu\).

Compute \(\sum_{i=1}^{\nu} \underline{Z}_i \underline{Z}_i^T\).

The above sum, by definition, gives us a single sample from the Wishart distribution with \(\nu\) degrees of freedom and scale matrix \(\underline{\Phi}_0\), i.e. a sample from \(Wishart(\nu,\underline{\Phi}_0)\). Warning: the notation is not standard, sometimes the degrees of freedom \(\nu\) is \(\nu+p\),

The Wishart distribution is a multivariate generalization of the chi-squared distribution, and it is used in specifying the prior for the covariance matrix \(\underline{\Sigma}\) in the multivariate normal model. Note, \(\underline{Z}_i \underline{Z}_i^T\) will be a \(p \times p\) matrix, and the sum \(\sum_{i=1}^{\nu} \underline{Z}_i \underline{Z}_i^T\) will also be a \(p \times p\) matrix that captures the variability and correlation structure of the multivariate data.

The matrix \(\sum_{i=1}^{\nu} \underline{Z}_i \underline{Z}_i^T\) is also known as \(\nu\times\text{empirical covariance matrix}\), or empirical sum of squares. Let’s try to understand the structure of this sum.

If \(\underline{Z}_i=(Z_{i1}, Z_{i2},\ldots, Z_{ip})^T\), then \(\underline{Z}_i \underline{Z}_i^T\) is a \(p \times p\) matrix with \((j,k)\)-th entry equal to \(Z_{ij} Z_{ik}\). Thus, the \((j,k)\)-th entry of the sum \(\sum_{i=1}^{\nu} \underline{Z}_i \underline{Z}_i^T\) is given by \(\sum_{i=1}^{\nu} Z_{ij} Z_{ik}\), which captures the covariance between the \(j\)-th and \(k\)-th components of the multivariate data. In the matrix form this looks like

\[ \sum_{i=1}^{\nu} \underline{Z}_i \underline{Z}_i^T = \begin{bmatrix} \sum_{i=1}^{\nu} Z_{i1}^2 & \sum_{i=1}^{\nu} Z_{i1} Z_{i2} & \cdots & \sum_{i=1}^{\nu} Z_{i1} Z_{ip} \\ \sum_{i=1}^{\nu} Z_{i2} Z_{i1} & \sum_{i=1}^{\nu} Z_{i2}^2 & \cdots & \sum_{i=1}^{\nu} Z_{i2} Z_{ip} \\ \vdots & \vdots & \ddots & \vdots \\ \sum_{i=1}^{\nu} Z_{ip} Z_{i1} & \sum_{i=1}^{\nu} Z_{ip} Z_{i2} & \cdots & \sum_{i=1}^{\nu} Z_{ip}^2 \end{bmatrix} \]

If we divide by \(\nu\) the above matrix, we will get a sample covariance matrix.

Q: why we do not subtract the mean? and why we do not divide by \(\nu-1\) in this case?

Answer

In this case, the mean is known to us: we sample from a mean 0 normal. Thus we do not need to subtract it (or rather a 0 is implicitly subtracted already), and we do not “lose” a degree of freedom by estimating the mean from the data, so we divide by \(\nu\) instead of \(\nu-1\).

Fun Fact

\[ \sum_{i=1}^{\nu_0} \underline{Z}_i\underline{Z}_i^T=\underline{Z}^T\underline{Z}= \begin{bmatrix} \mid & \mid & \cdots & \mid \\ \underline{Z}_1 & \underline{Z}_2 & \cdots & \underline{Z}_{\nu} \\ \mid & \mid & \cdots & \mid \end{bmatrix} \begin{bmatrix} - & \underline{Z}_1^T & -\\ - & \underline{Z}_2^T & - \\ \cdots & \cdots & \cdots \\ - & \underline{Z}_{\nu}^T & -\\ \end{bmatrix} \]

Exercise: Prove it!

Now, after we defined a pdf on the space of psd matrices, we can use it (or rather its modification) as a prior for \(\underline{\Sigma}\) in our multivariate normal model. However, the Wishart distribution is not conjugate to the MVN likelihood, but the inverse-Wishart distribution is. Thus, we will use the inverse-Wishart distribution as a prior for \(\underline{\Sigma}\).

In the Bayesian world

Recall, in the univariate case, \(\sigma^2\) is the variance parameter, and we often use an inverse-gamma distribution as a prior for \(\sigma^2\) because it is conjugate (or semi-conjugate) to the normal likelihood.

If \(\sigma^2 \sim \text{Inv-Gamma}(\alpha_0, \beta_0)\), then \(1/\sigma^2 \sim \text{Gamma}(\alpha_0, \beta_0)\)

In the \(p\)-dimensional case:

If \(\underline{\Sigma} \sim \text{Inv-Wishart}(\nu_0, \underline{S}_0^{-1})\), then \(\underline{\Sigma}^{-1} \sim \text{Wishart}(\nu_0, \underline{S}_0^{-1})\).

This prior will give us a closed-form full conditional distribution for \(\underline{\Sigma}\), which is also an inverse-Wishart distribution with updated parameters that depend on the data and the prior parameters. The inverse-Wishart distribution is a multivariate generalization of the inverse-gamma distribution, and it is used to model the uncertainty about the covariance matrix in the multivariate normal models.

2.2 Properties of the Wishart distribution

If \(\underline{\Sigma} \sim \text{IW}(\nu_0, \underline{S}_0^{-1})\), then \(E[\underline{\Sigma}] = \frac{1}{\nu_0-p-1} \underline{S}_0\)
If \(\underline{\Sigma} \sim \text{W}(\nu_0, \underline{S}_0^{-1})\), then \(E[\underline{\Sigma}] = \nu_0 \underline{S}_0^{-1}\)

Exercise: Prove the second statement using the definition of the Wishart distribution.

Just for Fun: The pdf of Inv-Wishart\((\nu_0,\underline{S}_0^{-1})\)

\[ \begin{aligned} p(\Sigma)= & {\left[2^{\nu_{0} p / 2} \pi^{\left(\begin{array}{c} p \\ 2 \end{array}\right) / 2}\left|\underline{S}_{0}\right|^{-\nu_{0} / 2} \prod_{j=1}^{p} \Gamma\left(\left[\nu_{0}+1-j\right] / 2\right)\right]^{-1} \times } \\ & |\Sigma|^{-\left(\nu_{0}+p+1\right) / 2} \times \exp \left\{-\operatorname{tr}\left(\underline{S}_{0} \Sigma^{-1}\right) / 2\right\} . \\ &\propto |\Sigma|^{-\left(\nu_{0}+p+1\right) / 2} \times \exp \left\{-\operatorname{tr}\left(\underline{S}_{0} \Sigma^{-1}\right) / 2\right\} \\ &=|\Sigma|^{-a} \times \exp \left\{-\frac{1}{2} \operatorname{tr}\left(\underline{A} \Sigma^{-1}\right)\right\} \end{aligned} \]

The last line shows the kernel of the Inverse-Wishart. This is the structure we will be looking for in the full conditional distribution of \(\underline{\Sigma}\), which will allow us to identify it as an inverse-Wishart distribution and read off the parameters.

3 Visualizing Wishart

It is a strange distribution, so let’s try to visualize it. We will look at the 2D case, i.e. \(p=2\), and we will look at the distribution of the covariance matrix \(\underline{\Sigma}\) (or equivalently, the precision matrix \(\underline{\Sigma}^{-1}\)) under different settings of the prior parameters.

Weak prior. Precision \(\underline\Sigma^{-1}\sim \operatorname{Wishart}(\nu_0,\underline{S}_0^{-1})\). Variance \(\underline\Sigma\sim \operatorname{Inv-Wishart}(\nu_0,\underline{S}_0^{-1})\). I am choosing a small \(\nu_0\) for the weak prior, which means that the prior is more diffuse and less informative about the covariance structure. The lowest possible value for \(\nu_0\) is \(p+2=4\) for \(p=2\), which corresponds to a very weak prior on the covariance matrix. As \(\nu_0\) increases, the prior becomes stronger and more informative about the covariance structure. We set \(\underline{S}_0\) to be a scaled identity matrix, which means that we are assuming that the variances of the two components are equal and that there is no prior correlation between them. The scale of \(\underline{S}_0\) will affect the expected value of \(\underline{\Sigma}\) under the prior, as \(E[\underline{\Sigma}] = \frac{1}{\nu_0-p-1} \underline{S}_0\). Thus, by choosing \(\underline{S}_0\) we can control the expected covariance structure under the prior.

The entries of the precision matrix \(\underline{\Sigma}^{-1}\) have a distribution that is loosely centered around the identity matrix. We observed the similar behavior for the samples of the covariance matrix \(\underline{\Sigma}\) from the Inverse-Wishart. In the top right corner you can see the histogram of the correlations betweem the first and second component. You can see that our samples have a wide range of correlations, between -1 and 1.

It may be more informative to visualize the samples from the covariance matrix using ellipses. Each draw of a covariance matrix defines a multivariate normal pdf, and we plot a 95% confidence ellipse for each draw from the Inverse-Wishart. The orange dashed line represents the mean covariance matrix \(S_0\) that we used to set the scale of the prior. We can see that the ellipses are quite spread out, which reflects the weak prior and the wide range of covariance structures that are plausible under this prior, while the mean covariance matrix is represented by a circle (since it is a scaled identity matrix) that is somewhere in the middle of the cloud of ellipses.

Strong prior. Precision \(\underline\Sigma^{-1}\sim \operatorname{Wishart}(\nu_0,\underline{S}_0^{-1})\). Variance \(\underline\Sigma\sim \operatorname{Inv-Wishart}(\nu_0,\underline{S}_0^{-1})\).

We choose \(\nu_0\) to be large, which means that the prior is more concentrated and informative about the covariance structure. The larger \(\nu_0\) is, the stronger the prior belief about the covariance structure, and the more it will influence the posterior distribution of \(\underline{\Sigma}\). We keep \(\underline{S}_0\) the same as in the weak prior case, which means that we are still assuming that the expected covariance structure under the prior is the same, but now we have a stronger belief in that structure due to the larger \(\nu_0\). Parameter \(\nu_0\) is set to \(52\).

You can see that the correlation is more tightly concentrated around 0, and the variances are more tightly concentrated around 1, which reflects the stronger prior belief about the covariance structure. The ellipses will also be more tightly clustered around the mean covariance matrix, which is represented by the orange dashed circle: