Lecture 10: The Multivariate Normal Model

1 Recall: Bayesian Network for the MVN Model

Refer to figures in the lectures please

(diagram to be posted to the website later)

2 Recall: Wishart distribution

Recall, we introduced Wishart distribution as a distribution of sample unscaled covariance matrices (or sample sum of squares), and we realized that Inverse-Wishart is a convenient semi-conjugate prior for the covariance matrix.

To sample a covariance matrix \(\underline\Sigma\) from Inv-Wishart

Sample \(\underline{z}_{1}, \ldots, \underline{z}_{\nu_{0}} \sim\) i.i.d. Multivariate Normal \(\left(\underline{0}, \underline{S}_{0}^{-1}\right)\);
Calculate \(\underline{Z}^{T} \underline{Z}=\sum_{i=1}^{\nu_{0}} \underline{z}_{i} \underline{z}_{i}^{T}\) (this gives us a sample from Wishart(\(\nu_0,S_0^{-1}\)) )
Set \(\Sigma=\left(\underline{Z}^{T} \underline{Z}\right)^{-1}\).

The last line produces \(\Sigma\sim Inv-Wishart(\nu_0,S_0^{-1})\)

Here, the precision matrix \(\underline\Sigma^{-1}\sim\ \text{Wishart} \left(\nu_{0}, \underline{S}_{0}^{-1}\right)\) and the variance-covariance matrix \(\underline\Sigma\sim \text{Inv-Wishart} \left(\nu_{0}, \underline{S}_{0}^{-1}\right)\).

\[ \begin{aligned} \mathrm{E}\left[\underline\Sigma^{-1}\right] & =\nu_{0}\ \underline{S}_{0}^{-1} \\ \mathrm{E}[\underline\Sigma] & =\frac{1}{\nu_{0}-p-1}\left(\underline{S}_{0}^{-1}\right)^{-1}=\frac{1}{\nu_{0}-p-1} \underline{S}_{0} . \end{aligned} \]

Exercise: Prove the first statement using the definition of the Wishart distribution.

2.1 How to choose prior paramaters?

The parameters of the Inv-Wishart distribution govern the mean and the variance of the variance-covariance matrix \(\underline{\Sigma}\). The degree of freedom, \(\nu\) governs the spread of our prior around the mean, the parameter \(\underline{S}_0\) controls the mean of the variance-covariance matrix.

Assume we know the prior expectation of the variance-covariance matrix is \(\underline\Sigma_{0}\). Then:

For the strong prior: choose \(\nu_{0}\) large, choose \(\underline{S}_{0}=\left(\nu_{0}-p-1\right) \underline\Sigma_{0}\). This choice makes \(\Sigma\) “tightly” centered around \(\underline\Sigma_{0}\).

For the weak prior: choose \(\nu_{0}=p+2\) and \(\underline{S}_{0}=\underline\Sigma_{0}\). This choice makes \(\Sigma\) only loosely centered around \(\underline\Sigma_{0}\).

Note: The lowest possible value for \(\nu_{0}\), such that the Wishart (and Inv-Wishart) yields a positive definite matrix is \(\nu_0=p+2\).

3 Full Conditional distribution of \(\Sigma\)

Last time we derived the full conditional of the mean \(\underline\theta\), i.e. \(p(\underline\theta\mid data, \underline\Sigma)\). We employed the fact that the joint sampling model, as a function of \(\underline\theta\), is exponential to the power of the quadratic form in \(\underline\theta\). This allowed us to recognize the kernel of a multivariate normal distribution, and derive the full conditional distribution of \(\underline\theta\). Let us perform a similar trick for the full conditional of \(\underline\Sigma\).

The joint sampling model: \[ p\left(\underline{y}_{1}, \ldots, \underline{y}_{n} \mid \underline{\theta}, \Sigma\right)=(2 \pi)^{-n p / 2}|\Sigma|^{-n / 2} \exp \left\{-\sum_{i=1}^{n}\left(\underline{y}_{i}-\underline{\theta}\right)^{T} \Sigma^{-1}\left(\underline{y}_{i}-\underline{\theta}\right) / 2\right\} . \]

We will need to rework the above in the form that will allow us to recognize the kernel of an inverse-Wishart distribution. To do this, we will need the following fact about traces:

Linear Algebra Fact: \[ \operatorname{tr}\left( \underline{B A}\underline{B}^{T}\right)= \operatorname{tr}\left(\underline{B}^{T} \underline{B A}\right)=\sum_{k=1}^{K} \underline{b}_{k}^{T} \underline{A} \underline{b}_{k}, \] where the matrix \(\underline{B}\) is \[ \underline{B}= \begin{bmatrix} - & \underline{b}_1^T & -\\ - & \underline{b}_2^T & -\\ \vdots & \vdots & \vdots\\ - & \underline{b}_K^T & - \end{bmatrix} \]

Using this fact we have:

\[ \begin{aligned} \sum_{i=1}^{n}\left(\underline{y}_{i}-\underline{\theta}\right)^{T} \underline\Sigma^{-1}\left(\underline{y}_{i}-\underline{\theta}\right) &= tr\left(\underline{Y}_\theta^T\underline{Y}_\theta\,\underline\Sigma^{-1}\right)\\ &= tr\left(\underline{S}_\theta \underline\Sigma^{-1}\right), \end{aligned} \]

where \(\underline{S}_\theta\) is the sample sum of squares matrix for the data centered at \(\underline\theta\): \[ \underline{S}_\theta=\underline{Y}_\theta^T\underline{Y}_\theta=\sum_{i=1}^{n}\left(\underline{y}_{i}-\underline{\theta}\right)\left(\underline{y}_{i}-\underline{\theta}\right)^{T}, \]

and \[ \underline{Y}_\theta= \begin{bmatrix} -&(\underline{y}_1-\underline{\theta})^T&-\\ -&(\underline{y}_2-\underline{\theta})^T&-\\ \vdots&\vdots&\vdots\\ -&(\underline{y}_n-\underline{\theta})^T&- \end{bmatrix} \]

Thus, the full conditional for \(\underline\Sigma\) is straightforward!

\[ \begin{aligned} p(\underline\Sigma \mid \underline{y}_{1}, \ldots, \underline{y}_{n}, \underline{\theta}) &\propto p(\underline{y}_1,\ldots,\underline{y}_n\mid \underline{\theta},\underline\Sigma)\, p(\underline\Sigma\mid\underline\theta) \\ &\propto p(\underline{y}_1,\ldots,\underline{y}_n\mid \underline{\theta},\underline\Sigma)\, p(\underline\Sigma) \\ &\propto|\underline\Sigma|^{-n/2}\exp\left[-\frac12 tr(\underline{S}_\theta\underline\Sigma^{-1})\right]\times |\underline\Sigma|^{-(\nu_0+p+1)/2}\exp\left[-\frac12 tr(\underline{S}_0\underline\Sigma^{-1})\right] \\ &=|\underline{\Sigma}|^{-(\nu_0+n+p+1)/2}\exp\left[-\frac12 tr((\underline{S}_0+\underline{S}_\theta)\underline\Sigma^{-1})\right] \\ &\propto \text{Inv-Wishart}\left(\nu_0+n,(\underline{S}_0+\underline{S}_\theta)^{-1}\right) \end{aligned} \]

In the derivation above we use the fact that \[ tr(AC)+tr(BC)=tr((A+B)C) \]

Exercise: Prove it!

4 Gibbs sampling of the mean and covariance

Thus we have derived the two full conditionals, which we will use for the Gibbs sampler.

\[ \begin{aligned} & \left\{\underline{\theta} \mid \underline{y}_{1}, \ldots, \underline{y}_{n}, \underline\Sigma\right\} \sim \operatorname{MVN}\left(\underline{\mu}_{n}, \Lambda_{n}\right)\qquad (FC1) \\ & \left\{\underline\Sigma \mid \underline{y}_{1}, \ldots, \underline{y}_{n}, \underline{\theta}\right\} \sim \text {Inv-Wishart}\left(\nu_{n}, \underline{S}_{n}^{-1}\right),\qquad (FC2) \end{aligned} \]

The parameters in (FC1) are \[ \begin{aligned} \underline{\Lambda}_{n} & =\left(\underline{\Lambda}_{0}^{-1}+n \underline\Sigma^{-1}\right)^{-1} \\ \underline{\mu}_{n} & =\underline{\Lambda}_{n}\left(\underline{\Lambda}_{0}^{-1} \underline{\mu}_{0}+n \underline\Sigma^{-1} \overline{\underline{y}}\right) \end{aligned} \]

Note, \(\overline{\underline{y}}\) is the sample mean vector of the vector observations \(\underline{y}_1,\ldots,\underline{y}_n\).

The parameters in (FC2) are \[ \begin{aligned} \nu_n&=\nu_0+n,\\ \underline{S}_n&=\underline{S}_0+\underline{S}_\theta. \end{aligned} \]

4.1 The Gibbs Sampler

Step 1.

\[ \underline{\theta}^{(s+1)}\sim p(\underline{\theta} \mid \underline{y}_{1}, \ldots, \underline{y}_{n}, \underline\Sigma^{(s)}) \]

Compute \(\underline{\mu}_{n}\) and \(\underline\Lambda_{n}\) from \(\underline{y}_{1}, \ldots, \underline{y}_{n}\) and \(\underline\Sigma^{(s)}\);
Sample \(\underline{\theta}^{(s+1)} \sim \operatorname{MVN} \left(\underline{\mu}_{n}, \underline\Lambda_{n}\right)\).

Step 2.

\[ \underline\Sigma^{(s+1)}\sim p(\underline\Sigma \mid \underline{y}_{1}, \ldots, \underline{y}_{n}, \underline{\theta}^{(s+1)}) \]

Compute \(\underline{S}_{n}\) from \(\underline{y}_{1}, \ldots, \underline{y}_{n}\) and \(\underline{\theta}^{(s+1)}\);
Sample \(\underline\Sigma^{(s+1)} \sim\) Inv-Wishart \(\left(\nu_{0}+n, \underline{S}_{n}^{-1}\right)\).

As \(S\rightarrow\infty\), the distribution of \(\{\underline{\theta}^{(s)},\underline\Sigma^{(s)}\}\) converges to the joint posterior distribution \(p(\underline{\theta},\underline\Sigma \mid \underline{y}_1,\ldots,\underline{y}_n)\).

Q: How many dimensions is the posterior distribution \(p(\underline{\theta},\underline\Sigma \mid \underline{y}_1,\ldots,\underline{y}_n)\)?

Answer:

\({\theta}\) is \(p\)-dimensional, \(\Sigma\) is a \(p\times p\) symmetric matrix, so it has \(p(p+1)/2\) free parameters. Thus, the posterior distribution is \(p + p(p+1)/2 = p(p+3)/2\)-dimensional. In the case of our dataset, \(p=2\), so we have a 5-dimensional posterior.

5 Back to Reading comprehension

     pretest posttest
[1,]      59       77
[2,]      43       39
[3,]      34       46
[4,]      32       26
[5,]      42       38
[6,]      38       43

We could model the data as bivariate normal \(\underline{Y}_{i}\overset{i.i.d}{\sim}MVN_{2}(\underline{\theta},\underline{\Sigma})\).
Let us specify semi-conjugate priors \[\underline{\theta}\sim MVN_{2}(\underline{\mu}_{0},\underline{\Lambda}_{0})\] \[\underline{\Sigma}\sim \text{InvWishart}(\nu_{0},\underline{S}_{0}^{-1})\]

5.1 Choosing prior parameters: \(\underline{\mu}_0, \underline\Lambda_0 ,\nu_0, \underline{S}_0\)

We are provided the following information about the test scores:

The test is designed to have a mean of 50.
The true mean is constrained to be between 0 and 100.
We assume a prior correlation of 0.5:

From (1) we set \(\underline{\mu}_0 = (50,50)^{T}\). The line (2) gives information about the spread of the mean, thus it describes the variances:
\[ \mu_{j}\pm 2\sigma_{j} = (0,100), \] which implies \(\sigma_{j}^{2} = (50/2)^{2} = 625=\left(\underline{\Lambda}_0\right)_{11}=\left(\underline{\Lambda}_0\right)_{22}\).

Without any information on the correlations we may set \(\left(\underline{\Lambda}_0\right)_{12}=\left(\underline{\Lambda}_0\right)_{21}=0\).

The (2) and (3) give us information about the prior mean of the variance as well. Recall that the mean of the Inv-Wishart is \(\frac{1}{\nu-p-1}\underline{S}_0\), so setting \(\nu_{0}=p+2=4\) leads to \(E[\underline{\Sigma}] = \underline{S}_{0}\), and the \(\underline{S}_{0}\) is thus

\[ \underline{S}_{0}= \begin{bmatrix} 625&0.5\cdot 625\\ 0.5\cdot 625&625 \end{bmatrix} \]

5.2 Implementation

The below chunk simulates samples from Wishart and Inverse-Wshart distributions:

#### Simulate from the Wishart distribution
rwish<-function(n,nu0,S0)
{
  sS0 <- chol(S0)
  S<-array( dim=c( dim(S0),n ) )
  for(i in 1:n)
  {
     Z <- matrix(rnorm(nu0 * dim(S0)[1]), nu0, dim(S0)[1]) %*% sS0
     S[,,i]<- t(Z)%*%Z
  }
  S[,,1:n]
}

#### Simulate inverse-Wishart matrix
rinvwish<-function(n,nu0,iS0) 
{
  sL0 <- chol(iS0) 
  S<-array( dim=c( dim(L0),n ) )
  for(i in 1:n) 
  {
     Z <- matrix(rnorm(nu0 * dim(L0)[1]), nu0, dim(iS0)[1]) %*% sL0  
     S[,,i]<- solve(t(Z)%*%Z)
  }     
  S[,,1:n]
}

#### Reading comprehension
load("../data/reading.RData")
Y<-reading

mu0<-c(50,50)
#L0<-matrix( c(625,312.5,312.5,625),nrow=2,ncol=2)
L0<-matrix( c(625,0,0,625),nrow=2,ncol=2) #two choices for Lambda0 will yield similar results

nu0<-4
S0<-matrix( c(625,312.5,312.5,625),nrow=2,ncol=2)

n<-dim(Y)[1] ; ybar<-apply(Y,2,mean)
Sigma<-cov(Y) ; THETA<-SIGMA<-NULL
YS<-NULL #empty array for the predictive distribution
set.seed(1)

for(s in 1:5000) 
{
 
  ###update theta via FC1
  Ln<-solve( solve(L0) + n*solve(Sigma) ); #solve computes the inverse
  mun<-Ln%*%( solve(L0)%*%mu0 + n*solve(Sigma)%*%ybar );
  theta<-rmvnorm(1,mun,Ln);  
  ### 
   
  ###update Sigma via FC2
  Sn<- S0 + ( t(Y)-c(theta) )%*%t( t(Y)-c(theta) ); 
#  Sigma<-rinvwish(1,nu0+n,solve(Sn))
  Sigma<-solve( rwish(1, nu0+n, solve(Sn)) );
  ###

  ### generating samples from the predictive distribution
  YS<-rbind(YS,rmvnorm(1,theta,Sigma)); 
  ###

  ### save results 
  THETA<-rbind(THETA,theta); 
  SIGMA<-rbind(SIGMA,c(Sigma));
  ###
  cat(s,round(theta,2),round(c(Sigma),2),"\n");
}

Joint posterior distribution.

Below are the HPD regions for the prior \[ p(\underline{\theta})\sim MVN\left( \begin{bmatrix} 50\\ 50 \end{bmatrix}, \begin{bmatrix} 625&0\\ 0&625 \end{bmatrix}, \right) \]

and the posterior (marginal posterior)

\[ p(\underline{\theta}\mid \underline{y}_1, \ldots, \underline{y}_n) = p(\theta_1,\theta_2\mid \underline{y}_1, \ldots, \underline{y}_n) \]

Showing HPD regions are shown for \(\{97.5\%,75\%,50\%,25\%,2.5\%\}\)

Below are the priors and the posteriors of the variances \(\underline\Sigma_{11},\underline\Sigma_{22}\) and the correlation \(\rho_{12}=\frac{\underline\Sigma_{12}}{\sqrt{\underline\Sigma_{11}\cdot\underline\Sigma_{22}}}\).

We can visualize the marginal posterior of the 2-by-2 covariance matrix via ellipses. Each ellipse is the 95% contour of a Gaussian with that sampled covariance matrix. The panel with orange ellipses shows the prior draws, the panel with blue ellipses shows the posterior draws. The dashed lines are the ellipses corresponding to the prior and posterior means, respectively.

Joint predictive posterior.

Below is the HPD region for

\[ \widetilde{\underline{y} } \mid \underline{y}_1, \ldots, \underline{y}_n = \widetilde{y}_{1},\widetilde{y}_{2}\mid \underline{y}_1, \ldots, \underline{y}_n \]

Showing HPD for \(\{97.5\%,75\%,50\%,25\%,2.5\%\}\)

Note, how the regions are shaped similarly for the means \(\theta_j\) as well as predictive values \(\tilde{y}_j\), but the predictive distribution values are more spreadout, as expected.

From the posterior and predictive distributions: \[ Pr(\theta_{1}>\theta_{2}\mid \underline{y}_1, \ldots, \underline{y}_n) \approx 0.99 \qquad Pr(\widetilde{y}_{1}>\widetilde{y}_{2}\mid \underline{y}_1, \ldots, \underline{y}_n) \approx 0.71 \]

Q: What do these probabilities mean in context of the reading comprehension data?

Answer:

The quantity \(Pr(\theta_{1}>\theta_{2}\mid \underline{y}_1, \ldots, \underline{y}_n)\) tells us how the average pretest score compares to the average posttest score. The value of 0.99 means that we are very confident that the average pretest score is higher than the average posttest score, which suggests that there was a decrease in scores from pretest to posttest. While the quantity \(Pr(\widetilde{y}_{1}>\widetilde{y}_{2}\mid \underline{y}_1, \ldots, \underline{y}_n)\) tells us how a new pretest score (of a randomly selected student) compares to a new posttest (of another randomly selected student) score. The value of 0.71 means that there is some uncertainty about whether a new pretest score will be higher than a new posttest score, but it is more likely than not that the new pretest score will be higher than the new posttest score.

Exercise: Plot diagnostics to investigate convergence of the Gibbs sampler for \(\{\underline{\theta}^{(i)},\underline{\Sigma}^{(i)}\}\):

Trace plots of \(\theta_{1},\theta_{2},\sigma_{11},\sigma_{12},\sigma_{22}\).
Autocorrelation plots of \(\theta_{1},\theta_{2},\sigma_{11},\sigma_{12},\sigma_{22}\).
Compute effective sample size for \(\theta_{1},\theta_{2},\sigma_{11},\sigma_{12},\sigma_{22}\).