Lecture 4: Exponential Family and Monte Carlo Method

1 Exponential families and conjugate priors

We covered two one-parameter models so far: Binomial and Poisson. And found corresponding conjugate prior distributions. Do these ideas generalize? The answer is yes, for a large class of models called exponential family models.

Definition. A one-parameter exponential family model is any model whose densities \(p(y\mid\phi)\) can be expressed as

\[ p(y\mid\phi) = h(y)\,c(\phi)\,\exp(\phi\,t(y)), \] where \(\phi\) is a parameter, and \(t(y)\) is a sufficient statistic.

Q: Can the Bernoulli pdf be written in the form above?

Click to see the answer

Yes! For \(Y\sim Bernoulli(\theta)\) we have

\[ p(y\mid\theta) = \theta^y\,(1-\theta)^{1-y} = (1-\theta)\,\exp\left(\log\left(\frac{\theta}{1-\theta}\right)\,y\right). \] So here \(h(y)=1\), \(c(\theta)=1-\theta\), \(\phi=\log\left(\frac{\theta}{1-\theta}\right)\), and \(t(y)=y\).

A joint sampling model for the exponential family is \[ p(y_1,\ldots,y_n\mid\phi) = \left[\prod_{i=1}^n h(y_i)\right]\,c(\phi)^n\,\exp\left(\phi\sum_{i=1}^n t(y_i)\right). \]

1.1 A Conjugate Prior distribution:

For the sampling model above, the conjugate prior distribution for \(\phi\) is given by \[ p(\phi\mid n_0,t_0) = k(n_0,t_0)\,c(\phi)^{n_0}\,\exp( n_0t_0\phi), \] where \(n_0>0\) and \(t_0\) are hyperparameters, and \(k(n_0,t_0)\) is a normalizing constant.

We can interpret the hyperparameters as follows:

\(n_0\) is the prior sample size (how much information the prior contains).
\(t_0\) is the prior mean of the sufficient statistic \(t(Y)\):

1.2 Posterior distribution:

\[ \begin{aligned} p(\phi\mid y_1,\ldots,y_n)&\propto p(y_1,\ldots,y_n\mid\phi)\,p(\phi\mid n_0,t_0)\\ &\propto c(\phi)^{n+n_0}\,\exp\left(\phi\left(n_0t_0+\sum_{i=1}^n t(y_i)\right)\right)\\ &=c(\phi)^{n+n_0}\,\exp\left((n_0+n)\frac{n_0t_0+\sum_{i=1}^n t(y_i)}{n_0+n}\phi\right)\\ \end{aligned} \] Compare this to the prior distribution form: \[ p(\phi\mid n_0,t_0) \propto\,c(\phi)^{n_0}\,\exp( n_0t_0\phi), \]

1.3 Fundamental Averaging Principle of Bayesian Statistics

Define the mean parameter: \[ \mu = E(t(Y)\mid \phi) = -\frac{d}{d\phi}\log c(\phi) \] Note, that \(\mu=\mu(\phi)\) will be a random variable since \(\phi\) is random.

Then one can show that for the prior and the posterior distributions we have:

\[ E(\mu) =E[E(t(Y)\mid \phi)]= t_0 \] Here the expectation is taken with respect to the prior distribution of \(\phi\). So \(t_0\) is the prior mean of the sufficient statistic \(t(Y)\).

Posterior mean of \(\mu\) will be equal to the updated sufficient statistic:

\[ \begin{aligned} E(\mu\mid y_1,\ldots,y_n) &= \frac{n_0t_0+\sum_{i=1}^n t(y_i)}{n_0+n}\\ &= \frac{n_0}{n_0+n}t_0 + \frac{n}{n_0+n}\left(\frac{\sum_{i=1}^n t(y_i)}{n}\right)\\ &= w_{\text{prior}} \cdot \text{prior mean} + w_{\text{data}} \cdot \text{data mean}. \end{aligned} \]

\[ \\[4cm] \]

2 The Monte Carlo method

What can MC do for me?

Approximate posterior expectations \(\operatorname{E}[\theta\mid y_1,\ldots,y_n]\), variances, quantiles.
Approximate posterior expectations of arbitrary functions \(\operatorname{E}[g(\theta)\mid y_1,\ldots,y_n]\).
Generate samples from the posterior predictive distribution \(p(\widetilde{y}\mid y_1,\ldots,y_n)\).
Approximate posterior predictive model checks and model diagnostics.

2.1 Posterior inference on \(\theta\)

Recall, in our Poisson model example (the birthrate data from 2 groups) we have the following posterior distributions for the two groups: \[ \begin{aligned} p(\theta_1\mid y_{i,1})&=\operatorname{dgamma}(\theta_1,219,112)\\ p(\theta_2\mid y_{i,2})&=\operatorname{dgamma}(\theta_2,68,45) \end{aligned} \]

And we have stated the the posterior probability that \(\theta_1>\theta_2\) is equal to \(0.97\): \[ Pr(\theta_1>\theta_2\mid y_{i,j}) = 0.97 \]

How would one find this value? \[ \begin{aligned} Pr(\theta_1>\theta_2\mid y_{i,j}) &= \int\int_{\theta_1>\theta_2} p(\theta_1,\theta_2\mid y_{i,j})\,d\theta_1\,d\theta_2\\ &= \int_0^{+\infty}\int_{\theta_2}^{+\infty} p(\theta_1\mid y_{i,1})\,p(\theta_2\mid y_{i,2})\,d\theta_1\,d\theta_2\\ &= \int_0^{+\infty}\int_{\theta_2}^{+\infty} dgamma(\theta_1,219,112)\,dgamma(\theta_2,68,45)\,d\theta_1\,d\theta_2\\ &=\ldots\\ &=0.97 \end{aligned} \]

This is a tedious integral to compute analytically. But we can easily approximate it using the Monte Carlo method!

2.2 Monte Carlo method

We use the power of LLN (Law of Large Numbers) to find that probability!

Assume we are able to obtain independent samples \(i=1,\ldots,S\) from the posterior distribution of \(\theta\):

\[ \theta^{(1)},\ldots, \theta^{(S)}\sim\text{iid}\ p(\theta\mid y_1,\ldots,y_n) \]

Then by LLN: \[ \frac{1}{S}\sum_{i=1}^S\theta^{(i)}\quad \longrightarrow \quad E(\theta\mid y_1,\ldots,y_n) \]

And also for any function \(g(\cdot)\): \[ \frac{1}{S}\sum_{i=1}^Sg(\theta^{(i)})\quad \longrightarrow\quad E[g(\theta)\mid y_1,\ldots,y_n] \]

For example, to find the variance of \(\theta\) we can use the following approximation:

\[ \frac{1}{S}\sum_{i=1}^S\left(\theta^{(i)}-\frac{1}{S}\sum_{j=1}^S\theta^{(j)}\right)^2\longrightarrow \operatorname{Var}(\theta\mid y_1,\ldots,y_n) \]

To find the probability that \(\theta< C\) for some constant \(C\), we can use the following approximation:

\[ \frac{1}{S}\sum_{i=1}^S\mathbb{I}(\theta^{(i)}<C)\quad \longrightarrow\quad Pr(\theta<C\mid y_1,\ldots,y_n), \] where \(\mathbb{I}(\cdot)\) is the indicator function.

Similarly, we will have:

\(\text{median}\{\theta^{(1)},\ldots,\theta^{(S)}\}\longrightarrow \text{median}(\theta\mid y_1,\ldots,y_n)\)
\(\text{Histogram}\{\theta^{(1)},\ldots,\theta^{(S)}\}\longrightarrow \text{density}(\theta\mid y_1,\ldots,y_n)\)

2.3 Monte Carlo approximation to \(Gamma(68,45)\)

Below we show how the Monte Carlo approximation to the posterior distribution of \(\theta_2\) improves as we increase the number of samples \(S\). In the first 4 plots you see histograms of the samples \(\theta_2^{(1)},\ldots,\theta_2^{(S)}\) for \(S=10,100,1000,10000\) and the true posterior density (blue curve). In the next 4 plots you see the same data, but instead of histograms we show kernel density estimates of the posterior density based on the samples. The true posterior density is shown in blue.

Computing the posterior mean \(E(\theta\mid y_1,\ldots,y_n)\)

Recall, for the Poisson model with Gamma conjugate prior we have the following sampling model and prior distribution: \[ Y_1,\ldots,Y_n\mid\theta\sim Pois(\theta),\quad p(\theta)=Gamma(a,b) \]

Then the posterior: \[ \theta\mid y_1,\ldots,y_n \sim \text{Gamma}(a+\sum_{i=1}^n y_i, b+n) \]

The mean of this distribution is given by \[ E(\theta\mid y_1,\ldots,y_n) = \frac{a+\sum_{i=1}^n y_i}{b+n} \]

Let’s see how we can approximate this value using the Monte Carlo method! We will generate samples from the posterior distribution of \(\theta\) and then compute the sample mean of these samples to approximate the posterior mean for various values of \(S\) (number of samples). We will see that as \(S\) increases, the Monte Carlo approximation gets closer to the true posterior mean.

set.seed(632)
a<-2  ; b<-1
sy<-66; n<-44

theta.sim10 <- rgamma(10,a+sy,b+n)
theta.sim100 <- rgamma(100,a+sy,b+n)
theta.sim1000 <- rgamma(1000,a+sy,b+n)
theta.sim10000 <- rgamma(10000,a+sy,b+n)

theta.true.mean <- (a+sy)/(b+n) 
theta.true.mean

[1] 1.511111

theta.mc.mean <- c(mean(theta.sim10),mean(theta.sim100),mean(theta.sim1000),mean(theta.sim10000))
theta.mc.mean

[1] 1.554040 1.545443 1.519201 1.510038

Computing the posterior probability

We can find the true value of the posterior probability that \(\theta<1.75\) using the cumulative distribution function of the Gamma distribution (\(\texttt{pgamma()}\) in R): \[ \Pr(\theta<1.75\mid y_1,\ldots,y_n) \]

true.prob <- pgamma(1.75,a+sy,b+n)
true.prob

[1] 0.8998286

prob.mc <- c(mean( theta.sim10<1.75),mean( theta.sim100<1.75),mean( theta.sim1000<1.75),mean( theta.sim10000<1.75))
prob.mc

[1] 0.9000 0.8700 0.8860 0.9005

Above, the blue line shows the running mean of the samples \(\theta^{(1)},\ldots,\theta^{(S)}\) as \(S\) increases. The dashed navy line shows the true posterior mean. The dotted skyblue lines show the Monte Carlo error bands, which are based on the standard error of the running mean. We can see that as \(S\) increases, the running mean converges to the true posterior mean and stays within the Monte Carlo error bands, as they shrink with increasing \(S\).

2.4 Monte Carlo standard errors

How do we find the Monte Carlo standard error of our estimates? For example, how do we find the standard error of the posterior mean estimate \(\frac{1}{S}\sum_{i=1}^S\theta^{(i)}\)?

Using the CLT (the Central Limit Theorem), \(\theta\) will lie in the following interval with \(95\%\) probability \[ \begin{aligned} \widehat{\theta}\pm 2\, SE(\widehat{\theta})&=\widehat{\theta}\pm 2\sqrt{\frac{\widehat{Var}(\theta\mid y_1,\ldots,y_n)}{{S}}}\\ &=\widehat{\theta}\pm 2{\frac{\widehat{\sigma}}{\sqrt{S}}} \end{aligned} \]

2.5 Posterior inference for arbitrary functions \(g(\theta)\)

Let us define the log-odds for the Binomial model: \[ g(\theta)=\text{log-odds}(\theta)=\log\frac{\theta}{1-\theta}=\gamma \]

Instead of looking for distribution of \(\theta\mid y\) we want to find a distribution of \(g(\theta)\mid y\), for some function \(g\).

\[ \theta \mid y_1,\ldots,y_n \qquad \rightarrow\qquad \gamma \mid y_1,\ldots,y_n \]

The new distribution is very easily obtained by applying the function \(g\) to the samples \(\theta^{(1)},\ldots,\theta^{(S)}\): \[ \text{Histogram}\{g(\theta^{(1)}),\ldots,g(\theta^{(S)})\}\approx p(g(\theta)\mid y_1,\ldots,y_n) \]

By transforming the samples \(\theta^{(i)}\) from prior and posterior distributions we can find the prior and posterior distributions of \(\gamma\):

Food for thought: Note, we used a uniform prior distribution for \(\theta\) in the Binomial model. It reflected our prior belief that \(\theta\) is equally likely to be anywhere in the interval \([0,1]\). But what about \(\gamma\)? What does the uniform prior on \(\theta\) imply for the prior distribution of \(\gamma\)? Is it a uniform distribution on \((-\infty,+\infty)\)? No! It is a logistic distribution, which is shown in the left plot above. So a uniform prior on \(\theta\) implies that we believe that \(\gamma\) is more likely to be around \(0\) than around \(-5\) or \(5\). This is an example of how the choice of a prior distribution can have implications for the inference on transformed parameters. We will address this issue in more detail in the next lecture when we talk about non-informative priors and Jeffreys priors.

2.6 Functions of two parameters

These are posterior distributions of \(\theta_1\) and \(\theta_2\) for the two groups in the birthrate data: \[ \begin{aligned} p(\theta_1\mid y_{i,1})&=dgamma(\theta_1,219,112)\\ p(\theta_2\mid y_{i,2})&=dgamma(\theta_2,68,45) \end{aligned} \]

How to estimate \(\Pr(\theta_1>\theta_2\mid y_{i,j})\)

set.seed(632)
a<-2 ; b<-1 ; sy1<-217 ;  n1<-111 ; sy2<-66  ;  n2<-44

theta1.mc<-rgamma(10000,a+sy1, b+n1)
theta2.mc<-rgamma(10000,a+sy2, b+n2)

mean(theta1.mc>theta2.mc) #probability that theta1 > theta2

[1] 0.9731

It’s simple: we just generate samples from the posterior distributions of \(\theta_1\) and \(\theta_2\) and then compute the proportion of samples for which \(\theta_1^{(i)}>\theta_2^{(i)}\). This gives us a Monte Carlo approximation to the posterior probability that \(\theta_1>\theta_2\).

2.7 Sampling from predictive distributions

The most interesting application of the Monte Carlo method is to sample from the posterior predictive distribution. And we can do it without even knowing the form of the posterior predictive distribution! That’s crazy, right? Let’s see how it works.

For the Poisson model \(Y_i\mid\theta\sim \text{Poisson}(\theta)\): \[ \widetilde{Y}\mid y_1,\ldots,y_n\sim \text{NegBinomial} \]

Step 1 \[ \begin{aligned} \theta_1^{(i)}&\sim p(\theta_1\mid y_{1..n_1,1})\\ \theta_2^{(i)}&\sim p(\theta_2\mid y_{1..n_2,2}) \end{aligned} \]

Step 2 \[ \begin{aligned} \widetilde{Y}_1^{(i)}&\sim \text{Poisson}(\theta_1^{(i)})\\ \widetilde{Y}_2^{(i)}&\sim \text{Poisson}(\theta_2^{(i)}) \end{aligned} \] Note, we sample exactly one value \(\widetilde{Y}_1^{(i)}\) for each sample \(\theta_1^{(i)}\) from the posterior distribution of \(\theta_1\). We do the same for \(\theta_2\).

Then, \((\theta_1^{(i)},\widetilde{Y}_1^{(i)})\) are samples from the joint posterior distribution of \(\theta_1\) and \(\widetilde{Y}_1\): \(p(\theta_1,\widetilde{Y}_1\mid y_{1..n_1,1})\). Similarly, \((\theta_2^{(i)},\widetilde{Y}_2^{(i)})\) are samples from the joint posterior distribution of \(\theta_2\) and \(\widetilde{Y}_2\): \(p(\theta_2,\widetilde{Y}_2\mid y_{1..n_2,2})\).

Step 3

To get the posterior predictive distribution of \(\widetilde{Y}_1\) we just drop \(\theta_1\) from the joint distribution and keep only \(\widetilde{Y}_1\) (this way we are performing the marginalization step):

\(\{\theta_1^{(i)},\widetilde{Y}_1^{(1)}\}_{i=1}^S \qquad \longrightarrow\qquad \{\widetilde{Y}_1^{(i)}\}_{i=1}^S\quad \approx\quad p(\widetilde{Y}_1\mid y_{1..n_1,1})\)

The same is done for \(\widetilde{Y}_2\).

set.seed(632)
a<-2 ; b<-1 ; sy1<-217 ;  n1<-111 ; sy2<-66  ;  n2<-44

theta1.mc<-rgamma(10000,a+sy1, b+n1)
theta2.mc<-rgamma(10000,a+sy2, b+n2)

y1.mc<-rpois(10000,theta1.mc)
y2.mc<-rpois(10000,theta2.mc)

mean(y1.mc>y2.mc)

[1] 0.4867

We just found \(\Pr(\widetilde{Y}_1>\widetilde{Y}_2\mid y_{1..n_1,1},y_{1..n_2,2})\), the posterior predictive probability that a new observation from group 1 is greater than a new observation from group 2. We can also find the posterior predictive distribution of the difference \(\widetilde{Y}_1-\widetilde{Y}_2\) by just computing the difference of the samples \(\widetilde{Y}_1^{(i)}-\widetilde{Y}_2^{(i)}\) for \(i=1,\ldots,S\).

As you can see, there is a considerable overlap between the two posterior predictive distributions of \(\widetilde{Y}_1\) and \(\widetilde{Y}_2\). The posterior predictive distribution of the difference \(D=\widetilde{Y}_1-\widetilde{Y}_2\) is centered around \(0\), which means that there is a good chance that a new observation from group 1 will be greater than a new observation from group 2, but there is also a good chance that it will be smaller.

2.8 Diagnostic Tool of Bayesian Analysis: Posterior predictive model checking

Posterior Predictive model checking is a powerful diagnostic tool in Bayesian analysis. It allows us to assess the fit of our model to the observed data by comparing the observed data to data simulated from the posterior predictive distribution.

Idea: sample from the posterior predictive distribution and compare to the observed data. They should look similar if the model is a good fit to the data. If they look very different, then we have evidence that our model is not a good fit to the data.

Evaluation of Model Fit for the Birthrate Data

Q: Does the observed data look like it could have come from the posterior predictive distribution?

Ans: Data looking ‘similar’ is a vague concept. It really depends on what exactly the analysis is trying to achieve. If we are interested in the mean of the distribution, then we can compare the means of the observed data and the posterior predictive distribution. If we are interested in the variance, then we can compare the variances. If we are interested in the shape of the distribution, then we can compare the histograms or density plots.

Let’s say we are interested in correctly assessing number of females that have 2 children vs 1 child. Looking at the two histograms above, we can see that the posterior predictive distribution looks very different. In our observed data

\[ \#\{\text{females with 2 children}\} \approx 2\times \#\{\text{females with 1 child}\} \]

And in the posterior predictive distribution: \[ \#\{\text{females with 2 children}\} \approx \#\{\text{females with 1 child}\} \]

Let’s define the following test statistic: \[ g(y) = \frac{\#\{\text{females with 2 children}\}}{\#\{\text{females with 1 child}\}} \]

Because we just had one simulation of the predictive distribution of size \(n_1\) we may have been unlucky and got a value of \(g\) that is very different from the observed value of \(g\). To get a better sense of how the predictive distribution of \(g\) looks like, we can repeat the posterior predictive simulation many times and get many samples from the predictive distribution of \(g\). Then we can compare the distribution of \(g\) to the observed value of \(g\).

Posterior Predictive Check Algorithm:

For each \(i \in \{1,\ldots,S\}\):

Sample \[ \theta^{(i)} \sim p(\theta \mid Y = y_{\text{obs}}). \]

These \(\theta^{(i)}\) are samples from the posterior distribution of \(\theta\), they will be used to generate samples from the posterior predictive distribution.

Sample \[ \underline{\widetilde{Y}}^{(i)} = (\widetilde{y}^{(i)}_1,\ldots,\widetilde{y}^{(i)}_n) \;\stackrel{\text{i.i.d.}}{\sim}\; p(y \mid \theta^{(i)}). \]

This way we are obtaining a sample of size \(n\) from a predictive distribution.

Compute \[ g^{(i)} = g\!\left(\underline{\widetilde{Y}}^{(i)}\right). \] For each sample \(\underline{\widetilde{Y}}^{(i)}\) from the predictive distribution we compute one value of the test statistic \(g\). And we do this many times to get many values of \(g\) from the predictive distribution.
Compare the distribution of \(g^{(i)}\) to the observed value \(g_{\text{obs}} = g(y_{\text{obs}})\).

Note, the observed value of \(g\) is around \(2.1\), while the values of \(g\) from the predictive distribution are mostly between 0.5 and 1.5. This suggests that our model is not a good fit to the data, at least in terms of capturing the ratio of females with 2 children to females with 1 child.

This is an example of how posterior predictive checks can help us identify specific ways in which our model may be failing to capture important features of the data. Remedy for this issue could be to use a more flexible model that can capture the observed ratio better, such as a Negative Binomial model instead of a Poisson model, or perhaps a Multinomial model.

Note: if we want to make inferences on the mean or variance of the distribution, then even the wrong model may give us good estimates of the mean and variance, so we may not be able to detect the model misspecification using posterior predictive checks. This is why it is important to choose test statistics that are sensitive to the aspects of the data that we care about.