Lecture 3: One-parameter models: Poisson

1 Source code and data from Hoff’s book

Just a reminder that all R code and data sets used in Hoff’s book are available at:

Data and code to replicate figures and numerical results https://www2.stat.duke.edu/~pdh10/FCBS/Replication/

2 Bayesian Analysis Tools

We continue introducing some basic Bayesian analysis tools which will be useful in more complex models.

2.1 Confidence regions

Frequentist coverage
Bayesian coverage
- Highest Posterior Density region (HPD)
- Quantile based region

Frequentist coverage

\((1-\alpha)100\%\) CI is defined as a random interval \((l(Y),u(Y))\), where \[ Pr(l(Y)<\theta<u(Y)\mid \theta)=1-\alpha \]

Frequentist CIs are referred to as having pre-experimental coverage. Frequentist CIs are constructed so that before observing data, the random interval \((l(Y),u(Y))\) will contain the fixed parameter \(\theta\) with probability \(1-\alpha\). The probability here pertains to the method, not to the specific interval computed from the observed data.

Bayesian coverage

\((l(Y),u(Y))\) is a \((1-\alpha)100\%\) confidence region if

\[ Pr(l(Y)<\theta<u(Y)\mid Y=y)=1-\alpha \]

This is referred to as post-experimental coverage. Bayesian CIs have post-experimental coverage by construction: we find the fixed interval \((l(y),u(y))\) such that the posterior probability of \(\theta\) being in this interval is \(1-\alpha\).

There are two main ways to construct such intervals: HPDs and quantile based intervals.

Highest Posterior Density interval (HPD)

Definition. \((1-\alpha)100\%\) Highest Posterior Density region is a confidence region \(S(y)\) such that

\(Pr(\theta\in S(y)\mid Y=y)=1-\alpha\)
\(\forall\theta_a\in S(y)\) and \(\forall\theta_b \not\in S(y)\)

\[ Pr(\theta_a\mid Y=y) > Pr(\theta_b\mid Y=y) \]

It is easier to see this graphically on a plot below.

Quantile based interval

Definition. \((1-\alpha)100\%\) quantile based interval is an interval of the form \[ (\theta_{\alpha/2},\theta_{1-\alpha/2}), \qquad \qquad \text{ where } \] \[ Pr(\theta \leq \theta_{\alpha/2}\mid Y=y)=\alpha/2,\quad Pr(\theta \leq \theta_{1-\alpha/2}\mid Y=y)=1-\alpha/2 \]

Quantile based intervals are easier to compute, but they do not guarantee the shortest length interval for a given posterior probability. Also, they sometimes include low density regions while excluding high density regions.

Q: find values of \(\theta\) on the plot outside of the quantile based interval, which have higher posterior density than some values of \(\theta\) inside the interval.

Which should we use? HPDs are more “optimal” in terms of length, but quantile based intervals are easier to compute, especially in multi-parameter models. In practice, the two intervals are often very similar. They coincide exactly for symmetric unimodal posterior distributions.

3 The Poisson model

In what situations does the Poisson model arise? Some examples: - Number of emails received in an hour - Number of phone calls at a call center in a day - Number of decay events from a radioactive source in a given time period - Number of births in a given time period/gien region

Connection to Binomial:

In other words, Poisson distribution models the number of “successes” in an infinite sequence of Bernoulli trials with very small probability of success.

We have the following connection between Binomial and Poisson distributions: If \(Y_n \sim \mathrm{Binomial}(n,\theta_n)\), and \(n\to\infty\), \(\theta_n\to 0\) such that \(n\theta_n \to \lambda\), then

\[ Y_n \xrightarrow{d} \mathrm{Poisson}(\lambda) \]

3.1 Sampling model

Recall the pdf of the Poisson distribution: \[ Pr(Y = y|\theta) = \frac{e^{-\theta} \theta^y}{y!}, \quad y = 0, 1, 2, \ldots \]

The Poisson distribution has mean and variance equal to \(\theta\), it is equidispersed. \[ E[Y] = \theta, \quad Var(Y) = \theta. \]

The joint sampling distribution for \(n\) independent observations \(Y_i\mid\theta\sim Poisson(\theta)\) is \[ \begin{aligned} p(y_1,\ldots,y_n\mid \theta) &= \prod_{i=1}^n \frac{e^{-\theta} \theta^{y_i}}{y_i!} \\ &= e^{-n\theta} \theta^{\sum_{i=1}^n y_i} \prod_{i=1}^n \frac{1}{y_i!}\\ &\propto e^{-n\theta} \theta^{\sum_{i=1}^n y_i} \end{aligned} \]

Q: What is the sufficient statistic for \(\theta\)?

Answer

The sufficient statistic is \(\sum_{i=1}^n y_i\).

3.2 Finding a conjugate prior

By Bayes Theorem \[ \begin{aligned} p(\theta\mid y_1,\ldots,y_n)&\propto p(y_1,\ldots,y_n\mid \theta)p(\theta) \\ &\propto e^{-n\theta} \theta^{\sum_{i=1}^n y_i} p(\theta) \end{aligned} \]

What should be the functional form of \(p(\theta)\) so that the posterior \(p(\theta\mid y_1,\ldots,y_n)\) has the same functional form as the prior?

Answer: The prior should be of the form \[ p(\theta) \propto e^{C_1\theta} \theta^{C_2} \] for some constants \(C_1\) and \(C_2\).

This is a Gamma distribution!

Gamma distribution

Recall, the pdf of the Gamma distribution with parameters \(a>0\) and \(b>0\) is

\[ p(\theta)=\operatorname{dgamma}(\theta;a,b)=\frac{b^a}{\Gamma(a)} \theta^{a-1} e^{-b \theta} \]

Pay attention: in this parameterization, \(a\) is the shape parameter, and \(b\) is the rate parameter (not scale). The mean and variance are \[ E[\theta] = \frac{a}{b}, \quad Var(\theta) = \frac{a}{b^2}. \]

3.3 Posterior

We can easily find the posterior distribution now: \[ \theta\sim \mathrm{Gamma}(a,b) \]

\[ \begin{aligned} p(\theta\mid y_1,\ldots,y_n)&\propto p(y_1,\ldots,y_n\mid \theta)p(\theta) \\ &\propto e^{-n\theta} \theta^{\sum_{i=1}^n y_i} \cdot e^{-b \theta} \theta^{a-1} \\ &= e^{-(b+n)\theta} \theta^{a-1+\sum_{i=1}^n y_i} \\ \end{aligned} \]

The above functional form uniquely identifies a Gamma distribution with updated parameters: \[ \theta\mid y_1,\ldots,y_n \sim \mathrm{Gamma}\left(a+\sum_{i=1}^n y_i, b+n\right) \]

Based on how the parameters are updated, we can interpret them as follows: - Prior parameters: \(a\) is the prior number of events observed - \(b\) is the prior sample size.

3.4 Posterior Expectation

\[ \begin{aligned} E[\theta\mid y_1,\ldots,y_n] &= \frac{a+\sum_{i=1}^n y_i}{b+n} \\ &= \frac{b}{b+n} \cdot \frac{a}{b} + \frac{n}{b+n} \cdot \frac{\sum_{i=1}^n y_i}{n} \\ &= \text{weighted average of prior mean and sample mean} \end{aligned} \]

If \(b\) is small compared to \(n\), the posterior mean is close to the sample mean.
If \(b\) is large compared to \(n\), the posterior mean is close to the prior mean.

Let’s see an example.

3.5 Data analysis: Birthrates

The following data are from the General Social Survey (GSS), a sociological survey used to collect data on demographic characteristics and attitudes of residents of the United States. The data set contains information on number of children (CHILDS) of Females aged 40 in different years, along with their education level (DEG: less than bachelor’s vs. bachelor’s or higher).

load("../data/gss.RData")
y1<-gss$CHILDS[gss$FEMALE==1 & gss$YEAR>=1990  & 
               gss$AGE==40 & gss$DEG<3 ]
y2<-gss$CHILDS[gss$FEMALE==1 &  gss$YEAR>=1990  &
               gss$AGE==40 & gss$DEG>=3 ]

Data summary

Group 1: less than bachelor’s \[ n_1 = 111,\quad\sum_{i=1}^{n_1} Y_{i,1} = 217,\quad \overline{Y}_1 = 1.95 \]
Group 2: Bachelor’s or higher

\[ n_2 = 44,\quad \sum_{i=1}^{n_2} Y_{i,2} = 66,\quad \overline{Y}_2 = 1.50 \]

Sampling Model

We have two distinct groups, so we will have two parameters:

\[ Y_{i,1}\mid \theta_1 \stackrel{iid}{\sim} \mathrm{Poisson}(\theta_1), \quad i=1,\ldots,n_1 \] \[ Y_{i,2}\mid \theta_2 \stackrel{iid}{\sim} \mathrm{Poisson}(\theta_2), \quad i=1,\ldots,n_2 \]

Prior

For the prior, we will use Gamma distribution for both parameters and an expert knowledge to set prior parameters.

Expert knowledge: Average birthrate is about 2 births per woman.

Since \(E[\theta] = \frac{a}{b}\), we can set \(a=2\) and \(b=1\) to reflect this knowledge and to make the prior as weak as possible (prior sample size is \(b=1\)).

Posterior

Using the conjugacy property, we can write down the posterior distributions for both parameters: \[ \theta_1\mid y_{1,1},\ldots,y_{n_1,1} \sim \mathrm{Gamma}(a + \sum_{i=1}^{n_1} y_{i,1}, b + n_1) = \mathrm{Gamma}(2 + 217, 1 + 111) \] \[ \theta_2\mid y_{1,2},\ldots,y_{n_2,2} \sim \mathrm{Gamma}(a + \sum_{i=1}^{n_2} y_{i,2}, b + n_2) = \mathrm{Gamma}(2 + 66, 1 + 44) \]

We can see the prior and the two posterior distributions on the plot below:

You can see that both posterior distributions are shifted to the left compared to the prior, indicating that the data suggests lower birth rates than the prior mean of 2. The posterior for the group with Bachelor’s or higher is shifted further left, indicating a lower birth rate compared to the group with less than a Bachelor’s degree. The posterior distributions are more concentrated than the prior, reflecting the information gained from the data.

Group 1 (< Bachelor’s) has a more concentrated posterior distribution compared to Group 2 (Bachelor’s or higher), indicating more certainty about the birth rate in Group 1 based on the data, since it has a larger sample size.

We can compute posterior summaries:

\[ Pr(\theta_1 > \theta_2 \mid \text{ data }) = 0.97 \]

\[ Pr(\widetilde{Y}_1 > \widetilde{Y}_2 \mid \text{ data }) = 0.47 \]

Q: what’s the difference between these two probability statements?

Answer

The first probability statement, \(Pr(\theta_1 > \theta_2 \mid \text{ data }) = 0.97\), refers to the probability that the true mean birth rate for Group 1 (less than Bachelor’s) is greater than that for Group 2 (Bachelor’s or higher) given the observed data. This is a statement about the parameters of the model. The second probability statement, \(Pr(\widetilde{Y}_1 > \widetilde{Y}_2 \mid \text{ data }) = 0.48\), refers to the probability that a new, randomly selected individual from Group 1 will have a higher birth count than a new, randomly selected individual from Group 2, given the observed data. This is a statement about future observations. Notice, that this is due to the fact that the posterior predictive distributions for both groups have considerable overlap.

3.6 Posterior predictive distribution

Recall, that the posterior predictive distribution is defined as \[ p(y_{n+1}\mid y_1,\ldots,y_n) = \int p(y_{n+1}\mid \theta)p(\theta\mid y_1,\ldots,y_n)d\theta = E_{\theta\mid y_1,\ldots,y_n}[p(y_{n+1}\mid \theta)] \]

In the other words, we average the sampling distribution for the new observation over the posterior distribution of the parameter. So we get a sampling distribution which incorporates uncertainty about the parameter, after learning from the data.

In the HW you will show, that the posterior predictive distribution for the Poisson sampling model with the Gamma prior will be a negative binomial distribution:

You can see that the posterior predictive distribution for the group with Bachelor’s or higher is shifted to the left compared to the group with less than a Bachelor’s degree, indicating a lower predicted birth rate for the former group. Also, there is a considerable overlap between the two distributions, suggesting that while there is a difference in predicted birth rates, there is also uncertainty in these predictions.