Lecture 2: Binomial Model

1 Looking ahead

In the next couple of lectures we will cover one-parameter models

  • Binomial model
  • Poisson model

We will develop the following Bayesian Analysis Tools

  • Conjugate priors
  • Predictive distributions
  • Confidence Regions
    • Frequentist coverage
    • Bayesian coverage
      • Highest Posterior Density region (HPD)
      • Quantile based region

2 One parameter models: Binomial

This is based on Chapter 3 of Hoff’s book. We use the General Social Survey (GSS) data on happiness.

Example: GSS Happiness data, \(n=129\) \[ Y_i=\begin{cases}&1, \text{ if $i$th person is happy},\\ &0, \text{ if $i$th person is unhappy}. \end{cases} \]

\[ \theta=\text{proportion of happy people in the population}=\frac{1}{N}\sum_{i=1}^N Y_i, \]

where \(N\) is the population size. We observe only a sample of size \(n=129\).

Q: Do you think \(Y_i\)’s are independent?

Answer: A: In the Bayesian framework, without conditioning on \(\theta\) they are actually dependent. Intuitively, if we see many happy people in the sample, then it is more likely that \(\theta\) is large, which in turn makes it more likely that the next person is happy. In contrast to the frequentist framework, where \(Y_i\)’s are considered independent, since \(\theta\) is fixed but unknown.

Exercise: If \(Y_1, Y_2\mid\theta\sim Bernoulli(\theta)\), conditionally independent, prove that \[ Cov(Y_1,Y_2)=Var(\theta). \]

Sampling model:

Individually, \(Y_i\mid\theta\sim Bernoulli(\theta)\) and \(p(y_i=1\mid\theta)=\theta\) and \(p(y_i=0\mid\theta)=1-\theta\).

Therefore we can write the joint sampling model as: \[ \begin{aligned} p(y_1,\ldots,y_{129}\mid \theta)&=p(y_1\mid \theta)p(y_2\mid \theta)\cdots p(y_{129}\mid \theta)\\ &=\theta^{\sum_{i=1}^{129} y_i}(1-\theta)^{129-\sum_{i=1}^{129} y_i} \end{aligned} \]

Prior:

We have no expert knowledge about \(\theta\), so we use a uniform prior on \([0,1]\): \[ p(\theta)=\begin{cases} 1, & 0 \leq \theta \leq 1 \\ 0, & \text{otherwise} \end{cases} \]

In other words, \(\theta\sim Unif[0,1]\).

Posterior:

We observed \(118\) happy people and \(11\) unhappy people. In other words, \(\sum_{i=1}^{129} y_i=118\) and \(n-\sum_{i=1}^{129} y_i=11\).

Using Bayes theorem, we have \[ \begin{aligned} p(\theta| y_1,\ldots,y_n)&=\frac{p(y_1,\ldots,y_n\mid \theta)p(\theta)}{p(y_1,\ldots,y_n)}\\ &=\frac{p(y_1,\ldots,y_n\mid \theta)p(\theta)}{p(y_1,\ldots,y_n)}\\ &\underbrace{\propto}_{\text{constant wrt }\theta} p(y_1,\ldots,y_n\mid \theta)\underbrace{p(\theta)}_{=1}\\ &=p(y_1,\ldots,y_n\mid \theta)\\ &=\theta^{118}(1-\theta)^{11} \end{aligned} \]

Above we did a few manipulations which are common in Bayesian statistics. The key point is that the posterior distribution is proportional to the likelihood times the prior. The denominator \(p(y_1,\ldots,y_n)\) is a normalizing constant that does not depend on \(\theta\). We are looking for a function of \(\theta\) that has the same shape as the posterior distribution, so we can ignore any multiplicative constants that do not depend on \(\theta\).

In other words, we just derived that \[ p(\theta\mid y_1,\ldots,y_n)\propto \theta^{118}(1-\theta)^{11} \quad\Leftrightarrow\quad p(\theta\mid y_1,\ldots,y_n)=C\theta^{118}(1-\theta)^{11} \]

Q: What kind of distribution is this?

Common Wrong Answer This looks like a Binomial distribution! But it is not, because Binomial is a distribution over integers, whereas we are looking for a distribution as a function of \(\theta\) supported on \([0,1]\).
Correct Answer

This is a Beta distribution! Recall that if \(\theta\sim Beta(a,b)\), then \[ p(\theta)=\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a-1}(1-\theta)^{b-1},\quad 0\leq \theta \leq 1, \] where \(\Gamma(a)=\int_0^\infty t^{a-1}e^{-t}dt\) is the Gamma function. For integer \(a\), \(\Gamma(a)=(a-1)!\).

Normalization Trick:

This is a common trick to identify the normalizing constant \(C\). In order to find \(C\), we use the fact that the posterior density must integrate to 1:

\[ \theta\sim Beta(a,b),\qquad p(\theta)=\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a-1}(1-\theta)^{b-1} \]

So we get

\[ \int_0^1 \theta^{a-1}(1-\theta)^{b-1}d\theta=\frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)} \]

For our example, \(a=119\) and \(b=12\), so \[ C=\frac{\Gamma(119+12)}{\Gamma(119)\Gamma(12)} \]

Posterior (contd):

Recall the prior was uniform on \([0,1]\), which is equivalent to \(Beta(1,1)\). \[ \theta\sim Unif[0,1] \equiv Beta(1,1) \]

We derived that the posterior is \[ \theta\mid y_1,\ldots,y_n \sim Beta(119,12) \]

Our posterior is from the same family as the prior! This is an example of a conjugate prior. This will work even if we had general \(Beta(a,b)\) prior.

Notice, that we did not need a specific sequence of 0’s and 1’s, just the total count of 1’s, \(\sum_i y_i\) to derive the posterior. This quantity is called a sufficient statistic for \(\theta\) in the Binomial model.

So we denote it as \(Y=\sum_{i=1}^n Y_i\) and write the sampling model as \[ Y\mid \theta \sim Bin(n,\theta) \]

2.1 Conjugate priors

A class \(\mathcal{P}\) of prior distributions for \(\theta\) is called conjugate for a sampling model \(p(y|\theta)\) if \[ p(\theta) \in \mathcal{P}\qquad \Longrightarrow \qquad p(\theta|y) \in \mathcal{P} \]

2.2 Recall: Binomial model

\[ Y=\sum_{i=1}^n Y_i\mid \theta \sim Bin(n,\theta) \]

\[ Pr(Y=y\mid \theta)=\binom{n}{y}\theta^y(1-\theta)^{n-y},\quad y=0,1,\ldots,n \]

We are choosing a prior from the Beta family: \[ \theta\sim Beta(a,b),\quad a>0,b>0 \]

Posterior: \[ \begin{aligned} p(\theta\mid y)&=\frac{p(y\mid\theta)p(\theta)}{p(y)}\\ &=\frac{\binom{n}{y}\theta^y(1-\theta)^{n-y} \cdot \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a-1}(1-\theta)^{b-1}}{p(y)}\\ &\propto \theta^{a+y-1}(1-\theta)^{b+n-y-1}\\ \end{aligned} \]

We ignored all terms that do not depend on \(\theta\). Therefore, the posterior is \(Beta(a+y,b+n-y)\).

From now on, for any conjugate prior analysis, we will directly write the posterior with the corresponding updated parameters, skipping the full derivation. That’s the power of the conjugate prior approach! We just need to do simple algebra on the parameters!!

2.3 Interpreting prior parameters

\[ \begin{aligned} \theta&\sim Beta(a,b)\\ \theta\mid Y=y&\sim Beta(a+y,b+n-y)\\ \end{aligned} \]

By seeing how the prior parameters \(a,b\) update to \(a+y,b+n-y\), we can interpret them as follows:

  • \(a\): prior number of “1”s,
  • \(b\): prior number of “0”s,
  • \(a+b\): prior sample size.

Posterior Expectation:

\[ \begin{aligned} E[\theta \mid Y=y]=\frac{a+y}{a+b+n}&=\frac{a+b}{a+b+n}\cdot \frac{a}{a+b} + \frac{n}{a+b+n}\cdot \frac{y}{n}\\ &=w_{prior}\cdot E[\theta] + w_{data}\cdot \frac{y}{n} \end{aligned} \]

where \[ w_{prior}=\frac{a+b}{a+b+n},\quad w_{data}=\frac{n}{a+b+n} \]

Interpretation: The posterior mean is a weighted average of the prior mean and the sample proportion, with weights proportional to the prior sample size and the actual sample size.

If \(n>>a+b\), then more weight is given to the data, and posterior mean is close to the sample proportion \(\frac{y}{n}\). If \(a+b>>n\), then more weight is given to the prior, and posterior mean is close to the prior mean \(E[\theta]\).

We will see this effect in variety of models in this course.

2.4 Prior influence example

Consider this situation with small sample size \(n=5\) and \(y=1\) and a weak and stronger prior.

Small sample size, \(n=5\). 20% of “1’s”

Notice, that with small sample size, the stronger prior (beta(3,2)) has more influence on the posterior compared to the weak prior (beta(1,1)).

For the second case, with larger sample size, the data dominates the prior: we have \(n=100\) and \(y=20\) (20% of “1”s). We have a weak and stronger prior again (beta(1,1) and beta(3,2)).

Large sample size, \(n=100\). 20% of “1’s”

Here, with large sample size, the prior has very little influence on the posterior. The data dominates the prior.

2.5 Predictive Distribution

Idea: Having observed \(Y_1,\ldots, Y_n\), what’s going to be the next unseen observation \(Y_{n+1}\)?

Bayesian statistics provides a natural way to answer this question via the posterior predictive distribution.

\[ \widetilde{Y}\mid Y_1,\ldots,Y_n \sim Bernoulli(Pr(\widetilde{Y}=1\mid Y_1,\ldots,Y_n)) \]

Since the next observation is either 0 or 1, we just need to find the probability of success.

\[ \begin{aligned} Pr(\widetilde{Y}=1\mid Y_1,\ldots,Y_n) &= \int_0^1 Pr(\widetilde{Y}=1,\theta\mid Y_1,\ldots,Y_n)d\theta\qquad\text{ (marginalizing)}\\ &= \int_0^1 Pr(\widetilde{Y}=1\mid \theta,Y_1,\ldots,Y_n)p(\theta\mid Y_1,\ldots,Y_n)d\theta\qquad\text{ (conditional probability)}\\ &= \int_0^1 Pr(\widetilde{Y}=1\mid \theta)p(\theta\mid Y_1,\ldots,Y_n)d\theta\qquad\text{ (conditional independence)}\\ &= \int_0^1 \theta \cdot p(\theta\mid Y_1,\ldots,Y_n)d\theta\qquad\text{ (sampling model)}\\ &= E[\theta\mid Y_1,\ldots,Y_n] \end{aligned} \]

So we get posterior expectation! Which really makes sense intuitively: the best guess for the next observation is the current best guess for the parameter \(\theta\), which is the posterior mean \(\frac{a+y}{a+b+n}\)

In other words we have \[ \widetilde{Y}\mid y_1,\ldots,y_n \sim Bernoulli\left(\frac{a+y}{a+b+n}\right)= Bernoulli\left(\frac{a+\sum_{i=1}^n y_i}{a+b+n}\right) \]