Lecture 1: Intro and Probability Review

1 Syllabus




2 Foundations of Statistics: Frequentist and Bayesian

Frequentist and Bayesian statistics are two different approaches to statistical inference. In this class, we will focus on the Bayesian approach, but it is important to understand the differences between the two. We start with some quotes:

“Statistics is the science of information gathering, especially when
the information arrives in little pieces instead of big ones.”

— Bradley Efron

Probability theory is nothing but common sense reduced to calculation.

— Pierre-Simon Laplace

“Probability does not exist.”

— Bruno de Finetti

Statistics is often taught as a toolbox: confidence intervals, hypothesis tests, p-values. Bayesian statistics starts somewhere else.

We start by asking: what do we know, how uncertain are we, and how should that uncertainty change when we see data?

This course is about learning how to answer those questions using probability theory.

2.1 Estimation

Frequentist and Bayesian approaches to inference differ in the interpretation of probability, and hence in the interpretation of confidence intervals.

  • Frequentist: I have 95% confidence that the population mean is between 12.7 and 14.5.

  • Meaning: If I repeated this experiment many times, 95% of the intervals I compute would contain the true population mean.

  • Bayesian: There is a 95% probability that the population mean is in the interval 12.7 to 14.5.

  • Meaning: Given the data I observed, there is a 95% probability that the population mean is between 12.7 and 14.5.

Bayesian approach allows us to make direct probability statements about parameters, which is often what we want to know in practice. Frequentist CIs, on another hand, do not allow this, as parameters are considered fixed but unknown quantities.

2.2 Hypothesis Testing

  • Frequentist: If \(H_0\) is true, we would get a result as extreme as the data we saw only 3.2% of the time. Since that is smaller than 5%, we would reject \(H_0\) at the 5% level. These data provide significant evidence for the alternative hypothesis.

  • Meaning: If we repeated this experiment many times, we would reject \(H_0\) in 5% of the cases when \(H_0\) is true.

  • Bayesian: The odds in favor of \(H_0\) against \(H_A\) are 1 to 3.

  • Meaning: Given the data I observed, the probability that \(H_0\) is true is 25%, and the probability that \(H_A\) is true is 75%. Equivalently, the data provide 3 times more support for \(H_A\) than for \(H_0\): \[ \frac{Pr(H_0 \mid \text{data})}{Pr(H_A \mid \text{data})} = \frac{1}{3}. \]

2.3 What do Bayesian methods provide?

  • Well-understood statistical behavior in both small and large samples
  • Coherent summaries of evidence under an explicit probability model
  • Predictive distributions that account for all sources of uncertainty
  • A unified framework for estimation, comparison, prediction, and checking

2.4 Setting up the Bayesian framework

  • \(\mathcal{Y}=\) sample space: a set of all possible datasets. E.g. for 3 coin flips, the sample space will be \[ \mathcal{Y}=\{HHH, HHT, HTH, THH, HTT, THT, TTH, TTT\}. \]

  • \(y=\) a single observed dataset: i.e. \(y\in \mathcal{Y}\)

For example, if we observed 2 heads and 1 tail, then \(y=\{HHT\}\in\mathcal{Y}\).

  • \(\Theta=\) parameter space: set of all possible values of the parameter(s) of interest. E.g. for a coin flip, the parameter space is \[ \Theta=[0,1], \]

where \(\theta\in \Theta\) is the probability of Heads.

Main Ingredients of the Bayesian approach

  • Prior distribution: \(p(\theta)\) or \(\pi(\theta)\)

It describes our beliefs about the unknown parameter(s) \(\theta\in\Theta\) that characterize the population (or data-generating process) prior to observing the data.

  • Sampling Model (Likelihood): \(p(y|\theta)\)

It specifies the probabilistic model how the observed data \(y\in\mathcal{Y}\) are generated given the parameter(s) \(\theta\in\Theta\).

  • Posterior distribution: \(p(\theta|y)\)

It describes our updated beliefs about the unknown parameter(s) \(\theta\in\Theta\) after observing the data \(y\in\mathcal{Y}\).

Together, these components are related via Bayes’ Theorem:

\[ p(\theta|y) = \frac{p(y|\theta)p(\theta)}{p(y)} =\frac{p(y|\theta)p(\theta)}{\int_{\Theta} p(y|\theta)p(\theta)d\theta} \propto p(y|\theta)p(\theta). \]

Using the notation introduced above, we can summarize the Bayesian approach as follows: \[ \underbrace{p(\theta \mid y)}_{\text{posterior}} \quad\propto\quad \underbrace{p(y \mid \theta)}_{\text{sampling model}} \;\underbrace{p(\theta)}_{\text{prior}}, \qquad \underbrace{p(y)}_{\text{evidence}} \text{ ensures normalization.} \]

2.5 Example 1: Estimating the probability of a rare event

Q: What is the prevalence of a rare disease in a population? A small sample of \(n=20\) from the city will be tested for infection.

  • \(y=\) number of infected people in the sample.
  • \(\theta=\) proportion of infected people in the population.
  • \(\mathcal{Y}=\{0,1,2,\ldots,20\}\).
  • \(\Theta=[0,1]\).

Sampling model \(p(y|\theta)\)

For a given infection rate \(\theta\), the number of infected people in a sample of size \(n=20\) follows a Binomial distribution:

\[ Y \mid \theta \sim \operatorname{Binomial}(n=20, \theta), \]

Note, we can’t say that \(Y\) follows a Binomial distribution without conditioning on \(\theta\) (more on this later in the class/homework).

Here is the pmf of \(Y\) for different values of \(\theta\) (i.e. various sampling models):

Prior distribution \(p(\theta)\)

Our prior beliefs often come from expert knowledge.

Expert knowledge: infection rate in comparable cities ranges from about 0.05 to 0.20, with an average prevalence of 0.10.

For convenience, we will use a Beta distribution as a prior for \(\theta\) (there are bigger reasons for this choice, more on this later):

\[ \theta \sim \operatorname{Beta}(a,b), \] where \(a,b>0\) are the shape parameters of the Beta distribution. Recall Beta distribution is supported on \([0,1]\) which makes it a natural choice for modeling probabilities.

\[ \operatorname{E}[\theta] = \frac{a}{a+b}, \qquad \operatorname{Mode}(\theta) = \frac{a-1}{a+b-2} \quad \text{if } a>1, b>1. \]

Let us choose \(a=2, b=20\) to reflect our expert knowledge. Under these prior parameters, the prior mean is 0.09 and \(Pr(0.05 < \theta < 0.20) \approx 0.67.\)

Posterior distribution \(p(\theta|y)\)

Using Bayes’ Theorem, we can combine the sampling model and the prior to obtain the posterior distribution of \(\theta\) given the observed data \(y\). In the next lecture we will derive the posterior distribution in detail, but for now, we can state the result: \[ \theta \mid Y=y \sim \operatorname{Beta}(a + y, b + n - y). \]

If we do not observe any infections in our sample, i.e. \(y=0\), then the posterior distribution is \[ \theta \mid Y=0 \sim \operatorname{Beta}(2 + 0, 20 + 20 - 0) = \operatorname{Beta}(2, 40). \]

Here is how the prior and posterior distributions look like:

Notice how the posterior distribution is more concentrated around smaller values of \(\theta\) compared to the prior distribution. This reflects our updated beliefs about the infection rate after observing no infections in the sample.

We can also compute the posterior summaries of interest:

  • Posterior mean: \[ \operatorname{E}[\theta \mid Y=0] = \frac{a + y}{a + b + n} = \frac{2 + 0}{2 + 20 + 20} = 0.04. \]

  • Posterior probability: \[ Pr(\theta < 0.10 \mid Y=0) \approx 0.86, \]

which indicates a high probability that the infection rate is below 10% given the observed data.

2.6 Sensitivity Analysis: Impact of the choice of Prior

Why did we choose Beta(2,20) prior? What if we had different prior beliefs? Sensitivity analysis helps us understand how the choice of prior affects the posterior distribution and summaries, and it is an essential part of Bayesian analysis.

We compare two posterior summaries:

\[ \mathbb{E}[\theta \mid Y=0], \] \[ Pr(\theta < 0.10 \mid Y=0). \]

Below are contour plots of these two posterior summaries as a function of prior parameters:

  • \(\theta_0 = a/(a+b)\) (prior mean)
  • \(w = a + b\) (prior concentration/strength of belief)

We can interpret the posterior mean as a weighted average of the prior mean and the sample proportion: \[ \begin{aligned} \operatorname{E}[\theta \mid Y=y] &= \frac{a + y}{a + y + b + n - y}\\ & = \frac{a + b}{a + b + n} \cdot \frac{a}{a + b} + \frac{n}{a + b + n} \cdot \frac{y}{n}\\ & = \frac{w}{w + n} \cdot \theta_0 + \frac{n}{w + n} \cdot \frac{y}{n}, \end{aligned} \] where \(\theta_0 = a/(a+b)\) is the prior mean and \(w = a + b\) is the prior concentration. So we have the following \[ \text{Posterior mean} = \underbrace{\text{weight on prior} \times \text{prior mean}}_{\text{influence of prior}} + \underbrace{\text{weight on data} \times \text{data mean}}_{\text{influence of data}}. \]

The above effect is observed in all Bayesian models: the posterior summaries are a compromise between prior beliefs and observed data. It is often referred to as the Bayesian smoothing or Bayesian regularization.

As \(n\rightarrow \infty\), the weight on data goes to 1, and the weight on prior goes to 0, so the posterior mean converges to the sample proportion \(y/n\) regardless of the prior choice.

As \(w\rightarrow \infty\), the weight on prior goes to 1, and the weight on data goes to 0, so the posterior mean converges to the prior mean \(\theta_0\) regardless of the data.

2.7 Example 2: Bayesian analysis in regression

Bayesian methods can also be applied to regression problems. Here, we illustrate the impact of the prior on the posterior inclusion probabilities of predictors in a Bayesian linear regression model.

Orange lines represent the prior inclusion probabilities (uniform prior with probability 0.5 for each predictor), while blue lines represent the posterior inclusion probabilities after observing the data.

We will cover Bayesian regression models in detail later in the course.

3 Probability Review

Ross, A first course in Probability is a good reference to review the topics. (Or your EN.553.420/620 lecture notes).

Definition: (Partition).

Let \(\mathcal{H}\) be set of all possible truths (a sample space). A collection of events \(\{H_1, H_2, \dots, H_K\}\) is said to be a partition of \(\mathcal{H}\) if

  1. \(H_i \cap H_j = \emptyset\) for all \(i \neq j\) (mutually exclusive)
  2. \(\bigcup_{k=1}^K H_k = \mathcal{H}\) (collectively exhaustive)

Partition identity: \[ \sum_{k=1}^K \Pr(H_k) = Pr(\mathcal{H}) = 1. \] where \(\{H_1, \dots, H_K\}\) is a partition of \(\mathcal{H}\).

Law of Total Probability/Marginalization:

\[ \Pr(A) = \sum_{k=1}^K \Pr(A\cap H_k)=\sum_{k=1}^K \Pr(A\mid H_k)\Pr(H_k) \]

Here we used the definition conditional probability: \[ Pr(A\mid B) = \frac{Pr(A\cap B)}{Pr(B)} \]

Bayes Rule/ Bayes Theorem

\[ \Pr(H_i\mid A) = \frac{\Pr(A\mid H_i)\Pr(H_i)}{\Pr(A)}=\frac{\Pr(A\mid H_i)\Pr(H_i)}{\sum_{k=1}^K \Pr(A\mid H_k)\Pr(H_k)} \]




3.1 Example: General Social Survey

Educational level and income of males >30yo

\[ \mathcal{H} = \{H_1,H_2,H_3,H_4\} = \{ \text{lower 25th \%}, \text{second 25th \%}, \text{third 25th \%}, \text{upper 25th \%} \} \]

Let event \(A\) be \[ A=\{\text{a randomly chosen individual has a college degree}\} \]

Prior

Before observing \(A\), we have the following prior beliefs about income levels: \[Pr(H_1)=0.25,\quad Pr(H_2)=0.25,\quad Pr(H_3)=0.25,\quad Pr(H_4)=0.25. \]

Sampling model

From the data, we can estimate the following sampling model: \[ Pr(A\mid H_1)=0.11,\quad Pr(A\mid H_2)=0.19,\quad Pr(A\mid H_3)=0.31,\quad Pr(A\mid H_4)=0.53. \]

Posterior

\[ \begin{aligned} Pr(H_i\mid A) &= \frac{Pr(A\mid H_i)Pr(H_i)}{\sum_{k=1}^4 Pr(A\mid H_k)Pr(H_k)} \\ &= \frac{Pr(A\mid H_i)Pr(H_i)}{0.25\times(0.11+0.19+0.31+0.53)}. \end{aligned} \]

E.g. for \(i=1\) we will get \(Pr(H_1\mid A) = 0.11/1.14 \approx 0.096\).

Notice, how our beliefs about income levels changed after observing that the individual has a college degree: before seeing anything, we thought that each income level was equally likely (25%), but after observing that the individual has a college degree, we updated our beliefs and now think that the probability of being in the lowest income quartile is only about 9.6%.

3.2 Independence. Conditional Independence.

Two events \(A,B\) are independent if \[ Pr(A\cap B) = \Pr(A)\Pr(B) \quad\Leftrightarrow\quad \Pr(A\mid B) = \Pr(A) \quad\Leftrightarrow\quad \Pr(B\mid A) = \Pr(B). \]




Two events \(A,B\) are conditionally independent given event \(C\) if \[ Pr(A\cap B\mid C) = Pr(A\mid C)Pr(B\mid C) \quad\Leftrightarrow\quad Pr(A\mid B,C) = Pr(A\mid C) \quad\Leftrightarrow\quad Pr(B\mid A,C) = Pr(B\mid C). \]

Conditional independence is the most important definition for Bayesian modeling! Let us reiterate: Probability of \(A\) happening, given \(B\) and \(C\) is the same as the probability of \(A\) happening given \(C\) alone. In other words, once we know \(C\), knowing \(B\) does not provide any additional information about \(A\).

Exercise: derive the equivalence of the three statements above, using the definition of conditional probability.

3.3 Random Variables

Discrete RVs

Discrete random variable \(Y\) takes values in a finite or countable set \(\mathcal{Y}=\{y_1,y_2,\dots\}\).

  • Probability mass function (pmf):

\[ p(y) = Pr(Y=y) \]

For simplicity we will often refer to pmf as pdf (following P.Hoff), but technically it is not correct.

  • Examples:
    • Bernoulli, Binomial, Poisson, Multinomial, Geometric, Negative Binomial, Hypergeometric, Discrete Uniform, etc.
  • Properties:

\[ 0\leq p(y)\leq 1,\qquad \sum_{y\in \mathcal{Y}} p(y) = 1,\qquad \Pr(Y\in A) = \sum_{y\in A} p(y). \]

Continuous RVs

Continuous random variables \(Y\) are defined via a CDF (cumulative distribution function) \(F(a)\).

\[ F(a):= \Pr(Y\leq a)\quad \Leftrightarrow\quad 1-F(a)=\Pr(Y>a). \]

\[ \Pr(a<Y\leq b) = F(b)-F(a). \]

From CDF we can get the pdf (probability density function) as follows: \[ p(y) = \frac{d}{dy}F(y). \]

From pdf we can get back the CDF: \[ F(a) = \int_{y\leq a} p(y) dy. \]

  • Properties:

\[ 0\leq p(y),\qquad \int_{y\in \mathcal{Y}} p(y)dy = 1,\qquad \Pr(Y\in A) = \int_{y\in A} p(y)dy. \]

Note, that the pdf could be larger than 1 for some values of \(y\) (unlike pmf for discrete RVs).

  • Examples:
    • Uniform, Normal, Exponential, Gamma, Beta, Weibull, Log-Normal, Pareto, Cauchy, \(t\)-distribution, etc.

Joint distributions of RVs

We can think of it as a distribution of a vector RV \(\mathbf{Y}=(Y_1,Y_2,\dots,Y_n)\).

  • Joint pdf:

\[ p_{Y_1,Y_2}(y_1,y_2) = Pr(Y_1=y_1, Y_2=y_2) \quad\text{(discrete RVs)} \]

  • Marginal pdf:

\[ p_{Y_1}(y_1) = \sum_{y_2\in \mathcal{Y}_2} p_{Y_1,Y_2}(y_1,y_2) \quad\text{(discrete RVs)} \] \[ p_{Y_1}(y_1) = \int_{\mathcal{Y}_2} p_{Y_1,Y_2}(y_1,y_2) dy_2 \quad\text{(continuous RVs)} \]

\[ p_{Y_2}(y_2) = \sum_{y_1\in \mathcal{Y}_1} p_{Y_1,Y_2}(y_1,y_2) \quad\text{(discrete RVs)} \]

\[ p_{Y_2}(y_2) = \int_{\mathcal{Y}_1} p_{Y_1,Y_2}(y_1,y_2) dy_1 \quad\text{(continuous RVs)} \]

For simplicity, we often drop the subscripts and write \(p(y_1,y_2)\), \(p(y_1)\), \(p(y_2)\).

  • Conditional pdf:

\[ p_{Y_1\mid Y_2}(y_1\mid y_2) = \frac{p_{Y_1,Y_2}(y_1,y_2)}{p_{Y_2}(y_2)} \quad\text{(discrete and continuous RVs)} \]

Mixed continuous and discrete RVs

We will often encounter situations where some RVs are discrete and some are continuous. The definitions above still hold, but we need to be careful with the marginalization step. E.g. in the infected population example, we had a discrete RV \(Y\) (number of infected people in the sample) and a continuous RV \(\theta\) (infection rate in the population), and we were constructing the joint distribution \(p(y,\theta)\).

\[ Pr(Y_1\in A, Y_2\in B) = \int_{y_2\in B} \sum_{y_1\in A} p(y_1,y_2) dy_2 \quad\text{(discrete } Y_1, \text{ continuous } Y_2) \]

Marginalization: \[ p(y_1)= \int_{\mathcal{Y}_2} p(y_1,y_2) dy_2 \quad\text{(discrete } Y_1, \text{ continuous } Y_2) \]

\[ p(y_2)= \sum_{y_1\in \mathcal{Y}_1} p(y_1,y_2) \quad\text{(discrete } Y_1, \text{ continuous } Y_2) \]

Conditionally independent RVs

The most important definition for Bayesian modeling!

\(Y_1,Y_2,\ldots,Y_n\) are conditionally independent given \(\theta\) if

\[ p(y_1,y_2,\ldots,y_n\mid \theta) = \prod_{i=1}^n p(y_i\mid \theta) \]

or equivalently \[ p(y_i\mid y_j, \theta) = p(y_i\mid \theta) \quad \text{for all } i\neq j. \]

In other words, once we know \(\theta\), knowing \(Y_j\) does not provide any additional information about \(Y_i\).