Lecture 15: Generalized Linear Models and Metropolis Algorithm

1 Generalized Linear Models

In this lecture, we will discuss how to extend linear regression to handle non-normal data, such as count data or binary outcomes. We will introduce the concept of Generalized Linear Models (GLMs) and explore how to perform Bayesian analysis for Poisson regression using both grid approximation and Markov Chain Monte Carlo (MCMC) methods.

1.1 Dataset: Sparrows

We use the dataset on the number of offsprings (or ‘fledged’) produced by sparrows of different ages. The data consists of pairs \((X_i, Y_i)\) where:

  • \(Y_i =\) # of eggs laid by bird \(i\);

  • \(X_i=\) age (in years) of bird \(i\).

What is the relationship between \(Y_i\) and \(X_i\)? Let’s visualize the data to get an initial understanding.

Note how the average number of offsprings increases with age until age 2, and then decreases for older birds. We will model the relationship between age and the number of offsprings using a Poisson random variable, which is a common choice for count data.

Q: Why don’t we use a binomial model here?
Answer The binomial model is typically used for binary outcomes (e.g., success/failure) or for modeling the number of successes in a fixed number of trials. In this case, we are modeling the count of offsprings, which can take on any non-negative integer value, and there is no fixed number of trials. The Poisson model is more appropriate for this type of count data.

So, our presumed model is: \[ Y_i\mid \theta \sim \text{Poisson}(\theta(Age)), \quad \text{where} \quad Age\ \in \{1,2,3,4,5,6\} \]

In other words, the mean number of offsprings \(\theta\) is a function of the age of the bird.

One approach would be to fit 6 different Poisson distributions, one for each age group. However, this would not allow us to understand the overall relationship between age and offspring count. Additionally, we won’t be able to fit a model to \(age=6\), since there is only one observation \(y=0\) for that age group.

Instead, we presume a simple relationship between age and the mean number of offsprings, such as a quadratic relationship: \[ \theta(Age) = \exp(\beta_1 + \beta_2 \cdot Age + \beta_3 \cdot Age^2)\quad\Longleftrightarrow\quad \log(\theta(Age)) = \beta_1 + \beta_2 \cdot Age + \beta_3 \cdot Age^2 \] We will use \(x\) instead of \(Age\) for notational convenience, and we will write \(\theta(x) = \exp(\beta_1 + \beta_2 x + \beta_3 x^2)\).

Q: Why do we use the exponential function here?
Answer

The exponential function is used to ensure that the mean \(\theta(x)\) is always positive, which is a requirement for the Poisson distribution. By modeling \(\log(\theta(x))\) as a linear function of the predictors, we can use the properties of the exponential function to maintain the positivity constraint on the mean.

Moreover, in the context of Generalized Linear Models, the exponential function serves as the canonical link function that connects the linear predictor (in this case, \(\beta_1 + \beta_2 x + \beta_3 x^2\)) to the mean of the distribution (in this case, \(\theta(x)\)). This allows us to model a wide range of relationships between the predictors and the response variable while ensuring that the model’s assumptions are satisfied. You can find out more about link functions in GLMs in the 414/614 Applied Statistics and Data Analysis course.

1.2 Linear Models vs Generalized Linear Models

Let us do a side by side comparison of linear models (LMs) and generalized linear models (GLMs):

Aspect LMs GLMs
Response Variable Continuous (e.g., \(Y_i \in \mathbb{R}\)) Can be continuous, binary, count, etc. (e.g., \(Y_i \in \{0,1\}\) for binary, \(Y_i \in \mathbb{N}_0\) for count)
Distribution Normal distribution (Gaussian) Exponential family distributions (e.g., Poisson for count data, Binomial for binary data)
Link Function Identity link (i.e., \(E[Y_i] = \underline{\beta}^T \underline{x}_i\)) Non-identity link functions (e.g., log link for Poisson, logit link for Binomial)
Model Form \(Y_i = \underline{\beta}^T \underline{x}_i + \varepsilon_i\) where \(\varepsilon_i \sim N(0, \sigma^2)\) \(g(E[Y_i]) = \underline{\beta}^T \underline{x}_i\) where \(g(\cdot)\) is the link function

1.3 Bayesian Analysis for Poisson Regression

Just like in linear regression, we can perform Bayesian analysis for Poisson regression by specifying a prior distribution for the parameters \(\underline{\beta}\) and using the sampling model derived from the Poisson distribution.

Prior for \(\underline\beta\)

As in linear regression, we can use a multivariate normal prior for \(\underline{\beta}\): \[ \underline{\beta} \sim MVN(\underline{0}, \underline{\Sigma}) \]

Sampling model

\[ Y_i\mid \underline{\beta}, \underline{x}_i \sim \text{Poisson}(\,\theta(x_i)=\exp(\underline{\beta}^T \underline{x}_i)\,) \]

Recall, \(\underline{x}_i=(1, x_i, x_i^2)^T=(1, age_i, age_i^2)^T\) is the vector of predictors for observation \(i\).

Posterior

\[ \begin{aligned} p(\underline{\beta}\mid \underline{y}, \underline{X})&\propto p(\underline{y}\mid \underline{\beta}, \underline{X})p(\underline{\beta})\\ &= \left(\prod_{i=1}^n p(y_i\mid \underline{\beta}, \underline{x}_i)\right) p(\underline{\beta})\\ &\propto \left(\prod_{i=1}^n \frac{\exp(-\theta(x_i))\theta(x_i)^{y_i}}{y_i!}\right) \cdot \exp\left(-\frac{1}{2}\underline{\beta}^T \Sigma^{-1} \underline{\beta}\right)\\ &\propto\prod_i \exp(-\exp(\underline{\beta}^T \underline{x}_i))\cdot \prod_i \exp(y_i \underline{\beta}^T \underline{x}_i) \cdot \exp\left(-\frac{1}{2}\underline{\beta}^T \Sigma^{-1} \underline{\beta}\right) \end{aligned} \]

The kernel of the posterior distribution is given by \[ p(\underline{\beta}\mid \underline{y}, \underline{X})\propto \exp\left(-\sum_i \exp(\underline{\beta}^T \underline{x}_i) + \sum_i y_i \underline{\beta}^T \underline{x}_i -\frac{1}{2}\underline{\beta}^T \Sigma^{-1} \underline{\beta}\right) \]

This kernel does not correspond to any known distribution, so we cannot directly sample from the posterior distribution. We can’t even use the Gibbs sampler here, since we cannot derive the full conditional distributions in closed form. Instead, we will use two different approaches to approximate the posterior distribution:

  • grid approximation
  • Metropolis algorithm (a type of Markov Chain Monte Carlo method)

2 Grid approximation

It’s quite straightforward to implement a grid approximation for the posterior distribution. We will create a grid of values for \(\beta_1\), \(\beta_2\), and \(\beta_3\), compute the log-posterior at each point in the grid, and then normalize to get the posterior probabilities.

Above are the 1-dimensional and 2-dimensional marginal posterior distributions for \(\beta_1\), \(\beta_2\), and \(\beta_3\) based on the grid approximation. The 1D plots show the marginal posterior densities for each parameter, while the 2D plots show the joint posterior distributions for pairs of parameters.

We can also visualize the pairwise relationships between the parameters using a pairs plot (command ggpairs), which shows the marginal distributions along the diagonal and the joint distributions in the off-diagonal panels.

What is the problem with this grid approximation approach? Can we use it for higher-dimensional problems?
Answer

The main problem with the grid approximation approach is that it becomes computationally infeasible as the numberof parameters increases. The number of grid points grows exponentially with the number of parameters (the “curse of dimensionality”), making it impractical for models with more than a few parameters. Additionally, the grid approximation may not capture the true shape of the posterior distribution accurately if the grid is too coarse or if the posterior has complex features (e.g., multimodality, strong correlations between parameters). Also note that the grid approximation is not a sampling method, so it does not provide a way to generate samples from the posterior distribution, which can be useful for making predictions or performing inference.

For higher-dimensional problems, we typically need to use more sophisticated methods like Markov Chain Monte Carlo (MCMC) to approximate the posterior distribution.

3 Metropolis Algorithm (MCMC)

We need to approximate \[ p(\theta\mid y)=\frac{p(\theta)p(y\mid \theta)}{\int p(y\mid \tilde\theta)p(\tilde\theta)d\tilde\theta} \]

We have three components in the above expression:

  • Prior \(p(\theta)\) (easy to evaluate)

  • Sampling model \(p(y\mid \theta)\) (easy to evaluate))

  • Denominator \(\int p(y\mid \tilde\theta)p(\tilde\theta)d\tilde\theta\) (hard to evaluate)

Idea: We have already generated samples \(\theta^{(1)}, \theta^{(2)},\ldots, \theta^{(S)}\) from the posterior distribution \(p(\theta\mid y)\). What should the next sample \(\theta^{(S+1)}\) be?

  • Propose a potential next sample \(\theta^*\) near the current sample \(\theta^{(S)}\).
  • Make a decision: is it a good sample or not?
    • If \(p(\theta^*\mid y) > p(\theta^{(S)}\mid y)\), then accept \(\theta^*\) as the next sample, i.e., set \(\theta^{(S+1)} = \theta^*\).
    • If \(p(\theta^*\mid y) < p(\theta^{(S)}\mid y)\), then maybe accept \(\theta^*\) with some probability, or reject it and keep \(\theta^{(S)}\) as the next sample.

Problem: How do we compare \(p(\theta^*\mid y)\) and \(p(\theta^{(S)}\mid y)\) if we cannot evaluate the denominator \(\int p(y\mid \tilde\theta)p(\tilde\theta)d\tilde\theta\)?

Solution: We can compare the ratios \(\frac{p(\theta^*\mid y)}{p(\theta^{(S)}\mid y)}\)!! The denominator cancels out in the ratio, so we can evaluate the ratio using only the prior and the likelihood, which are easy to compute.

Define the ratio: \[ \frac{p(\theta^*\mid y)}{p(\theta^{(S)}\mid y)} = r \]

Then we can make the decision as follows:

  • If \(r > 1\), then accept \(\theta^*\) as the next sample, i.e., set \(\theta^{(S+1)} = \theta^*\).
  • If \(r < 1\), then accept \(\theta^*\) with probability \(r\), i.e., set \(\theta^{(S+1)} = \theta^*\) with probability \(r\), and set \(\theta^{(S+1)} = \theta^{(S)}\) with probability \(1-r\).

Idea is simple: if the proposed sample is in the higher probability region of the posterior, we always accept it. If it is in the lower probability region, we sometimes accept it, which allows us to explore the posterior distribution and avoid getting stuck in local modes.

Metropolis Algorithm Illustration

Figure 1: Metropolis proposal step: from the current value \(\theta^{(s)}\), propose \(\theta^\star \sim g(\theta^\star \mid \theta^{(s)})\) and accept or reject using the posterior density up to proportionality.

3.1 Metropolis Algorithm Examples

We will show how tuning the proposal distribution affects the behavior of the Metropolis algorithm. We will use a bivariate normal distribution as our target distribution (i.e. this is the distribution from which we need to generate samples ) and implement a random walk Metropolis algorithm to sample from it. We will visualize the samples and the acceptance/rejection decisions for different proposal standard deviations.

The black points represent the accepted samples, while the red points represent the rejected proposals. The orange circles indicate the region around each accepted sample where proposals are likely to be made (i.e., the proposal distribution). The contour lines show the shape of the target distribution.

Q: What do you observe about the behavior of the chain with a very small proposal standard deviation? Note two key features.
Answer

With a very small proposal standard deviation (SD), the chain exhibits two key features:

  • High acceptance rate: The chain accepts almost all proposals because they are very close to the current state, which means they are likely to be in a high-probability region of the target distribution. This is actually not ideal, as it indicates that the chain is not exploring the parameter space effectively.

  • Poor mixing: The chain moves very slowly through the parameter space, resulting in a highly autocorrelated sequence of samples. This means that the samples are not independent and may not represent the target distribution well, especially if the chain gets stuck in a local mode for a long time.

Here we use the large standard deviation for the proposal distribution.

Q: What do you observe?

Answer

With a very large proposal standard deviation (SD), the chain exhibits the following features:

  • Low acceptance rate: The chain rejects most proposals because they are often far from the current state and may fall in low-probability regions of the target distribution. This leads to a high number of rejected proposals (red points).

  • Poor mixing: The chain does not move much from the initial state (it keeps rejecting the proposals!), resulting in a sequence of samples that are highly autocorrelated and do not explore the parameter space effectively. The chain may get stuck in a local mode or fail to converge to the target distribution within a reasonable number of iterations.

With a medium proposal standard deviation (SD), the chain exhibits a more balanced behavior:

  • Moderate acceptance rate: The chain accepts a reasonable proportion of proposals, allowing it to explore the parameter space more effectively.
  • Better mixing: The chain moves more freely through the parameter space, resulting in a sequence of samples that are less autocorrelated and better represent the target distribution. The samples are more spread out and cover the high-probability regions of the target distribution more effectively.

### How to choose the proposal standard deviation?

The rule of thumb for tuning the proposal standard deviation is to aim for an acceptance rate of around 25-45%. For high-dimensional problems, an acceptance rate of around 23% is often recommended. For univariate problems, an acceptance rate of around 44% is often recommended. However, these are just guidelines, and the optimal acceptance rate can depend on the specific problem and target distribution. It’s important to experiment with different values of the proposal standard deviation and assess the convergence and mixing of the chain to find a good balance.