MCMC-sampling

BADT

Author

Zheng Zhou

Learning goals

  1. describe what is a Markov chain monte carlo (MCMC)

  2. describe the purpose and applicability range of MCMC

  3. compare the strength and weakness of MCMC with other posterior estimation methods

  4. develop a simple MCMC (complete exercise 3)

  5. describe MCMC diagnostics

Beta-binomial conjugation

Binomial likelihood:

n Bernoulli trials with y successes and unknown probability of success \(\theta\).

\[ \text{Binomial}(n, y,\theta) \]

The prior is a beta distribution:

\[ Beta(\alpha, \beta) \]

Conjugation: the posterior distribution is also a beta distribution:

\[ Beta(\alpha + y, \beta + n - y) \]

Posterior computation

  1. Analytical solution: conjugation often impossible

  2. Grid approximation: possible but efficiency drops exponentialy with dimension

  3. LaPlace approximation: assumption of multivariate normality often not applicable

  4. Markov Chain Monte Carlo (MCMC): flexible, efficient

What is MCMC

  • Markov chain
    • Markov process: \(f(t+1)=g(f(t),noise)\)
    • Chain
  • Monte Carlo: a large number of samples can approximate the distribution

MCMC transforms the question:

FROM:

Getting the posterior distribution

TO

Getting enough representative samples from the posterior distribution

Example: Island hopping

  1. Make a proposal.

  2. Acceptance ratio \(p_{prop}=\frac{poposal}{current}\).

  3. If \(p_{prop}\) > 1, move; if \(p_{prop}\)=1, stay.

  4. If \(p_{prop}\) < 1, generate a random value k between (0,1). If \(k<p_{prop}\), then move.

  5. repeat 1-4.

Simulate the hopping process

Number of islands (parameter values) unknown in reality.

Let’s first compute the posterior manually.

Prior: no knowledge of the islands

prior_function <- function(island) {
  return(1/num_island)  # Uniform prior
}

Likelihood: proportional to relative population of each island

likelihood_function <- function(island) {
  return(island / sum(1:num_island))  # Proportional to island size
}

Therefore, posterior = prior * likelihood / marginals

true_posterior <- sapply(1:num_island, function(i) {
  prior_function(i) * likelihood_function(i)
})
true_posterior <- true_posterior / sum(true_posterior)  # marginalize
names(true_posterior) <- paste0("Island ",1:num_island)
true_posterior
  Island 1   Island 2   Island 3   Island 4   Island 5 
0.06666667 0.13333333 0.20000000 0.26666667 0.33333333 

Now compute the posterior with MCMC

num_iterations <- 1e3
starting_island <- 3

# store the positions
chain <- numeric(num_iterations)
prior_values <- numeric(num_iterations)
likelihood_values <- numeric(num_iterations)
posterior_values <- numeric(num_iterations)

# Calculate values for initial position
chain[1] <- starting_island
prior_values[1] <- prior_function(chain[1])
likelihood_values[1] <- likelihood_function(chain[1])
posterior_values[1] <- prior_values[1] * likelihood_values[1]  # Unnormalized posterior

# Store jump decisions for analysis
accepted <- logical(num_iterations - 1)
acceptance_ratios <- numeric(num_iterations - 1)

for(i in 2:num_iterations){
    current <- chain[i-1]
    
    # Propose a move: randomly jump to an adjacent island
    jump <- sample(c(-1,1), 1)  
    proposed <- current + jump
    # Handle boundaries with wrap-around
    if (proposed < 1) proposed <- num_island
    if (proposed > num_island) proposed <- 1
    # random walk proposal
    # proposed <- current + rnorm(1,proposal_mean,proposal_sd)
    
    # compute acceptance ratio
    # current position
    prior_current <- prior_function(current)
    likelihood_current <- likelihood_function(current)
    posterior_current <- prior_current * likelihood_current  # Unnormalized

    # proposed position    
    prior_proposed <- prior_function(proposed)
    likelihood_proposed <- likelihood_function(proposed)
    posterior_proposed <- prior_proposed * likelihood_proposed  # Unnormalized

    acceptance_ratio <- posterior_proposed / posterior_current
    acceptance_prob <- min(1, acceptance_ratio)

    # Store the computed values
    prior_values[i-1] <- prior_current
    likelihood_values[i-1] <- likelihood_current
    posterior_values[i-1] <- posterior_current
    acceptance_ratios[i-1] <- acceptance_prob

    # decide the jump
    if (runif(1) < acceptance_prob) {
      chain[i] <- proposed
      accepted[i-1] <- TRUE
    } else {
      chain[i] <- current
      accepted[i-1] <- FALSE
    }
}

Now summarize the results from MCMC

prior_values[num_iterations] <- prior_function(chain[num_iterations])
likelihood_values[num_iterations] <- likelihood_function(chain[num_iterations])
posterior_values[num_iterations] <- prior_values[num_iterations] * likelihood_values[num_iterations]

# Calculate empirical frequencies from MCMC samples
mcmc_freq <- table(chain) / num_iterations
mcmc_df <- data.frame(
  island = 1:num_island,
  mcmc_prob = as.numeric(mcmc_freq),
  true_posterior = true_posterior
)
mcmc_df
         island mcmc_prob true_posterior
Island 1      1     0.082     0.06666667
Island 2      2     0.138     0.13333333
Island 3      3     0.176     0.20000000
Island 4      4     0.268     0.26666667
Island 5      5     0.336     0.33333333

Trace the move

# Trace plot with acceptance indicators
trace_df <- data.frame(
  iteration = 1:num_iterations,
  island = chain,
  accepted = c(NA, accepted),
  posterior = posterior_values
)
p_trace <- ggplot(trace_df, aes(x = iteration, y = island)) +
  geom_line(alpha = 0.5) +
  geom_point(aes( size = posterior), alpha = 0.7) +
  # scale_color_manual(values = c("FALSE" = "red", "TRUE" = "green"), 
  #                    na.value = "black", name = "Accepted") +
  scale_size_continuous(name = "Posterior", range = c(1, 3)) +
  scale_y_continuous(breaks = 1:num_island, limits = c(1, num_island)) +
  theme_classic() +
  labs(title = "Trace Plot",
       subtitle = "Point size proportional to posterior probability",
       x = "Iteration",
       y = "Island Number")
p_trace

Summary of hopping

When repeated enough times, frequency on any island matches the relative population of the island.

Three critical things:

  1. known your position and adjacent options

  2. making a proposal and knowing the population of the proposal

  3. knowing the population of the current, to calculate the acceptance ratio \(p_{prop}=\frac{proposal}{current}\).

Metropolis algorithm

Island hopping is a special case, Metropolis algorithm could handle:

  • continuous positions

  • more than one dimension

  • complex proposals

\[ y=\beta_0 + \beta_1 x + \epsilon \]

Model:

\[ \begin{equation} \begin{split} \epsilon \sim N(0,\sigma^2) \\ y \sim N(\beta_0 + \beta_1 x, \epsilon) \end{split} \end{equation} \]

\(\beta_0, \beta_1\) are parameters

Likelihood:

\[ p(Y|\beta_0,\beta_1,\epsilon,x)=\prod_i^{n} \frac{1}{\sqrt{2\pi \sigma^2}}exp(-\frac{(y-(\beta_0+\beta_1 x))^2}{2\sigma^2}) \]

Priors:

\(\beta_0 \sim N(0,5)\).

How does the island hopping analogy work here?

Metropolis algorithm
  1. Start with an arbitrary initial value of the parameter, and denote it \(\theta_{current}\)

  2. Randomly generate a proposed jump, \(\Delta\theta \sim N(0,\sigma^2)\) and derive the proposed parameter as \(\theta_{proposed} = \theta_{current}+\Delta\theta\).

  3. Compute the probability of moving to the proposed value as \[\begin{split} & \\ p_{move}=min\left(1,\frac{f(\theta_{proposed}|y)}{f(\theta_{current}|y)}\right) =& \\min\left(1,\frac{l(y|\theta_{proposed})f(\theta_{proposed})}{l(y|\theta_{current})f(\theta_{current})}\right) \end{split}\]

  4. Accept the proposed parameter value if a random value sampled from a [0,1] uniform distribution is less than \(p_{move}\)

Repeat steps 1 to 3 until a sufficiently representative sample of the posterior has been generated.

Throw away a burn-in period of the sampling and keep the part where the algorithm has converged.

The Metropolis algorithm (is a type of importance sampling) which can be combined with ways to generate proposals make the sampling more efficient, including Gibbs sampling and Hamiltonian Markov Chain (HMC).

MCMC diagnostics

Before accepting MCMC results, how good they are?

  • Trace plot: biased by the arbitrary starting value? fully explore the posterior range?

  • Rhat/ Gelman-Rubins statistics: convergence of multiple chains, also shown in the trace plot

  • Effective sample size: measures chain accuracy- how many independent sampling steps for the chain to approximate the posterior

  • Autocorrelation: also shows chain accuracy- strong correlation between steps that are one or two steps apart breaks Markov process

Read the Bayes rule book chapter 6.3 and DBDA chapter 7.5 for technical details and examples.