02: ABC-SMC Inference of Mutation-Selection Parameters

Why Bayesian Inference for Mutation-Selection Dynamics?

The Basener-Sanford extension of Fisher's Fundamental Theorem tells us that mean fitness changes as:

d(m̄)/dt ≈ Var(m) + μ · Eg[s] · b̄

The first term (fitness variance driving selection) competes with the second term (mutational drag). Whether a population adapts or undergoes mutational meltdown depends on the precise relationship between five key parameters: the mutation rate μ, the shape and scale of the distribution of fitness effects (DFE), the fraction of beneficial mutations, and the environmental noise level.

A simple point estimate ("best fit") for these parameters would not tell us whether the population is definitively in one regime or ambiguously near the boundary. Bayesian inference gives us a full probability distribution over parameter values — a posterior distribution — that quantifies our uncertainty. This lets us ask not just "what are the best-fit parameters?" but "given the data, what is the probability that this population is above or below the meltdown threshold?"

What is ABC-SMC? (Explained for Biologists)

For the Basener-Sanford simulator, there is no formula for the probability of observing particular data given particular parameter values (the "likelihood"). The simulator is a complex stochastic process — you can run it forward, but you cannot write down a closed-form probability for any specific outcome.

Approximate Bayesian Computation (ABC) sidesteps this problem by using simulation as a substitute for the likelihood:

  1. Propose parameter values from a prior distribution (expressing our initial uncertainty)
  2. Simulate the population forward using those parameters
  3. Compress each simulation into 16 summary statistics (fitness trends, variance patterns, population dynamics)
  4. Accept parameters whose simulated summary statistics are "close enough" to the observed data

The Sequential Monte Carlo (SMC) refinement runs this in rounds, progressively tightening the acceptance threshold so that accepted parameters converge toward the true posterior distribution.

Analogy: Imagine trying to figure out a cake recipe by repeatedly baking cakes with different ingredient amounts and keeping the recipes that produce cakes most similar to the target. Each round, you become pickier about what counts as "similar enough," until you converge on the range of recipes that could plausibly produce that cake.

This notebook validates the method on synthetic data with known true parameters (μ = 0.1, gamma_shape = 0.5, gamma_scale = 0.003, p_beneficial = 0.005, sigma_env_ind = 0.008), corresponding to a mutational meltdown regime. If ABC-SMC can recover these known parameters, we can trust it on real data. Implementation uses PyMC.

1. Posterior Distributions

ABC-SMC posterior distributions
Figure 1: Marginal posterior distributions for each inferred parameter. Red dashed lines indicate the true parameter values used to generate the synthetic observed data. Each panel shows the posterior distribution for one of the five parameters that govern the mutation-selection dynamics: Reading the plots: Narrow, peaked posteriors near the red line indicate the data is highly informative and the method successfully recovers that parameter. Wide, flat posteriors mean the data cannot strongly constrain that parameter — a genuine finding about what the observed dynamics can and cannot tell us. Posteriors shifted away from the true value would indicate systematic bias in the inference.

2. Pairwise Posterior Correlations

Pairwise posterior correlations
Figure 2: Pairplot showing correlations between posterior parameter samples. Off-diagonal scatter plots reveal parameter trade-offs that reflect genuine biological identifiability issues in the mutation-selection model. Diagonal panels show marginal histograms. Understanding these trade-offs is essential for honest interpretation: a well-constrained marginal posterior may hide a strong joint correlation that makes the overall inference less certain than univariate plots suggest.

3. Posterior Predictive Check

Posterior predictive check
Figure 3: Posterior predictive simulations overlaid on observed data. This is the single most important validation of the inference. Twenty parameter vectors are sampled from the posterior and each is used to run a full forward simulation of the Basener-Sanford model. The resulting fitness trajectories (colored lines) should bracket the observed data (black line). This check answers the key question: does the Basener-Sanford model, with the inferred parameters, actually reproduce the observed population dynamics? If so, the theoretical framework captures the essential biology. If not, either the model is missing important processes, or the ABC-SMC method lacks the power to find the right parameters — motivating the more powerful BSL and SBI methods explored in subsequent notebooks.

Key Biological Insights from ABC-SMC Inference