02: ABC-SMC Inference — Mutation-Selection Model

The Basener-Sanford extension of Fisher's Fundamental Theorem tells us that mean fitness changes as:

The first term (fitness variance driving selection) competes with the second term (mutational drag). Whether a population adapts or undergoes mutational meltdown depends on the precise relationship between five key parameters: the mutation rate μ, the shape and scale of the distribution of fitness effects (DFE), the fraction of beneficial mutations, and the environmental noise level.

A simple point estimate ("best fit") for these parameters would not tell us whether the population is definitively in one regime or ambiguously near the boundary. Bayesian inference gives us a full probability distribution over parameter values — a posterior distribution — that quantifies our uncertainty. This lets us ask not just "what are the best-fit parameters?" but "given the data, what is the probability that this population is above or below the meltdown threshold?"

What is ABC-SMC? (Explained for Biologists)

For the Basener-Sanford simulator, there is no formula for the probability of observing particular data given particular parameter values (the "likelihood"). The simulator is a complex stochastic process — you can run it forward, but you cannot write down a closed-form probability for any specific outcome.

Approximate Bayesian Computation (ABC) sidesteps this problem by using simulation as a substitute for the likelihood:

Propose parameter values from a prior distribution (expressing our initial uncertainty)
Simulate the population forward using those parameters
Compress each simulation into 16 summary statistics (fitness trends, variance patterns, population dynamics)
Accept parameters whose simulated summary statistics are "close enough" to the observed data

The Sequential Monte Carlo (SMC) refinement runs this in rounds, progressively tightening the acceptance threshold so that accepted parameters converge toward the true posterior distribution.

Analogy: Imagine trying to figure out a cake recipe by repeatedly baking cakes with different ingredient amounts and keeping the recipes that produce cakes most similar to the target. Each round, you become pickier about what counts as "similar enough," until you converge on the range of recipes that could plausibly produce that cake.

This notebook validates the method on synthetic data with known true parameters (μ = 0.1, gamma_shape = 0.5, gamma_scale = 0.003, p_beneficial = 0.005, sigma_env_ind = 0.008), corresponding to a mutational meltdown regime. If ABC-SMC can recover these known parameters, we can trust it on real data. Implementation uses PyMC.

1. Posterior Distributions

Figure 1: Marginal posterior distributions for each inferred parameter. Red dashed lines indicate the true parameter values used to generate the synthetic observed data. Each panel shows the posterior distribution for one of the five parameters that govern the mutation-selection dynamics:

μ (mutation rate) — directly controls the mutational drag term μ · E_g[s] · b̄ in the Basener-Sanford equation. A posterior concentrated at low values suggests the population has a low mutation rate (favorable for maintaining fitness); concentration at high values suggests substantial mutational load that may push the population toward meltdown.
gamma_shape (DFE shape) — controls whether mutations have mostly tiny effects with rare severe ones (shape < 1, the empirically supported L-shaped distribution) or more uniform effect sizes (shape > 1). The true value of 0.5 represents the L-shaped DFE seen in empirical mutation accumulation experiments.
gamma_scale (DFE scale) — controls the magnitude of fitness effects. The mean deleterious effect is shape × scale, which is E_g[s] in the theorem. This parameter, together with μ, determines the total mutational drag on the population.
p_beneficial (beneficial fraction) — the fraction of mutations that improve fitness. Fisher's original theorem implicitly assumed roughly 50%, but empirical data consistently shows 0.1–1%. The true value of 0.5% reflects this empirical reality.
sigma_env_ind (environmental noise) — noise in individual fitness measurements that is not genetic. Higher noise makes selection less efficient at purging deleterious mutations, effectively shifting the balance toward meltdown.

Reading the plots: Narrow, peaked posteriors near the red line indicate the data is highly informative and the method successfully recovers that parameter. Wide, flat posteriors mean the data cannot strongly constrain that parameter — a genuine finding about what the observed dynamics can and cannot tell us. Posteriors shifted away from the true value would indicate systematic bias in the inference.

2. Pairwise Posterior Correlations

Figure 2: Pairplot showing correlations between posterior parameter samples. Off-diagonal scatter plots reveal parameter trade-offs that reflect genuine biological identifiability issues in the mutation-selection model. Diagonal panels show marginal histograms.

μ vs. gamma_scale (the critical trade-off): The mutational drag is μ · E_g[s], so the observed fitness decline could be equally well explained by moderate mutation rate with strong per-mutation effects OR high mutation rate with weak per-mutation effects. A negative correlation here is the visual signature of this fundamental identifiability issue — the data constrains the product μ · E_g[s] better than the individual factors.
gamma_shape vs. gamma_scale: These jointly determine the DFE. A negative correlation means the data constrains the mean deleterious effect (shape × scale) better than it constrains the individual shape and scale parameters.
p_beneficial vs. μ: A positive correlation would mean higher mutation rates can be partially compensated by a larger fraction of beneficial mutations, keeping the net mutational effect similar.

Understanding these trade-offs is essential for honest interpretation: a well-constrained marginal posterior may hide a strong joint correlation that makes the overall inference less certain than univariate plots suggest.

3. Posterior Predictive Check

Figure 3: Posterior predictive simulations overlaid on observed data. This is the single most important validation of the inference. Twenty parameter vectors are sampled from the posterior and each is used to run a full forward simulation of the Basener-Sanford model. The resulting fitness trajectories (colored lines) should bracket the observed data (black line).

Trajectory type match: If the observed data shows declining mean fitness (meltdown regime), the posterior predictive simulations should also show decline. A mismatch — e.g., predictions showing adaptation when data shows decline — would indicate a fundamental inference failure.
Quantitative agreement: Beyond matching the direction, predictions should match the rate of fitness change, the magnitude of stochastic fluctuations, and the final fitness level at generation 200.
Spread of predictions: The width of the trajectory envelope reflects posterior uncertainty. Wider spread means more parameter uncertainty; very tight envelopes mean the parameters are well-constrained.

This check answers the key question: does the Basener-Sanford model, with the inferred parameters, actually reproduce the observed population dynamics? If so, the theoretical framework captures the essential biology. If not, either the model is missing important processes, or the ABC-SMC method lacks the power to find the right parameters — motivating the more powerful BSL and SBI methods explored in subsequent notebooks.

Key Biological Insights from ABC-SMC Inference

Parameter recovery validates the approach: The ability (or inability) of ABC-SMC to recover the five true parameters from synthetic data tells us whether population fitness trajectories contain enough information to distinguish between mutation-selection regimes.
The μ × E_g[s] identifiability problem is fundamental: The mutational drag term depends on the product of mutation rate and mean effect size. Disentangling these requires additional data beyond mean fitness trajectories (e.g., direct DFE measurements or mutation rate estimates).
Posterior uncertainty maps to regime uncertainty: When the posterior distributions for μ and gamma_scale are wide, it means the data is consistent with parameter combinations spanning both the adaptation and meltdown regimes. This is the most biologically important finding — it tells us whether the data is sufficient to determine the population's evolutionary fate.
ABC-SMC provides a baseline: As a likelihood-free method, ABC-SMC is conceptually simple but computationally expensive (each posterior sample requires a full simulation). The BSL and SBI methods in notebooks 03 and 04 offer potentially more efficient alternatives, and notebook 05 compares their performance.

02: ABC-SMC Inference of Mutation-Selection Parameters

Why Bayesian Inference for Mutation-Selection Dynamics?

What is ABC-SMC? (Explained for Biologists)

1. Posterior Distributions

2. Pairwise Posterior Correlations

3. Posterior Predictive Check

Key Biological Insights from ABC-SMC Inference