From Bayesian Foundations to Neural Density Estimation
School of Earth & Atmospheric Sciences — Georgia Institute of Technology
2026-03-02

Felix J. Herrmann
School of Computational Science and Engineering — Georgia Institute of Technology
Slides build on Louppe (2023a), Deistler et al. (2025), and Arruda et al. (2025)

SBI provides amortized prior-to-posterior mappings that naturally integrate into sequential Bayesian data assimilation workflows:
Tip
SBI enables scalable, amortized Bayesian updates for high-dimensional, nonlinear dynamical systems where traditional filtering (EnKF, particle filters) struggle.
Consider a simple simulator: a ball is thrown at angle \(\btheta\) and lands at distance \(\bx\) (Deistler et al. 2025).

Why quantify uncertainty?
Two fundamental types:
In the SBI context:
Note
A well-calibrated posterior should be neither overconfident nor underconfident—this is exactly what SBI diagnostics will verify.
The ball-throw example illustrates the core SBI idea (Deistler et al. 2025):
Important
What the network learns: Unlike a point estimator that predicts a single \(\hat{\btheta}\), the inference network learns to output a full probability distribution over \(\btheta\) conditioned on each input \(\bx\). It captures multi-modality, correlations, and uncertainty.
Amortization: Once trained, the same network can be applied to any new observation without retraining or additional simulations. This is fundamentally different from classical methods (MCMC, ABC) that must restart inference for each new \(\xo\).
A simulator is a (possibly stochastic) program that generates synthetic data:
\[\btheta \sim p(\btheta), \qquad \bx \sim p(\bx \mid \btheta) \quad\text{or equivalently}\quad \bx = \texttt{Sim}(\btheta, \texttt{rng})\]
Running the simulator with fixed \(\btheta\) and varying noise \(\bzeta\) is equivalent to sampling from an implicit likelihood (Radev et al. 2023; Cranmer, Brehmer, and Louppe 2020):
\[\bx \sim p(\bx \mid \btheta) \;\Longleftrightarrow\; \bx = G(\btheta, \bzeta) \quad\text{with}\quad \bzeta \sim p(\bzeta)\]
In principle, this likelihood can be obtained by marginalizing the joint \(p(\bzeta, \bx \mid \btheta)\) over all possible execution trajectories of the simulation program, but this is typically intractable (Radev et al. 2023). The posterior follows from the joint:
\[p(\btheta \mid \bx) \propto p(\btheta)\!\int_{\Xi} p(\bzeta, \bx \mid \btheta)\,d\bzeta\]
which is singly intractable (one integral over the noise space \(\Xi\)).
The prior predictive distribution (marginal likelihood):
\[p(\bx) = \int_{\Theta}\!\int_{\Xi} p(\btheta)\,p(\bzeta, \bx \mid \btheta)\,d\bzeta\,d\btheta\]
is doubly intractable—both integrals are difficult to approximate (Radev et al. 2023).
Important
The likelihood \(p(\bx \mid \btheta)\) is intractable—we can sample from it but cannot evaluate it.
| Likelihood-Based | Simulation-Based | |
|---|---|---|
| Joint model \(p(\btheta, \bx)\) | \(\btheta \sim p(\btheta),\; \bx \sim p(\bx \mid \btheta)\) | \(\btheta \sim p(\btheta),\; \bx = \texttt{Sim}(\btheta, \texttt{rng})\) |
| Prior \(p(\btheta)\) | Can be sampled and evaluated | Can be sampled, optionally evaluated |
| Data model \(p(\bx \mid \btheta)\) | Can be sampled and evaluated | Can be sampled but not evaluated |
| Posterior \(p(\btheta \mid \bx)\) | (Usually) intractable | Doubly intractable |
“Doubly intractable” = neither the likelihood nor the normalizing constant is available.
Bayes’ rule:
\[p(\btheta \mid \xo) = \frac{p(\xo \mid \btheta)\, p(\btheta)}{p(\xo)}\]
The posterior provides a complete description of parameter uncertainty: it captures not just the best estimate, but all plausible parameter configurations, their dependencies, and multi-modality.
Evidence integral: Computing \(p(\xo) = \int p(\xo \mid \btheta)\,p(\btheta)\,d\btheta\) requires integrating over the entire parameter space—intractable in high dimensions.
Intractable likelihood: For simulation-based models, even evaluating \(p(\xo \mid \btheta)\) for a single \(\btheta\) is impossible.
Classical methods fail: Conjugate priors, standard variational inference, or MCMC with known likelihoods do not apply.
Note
This motivates likelihood-free / simulation-based inference.
Algorithm:
Issues:

MCMC for simulators:
Warning
For expensive, nonlinear forward models with high-dim parameters/data, ABC and MCMC become impractical.
Problem archetypes (Arruda et al. 2025, Table 2)
| Budget | \(\dim(\btheta)\) | \(\dim(\bx)\) | Example |
|---|---|---|---|
| High | Low | Low | SBI benchmarks |
| Low | Low | Low | \(N\)-body sims |
| High | Low | High | Single-molecule |
| Low | Low | High | Galaxy sims |
| High | High | Low | 2D Darcy flow |
| Low | High | Low | Groundwater |
| High | High | High | Bayesian NNs |
| Low | High | High | Ultrasound |
Deistler et al. (2025), Fig. 3
Five-step workflow (Deistler et al. 2025):
All simulations done offline & in parallel; training decoupled from simulation; inference (after training) is fast.
The simulator:
The prior \(p(\btheta)\):
Warning
Critical pitfall: Model misspecification—when \(\xo\) falls outside the range of what the model (simulator + prior) can generate. SBI can produce spurious posteriors in this case.
Data representation:
Three core inference methods:
| Method | Target | Output | Inference | i.i.d. data |
|---|---|---|---|---|
| NPE | Posterior \(p(\btheta \mid \bx)\) | Direct samples | Fast (ms) | Difficult |
| NLE | Likelihood \(p(\bx \mid \btheta)\) | Requires MCMC/VI | Slow (sec–min) | Efficient |
| NRE | Ratio \(p(\bx \mid \btheta)/p(\bx)\) | Requires MCMC/VI | Slow (sec–min) | Efficient |
Network choices: normalizing flows (Papamakarios et al. 2021), diffusion models (Song et al. 2021), flow matching (Lipman et al. 2023; Wildberger et al. 2023). Details in next Lecture.
Goal: Learn \(q_\phi(\btheta \mid \bx) \approx p(\btheta \mid \bx)\)
Minimize the expected forward KL divergence:
\[\begin{aligned} &\min_{q_\phi} \; \E_{p(\bx)}\Big[\KL\big(p(\btheta \mid \bx)\,\|\,q_\phi(\btheta \mid \bx)\big)\Big] \\[4pt] &= \min_{q_\phi}\; \E_{p(\bx)}\,\E_{p(\btheta \mid \bx)}\left[\log \frac{p(\btheta \mid \bx)}{q_\phi(\btheta \mid \bx)}\right] \\[4pt] &= \max_{q_\phi}\; \E_{p(\bx,\btheta)}\big[\log q_\phi(\btheta \mid \bx)\big] + \text{const.} \end{aligned}\]
Training loss:
\[\boxed{\mathcal{L}(\phi) = \E_{(\btheta,\bx)\sim p(\btheta,\bx)}\big[-\log q_\phi(\btheta \mid \bx)\big]}\]
Tip
This is maximum likelihood under the joint \(p(\bx,\btheta)\)! We only need samples from the joint—exactly what the simulator provides.
Generating simulations (Step 3 of the workflow):
\[\text{For } i = 1,\ldots,N:\quad \btheta_i \sim p(\btheta), \quad \bx_i \sim p(\bx \mid \btheta_i)\]
\[\text{Dataset: } \{(\btheta_i, \bx_i)\}_{i=1}^N \sim p(\btheta, \bx)\]
Training: Minimize \(\mathcal{L}(\phi)\) with standard deep learning (SGD, early stopping, checkpointing).
At inference time (for NPE):
Amortization: Train once on the full prior, apply to any observation without retraining.
Cost structure:
Compare to non-amortized methods:
Applications enabled:
Non-amortized (observation-specific) inference optimizes \(q_\phi\) for a single \(\xo\) using the reverse KL:
\[\min_{q_\phi}\;\KL\big(q_\phi(\btheta)\,\|\,p(\btheta \mid \xo)\big) = \min_{q_\phi}\;\E_{q_\phi(\btheta)}\!\left[\log \frac{q_\phi(\btheta)}{p(\btheta \mid \xo)}\right]\]
Amortized (forward KL):
Non-amortized (reverse KL):
Tip
The amortization gap is the price paid for generalization: an amortized posterior \(q_\phi(\btheta \mid \bx)\) that works well across all \(\bx\) may be slightly worse for any particular \(\xo\) than a posterior optimized specifically for that \(\xo\).
How do we know whether a learned posterior \(q_\phi(\btheta \mid \bx)\) is trustworthy?
Coverage diagnostics test whether the posterior’s stated uncertainty matches reality:
Why expensive: Each test requires a full posterior evaluation. Reliable calibration checks need \(10^3\)–\(10^4\) test pairs. Without amortization, each test pair demands a separate MCMC run—making diagnostics prohibitively costly. With amortized NPE, all \(10^4\) posteriors are obtained in seconds.
Note
Coverage diagnostics are treated in detail later in this lecture. The key point: amortization is not just convenient—it is essential for validation.
Simulators often produce high-dimensional outputs \(\bx \in \R^d\) (\(d \gg 1\)), e.g., large images, long time series, or volumetric fields. Dimensionality reduction via summary statistics \(\mathcal{S}: \R^d \to \R^r\) (\(r \ll d\)) can help:
Goal: The summary statistic should preserve information about \(\btheta\):
\[p(\btheta \mid \mathcal{S}(\bx)) \approx p(\btheta \mid \bx)\]
A sufficient statistic achieves equality, but for complex simulators sufficient statistics are generally unknown.
Approaches: hand-crafted domain features (e.g., autocorrelation, power spectra), physics-based summaries [e.g., migrated images in geophysics; Orozco et al. (2024)], or learned summaries via embedding networks.
Parameter-to-observation can be complex and depends on acquisition geometry
Physics-based summary statistics compress high-dimensional data while preserving information about model parameters (Orozco, Louboutin, and Herrmann 2023)
Observations: \(\by\)

Summary: \(\bar{\by}\)

Adjoint-based summary statistics leverage the physics of the forward model to compute informative data representations at low cost (Orozco, Louboutin, and Herrmann 2023).
Embedding networks are feed-forward neural networks that learn to compress high-dimensional data into informative low-dimensional summaries. Unlike invertible or generative networks, they are standard multi-layer architectures optimized purely for compression.

Warning
Training large embedding networks requires many simulations. Powerful architectures (e.g., Transformers) may be impractical with limited simulation budgets.
Normalizing flows define \(q_\phi(\btheta \mid \bx)\) through a chain of \(K\) invertible transformations:
Training objective (change of variables): \[\log q_\phi(\btheta \mid \bx) = \log p_0(\bz_0) - \sum_{k=1}^K \log\left|\det \frac{\partial f_k}{\partial \bz_{k-1}}\right|\]
End-to-end with learned summaries: Replace raw data \(\bx\) with \(\mathcal{S}_\lambda(\bx)\): \[\mathcal{L}(\phi, \lambda) = \mathbb{E}_{(\btheta,\bx)\sim p(\btheta,\bx)}\big[-\log q_\phi(\btheta \mid \mathcal{S}_\lambda(\bx))\big]\]
The embedding network \(\mathcal{S}_\lambda\) (feed-forward) and the normalizing flow \(q_\phi\) (invertible) are trained jointly.
Sampling (fast, feed-forward): compute \(\mathbf{s} = \mathcal{S}_\lambda(\xo)\), draw \(\bz_0 \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\), apply \(\btheta = f_K \circ \cdots \circ f_1(\bz_0;\, \mathbf{s})\). No MCMC needed.

Definition: A model is well-specified if \(\xo\) could have been generated by the simulator:
\[\exists\;\btheta^* \text{ such that } p(\btheta^*, \xo) > 0\]
i.e., \(\xo\) falls within the support of simulated data.
Consequences of misspecification:
Detection strategies:
Both checks can be performed before training any inference network.

Procedure:
Good: Posterior predictives resemble \(\xo\)
Bad: Systematic differences \(\Rightarrow\) problems
Limitations:
Challenges for large problems:
Goal: Verify that uncertainty estimates are statistically calibrated.
Two families of tests (Hermans et al. 2022; Talts et al. 2018):
All coverage methods require a calibration set of \(N\) pairs \((\btheta^*, \bx) \sim p(\btheta, \bx)\) plus inference for each pair.
Failure modes: Below diagonal \(\to\) overconfident; Above \(\to\) underconfident; Curved \(\to\) biased.
SBC (Talts et al. 2018) checks whether the posterior marginals are correctly calibrated:
Algorithm:
Interpreting rank histograms:
Computational cost: Requires \(N \times L\) posterior samples. With amortized NPE this takes seconds; with NLE/NRE each of the \(N\) test cases needs a separate MCMC run.
Expected Coverage [HPD = Highest Posterior Density; Hermans et al. (2022)]:
TARP [Tests of Accuracy with Random Points; Lemos et al. (2023)]:
Note
All global diagnostics require \(10^3\)–\(10^4\) calibration simulations. Amortization is essential—without it, each calibration pair requires a full MCMC run.
Coverage diagnostics (SBC, Expected Coverage, TARP) and uncertainty-vs-error analysis test different aspects of calibration:
Coverage plots are marginal diagnostics
Uncertainty vs. error is a conditional diagnostic
Formally: Conditional calibration requires \[\text{Var}_q[\btheta \mid \bx] = \mathbb{E}\!\left[(\btheta - \mathbb{E}_q[\btheta \mid \bx])^2 \,\big|\, \bx\right] \quad \forall\; \bx\]
Marginalizing over \(\bx\) gives correct coverage, but correct coverage does not guarantee this holds pointwise.
Hierarchy of calibration
| Diagnostic | Type | What it tests |
|---|---|---|
| SBC rank histograms | Marginal | Per-parameter marginals |
| Expected Coverage / TARP | Marginal | Joint posterior (global) |
| Uncertainty vs. error | Conditional | Local calibration per \(\bx\) |
Conditional calibration \(\Rightarrow\) marginal calibration, but not vice versa.
Example of marginal-but-not-conditional calibration: A posterior \(q(\btheta \mid \bx)\) that is too narrow for noisy observations and too wide for clean observations can still lie on the diagonal on average.
Practical recommendations
Tip
Coverage diagnostics tell you whether your posterior is globally trustworthy. Uncertainty-vs-error analysis tells you whether it is locally trustworthy. Use both.
For high-dimensional inference problems (e.g., imaging), standard SBC/TARP diagnostics are impractical because parameter spaces have millions of dimensions. Alternative UQ metrics exist that assess uncertainty pixelwise (Orozco et al. 2024):
1. \(z\)-score (error vs. uncertainty warning):
\[z\text{-score} = \frac{|\bx^* - \bx_{\text{mean}}|}{\bx_{\text{std}}}\]
Pixels where error \(> 2\times\) uncertainty are flagged. A low percentage of flagged pixels indicates good UQ.
2. Uncertainty Calibration Error (UCE): Bins pixels by error magnitude and checks whether uncertainty tracks error. Lower UCE = better calibrated.
3. Posterior coverage percentage: Fraction of pixels where the ground truth falls within the \([1\%, 99\%]\) percentile range of posterior samples.
4. Data residual check: Pass posterior samples through the forward operator and check fit to observed data—does not require ground truth.
Warning
Metrics 1–3 require access to ground truth, limiting them to synthetic validation. Metric 4 (data residual) works on field data but is necessary, not sufficient.
Orozco et al. (2024) apply SBI with conditional diffusion networks to seismic velocity model building:


Calibration (UCE): the error-uncertainty correlation should follow the diagonal. CIGs improve calibration over RTMs.

Posterior coverage traces: posterior samples (colored) should bracket the ground truth (black). Coverage with CIGs: 90.6% vs. RTMs: 82.1%.

Visualization:
Marginal moments (cheap via Monte Carlo):
\[\E_{p(\btheta \mid \xo)}[f(\btheta)] = \int p(\btheta \mid \xo)\,f(\btheta)\,d\btheta \;\approx\; \frac{1}{N}\sum_{i=1}^N f(\btheta_i)\]
Conditional moments—reveal parameter co-tuning:
\[\E_{p(\theta_i \mid \btheta_{/i},\xo)}[f(\theta_i)] = \int p(\theta_i \mid \btheta_{/i},\xo)\,f(\theta_i)\,d\theta_i\]
MAP estimation: \(\hat{\btheta}_{\text{MAP}} = \arg\max_{\btheta} \log q_\phi(\btheta \mid \xo)\) via gradient optimization.
Bayesian decision theory—choose action \(a\) minimizing expected cost:
\[\ell(\xo, a) = \int p(\btheta \mid \xo)\,c(\btheta, a)\,d\btheta\]
Sensitivity analysis: Which data features constrain which parameters?
SBI strengths:
Challenges:
Recommended starting point: NPE + normalizing flows + full diagnostic pipeline
Software:
sbi Python toolbox (Boelts et al. 2024): https://github.com/sbi-dev/sbiPractical guide code (reproduces all examples from Deistler et al. (2025)):
SBI applications database (interactive explorer of real-world applications):
Today we treated \(q_\phi\) as a black box. Next time we open the box.
Topics for Lecture 2:
This lecture was prepared with the assistance of Claude (Anthropic) and validated by Felix J. Herrmann.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
