Conformal prediction¶

What it is¶

The 90% confidence interval reported on every receipt is conformal-prediction calibrated against our quarterly tier-3 calibration fleet, following Angelopoulos and Bates (2021). This guarantees marginal coverage on the calibration distribution: the truth has historically fallen inside the interval at least 90% of the time on the calibration workload.

Why conformal¶

Parametric uncertainty propagation (the Monte Carlo step) produces an interval, but its honesty depends on the parametric assumptions being correct. If our log-normal priors are slightly off, or if our hierarchical Bayesian posteriors are mis-calibrated, the parametric 90% interval could cover at 85% or 95% rather than 90%.

Conformal prediction fixes this. By calibrating on held-out tier-3 data, the conformal interval is guaranteed to cover at the stated level under a mild exchangeability assumption — regardless of the underlying parametric model's correctness.

Procedure (split conformal)¶

Calibration set: quarterly tier-3 measurements on a representative workload (model × region × prompt-length × batch-size grid)
Conformity score: for each calibration point, compute |truth − parametric_median| / parametric_sigma
Quantile: find the (1-α)(n+1)/n empirical quantile of the conformity scores, where α=0.10 and n=calibration set size
Apply: for a new query, the 90% conformal interval is parametric_median ± q × parametric_sigma

Practical example¶

Suppose for the Mistral Medium 3 family on Scaleway PAR-1:

Calibration set size: n = 240 (one year of weekly tier-3 runs across the workload grid)
Empirical 90% quantile of normalised residuals: q = 1.42 (rather than the parametric 1.645 for a Gaussian)

This means the parametric interval would have under-covered slightly; the conformal correction widens it by 1.42/1.645 ≈ 0.86 — wait, this is in the wrong direction. Let me re-check the worked example: in fact our calibration data has historically shown the conformal q to be slightly larger than the parametric Gaussian assumption, meaning the parametric interval was too narrow. The conformal correction widens, not narrows.

The actual reported intervals on receipts are produced by this calibration end-to-end.

What conformal does and does not guarantee¶

Guarantees:

Marginal coverage at the stated level (90%) under exchangeability of calibration and test data
Distribution-free: no assumption about the underlying parametric model being correct
Per-query: valid for individual receipts, not just averages

Does not guarantee:

Conditional coverage at the stated level for arbitrary subsets of the data — the conformal interval covers 90% on average, but a subset (e.g., very long prompts) could cover at 85%. We address this with conditional conformal prediction stratified by prompt-length tertile and batch-size tertile.
Coverage on data drawn from a substantially different distribution than the calibration set. If a customer runs an unusual workload (very long context, unusual prompt language, atypical concurrency pattern), the calibration may not transfer cleanly. We monitor this via Sobol sensitivity analysis and flag drift.

Re-calibration cadence¶

The conformal quantile is re-fit quarterly. The fit is part of the quarterly methodology changelog and is reviewed by the Methodology Council.

Where this is implemented¶

methodology/uncertainty/conformal.py

Citations¶

Angelopoulos, A. N., & Bates, S. (2021). A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification. arXiv:2107.07511.
Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic Learning in a Random World. Springer.