Model evaluation & calibration

Dixon-Coles + ρ (MLE) + ELO/strength λ, temperature-calibrated.

Leakage-free As-of backtest over 439 recent international matches (2025–2026). Every prediction uses only data prior to that match.

Reliability — before vs after calibration

Raw (overconfident)
Calibrated (T=1.718)
Perfect calibration

Curve = full corpus (n=439). Metrics table below = leakage-free held-out test (n=102). Dot size ∝ bin sample count.

Held-out test metrics (n=102)

Metric	Raw	Calibrated	Base-rate
RPS ↓	0.2177	0.2023	0.2150
Brier ↓	0.6406	0.6012	0.6246
Log-loss ↓	1.0824	1.0022	1.0389
ECE ↓	0.1030	0.0265	0.0000*

Lower is better. Calibrated beats the base-rate (climatology) on every proper score. *Base-rate ECE is 0 by construction (a constant prediction is trivially calibrated but has no discrimination).

Calibration error (ECE)

0.1030 → 0.0265

held-out test

RPS vs base-rate

0.202 / 0.215

beats climatology

Backtest size

439

leakage-free as-of

Limitations (read honestly)

Corpus mix. Friendlies are 41.7% of the full eval set and 66.7% of the held-out test — results lean toward friendly-match behaviour.
No Nations League in eval. UEFA Nations League fixtures fell in the warm-up window, so they are not scored here.
Small test set. Held-out test is n=102; high-confidence reliability bins are sparse, so ECE there carries noise.
Residual away-favourite overconfidence. After a single temperature, away picks ≥60% (n=7) still ran 66% predicted vs 43% observed (gap +23pp). Signal for richer calibration — not acted on (sample too small).
Engine scope. This validates the core engine only. The live-only stages (player_xG_adj, situation_mult) are not covered by this fit.

Live forward track

Predictions are locked before kickoff (saved up to two hours ahead, never revised) and scored prospectively as matches complete. Forward tracking of the calibrated model begins at the cutover (2026-06-17); results will populate as matches finish. The live sample is currently minimal and is reported honestly — no headline accuracy figure is claimed until the sample is meaningful.

Generated from backtest at ddac4a8, 2026-06-17 · data: github.com/martj42/international_results (CC0).
Reproduce: INTL_RESULTS_CSV=<results.csv> python backtest_calibration.py --dump