Leakage-free
As-of backtest over 439 recent international
matches (2025–2026). Every prediction uses only data prior to that match.
Reliability — before vs after calibration
Raw (overconfident)
Calibrated (T=1.718)
Perfect calibration
Curve = full corpus (n=439). Metrics table below =
leakage-free held-out test (n=102). Dot size ∝ bin sample count.
Held-out test metrics (n=102)
Metric
Raw
Calibrated
Base-rate
RPS ↓
0.2177
0.2023
0.2150
Brier ↓
0.6406
0.6012
0.6246
Log-loss ↓
1.0824
1.0022
1.0389
ECE ↓
0.1030
0.0265
0.0000*
Lower is better. Calibrated beats the base-rate (climatology)
on every proper score. *Base-rate ECE is 0 by construction
(a constant prediction is trivially calibrated but has no discrimination).
Calibration error (ECE)
0.1030
→0.0265
held-out test
RPS vs base-rate
0.202/ 0.215
beats climatology
Backtest size
439
leakage-free as-of
Limitations (read honestly)
Corpus mix. Friendlies are 41.7% of the
full eval set and 66.7% of the held-out test —
results lean toward friendly-match behaviour.
No Nations League in eval.
UEFA Nations League fixtures fell in the warm-up
window, so they are not scored here.
Small test set. Held-out test is n=102; high-confidence
reliability bins are sparse, so ECE there carries noise.
Residual away-favourite overconfidence. After a single temperature,
away picks ≥60% (n=7) still ran
66% predicted vs
43% observed
(gap +23pp). Signal for richer
calibration — not acted on (sample too small).
Engine scope. This validates the core engine only. The live-only stages
(player_xG_adj, situation_mult) are not covered by this fit.
Live forward track
Predictions are locked before kickoff (saved up to two
hours ahead, never revised) and scored prospectively as matches complete.
Forward tracking of the calibrated model begins at the cutover
(2026-06-17); results will populate as matches finish.
The live sample is currently minimal and is reported honestly — no headline
accuracy figure is claimed until the sample is meaningful.