← All fixtures

Model evaluation & calibration

Dixon-Coles + ρ (MLE) + ELO/strength λ, temperature-calibrated.

Leakage-free As-of backtest over 439 recent international matches (2025–2026). Every prediction uses only data prior to that match.

Reliability — before vs after calibration

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 observed → predicted →
  • Raw (overconfident)
  • Calibrated (T=1.718)
  • Perfect calibration

Curve = full corpus (n=439). Metrics table below = leakage-free held-out test (n=102). Dot size ∝ bin sample count.

Held-out test metrics (n=102)

MetricRawCalibratedBase-rate
RPS 0.2177 0.2023 0.2150
Brier 0.6406 0.6012 0.6246
Log-loss 1.0824 1.0022 1.0389
ECE 0.1030 0.0265 0.0000*

Lower is better. Calibrated beats the base-rate (climatology) on every proper score. *Base-rate ECE is 0 by construction (a constant prediction is trivially calibrated but has no discrimination).

Calibration error (ECE)
0.1030 0.0265
held-out test
RPS vs base-rate
0.202 / 0.215
beats climatology
Backtest size
439
leakage-free as-of

Limitations (read honestly)

Live forward track

Predictions are locked before kickoff (saved up to two hours ahead, never revised) and scored prospectively as matches complete. Forward tracking of the calibrated model begins at the cutover (2026-06-17); results will populate as matches finish. The live sample is currently minimal and is reported honestly — no headline accuracy figure is claimed until the sample is meaningful.