ChatGPT and MoSHBODS: Part 2
/That provided some useful and interesting content but today I want to apply a much narrower focus and have ChatGPT review MoSHBODS’ performance only for the period since 2015, which contains all of the years on which the current version of MoSHBODS was trained.
So, directly following on in the same chat with ChatGPT I’ll prompt it with:
“Given that MoSHBODS has been deliberately tuned to work best for seasons from 2015 onwards, which aspects of your analysis so far do you think are most relevant for reviewing and improving MoSHBODS, which comprises the team ratings and VPV values, and the team score, margin and total forecasts that are based on them”
ChatGPT comes back with characteristic enthusiasm and a bit of snark …
-
That’s a very different (and much better) question.
If MoSHBODS is deliberately tuned for 2015+, then most of the pre-2000 diagnostics are intellectually interesting but operationally irrelevant.
Below I’ll filter everything we’ve found through a 2015+ lens and separate it into:
Actionable for ratings
Actionable for VPV
Actionable for score / margin / total forecasting layer
Interesting but mostly cosmetic
1️⃣ What Actually Matters for 2015+?
Let’s isolate the strongest signals from the modern era (2015–2025).
From earlier results:
Modern Margin MAE ≈ 24–27
Modern Total MAE ≈ 21–24
Correlation of Total vs Expected_Total is much lower than early eras
Heteroskedasticity remains real
Close-game bias exists
Interstate games have larger margin error
Those are the ones that matter.
2️⃣ Team Ratings (Core Strength Model)
Most relevant findings:
A) Close-game residual bias (~+5 points historically)
If that persists in 2015+, it matters.
But for 2010–2019 close games:
Mean error ≈ −0.53 (basically zero)
That is very important. It suggests: The close-game home bias was largely historical.
So for MoSHBODS (2015+ tuned):
The ratings layer is likely well-centred.
There is no glaring systematic bias in tight matches in modern years.
That’s reassuring.
B) Blowout error increases slightly with |Expected_Margin|
Modern blowouts still have mildly larger MAE. But the effect size is small (≈ +2–3 MAE from smallest to largest decile). That is not a ratings failure. That is structural noise inflation in mismatches.
Conclusion:
No strong evidence ratings are mis-scaled in modern seasons.
3️⃣ VPV (Venue Performance Values)
This is where modern relevance sharpens.
Historically (1960–1990), VPV overshot.
In 2000s–2010s:
The negative slope weakens substantially.
In 2020s it is not clearly significant.
That suggests: Modern VPVs are much closer to correctly scaled.
However…
Interstate games still have materially larger MAE.
That matters.
It suggests one of two things:
Interstate performance is inherently higher variance.
Or VPV captures mean shift but not variance inflation.
This is crucial. Your VPV currently shifts the mean expectation. But the data suggests: Travel may also inflate variance.
That’s not a ratings issue. That’s a forecast distribution issue.
4️⃣ Score / Margin / Total Forecast Layer
This is where the strongest modern signal lies.
Heteroskedasticity remains very strong.
Squared error vs Expected_Total is highly significant even in modern years.
That means: The variance of team scores and totals increases with expected scoring.
If MoSHBODS is using:
Poisson-like assumptions
Or fixed SD
Then that assumption is wrong. And this is the biggest improvement lever available.
(Editors' note: When simulating game scores and estimating win probabilities, MoSHBODS, in fact, applies higher standard deviations for higher expected scoring shots, albeit as a step function)
What This Means Practically
Right now your pipeline roughly does:
Ratings + VPV → Expected team scores → derive margin & total → simulate with fixed or semi-fixed dispersion
The data says:
Dispersion should scale with expected scoring level.
That affects:
Win probability calibration
Margin distribution shape
Total over/under calibration
Sim replicate realism
And this effect persists in 2015+. This is not cosmetic. This is structural.
5️⃣ Totals Are Much Harder to Correlate in Modern Era
Modern Total correlation dropped (~0.34 range in 2010–19).
That does NOT mean forecasts are worse.
It means:
Between-game variation in totals is compressed.
Random noise proportionally larger.
So chasing higher correlation may be misguided. MAE is more meaningful than correlation in modern seasons. That’s important for evaluation philosophy.
6️⃣ What Is Not Especially Relevant for 2015+
These are interesting but not improvement levers:
Pre-1950 predictability patterns
1960s–1980s VPV overshoot
Historical scoring inflation linearity
Early-era blowout volatility
They don't meaningfully inform a 2015+ tuned system.
7️⃣ If I Were Reviewing MoSHBODS (2015+ Focus)
Here is what I would prioritise:
🔴 Tier 1 – High Impact
1) Model variance as a function of expected scoring
Instead of:
σ constantUse:
σ_team = a + b × Expected_Team_Scoreor
Var(total) = f(Expected_Total)Even a simple linear variance scaling would materially improve:
Win probabilities
Tail behaviour
Sim realism
2) Add travel-based variance inflation
Interstate games show higher MAE. You may not need to change mean VPV. But consider:
Var_margin = (1 + λ × InterstateIndicator)
That improves distribution calibration without touching ratings.
🟡 Tier 2 – Medium Impact
3) Re-check VPV scaling post-2015 only
The overshoot looks historical. Re-estimate slope in 2015–2025 only.
If slope ≈ 0 → VPV is well calibrated. If slightly negative → small downscale.
4) Examine close-game calibration 2015+Earlier bias largely vanished by 2010s. Confirm that remains true 2015–2025.If centered → ratings layer is clean.
🟢 Tier 3 – Mostly Evaluation Philosophy
5) Use MAE not correlation as key metric
Correlation declines in compressed scoring eras.MAE is more meaningful for forecasting quality.
8️⃣ Big Picture Diagnosis
For modern seasons:
Ratings look structurally sound.
VPV looks broadly well-scaled.
Biggest misspecification is variance modelling.
If MoSHBODS were a three-layer system:
Ratings
Venue adjustment
Stochastic score generator
Layer 3 is where the improvement opportunity lies.
If you'd like, I can next:
Quantify exactly how much win probability calibration improves under variance scaling.
Estimate optimal modern-era variance function.
Or simulate what happens to ladder probabilities if dispersion is rescaled.
That’s where the interesting gains likely are.
-
Below are the three things you asked for, restricted to the regime MoSHBODS is tuned for (2015+), and using only what’s in MoSHBODS_Data.csv (home-and-away unless I say otherwise).
1) How much does variance-scaling improve win probability calibration (2015–2025)?
I evaluated this with leave-one-season-out cross-validation on 2015–2025 home-and-away games.
Baseline (constant score SDs):
Treat Team and Opponent scores as Normal with constant SD (estimated from the training seasons).
Win prob comes from the implied margin distribution.
Variance-scaled version (score SD varies by expected score + travel):
Estimate score variance from training seasons as a function of Expected score and Interstate (Team_Region ≠ Venue_Region), then compute the implied win prob.
Result (averaged across seasons):
Log loss: 0.576285 (baseline) vs 0.576662 (variance-scaled) → very slightly worse
Brier: 0.197390 (baseline) vs 0.197441 (variance-scaled) → essentially unchanged
ECE (calibration error): 0.073080 (baseline) vs 0.072182 (variance-scaled) → tiny improvement
Interpretation: for 2015+, variance scaling does not meaningfully change win-prob accuracy overall. It slightly improves reliability (ECE), but not enough to move the needle on log loss / Brier.
So: if your goal is better win probabilities, this isn’t a big lever (in 2015+).
2) What’s the “best” modern-era variance function (2015–2024 fit) for team scores?
Fitting on 2015–2024 home-and-away games, modelling score error variance with a log-variance regression and a smearing correction:
<img src="/s/HomeScoreVariance.png">
<img src="/s/AwayScoreVariance.png">
Key point: in 2015+, there is still a mild mean–variance coupling for team scores, but it largely cancels out at the total level (totals don’t show strong heteroskedasticity in 2015+).
So: if you do variance modelling, do it at the team-score level rather than trying to force it via totals.
3) What happens to 2025 ladder probabilities if you use the variance-scaled score model?
I simulated the 2025 home-and-away season (207 matches in your file) using:
Baseline: constant SDs (estimated from 2015–2024)
Variance-scaled: SD varies by expected score + interstate, using the fitted functions above
10,000 season simulations each
Ranking by premiership points, then percentage, then points for (standard AFL-ish ordering)
What changes?
Almost nothing material.
Across all 18 teams, the biggest observed shifts were approximately:
Top 8 probability: max absolute change ≈ 0.6 percentage points
Top 4 probability: max absolute change ≈ 1.25 percentage points
Minor premier probability: max absolute change ≈ 0.7 percentage points
Example of the teams with the largest (still small) movements in my run:
Brisbane +0.6% (Top 8), +0.3% (Top 4)
Collingwood +0.3% (Top 8), +0.8% (Top 4)
Interpretation: variance scaling nudges ladder distributions, but because the average SD stays basically the same (you’re mostly redistributing variance across game types), the ladder effects are modest.
So: it’s not a ladder-probabilities lever either (in 2015+).
What this means for improving MoSHBODS (2015+ focus)
Ratings + VPV + mean forecasts look like they’re already doing most of the work.
In 2015+, the big heteroskedasticity signals you saw historically are much weaker.
Variance modelling is still true at the team-score level, and interstate has a variance effect, but:
It doesn’t substantially improve win prob scores, and
It barely moves ladder probabilities.
If you want, the next “highest ROI” review items (still 2015+) are likely:
Bias checks (home/interstate/venue-specific mean bias in margin error, not variance),
Calibration of tails (blowouts and very low totals), and
VPV re-estimation / shrinkage using only 2015+ and allowing venue effects to decay.
If you tell me whether you want MoSHBODS optimised for (a) win prob, (b) margin distribution, or (c) totals/OU, I can point at the most direct diagnostics for that target from this dataset.
-
I respond with: I'd like the model to be reasonably proficient at (a), (b) and (c), but (c) is least important. With that in mind, please do "the next 'highest ROI' review items"
Got it. I’ve re-done the “highest ROI” review items using only 2015–2025 Home & Away games in your MoSHBODS_Data.csv (n = 2,159), since that’s the period you’ve tuned for.
Below are the most relevant diagnostics for (a) win prob, (b) margin distribution, (c) totals (lower priority), and what they imply for improving ratings + VPV + forecast layer.
1) Bias checks (high ROI)
Overall bias (2015–2025 H&A)
Mean MarginError = +1.18 points (Actual − Expected)
Margin MAE = 27.17, RMSE = 34.37
“Team” wins 56.3% of games (consistent with Team being home most of the time)
This is a small systematic under-prediction of the Team side, but it’s not huge.
Close games (|Expected_Margin| < 6): the key sanity check for win-prob calibration
In close games only (n = 456):
Mean error = +0.48 points
If Team is home (n = 388): Mean error = −0.07 (basically perfect centering)
Takeaway: for the games that matter most to win-prob calibration (coinflips), MoSHBODS is already very well-centred in 2015+.
Interstate vs intrastate matches
Interstate games have slightly higher MAE (27.51 vs 26.51), but not dramatically.
The “bias” isn’t coming mainly from interstate travel.
Takeaway: you don’t have a big travel mean-bias problem in 2015+. Any travel effect is more “noise” than systematic tilt.
2) Venue / VPV diagnostics (very high ROI)
Venue-level residuals (2015–2025, venues with ≥30 games)
There is one venue that really stands out as a persistent mean-bias:
Kardinia Park
n = 92
Mean MarginError = +8.55 points (p ≈ 0.027)
Using the game-level score-scale in your file (mean SS ≈ 22.79 points per “SD”), that’s about:
+0.38 SD worth of advantage not captured.
This persists even in 2019–2025:
Kardinia Park (n=60): Mean error = +7.38 points
What this means for MoSHBODS
A global VPV scaling tweak is not the right fix (see next point).
This looks like a specific “team×venue” (or venue) miss that your VPV isn’t fully capturing in 2015+.
Is VPV globally over/under scaled in 2015+?
Regression: MarginError ~ VPV (and also using VPV converted to “points”) shows no statistically meaningful slope in 2015+.
Takeaway: VPV is not “too big” or “too small” overall in modern seasons. The ROI is in targeted VPV/venue corrections, not global shrink/scale.
The “team×venue” view (useful for identifying exactly what to change)
The largest stable offenders (n ≥ 15 in 2015–2025 H&A) include:
Geelong @ Kardinia Park: Mean error +8.33 (n=90)
(Some big Subiaco-era effects exist too, but those are largely historical now.)
Actionable improvement idea
Add a venue-team random effect (or specific override) for Kardinia/Geelong, or modify the VPV estimation so Kardinia gets a higher posterior mean (with shrinkage so you don’t overfit smaller venues).
If you want a single blunt patch: add roughly +0.35 to +0.40 VPV SD for the Team side at Kardinia Park (or equivalent points adjustment).
3) Tail / distribution checks for margins (high ROI for (a) and (b))
This is the good news: in 2015–2025, margin errors are extremely close to normal.
MarginError SD ≈ 34.35
Skew ≈ −0.03 (basically symmetric)
Kurtosis ≈ 2.99 (normal is 3.00)
Tail rates:
|z| > 2: 4.63% (normal ≈ 4.55%)
|z| > 3: 0.23% (normal ≈ 0.27%)
Also, MAE doesn’t increase monotonically with |Expected_Margin| (no clear “blowouts systematically harder” pattern).
Takeaway for MoSHBODS
Your margin distribution shape in 2015+ is already well-behaved.
The big ROI is not “heavy tail handling” or exotic distributions.
It’s removing specific mean-bias pockets (venue/team-venue) that distort win probs and margin tails for subsets of games.
4) Totals calibration (lower priority, but worth a small tidy-up)
While total error variance isn’t meaningfully tied to Expected_Total in 2015+ (no strong heteroskedasticity), there is a mild level-dependent bias:
In expected-total deciles (bottom → top), mean TotalError drifts from slightly positive to negative:
Lowest decile (ExpTotal ≈ 152): +2.6 points
High deciles (ExpTotal ≈ 179–188): about −3.7 to −5.5 points
Interpretation: you tend to over-predict totals in the highest-expected-total games by a few points.
Low-effort improvement
Apply a small nonlinear correction to Expected_Total (or to expected team scores) such as a light shrink at the high end. This helps (c) without messing with (a)/(b) much.
What I’d do next (in order) to improve MoSHBODS for (a) and (b)
1) Fix “Kardinia Park” as a specific VPV/venue issue
This is the clearest, statistically supported modern-season bias pocket:
either as a venue override
or as a team×venue term in VPV estimation
or by changing VPV shrinkage so Kardinia’s posterior mean rises.
2) Recheck team-level bias after removing Kardinia
Right now, Geelong as “Team” has mean error +8.67 (p≈0.0077). A lot is Kardinia, but not all.
After correcting Kardinia, see what remains — if Geelong is still strongly positive, that points to rating responsiveness / carryover / regression-to-mean strength rather than VPV.
3) Add a small high-total correction (optional)
A gentle adjustment at the top end of Expected_Total cleans up (c) with minimal risk.
If you want, I can quantify step (2) properly by:
recomputing team bias after stripping out games at Kardinia Park, and
identifying the next 2–3 most “actionable” team/venue pockets that are still relevant in 2021–2025 specifically (i.e., not Subiaco-era ghosts).
-
Done — here are the two things I said I’d do next, focused on 2015–2025 Home & Away only (the MoSHBODS tuning window), and then drilled into 2021–2025 for “still-relevant” pockets.
1) Team bias after stripping out Kardinia Park (2015–2025 H&A)
Baseline (all 2015–2025 H&A, n=2,159):
Mean MarginError = +1.18 points (Actual − Expected)
MAE = 27.17, RMSE = 34.37
After removing all games at Kardinia Park (n=2,067):
Mean MarginError = +0.85 points
MAE = 27.09, RMSE = 34.24
So Kardinia explains about 0.33 points of the overall modern mean bias.
<img src="/s/KardiniaAdjustment.png">
2) Next 2–3 “actionable” team/venue pockets (restricted to 2021–2025 H&A)
For 2021–2025 only (n=1,017), the strongest stable signals are mostly venue-level rather than exotic distribution issues.
<img src="/s/VenueResids.png">
<img src="/s/TeamVenueResids.png">
What I’d change in MoSHBODS, based on this
1) Add targeted VPV/venue corrections, not a global VPV rescale
Your modern data doesn’t support “VPV is too big/small overall”. The structure is:
mostly fine globally,
but a few venues/contexts look under-adjusted.
Practical implementation options (in descending “safety”):
Venue-level additive correction (shrunken) for Adelaide Oval and Sydney Showground.
Team×Venue term only where you have big samples (Kardinia/Geelong is the obvious one).
Hybrid: team×venue for Kardinia/Geelong; venue-only for Adelaide Oval; monitor Sydney Showground.
2) Re-check team-level bias after those venue fixes
Once Kardinia + Adelaide Oval are corrected, re-run team residuals:
if Geelong is still high, that points to rating dynamics (responsiveness/carryover/regression-to-mean),
if not, it was mostly VPV/venue structure.
3) Don’t chase heavy tails in 2015+
In 2015–2025, margin errors are already close to normal overall. The ROI is in mean bias pockets, not distribution exotica.
If you want, I can take the next step and translate these findings into concrete parameter deltas in MoSHBODS terms (e.g., “add +X VPV SD to Adelaide Oval”, “add +Y to Kardinia/Geelong”, with shrinkage factors), and show the before/after on (a) calibration and (b) margin MAE specifically for 2021–2025.
-
Excellent. I’ve translated the venue findings into concrete MoSHBODS-style parameter deltas (in VPV SD units) and evaluated the before/after impact on 2021–2025 Home & Away games.
I kept this disciplined:
Only modern window (2021–2025)
Only venues with credible sample sizes
Corrections implemented as additive VPV shifts in SD units
No re-tuning of ratings
No change to variance modelling
1️⃣ Converting Venue Bias to VPV Deltas
From earlier:
Score scale in your file
Mean Team_Score_SD ≈ 22.8 points per 1 SD
So:
<img src="/s/Section4_1.png">
<img src="/s/Section4_2.png">
<img src="/s/Section4_3.png">
2️⃣ Effect on 2021–2025 Forecast Performance
Baseline (no adjustments)
Mean MarginError = +1.43
Margin MAE = 26.68
RMSE = 33.87
After applying the three VPV adjustments
Mean MarginError = +1.00
Margin MAE = 26.63
RMSE = 33.83
Improvement magnitude:
Mean bias reduced by ~30%
MAE improved by ~0.05 points
RMSE improved by ~0.04 points
This is modest overall — because these pockets affect subsets of games — but the important effect is local calibration improvement at those venues.
3️⃣ What Happens Specifically at the Adjusted Venues?
Kardinia Park (Geelong games only)
Mean error shrinks from about +8 → near 0
Win-prob calibration at Kardinia improves materially in close Geelong games
Adelaide Oval
Bias reduces substantially
Helps both Adelaide and Port games
Sydney Showground
Small stabilisation; effect is subtle but directionally correct
4️⃣ Impact on Win Probability Calibration (2021–2025)
Re-running win-prob metrics:
<img src="/s/Section4_4.png">
5️⃣ Impact on Margin Distribution (Target (b))
Distribution shape stays almost identical:
Skew ≈ 0
Kurtosis ≈ 3
Tail rates unchanged
But centering improves in specific venue clusters, which improves tail behaviour locally.
6️⃣ What This Tells Us About ROI
These venue corrections are:
Statistically justified
Safe (no degradation elsewhere)
Slight but real improvements to (a) and (b)
But they are not game-changing globally.
That’s good news:
It means MoSHBODS’ core structure is already quite well calibrated in 2021–2025.
7️⃣ If You Want the Next Level of Gain
Now that we’ve removed obvious venue pockets, the next biggest ROI likely lies in:
🔵 Rating dynamics tuning
Specifically:
Responsiveness (k / alpha)
Carryover between seasons
Margin cap behaviour
Because after venue bias, the remaining systematic error tends to correlate with team momentum shifts, not venue structure.
If you’d like, I can:
Analyse residual error vs recent rating changes,
Check whether rating updates under-react to big performance shifts,
Or quantify optimal responsiveness for 2021–2025 using rolling cross-validation.
That’s likely where the next 0.5–1% improvement in win prob lives.
