On First Birthday, Hokie Analytics Unveils New Score Prediction Model
Details reveal vastly improved score prediction model for 2023 season
Hokie Analytics blew out one candle on the cake this week, and in the spirit of it’s better to give than to receive, has a new toy to share with readers.
Coverage of the 2023 season will feature a new-from-the-ground-up score prediction model.
Whereas the 2022 in-season model was a fairly simple, albeit extremely clunky and difficult to manage, linear linear regression setup, this year’s version is an Extreme Gradient Boost (XGBoost) model .
In this article, I will open the hood on this latest model, highlight what it does (and does not) do well, and share examples from the test results.
Background on XGBoost models
XGBoost is a machine learning algorithm that is specifically designed for supervised learning tasks, where one has a labeled dataset and is aiming to make predictions or classifications.
The algorithm was developed by Tianqi Chen, an Assistant Professor of Computer Science in the Machine Learning Department at Carnegie Mellon University. It builds upon the concept of gradient boosting, which is an ensemble learning technique that combines the predictions of multiple weak models (typically decision trees) to create a strong predictive model.
XGBoost improves upon the traditional gradient boosting algorithm by incorporating various enhancements and optimizations, making it extremely efficient and effective.
Advantages of XGBoost models include:
High Performance: XGBoost is designed to be highly efficient and can handle large datasets with high dimensionality.
Regularization: It includes regularization techniques that help prevent overfitting, making the model more generalizable.
Flexibility: XGBoost supports both regression and classification tasks, making it versatile for various predictive modeling scenarios.
Feature Importance: It provides insights into feature importance, helping users understand which features have the most impact on predictions.
Handling Missing Values: XGBoost can handle missing values in the data during training and prediction.
Parallel and Distributed Computing: It can be run in parallel on multi-core CPUs and distributed environments, which speeds up training.
Last year’s model was assembled with the computing equivalent of duct tape and a Sawzall. It was built to predict Virginia Tech games only, and some of the variables, notably QB Health, were subjective and prone to bias.
Key goals for this year’s model were:
Applicability to all FBS games
Ease of Use - no more manual data entry or subjective variables
Flexibility in handling null value fields
Accuracy
The 2023 XGBoost earns high marks in each area, and fully removes any and all bias.
Accuracy
The model was trained on regular season games from 2013 through 2020 and tested on all games from the 2021 and 2022 seasons. The testing results are as follows:
Of course, some games are more difficult to predict than others. For example, there tend to be more foreseeable blowouts early in the season, before conference play heats up.
While performance on the test data varied by week, the ranges all remained in the acceptable range.
And in the toughest games to predict, those with a spread ranging from -2 to 2, the model got more picks right than it did wrong.
Those results are pretty solid, especially considering the detailed mechanics of the model. Technically, the model makes two predictions - predicted home points and predicted away points.
The model is not actually predicting which team will win or lose, or by what margin. However, it is not predicting points in a vacuum either, as away team points are a variable in the home team predicted points algorithm, and vice versa.
One major note of caution: the score prediction model should not be used to bet over/under or against the spread. It is not good at either task. At some point in time I might create models to predict one or both of those.
Results from the testing data
Understanding predictions from the testing data, and how they compared to the final results, will help contextualize the 2023 predictions. Most predictions aligned nicely with the actual results. Below are some instructive deviations from the norm.
Near miss - 2021 Notre Dame at Virginia Tech
The model predicted a 29.3 to 28.4 Virginia Tech win. The final result was a gut wrenching 32-29 Tech loss. How close was the model? The Hokies led 29-21 late in the fourth quarter. Notre Dame drove down the field and scored a touchdown. Had the Irish failed on the two point conversion attempt and the Hokies gotten a first down, they could have taken a few kneel downs and secured a big 29-27 win. As it was, the Irish converted the two=point attempt to tie the game, the Hokies went nowhere on their ensuing possession and punted the ball back to the Irish, who drove far enough into Tech territory to set up a game winning field goal.
Right result, wrong score - 2022 Boston College at Virginia Tech
The BC game was the only real bright spot in Brent Pry’s first season as head coach. The Hokies won the game 27-10, and it was never in doubt. In comparison, the model predicted a 29.9 to 26.4 Hokie win. I’d like to meet the person who has created a model that accurately predicted just how heinously last year’s BC offensive line would play.
Didn’t see that one coming - 2022 Virginia Tech at Old Dominion
Tech lost 20-17, but the predicted outcome, 32.2 to 23.7 in favor of Tech, made sense in a world in which the Hokies do not commit 15 penalties and turn the ball over 5 times. That’s why they play the games!
Key injury - 2021 Virginia Tech at Boston College
This game was the straw that broke the camel’s back in terms of Justin Fuente’s tenure as head coach. On paper, the game was an even matchup. Phil Jurkovec was a game-time decision after missing multiple games due to injury. The model predicted a 29.2 to 27.4 Virginia Tech win. In reality, Braxton Burmeister got injured early and missed the rest of the game. Knox Kadum stepped in and struggled mightily. The Hokies could not move the ball, so once BC put up a few points, they just sat on their lead, cruising to a 17-3 win.
Overall, the model’s performance in predicting the scores of Virginia Tech games was reflective of all the test data. And even when the prediction did not match the actual result, the prediction made logical sense, with clear narratives explaining the deviation from the predicted scores.
Coming up!
Next week is finally game week!
I am finalizing new game matchup visualizations and planning a few new bells and whistles.
While I have not decided on a publishing cadence for the season, I am leaning toward a midweek, quantitative and data visualization heavy look at the week’s VT game as well as others involving ACC teams.
I will likely keep publishing columns each Friday, with a tweak here or there due to Thursday night games.
The season is upon us. Finally!
Happy first birthday!