Predictive Multiple Regression Analysis For Sports Betting

QuadDeuces

Registered User
Jul 16, 2006
1,208
304
Salt Lake City
Hey all, I'm new to the By The Numbers forum, but as an avid sports bettor with a strong background in math, I've always been intrigued by the applications of statistical analysis for sports betting. My idea is to create a detailed multiple regression analysis for game-by-game outcomes using data from the last 3 seasons.

The aim of the analysis is to create a percent chance of victory for each team. The design I have in mind is an output variable that would produce the percent chance of victory for the home team based on a series of dependent variables.

Thus far, the variables I would like to include:

- Season-to-date win%
- Rolling win% over the last 10 games
- Season-to-date starting goalie SVPCT
- Rolling save % over the last 10 games for said goalie
- Historical goalie SVPCT vs opponent
- Team days of rest
- Goaltender days of rest
- Travel distance from prior destination
- Rolling 10 game head to head win%
- Season-to-date Corsi or Fenwick
- Rolling 10 game Corsi or Fenwick
- Season-to-date team shooting %
- Rolling 10 game shooting %

I'd be open to exploring other variables to add or removing any of the current ones. The 10 games is currently an arbitrary number. I'm certain that this can be optimized to produce the best coefficient, but I think 10 is a good starting point, at least as arbitrary numbers go.

The objective is to run a multiple regression and find which variables are best at predicting a victory, and testing other variables to determine which ones would produce the strongest r and r^2.

My questions are as follows:

- Is there an easier way to obtain this data than to do it manually? I know I can get the season results from hockey-reference, but then I would have to go into each game manually to get the goalie #s. I have no idea whatsoever how to get the Corsi #s.
- Are there any additional variables that you would suggest as useful?

Thanks in advance, folks. As I said, I'm relatively new here, so I apologize if I'm missing things that are obvious to the rest of you.

Edited to add: Special teams #s would also likely be useful. I'm also aware that this is not to be a perfect system. The idea is to find statistically significant discrepancies between Vegas odds and winning percentages based on the regression analysis. The simplest example would be a team with a standard -110 line showing up with a 75% winning percentage based on the regression model, with a strong r. In this case (depending on the exact r, it seems that we would have a positive ROI, which can be calculated on a per-game basis pretty easily)
 
Last edited:

eperry

Registered User
Jun 27, 2016
64
9
You can download the full Play-By-Play by season from Corsica.Hockey in RData form.

My first thought, and I'm guessing this is something you're well aware of, is the multicollinearity between many of these variables. I would try to group them based on similarity (L10GP Corsi and Fenwick, for example) and perform variable selection from each group. Rest days and goalie talent are sneaky in that way because the largest effect of back-to-back games is the decrease in talent from starter to backup. Travel distance strikes me as a nothing input, but I suspect you will find that in your trials.

Secondly, I'd lean towards using more data whenever possible. Even if you're cross-validating, you risk overfitting to 3 years of data.

Lastly, if you're testing a logistic model, you don't want to use R^2 as a measure of success. Use logarithmic loss or AUC.
 

Ad

Upcoming events

Ad

Ad