Time for a lunch break post.
Yesterday I was thinking about the various ranking systems out there, such as KenPom, Sagarin, RPI, "BPI", etc. These are attempts to rank teams based on some sort of index, which is heuristically determined from wins/losses, scoring margins, strength of schedules, etc, etc. They all generally keep their special formula secret, but they are all trying to do the same thing. I never liked them. They all have flaws in some way, and I always have the sneaking suspicion that problems and biases are built in to the formulas somehow or another, even if unintentionally and unwillingly.
But supposing you have a ratings index that goes from 1-100, and you give a default 5 point home game advantage to any team. There exists, uncategorically, a "best" way to give each NCAA team a score such that their rank matches actual outcomes in the most accurate way possible. However, this is extremely difficult to achieve in practice. Even if you only allow integer numbers for the index, that gives you 100 different possibilities for each team (1-100), for a total number of 100^350 possibilities to be explored. You can limit the possibilities somewhat, but it doesn't really help. For example, you can restrict that Kansas is probably above 75, say. But even if you could feasibly constrain every team to a 25-point range, that is still 25^350 possibilities! Even if you already knew, a priori, each teams 'rank' within +/- 1, you'd still have to compute and search 3^350 possible combinations of ranks to ensure you got the "best" one! God help you if you want to compute down to a quarter of a point or something. You will be computing away until the heat death of the universe, even with reasonable constraints put in. So the simple phenomenological formulas that are out there make sense in that context.
So my idea was to use a modern optimization scheme to find the right minimum in the data set without computing all 100^350 possibilities. I use surrogate management framework (kriging functions), coupled with a mesh-adaptive-direct-search (MADS) scheme to efficiently and reliably determine a ranking set that gives the minimal error. It is 100% guaranteed to find a minimum in the data set. It makes no promises on how fast it does this, and you could theoretically compute away until the heat death of the universe anyways, but it has been very reliably used for a number of other complex applications. For reference, the processes I used are outlined pretty nicely here:
http://epubs.siam.org/doi/abs/10.1137/040603371
http://ntrs.nasa.gov/search.jsp?R=19980046640
I have some experience with these methods myself, and I've used them successfully in the past:
http://link.springer.com/article/10.1007/s00466-013-0967-z
My wife was out of town last night, so when I got home from work I decided to test it out. I created a database of all AAC games this season. There were 99 of them. I stored the winner, the loser, the score differential, and the home team. I used a scoring system from 0-50, 50 being the best. I also assume each team has its own unique home court advantage, which I assumed could range from (-5 to 15). This gave me 22 variables for 11 teams. I allowed each variable to increment by as little as 0.0625, giving me 800^11*320^11 possible combinations to test. To give you an idea of the scale of the problem, even our little 11-team conference would require an enormous effort on a massively parallel supercomputer to find the correct values in my lifetime.
I defined the 'error' based on two parameters. The first is simply the percentage of games predicted incorrectly. The second is more complicated, but is based on the predicted score differential versus actual score differential, as shown:
err2 = sqrt( SUM ((predicted differential - actual differential) ^ 2) / Number of games )
This is different from simply using the average error in point differential. It skews the value more towards the worst error. Actually, in retrospect, this is probably a mistake, as the worst errors are likely outliers and we should not punish our rankings for not adequately predicting outliers. I will probably change this to a simple average, or maybe even a median value. The total error is then just the sum of the two errors described above. The optimization scheme searches through combinations of different rankings and stops when it finds the lowest possible error value.
I set up the problem and ran it for a few hours last night, using ONLY AAC data. It did not finish converging, but it got close. I found a bug with the mesh refinement, where for some reason it only occasionally refined the search space correctly, which is why some numbers are whole integers. So Temple has a power rank of 40 and not, say, 37.9, because the algorithm early on settled on the idea that Temple was closer to a 40 than a 35 (initial spacing was in increments of 5). Then it never refined the search space for this team, which is probably the only reason they are #1. If I ran it again Tulsa, UCONN, and Temple would get bounced around more and the whole top 5 could shake up quite a bit with a second run to work out the kinks. It found this near-minimum in only 600 computations, which is pretty good compared to 800^11*320^11...
TEAM POW H. ADV
#1 Temple 40 -1
#2 SMU 39.06 5
#3 Cincy 38.75 0.88
#4 Tulsa 35 1.13
#5 UCONN 35 5
#6 Memph 30 5
#7 ECU 25.31 5
#8 Hou 21.25 -2
#9 Tulane 20.63 1.25
#10 UCF 20 5
#11 USF 16.25 10.88
The way to read it is that if Temple (40) plays Memphis (30) on a neutral court, it predicts a 10 point victory for Temple. If they play at Memphis, it predicts a 5 point Temple win after accounting for the Memphis home court advantage. At Temple, and it predicts a 9 point Temple win, since Temple apparently doesn't really have any advantage to playing at home.
It predicted the winner in 86/99 AAC games correctly, but it missed both of our Temple victories. As mentioned, the optimization tends to stress out and try to minimize the point differential error in the biggest outliers, rather than being satisfied with a large number of games with a very low error in score differential. Changing this approach might also change the results some.
I'd like to patch it up and run it again tonight and see what it does. But the wife will return this evening, so we will see if I have time. It might be a weekend project.
If this is something that people find interesting and respond to, I might try to do a weekly computation of the full NCAA next season and post it online somewhere. If this post gets ignored and/or I get dismissed as just some crazy guy, then I probably won't bother. The above ranking has its flaws (Obviously Temple should not be #1...), but I'm actually pretty excited at the results given that it is only a half-complete first attempt that did not converge with a notable bug affecting the results. And yet, it still predicts the winner 87% of the time! Which is really pretty good as far as these types of things go. When it is converged and working properly it should, by its very nature, outperform every heuristic method that exists, which is an exciting idea.
So now you all know what a bored mathematician does with a free evening. I might have had a beer or two, too, which could have led to that d*mned bug that screwed up the results. Oh well, and thanks for reading this far. No lunch, back to work.
Yesterday I was thinking about the various ranking systems out there, such as KenPom, Sagarin, RPI, "BPI", etc. These are attempts to rank teams based on some sort of index, which is heuristically determined from wins/losses, scoring margins, strength of schedules, etc, etc. They all generally keep their special formula secret, but they are all trying to do the same thing. I never liked them. They all have flaws in some way, and I always have the sneaking suspicion that problems and biases are built in to the formulas somehow or another, even if unintentionally and unwillingly.
But supposing you have a ratings index that goes from 1-100, and you give a default 5 point home game advantage to any team. There exists, uncategorically, a "best" way to give each NCAA team a score such that their rank matches actual outcomes in the most accurate way possible. However, this is extremely difficult to achieve in practice. Even if you only allow integer numbers for the index, that gives you 100 different possibilities for each team (1-100), for a total number of 100^350 possibilities to be explored. You can limit the possibilities somewhat, but it doesn't really help. For example, you can restrict that Kansas is probably above 75, say. But even if you could feasibly constrain every team to a 25-point range, that is still 25^350 possibilities! Even if you already knew, a priori, each teams 'rank' within +/- 1, you'd still have to compute and search 3^350 possible combinations of ranks to ensure you got the "best" one! God help you if you want to compute down to a quarter of a point or something. You will be computing away until the heat death of the universe, even with reasonable constraints put in. So the simple phenomenological formulas that are out there make sense in that context.
So my idea was to use a modern optimization scheme to find the right minimum in the data set without computing all 100^350 possibilities. I use surrogate management framework (kriging functions), coupled with a mesh-adaptive-direct-search (MADS) scheme to efficiently and reliably determine a ranking set that gives the minimal error. It is 100% guaranteed to find a minimum in the data set. It makes no promises on how fast it does this, and you could theoretically compute away until the heat death of the universe anyways, but it has been very reliably used for a number of other complex applications. For reference, the processes I used are outlined pretty nicely here:
http://epubs.siam.org/doi/abs/10.1137/040603371
http://ntrs.nasa.gov/search.jsp?R=19980046640
I have some experience with these methods myself, and I've used them successfully in the past:
http://link.springer.com/article/10.1007/s00466-013-0967-z
My wife was out of town last night, so when I got home from work I decided to test it out. I created a database of all AAC games this season. There were 99 of them. I stored the winner, the loser, the score differential, and the home team. I used a scoring system from 0-50, 50 being the best. I also assume each team has its own unique home court advantage, which I assumed could range from (-5 to 15). This gave me 22 variables for 11 teams. I allowed each variable to increment by as little as 0.0625, giving me 800^11*320^11 possible combinations to test. To give you an idea of the scale of the problem, even our little 11-team conference would require an enormous effort on a massively parallel supercomputer to find the correct values in my lifetime.
I defined the 'error' based on two parameters. The first is simply the percentage of games predicted incorrectly. The second is more complicated, but is based on the predicted score differential versus actual score differential, as shown:
err2 = sqrt( SUM ((predicted differential - actual differential) ^ 2) / Number of games )
This is different from simply using the average error in point differential. It skews the value more towards the worst error. Actually, in retrospect, this is probably a mistake, as the worst errors are likely outliers and we should not punish our rankings for not adequately predicting outliers. I will probably change this to a simple average, or maybe even a median value. The total error is then just the sum of the two errors described above. The optimization scheme searches through combinations of different rankings and stops when it finds the lowest possible error value.
I set up the problem and ran it for a few hours last night, using ONLY AAC data. It did not finish converging, but it got close. I found a bug with the mesh refinement, where for some reason it only occasionally refined the search space correctly, which is why some numbers are whole integers. So Temple has a power rank of 40 and not, say, 37.9, because the algorithm early on settled on the idea that Temple was closer to a 40 than a 35 (initial spacing was in increments of 5). Then it never refined the search space for this team, which is probably the only reason they are #1. If I ran it again Tulsa, UCONN, and Temple would get bounced around more and the whole top 5 could shake up quite a bit with a second run to work out the kinks. It found this near-minimum in only 600 computations, which is pretty good compared to 800^11*320^11...
TEAM POW H. ADV
#1 Temple 40 -1
#2 SMU 39.06 5
#3 Cincy 38.75 0.88
#4 Tulsa 35 1.13
#5 UCONN 35 5
#6 Memph 30 5
#7 ECU 25.31 5
#8 Hou 21.25 -2
#9 Tulane 20.63 1.25
#10 UCF 20 5
#11 USF 16.25 10.88
The way to read it is that if Temple (40) plays Memphis (30) on a neutral court, it predicts a 10 point victory for Temple. If they play at Memphis, it predicts a 5 point Temple win after accounting for the Memphis home court advantage. At Temple, and it predicts a 9 point Temple win, since Temple apparently doesn't really have any advantage to playing at home.
It predicted the winner in 86/99 AAC games correctly, but it missed both of our Temple victories. As mentioned, the optimization tends to stress out and try to minimize the point differential error in the biggest outliers, rather than being satisfied with a large number of games with a very low error in score differential. Changing this approach might also change the results some.
I'd like to patch it up and run it again tonight and see what it does. But the wife will return this evening, so we will see if I have time. It might be a weekend project.
If this is something that people find interesting and respond to, I might try to do a weekly computation of the full NCAA next season and post it online somewhere. If this post gets ignored and/or I get dismissed as just some crazy guy, then I probably won't bother. The above ranking has its flaws (Obviously Temple should not be #1...), but I'm actually pretty excited at the results given that it is only a half-complete first attempt that did not converge with a notable bug affecting the results. And yet, it still predicts the winner 87% of the time! Which is really pretty good as far as these types of things go. When it is converged and working properly it should, by its very nature, outperform every heuristic method that exists, which is an exciting idea.
So now you all know what a bored mathematician does with a free evening. I might have had a beer or two, too, which could have led to that d*mned bug that screwed up the results. Oh well, and thanks for reading this far. No lunch, back to work.