See also Mark Hopkins' comments.
Warning: I am not a neutral observer. My purpose is to tout my Performance Rating system. Be sure to see Mike Zenor's response and the subsequent e-mail at the end.
Mike Zenor has two rating systems, one that explains point spreads (BOMB) and one that explains win-lose-tie records (Just-Win-Baby). For predicting point spreads, Darryl Marsee has a rival system. (Marsee's system has a major difference in its objective, however: Marsee trys to predict the future rather than explain the past.) For explaining win-lose-tie records, the Performance Rating system is a rival.
When Mike Zenor writes "this proves that the vector R* now contains the absolute, without argument, dead-certain-best power ratings for explaining all game outcomes," he is mistaken.
For explaining point spreads, one could reduce the error by including more factors that just the teams' strength. Possible factors include:
r(a) - r(b) + h(a)*H(a,b) - h(b)*H(b,a) + at(a)*AT(a,b)
- at(b)*AT(a,b) + rsh(a)*RSH(B) - rsh(b)*RSH(a)
to explain the result, where:
r(x) team x's rating
h(x) number of extra points for team x if at home
H(x,y) 1 is team x is at home for the game between
teams x and y; 0 otherwise
at(x) number of extra points for team x on artificial
turf (may be negative)
AT(a,b) 1 if game between a and b on artificial turf;
0 otherwise
rsh(x) number of extra points for team x if opponent
is primarily a rushing team (may be neg.)
RSH(x) ratio of rushing yards to total yards for
teams x.
Thus, each team would be characterized by four numbers: r(x), the power rating; h(x), the home field advantage adjustment; at(x), the artificial turf adjustment; and rsh(x), the rushing opponent adjustment.
Mike Zenor does say that he prefers a system that looks solely at the teams' win-lose-tie record. Even with this restriction, the Performance Rating system explains more of the variation in the game results than his Just-Win-Baby system because the Performance Rating system uses a non-linear estimation formula. For the Just-Win-Baby system, the error in estimating a game is calculated using:
m(a,b) = r(a) - r(b) + error(a,b)
where team "a" either won or tied, r(x) is team x's rating and m(a,b) is 1 if team "a" won or 0 for a tie. For the Performance Rating system, the corresponding formulas are:
1 = minimum( 1, r(a) - r(b) ) + error(a,b)
if team "a" won, or
0 = r(a) - r(b) + error(a,b)
if the two teams tied, where r(x) is the Performance Rating divided by 100 for team x. This recognizes the fact that if the teams are badly mismatched with r(a) more than 1 above r(b), team "a" can still do no more than win the game.
For comparing the two systems, let's use yet another formula:
m(a,b) = minimum(1, maximum(-1, r(a)-r(b))) + error(a,b)This modifies out-of-range estimates for both systems to be in the [-1,1] range. Because the two systems treat Non-Division I-A opponents differently (each Non-Division I-A team gets a separate Performance Rating; the Just-Win-Baby system treats Non-Division I-A teams as if they were a single team; both systems consider only games with Division I-A opponents), let's use only the games where both teams are in Division I-A. The sums of the errors squared for the games through Oct. 28, 1995 were 140.7 for the Performance Ratings and 177.6 for the Just-Win-Baby Ratings.
1. "For explaining point spreads, one could reduce the error by including more factors that just the scores."
I assume here that by "scores" David means the teams playing. [Yes--this has now been fixed.] This point is valid. One could reduce certainly reduce error by including information on who was the home team, weather conditions, cheerleader hemlines, etc. Such information can easily be included in the JWB or BOMB ratings as covariates. However, if a rating system truly took these factors into account, a team's rating would necessarily be conditional on the other factors. In other words, instead of a single rating for Nebraska, there would have to be separate ratings for Nebraska at home, away, on grass, against teams with green helmets, etc. Note that all the ratings on this page, including the PRS, provide a single rating point for each team irrespective of these factors.
2. "The sums of the errors squared for the games through Oct. 28, 1995 were 140.7 for the Performance Ratings and 177.6 for the Just-Win-Baby Ratings."
The sum of errors reported here incorrect. I have computed the variance accounted for in games through November 11, using the correct procedure from Stuart & Ord, Kendall's Advanced Theory of Statistics. For those interested, I have attached a detailed description of the analysis, and will provide the data upon request.
Margin Win/Loss
BOMB 67.7% 44.2%
JWB 56.5% 53.0%
Wilson 49.9% 43.0%
Marsee 66.0% 44.5%
Lightner 47.3% 45.7%
The BOMB ratings explain 67.7% of the variation in victory margins, exceeding all the others posted on this web page. Likewise, the JWB ratings explain the highest percent of the variation in pure win-loss. As I show in the mathematical proof, no set of unconditional ratings (i.e., where there is a single rating number for each team) can exceed them.
3. "Michael Zenor has two [ratings], one that explains point spreads (BOMB) and one that explains win-lose-tie records (Just-Win-Baby). For predicting point spreads, Darryl Marsee has a rival system. (Marsee's system has a major difference inits objective, however: Marsee trys to predict the future rather than explain the past.) For explaining win-lose-tie records, the Performance Rating system is a rival."
Like Marsee's, the BOMB and JWB can be used to predict the future (see the related predictions file). The principle difference here is that the JWB/BOMB ratings are statistical estimators that minimize a known loss function (squared error). The PRS is a non-statistical scoring method.
4. "Consider what would have happened if Florida State had gone unbeatened and untied, beating Florida in the last game of the season and Nebraska in a bowl game. (Florida State lost to Virginia on Nov. 2.) Florida State had no chance at all of ending up on top in the Just-Win-Baby system. That's because all those games against weak opponents would have continued to be counted in the ratings." [This was in the original critique but has been removed because, as show below, it was incorrect.]
This is incorrect. In the current JWB ratings, FSU is #19. I recalculated the JWB, adding the following hypothetical games suggested in David Wilson's comments: (1) FSU beat, rather than lost to UVa; (2) FSU beats Florida; (3) FSU beats Nebraska; (4) Ohio St. loses to Michigan; (5) Northwestern loses to USC in the hypothetical Rose Bowl. Guess what? FSU goes to #1 in the JWB ratings.
Below are the results of the first four games of the season. Appended to the right are two columns. A is the observed victory margin. B ignores any victory margin above 1 point (i.e., all wins are treated as one-point wins). Mathematically, column B = min(column A, 1). Thus, column B is pure win-loss information.
A B
Ohio State 38, Boston College 6 32 1
Michigan 18, Virginia 17 1 1
Iowa State 36, Ohio 21 15 1
Nebraska 64, Oklahoma State 21 43 1
&etc.
Since we are analyzing differences, each game contains two symmetric pieces of information: the positive victory margin and its corresponding negative loss margin. This is called the "image" or complement vector. See Stuart and Ord, Kendall's Advanced Theory of Statistics, for a lengthy discussion of correlating differences and the related topic of regression through the origin. Below is the original data augmented with the image vectors.
A B
Ohio State 38, Boston College 6 32 1
Ohio State 38, Boston College 6 -32 -1
Michigan 18, Virginia 17 1 1
Michigan 18, Virginia 17 -1 -1
Iowa State 36, Ohio 21 15 1
Iowa State 36, Ohio 21 -15 -1
Nebraska 64, Oklahoma State 21 43 1
Nebraska 64, Oklahoma State 21 -43 -1
&etc.
The question now is how well the rating scale differences correlate with these actual margins. To illustrate, the table below shows the various scale differences, and their images, for the Ohio State-Boston College game.
BOMB JWB WILSON MARSEED LIGHTNER
Ohio State 33.04 1.38 819 949 98.36
Boston College 0.82 -0.12 447 534 17.81
Scale Difference: 32.22 1.51 372 415 80.55
Image: -32.22 -1.51 -372 -415 -80.55
These scale differences can now be appended to the original data set:
Ohio State 38, Boston College 6 32 1 32.22 1.51 372 415 80.55 Ohio State 38, Boston College 6 -32 -1 -32.22 -1.51 -372 -415 -80.55 Michigan 18, Virginia 17 1 1 3.65 0.54 86 53 29.82 Michigan 18, Virginia 17 -1 -1 -3.65 -0.54 -86 -53 -29.82 Iowa State 36, Ohio 21 15 1 18.78 0.74 114 182 5.08 Iowa State 36, Ohio 21 -15 -1 -18.78 -0.74 -114 -182 -5.08 Nebraska 64, Oklahoma State 21 43 1 45.65 1.41 596 562 79.84 Nebraska 64, Oklahoma State 21 -43 -1 -45.65 -1.41 -596 -562 -79.84 : : : : : : : : : : : : : : : : : : & etc.
As of November 11, there had been 505 games between Division I-A teams, providing 1010 observations for the correlation analysis. I will provide the dataset upon request. The table below shows the correlation between the 5 different rating methods and actual game outcomes - either the margin or the Win/Loss.
Margin Win/Loss BOMB 0.823 0.665 JWB 0.759 0.728 Wilson 0.706 0.655 Marsee 0.813 0.667 Lightner 0.688 0.676
Squaring these correlation coefficients gives the variance accounted for:
Margin Win/Loss
BOMB 67.7% 44.2%
JWB 56.5% 53.0%
Wilson 49.9% 43.0%
Marsee 66.0% 44.5%
Lightner 47.3% 45.7%
While Darryl Marsee was close, the BOMB ratings provide the best accounting for actual margins and the JWB ratings provide the best accounting for pure win-loss. Neither result is suprising: in the description of the ratings, I provide a proof that for explaining victory margin, the BOMB index minimizes squared error - thereby maximizing explained variance. Likewise the JWB minimizes squared error in Win-Loss.
In conclusion, my original point stands. It is mathematically impossible for
any unconditional team rating index (one rating number per team) to have a
higher correlation with game outcomes.
Subsequent E-mail
^ rating(a)-rating(b)
y(a,b) = minimum(1, maximum(-1, ------------------- ))
100
where the observed y(a,b) is 1 if team "a" won, 0 if the two teams tied, and -1 if team "b" won. For the case where the higher rated team (say, team "a") won, the system makes explict use of the "minimum" function in the estimation formula to, in effect, exclude "wins that would lower the rating and losses that would raise the rating." Please recalcute the percentage of variance explained using the above formula for the estimated game result.
I was wrong about Florida State and will remove that paragraph from the critique.
To illustrate, I computed the correlation between the PRS and actual outcomes, using your "y hat" formula. It was 0.785, or a "variation accounted for" of 61.7%. This is indeed higher than the JWB (53%). But, fair is fair - if you can chose your own discontinuous non-linear function for constructing predicted y(a,b), so can I. I'll choose this for JWB:
^
y(a,b) = exp(JWB(a)-JWB(b)) if JWB(a)>JWB(b)
^
y(a,b) = -exp(JWB(b)-JWB(a)) if JWB(b)>JWB(a)
If I do this, the correlation between the rescaled JWB difference and the same criterion is 0.887, or "variation accounted for" of 78.7%. To summarize:
VAF "VAF" Using prefered prediction rule
=================
JWB 53.0% 78.7%
PRS 43.0% 61.6%
These results illustrate why the original analysis is correct. Note that I can arbitrarily raise "VAF" from 53% to 79% with the same exact set of JWB ratings. Quite clearly, I did not suddenly improve the quality of the JWB ratings, for they remain the same. The increase had nothing to do with the ratings themselves, and everything to do with my (arbitrary) non-linear prediction rule. The real question is how much variation is explained by the scale itself; this is column one.
There are, to be sure, a number of useful non-linear models that could be applied to this problem. In particular, binomial LOGIT and PROBIT models (see G.S. Maddalla, Analysis of Limited Dependent Variables) could be used to derive ratings for teams. Rather than minimize squared error, these models models maximize a likelihood or entropy function.
My y-hat function is not arbitrary since it's what I have in mind when calculating the ratings. It's also not discontinuous although its first derivative is discontinous.
Since the y-hat function has to approach 1 as the ratings difference goes to positive inifinity, -1 as the ratings difference goes to negative infinity, and be 0 when the ratings difference I'm hoping to be able to use a y-hat function of the form:
^ 2
y(a,b) = -- arctan(v*(rating(a)-rating(b)))
pi
where v is some scaling factor. Whether or not that will pan out I don't know yet.
On your BOMB rating system, the restriction that the teams ratings be a single number does leaves only one possible improvement I can think of: introducing a universal home field advantage value.
m(a,b) = r(a) - r(b) + q*h(a,b) + error(a,b)
where h(a,b) is 1 if a is at home, -1 if b is at home, and 0 if neither is at home, and q the universal home field advantage value. To actually work with this, you'd use:
m(a,b) - q*h(a,b) = r(a) - r(b) + error(a,b)
and repetitively solve this trying various values for q. To avoid having the q value change each week, you could use the '93 and '94 games to determine the best q value and stick with that.
For predictions, your system can definitely be improved early in the season. For example, some kind of weighted average of last year's final ratings with the current ratings could produce new ratings that would do a better job of predicting the next week's games even though these new ratings would do a worse job of explaining the results of games to date.
^ 2
y(a,b) = -- arctan(1.80*x + 4.55*x^3)
pi
while * is multiplication, x^3 is x cubed, and
rating(a)-rating(b)
x = -------------------
100
By the way, you can get my ratings to 5 decimal places by linking to the description of my system from the RSFC homepage, then to "Programming Notes" and then to "byname.txt". I also saved the numbers for Nov. 11.
y-hat = sign(x)*exp(|x|) x=JWB(a)-JWB(b) where sign(x) = 1 if x>0; 0 if x=0; -1 if x<0
must be missing some scaling factors since exp(|x|) is always outside the interval (-1,1) except at x=0. You'd do better with just sign(x). I'll introduce a couple of parameters and do a fit before using the function to calculate sum of error squared:
y-hat = a*sign(x)*exp(b*|x|)
y-hat(a,b) = - y-hat(b,a)
so that only one estimation is produced for a game? If not, you should enter all the games into the analysis twice--one from the point of view of each team. Otherwise, the anaylsis will be trying to explain just the variation between games the were tied and games that weren't. If the anaylsis is producing y-hat functions that are symmetric about the origin, entering each game twice should make no difference at all in the results.
Sum Error % variation
^ squared explained
y formula JWB PRS JWB PRS
------------------------------ ------ ------- ----- -------
x 467 1679 53 -69
min(1,max(-1,x)) 445 381 55 62
(2/pi)*atan(1.80*x + 4.55*x^3) 425 374 57 62
sign(x)*exp(|x|) 2648 4012688 -166 -403591
sign(x) 616 544 38 45
0.435*sign(x)*exp(0.694*|x|) 455 18770 54 -1788
You had:
VAF "VAF" Using prefered prediction rule
=================
JWB 53.0% 78.7%
PRS 43.0% 61.6%
So I match your 53% and 62% figures. For the PRS figure for the y-hat formula of just x, you get 43% and I get -69% because your analysis is doing an extra fit before calculating the % variation explained and my analysis does no additional fit. x is either the JWB number from compare.txt or the Wilson number divided by 100.
In the 0.435*sign(x)*exp(0.694*|x|) formula, the 0.435 and 0.694 numbers come from a non-linear least squares fitting program. You must have a better fit. Please send it to me.
From: Edward Kambour <ekambour@prosx.com>
Subject: Least squares ratings
Date: Wed, 4 Nov 1998 11:39:06 -0600
Just a quick note on the "Zenor" method. Least squares of this type results in Best Linear Unbiased Estimates (BLUEs). However, such estimators are inadmissible (that is, there are estimators that are closer in terms of mean-squared error regardless of the true parameter values). In fact Empirical Bayes and shrinkage estimators are better everywhere (but they're not unbiased, and not necessarily linear).
David Wilson / dwilson@cae.wisc.edu