Critique of Mike Zenor's Rating Systems

See also Mark Hopkins' comments.

Warning: I am not a neutral observer. My purpose is to tout my Performance Rating system. Be sure to see Mike Zenor's response and the subsequent e-mail at the end.

Mike Zenor has two rating systems, one that explains point spreads (BOMB) and one that explains win-lose-tie records (Just-Win-Baby). For predicting point spreads, Darryl Marsee has a rival system. (Marsee's system has a major difference in its objective, however: Marsee trys to predict the future rather than explain the past.) For explaining win-lose-tie records, the Performance Rating system is a rival.

When Mike Zenor writes "this proves that the vector R* now contains the absolute, without argument, dead-certain-best power ratings for explaining all game outcomes," he is mistaken.

The BOMB Rating System

For explaining point spreads, one could reduce the error by including more factors that just the teams' strength. Possible factors include:

For example, on could use the formula:
   r(a) - r(b) + h(a)*H(a,b) - h(b)*H(b,a) + at(a)*AT(a,b)
    - at(b)*AT(a,b) + rsh(a)*RSH(B) - rsh(b)*RSH(a)

to explain the result, where:

   r(x)    team x's rating
   h(x)    number of extra points for team x if at home
   H(x,y)  1 is team x is at home for the game between
                teams x and y; 0 otherwise
   at(x)   number of extra points for team x on artificial
                turf (may be negative)
   AT(a,b) 1 if game between a and b on artificial turf;
                0 otherwise
   rsh(x)  number of extra points for team x if opponent
                is primarily a rushing team (may be neg.)
   RSH(x)  ratio of rushing yards to total yards for
                teams x.

Thus, each team would be characterized by four numbers: r(x), the power rating; h(x), the home field advantage adjustment; at(x), the artificial turf adjustment; and rsh(x), the rushing opponent adjustment.

The Just-Win-Baby Rating System

Mike Zenor does say that he prefers a system that looks solely at the teams' win-lose-tie record. Even with this restriction, the Performance Rating system explains more of the variation in the game results than his Just-Win-Baby system because the Performance Rating system uses a non-linear estimation formula. For the Just-Win-Baby system, the error in estimating a game is calculated using:

   m(a,b) = r(a) - r(b) +  error(a,b)

where team "a" either won or tied, r(x) is team x's rating and m(a,b) is 1 if team "a" won or 0 for a tie. For the Performance Rating system, the corresponding formulas are:

   1 = minimum( 1, r(a) - r(b) ) +  error(a,b)

if team "a" won, or

   0 =  r(a) - r(b) +  error(a,b)

if the two teams tied, where r(x) is the Performance Rating divided by 100 for team x. This recognizes the fact that if the teams are badly mismatched with r(a) more than 1 above r(b), team "a" can still do no more than win the game.

For comparing the two systems, let's use yet another formula:

   m(a,b) = minimum(1, maximum(-1, r(a)-r(b))) + error(a,b)
This modifies out-of-range estimates for both systems to be in the [-1,1] range. Because the two systems treat Non-Division I-A opponents differently (each Non-Division I-A team gets a separate Performance Rating; the Just-Win-Baby system treats Non-Division I-A teams as if they were a single team; both systems consider only games with Division I-A opponents), let's use only the games where both teams are in Division I-A. The sums of the errors squared for the games through Oct. 28, 1995 were 140.7 for the Performance Ratings and 177.6 for the Just-Win-Baby Ratings.

Mike Zenor's Response

In his critique of the BOMB & JWB ratings, David Wilson makes some interesting observations, but none disprove my original thesis. Allow me to address each in turn:

1. "For explaining point spreads, one could reduce the error by including more factors that just the scores."

I assume here that by "scores" David means the teams playing. [Yes--this has now been fixed.] This point is valid. One could reduce certainly reduce error by including information on who was the home team, weather conditions, cheerleader hemlines, etc. Such information can easily be included in the JWB or BOMB ratings as covariates. However, if a rating system truly took these factors into account, a team's rating would necessarily be conditional on the other factors. In other words, instead of a single rating for Nebraska, there would have to be separate ratings for Nebraska at home, away, on grass, against teams with green helmets, etc. Note that all the ratings on this page, including the PRS, provide a single rating point for each team irrespective of these factors.

2. "The sums of the errors squared for the games through Oct. 28, 1995 were 140.7 for the Performance Ratings and 177.6 for the Just-Win-Baby Ratings."

The sum of errors reported here incorrect. I have computed the variance accounted for in games through November 11, using the correct procedure from Stuart & Ord, Kendall's Advanced Theory of Statistics. For those interested, I have attached a detailed description of the analysis, and will provide the data upon request.

              Margin  Win/Loss
BOMB           67.7%   44.2%
JWB            56.5%   53.0%
Wilson         49.9%   43.0%
Marsee         66.0%   44.5%
Lightner       47.3%   45.7%

The BOMB ratings explain 67.7% of the variation in victory margins, exceeding all the others posted on this web page. Likewise, the JWB ratings explain the highest percent of the variation in pure win-loss. As I show in the mathematical proof, no set of unconditional ratings (i.e., where there is a single rating number for each team) can exceed them.

3. "Michael Zenor has two [ratings], one that explains point spreads (BOMB) and one that explains win-lose-tie records (Just-Win-Baby). For predicting point spreads, Darryl Marsee has a rival system. (Marsee's system has a major difference inits objective, however: Marsee trys to predict the future rather than explain the past.) For explaining win-lose-tie records, the Performance Rating system is a rival."

Like Marsee's, the BOMB and JWB can be used to predict the future (see the related predictions file). The principle difference here is that the JWB/BOMB ratings are statistical estimators that minimize a known loss function (squared error). The PRS is a non-statistical scoring method.

4. "Consider what would have happened if Florida State had gone unbeatened and untied, beating Florida in the last game of the season and Nebraska in a bowl game. (Florida State lost to Virginia on Nov. 2.) Florida State had no chance at all of ending up on top in the Just-Win-Baby system. That's because all those games against weak opponents would have continued to be counted in the ratings." [This was in the original critique but has been removed because, as show below, it was incorrect.]

This is incorrect. In the current JWB ratings, FSU is #19. I recalculated the JWB, adding the following hypothetical games suggested in David Wilson's comments: (1) FSU beat, rather than lost to UVa; (2) FSU beats Florida; (3) FSU beats Nebraska; (4) Ohio St. loses to Michigan; (5) Northwestern loses to USC in the hypothetical Rose Bowl. Guess what? FSU goes to #1 in the JWB ratings.

COMPARING METHODS

The real question at hand is how well any scale accounts for game outcomes, in terms of "variance accounted for". Below is a description of the correct way to compute this for a difference variable.

Below are the results of the first four games of the season. Appended to the right are two columns. A is the observed victory margin. B ignores any victory margin above 1 point (i.e., all wins are treated as one-point wins). Mathematically, column B = min(column A, 1). Thus, column B is pure win-loss information.

                                    A  B
Ohio State 38, Boston College 6    32  1
Michigan 18, Virginia 17            1  1
Iowa State 36, Ohio 21             15  1
Nebraska 64, Oklahoma State 21     43  1
&etc.

Since we are analyzing differences, each game contains two symmetric pieces of information: the positive victory margin and its corresponding negative loss margin. This is called the "image" or complement vector. See Stuart and Ord, Kendall's Advanced Theory of Statistics, for a lengthy discussion of correlating differences and the related topic of regression through the origin. Below is the original data augmented with the image vectors.

                                    A  B
Ohio State 38, Boston College 6    32  1
Ohio State 38, Boston College 6   -32 -1 
Michigan 18, Virginia 17            1  1
Michigan 18, Virginia 17           -1 -1
Iowa State 36, Ohio 21             15  1
Iowa State 36, Ohio 21            -15 -1
Nebraska 64, Oklahoma State 21     43  1
Nebraska 64, Oklahoma State 21    -43 -1
&etc.

The question now is how well the rating scale differences correlate with these actual margins. To illustrate, the table below shows the various scale differences, and their images, for the Ohio State-Boston College game.


                     BOMB    JWB  WILSON MARSEED LIGHTNER 
Ohio State          33.04   1.38     819     949    98.36
Boston College       0.82  -0.12     447     534    17.81

Scale Difference:   32.22   1.51     372     415    80.55
Image:             -32.22  -1.51    -372    -415   -80.55

These scale differences can now be appended to the original data set:

Ohio State 38, Boston College 6  32  1  32.22  1.51  372  415  80.55
Ohio State 38, Boston College 6 -32 -1 -32.22 -1.51 -372 -415 -80.55
Michigan 18, Virginia 17          1  1   3.65  0.54   86   53  29.82
Michigan 18, Virginia 17         -1 -1  -3.65 -0.54  -86  -53 -29.82
Iowa State 36, Ohio 21           15  1  18.78  0.74  114  182   5.08
Iowa State 36, Ohio 21          -15 -1 -18.78 -0.74 -114 -182  -5.08
Nebraska 64, Oklahoma State 21   43  1  45.65  1.41  596  562  79.84
Nebraska 64, Oklahoma State 21  -43 -1 -45.65 -1.41 -596 -562 -79.84
:                :                :  :    :     :     :    :     :
:                :                :  :    :     :     :    :     :
& etc.

RESULTS

"Variance accounted for" is defined as the squared correlation between the criterion and the predictor. In this case, there are two criterion variables (margin and win-loss), and five predictors (BOMB, JWB, and the ratings of David Wilson, Darryl Marsee and Jim Lightner).

As of November 11, there had been 505 games between Division I-A teams, providing 1010 observations for the correlation analysis. I will provide the dataset upon request. The table below shows the correlation between the 5 different rating methods and actual game outcomes - either the margin or the Win/Loss.

		 Margin  Win/Loss
BOMB           0.823   0.665
JWB            0.759   0.728
Wilson         0.706   0.655
Marsee         0.813   0.667
Lightner       0.688   0.676

Squaring these correlation coefficients gives the variance accounted for:

              Margin  Win/Loss
BOMB           67.7%   44.2%
JWB            56.5%   53.0%
Wilson         49.9%   43.0%
Marsee         66.0%   44.5%
Lightner       47.3%   45.7%

While Darryl Marsee was close, the BOMB ratings provide the best accounting for actual margins and the JWB ratings provide the best accounting for pure win-loss. Neither result is suprising: in the description of the ratings, I provide a proof that for explaining victory margin, the BOMB index minimizes squared error - thereby maximizing explained variance. Likewise the JWB minimizes squared error in Win-Loss.

In conclusion, my original point stands. It is mathematically impossible for any unconditional team rating index (one rating number per team) to have a higher correlation with game outcomes.

Subsequent E-mail

Dave Wilson to Mike Zenor, 11/16/95

My performance rating system is non-linear and cannot be properly analyzed using the tools of linear algebra and correlation. When calculating the percentage of variance of game results explained, you must use the right formula for the estimated game result:
    ^                               rating(a)-rating(b)
    y(a,b) = minimum(1, maximum(-1, ------------------- ))
                                            100

where the observed y(a,b) is 1 if team "a" won, 0 if the two teams tied, and -1 if team "b" won. For the case where the higher rated team (say, team "a") won, the system makes explict use of the "minimum" function in the estimation formula to, in effect, exclude "wins that would lower the rating and losses that would raise the rating." Please recalcute the percentage of variance explained using the above formula for the estimated game result.

I was wrong about Florida State and will remove that paragraph from the critique.

Mike Zenor to Dave Wilson, 11/17/95

If your predictions are computed by the that formula, you are indeed correct in noting that "My performance rating system is non-linear and cannot be properly analyzed using the tools of linear algebra and correlation". However, this also means that explained variance (a linear statistic) itself cannot properly be computed for these predictions.

To illustrate, I computed the correlation between the PRS and actual outcomes, using your "y hat" formula. It was 0.785, or a "variation accounted for" of 61.7%. This is indeed higher than the JWB (53%). But, fair is fair - if you can chose your own discontinuous non-linear function for constructing predicted y(a,b), so can I. I'll choose this for JWB:

    ^                       
    y(a,b) =  exp(JWB(a)-JWB(b)) if JWB(a)>JWB(b)

    ^                       
    y(a,b) = -exp(JWB(b)-JWB(a)) if JWB(b)>JWB(a)

If I do this, the correlation between the rescaled JWB difference and the same criterion is 0.887, or "variation accounted for" of 78.7%. To summarize:

                   VAF     "VAF" Using prefered prediction rule
                  =================
        JWB       53.0%    78.7%
        PRS       43.0%    61.6%

These results illustrate why the original analysis is correct. Note that I can arbitrarily raise "VAF" from 53% to 79% with the same exact set of JWB ratings. Quite clearly, I did not suddenly improve the quality of the JWB ratings, for they remain the same. The increase had nothing to do with the ratings themselves, and everything to do with my (arbitrary) non-linear prediction rule. The real question is how much variation is explained by the scale itself; this is column one.

There are, to be sure, a number of useful non-linear models that could be applied to this problem. In particular, binomial LOGIT and PROBIT models (see G.S. Maddalla, Analysis of Limited Dependent Variables) could be used to derive ratings for teams. Rather than minimize squared error, these models models maximize a likelihood or entropy function.

Dave Wilson to Mike Zenor, 11/17/95

Gee, I hadn't thought of optimizing the y-hat function afterwards. I'll make a plot of the rating differences vs. average result to see what can be done for the performance rating system. If I'm correct in thinking that it's important to drop out the games of mismatched opponents to produce the best quality ratings (once those games are dropped my system does become linear), then I should be able to find an optimized y-hat function that produces a better result than any optimized y-hat function for your system. I'll give it a try and see what I can come up with.

My y-hat function is not arbitrary since it's what I have in mind when calculating the ratings. It's also not discontinuous although its first derivative is discontinous.

Since the y-hat function has to approach 1 as the ratings difference goes to positive inifinity, -1 as the ratings difference goes to negative infinity, and be 0 when the ratings difference I'm hoping to be able to use a y-hat function of the form:

    ^         2
    y(a,b) = -- arctan(v*(rating(a)-rating(b)))
             pi 

where v is some scaling factor. Whether or not that will pan out I don't know yet.

On your BOMB rating system, the restriction that the teams ratings be a single number does leaves only one possible improvement I can think of: introducing a universal home field advantage value.

   m(a,b) = r(a) - r(b) + q*h(a,b) + error(a,b)

where h(a,b) is 1 if a is at home, -1 if b is at home, and 0 if neither is at home, and q the universal home field advantage value. To actually work with this, you'd use:

   m(a,b) - q*h(a,b) = r(a) - r(b) + error(a,b)

and repetitively solve this trying various values for q. To avoid having the q value change each week, you could use the '93 and '94 games to determine the best q value and stick with that.

For predictions, your system can definitely be improved early in the season. For example, some kind of weighted average of last year's final ratings with the current ratings could produce new ratings that would do a better job of predicting the next week's games even though these new ratings would do a worse job of explaining the results of games to date.

From Dave Wilson to Mike Zenor, 11/21/95

I have just finished fitting my y-hat function using data from '93, '94, and '95.
    ^         2
    y(a,b) = -- arctan(1.80*x + 4.55*x^3)
             pi

while * is multiplication, x^3 is x cubed, and

           rating(a)-rating(b)
       x = -------------------
                   100

By the way, you can get my ratings to 5 decimal places by linking to the description of my system from the RSFC homepage, then to "Programming Notes" and then to "byname.txt". I also saved the numbers for Nov. 11.

From Dave Wilson to Mike Zenor, 11/23/95

I just calculated the improvement in sum error squared for the arctan function over my original function and the result is unimpressive. Next, I'm going to try to verify your previous results. Since the average game result, when viewed from both sides, is zero, the sum of the variation squared should be the number of games less the number of ties. You y-hat function
   y-hat = sign(x)*exp(|x|)        x=JWB(a)-JWB(b)

where sign(x) = 1 if x>0; 0 if x=0; -1 if x<0

must be missing some scaling factors since exp(|x|) is always outside the interval (-1,1) except at x=0. You'd do better with just sign(x). I'll introduce a couple of parameters and do a fit before using the function to calculate sum of error squared:

   y-hat = a*sign(x)*exp(b*|x|)

From Dave Wilson to Mike Zenor, 11/24/95

The analysis you're using apparently does some kind of fit of its own of the power ratings to the game results. Are the y-hat functions it produces constrained such that:
    y-hat(a,b) = - y-hat(b,a)

so that only one estimation is produced for a game? If not, you should enter all the games into the analysis twice--one from the point of view of each team. Otherwise, the anaylsis will be trying to explain just the variation between games the were tied and games that weren't. If the anaylsis is producing y-hat functions that are symmetric about the origin, entering each game twice should make no difference at all in the results.

From Dave Wilson to Mike Zenor, 12/1/95

I'm unable to verify you calculation of 79% of variation accounted for using a y-hat of the form a*sign(x)*exp(b*|x|). Please send me the complete y-hat formula so all I have to do is run through the data and calculate the sum of errors squared. Here's my result right now, using your compare.txt data set:
                                   Sum Error       % variation
           ^                        squared         explained
           y formula              JWB     PRS      JWB     PRS
------------------------------ ------ -------    ----- -------
                             x    467    1679       53     -69
              min(1,max(-1,x))    445     381       55      62
(2/pi)*atan(1.80*x + 4.55*x^3)    425     374       57      62
              sign(x)*exp(|x|)   2648 4012688     -166 -403591
                       sign(x)    616     544       38      45
  0.435*sign(x)*exp(0.694*|x|)    455   18770       54   -1788

You had:

                   VAF     "VAF" Using prefered prediction rule
                  =================
        JWB       53.0%    78.7%
        PRS       43.0%    61.6%

So I match your 53% and 62% figures. For the PRS figure for the y-hat formula of just x, you get 43% and I get -69% because your analysis is doing an extra fit before calculating the % variation explained and my analysis does no additional fit. x is either the JWB number from compare.txt or the Wilson number divided by 100.

In the 0.435*sign(x)*exp(0.694*|x|) formula, the 0.435 and 0.694 numbers come from a non-linear least squares fitting program. You must have a better fit. Please send it to me.


From: Edward Kambour <ekambour@prosx.com>
Subject: Least squares ratings
Date: Wed, 4 Nov 1998 11:39:06 -0600

Just a quick note on the "Zenor" method. Least squares of this type results in Best Linear Unbiased Estimates (BLUEs). However, such estimators are inadmissible (that is, there are estimators that are closer in terms of mean-squared error regardless of the true parameter values). In fact Empirical Bayes and shrinkage estimators are better everywhere (but they're not unbiased, and not necessarily linear).

<- Parent Directory

David Wilson / dwilson@cae.wisc.edu