Apologies for necroing this thread. In my defense, I haven't posted anything here in over 6 months.
Tees-Exe Line wrote:As for people whose arguments actually deserve discussion, if the thread Jeff linked above is what established the consensus in favor of a PPG tiebreaker and against head-to-head, that consensus ought to be revisited. As I understand the analysis, the sample was taken from tournaments featuring repeated match-ups. Each of the candidate tiebreakers was calculated before the second (or third...) match between heretofore-tied teams, and the "prediction" about the outcome of that match was tested against the actual outcome, yielding statements like "PPG predicted the outcome of the repeat match 70% of the time." If I've misunderstood the procedure, please correct me.
This is largely correct, except that statistics were not calculated across the tournament, but rather only in games against common opponents and each other. Thus, a team did not benefit by playing a game against a poor team its opponent did not play.
That procedure has a number of flaws, but probably the most important one to address and resolve is one of interpretation about the nature of a tiebreaker. First, let me propose that the best way to break a tie is to play a complete packet between the two tied teams and award the tiebreaker to the winner of that match.
You will note that this proposition has been a commonly accepted proposition well before this thread (in fact, it was an axiom of the thread under review). I'm not going to respond to Andrew Hart's critique since I think that criticism is
somewhat right (for instance, there may have been factors such as question quality and relative team strength that biased the results one way or another; for another, H2H was transformed to "H2H differential" if the teams had played multiple times such that there would be an actual H2H-based tiebreaker if H2H was split evenly between the teams). I don't feel that it's right "enough" to totally invalidate the collected data. If I understand your twist on the critique, however, you are arguing that a "regular" repeat match in a tournament between two teams of the exact same composition as the original match is subjectively different than a "tiebreaker" repeat match, which, furthermore, is subjectively different from a "finals" repeat match. I'm not sure what the subjective difference here is (we used all three types, IIRC). Question quality/difficulty? Should be the same. Distribution? I'd hope that "whether or not that third arts tossup is opera or architecture" would be unbiased as to regular vs. tiebreaker matches. Pressure? I haven't seen any objective or subjective evidence that teams play differently in tiebreaker matches compared to other matches.
Also, I don't understand what you mean by PPG being hegemonic. That shouldn't have any bearing on your (or anyone else's) ability to find a sufficiently large set of games that fit your preconceived notions of what games would be valid samples to collect to answer the question, "what statistical tiebreaker best predicts the results of a hypothetical tiebreaker?". That thread tested two common sense hypotheses - the "Andrew Hart common sense hypothesis" (PPB) and the "JR Barry common sense hypothesis" (H2H) - and found that the data didn't support JR Barry's common sense.
Second, the outcome in this case is a binary prediction: was the tiebreaker in question was "right" or "wrong" about the outcome of the repeat match? If, as I suspect, a linear probability model was used, than in some sense that is required to be incorrect by construction. At the very least we'd need corrected standard errors around the percentage-correct estimates. Since the inputs have a lot of variance empirically, it would be far better would to use probit or logit. (Were the PPG and PPB tiebreakers specified as binary or continuous variables in that procedure, ie "Team A has higher PPG" or "Team A's PPG - Team B's PPG" or some variation thereof?)
Marshall, I've already copped to using the wrong statistical test in
this thread and re-analyzed the data using what I believe to be the correct test. The second test used a 2x2 matrix of outcomes (e.g. H2H better won/PPG better won, H2H better lost/PPG better won, H2H better won/PPG better lost, H2H better lost/PPG better lost). I did not make any corrections for the number of tests that were done but a one-tailed McNemar's test between PPB and H2H is still significant at the 5% level even using Bonferroni corrections.
Finally, I'll note that in both the original (wrong) and updated versions, all four statistical tiebreakers performed statistically worse than "who won the match" in predicting "who won the match" and that the only conclusion drawn from the whole thing was that H2H was significantly worse than each of the other three tiebreakers in predicting "who won the match." In fact, H2H (or H2H differential, where H2H was like 1-1) was statistically indistinguishable from a coin flip. I highly suspect that, if you were to run your own statistical analysis on a set of data you collected yourself and you felt assured that no one else could accurately point out flaws in your data collection, you would see a similar result (at least with respect to H2H). The reason for this is that H2H is predicting the result of a future match based on
one game (or, best case scenario, 3) - and, at that, a game whose outcomes are
extremely highly variable (there was a tournament a few years back where Chicago and Minnesota both beat each other by like 400+ points).
Finally, I should say that it's quite possible to use a function of several tiebreakers. That seems to be the philosophy behind the NAQT "D-value," and I don't see why it can't be done here. I imagine it wouldn't be hard to program such a function into the tournament statistics software, and we can discuss what the function would be.
As the person who largely combined the suggestions of Andrew Hart and others into the D-Value, I sought to accurately rank teams who had played wildly different sets of opponents on potentially different packet sets. The D-Value roughly translates into "how many points would we expect you to score against a hypothetical nationally-average team on the appropriate set of questions," which to me (and apparently NAQT) seems to be as good as any a way of ranking every college team that played SCT without making the calculations too arcane for the average quizbowler (there are some adjustments to the raw rankings to ensure that order of finish within a given tournament takes precedence over statistical measures). Of course a D-Value or equivalent future statistic can be calculated for any individual tournament. However, not knowing your experience, I am unsure whether the second sentence here is due to your inexperience TDing anything larger than ~10-15 teams. Reconstructing the results of an entire bracket, by hand, in a small time window, using scoresheets with borderline unintelligible writing is something that happens
all the time in moderately-sized high school tournaments. The best statistical tiebreakers are things that can be computed relatively easily by hand if one of the thousand things that could go wrong with stats-entering does go wrong. I'm not saying it can't be done, but that it would be wholly impractical to do it.
Let me conclude by saying that IF you agree with me that the point of a tiebreaker is to choose which of two teams has done better at a given tournament, you should not be satisfied with the status quo. To my mind, by rewarding teams for doing well in unimportant matches, it overvalues a lot of uninformative data.
No one is arguing that we should be satisfied with the status quo, or that a 4.5-year-old thread is somehow the final word on tiebreakers. However, the objective analysis done in 2008 (and corrected in 2011 due to being wrong) was broadly consistent with subjective quizbowl experience that:
(1) Ties should be broken by actually playing tiebreaker games when possible
(2) Statistical tiebreakers are not a valid substitute for actually playing the game, but are used due to expediency
(3) When teams play identical schedules excepting each other and have identical win-loss records, head-to-head is about as effective at predicting the result of the rematch as a coin flip.
(and if you really want to see an argument against H2H, consider which three teams advance from a bracket in which, entering the final round, A is 7-1 with a loss to B, B and C are 7-1 and playing each other, and D is 6-2 with a win over B and losses to A and C. You'll find that the result depends entirely on
whether B beat C if you use H2H - D advances or does not advance based solely on the result of a game it did not play; furthermore, if B wins, the only reason you're breaking a two-way tie instead of a three-way circle of death is because
B beat A - a match that involved
neither of the teams in the current two-way tie).
Let me conclude by saying that while Matt is largely correct about the insufficiency of data in most quizbowl-related statistical measures (including most of mine), I think he's completely incorrect here. We don't care about who buzzed where or whether someone vultured a neg, which is what we have to approximate using "arithmetic wizardry" in order to obtain most of the statistics he doesn't like. Here, all we care about is (1) whether a team won a game against a team with an identical won-loss record against common opponents excepting each other and (2) how many points (or ppb) the two teams scored against that slate of opponents. These data are available for literally EVERY tournament that has full statistics. So this study can be replicated. Perhaps, given the number of people who have said "there were a lot of problems with that study" (most of whom never say what those problems are; Marshall, you are an exception here, and I hope I've addressed your posted concerns), it should be replicated to account for whatever the problems are. Because the problem isn't that there isn't enough data.