Wed Dec 03, 2008 5:18 pm

While I was making this post, I got to thinking: we should be able to measure which is the best statistical tiebreaker. All we need are situations in which teams (preferably with the same record, but not necessarily) play each other more than once: for some fraction of these, the results for head-to-head, PPG, PPB and whatever other tiebreakers will differ. We can then easily see which is the strongest correlate to winning a given match (so, like, if PPG differential predicts the winner 87% of the time but head-to-head only 62%, we can quantifiably say that PPG differential is a better tiebreaker.)

As such matchups happen not infrequently at tournaments, we should be able to assemble some data fairly quickly if some of you are willing to look over old stats. What do people think of this idea? If anyone wants to, I welcome them to find a tournament with such a matchup (a team playing another more than once) and see how often each tiebreaker correctly predicts the results of the actually played match.

MaS

PS: The first thing I came across was the finals of this year's IO, but this provides no useful data since the head-to-head, PPG, and PPB tiebreakers all predict the same result (unless there's a tiebreaker that's broken that I don't know about.)

As such matchups happen not infrequently at tournaments, we should be able to assemble some data fairly quickly if some of you are willing to look over old stats. What do people think of this idea? If anyone wants to, I welcome them to find a tournament with such a matchup (a team playing another more than once) and see how often each tiebreaker correctly predicts the results of the actually played match.

MaS

PS: The first thing I came across was the finals of this year's IO, but this provides no useful data since the head-to-head, PPG, and PPB tiebreakers all predict the same result (unless there's a tiebreaker that's broken that I don't know about.)

Wed Dec 03, 2008 6:49 pm

I was actually thinking of doing this a couple of months ago, but I never got around to it. I think with my database of

tournaments I was using to calculate my individual computer rankings, I should be able to write a script to do a

bunch of these comparisons. I might try working on this over the weekend.

tournaments I was using to calculate my individual computer rankings, I should be able to write a script to do a

bunch of these comparisons. I might try working on this over the weekend.

Wed Dec 03, 2008 9:15 pm

Philosophically, I wonder if this is the best way to go.

Let's say you had two teams play common opponents. One of them narrowly wins all of its matches, going 5-0 with an average margin of victory of 50 PPG. The other one blows out all of its opponents but goes on a negfest against that other team, going 4-1 with an average margin of victory of 250 PPG. If I had to predict which team would win a rematch, I would pick the 4-1 team. However, if I had to select one team to go into the Championship Bracket, I would pick the 5-0 team. In other words, the best team and the team most deserving of advancement are not necessarily the same team.

My example is a bit extreme, but there have been plenty of cases similar to it. At IHSA Sectionals, four teams play a Round Robin with one team advancing. If the team generally considered the best loses to the team generally considered the second best, then that generally decides who advances even if it is a very narrow defeat as long as those two teams win their other matches by whatever scores they rack up.

The fact that you are talking about a tiebreaker somewhat alleviates this, but there is still an issue of whether the team with a higher PPG has earned the right to advance because answering more questions is an accomplishment as opposed to finding a more complex metric that may better predict success at the next level.

Let's say you had two teams play common opponents. One of them narrowly wins all of its matches, going 5-0 with an average margin of victory of 50 PPG. The other one blows out all of its opponents but goes on a negfest against that other team, going 4-1 with an average margin of victory of 250 PPG. If I had to predict which team would win a rematch, I would pick the 4-1 team. However, if I had to select one team to go into the Championship Bracket, I would pick the 5-0 team. In other words, the best team and the team most deserving of advancement are not necessarily the same team.

My example is a bit extreme, but there have been plenty of cases similar to it. At IHSA Sectionals, four teams play a Round Robin with one team advancing. If the team generally considered the best loses to the team generally considered the second best, then that generally decides who advances even if it is a very narrow defeat as long as those two teams win their other matches by whatever scores they rack up.

The fact that you are talking about a tiebreaker somewhat alleviates this, but there is still an issue of whether the team with a higher PPG has earned the right to advance because answering more questions is an accomplishment as opposed to finding a more complex metric that may better predict success at the next level.

Wed Dec 03, 2008 10:30 pm

One problem I can foresee is that there is a lot of interaction between the different tiebreakers. For instance, teams with high bonus conversions generally have high points per game.

I do not have the base of tournaments necessary to run this, but it strikes me that a more prudent approach might be to record each game as a six-dimensional vector where:

1 means winner of game was higher in stat

0 means winner of game was equal or lower in stat

stats being W-L record, head-to-head, ppg, ppb, h2h differential, overall point differential

We then put this into a 2x2x2x2x2x2 matrix, where each entry is the number of games with that particular vector.

For each cell, moving the equivalent of down/right would be the equivalent of changing a 1 to a 0. We can then get a ranking of what's most important by comparing that cell with all cells "above" it. So if there were 45 110110 cells but only 24 100111 cells and only 12 101110, then for cell 100110, "flipping statistic 6" is more likely to explain the winner than "flipping statistic 2", which is more likely to explain the winner than "flipping statistic 3". For each cell, then, we would have a "ranking" of which 0s flipping to 1s are most likely to explain the winner, given that the 1s stay the same.

Among the 64 cells, we have:

1 ranking of 0 stats

6 rankings of 1 stat only

15 different rankings of 2 stats

20 different rankings of 3 stats

15 different rankings of 4 stats

6 different rankings of 5 stats

1 ranking of all 6 stats

We can then use any method we like to interpret these rankings ("play 30 games" pitting statistic A vs statistic B in the 16 cells they are ranked in, one ranked ahead in more cells wins the game, best W-L-T record wins, seems to me to be the best strategy; one could also use the 6-5-4-3-2-1 system with 438 total points to determine order, look for interesting trends in the data, etc.)

I do not have the base of tournaments necessary to run this, but it strikes me that a more prudent approach might be to record each game as a six-dimensional vector where:

1 means winner of game was higher in stat

0 means winner of game was equal or lower in stat

stats being W-L record, head-to-head, ppg, ppb, h2h differential, overall point differential

We then put this into a 2x2x2x2x2x2 matrix, where each entry is the number of games with that particular vector.

For each cell, moving the equivalent of down/right would be the equivalent of changing a 1 to a 0. We can then get a ranking of what's most important by comparing that cell with all cells "above" it. So if there were 45 110110 cells but only 24 100111 cells and only 12 101110, then for cell 100110, "flipping statistic 6" is more likely to explain the winner than "flipping statistic 2", which is more likely to explain the winner than "flipping statistic 3". For each cell, then, we would have a "ranking" of which 0s flipping to 1s are most likely to explain the winner, given that the 1s stay the same.

Among the 64 cells, we have:

1 ranking of 0 stats

6 rankings of 1 stat only

15 different rankings of 2 stats

20 different rankings of 3 stats

15 different rankings of 4 stats

6 different rankings of 5 stats

1 ranking of all 6 stats

We can then use any method we like to interpret these rankings ("play 30 games" pitting statistic A vs statistic B in the 16 cells they are ranked in, one ranked ahead in more cells wins the game, best W-L-T record wins, seems to me to be the best strategy; one could also use the 6-5-4-3-2-1 system with 438 total points to determine order, look for interesting trends in the data, etc.)

Wed Dec 03, 2008 11:38 pm

Shcool wrote:Philosophically, I wonder if this is the best way to go.

...The fact that you are talking about a tiebreaker somewhat alleviates this...

Yeah, I think you're misunderstanding me. We're talking about situations in which we need a tiebreaker. I'm just proposing a measurement to determine which popular tiebreaker is actually the most valid (best correlate to winning.) Obviously the team with the best record should win regardless of whatever tiebreakers another team may hold against them.

Dwight: I think you're misunderstanding the nature of what I'm proposing to do here. We don't want to compare W-L because that isn't a tiebreaker; only W-L against the same team. We can easily determine how predictive, for example, PPG differential is in the outcome of any game, but that isn't very useful because we can't make the same comparison to head-to-head unless in the case of a repeat matchup, which means we can't isolate the other factors (so no direct comparison can be made.) Only in the case of a repeat matchup can we isolate all the factors. Also, the fact that the tie-breakers are correlated isn't important; the proposed measurement measures only the differences between them.

MaS

Thu Dec 04, 2008 1:15 pm

It's an accepted principle in sports analysis (and yes, I will break my virulent opposition to sports analogies here because the sabermetricians are much more advanced with their data and mathematical thought than we are) that a loss or victory by a small margin means nothing, but long-term trends in scoring mean everything. Who wins a 280-245 quizbowl game comes down to the luck of the draw in terms of whether that third arts tossup was better for one team's opera specialist or the other team's architecture player; who consistently scores 35 PPG more over the course of the tournament reliably indicates more knowledge or at least more ability to play quizbowl.

I don't like any appeals to "you're making the head-to-head result meaningless" because:

1) the head-to-head result IS meaningless, essentially, when we're talking about a tie situation--the teams must be very close in ability if they are tied, especially if the one who won the head-to-head game then went and lost to someone who the opponent beat, which mathematically must happen in the "two-way tie at the top of the standings" scenario. If the head-to-head result was a 300-point blowout and then the winning team went and lost to someone who the losing team also beat by 300, then something is wrong with the questions. In the more usual scenario, if the head-to-head result is a very close game, then it has very little value in determining who the better team would be in a longer series of games.

2) the head-to-head result is taken into account to create the tie; without it, someone is 1 game ahead. That game has all the value in the world when we're talking about the difference between "you are 1 game ahead and you have won the tournament/earned the advantage in the final" and "well, I guess we have a tie now, let's find some way to break it." That's value enough for any one game without artificially adding any more.

I don't like any appeals to "you're making the head-to-head result meaningless" because:

1) the head-to-head result IS meaningless, essentially, when we're talking about a tie situation--the teams must be very close in ability if they are tied, especially if the one who won the head-to-head game then went and lost to someone who the opponent beat, which mathematically must happen in the "two-way tie at the top of the standings" scenario. If the head-to-head result was a 300-point blowout and then the winning team went and lost to someone who the losing team also beat by 300, then something is wrong with the questions. In the more usual scenario, if the head-to-head result is a very close game, then it has very little value in determining who the better team would be in a longer series of games.

2) the head-to-head result is taken into account to create the tie; without it, someone is 1 game ahead. That game has all the value in the world when we're talking about the difference between "you are 1 game ahead and you have won the tournament/earned the advantage in the final" and "well, I guess we have a tie now, let's find some way to break it." That's value enough for any one game without artificially adding any more.

Thu Dec 04, 2008 1:21 pm

The two long-term data trends that emerge from quizbowl games are PPG and PPB. I am in favor of using PPG (because it incorporates the entirety of quizbowl activity) when the teams have played common opponents. When teams haven't played common opponents, I think the only fair thing to do is to use bonus conversion, since that is much less affected by the opponents one plays than PPG is. Ideally PPB is context-neutral, but depending on how variation in packets and opponents line up, it might not be.

Thu Dec 04, 2008 2:13 pm

Well, look; if you guys believe these sports analogies, they should be reflected in the measurement I'm proposing to make, so you have nothing to lose and everything to gain. More importantly, if you believe in reason, you can't advocate uncritically using one tiebreaker based on those arguments; rather, you are compelled to acknowledge that relying on a priori arguments when a posteriori evidence is available is the very pinnacle of unreasonable, unscientific thinking.

The case remains this: all else equal, long-term trends like PPG/PPB have lower fluctuations due to (massively) larger sample size, but are less predictive per datum, while the outcomes of previous games between tied teams are more predictive per datum, but potentially contain very large fluctuations. Therefore, until we can quantify things (which is exactly what I'm proposing to do,) doubt must remain regarding which is the better tie-breaker.

In short, both of you are compelled to advocate this comparison as the justification of your beliefs or abandon reason (and, concomitantly, your arguments), in which case you must either form a new argument or not argue against this measurement. So far, we have one datum indicating unit correlation to winning (and to one another) for head-to-head, PPG, and PPB tiebreakers. I know we can do better than that.

MaS

The case remains this: all else equal, long-term trends like PPG/PPB have lower fluctuations due to (massively) larger sample size, but are less predictive per datum, while the outcomes of previous games between tied teams are more predictive per datum, but potentially contain very large fluctuations. Therefore, until we can quantify things (which is exactly what I'm proposing to do,) doubt must remain regarding which is the better tie-breaker.

In short, both of you are compelled to advocate this comparison as the justification of your beliefs or abandon reason (and, concomitantly, your arguments), in which case you must either form a new argument or not argue against this measurement. So far, we have one datum indicating unit correlation to winning (and to one another) for head-to-head, PPG, and PPB tiebreakers. I know we can do better than that.

MaS

Thu Dec 04, 2008 3:31 pm

Matt:

Are you talking about 35 PPG over the course of an entire tournament that is true round-robin, or one that has several divisions of teams? If it is the former kind of tournament, then state so explicitly; if it is the latter one, then your argument holds little weight since common opponents must be factored into any metric attempting to break a tie between 2 teams with identical records. A 35 PPG differential means little, if anything, if the only common opponent between Team A and Team B is the other one. Team A may have played some of their 6 Divisional games against Middle School teams while Team B played games against Dorman B, Charter C, RM D, among others? 35 PPG more for Team A means very little if they finish with the same record as Team B, but lost to them head-to-head. If I am missing some piece of your argument, please clarify your post.

Are you talking about 35 PPG over the course of an entire tournament that is true round-robin, or one that has several divisions of teams? If it is the former kind of tournament, then state so explicitly; if it is the latter one, then your argument holds little weight since common opponents must be factored into any metric attempting to break a tie between 2 teams with identical records. A 35 PPG differential means little, if anything, if the only common opponent between Team A and Team B is the other one. Team A may have played some of their 6 Divisional games against Middle School teams while Team B played games against Dorman B, Charter C, RM D, among others? 35 PPG more for Team A means very little if they finish with the same record as Team B, but lost to them head-to-head. If I am missing some piece of your argument, please clarify your post.

Thu Dec 04, 2008 4:40 pm

elrountree wrote:Matt:

Are you talking about 35 PPG over the course of an entire tournament that is true round-robin, or one that has several divisions of teams? If it is the former kind of tournament, then state so explicitly; if it is the latter one, then your argument holds little weight since common opponents must be factored into any metric attempting to break a tie between 2 teams with identical records. A 35 PPG differential means little, if anything, if the only common opponent between Team A and Team B is the other one. Team A may have played some of their 6 Divisional games against Middle School teams while Team B played games against Dorman B, Charter C, RM D, among others? 35 PPG more for Team A means very little if they finish with the same record as Team B, but lost to them head-to-head. If I am missing some piece of your argument, please clarify your post.

I think in general people don't support PPG comparisons unless they're made against teams with all common opponents; if you have to compare across brackets, you always prefer PPB to PPG. The only rare circumstance in which this fails is if Team A wins all its games 600-0, getting twenty tossups per game and 20PPB, and team B wins all its games 80-0, getting two tossups per game and 30PPB--or some more realistic corner case, I suppose. But this relies on absolutely atrocious bracket balance. Getting at least decent bracket balance means that the teams that only get two tossups per game--the teams for which PPB means little due to a relatively small sample--will also lose a whole lot.

Thu Dec 04, 2008 4:46 pm

I mean PPG within the round-robin that produced the tie, of course.

Thu Dec 04, 2008 5:13 pm

You're not correct, Mike.

To continue with sports analogies, there is overwhelming evidence that the Patriots were the best team in the NFL last year. However, that does not mean that they should be considered the NFL Champions. Titles and playoff berths go to teams that earn them through criteria decided ahead of time, not to teams that prove themselves the greatest statistically.

If somebody knowledgeable with statistics goes through a large amount of data, they could produce a complex formula to determine which teams are better than which other teams. They will not find that PPG is always the best predictor--they will find that PPG correlates to a certain extent with being better, PPB correlates to a certain extent, team record correlates, etc. There very well could be correlations with the number of negs and, in NAQT tournaments, with the number of powers. If somebody wants to, as best as possible, determine which team is better, then they will need a formula that takes all available correlating statistics into account. Is your goal to use such a formula to break ties?

To continue with sports analogies, there is overwhelming evidence that the Patriots were the best team in the NFL last year. However, that does not mean that they should be considered the NFL Champions. Titles and playoff berths go to teams that earn them through criteria decided ahead of time, not to teams that prove themselves the greatest statistically.

If somebody knowledgeable with statistics goes through a large amount of data, they could produce a complex formula to determine which teams are better than which other teams. They will not find that PPG is always the best predictor--they will find that PPG correlates to a certain extent with being better, PPB correlates to a certain extent, team record correlates, etc. There very well could be correlations with the number of negs and, in NAQT tournaments, with the number of powers. If somebody wants to, as best as possible, determine which team is better, then they will need a formula that takes all available correlating statistics into account. Is your goal to use such a formula to break ties?

Thu Dec 04, 2008 5:43 pm

To continue with sports analogies, there is overwhelming evidence that the Patriots were the best team in the NFL last year. However, that does not mean that they should be considered the NFL Champions. Titles and playoff berths go to teams that earn them through criteria decided ahead of time, not to teams that prove themselves the greatest statistically.

This part of your post makes no sense to me. The whole point of this thread is to decide "ahead of time" the criteria to use to break ties in tournaments. The goal is not to retroactively change the outcome of tournaments, as calling the Patriots the NFL Champions would be, but to find a fair way to do it in the future.

Thu Dec 04, 2008 6:05 pm

I think that it is actually useful to determine how predictive any given stat is in the outcome of any given game (in order to better quantify "upsets" for instance). However, seeing what you are actually trying to do now, this seems to be a project reserved for a later time.Captain Scipio wrote:Dwight: I think you're misunderstanding the nature of what I'm proposing to do here. We don't want to compare W-L because that isn't a tiebreaker; only W-L against the same team. We can easily determine how predictive, for example, PPG differential is in the outcome of any game, but that isn't very useful because we can't make the same comparison to head-to-head unless in the case of a repeat matchup, which means we can't isolate the other factors (so no direct comparison can be made.) Only in the case of a repeat matchup can we isolate all the factors. Also, the fact that the tie-breakers are correlated isn't important; the proposed measurement measures only the differences between them.

Are you looking for repeat matchups, or just repeat matchups between teams of the same record? If the latter, here are some data points. Statistics are calculated at the instantaneous point in time that the match began, not data from the entire tournament. If you want to know entire tournament data you can calculate that yourself but it's mostly similar.

2007 TWAIN, Round 11: UCLA B vs. UCI A. Both teams 8-2. UCLA B held head-to-head advantage; UCI A held all other tiebreakers. UCI A def UCLA B 260-110.

At least one other example probably exists from 2007 TWAIN, but due to UCLA's policy of not counting rounds I don't know which one(s) it is.

2007 Aztlan Cup, Round 11: UCI 1 vs. UCLA 1. Both teams 9-1*. UCI 1 held all tiebreakers except h2h, which was split, and won finals match 440-115.

2006 ACF Fall, Round 10: Caltech vs Stanford B. Both teams 6-3*. Caltech held ppb tiebreaker, point differential tiebreaker; Stanford B held h2h tiebreaker and h2h differential tiebreaker; ppg tiebreaker was negligible (Caltech 303 to Stanford B 300). Stanford B def Caltech 340-275.

2006 ACF Fall, Round 9: UCLA vs Stanford B. Both teams 6-2. UCLA held all tiebreakers and won rematch 450-205.

2006 ACF Fall, Round 9: Caltech vs Stanford A. Both teams 5-3. Caltech held head-to-head and h2h differential tiebreaker; Stanford A held ppg, point differential, ppb tiebreaker. Caltech def Stanford A 385-210.

2006 ACF Fall, Round 9: UCI vs Berkeley. Both teams 1-7. UCI held h2h and h2h differential tiebreaker, Berkeley held ppg, point differential, ppb tiebreaker. Berkeley def UCI 265-155.

2006 Aztlan Cup, Round 10?: USC vs UCSD. Both teams 7-1*. USC held PPG tiebreaker, h2h, h2h differential tiebreaker, point differential tiebreaker; UCSD held PPB tiebreaker. USC def UCSD by unknown score.

2006 ACF Regionals, Round 5: UCLA vs Berkeley. Both teams 3-1. UCLA held h2h, h2h differential, ppb tiebreaker. Berkeley held ppg, point differential tiebreaker. UCLA def Berkeley 290-220.

2006 SCT West D1, Round 9: UCLA vs Stanford. Both teams 6-2*. UCLA held ppth advantage, point differential advantage, head-to-head differential advantage; Stanford held bonus conversion advantage; head-to-head was split 1-1. UCLA def Stanford 450-320.

2006 SCT West D1, Round 15: UCLA vs Stanford. Both teams 11-3*. UCLA held ppth advantage, point differential advantage, head-to-head differential advantage; head to head was split 1-1 and bonus conversion was negligible (18.71 for UCLA to 18.67 for Stanford). UCLA def Stanford 470-185.

2005 ACF Regionals, Round 14: Berkeley A vs Berkeley B. Both teams 10-1*. Berkeley B held h2h and h2h differential tiebreaker, ppg tiebreaker. Berkeley A narrowly held ppb tiebreaker (difference of about .2 ppb). Berkeley A def Berkeley B 250-210.

2004 ACF Fall, Round 11: Berkeley A vs Berkeley C. Both teams 7-3*. Berkeley A held h2h differential tiebreaker. Berkeley C held ppg, ppb, point differential tiebreaker. h2h was split 1-1. Berkeley C def Berkeley A 475-225.

2004 SCT West D2, Round 13: Berkeley Well vs Stanford Incoln. Both teams 6-6. Berkeley Well held h2h differential, ppg, ppb, narrowly held point differential tiebreaker (by about 2 ppg), head to head was split 1-1. Berkeley Well def Stanford Incoln 215-160.

2004 SCT West D2, Round 12: Caltech vs Stanford Incoln. Both teams 6-5. Caltech held h2h differential, ppg, point differential, narrowly held ppb (about .2 difference), h2h was split 1-1. Caltech def Stanford Incoln 280-230.

2004 Cardinal Classic, Round Finals: Berkeley Jeff vs Berkeley David. Both teams 11-1. Berkeley Jeff held ppg, ppb, point differential tiebreakers. Berkeley David held h2h, h2h differential tiebreaker. Berkeley Jeff def Berkeley David 350-225.

2003 Cardinal Junior Bird, Round 5: Berkeley STP vs Stanford A. Both teams 1-3. Berkeley STP held all tiebreakers. Berkeley STP def Stanford A 395-220.

2003 ACF Fall, Round 14: Berkeley Untitled vs UCLA. Both teams 5-8*. Berkeley Untitled held ppg, point differential, h2h, h2h differential tiebreakers. UCLA held ppb tiebreaker. UCLA def Berkeley Untitled 330-150.

2003 ACF Fall, Round 11: Stanford Old vs Berkeley Kids. Both teams 8-2. Stanford Old held h2h, h2h differential, ppb tiebreakers. Berkeley Kids held ppg and point differential tiebreakers. Berkeley Kids def Stanford Old 350-280.

2003 ACF Fall, Round 11: Berkeley Untitled vs Berkeley Discovery. Both teams 3-7. Berkeley Untitled held h2h and h2h differential. Berkeley Discovery held ppg, point differential, ppb tiebreakers. Berkeley Untitled def Berkeley Discovery 250-210.

2003 ACF Fall, Round 9: Berkeley Nominalists vs UCLA. Both teams 3-5. Berkeley Nominalists held ppg, ppb, point differential tiebreakers. UCLA held h2h and h2h differential tiebreakers. Berkeley Nominalists def UCLA 310-290.

2003 ACF Fall, Round 8: Berkeley Discovery vs UCLA. Both teams 2-5*. Berkeley Discovery held all tiebreakers. UCLA def Berkeley Discovery 280-230.

2003 Buzzerfest Mirror at Stanford, Round 10: Stanford vs Berkeley C. Both teams 7-2*. Stanford held h2h tiebreaker. Berkeley C held ppg, ppb, point differential, narrowly held h2h tiebreaker (+20 over 3 games). Stanford def Berkeley C 300-245.

*teams had played exactly the same opponents except for each other.

More coming.

Thu Dec 04, 2008 6:08 pm

Not only am I not wrong, but you've said almost nothing even germane to what I'm saying. I'm begging you and everyone to stop arguing with sports analogies (or really any analogies whatsoever): you're only confusing yourselves. Please look at the actual situation at hand.

Nobody is talking about supplanting winning and losing games to determine tournament winners. That has nothing to do with anything. This thread is about measuring which is the best tiebreaker*. Of course, it would be easy to regress any number of formulae onto winning percentage as you say, but that's not of interest here.

So, again, what we want to do here is to practically compare commonly used (or usable) tiebreakers. I've devised a method that seems to isolate other factors and allows us to draw an immediate conclusion regarding which is the best (most predictive) among the three common tiebreakers (PPG differential over common opponents, PPB differential, head-to-head.) If you or anyone else has an easily computable tiebreaker formula that you'd like to see in use, I invite you to publish it here: any such should be comparable by the method I've outlined. Of course, given enough data, we could use regression to determine a statistically best tiebreaker, but let's worry about that later.

MaS

*Maybe people are confused on this point. A tiebreaker is used to choose the best among several teams with equal records to determine, for example, seeding or sometimes other things. The impetus for this thread was a dispute in a previous thread regarding a tiebreaker to award a tournament championship, so it a positive fact that things like that are happening.

Nobody is talking about supplanting winning and losing games to determine tournament winners. That has nothing to do with anything. This thread is about measuring which is the best tiebreaker*. Of course, it would be easy to regress any number of formulae onto winning percentage as you say, but that's not of interest here.

So, again, what we want to do here is to practically compare commonly used (or usable) tiebreakers. I've devised a method that seems to isolate other factors and allows us to draw an immediate conclusion regarding which is the best (most predictive) among the three common tiebreakers (PPG differential over common opponents, PPB differential, head-to-head.) If you or anyone else has an easily computable tiebreaker formula that you'd like to see in use, I invite you to publish it here: any such should be comparable by the method I've outlined. Of course, given enough data, we could use regression to determine a statistically best tiebreaker, but let's worry about that later.

MaS

*Maybe people are confused on this point. A tiebreaker is used to choose the best among several teams with equal records to determine, for example, seeding or sometimes other things. The impetus for this thread was a dispute in a previous thread regarding a tiebreaker to award a tournament championship, so it a positive fact that things like that are happening.

Thu Dec 04, 2008 6:09 pm

Dwight, yeah, that's massively awesome. Is there an easy way to get those data?

MaS

MaS

Thu Dec 04, 2008 7:11 pm

2007 WIT, Round 10: Chicago A vs Stanford A. Both teams 8-1*. Chicago A held all tiebreakers. Chicago A def Stanford A 330-260.

2007 WIT, Round 10: Berkeley B vs Stanford B. Both teams 4-5*. Berkeley B held h2h, h2h differential, point differential. Stanford B held ppb and narrowly held ppg (by about 3 ppg). Stanford B def Berkeley B 290-205.

2007 SCT West, Round 14: Berkeley 1 vs Stanford A1. Both teams 11-2*. Stanford A1 held h2h differential. Berkeley 1 held ppg, ppb, point differential. h2h was split. Berkeley 1 def Stanford A1 340-265.

2007 SCT West, Round 13: Stanford B1 vs USC 2. Both teams 3-9*. Stanford B1 held h2h, h2h differential, narrowly held ppg (by 0.06 ppth). USC 2 held ppb, point differential. USC 2 def Stanford B1 160-85.

2007 SCT West, Round 12: USC 1 vs Caltech. Both teams 6-5. USC held h2h, h2h differential. Caltech held ppg, ppb, point differential. Caltech def USC 1 335-250.

2006 WIT, Round 10: Berkeley B vs Chicago B. Both teams 3-5*. Berkeley B held ppg, ppb. Chicago B held h2h, h2h differential, narrowly held point differential (by about 8 ppg). Chicago B def Berkeley B 220-150.

2005 WIT, Round 8: UCLA vs Stanford B. Both teams 2-5*. Stanford B held all tiebreakers. Stanford B def UCLA 235-225.

2005 TRASH Regionals, Round 9: Mich Alums vs UCLA. Both teams 4-4. UCLA held h2h and h2h differential; Mich Alums held ppg, ppb, point differential. UCLA def Mich Alums 205-165.

2005 TRASH Regionals, Round 8: Mich Alums vs. Berkeley. Both teams 4-3. Berkeley held all tiebreakers and won 270-135.

2005 ACF Fall, Round 9: UCLA A vs Stanford C. Both teams 7-1. UCLA A held ppb tiebreaker; Stanford C held point differential, h2h, h2h differential, narrowly held ppg (by about 5 ppg). Stanford C def UCLA A 340-305.

2005 ACF Fall, Round 9: UCLA B vs Stanford B. Both teams 2-6. UCLA B held h2h, h2h differential, ppb, narrowly held ppg (by about 1 ppg). Stanford B held point differential. Stanford B def UCLA B 195-125.

2005 ACF Fall, Round 8: Berkeley A vs Stanford C. Both team 6-1*. Berkeley A held h2h, h2h differential, ppb, narrowly held ppg (by about 5 ppg). Stanford C narrowly held point differential (by about 2 ppg). Stanford C def Berkeley A 425-190.

2005 BLaST, Round 15: Berkeley D vs Stanford. Both teams 8-6*. Berkeley D held point differential, narrowly held ppg (by about 7 ppg). Stanford held h2h, h2h differential, ppb. Berkeley D def Stanford 215-135.

2005 BLaST, Round 12: Chicago A vs Berkeley B. Both teams 9-2. Chicago A held ppg, point differential. Berkeley B held h2h, h2h differential, ppb. Berkeley B def Chicago A 360-270.

2004 ACF Regionals, Round 10: Berkeley Jerry vs Stanford. Both teams 3-6*. Berkeley Jerry held h2h, h2h differential. Stanford held ppg, ppb, point differential. Berkeley Jerry def Stanford 210-160.

*teams played exact same opponents except for each other.

2007 WIT, Round 10: Berkeley B vs Stanford B. Both teams 4-5*. Berkeley B held h2h, h2h differential, point differential. Stanford B held ppb and narrowly held ppg (by about 3 ppg). Stanford B def Berkeley B 290-205.

2007 SCT West, Round 14: Berkeley 1 vs Stanford A1. Both teams 11-2*. Stanford A1 held h2h differential. Berkeley 1 held ppg, ppb, point differential. h2h was split. Berkeley 1 def Stanford A1 340-265.

2007 SCT West, Round 13: Stanford B1 vs USC 2. Both teams 3-9*. Stanford B1 held h2h, h2h differential, narrowly held ppg (by 0.06 ppth). USC 2 held ppb, point differential. USC 2 def Stanford B1 160-85.

2007 SCT West, Round 12: USC 1 vs Caltech. Both teams 6-5. USC held h2h, h2h differential. Caltech held ppg, ppb, point differential. Caltech def USC 1 335-250.

2006 WIT, Round 10: Berkeley B vs Chicago B. Both teams 3-5*. Berkeley B held ppg, ppb. Chicago B held h2h, h2h differential, narrowly held point differential (by about 8 ppg). Chicago B def Berkeley B 220-150.

2005 WIT, Round 8: UCLA vs Stanford B. Both teams 2-5*. Stanford B held all tiebreakers. Stanford B def UCLA 235-225.

2005 TRASH Regionals, Round 9: Mich Alums vs UCLA. Both teams 4-4. UCLA held h2h and h2h differential; Mich Alums held ppg, ppb, point differential. UCLA def Mich Alums 205-165.

2005 TRASH Regionals, Round 8: Mich Alums vs. Berkeley. Both teams 4-3. Berkeley held all tiebreakers and won 270-135.

2005 ACF Fall, Round 9: UCLA A vs Stanford C. Both teams 7-1. UCLA A held ppb tiebreaker; Stanford C held point differential, h2h, h2h differential, narrowly held ppg (by about 5 ppg). Stanford C def UCLA A 340-305.

2005 ACF Fall, Round 9: UCLA B vs Stanford B. Both teams 2-6. UCLA B held h2h, h2h differential, ppb, narrowly held ppg (by about 1 ppg). Stanford B held point differential. Stanford B def UCLA B 195-125.

2005 ACF Fall, Round 8: Berkeley A vs Stanford C. Both team 6-1*. Berkeley A held h2h, h2h differential, ppb, narrowly held ppg (by about 5 ppg). Stanford C narrowly held point differential (by about 2 ppg). Stanford C def Berkeley A 425-190.

2005 BLaST, Round 15: Berkeley D vs Stanford. Both teams 8-6*. Berkeley D held point differential, narrowly held ppg (by about 7 ppg). Stanford held h2h, h2h differential, ppb. Berkeley D def Stanford 215-135.

2005 BLaST, Round 12: Chicago A vs Berkeley B. Both teams 9-2. Chicago A held ppg, point differential. Berkeley B held h2h, h2h differential, ppb. Berkeley B def Chicago A 360-270.

2004 ACF Regionals, Round 10: Berkeley Jerry vs Stanford. Both teams 3-6*. Berkeley Jerry held h2h, h2h differential. Stanford held ppg, ppb, point differential. Berkeley Jerry def Stanford 210-160.

*teams played exact same opponents except for each other.

Thu Dec 04, 2008 7:43 pm

Amusingly, the first batch of isolated data (the *'d data from Dwight's first post) indicate that all tiebreakers have a correlation of 0.5. Further reports as processing proceeds.

MaS

MaS

Thu Dec 04, 2008 8:03 pm

Mike, I take it that you're only looking at the *'d ones, so that's what I'll keep looking for. Unfortunately, because a lot of tournaments don't have round numbers attached, and because you have to manually add and subtract things (SQBS won't necessarily give you a tournament snapshot after, e.g., Round 9 of a 12 round tournament), I'm not sure there's an easier way to do this kind of thing. That said, if people want to put in the time and scour stats pages, this is what you should look for:

A tournament small enough to run a full round robin (usually <15 teams). Anything more and you get bracketed round robins, which skews the data. In these tournaments, the first or last game in a playoff bracket, or a finals game, is guaranteed to be between teams that have faced the exact same opponents (except for themselves). It's just then a matter of manually sorting through that subset of games to find ones between teams of the same record.

If you prefer the end-of-tournament overall data to the instantaneous-point-in-time data, then it's easier to just read numbers off the page; I think that the instantaneous-point-in-time data is more correct to use, but it's also more time-consuming to get.

A tournament small enough to run a full round robin (usually <15 teams). Anything more and you get bracketed round robins, which skews the data. In these tournaments, the first or last game in a playoff bracket, or a finals game, is guaranteed to be between teams that have faced the exact same opponents (except for themselves). It's just then a matter of manually sorting through that subset of games to find ones between teams of the same record.

If you prefer the end-of-tournament overall data to the instantaneous-point-in-time data, then it's easier to just read numbers off the page; I think that the instantaneous-point-in-time data is more correct to use, but it's also more time-consuming to get.

Thu Dec 04, 2008 8:30 pm

Hi Dwight,

Well, thanks for your effort, then! The point-in-time data are indeed what we want here, thought even the whole-tournament data have some validity. The unstarred data will be included later, but I consider them to be less predictive (since there are more non-isolated factors; the starred data isolate everything possible.)

MaS

Well, thanks for your effort, then! The point-in-time data are indeed what we want here, thought even the whole-tournament data have some validity. The unstarred data will be included later, but I consider them to be less predictive (since there are more non-isolated factors; the starred data isolate everything possible.)

MaS

Thu Dec 04, 2008 8:35 pm

With 19 games (all the *'d data): total points difference, 0.6 > PPB difference, 0.55 > H-H, 0.53 > H-H point difference, 0.5 = PPG difference, 0.5. I'll now include the non-starred data. If anyone else can get me more, I've found a method to enter them pretty quickly. I may just post the spreadsheet on Google Docs to let people enter them by themselves.

MaS

MaS

Thu Dec 04, 2008 9:14 pm

Okay, with all the data entered (came to 38 games) we've got:

Common-opponent data: Point difference, 60%; Bonus points per bonus heard difference, 55%; Head-to-Head Result, 53.33%; Points per game/points per tossup heard difference, 50%; Head-to-head points difference, 50%.

All data: Point difference, 65.79%; Bonus points per bonus heard difference, 60.53%; Points per game/points per tossup heard difference, 57.89%; Head-to-head points difference, 57.89%; Head-to-Head result, 54.84%.

So, at this point, I'll conclude three things:

1. Point difference is the best tiebreaker in these data by a fair margin.

2. No normal tiebreaker significantly outperforms any other; they're all in the 50-70% range at predicting the right winner of an actual game.

3. Relatedly, no standard tiebreaker is very good, so meaningful ties should absolutely be played off if a tournament wants to find a fair winner.

I'd further suggest that, as I have little faith point 3 will carry the weight it ought, that we ought to take-up Coach Reinstein's suggestion and consider a better, composite tiebreaker. I'm open to suggestions in this area and will gladly test any. If we can find enough data, I will try a regression study.

I'll add the caveat that I'm currently confused about one thing in these data: how can a team hold points per game but not point difference if they've played the same number of games? Perhaps I've misunderstood what Dwight meant by point differential; I took that to mean difference in total points scored. Dwight, please let me know what's up; I can update this easily to reflect whatever changes.

MaS

PS: Perhaps point differential means, like, the difference between the teams' mean point difference per match. That might explain the discrepancy.

Common-opponent data: Point difference, 60%; Bonus points per bonus heard difference, 55%; Head-to-Head Result, 53.33%; Points per game/points per tossup heard difference, 50%; Head-to-head points difference, 50%.

All data: Point difference, 65.79%; Bonus points per bonus heard difference, 60.53%; Points per game/points per tossup heard difference, 57.89%; Head-to-head points difference, 57.89%; Head-to-Head result, 54.84%.

So, at this point, I'll conclude three things:

1. Point difference is the best tiebreaker in these data by a fair margin.

2. No normal tiebreaker significantly outperforms any other; they're all in the 50-70% range at predicting the right winner of an actual game.

3. Relatedly, no standard tiebreaker is very good, so meaningful ties should absolutely be played off if a tournament wants to find a fair winner.

I'd further suggest that, as I have little faith point 3 will carry the weight it ought, that we ought to take-up Coach Reinstein's suggestion and consider a better, composite tiebreaker. I'm open to suggestions in this area and will gladly test any. If we can find enough data, I will try a regression study.

I'll add the caveat that I'm currently confused about one thing in these data: how can a team hold points per game but not point difference if they've played the same number of games? Perhaps I've misunderstood what Dwight meant by point differential; I took that to mean difference in total points scored. Dwight, please let me know what's up; I can update this easily to reflect whatever changes.

MaS

PS: Perhaps point differential means, like, the difference between the teams' mean point difference per match. That might explain the discrepancy.

Thu Dec 04, 2008 10:22 pm

Captain Scipio wrote:I'll add the caveat that I'm currently confused about one thing in these data: how can a team hold points per game but not point difference if they've played the same number of games? Perhaps I've misunderstood what Dwight meant by point differential; I took that to mean difference in total points scored. Dwight, please let me know what's up; I can update this easily to reflect whatever changes.

If Team A has 350 PPG and 275 PPGA while Team B has 300 PPG and 200 PPGA, then Team A has higher PPG while Team B has higher point differential. Usually that means that Team B is better at answering tossups (hence less chance for the opponent to score) but worse at bonuses (hence lower PPG).

Also, while the data crunching is neat and all, I think the margin of error is way too great for what we have right now. But then again, I'm just eyeballing these numbers.

Thu Dec 04, 2008 10:49 pm

This is exactly what I meant, and exactly what I think that statistic means (which is why it would be useful as a tiebreaker).hwhite wrote:If Team A has 350 PPG and 275 PPGA while Team B has 300 PPG and 200 PPGA, then Team A has higher PPG while Team B has higher point differential. Usually that means that Team B is better at answering tossups (hence less chance for the opponent to score) but worse at bonuses (hence lower PPG).

Mike, since I've given the exact scores for something like 37 of those games, would it be possible to run a regression involving not just who wins, but by how much (e.g. if team A holds PPG tiebreaker, but team B holds head-to-head, and team A beats team B 230-180, then it would be +50 for the PPG tiebreaker and -50 for the h2h tiebreaker).

Harry, can you elaborate about the margin of error? I think Mike is saying exactly that when he claims that no statistic significantly outperforms any other, though he hasn't quantified that significance/error.

I'll see if I can scrounge up some more data for all-else-equal matches.

Thu Dec 04, 2008 11:13 pm

cvdwightw wrote:Harry, can you elaborate about the margin of error? I think Mike is saying exactly that when he claims that no statistic significantly outperforms any other, though he hasn't quantified that significance/error.

(N.b. I don't claim to be a statistician, nor have I taken a statistics course, so I could be wrong)

If you remember the presidential election polls, it works in the same way. Long story short, if you want 80% confidence (which is rather low, but then again, tiebreaking is not perfect to begin with), then with the current sample size of 38, you have a 10% margin of error, which means that no tiebreaker is statistically significantly better than the other (H-H could be 10% higher than reported, and PPG difference could be 10% lower than reported). If you increase your sample size to 100 games, you'll be down to a 6% margin of error, which may start to allow you to confidently (statistically-wise) rule out options.

Thu Dec 04, 2008 11:34 pm

Dwight, yeah, I don't see why not. Perhaps for future work.

MaS

MaS

Fri Dec 05, 2008 4:37 am

This looks pretty cool.

Regarding what the margin of error on these numbers is, I'm pretty sure you can just use a binomial

distribution. In that case, the standard deviation on the number of successes is just sqrt(n*p*(1-p)).

So, for example, Mike said point difference had a success rate of 65.79 percent, out of 38 games. In

other words, there were 25 successes in 38 games. The error on that is sqrt(38*0.6579*(1-0.6579))

= 2.92. So, we have (25 +/- 2.92)/38 = 0.6579 +/- 0.0768. The errors on the other numbers will

be similar. So, I agree with Harry that the errors on these numbers are too big to say definitively

which tiebreaker is the best. We probably need to lower the error from the current 7.7 percent to

about 3 percent or less to say with much confidence which tiebreaker is the best. Since error scales

like 1/sqrt(n), this means we might need 6 times more data than we currently have. Whether that's

feasible or not I don't know.

Regarding what the margin of error on these numbers is, I'm pretty sure you can just use a binomial

distribution. In that case, the standard deviation on the number of successes is just sqrt(n*p*(1-p)).

So, for example, Mike said point difference had a success rate of 65.79 percent, out of 38 games. In

other words, there were 25 successes in 38 games. The error on that is sqrt(38*0.6579*(1-0.6579))

= 2.92. So, we have (25 +/- 2.92)/38 = 0.6579 +/- 0.0768. The errors on the other numbers will

be similar. So, I agree with Harry that the errors on these numbers are too big to say definitively

which tiebreaker is the best. We probably need to lower the error from the current 7.7 percent to

about 3 percent or less to say with much confidence which tiebreaker is the best. Since error scales

like 1/sqrt(n), this means we might need 6 times more data than we currently have. Whether that's

feasible or not I don't know.

Fri Dec 05, 2008 4:55 am

Considering that this is just from one small, isolated circuit that doesn't run a lot of tournaments (as compared to, say, the Midwest), we should be able to find (hopefully) a near-equivalent amount of data from the Midwest, Northeast, Mid-Atlantic, and Southeast circuits. Plus, there's an entire high school circuit, if we can find small enough tournaments that run double RR or single RR + playoff brackets. I'd say it's feasible to get a sample size of ~200-250 games if we work at it and include anything between teams of the same record (not ideal, but hey, it's the best we can do if we're looking at 250 games).Schweizerkas wrote:Since error scales like 1/sqrt(n), this means we might need 6 times more data than we currently have. Whether that's feasible or not I don't know.

Fri Dec 05, 2008 5:34 am

I don't think we can work with the assumption that data trends in past quizbowl matches necessarily predict game results of future matches. I'm unconvinced that the body of quizbowl match results as a whole to this point represent the expected outcomes of matches to come, and I certainly reject outright the idea that a data set that mixes non-common and common-opponent schedules, is heavily skewed towards sketchily edited west coast sets, TRASH regionals, and IS set tournaments, and has a whole bevy other other problems has any useful extrapolative value whatsoever. These data stem from activities whose commonality barely extends past the use of questions and buzzers. Who says that stirring up all of these (or any other concoction) yields something that will be predictive for future quizbowl as a whole, or more importantly, any individual tournament?

I hold that there is a hefty burden that resides with those who advocate using past data to remake the tiebreaker system, and that that burden is to show that there is a predictive relationship between what has happened in the past and what will happen in the future. Unless someone can show that one tiebreaker stands above the rest regardless of the type of questions, level of competition, a team's slate of opponents, and a plethora of other variables, I don't think we can safely use this kind of data at all.

I still believe that it's best to set a reasonable, intuitive goalpost as the tiebreaker and stick to that. As we see above, points per game and points per bonus correlate similarly to other methods; even if you claim that the above data are valid, you are still forced to admit that the traditional tiebreakers of PPG and PPB appear to be about as useful as any other proposed method.

Moreover, they have the benefit of being both intuitive and positive. It makes a lot of sense that the better team will score more points against common opponents, or score more points per bonus on a differing schedule. Furthermore, it's a positive tiebreaker; you start from zero and go up, there is a goalpost out there, and once you pass it and another team doesn't, you win the tiebreaker. Which is more appealing, that a team should strive to score as many total points and as many points per bonus as possible, or that a team should hope that their margin of victory in one game (or some amalgamation of all of the proposed tiebreakers that historically boosts correlation by X%) was good enough that results from 1994 Wahoo Wars combined with data from Tartan Tussle XX will indicate that they have a 2.5% better chance of winning a follow-up game?

In sum, I hold that Mike's argument that we must reject theory (which amounts to intuition and reason coupled with practice) because there are data out there is ludicrous. There is no reason at all to take at face value these data as useful.

I hold that there is a hefty burden that resides with those who advocate using past data to remake the tiebreaker system, and that that burden is to show that there is a predictive relationship between what has happened in the past and what will happen in the future. Unless someone can show that one tiebreaker stands above the rest regardless of the type of questions, level of competition, a team's slate of opponents, and a plethora of other variables, I don't think we can safely use this kind of data at all.

I still believe that it's best to set a reasonable, intuitive goalpost as the tiebreaker and stick to that. As we see above, points per game and points per bonus correlate similarly to other methods; even if you claim that the above data are valid, you are still forced to admit that the traditional tiebreakers of PPG and PPB appear to be about as useful as any other proposed method.

Moreover, they have the benefit of being both intuitive and positive. It makes a lot of sense that the better team will score more points against common opponents, or score more points per bonus on a differing schedule. Furthermore, it's a positive tiebreaker; you start from zero and go up, there is a goalpost out there, and once you pass it and another team doesn't, you win the tiebreaker. Which is more appealing, that a team should strive to score as many total points and as many points per bonus as possible, or that a team should hope that their margin of victory in one game (or some amalgamation of all of the proposed tiebreakers that historically boosts correlation by X%) was good enough that results from 1994 Wahoo Wars combined with data from Tartan Tussle XX will indicate that they have a 2.5% better chance of winning a follow-up game?

In sum, I hold that Mike's argument that we must reject theory (which amounts to intuition and reason coupled with practice) because there are data out there is ludicrous. There is no reason at all to take at face value these data as useful.

Fri Dec 05, 2008 6:09 am

What does this even mean? All the proposed tiebreakers and combinations of tiebreakers hold the following: it is better to win a game than not, it is better to answer tossups than not, it is better to answer bonus parts than not. We're using West Coast data because I know where those stats are and no one else has volunteered data.theMoMA wrote:Moreover, they have the benefit of being both intuitive and positive. It makes a lot of sense that the better team will score more points against common opponents, or score more points per bonus on a differing schedule. Furthermore, it's a positive tiebreaker; you start from zero and go up, there is a goalpost out there, and once you pass it and another team doesn't, you win the tiebreaker. Which is more appealing, that a team should strive to score as many total points and as many points per bonus as possible, or that a team should be hope that their margin of victory in one game (or some amalgamation of all of the proposed tiebreakers that historically boosts correlation by X%) was good enough that results from 1994 Wahoo Wars combined with data from Tartan Tussle XX will indicate that they have a 2.5% better chance of winning a follow-up game?

Data is useful because it confirms intuition. Since there are good arguments to be made for various tiebreakers, it follows that we must go to whatever data is available, or collect new data, in order to verify one or more of these arguments. After all, in Georgia, they consider head to head to be "intuitive", a view with which you appear to disagree - therefore there is not a consensus on what is "intuitive". If you have a better set of data on immaculate questions with perfectly opponent-controlled matches, I'd love to see it, because it would be the best data set out there. But I don't think using the data the we do have is somehow invalid.

We're going back to the instantaneous point in a tournament at which the rematch occurs, and predicting which team will win given results of that tournament up to that point. We already know the result, so we're testing how often our predictor is right. 50% means it's a bad predictor, <50% means it's predicting that the team with the better stat will lose the game more often than it will win it. Can we extrapolate this to the future? I don't see why not. We're already pretty certain it can't replace tiebreaker matches, and as more tournaments happen we can feed more data into the machine and come up with the best "approximation" of a tiebreaker match for tournaments that don't have the luxury of that extra packet. I argue that doing this is independent of question quality and independent of strength of schedule; heck, I'm back with my "let's use W/L and predict outcomes of every match" suggestion.

Fri Dec 05, 2008 10:00 am

The people pointing out that the data above is inconclusive are correct. If anything, they are understating how inconclusive it is. If two people each toss a coin 38 times, the expected value for the difference in the number of heads each one gets is about 3.5 heads, or about 9%.

Fri Dec 05, 2008 11:43 am

theMoMA wrote:I don't think we can work with the assumption that data trends in past quizbowl matches necessarily predict game results of future matches. I'm unconvinced that the body of quizbowl match results as a whole to this point represent the expected outcomes of matches to come, and I certainly reject outright the idea that a data set that mixes non-common and common-opponent schedules, is heavily skewed towards sketchily edited west coast sets, TRASH regionals, and IS set tournaments, and has a whole bevy other other problems has any useful extrapolative value whatsoever. These data stem from activities whose commonality barely extends past the use of questions and buzzers. Who says that stirring up all of these (or any other concoction) yields something that will be predictive for future quizbowl as a whole, or more importantly, any individual tournament?

First of all, if you don't like these data, get me some more more to your liking. I've addressed your concerns by publishing means (now with error bounds; thanks, Brian! I was about to get on that myself...) for both isolated tiebreaking and non-isolated tiebreaking. If we get more data, I can address them further by publishing data for different kinds of situations: competition level, type of questions, etc. There really shouldn't be anything systematic that we can't deconvolve without enough data. However, the fact is, even introducing more random errors* by considering the non-star data (or considering "skewed data," though your criticism of skewed hiow is well wide of the mark: please reconsider what sets these data are from,) we should (must) converge to the correct mean with enough data; that's just statistics. This also addresses your claim that these data are useless: no data are useless, we just have to carefully consider the nature of the error we introduce and consider propagating fluctuations.

Secondly, your arguments are massively unscientific. You're just arguing from untested dogmas and saying thing that, again, are not counter to what we're examining here. Again, if long-term trends are the best tiebreakers, that will (must) be borne out by the data; if it's not, then it's your dogmas that are wrong. This is what is known as science.

theMoMA wrote:I hold that there is a hefty burden that resides with those who advocate using past data to remake the tiebreaker system, and that that burden is to show that there is a predictive relationship between what has happened in the past and what will happen in the future. Unless someone can show that one tiebreaker stands above the rest regardless of the type of questions, level of competition, a team's slate of opponents, and a plethora of other variables, I don't think we can safely use this kind of data at all.

Okay. I turn that burden back on you: justify uncritically retaining the traditional tiebreaker system without an appeal to tradition itself or to unverified dogmas like "long-terms trends are always best." The simple fact is you can't: all untested dogmas are of the same standing and, as you are apparently opposed to looking at actual data and/or don't have any (or are holding out on me...) that's all you can possibly bring me.

Consider, for example, that the whole impetus for this is another person's appeal to "reason" and tradition in favor of the straight head-to-head tiebreaker. Consider, further, that that same tiebreaker for those same reasons was widely considered "the correct ones" very recently in the college game and, further, that there's no reason it can't become so again. Evidently, you vehemently disagree with that person and with the practitioners of the college game of years past, but their arguments are just as sound as yours in the absence of data and analysis: you've all brought only your dicks to a sword fight.

MaS

*This is somewhat begging the question: Andrew evidently means to assert that competition level/question type may introduce systematic, rather than random, drifts. I don't know if I buy that, but, at the same time, it's not something I can safely dismiss out of hand. The answer is (you guessed it!) more data.

PS: Also, your argument is contradictory at least in this: You're arguing that different situations (types of questions, level of competition) may have different results for the most predictive tiebreaker. You're then arguing that everyone is therefore compelled to use the same tiebreaker in the name of reason. That does not follow.

Fri Dec 05, 2008 12:13 pm

Report with Random Errors*:

Starred data (sample size 20): Margin difference, 60% (10.95%); Bonus conversion difference, 55% (11.12%); Head-to-head result, 53.33% (11.15%); Head-to-head margin, 50% (11.18%); Point conversion difference, 50% (11.18%).

All data (sample size 38): Margin difference, 65.79% (7.70%); Bonus conversion difference, 60.53% (7.93%); Head-to-head result, 54.84% (8.07%); Head-to-head margin, 57.89% (8.01%); Point conversion difference, 57.89% (8.07%).

*Binomial random errors in parentheses. These should be considered lower error bounds: there are other drifts unaccounted for.

MaS

Starred data (sample size 20): Margin difference, 60% (10.95%); Bonus conversion difference, 55% (11.12%); Head-to-head result, 53.33% (11.15%); Head-to-head margin, 50% (11.18%); Point conversion difference, 50% (11.18%).

All data (sample size 38): Margin difference, 65.79% (7.70%); Bonus conversion difference, 60.53% (7.93%); Head-to-head result, 54.84% (8.07%); Head-to-head margin, 57.89% (8.01%); Point conversion difference, 57.89% (8.07%).

*Binomial random errors in parentheses. These should be considered lower error bounds: there are other drifts unaccounted for.

MaS

Fri Dec 05, 2008 4:22 pm

You have not addressed my concerns, and your statement about error bounds reflects a fundamental misunderstanding of what I'm saying. Your error bounds are useless outside of the data themselves. You've yet to show that these data have any value outside of themselves (ie, some kind of extraordinary power to predict future action), and until you do so, I will continue to reject what you're doing. I do hold that your data are useless, just like golf ball trajectory data are useless in determining who should win quizbowl tiebreakers. Until you show that the data are applicable to the situation at hand, I hold that we have no reason to assume that the data are valuable. When Dwight says "I argue that [feeding a bunch of data from past tournaments into a machine and coming up with a statistical tiebreaker] is independent of question quality and independent of strength of schedule, why on earth should we take him at face value? This is the major contention in using past data; you can't simply argue it away by putting "I argue" in front of an opinion.

Moreover, why would the burden be on me to get you data "to my liking"? I am the one making objections here; either find a way to counter them, find new data, or abandon your argument. Don't tell me that I have to counter my own argument for you. And stop mischaracterizing my argument. I am not opposed to looking at data, I am opposed to assuming that the data are useful in describing the situation at hand, which I find a hefty precondition to looking at the data.

I merely offer PPG and PPB as reasonable, intuitive, and positive. I am by no means saying that these are the only reasonable, intuitive, and positive tiebreakers that exist. The fact that some people see head-to-head as a legitimate tiebreaker doesn't do anything to my argument; those people can show up and convincingly justify their beliefs as such, which would only show that there can be more than one legitimate tiebreaker. Or they can be wrong. Neither of these possibilities undermines what I'm saying. I see no reason to accept the "other people believe differently and appeal to some of the same things you do, abandon your argument" argument.

It may very well be that the current mode of tiebreaking is an untested dogma, but you've got a responsibility to show that your test is actually the correct one. You haven't done anything to shift the burden back to me. Show that your data are meaningful, or be forced to submit to bottom-up instead of top-down tiebreakers.

Moreover, why would the burden be on me to get you data "to my liking"? I am the one making objections here; either find a way to counter them, find new data, or abandon your argument. Don't tell me that I have to counter my own argument for you. And stop mischaracterizing my argument. I am not opposed to looking at data, I am opposed to assuming that the data are useful in describing the situation at hand, which I find a hefty precondition to looking at the data.

I merely offer PPG and PPB as reasonable, intuitive, and positive. I am by no means saying that these are the only reasonable, intuitive, and positive tiebreakers that exist. The fact that some people see head-to-head as a legitimate tiebreaker doesn't do anything to my argument; those people can show up and convincingly justify their beliefs as such, which would only show that there can be more than one legitimate tiebreaker. Or they can be wrong. Neither of these possibilities undermines what I'm saying. I see no reason to accept the "other people believe differently and appeal to some of the same things you do, abandon your argument" argument.

It may very well be that the current mode of tiebreaking is an untested dogma, but you've got a responsibility to show that your test is actually the correct one. You haven't done anything to shift the burden back to me. Show that your data are meaningful, or be forced to submit to bottom-up instead of top-down tiebreakers.

Fri Dec 05, 2008 5:45 pm

Andrew, unless I'm horribly mischaracterizing your argument, you are appearing to state that we cannot use the data that we have because it is not at all useful. Do you agree with the following method:theMoMA wrote:I am not opposed to looking at data, I am opposed to assuming that the data are useful in describing the situation at hand, which I find a hefty precondition to looking at the data.

Hypothesis: A is a better predictor of B than C is.

Testing Hypothesis: We make two Bernoulli random variables corresponding to A -> B and C -> B. We find a bunch of situations in which A occurs, and a bunch of situations in which C occurs. In each situation, either B will occur (a 1) or B will not occur (a 0). From this, we are able to guess the mean of these Bernoulli variables, i.e., the true probability that A -> B and C -> B.

Data Analysis: We can run a one-sided z-test with H0: The true probability that B occurs given A and the true probability that B occurs given C are the same, and HA: The true probability that B occurs given A is greater than the true probability that B occurs given C.

Conclusion: If we get a p-value of less than our significance level, say 5%, then we reject H0 and claim that the true probability that B occurs given A is greater than the true probability B occurs given C. This necessarily implies that A is a better predictor of B than C is. If we get a p-value greater than our significance level, then we cannot reject H0 and we're back to "intuition" in deciding whether A or C is better.

If you do not agree, tell me where there is a problem with this setup. If you do agree, tell me where I can find data that might be more "useful," or prove to me that no such data exists. Unless I'm terribly mischaracterizing your argument (and I think I am), you seem to be implying that the only useful data is future data, i.e., data that we don't have (and once we do have it it'll be invalid because it's now past data).

As Mike said, there may be some systemic drift between different types of questions, or between different records, and he's entirely right when he says that we can check this if there's enough data (using the method outlined above).

The only argument that I think you can really be making is that the data has not been randomly selected. I will agree with you there, because we don't have data from other circuits. From our small sample of data, we are making a generalization about the population of (rematches between teams of the same record on the same packet set). If there is a systemic reason why we should not include "old" or "poorly edited" tournaments in our sample, outside of that it might skew the data one way or another (which, as Mike said, we can deconvolve with enough data), then you need to explain to me what it is, because you haven't done that yet. Performance on 1994 Wahoo Wars is probably not well predictive of performance of 2007 ACF Regionals, but we are comparing data from within tournaments (and their fields), not between tournaments (and their fields). That is, we are not taking data from 1994 Wahoo Wars and extrapolating to 2007 ACF Regionals. We are taking data from Wahoo Wars and comparing it to other data from Wahoo Wars, and doing the same thing with ACF Regionals. As long as the match passes our exclusionary criteria (e.g. we need rematches so the teams are at theoretically the same level at which they played the last time, although this assumption does not always hold; furthermore, we need matches between teams of the same record because W-L record is probably the best predictor of who will win a given match), it should be included in the data set. You appear to be arguing that we need additional exclusionary criteria: please elucidate what exactly these criteria should be.

If there was some procedural change (for instance, if the halftime Whack-a-Mole game was played until 2002, then discontinued), then tiebreakers affected by that change would no longer be valid (we can't use Whack-a-Mole to predict tiebreakers because the probability that a team will win a tiebreaker given a Whack-a-Mole win is 0, since there is no chance a team will actually win the Whack-a-Mole game). The only "changes" that have really occurred in the past decade are that questions are almost uniformly longer and relatively easier. Neither of these are systemic changes that prevent us from taking a meaningful, for instance, bonus conversion statistic.

Fri Dec 05, 2008 6:34 pm

Okay, Andrew. I say that I have, in fact, understood and addressed your concerns. I will now try to do so a second time. If you see any objection of yours that isn't addressed, I invite you to point out what isn't addressed and how.

I hold as an axiom that past results are the best (and only reasonable) predictor of future results available. This is the farthest thing possible from "extraordinary power." I seek only to use the very basic predictive power of statistics. If you can see a better predictor than past performance, you're welcome to disclose what it is, but the claim I'm making here is hardly odd or extraordinary.

However, the fact that you continue to denigrate even the principle that future results can be predicted by past data leads me to believe that it is you who lack understanding in this case. Therefore, let me take your argument to its logical conclusion: if past results have negligible predictive power, one cannot justly have resort to any tiebreaker whatsoever because, as they're all based on past results of some kind, they're all inherently unfair and baseless fiat judgments that a TD foists on their field. Now, I don't believe that, and your argument about the reasonableness of traditional tiebreakers leads me to conclude that you don't believe that, either. This is, in fact, a major contradiction in what you're saying and gives you every reason to abandon your argument.

Now, then, I've said already and say again that you make a valid criticism by saying that differing types of questions or levels of competition may introduce systematic drifts in the data that we have no good way of compensating for. For a second time, I accept that that may be, though I have my doubts. That means that I can only publish what you will see as lower bounds on the error, for now.

However, I addressed that and address that again by saying that, with further data, we can observe what these drifts are by deconvolving whatever trends you like. Therefore, your criticism (to the extent that it is valid) is one of the data, not of the method per se. Given sufficient data, this method will observe which is the best tiebreaker in any situation that occurs frequently enough. But nobody, me least of all, has ever said that this method will work well with a paucity of data or with only certain kinds of data: in fact, I am saying and have always said the exact opposite of that.

However, if you cleave to this criticism and want to convince me of it, it is incumbent on you to demonstrate it. Find data for a situation of import (well-edited sets or top-flight teams or whatever) and show me that my results are badly different from your results for those. If you can't or won't do this, your criticism is in the realm of conjecture and my (or anyone else's) counter-conjecture is equally valid.

Now, I'll note for a second time that your argument that different tiebreakers may be more predictive in different situations directly contradicts your contention that we should just use PPG or PPB in all cases. Your argument, in fact, dictates that, if we would be fair, we must use the correct tiebreaker for the situation. That is a second major contradiction in what your saying and, again, gives you every reason to abandon this argument.

Now, if you understand what I've said, you understand that I'm not assuming that every datum is equally valid in every situation. In fact, I'm saying quite the opposite of that: I'm saying that we are compelled to examine different situations to determine if the most predictive tiebreaker may be different in different cases. So, if you're not opposed to examining the data and drawing conclusions, you have no further issue with what we're doing here. However, you claim to understand what I'm saying and yet continue to oppose it. That is a third major contradiction in what your saying and, again, gives you every reason to abandon this argument.

You say that other tiebreakers may be just as good as the ones you propose, even by your own standards. Then, I ask you: on what basis do you propose the ones you do and not others? The fact that the exact same argument you're making can be used to justify different conclusions (by your own admonition!) formally indicates that your conclusion does not follow from your argument. That is a fourth major contradiction in what your saying and, again, gives you every reason to abandon this argument.

In closing, I'll note that you're right that the responsibility is on me to show that my test is valid. Fair enough: I take as an axiom that, if we're fair, we are compelled to select the tiebreakers that would best predict the outcome of an actual match, since we would presumably play the match to break the tie if we could. However, I say that what is above shows precisely that, given enough data, this test will indicate which stats are most predictive of winning in any situations that you like.

However, nothing substantive above is new; it is rather what I've been saying all along. Therefore, I claim that, if you have not heretofore understood that my proposed test is valid (or, indeed, if you don't understand that now), it is not because of my failure to demonstrate that it's so, but rather your failure to understand the principles of my arguments. I invite you to demonstrate that this is not so if you can.

MaS

I hold as an axiom that past results are the best (and only reasonable) predictor of future results available. This is the farthest thing possible from "extraordinary power." I seek only to use the very basic predictive power of statistics. If you can see a better predictor than past performance, you're welcome to disclose what it is, but the claim I'm making here is hardly odd or extraordinary.

However, the fact that you continue to denigrate even the principle that future results can be predicted by past data leads me to believe that it is you who lack understanding in this case. Therefore, let me take your argument to its logical conclusion: if past results have negligible predictive power, one cannot justly have resort to any tiebreaker whatsoever because, as they're all based on past results of some kind, they're all inherently unfair and baseless fiat judgments that a TD foists on their field. Now, I don't believe that, and your argument about the reasonableness of traditional tiebreakers leads me to conclude that you don't believe that, either. This is, in fact, a major contradiction in what you're saying and gives you every reason to abandon your argument.

Now, then, I've said already and say again that you make a valid criticism by saying that differing types of questions or levels of competition may introduce systematic drifts in the data that we have no good way of compensating for. For a second time, I accept that that may be, though I have my doubts. That means that I can only publish what you will see as lower bounds on the error, for now.

However, I addressed that and address that again by saying that, with further data, we can observe what these drifts are by deconvolving whatever trends you like. Therefore, your criticism (to the extent that it is valid) is one of the data, not of the method per se. Given sufficient data, this method will observe which is the best tiebreaker in any situation that occurs frequently enough. But nobody, me least of all, has ever said that this method will work well with a paucity of data or with only certain kinds of data: in fact, I am saying and have always said the exact opposite of that.

However, if you cleave to this criticism and want to convince me of it, it is incumbent on you to demonstrate it. Find data for a situation of import (well-edited sets or top-flight teams or whatever) and show me that my results are badly different from your results for those. If you can't or won't do this, your criticism is in the realm of conjecture and my (or anyone else's) counter-conjecture is equally valid.

Now, I'll note for a second time that your argument that different tiebreakers may be more predictive in different situations directly contradicts your contention that we should just use PPG or PPB in all cases. Your argument, in fact, dictates that, if we would be fair, we must use the correct tiebreaker for the situation. That is a second major contradiction in what your saying and, again, gives you every reason to abandon this argument.

Now, if you understand what I've said, you understand that I'm not assuming that every datum is equally valid in every situation. In fact, I'm saying quite the opposite of that: I'm saying that we are compelled to examine different situations to determine if the most predictive tiebreaker may be different in different cases. So, if you're not opposed to examining the data and drawing conclusions, you have no further issue with what we're doing here. However, you claim to understand what I'm saying and yet continue to oppose it. That is a third major contradiction in what your saying and, again, gives you every reason to abandon this argument.

You say that other tiebreakers may be just as good as the ones you propose, even by your own standards. Then, I ask you: on what basis do you propose the ones you do and not others? The fact that the exact same argument you're making can be used to justify different conclusions (by your own admonition!) formally indicates that your conclusion does not follow from your argument. That is a fourth major contradiction in what your saying and, again, gives you every reason to abandon this argument.

In closing, I'll note that you're right that the responsibility is on me to show that my test is valid. Fair enough: I take as an axiom that, if we're fair, we are compelled to select the tiebreakers that would best predict the outcome of an actual match, since we would presumably play the match to break the tie if we could. However, I say that what is above shows precisely that, given enough data, this test will indicate which stats are most predictive of winning in any situations that you like.

However, nothing substantive above is new; it is rather what I've been saying all along. Therefore, I claim that, if you have not heretofore understood that my proposed test is valid (or, indeed, if you don't understand that now), it is not because of my failure to demonstrate that it's so, but rather your failure to understand the principles of my arguments. I invite you to demonstrate that this is not so if you can.

MaS

theMoMA wrote:You have not addressed my concerns, and your statement about error bounds reflects a fundamental misunderstanding of what I'm saying. Your error bounds are useless outside of the data themselves. You've yet to show that these data have any value outside of themselves (ie, some kind of extraordinary power to predict future action), and until you do so, I will continue to reject what you're doing. I do hold that your data are useless, just like golf ball trajectory data are useless in determining who should win quizbowl tiebreakers. Until you show that the data are applicable to the situation at hand, I hold that we have no reason to assume that the data are valuable. When Dwight says "I argue that [feeding a bunch of data from past tournaments into a machine and coming up with a statistical tiebreaker] is independent of question quality and independent of strength of schedule, why on earth should we take him at face value? This is the major contention in using past data; you can't simply argue it away by putting "I argue" in front of an opinion.

Moreover, why would the burden be on me to get you data "to my liking"? I am the one making objections here; either find a way to counter them, find new data, or abandon your argument. Don't tell me that I have to counter my own argument for you. And stop mischaracterizing my argument. I am not opposed to looking at data, I am opposed to assuming that the data are useful in describing the situation at hand, which I find a hefty precondition to looking at the data.

I merely offer PPG and PPB as reasonable, intuitive, and positive. I am by no means saying that these are the only reasonable, intuitive, and positive tiebreakers that exist. The fact that some people see head-to-head as a legitimate tiebreaker doesn't do anything to my argument; those people can show up and convincingly justify their beliefs as such, which would only show that there can be more than one legitimate tiebreaker. Or they can be wrong. Neither of these possibilities undermines what I'm saying. I see no reason to accept the "other people believe differently and appeal to some of the same things you do, abandon your argument" argument.

It may very well be that the current mode of tiebreaking is an untested dogma, but you've got a responsibility to show that your test is actually the correct one. You haven't done anything to shift the burden back to me. Show that your data are meaningful, or be forced to submit to bottom-up instead of top-down tiebreakers.

Fri Dec 05, 2008 6:54 pm

While I'm at it, I'll continue to disregard Andrew's remarks and post some more useless data:

2008 ACF Fall North, Round 14: Minnesota C vs Eden Prairie. Both teams 4-8*. Minnesota C held h2h, h2h differential, ppb. EPHS held ppg, point differential. EPHS def Minnesota C 325-140.

2008 ACF Fall North, Round 15: Minnesota A vs Chicago C. Both teams 12-1*. Minnesota A held all tiebreakers except h2h, which was split. Minnesota A def Chicago C 350-220.

2008 VCU Novice, Round 12: Broccoli Forest vs Jonathan Hoag. Both teams 5-5*. Jonathan Hoag held all tiebreakers (ppb by only about 0.15 ppb) and won 405-205.

2008 VCU Novice, Round 12: Lampoon vs Streetcar. Both teams 7-3*. Lampoon held all tiebreakers. Streetcar won 310-220.

2008 FEUERBACH South, Round 11: VCU vs Clemson. Both teams 7-3*. Clemson held ppb and h2h differential, narrowly held point differential (by 4.5 ppg). VCU narrowly held ppg (by 8 ppg). h2h was split. Clemson won 195-125.

2008 FEUERBACH South, Round 9: VCU vs Clemson. Both teams 6-2*. Clemson held ppb and h2h differential, narrowly held point differential (by about 2 ppg). VCU held ppg (this time by about 12 ppg). h2h was split. VCU won 190-170.

EDIT: Three more

2008 MUT, Round 13: Drake vs Illinois A. Both teams 10-1*. Drake held h2h differential, ppb. Illinois A held point differential, narrowly held ppg (by about 5 ppg). Drake def Illinois A 430-60.

2008 MUT, Round 9: Minnesota A vs Armageddon. Both teams 4-3. Minnesota A held h2h, h2h differential. Armageddon held ppg, ppb, point differential. Armageddon won 395-130.

2008 MCMNT, Round 10: Lawrence vs Chicago Police Cops. Both teams 6-2*. Lawrence held all tiebreakers and won 310-155.

2008 ACF Fall North, Round 14: Minnesota C vs Eden Prairie. Both teams 4-8*. Minnesota C held h2h, h2h differential, ppb. EPHS held ppg, point differential. EPHS def Minnesota C 325-140.

2008 ACF Fall North, Round 15: Minnesota A vs Chicago C. Both teams 12-1*. Minnesota A held all tiebreakers except h2h, which was split. Minnesota A def Chicago C 350-220.

2008 VCU Novice, Round 12: Broccoli Forest vs Jonathan Hoag. Both teams 5-5*. Jonathan Hoag held all tiebreakers (ppb by only about 0.15 ppb) and won 405-205.

2008 VCU Novice, Round 12: Lampoon vs Streetcar. Both teams 7-3*. Lampoon held all tiebreakers. Streetcar won 310-220.

2008 FEUERBACH South, Round 11: VCU vs Clemson. Both teams 7-3*. Clemson held ppb and h2h differential, narrowly held point differential (by 4.5 ppg). VCU narrowly held ppg (by 8 ppg). h2h was split. Clemson won 195-125.

2008 FEUERBACH South, Round 9: VCU vs Clemson. Both teams 6-2*. Clemson held ppb and h2h differential, narrowly held point differential (by about 2 ppg). VCU held ppg (this time by about 12 ppg). h2h was split. VCU won 190-170.

EDIT: Three more

2008 MUT, Round 13: Drake vs Illinois A. Both teams 10-1*. Drake held h2h differential, ppb. Illinois A held point differential, narrowly held ppg (by about 5 ppg). Drake def Illinois A 430-60.

2008 MUT, Round 9: Minnesota A vs Armageddon. Both teams 4-3. Minnesota A held h2h, h2h differential. Armageddon held ppg, ppb, point differential. Armageddon won 395-130.

2008 MCMNT, Round 10: Lawrence vs Chicago Police Cops. Both teams 6-2*. Lawrence held all tiebreakers and won 310-155.

Thu Dec 11, 2008 3:55 am

I've made a bit of progress on this.

I've written a script that scans the "*_games.html" SQBS files, and finds cases of where 2 teams face each other multiple times. It then finds all the opponents that those 2 teams have in common, and keeps track of both teams' stats for the games they play against common opponents (as well as for their head-to-head matchups). So, for each of the two teams, I keep stats for N games (N-2 games versus common opponents, and 2 head-to-head matchups). I require that the two teams have identical records in their first N-1 games, and I require N>5, so that the teams have at least a reasonable number of common opponents. If all these requirements are met, I calculate the teams' stats for those N-1 games (ppg, ppb, point differential, head-to-head), and see how well those stats predict which team wins their second head-to-head match.

This is essentially equivalent to Dwight's *'d data points, except I'm loosening the requirement on what rounds the teams play their opponents. For example, let's say Teams A and B play 8 rounds, with the following opponents:

Team A plays [B,C,D,E,F,G,B,H]

Team B plays [A,E,C,H,D,F,A,K]

In this case, instead of using the first 6 rounds for comparison (where the teams don't play all common opponents), I can look at rounds [1,2,3,4,5,8] for Team A, and rounds [1,2,3,4,5,6] for team B. In those rounds, both A and B play each other once, as well as play C,D,E,F, and H. Assuming A and B both have identical records in those 6 games, we can look at their stats in those games, and see how they predict who wins their second head-to-head matchup (in round 7).

I've applied this script to about a year's worth of tournaments, looking for all the results I could find for the last year. This resulted in 54 data points. Here's the results:

PPG: 0.7222 +/- 0.0610

PPG Differential: 0.6852 +/- 0.0632

Bonus Conversion: 0.7593 +/- 0.0582

Head to Head: 0.5000 +/- 0.0680

Here we're to starting to see fairly significant differences between head-to-head and the other stats. We'll need a lot more data to distinguish between PPG, PPG differential, and bonus conversion, but in any case, it looks unlikely that there's a large difference between those three statistics.

I've written a script that scans the "*_games.html" SQBS files, and finds cases of where 2 teams face each other multiple times. It then finds all the opponents that those 2 teams have in common, and keeps track of both teams' stats for the games they play against common opponents (as well as for their head-to-head matchups). So, for each of the two teams, I keep stats for N games (N-2 games versus common opponents, and 2 head-to-head matchups). I require that the two teams have identical records in their first N-1 games, and I require N>5, so that the teams have at least a reasonable number of common opponents. If all these requirements are met, I calculate the teams' stats for those N-1 games (ppg, ppb, point differential, head-to-head), and see how well those stats predict which team wins their second head-to-head match.

This is essentially equivalent to Dwight's *'d data points, except I'm loosening the requirement on what rounds the teams play their opponents. For example, let's say Teams A and B play 8 rounds, with the following opponents:

Team A plays [B,C,D,E,F,G,B,H]

Team B plays [A,E,C,H,D,F,A,K]

In this case, instead of using the first 6 rounds for comparison (where the teams don't play all common opponents), I can look at rounds [1,2,3,4,5,8] for Team A, and rounds [1,2,3,4,5,6] for team B. In those rounds, both A and B play each other once, as well as play C,D,E,F, and H. Assuming A and B both have identical records in those 6 games, we can look at their stats in those games, and see how they predict who wins their second head-to-head matchup (in round 7).

I've applied this script to about a year's worth of tournaments, looking for all the results I could find for the last year. This resulted in 54 data points. Here's the results:

PPG: 0.7222 +/- 0.0610

PPG Differential: 0.6852 +/- 0.0632

Bonus Conversion: 0.7593 +/- 0.0582

Head to Head: 0.5000 +/- 0.0680

Here we're to starting to see fairly significant differences between head-to-head and the other stats. We'll need a lot more data to distinguish between PPG, PPG differential, and bonus conversion, but in any case, it looks unlikely that there's a large difference between those three statistics.

Thu Dec 11, 2008 4:16 am

Awesome! I'm glad people so much better at mining these data than I am exist. So, it seems what we need now are more SQBS files, then? How hard would it be to make splits for, like, record or tournament type using your script?

MaS

MaS

Thu Dec 11, 2008 12:57 pm

Schweizerkas wrote:I've made a bit of progress on this.

Wow, excellent. Well done.

Thu Dec 11, 2008 4:59 pm

Brian,

That looks awesome. If you need more data, you can try NAQT's database. You'd probably have to rework your script and make sure you don't get repeats, but I'm guessing that NAQT has some statistics that aren't elsewhere. Of course, you could also try going back to 2006-07 or 2005-06 too.

I've taken your data and really quickly run it through the 2-PropZTest function on my trusty TI-83+. I get the following p-values (I've defined the following: H0: p1 = p2; Ha: p1 > p2):

p1 = BC, p2 = PPG: 0.330

p1 = BC, p2 = PPGDiff: 0.195

p1 = BC, p2 = H2H: 0.002**

p1 = PPG, p2 = PPGDiff: 0.337

p1 = PPG, p2 = H2H: 0.009**

p1 = PPGDiff, p2 = H2H: 0.025*

*Significant at the 5% significance level

**Significant at the 1% significance level

Given that data, I think we can safely conclude that head-to-head is the weakest of the four considered tiebreakers.

For those of you who haven't taken a statistics class, or don't remember anything from it, I define a null hypothesis that the true percentage of games accurately predicted by one tiebreaker is the same as the true percentage of games accurately predicted by a different tiebreaker, and an alternative hypothesis that the true percentage of games accurately predicted by one tiebreaker is greater than the true percentage of games accurately predicted by the other. I plug the data into a fancy mathematical formula to get a z-score, which I can turn into a p-value. If my p-value is less than my significance level, I reject my null hypothesis (and am forced to accept my alternative hypothesis, assuming I've defined my hypotheses correctly); otherwise I cannot reject the null hypothesis (and thus I must continue to assume that one tiebreaker is not better than the other).

That looks awesome. If you need more data, you can try NAQT's database. You'd probably have to rework your script and make sure you don't get repeats, but I'm guessing that NAQT has some statistics that aren't elsewhere. Of course, you could also try going back to 2006-07 or 2005-06 too.

I've taken your data and really quickly run it through the 2-PropZTest function on my trusty TI-83+. I get the following p-values (I've defined the following: H0: p1 = p2; Ha: p1 > p2):

p1 = BC, p2 = PPG: 0.330

p1 = BC, p2 = PPGDiff: 0.195

p1 = BC, p2 = H2H: 0.002**

p1 = PPG, p2 = PPGDiff: 0.337

p1 = PPG, p2 = H2H: 0.009**

p1 = PPGDiff, p2 = H2H: 0.025*

*Significant at the 5% significance level

**Significant at the 1% significance level

Given that data, I think we can safely conclude that head-to-head is the weakest of the four considered tiebreakers.

For those of you who haven't taken a statistics class, or don't remember anything from it, I define a null hypothesis that the true percentage of games accurately predicted by one tiebreaker is the same as the true percentage of games accurately predicted by a different tiebreaker, and an alternative hypothesis that the true percentage of games accurately predicted by one tiebreaker is greater than the true percentage of games accurately predicted by the other. I plug the data into a fancy mathematical formula to get a z-score, which I can turn into a p-value. If my p-value is less than my significance level, I reject my null hypothesis (and am forced to accept my alternative hypothesis, assuming I've defined my hypotheses correctly); otherwise I cannot reject the null hypothesis (and thus I must continue to assume that one tiebreaker is not better than the other).

Fri Dec 12, 2008 12:41 am

Dwight, that looks really nice. I didn't realize TI calculators had that type of capability built in. The p value looks like exactly the thing we want to be looking at.

Splitting up by tournament type is just a matter of sorting the SQBS files by hand into different categories (most of the tournaments I have are college ACF-style, but there are some NAQT and trash tournaments as well). So, that shouldn't be very hard. What exactly do you mean by "splitting by record"? I think it should be relatively easy to make splits based on any category you can imagine. One idea I had was to look at how predictive the stats are as a function of the stat difference between the two teams. So, for example, instead of just asking, "how often does the team with the higher PPB win the second H2H matchup?", we can look at, "how often does a team with 1 (or 2, or 3, etc.) higher PPB win the second H2H matchup?" This way, we can find out, is a 1 PPB advantage more or less significant than (e.g.) a 20 PPG advantage?

I'll see if I can make a Google Docs spreadsheet available with all the numbers for the 54 datapoints, so you can play around with the numbers yourself.

Captain Scipio wrote:How hard would it be to make splits for, like, record or tournament type using your script?

Splitting up by tournament type is just a matter of sorting the SQBS files by hand into different categories (most of the tournaments I have are college ACF-style, but there are some NAQT and trash tournaments as well). So, that shouldn't be very hard. What exactly do you mean by "splitting by record"? I think it should be relatively easy to make splits based on any category you can imagine. One idea I had was to look at how predictive the stats are as a function of the stat difference between the two teams. So, for example, instead of just asking, "how often does the team with the higher PPB win the second H2H matchup?", we can look at, "how often does a team with 1 (or 2, or 3, etc.) higher PPB win the second H2H matchup?" This way, we can find out, is a 1 PPB advantage more or less significant than (e.g.) a 20 PPG advantage?

I'll see if I can make a Google Docs spreadsheet available with all the numbers for the 54 datapoints, so you can play around with the numbers yourself.

Fri Dec 12, 2008 3:07 am

Okay, I have a spreadsheet with all the datapoints available here.

Also, I noticed that one of my datapoints accidentally appeared twice, because I had two copies of the same tournament in my directory. I removed the extra datapoint, and here are the new numbers (based on 53 points):

PPG: 0.7170 +/- 0.0619

PPG Differential: 0.6792 +/- 0.0641

Bonus Conversion: 0.7547 +/- 0.0591

Head to Head: 0.5094 +/- 0.0687

Also, I noticed that one of my datapoints accidentally appeared twice, because I had two copies of the same tournament in my directory. I removed the extra datapoint, and here are the new numbers (based on 53 points):

PPG: 0.7170 +/- 0.0619

PPG Differential: 0.6792 +/- 0.0641

Bonus Conversion: 0.7547 +/- 0.0591

Head to Head: 0.5094 +/- 0.0687

Fri Dec 12, 2008 9:22 am

Dwight, Not to be a stats nitpicker, but you probably should be using a t-test here.

Fri Dec 12, 2008 9:30 am

If two people each flip a coin 53 times, the expected value for the difference in their number of heads is a little over 4, which is approximately the difference in the number of successful picks between PPG, PPG Differential, and Bonus Conversion. P Values around 0.3 should not be used to draw any conclusions other than more research is necessary. (I'm not contradicting anybody--I'm just making the statistical uncertainties more explicit in case anybody reading this thread thinks it's a good idea to draw conclusions at this point.)

Fri Dec 12, 2008 2:20 pm

T-test is used for sample means. Z-test is used for sample proportions. We're comparing proportions, not means. Really, the only criticisms that you can make are:cdcarter wrote:Dwight, Not to be a stats nitpicker, but you probably should be using a t-test here.

1. The samples were not selected randomly or independently

2. There is a hidden variable that is causing the difference in data, and so the null and alternate hypotheses are invalid

BTW, for the "new" data set:

p1 = BC, p2 = PPG: .330

p1 = BC, p2 = PPGdiff: .194

p1 = BC, p2 = H2H: .004**

p1 = PPG, p2 = PPGdiff: .336

p1 = PPG, p2 = H2H: .014*

p1 = PPGdiff, p2 = H2H: .037*

*significant at the 5% significance level

**significant at the 1% significance level

Really, not a huge difference, except that the PPG vs H2H is no longer significant at the 1% level.

Fri Dec 12, 2008 8:46 pm

cvdwightw wrote:T-test is used for sample means. Z-test is used for sample proportions. We're comparing proportions, not means.cdcarter wrote:Dwight, Not to be a stats nitpicker, but you probably should be using a t-test here.

Oh these totally are proportions... I was thinking you were doing means with a Z-test which can be done but is like...bad. I should read.

Sat Dec 13, 2008 11:12 pm

Added some more tournament data. I now have all college tournaments from 2008 and the second half of 2007. The new numbers, with 106 data points:

Bonus Conversion: 0.6981 +/- 0.0446

PPG: 0.6792 +/- 0.0453

PPG Differential: 0.6415 +/- 0.0466

Head to Head: 0.4906 +/- 0.0486

Bonus Conversion: 0.6981 +/- 0.0446

PPG: 0.6792 +/- 0.0453

PPG Differential: 0.6415 +/- 0.0466

Head to Head: 0.4906 +/- 0.0486