Page 1 of 1

Introducing BPA, a new evaluation metric using detailed stats

Posted: Wed Oct 17, 2018 3:52 pm
by ryanrosenberg
What is BPA?

BPA stands for Buzz Point AUC (area under the curve). It is the total area under the curve of [% of tossups gotten successfully] against [% of question elapsed].

The theoretical maximum is 100 (i.e., if all tossups were gotten near-instantly); however, top players will generally get somewhere between 10 and 15 at regular difficulty, which corresponds to preventing about 15% of the total words in the tournament's tossups from being read by getting the question. Top teams will generally get around 20-25 at regular difficulty. As an illustration, below is Chris Ray's buzz point graph from 2018 ACF Regionals.

Image

BPA can be calculated for any tournament that records buzz points.

How do I calculate BPA?

BPA is actually pretty easy to calculate, especially for an individual player. The below screenshot shows an example of calculating conversion percent at each buzz point (max gets is the number of possible tossups heard, so games played times 20), and BPA is simply the sum of column F (over all buzz points 0.01 to 1).

Image

What are the advantages of BPA over other quizbowl stats?

BPA is the first metric to take advantage of buzz point tracking and provide a more detailed view into how early people are getting questions. This reveals player skill that may be masked by traditional stats.

For example, let's look at the top two scorers from the Minnesota site of CMST: Shan Kothari and Auroni Gupta. Shan outscored Auroni by about a tossup per game, and recorded seven powers to Auroni's four. However, Auroni had a 6.9 BPA, while Shan comes in at 6.47, since Auroni was buzzing earlier on a higher percentage of tossups, particularly in the late-middle clues, before Shan overtakes him during giveaways. BPA ranking Auroni over Shan is in line with subjective appraisals of the two players (the player poll had Auroni as a top-5 player in grad school, and Shan in the 10-15 range), but neither of the traditional stats (PPG and powers) capture this difference in skill.

Image

What are BPA's shortcomings?

BPA is still, like PPG, a heavily context-dependent stat, and is not exactly comparable across fields of different strength (or even across different schedules in the same field). Teammate effects are also fairly strong; BPA does not incorporate the PATH adjustment for shadow effect since I believe that introduces more false positives than the false negatives it corrects.

Who does BPA say is good at quizbowl?

The top 10 players at CMST were Jordan Brownstein (18.04), Jacob Reed (11.36), Stephen Liu (10.51), Neil Gurram (10.21), Eric Mukherjee (9.05), John Lawrence (8.37), Rafael Krichevsky (7.99), Matt Bollinger (7.95), Will Alston (7.36), and Auroni Gupta (6.9).
The top 5 teams were Brownstein et al. (23.86), Yale (23.13), BHSU A (20.45), Bloq Mayus (18.95), and Chicago A (18.13).

The top 10 players at 2018 Regionals were Eric Mukherjee (17.15), Jakob Myers (15.68), Aseem Keyal (14.33), Evan Lynch (12.89), Rafael Krichevsky (12.84), Eric Wolfsberg (12.72), Adam Silverman (12.56), Chris Ray (12.18), John Lawrence (11.82), and Derek So (11.64).
The top 5 teams were Penn A (28.32), Berkeley A (27.6), Chicago A (25.35), Columbia A (25.05), and Maryland A (25.03).

There's also category-specific BPA! Here are overall and category-specific rankings for 2018 Regionals and CMST.

Re: Introducing BPA, a new evaluation metric using detailed stats

Posted: Wed Oct 17, 2018 4:09 pm
by vinteuil
I think this might be the most precise (and intuitively useful) non-PATH-like stat we've ever had—thanks to Ryan for the computations and visualizations!

Re: Introducing BPA, a new evaluation metric using detailed stats

Posted: Wed Oct 17, 2018 4:17 pm
by Periplus of the Erythraean Sea
Auroni Gupta (6.9)
nice

Re: Introducing BPA, a new evaluation metric using detailed stats

Posted: Wed Oct 17, 2018 6:47 pm
by t-bar
This is awesome! Thanks for putting the work into coming up with this.

This is also an interesting statistic to look at on a game-by-game basis, though you have to take the results with a grain of salt. Here are the top 10 games from 2018 ACF Regionals by total BPA:

Code: Select all

Winner		Loser		Score		Winner BPA	Loser BPA	Total BPA
Berkeley A	UC San Diego B	500-80		37.915		6.7		44.615
Cambridge B	Oxford B	315-290		24.515		19.83		44.345
Penn A		Villanova	490-50		37.735		6.595		44.33
Penn A		Johns Hopkins A	375-240		31.585		12.31		43.895
Northwestern A	MSU A		320-285		15.855		26.835		42.69
Columbia A	Amherst		355-200		26.67		15.785		42.455
McGill A	McGill B	315-175		26.845		15.42		42.265
Penn A		Delaware	490-115		31.745		10.04		41.785
Northwestern A	Ohio State A	385-215		22.425		19.155		41.58
Columbia A	Harvard A	375-170		26.98		14.565		41.545
Note that in the fifth game, Northwestern A beat MSU A despite having a significantly lower BPA. This is partly due to the fact that Northwestern waited until the end on all three of MSU's negs, while not negging at all themselves. However, even on the 7 live tossups they converted, Northwestern had an average buzz location of 0.547, substantially later than MSU's average of 0.463.

Here are the five games with the closest margin of BPA, selected from among games with a total BPA of at least 30:

Code: Select all

Winner		Loser		Score		Winner BPA	Loser BPA	Total BPA
Ohio State A	Chicago B	325-240		15.1		15.26		30.36
Harvard A	Yale A		310-245		15.9		14.855		30.755
McGill A	Toronto A	270-240		14.95		16.84		31.79
MSU A		Chicago A	310-260		17.33		19.585		36.915
Berkeley B	Stanford	305-230		15.215		17.56		32.775
In all but one of these games, the winner had the lower BPA. However, only some of them can be chalked up to a negstorm by the losing team. For example, in the McGill-Toronto game, McGill went 9/4 to Toronto's 10/2 and won on the strength of their bonus conversion.

Interesting questions for future BPA analysis: what fraction of games are won by the team with the lower BPA? In these situations, can we discriminate between occurrences of (a) one team waiting to the end on a bunch of negs, (b) one team out-bonusing the other, (c) one team having a large advantage in certain categories and being able to sit on those questions, (d) something else? Perhaps it's fruitful to only consider tossups that were not negged, in order to restrict the analysis to situations in which both teams are playing each tossup live. This requires a bit more careful work to determine the number of tossups heard, but it's certainly possible with the data we have.

Re: Introducing BPA, a new evaluation metric using detailed stats

Posted: Wed Oct 17, 2018 11:58 pm
by AGoodMan
This is super cool! Is there any chance we can see similar metrics for EFT?

Re: Introducing BPA, a new evaluation metric using detailed stats

Posted: Thu Oct 18, 2018 8:07 am
by ryanrosenberg
AGoodMan wrote:
Wed Oct 17, 2018 11:58 pm
This is super cool! Is there any chance we can see similar metrics for EFT?
Yes, I'll post EFT BPA later today.

Re: Introducing BPA, a new evaluation metric using detailed stats

Posted: Mon Oct 22, 2018 10:02 pm
by ryanrosenberg
Here's a public link to code used to generate overall BPA for last year's Regionals.

Re: Introducing BPA, a new evaluation metric using detailed stats

Posted: Wed Oct 24, 2018 1:40 pm
by ProfessorIanDuncan
Does this metric factor in negs? Would that be a useful feature? It seems that adding a negative value, namely the difference between the minimum of question length and correct answer buzz point and the neg point, could shed some insight on how negs affect how much of the tournament is heard. I suppose that this would fail to take into account teams waiting until the end of the question to convert, so maybe its not that useful of an addition.