Page 2 of 2

Re: Packets, recordings, and detailed stats warm-up survey

Posted: Fri Feb 02, 2018 3:40 am
by Fado Alexandrino
Something on my to do list is to look at reaction time. In the aforementioned tossup on Islam where a third of the buzzes occurred after a particular clue, those buzzes were all within three words past the offending clue. Note that the offending clue ended a sentence.

Personally I tend to reaction buzz at clues I’ve heard before but wait on it for earlier clues that I’m trying to connect the dots in my brain, especially two-step processes like 2017 Nobel -> CryoEM -> Getting frozen

Re: Packets, recordings, and detailed stats warm-up survey

Posted: Fri Feb 02, 2018 12:45 pm
by Auroni
Victor Prieto wrote:I think that eliminating the first line or two in tossups could automatically correct this curve without sacrificing ability to differentiate between lower-tier teams. I urge writers to consider shortening their questions in the future.
I'd just like to point out that shortening questions to 6 lines makes it a lot harder to do the following (significantly reducing player empathy):
Aaron Manby (ironmaster) wrote: Personally I tend to reaction buzz at clues I’ve heard before but wait on it for earlier clues that I’m trying to connect the dots in my brain, especially two-step processes like 2017 Nobel -> CryoEM -> Getting frozen

Re: Packets, recordings, and detailed stats warm-up survey

Posted: Fri Feb 02, 2018 2:20 pm
by ryanrosenberg
I made the above graphs for each major category and put them all in an album here. They're large images, so you might need to right-click and open the image in a new tab to really get up close.

Re: Packets, recordings, and detailed stats warm-up survey

Posted: Fri Feb 02, 2018 4:00 pm
by Fado Alexandrino
Buzzpoints by team, split by player

EDIT: The area under a player's curve is equal to the number of tossups answered correctly. Y axis is thus in arbitrary units of points*probability.

Someone should design an interactive website with custom results and graphs but that I unfortunately don't have the skills to be that person.

Re: Packets, recordings, and detailed stats warm-up survey

Posted: Fri Feb 02, 2018 10:05 pm
by Good Goblin Housekeeping
Aaron Manby (ironmaster) wrote:Buzzpoints by team, split by player

EDIT: The area under a player's curve is equal to the number of tossups answered correctly. Y axis is thus in arbitrary units of points*probability.

Someone should design an interactive website with custom results and graphs but that I unfortunately don't have the skills to be that person.
:w-hat: happened to matt lehmann

Re: Packets, recordings, and detailed stats warm-up survey

Posted: Fri Feb 02, 2018 10:11 pm
by Mahavishnu
Borel hierarchy wrote:
Aaron Manby (ironmaster) wrote:Buzzpoints by team, split by player

EDIT: The area under a player's curve is equal to the number of tossups answered correctly. Y axis is thus in arbitrary units of points*probability.

Someone should design an interactive website with custom results and graphs but that I unfortunately don't have the skills to be that person.
:w-hat: happened to matt lehmann
And on that note, any teams from the UCF site?

Re: Packets, recordings, and detailed stats warm-up survey

Posted: Sat Feb 03, 2018 10:58 am
by Corry
I just want to say that from a writers’ perspective, these are like, god-tier stats. I’ve always personally preferred to write for NAQT because they were traditionally the only ones to provide after-the-fact conversion data - and therefore, the only ones who could offer empirical data to address familiarity bias and difficulty confirmation bias among writers. But this system really takes things to the next level. Thank you Ophir!

As a total aside:
Periplus of the Erythraean Sea wrote:A 15-20% power rate is, to my understanding, what NAQT aims for in its tournaments - not sure what median SCT power rate is, but the median ICT and HSNCT power rates usually fall within that range.
This isn’t exactly accurate. While I’ve periodically heard of NAQT “theoretically” aiming for a 15-20% power rate, in practice, the median HSNCT and SCT power rates tend to cluster around 20-25%. So purely from a powers basis, this set would probably count as marginally harder than SCT. (ICT is more along the lines of 15-20% power rate, although that also fluctuates.)

Re: Packets, recordings, and detailed stats warm-up survey

Posted: Sat Feb 03, 2018 7:30 pm
by naan/steak-holding toll
Corry wrote: This isn’t exactly accurate. While I’ve periodically heard of NAQT “theoretically” aiming for a 15-20% power rate, in practice, the median HSNCT and SCT power rates tend to cluster around 20-25%. So purely from a powers basis, this set would probably count as marginally harder than SCT. (ICT is more along the lines of 15-20% power rate, although that also fluctuates.)
I guess so. However, this year's SCT looks like it was pretty hard to power from the numbers coming in.

Re: Packets, recordings, and detailed stats warm-up survey

Posted: Sun Feb 04, 2018 12:36 pm
by thebluehawk1
I am interested in looking at the different ways in which writers ask about lit questions, to see the differences in how the are converted. For example a common format for asking about lit would be a "this author question", which typically has less deep clues about individual works, and more basic clues about obscure works. I think these questions will on average be converted earlier than the next type of question, because it is easier to read a brief wikipedia summary of several works than lock down all the clues in a full work. The next type would be a "this work" tossup. It is harder to lock down deep clues for a work you haven't read, and there are a lot of works that are tossup-able at regionals level. Therefore I think overall these questions would be converted later by the field, but because you can generally get a good buzz on a work you have read, and you are more likely to read a work that is tossup-able (because it is more famous) these questions will have a higher percentage of first buzzes. I don't really have much of a prediction for how common links will play, but I would be interested to look at that as well.

Re: Packets, recordings, and detailed stats warm-up survey

Posted: Sun Feb 04, 2018 4:35 pm
by ThisIsMyUsername
thebluehawk1 wrote:I am interested in looking at the different ways in which writers ask about lit questions, to see the differences in how the are converted. For example a common format for asking about lit would be a "this author question", which typically has less deep clues about individual works, and more basic clues about obscure works. I think these questions will on average be converted earlier than the next type of question, because it is easier to read a brief wikipedia summary of several works than lock down all the clues in a full work. The next type would be a "this work" tossup. It is harder to lock down deep clues for a work you haven't read, and there are a lot of works that are tossup-able at regionals level. Therefore I think overall these questions would be converted later by the field, but because you can generally get a good buzz on a work you have read, and you are more likely to read a work that is tossup-able (because it is more famous) these questions will have a higher percentage of first buzzes. I don't really have much of a prediction for how common links will play, but I would be interested to look at that as well.
Whether this is "typically" true or not really depends on the writer/editor. Some prefer to include a large proportion of author tossups that interleave clues from works that are themselves tossupable at the same difficulty level; and some prefer to write author tossups that clue from works that are not. Likewise, a tossup on a work could begin by using mainly secondary-source clues, which are sometimes drawn from the same Wikipedia/Google-type sources that you say mostly populate author tossups.

What would be more pertinent (but far more labor-intensive) would be to tag questions according to what type of early clue they use (rather than their answer-line type), and to see how that affects buzzing. I think you may be right that one or the other might have typically earlier buzzpoints. But I think, above all, one would also find that some individual players are better at one type and some at the other (depending on their balance between reading and studying).

Re: Packets, recordings, and detailed stats warm-up survey

Posted: Sun Feb 04, 2018 4:52 pm
by khannate
I spent some time playing with the stats from the UIUC site, and more specifically trying to find some meaningful way to compare and rank teams. What I ended up doing was constructing an estimate of the distribution of buzz points for each team within each category by starting with Will's proposed ideal distribution of buzz points and Bayesian updating based on the actual buzzes a team got within the category. This ends up looking like a weighted average of the ideal distribution and the empirical distribution of a teams buzzes, weighted based on a tuning parameter and the number of buzzes in the category the team got.

Based on these distributions, you can simulate two teams playing a tossup in a category by drawing a buzz from each teams distribution in that category, determining what would have happened, and giving the team who got the tossup their ppb as bonus points. By doing this the right number of times for each category, you can simulate a full game. I did this 100 times for each pair of teams at the UIUC site and plotted the results in the graph attached. The entry in (row, column) is the fraction row team beat column team.

I think this sort of model can be useful for thinking about the outcomes of tournaments and how surprising or unsurprising they are. For example, (at least at Chicago), there's a perception that Chicago teams upset each other an unusual amount, but the graph suggests this isn't actually the case. At a tournament where all of Chicago A, B, and C play each other, the probability of at least one team losing to a lower-lettered team is about 44%.

This could also be used to do forecasting for Nats by simulating all the matches from the Nats schedule, running through the tournament 1000 times, and seeing the distribution of each teams placing.

If people are interested in seeing this sort of thing for other sites or for say the top 25 teams by ppb, just let me know and I'd be happy to do it.

Re: Packets, recordings, and detailed stats warm-up survey

Posted: Sun Feb 04, 2018 5:05 pm
by wcheng
khannate wrote:I spent some time playing with the stats from the UIUC site, and more specifically trying to find some meaningful way to compare and rank teams. What I ended up doing was constructing an estimate of the distribution of buzz points for each team within each category by starting with Will's proposed ideal distribution of buzz points and Bayesian updating based on the actual buzzes a team got within the category. This ends up looking like a weighted average of the ideal distribution and the empirical distribution of a teams buzzes, weighted based on a tuning parameter and the number of buzzes in the category the team got.

Based on these distributions, you can simulate two teams playing a tossup in a category by drawing a buzz from each teams distribution in that category, determining what would have happened, and giving the team who got the tossup their ppb as bonus points. By doing this the right number of times for each category, you can simulate a full game. I did this 100 times for each pair of teams at the UIUC site and plotted the results in the graph attached. The entry in (row, column) is the fraction row team beat column team.

I think this sort of model can be useful for thinking about the outcomes of tournaments and how surprising or unsurprising they are. For example, (at least at Chicago), there's a perception that Chicago teams upset each other an unusual amount, but the graph suggests this isn't actually the case. At a tournament where all of Chicago A, B, and C play each other, the probability of at least one team losing to a lower-lettered team is about 44%.

This could also be used to do forecasting for Nats by simulating all the matches from the Nats schedule, running through the tournament 1000 times, and seeing the distribution of each teams placing.

If people are interested in seeing this sort of thing for other sites or for say the top 25 teams by ppb, just let me know and I'd be happy to do it.
I think it'd be really interesting to see how the top teams by A-Value stack up against each other by this metric!

Re: Packets, recordings, and detailed stats warm-up survey

Posted: Sun Feb 04, 2018 5:23 pm
by A Dim-Witted Saboteur
khannate wrote:I spent some time playing with the stats from the UIUC site, and more specifically trying to find some meaningful way to compare and rank teams. What I ended up doing was constructing an estimate of the distribution of buzz points for each team within each category by starting with Will's proposed ideal distribution of buzz points and Bayesian updating based on the actual buzzes a team got within the category. This ends up looking like a weighted average of the ideal distribution and the empirical distribution of a teams buzzes, weighted based on a tuning parameter and the number of buzzes in the category the team got.

Based on these distributions, you can simulate two teams playing a tossup in a category by drawing a buzz from each teams distribution in that category, determining what would have happened, and giving the team who got the tossup their ppb as bonus points. By doing this the right number of times for each category, you can simulate a full game. I did this 100 times for each pair of teams at the UIUC site and plotted the results in the graph attached. The entry in (row, column) is the fraction row team beat column team.

I think this sort of model can be useful for thinking about the outcomes of tournaments and how surprising or unsurprising they are. For example, (at least at Chicago), there's a perception that Chicago teams upset each other an unusual amount, but the graph suggests this isn't actually the case. At a tournament where all of Chicago A, B, and C play each other, the probability of at least one team losing to a lower-lettered team is about 44%.

This could also be used to do forecasting for Nats by simulating all the matches from the Nats schedule, running through the tournament 1000 times, and seeing the distribution of each teams placing.

If people are interested in seeing this sort of thing for other sites or for say the top 25 teams by ppb, just let me know and I'd be happy to do it.
Another interesting thing to look at would be what percentage of games that actually took place at regs were upsets (actual winner lost the majority of simulated games) or strong upsets (actual winner lost more than 65? 70? 75? % of simiulated games).

Re: Packets, recordings, and detailed stats warm-up survey

Posted: Tue Feb 06, 2018 5:20 pm
by Maxwell Sniffingwell
Stupid question, but are the stats actually posted anywhere?

Re: Packets, recordings, and detailed stats warm-up survey

Posted: Tue Feb 06, 2018 5:49 pm
by naan/steak-holding toll
cornfused wrote:Stupid question, but are the stats actually posted anywhere?
We have not released them publicly yet - we were hoping to get numerous voices from within the community, at many levels of skill, to voice their opinions and ask questions. In the interest of fostering this sort of discussion, we've withheld stats for a bit so that people don't retreat to their own silos / groupchats and noodle around with things themselves, answer their own questions, and never talk about things in public channels.

Perhaps it's ironic that withholding of some information should be necessary to foster a public discourse, but I doubt this thread would have taken off if we simply immediately released the stats, since there was very little discussion about them for this year's EFT or This Tournament is a Crime.

Re: Packets, recordings, and detailed stats warm-up survey

Posted: Tue Feb 06, 2018 8:40 pm
by a bird
Aaron Manby (ironmaster) wrote:Buzzpoints by team, split by player

EDIT: The area under a player's curve is equal to the number of tossups answered correctly. Y axis is thus in arbitrary units of points*probability.

Someone should design an interactive website with custom results and graphs but that I unfortunately don't have the skills to be that person.
These are a very nice way of looking at the buzzpoint data for a given team. Thanks for making them! This (along with the plots Ryan made) got me wondering how the difficulty of different categories could affect these curves (both in general and in the specific case of this tournament). For example, say (hypothetically) the lit had easier early clues than the history. If player A buzzed mostly on lit while their teammate player B buzzed mostly on history, the different shapes of the A and B buzzpoint curves would be influenced by the players' knowledge of their respective categories and the difficulty of those categories. A and B could have buzzed on clues of comparable difficulty, but ended up with different buzzpoint curves due primarily to the cluing in the set.

Do people think this had a substantial effect on the buzzpoint curves, or was it negligible? Of course most players buzz on multiple categories anyway, so the effect I'm describing might be hard to detect in most cases, even if it did happen. It might also be interesting to make buzzpoint distribution plots for specific categories of interest. It might also be interesting to make buzzpoint plot that somehow included average performance on a per subject, or even per question basis.

Has anyone analyzed which categories had the most early buzzes? I didn't find any category subjectively harder by a large amount, but I wonder what the data say about the difficulty of early clues in different categories.

Re: Packets, recordings, and detailed stats warm-up survey

Posted: Fri Feb 09, 2018 11:05 pm
by Tejas
I've gone through the stats and tried testing some of the hypotheses people have posted in this thread that have not yet been answered.
CPiGuy wrote:"Bad" teams (let's say <12PPB) will have a higher percentage of 30'd bonuses than buzzes in the first two lines, and "good" teams (>18PPB) will have a higher percentage of buzzes in the first two lines than 30'd bonuses.
I wasn't sure exactly what was meant by buzzes in the first two lines, I took it to mean correct buzzes in the first two lines as a percentage of all of that team's buzzes. Based on that definition, I found that in total, "bad" teams 30'd 1.2% of bonuses and buzzed in the first 30% of a tossup(as a proxy for question length) 0.8% of the time. On the other hand, "good" teams 30'd 23.0% of bonuses and buzzed in the first 25% of the tossup 4.1% of the time.

You probably underestimated the ease with which top teams can 30 regular difficulty bonuses in particular, especially compared to getting early buzzes on a set with somewhat tougher leadins.
nsb2 wrote:I would predict that a large majority of music buzzes (maybe up to 90%) were after the third line or so, even more than for other categories.
You were correct that more than 90% of music buzzes came after the third line (I used 40% of the tossup as a proxy). However, this was not especially high compared to other categories.

Code: Select all

 Subcategory                                                   Buzzes After 3rd Line
                                         Philosophy            0.984
                                  Miscellaneous Lit            0.982
                                              Drama            0.972
                                            Biology            0.970
                                    Non-Epic Poetry            0.969
 British, Canadian, Australian, New Zealand History            0.963
                                          Chemistry            0.957
                                     Social Science            0.956
                                              Music            0.949
                                     Other Academic            0.949
                                      Other Science            0.946
                                          Other Art            0.939
                                      Other History            0.939
                                         US History            0.937
                                       Long Fiction            0.925
                                      Short Fiction            0.925
                                            Physics            0.920
                                 Painting/Sculpture            0.915
         Continental European History (post-600 CE)            0.910
   Continental or Near Eastern History (pre-600 CE)            0.905
                                          Geography            0.901
                                           Religion            0.890
                     Historiography and Archaeology            0.885
                                     Current Events            0.853
                                          Mythology            0.846
                                              Trash            0.766
cwest123 wrote:At the Southeast (Georgia Tech) site, the majority of music buzzes were VERY late in the question, generally close or in to the last line. Percentage wise, I'll guess that the average buzz was beyond the 75% point.
This is correct, of all the sites I found that the Georgia Tech one had the latest mean and median buzzpoints. This is only considering correct buzzes and ignoring vulches.

Code: Select all

  Site             Average Buzz %  Median Buzz %
      Minnesota    0.684           0.720
           UCSD    0.700           0.723
   Kansas State    0.713           0.664
 Oxford Brookes    0.715           0.754
     Penn State    0.723           0.723
           UIUC    0.765           0.813
    Connecticut    0.771           0.791
        Toronto    0.783           0.876
            UCF    0.843           0.887
       Virginia    0.863           0.940
           Rice    0.899           0.924
   Georgia Tech    0.936           0.996
Sima Guang Hater wrote: -Hard parts were significantly harder than middle parts, leading to a "wall effect" around 20 ppb
-Science bonuses were, on average, harder than literature bonuses
I don't know if there's a good way to measure a "wall effect", overall middle bonus parts were converted around 50% of the time and hard parts about 15%. This seems about what's expected, and any measure of "significantly harder" would probably require a comparison to other regular difficulty tournaments. You were correct that science bonuses were harder, although I was surprised to see that science easy parts were actually converted pretty well, while the middle and hart parts were converted the least of any category. I've attached a plot below showing conversion by category and difficulty.

Image
geremy wrote:I predict that out of all the science subcategories, physics has the latest average buzz point and biology the earliest, but the average PPB will be pretty close.
Physics did have not the latest buzz point, but it did have the lowest PPB.

Code: Select all

 Subcategory   PPB      Average Buzz %  Median Buzz %
       Biology 14.68    0.771           0.814
     Chemistry 14.59    0.755           0.770
 Other Science 14.28    0.727           0.746
       Physics 13.08    0.761           0.814
I'll get back to some of the other ones, PM me if you catch any errors I made here.

EDIT: fixed incorrect statement

Re: Packets, recordings, and detailed stats warm-up survey

Posted: Sat Feb 10, 2018 1:48 am
by ErikC
It's interesting that the Toronto site's median and average buzz are quite different.

I'm not surprised science hard parts were the hardest. I think science easy parts being converted well is because it is often a concept almost everyone is familiar with (like gravity) even if they don't understand it fundamentally (like Rein Otsason).