## The D-value SOS calculation is broken

Elaborate on the merits of specific tournaments or have general theoretical discussion here.
ryanrosenberg
Auron
Posts: 1188
Joined: Thu May 05, 2011 5:48 pm
Location: Chicago, Illinois

### The D-value SOS calculation is broken

Last Saturday, DePaul played the Missouri SCT site, a combined D1/D2 field using the D2 set. Of DePaul's opponents, only WUSTL A and C averaged 20 PPB on the D2 set, with over half coming in under 15 PPB.

And what did this leave DePaul with? The second-highest strength of schedule (SOS), not in the Missouri field, but of every team that played SCT. This is due to a confluence of three serious flaws with the SOS calculation, which I'll lay out below.

To review, SOS is calculated as tossup points per tossup heard (TUPPTH) of the teams you played in their other games, divided by the tossup points per tossup heard over all SCT sites.

1. Tossup points per tossup heard is a heavily field-dependent measure of team strength. The current SOS calculation cannot differentiate reasonably strong teams in good fields from weak teams in weak fields. Looking at the current D-values list, Chicago D (15.69 PPB) has a worse TUPPTH than Ohio State C, Vanderbilt A, and Colorado A (10.61, 10.00, and 8.41 PPB, respectively). So playing Chicago D, a respectable opponent by any standard, lowers the SOS of Chicago D's opponents more than if they had hypothetically been able to play OSU C or Vandy A. That double-penalizes teams at the Northwestern SCT -- not only do they have to face a fairly strong team as their weakest opponent, but their SOS takes a huge hit for it.

2. The effect of point 1 is exacerbated in round-robin scheduling. Let's take the example of two four-team round robin tournaments. The first has four strong teams in the 17-19 PPB range. The second has two of those teams, and then two teams in the 10-12 PPB range. In the first tournament, each team will get roughly half the tossups per round, so each team's opponent's TUPPTH will be ~5.00 (assuming as many powers as negs). In the second tournament, the two strong teams will get about 80% of the tossups against the two weak teams and 50% against each other. The weak teams will get 20% against strong teams and 50% against each other. So a strong team will have played one strong team (~7.72 TUPPTH in other games) and two weak teams (~3.64 TUPPTH in other games), for a SOS of ~5.00. So a strong team gets the benefit of beating up weak teams without any hit to SOS! This effect would be further exacerbated by a final between the two strong teams, which will boost their SOS without dramatically reducing their TUPPTH.

More generally, in a round robin, the number of tossup points scored by your opponents in other games is equal to {(10 x Tossup Conversion Rate) + (5 x Power Rate) - (5 x Neg Rate)}. This measure doesn't really vary too much from site to site, since even in games between relatively weak teams, almost all tossups are still converted, and power rates aren't significantly lower than games between good teams.

3. There is no D2 conversion for opponents' TUPPTH in the SOS calculation. Following from the last sentence of point 2, the tossup conversion rate in combined fields is artificially raised, since a D1 team's opponents are being measured on their ability to convert D2 tossups rather than D1 tossups. It seems fairly clear that if you had forced the Missouri site to play on D1 questions, many more tossups would have gone dead, and the SOS of all teams would be much lower. However, teams at all non-combined sites are being judged on their ability to convert D1 questions, so comparing those two measures seems illogical.

How should NAQT fix the strength of schedule calculation for future years? Use points per tossup heard rather than tossup points per tossup heard, which will incorporate a non-competitive measure of team strength (bonus conversion) and adjust for the strength good teams in very competitive fields. Additionally, NAQT should apply a D2 conversion factor to the SOS of combined fields to avoid comparing field strengths on two very different sets.
Ryan Rosenberg
North Carolina '16 | Ardsley '12
PACE | ACF

theMoMA
Posts: 5575
Joined: Mon Oct 23, 2006 2:00 am

### Re: The D-value SOS calculation is broken

It might be useful to share a bit of data from our statistical survey of D values and ICT performance. Using the new D value calculation, the historical r^2 between D value and ICT performance for teams maintaining substantial roster continuity between the tournaments is about 0.75 (roughly speaking, this means that about 75% of a team's ICT performance is predicted by the team's D value). When the SoS is removed from D value, the r^2 drops to about 0.62, so that just over 62% of ICT performance is predicted by the team's D value. SoS improves the predictiveness of D value by similar amounts regardless of whether you use old or new D value, or whether you look at ICT prelims or overall scoring.

I certainly credit the idea that the SoS measure could stand to improve (and note that, if D value suffers these issues, then it's something ACF will have to look at for A value as well, as the two are almost exactly equivalent as to how they calculate strength of schedule; both have used tossup points per tossup heard from the beginning). The numbers do show, however, that SoS in its current form has a large positive impact on the ability of D value to predict ICT performance.

The DII SoS certainly could be calculated in the way that Ryan suggests, but this would mean that the translation coefficients for tossup and bonus points would have to be raised, because they are currently calibrated for untranslated SoS. This may prove to be an even more accurate way to generate D values from DI teams playing on the DII set, and I'm glad that Ryan suggested it so we can look into those numbers, but I do want to point out that the DII translations are indeed calibrated based on past ICT performances, and so teams playing on the DII set do not have an unfair leg up. In other words, if the DII SoS measure were calculated the way that Ryan suggests, the resulting necessary increases to the translation coefficients for tossup and bonus points would result, on the whole, in similar final D values for DI teams that played the DII set.

As for the larger picture, it certainly appears that the current calculation can produce unexpected and likely inaccurate SoS numbers for a few individual teams for the reasons that Ryan suggested, and as a result, I would like to investigate the effect of changing SoS to be based on a holistic team performance instead of just tossup performance. (I hope that our changes this year, which were largely intended to improve D value's performance for a small subset of teams that play DII or at very weak fields, demonstrate our commitment to making D value as accurate as it can be, even if the number of affected teams is very small.) But I'd also like to point out that the SoS is, on the whole, not "broken" in its present form; it is a large net positive for the accuracy of D value, improving D value's predictive ability by nearly 15%. To be totally clear, this is no reason to keep things exactly the way they are, and it's very possible that D value's predictive ability would be even better with a tweaked SoS calculation. But it is a reason to be confident that, in its present form, D value (and A value) is doing a very good job inviting the correct teams.
Andrew Hart
Minnesota alum

Fucitol
Rikku
Posts: 266
Joined: Sat May 05, 2007 10:02 pm
Location: the North Atlantic seaweed Fucus vesiculosus

### Re: The D-value SOS calculation is broken

theMoMA wrote:It might be useful to share a bit of data from our statistical survey of D values and ICT performance. Using the new D value calculation, the historical r^2 between D value and ICT performance for teams maintaining substantial roster continuity between the tournaments is about 0.75 (roughly speaking, this means that about 75% of a team's ICT performance is predicted by the team's D value). When the SoS is removed from D value, the r^2 drops to about 0.62, so that just over 62% of ICT performance is predicted by the team's D value. SoS improves the predictiveness of D value by similar amounts regardless of whether you use old or new D value, or whether you look at ICT prelims or overall scoring.

Can you check the correlation with ACF Nationals performance? I know the formats aren't all that similar but teams that are good at one tend to be good at the other and ACF Nationals has not recently suffered from cutting out top 25 teams (according to polls) from attending its tournament which would be a large confounding variable in that r^2 value.
James L.
Kellenberg '10
UPenn '14

Benin Rebirth Party
Tidus
Posts: 742
Joined: Sat Jun 12, 2010 8:46 pm
Location: Farhaven, Ontario

### Re: The D-value SOS calculation is broken

What about looking at SCT ppb only? I've strongly suspected this for years now while looking at recently underranked teams like Chicago B.
Joe Su
Lisgar 2012, McGill 2015, McGill 20--

FINALIST -- 2017 ILQBM MEME OF THE YEAR

Fucitol
Rikku
Posts: 266
Joined: Sat May 05, 2007 10:02 pm
Location: the North Atlantic seaweed Fucus vesiculosus

### Re: The D-value SOS calculation is broken

Aaron Manby (ironmaster) wrote:What about looking at SCT ppb only? I've strongly suspected this for years now while looking at recently underranked teams like Chicago B.
PPB should definitely be more strongly included in the D-Value as it is a direct measure of team strength (albeit scattered by how strong the bonus rollercoaster is). However, converting tossups is also a skill.

One idea that I liked is to base SOS on field power numbers and PPB weighted in some empirically valid way.

EDIT: Accidentally a few words
James L.
Kellenberg '10
UPenn '14

Periplus of the Erythraean Sea
Auron
Posts: 1739
Joined: Mon Feb 28, 2011 11:53 pm
Location: Falls Church, VA

### Re: The D-value SOS calculation is broken

Aaron Manby (ironmaster) wrote:What about looking at SCT ppb only? I've strongly suspected this for years now while looking at recently underranked teams like Chicago B.
PPB is a good but not perfect measure because teams are often able to outperform their PPB by nailing a few categories and playing strategically elsewhere to get to the crucial tossup plurality / majority. Examples of this might include the Ike-led Illinois team circa 2012 (and arguably 2013 as well) and the Myers-led MSU team's penchant for pulling off upsets this year.

Also, SCT PPB can get pretty heavily distorted by how well you cover categories less common in mACF tournaments. This is why I think most teams' PPB on SCT is equal tracks pretty well with their mACF PPB, except upper level teams (who tend to be more outsized in skill at arts and other categories more emphasized in mACF) and those reliant on a single generalist, with CE/geo master Jakob Myers again being an exception.
Will Alston
Bethesda Chevy Chase HS '12, Dartmouth '16, Columbia '21
"...should be treated as the non-stakeholding troll he is" -Matt Weiner

heterodyne
Rikku
Posts: 365
Joined: Tue Jun 26, 2012 9:47 am

### Re: The D-value SOS calculation is broken

As James gestured towards, any correlational argument does seem to fail to account for teams that got prevented from qualifying by this problem - and given the relative difficulty of qualifying via D-value in previous years, I can't imagine the effect is insignificant.
Alston [Montgomery] Boyd
Bloomington High School '15
UChicago '19
he/him/his or they/them/their

theMoMA
Posts: 5575
Joined: Mon Oct 23, 2006 2:00 am

### Re: The D-value SOS calculation is broken

heterodyne wrote:As James gestured towards, any correlational argument does seem to fail to account for teams that got prevented from qualifying by this problem - and given the relative difficulty of qualifying via D-value in previous years, I can't imagine the effect is insignificant.
For teams that "comfortably" qualified (teams with new D values of 350 or greater--no teams of this strength are going to be left out of the ICT sample by virtue of failing to qualify), adding the SoS factor doubles the predictiveness of D value with respect to either ICT overall pp20tuh (from an r^2 of 0.29 to 0.58) or ICT prelims pp20tuh (0.32 to 0.71). (Note that r^2 for smaller samples, such as high-D-value teams, tends to be lower than the r^2 of the overall series.)
Andrew Hart
Minnesota alum

theMoMA
Posts: 5575
Joined: Mon Oct 23, 2006 2:00 am

### Re: The D-value SOS calculation is broken

Disclaimer: this is a post containing my personal reflections as someone who has been involved in the development of D value and A value and not an official statement of either NAQT or ACF.

As a community, we've developed the D/A value model and, over time, incrementally tweaked it to better pick out the most qualified teams for ICT and ACF Nationals. NAQT used to invite teams based on S value, which was more opaque than D value and had various other flaws, and took up D value after asking for community input on a new, "open-source" method for inviting teams to ICT. (Another disclaimer: D value, whose name honors Dwight Wynne, was based on his substantial improvements to a basic framework I devised.) Later, ACF needed a qualification procedure when it moved to an invitation-based Nationals, and adopted the framework of D value with a couple of tweaks to the SoS to alleviate weak-field issues; this became A value (a name that, sadly, does not honor yours truly), which to my knowledge ACF hasn't changed since adopting. This year, NAQT changed D value to alleviate weak-field issues and better assess the performances of teams playing on the DII set. That brings us to the present state of these statistics, which are now almost entirely identical. While the organizations' goals and requirements have not always been the same, the resulting statistics have.

There are two ways to go from here. The first is, as NAQT has recently done, to analyze the data we have to make incremental changes to D/A value that will improve their predictiveness. For instance, now that Ryan has identified a SoS issue with D/A value that appears to have both empirical and conceptual validity, at least with respect to a few teams, we can do a statistical survey to see whether tweaks to remedy that issue would have a positive impact on the predictive ability of D/A value. This is what I'd like to do with D value/ICT data before next year's SCTs.

The second path is to look for a radically different model for comparing teams that can accommodate the necessary factors of team performance, field strength, and, for NAQT's purposes, combined field translations. Joe's points-per-bonus suggestion above is an example of this approach (though I suspect that, because tossup performance is the key skill of quizbowl--previous work I've done suggests that tossup performance is about 13 times more predictive of a team's chance of winning than bonus performance--this would not be a particularly fruitful path to go down).

To put it in metaphorical terms, we've built a prediction engine out of various moving parts: tossup performance, bonus performance, strength of schedule, and (for D value but not A value) DII translations. We can either decide to tool up various parts of the current engine, or to build a new one from scratch.

Either way, the data that you'd need to analyze (historical SCT/ICT and Regionals/Nationals results) is entirely public. I enjoy working on these projects, but I want to be totally clear about my own conception of the goal in doing so: I think we should improve the current model rather than devise a new one. To the extent I've worked with the data, they suggest that D value does a very good job at picking out the most qualified teams for ICT. I also think that, with minor changes, it could possibly do an even better job for the subset of teams affected by possible SoS deficiencies that Ryan pointed out, and possibly after other tweaks we haven't yet envisioned, but I'm not comfortable saying so definitively until I can look at the data. I haven't worked on similar projects with A value (I would be happy to do so if ACF were interested in looking into improvements to A value in the future), but my unsupported intuition is that A value works for the same reason D value works, and that it generally does a very good job (that could perhaps be even better with SoS tweaks or other improvements we haven't foreseen).

I say all this to make this point: just because my sense is that D/A value don't need a major overhaul, and just because my work in this area is focused on optimizing the current model rather than creating a new one, doesn't mean that others can't or shouldn't look for new ways to tackle the same problem. It also doesn't mean that I should be the only person looking for ways to improve the current model (though, like I said, I enjoy doing so, and am happy to follow up on suggestions that people have, so this is definitely not a message that "you fix it yourself or nothing gets done"). The data are out there for anyone to use, and I'm interested to see what people can do with it.
Andrew Hart
Minnesota alum

jonah
Auron
Posts: 2262
Joined: Thu Jul 20, 2006 5:51 pm
Location: Chicago

### Re: The D-value SOS calculation is broken

I'll add that anyone on Andrew's quest (or something similar) who wants data from NAQT's public website in a more convenient format is welcome to contact me (jonah@naqt.com) and I'll make reasonable efforts to provide it. (It might have to wait until the summer, but I'll do what I can.)
Jonah Greenthal

Benin Rebirth Party
Tidus
Posts: 742
Joined: Sat Jun 12, 2010 8:46 pm
Location: Farhaven, Ontario

### Re: The D-value SOS calculation is broken

I was exaggerating when I said ppb meant everything (not looking at you Fred), but it definitely should be the starting point for future improvements of the system.

I did a quick excel calculation with 2017 and these teams that had identical enough rosters between SCT and ICT: Stanford, Berkeley A, Northwestern, Toronto, Duke, Berkeley B, McGill, Chicago B, NYU, MIT, Amherst, UCSD, Kenyon, Louisville, Missouri. I took their SCT PPB, their D-value order of finish and their ICT order of finish. The Spearman rank correlation was 0.82 for SCT PPB and ICT finish, while it was 0.75 for D-value order of finish and ICT finish. Part of the 2017 wonkniess probably has to do with Duke's low finish, McGill's high finish, and Missouri's high finish. Deleting just Duke gets you 0.92 for PPB and 0.89 for D-value. For 2016, this correlation increases to 0.93 for PPB and 0.87 for D-value.

I think an easy fix for D/A value would be to optimize weights for all the parameters - all parameters in the SOS calculation, ppb, pptuh, etc. Two things that could also be helpful would be avg. opponent ppb, as the SOS as mentioned by Ryan is not perfect, and power rate. Completely anecdotally, my D2 ICT saw Columbia come 9th after being placed in a circle of death with us and Harvard but was 6th in bonus conversion. They had an anomalously high SCT power percentage for their D-value rank.
Joe Su
Lisgar 2012, McGill 2015, McGill 20--

FINALIST -- 2017 ILQBM MEME OF THE YEAR

ryanrosenberg
Auron
Posts: 1188
Joined: Thu May 05, 2011 5:48 pm
Location: Chicago, Illinois

### Re: The D-value SOS calculation is broken

Is NAQT planning to make changes to the D-value calculation for this year's SCT?
Ryan Rosenberg
North Carolina '16 | Ardsley '12
PACE | ACF