State of QB Databases, 2017

The scariest thing of all is Protobowl
Post Reply
User avatar
ezubaric
Rikku
Posts: 337
Joined: Mon Feb 09, 2004 8:02 pm
Location: College Park, MD
Contact:

State of QB Databases, 2017

Post by ezubaric » Thu Apr 20, 2017 10:32 am

We're cleaning up some of the cruft that we have in our datasets for our quiz bowl playing robot, and we also want to ingest new data. However, we haven't really kept current on the state of the art in QB database land. We were wondering if people would care to share their opinions on:
1) What is the most complete source of central source questions in machine-readable format (e.g., most number of questions, regardless of quality)
2) What is the cleanest source of questions in machine-readable format (e.g., fewest unicode issues / formatting errors)
3) What is the richest source of questions in machine-readable format (e.g., best metadata, annotation, etc.)

Would be great if a single database were the answers to 1-3, but I'm assuming not.

Thanks in advance!
Jordan Boyd-Graber
UMD (College Park, MD), Faculty Advisor 2018-present
UC Boulder, Founder / Faculty Advisor 2014-2017
UMD (College Park, MD), Faculty Advisor 2010-2014
Princeton, Player 2004-2009
Caltech (Pasadena, CA), Player / President 2000-2004
Ark Math & Science (Hot Springs, AR), Player 1998-2000
Monticello High School, Player 1997-1998

Human-Computer Question Answering:
http://qanta.org/

User avatar
grapesmoker
Sin
Posts: 6360
Joined: Sat Oct 25, 2003 5:23 pm
Location: Pittsburgh, PA
Contact:

Re: State of QB Databases, 2017

Post by grapesmoker » Thu Apr 20, 2017 11:29 am

I've recently updated QBDB; it's even got an API, which if you want, you can scrape (email for details). I'm not going to claim completeness, but I have ~10k tossups parsed into question/answer format and something like ~25k bonus parts, likewise parsed. I'm going to be updating this on a regular basis once Nationals is done, working my way back in time. My goal is to eventually get to complete coverage for college events going back to sometime around the start of the modern QB era (circa 2005 or so). The parser for QBDB was rewritten in Python 3 so I no longer have to deal with unicode issues, and as far as formatting goes, the text of the questions is available both with original markup (required answer content, powermarking) and bare text. I'm not doing anything with regard to annotation or metadata however, besides trivial stuff like year and tournament.
Jerry Vinokurov
ex-LJHS, ex-Berkeley, ex-Brown, sorta-ex-CMU
code ape, loud voice, general nuissance

kevink
Lulu
Posts: 73
Joined: Tue Aug 28, 2012 5:18 pm

Re: State of QB Databases, 2017

Post by kevink » Thu Apr 20, 2017 12:09 pm

Protobowl's latest question dump probably has the most tossup questions (105,923 in the latest dump) but it doesn't have any bonuses. It's comprised of questions scraped from a number of databases, including a custom parser, tweaked over the years through the protobowl question reporting interface. Questions/answers are stored in plain text and so it often lacks proper rich text information (italics/bolding). I don't have any quantitative information about the incidence of formatting errors, but they do exist.
Kevin Kwok
Annandale High School 2013
MIT 2017

User avatar
ezubaric
Rikku
Posts: 337
Joined: Mon Feb 09, 2004 8:02 pm
Location: College Park, MD
Contact:

Re: State of QB Databases, 2017

Post by ezubaric » Thu Apr 20, 2017 2:17 pm

Very helpful, thank you! QANTA is still tossup-only, and we're data hungry, so Protobowl is probably our best choice for now.

We're moving to bonuses in the not too distant future, so QBDB is probably in our future.

Would love to know if there are any other games in town!
Jordan Boyd-Graber
UMD (College Park, MD), Faculty Advisor 2018-present
UC Boulder, Founder / Faculty Advisor 2014-2017
UMD (College Park, MD), Faculty Advisor 2010-2014
Princeton, Player 2004-2009
Caltech (Pasadena, CA), Player / President 2000-2004
Ark Math & Science (Hot Springs, AR), Player 1998-2000
Monticello High School, Player 1997-1998

Human-Computer Question Answering:
http://qanta.org/

Post Reply