Please upload your packets as DOCs too

The scariest thing of all is Protobowl
Post Reply
User avatar
UlyssesInvictus
Tidus
Posts: 717
Joined: Thu Feb 10, 2011 7:38 pm

Please upload your packets as DOCs too

Post by UlyssesInvictus » Thu Oct 05, 2017 1:29 pm

^title

I've been doing a lot of work on getting a standard machine parser working for QuizDB and it's surprisingly difficulty to consistently convert PDFs into machine-readable text. (It actually shouldn't be that surprising, given that your browser/OS still have such trouble copy-pasting from PDFs to regular text--and thus all the encoding errors on QuizDB.)

On the other hand, it's super easy to get nicely readable files from Word DOCs and DOCXs. I've been able to consistently get really good parsed questions from these packet formats (assuming they adhere closely to a standard format, but that's addressed in this other post).

So please additionally upload your packets as Word files to the Packets Archive.

I know it's a little extra work, but if you zip it, it's like five extra clicks; and if you do, it makes your questions much more permanent for posterity. Isn't that a good thing?

(Plus, I can hardly think of any cases where you would write your packets in a format that couldn't also be easily converted to Word, whether you started off writing in Word, are using Google Docs, or are exporting from QEMS.)

Hopefully everyone sees this post and updates their Archive entries of their own volition, but I'll probably start emailing people soon (and orgs--this applies to ACF, too) and nicely asking them for some extra files.
Last edited by UlyssesInvictus on Fri Oct 06, 2017 11:57 am, edited 1 time in total.
Raynor Kuang
quizdb.org
Harvard 2017, TJHSST 2013
I wrote GRAPHIC and FILM

jonah
Auron
Posts: 2260
Joined: Thu Jul 20, 2006 5:51 pm
Location: Chicago

Re: Please upload your packets as DOCs too

Post by jonah » Thu Oct 05, 2017 8:03 pm

For writers whose use LaTeX (Scobol Solo and Masonic being the only current examples I know of), do you want the .tex source, or what?
Jonah Greenthal
National Academic Quiz Tournaments

User avatar
AKKOLADE
Sin
Posts: 15016
Joined: Thu Apr 24, 2003 8:08 am

Re: Please upload your packets as DOCs too

Post by AKKOLADE » Thu Oct 05, 2017 10:47 pm

jonah wrote:For writers whose use LaTeX (Scobol Solo and Masonic being the only current examples I know of), do you want the .tex source, or what?
don't use latex
Fred Morlan
PACE President, 2018-19
International Quiz Bowl Tournaments, co-owner
University of Kentucky CoP, 2017
hsqbrank manager, NAQT writer (former subject editor), former hsqb Administrator/Chief Administrator, 2012 NASAT TD

User avatar
UlyssesInvictus
Tidus
Posts: 717
Joined: Thu Feb 10, 2011 7:38 pm

Re: Please upload your packets as DOCs too

Post by UlyssesInvictus » Fri Oct 06, 2017 12:02 am

Wow, .tex, didn't expect that, but I should have.

TeX is a standardized markup language, though, right? That's probably as good as DOC, since PanDoc will probably be able to parse it fine as well.

(Although, at that point it really just becomes something only I'd use, rather than the QB audience at large.)

EDIT: confirmed that Pandoc does this just fine. Will happily take LaTex sources!
Raynor Kuang
quizdb.org
Harvard 2017, TJHSST 2013
I wrote GRAPHIC and FILM

User avatar
UlyssesInvictus
Tidus
Posts: 717
Joined: Thu Feb 10, 2011 7:38 pm

Re: Please upload your packets as DOCs too

Post by UlyssesInvictus » Fri Oct 06, 2017 12:06 am

Although, re-reading your post, I'll note that Scobol and Masonic wildly diverge from the standard ACF-ish format (the one I'm proposing in my still to-be-written post), and so I'd have larger issues past actually converting the packets into machine readable (TBH this really just means my machine) format.
Raynor Kuang
quizdb.org
Harvard 2017, TJHSST 2013
I wrote GRAPHIC and FILM

jonah
Auron
Posts: 2260
Joined: Thu Jul 20, 2006 5:51 pm
Location: Chicago

Re: Please upload your packets as DOCs too

Post by jonah » Fri Oct 06, 2017 12:14 am

UlyssesInvictus wrote:Although, re-reading your post, I'll note that Scobol and Masonic wildly diverge from the standard ACF-ish format (the one I'm proposing in my still to-be-written post), and so I'd have larger issues past actually converting the packets into machine readable (TBH this really just means my machine) format.
Scobol Solo is tossups only, so I would have thought that would be pretty usable. Masonic consists of tossups and questions that are isomorphic to bonuses, but they're arranged in a weird fashion, so I understand not wanting to mess with that.

Anyway, if you want any of these things, shoot me an email specifying which ones you want.
Jonah Greenthal
National Academic Quiz Tournaments

User avatar
CPiGuy
Rikku
Posts: 451
Joined: Wed Nov 16, 2016 8:19 pm
Location: Ann Arbor, MI

Re: Please upload your packets as DOCs too

Post by CPiGuy » Fri Oct 06, 2017 12:19 am

jonah wrote:For writers whose use LaTeX (Scobol Solo and Masonic being the only current examples I know of), do you want the .tex source, or what?
Math Monstrosity also used LaTeX, and there's a .zip file with the .tex sources, which would probably be better for you to pull from than (what I assume you already pulled from:) the PDFs.
Conor Thompson
Bangor HS (Maine) '16
Michigan '20

User avatar
UlyssesInvictus
Tidus
Posts: 717
Joined: Thu Feb 10, 2011 7:38 pm

Re: Please upload your packets as DOCs too

Post by UlyssesInvictus » Fri Oct 06, 2017 12:31 am

re: Jonah: I was mainly concerned with the interactive arrows and interim text like "Check the score," but we'll see how problematic it is after I parse the files. Thanks! (And, yeah, I'm just going to stay away from the Masonic box-stuff for now.)

re: Conor: Awesome, I'll reupload those at some time using the TeX parsed data instead. The PDF stuff was just what the earlier uploaders used, so it's got the pros of human error checking but also the cons of human error introduction.

I'm actually still concerned with getting the final build of the parser working--the categorizer (yay, logistic regression!)--so while that happens, the most helpful thing for me is for people to just upload these machine readable files on their own to the archive. I'll track people down eventually if I have to, but it'd be lovely to just go back online and find them waiting there for me.

(Though I do want to stress that I believe this is something benefiting the community as a whole, not just QuizDB :D)
Raynor Kuang
quizdb.org
Harvard 2017, TJHSST 2013
I wrote GRAPHIC and FILM

User avatar
Dominator
Rikku
Posts: 487
Joined: Sun Mar 14, 2010 9:16 pm

Re: Please upload your packets as DOCs too

Post by Dominator » Fri Oct 06, 2017 3:05 pm

jonah wrote:For writers whose use LaTeX (Scobol Solo and Masonic being the only current examples I know of), do you want the .tex source, or what?
You cannot be serious.
  • You know that I used LaTeX to format all IMSANITY packets, and you asked me in 2011 to see my LaTeX code.
  • You know from when you interrogated me as a writer applicant for NAQT that the Packetizor system I built for NHBB uses LaTeX for its final packets.
  • You also know that that system has expanded to other corners of the Madden-verse (like USABB) and SCOP.
Are you implying that those organizations are not quizbowl? If so, will they be removed from "major competitor to NAQT" list?
AKKOLADE wrote:
jonah wrote:For writers whose use LaTeX (Scobol Solo and Masonic being the only current examples I know of), do you want the .tex source, or what?
don't use latex
This is the wrong response to the problem.
UlyssesInvictus wrote:Wow, .tex, didn't expect that, but I should have.

TeX is a standardized markup language, though, right? That's probably as good as DOC, since PanDoc will probably be able to parse it fine as well.
Packetizor can be made to output the questions in whatever format is best for QuizDB. If you'd like to find a better solution than parsing .pdf or .tex files for uploading NHBB and SCOP content, feel free to email me.
Dr. Noah Prince

Normal Community High School (2002)
University of Illinois at Urbana-Champaign (2004)
University of Illinois at Urbana-Champaign (2007)
University of Illinois at Urbana-Champaign (2008)

Illinois Mathematics and Science Academy - Scholastic Bowl coach (2009-2014), assistant coach (2014-2015), well wisher (2015-2016)
guy in San Diego (2016-present)

User avatar
Mike Bentley
Auron
Posts: 5579
Joined: Fri Mar 31, 2006 11:03 pm
Location: Bellevue, WA
Contact:

Re: Please upload your packets as DOCs too

Post by Mike Bentley » Sun Oct 08, 2017 2:03 pm

Getting back to the original problem of parsing PDFs, has anyone looked at using OCR for this? For instance, something like the Azure Computer Vision API. It's not necessarily the most efficient (and costs a small amount of money), but might get more reliable results assuming you're able to isolate the content of the document.
Mike Bentley
Treasurer, Partnership for Academic Competition Excellence
Adviser, Quizbowl Team at University of Washington
University of Maryland, Class of 2008

User avatar
UlyssesInvictus
Tidus
Posts: 717
Joined: Thu Feb 10, 2011 7:38 pm

Re: Please upload your packets as DOCs too

Post by UlyssesInvictus » Sun Oct 08, 2017 2:34 pm

Mike Bentley wrote:Getting back to the original problem of parsing PDFs, has anyone looked at using OCR for this? For instance, something like the Azure Computer Vision API. It's not necessarily the most efficient (and costs a small amount of money), but might get more reliable results assuming you're able to isolate the content of the document.
That's an interesting idea, but more of one to consider solely for fun--or in the case where the original authors are no longer contactable--since I'd much rather just have the original, encoded source as far as data preservation.
Raynor Kuang
quizdb.org
Harvard 2017, TJHSST 2013
I wrote GRAPHIC and FILM

User avatar
ezubaric
Rikku
Posts: 361
Joined: Mon Feb 09, 2004 8:02 pm
Location: College Park, MD
Contact:

Re: Please upload your packets as DOCs too

Post by ezubaric » Mon Oct 16, 2017 4:08 pm

UlyssesInvictus wrote:
Mike Bentley wrote:Getting back to the original problem of parsing PDFs, has anyone looked at using OCR for this? For instance, something like the Azure Computer Vision API. It's not necessarily the most efficient (and costs a small amount of money), but might get more reliable results assuming you're able to isolate the content of the document.
That's an interesting idea, but more of one to consider solely for fun--or in the case where the original authors are no longer contactable--since I'd much rather just have the original, encoded source as far as data preservation.
I don't think you'd actually want to use OCR; I think having examples of correct parsing, you could easily read the original bytestream and create an LSTM to convert it.

For things that only exist on paper, OCR would be a good option; I prefer the Google OCR toolkit (which we used for comic books).
Jordan Boyd-Graber
UMD (College Park, MD), Faculty Advisor 2018-present
UC Boulder, Founder / Faculty Advisor 2014-2017
UMD (College Park, MD), Faculty Advisor 2010-2014
Princeton, Player 2004-2009
Caltech (Pasadena, CA), Player / President 2000-2004
Ark Math & Science (Hot Springs, AR), Player 1998-2000
Monticello High School, Player 1997-1998

Human-Computer Question Answering:
http://qanta.org/

Post Reply