Please upload your packets as DOCs too
- UlyssesInvictus
- Yuna
- Posts: 845
- Joined: Thu Feb 10, 2011 7:38 pm
Please upload your packets as DOCs too
^title
I've been doing a lot of work on getting a standard machine parser working for QuizDB and it's surprisingly difficulty to consistently convert PDFs into machine-readable text. (It actually shouldn't be that surprising, given that your browser/OS still have such trouble copy-pasting from PDFs to regular text--and thus all the encoding errors on QuizDB.)
On the other hand, it's super easy to get nicely readable files from Word DOCs and DOCXs. I've been able to consistently get really good parsed questions from these packet formats (assuming they adhere closely to a standard format, but that's addressed in this other post).
So please additionally upload your packets as Word files to the Packets Archive.
I know it's a little extra work, but if you zip it, it's like five extra clicks; and if you do, it makes your questions much more permanent for posterity. Isn't that a good thing?
(Plus, I can hardly think of any cases where you would write your packets in a format that couldn't also be easily converted to Word, whether you started off writing in Word, are using Google Docs, or are exporting from QEMS.)
Hopefully everyone sees this post and updates their Archive entries of their own volition, but I'll probably start emailing people soon (and orgs--this applies to ACF, too) and nicely asking them for some extra files.
I've been doing a lot of work on getting a standard machine parser working for QuizDB and it's surprisingly difficulty to consistently convert PDFs into machine-readable text. (It actually shouldn't be that surprising, given that your browser/OS still have such trouble copy-pasting from PDFs to regular text--and thus all the encoding errors on QuizDB.)
On the other hand, it's super easy to get nicely readable files from Word DOCs and DOCXs. I've been able to consistently get really good parsed questions from these packet formats (assuming they adhere closely to a standard format, but that's addressed in this other post).
So please additionally upload your packets as Word files to the Packets Archive.
I know it's a little extra work, but if you zip it, it's like five extra clicks; and if you do, it makes your questions much more permanent for posterity. Isn't that a good thing?
(Plus, I can hardly think of any cases where you would write your packets in a format that couldn't also be easily converted to Word, whether you started off writing in Word, are using Google Docs, or are exporting from QEMS.)
Hopefully everyone sees this post and updates their Archive entries of their own volition, but I'll probably start emailing people soon (and orgs--this applies to ACF, too) and nicely asking them for some extra files.
Last edited by UlyssesInvictus on Fri Oct 06, 2017 11:57 am, edited 1 time in total.
Raynor Kuang
quizdb.org
Harvard 2017, TJHSST 2013
I wrote GRAPHIC and FILM
quizdb.org
Harvard 2017, TJHSST 2013
I wrote GRAPHIC and FILM
Re: Please upload your packets as DOCs too
For writers whose use LaTeX (Scobol Solo and Masonic being the only current examples I know of), do you want the .tex source, or what?
Jonah Greenthal
National Academic Quiz Tournaments
National Academic Quiz Tournaments
Re: Please upload your packets as DOCs too
don't use latexjonah wrote:For writers whose use LaTeX (Scobol Solo and Masonic being the only current examples I know of), do you want the .tex source, or what?
Fred Morlan
University of Kentucky CoP, 2017
International Quiz Bowl Tournaments, CEO, co-owner
former PACE member, president, etc.
former hsqbrank manager, former NAQT writer & subject editor, former hsqb Administrator/Chief Administrator
University of Kentucky CoP, 2017
International Quiz Bowl Tournaments, CEO, co-owner
former PACE member, president, etc.
former hsqbrank manager, former NAQT writer & subject editor, former hsqb Administrator/Chief Administrator
- UlyssesInvictus
- Yuna
- Posts: 845
- Joined: Thu Feb 10, 2011 7:38 pm
Re: Please upload your packets as DOCs too
Wow, .tex, didn't expect that, but I should have.
TeX is a standardized markup language, though, right? That's probably as good as DOC, since PanDoc will probably be able to parse it fine as well.
(Although, at that point it really just becomes something only I'd use, rather than the QB audience at large.)
EDIT: confirmed that Pandoc does this just fine. Will happily take LaTex sources!
TeX is a standardized markup language, though, right? That's probably as good as DOC, since PanDoc will probably be able to parse it fine as well.
(Although, at that point it really just becomes something only I'd use, rather than the QB audience at large.)
EDIT: confirmed that Pandoc does this just fine. Will happily take LaTex sources!
Raynor Kuang
quizdb.org
Harvard 2017, TJHSST 2013
I wrote GRAPHIC and FILM
quizdb.org
Harvard 2017, TJHSST 2013
I wrote GRAPHIC and FILM
- UlyssesInvictus
- Yuna
- Posts: 845
- Joined: Thu Feb 10, 2011 7:38 pm
Re: Please upload your packets as DOCs too
Although, re-reading your post, I'll note that Scobol and Masonic wildly diverge from the standard ACF-ish format (the one I'm proposing in my still to-be-written post), and so I'd have larger issues past actually converting the packets into machine readable (TBH this really just means my machine) format.
Raynor Kuang
quizdb.org
Harvard 2017, TJHSST 2013
I wrote GRAPHIC and FILM
quizdb.org
Harvard 2017, TJHSST 2013
I wrote GRAPHIC and FILM
Re: Please upload your packets as DOCs too
Scobol Solo is tossups only, so I would have thought that would be pretty usable. Masonic consists of tossups and questions that are isomorphic to bonuses, but they're arranged in a weird fashion, so I understand not wanting to mess with that.UlyssesInvictus wrote:Although, re-reading your post, I'll note that Scobol and Masonic wildly diverge from the standard ACF-ish format (the one I'm proposing in my still to-be-written post), and so I'd have larger issues past actually converting the packets into machine readable (TBH this really just means my machine) format.
Anyway, if you want any of these things, shoot me an email specifying which ones you want.
Jonah Greenthal
National Academic Quiz Tournaments
National Academic Quiz Tournaments
Re: Please upload your packets as DOCs too
Math Monstrosity also used LaTeX, and there's a .zip file with the .tex sources, which would probably be better for you to pull from than (what I assume you already pulled from:) the PDFs.jonah wrote:For writers whose use LaTeX (Scobol Solo and Masonic being the only current examples I know of), do you want the .tex source, or what?
Conor Thompson (he/it)
Bangor High School '16
University of Michigan '20
Iowa State University '25
Tournament Format Database
Bangor High School '16
University of Michigan '20
Iowa State University '25
Tournament Format Database
- UlyssesInvictus
- Yuna
- Posts: 845
- Joined: Thu Feb 10, 2011 7:38 pm
Re: Please upload your packets as DOCs too
re: Jonah: I was mainly concerned with the interactive arrows and interim text like "Check the score," but we'll see how problematic it is after I parse the files. Thanks! (And, yeah, I'm just going to stay away from the Masonic box-stuff for now.)
re: Conor: Awesome, I'll reupload those at some time using the TeX parsed data instead. The PDF stuff was just what the earlier uploaders used, so it's got the pros of human error checking but also the cons of human error introduction.
I'm actually still concerned with getting the final build of the parser working--the categorizer (yay, logistic regression!)--so while that happens, the most helpful thing for me is for people to just upload these machine readable files on their own to the archive. I'll track people down eventually if I have to, but it'd be lovely to just go back online and find them waiting there for me.
(Though I do want to stress that I believe this is something benefiting the community as a whole, not just QuizDB :D)
re: Conor: Awesome, I'll reupload those at some time using the TeX parsed data instead. The PDF stuff was just what the earlier uploaders used, so it's got the pros of human error checking but also the cons of human error introduction.
I'm actually still concerned with getting the final build of the parser working--the categorizer (yay, logistic regression!)--so while that happens, the most helpful thing for me is for people to just upload these machine readable files on their own to the archive. I'll track people down eventually if I have to, but it'd be lovely to just go back online and find them waiting there for me.
(Though I do want to stress that I believe this is something benefiting the community as a whole, not just QuizDB :D)
Raynor Kuang
quizdb.org
Harvard 2017, TJHSST 2013
I wrote GRAPHIC and FILM
quizdb.org
Harvard 2017, TJHSST 2013
I wrote GRAPHIC and FILM
Re: Please upload your packets as DOCs too
You cannot be serious.jonah wrote:For writers whose use LaTeX (Scobol Solo and Masonic being the only current examples I know of), do you want the .tex source, or what?
- You know that I used LaTeX to format all IMSANITY packets, and you asked me in 2011 to see my LaTeX code.
- You know from when you interrogated me as a writer applicant for NAQT that the Packetizor system I built for NHBB uses LaTeX for its final packets.
- You also know that that system has expanded to other corners of the Madden-verse (like USABB) and SCOP.
This is the wrong response to the problem.AKKOLADE wrote:don't use latexjonah wrote:For writers whose use LaTeX (Scobol Solo and Masonic being the only current examples I know of), do you want the .tex source, or what?
Packetizor can be made to output the questions in whatever format is best for QuizDB. If you'd like to find a better solution than parsing .pdf or .tex files for uploading NHBB and SCOP content, feel free to email me.UlyssesInvictus wrote:Wow, .tex, didn't expect that, but I should have.
TeX is a standardized markup language, though, right? That's probably as good as DOC, since PanDoc will probably be able to parse it fine as well.
Dr. Noah Prince
Normal Community High School (2002)
University of Illinois at Urbana-Champaign (2004, 2007, 2008)
Illinois Mathematics and Science Academy - Scholastic Bowl coach (2009-2014), assistant coach (2014-2015), well wisher (2015-2016)
guy in San Diego (2016-present)
President of Qblitz (2018-present)
Normal Community High School (2002)
University of Illinois at Urbana-Champaign (2004, 2007, 2008)
Illinois Mathematics and Science Academy - Scholastic Bowl coach (2009-2014), assistant coach (2014-2015), well wisher (2015-2016)
guy in San Diego (2016-present)
President of Qblitz (2018-present)
- Mike Bentley
- Sin
- Posts: 6465
- Joined: Fri Mar 31, 2006 11:03 pm
- Location: Bellevue, WA
- Contact:
Re: Please upload your packets as DOCs too
Getting back to the original problem of parsing PDFs, has anyone looked at using OCR for this? For instance, something like the Azure Computer Vision API. It's not necessarily the most efficient (and costs a small amount of money), but might get more reliable results assuming you're able to isolate the content of the document.
Mike Bentley
Treasurer, Partnership for Academic Competition Excellence
Adviser, Quizbowl Team at University of Washington
University of Maryland, Class of 2008
Treasurer, Partnership for Academic Competition Excellence
Adviser, Quizbowl Team at University of Washington
University of Maryland, Class of 2008
- UlyssesInvictus
- Yuna
- Posts: 845
- Joined: Thu Feb 10, 2011 7:38 pm
Re: Please upload your packets as DOCs too
That's an interesting idea, but more of one to consider solely for fun--or in the case where the original authors are no longer contactable--since I'd much rather just have the original, encoded source as far as data preservation.Mike Bentley wrote:Getting back to the original problem of parsing PDFs, has anyone looked at using OCR for this? For instance, something like the Azure Computer Vision API. It's not necessarily the most efficient (and costs a small amount of money), but might get more reliable results assuming you're able to isolate the content of the document.
Raynor Kuang
quizdb.org
Harvard 2017, TJHSST 2013
I wrote GRAPHIC and FILM
quizdb.org
Harvard 2017, TJHSST 2013
I wrote GRAPHIC and FILM
Re: Please upload your packets as DOCs too
I don't think you'd actually want to use OCR; I think having examples of correct parsing, you could easily read the original bytestream and create an LSTM to convert it.UlyssesInvictus wrote:That's an interesting idea, but more of one to consider solely for fun--or in the case where the original authors are no longer contactable--since I'd much rather just have the original, encoded source as far as data preservation.Mike Bentley wrote:Getting back to the original problem of parsing PDFs, has anyone looked at using OCR for this? For instance, something like the Azure Computer Vision API. It's not necessarily the most efficient (and costs a small amount of money), but might get more reliable results assuming you're able to isolate the content of the document.
For things that only exist on paper, OCR would be a good option; I prefer the Google OCR toolkit (which we used for comic books).
Jordan Boyd-Graber
UMD (College Park, MD), Faculty Advisor 2018-present
UC Boulder, Founder / Faculty Advisor 2014-2017
UMD (College Park, MD), Faculty Advisor 2010-2014
Princeton, Player 2004-2009
Caltech (Pasadena, CA), Player / President 2000-2004
Ark Math & Science (Hot Springs, AR), Player 1998-2000
Monticello High School, Player 1997-1998
Human-Computer Question Answering:
http://qanta.org/
UMD (College Park, MD), Faculty Advisor 2018-present
UC Boulder, Founder / Faculty Advisor 2014-2017
UMD (College Park, MD), Faculty Advisor 2010-2014
Princeton, Player 2004-2009
Caltech (Pasadena, CA), Player / President 2000-2004
Ark Math & Science (Hot Springs, AR), Player 1998-2000
Monticello High School, Player 1997-1998
Human-Computer Question Answering:
http://qanta.org/