Page 1 of 1

Please upload your packets as DOCs too

Posted: Thu Oct 05, 2017 1:29 pm
by UlyssesInvictus
^title

I've been doing a lot of work on getting a standard machine parser working for QuizDB and it's surprisingly difficulty to consistently convert PDFs into machine-readable text. (It actually shouldn't be that surprising, given that your browser/OS still have such trouble copy-pasting from PDFs to regular text--and thus all the encoding errors on QuizDB.)

On the other hand, it's super easy to get nicely readable files from Word DOCs and DOCXs. I've been able to consistently get really good parsed questions from these packet formats (assuming they adhere closely to a standard format, but that's addressed in this other post).

So please additionally upload your packets as Word files to the Packets Archive.

I know it's a little extra work, but if you zip it, it's like five extra clicks; and if you do, it makes your questions much more permanent for posterity. Isn't that a good thing?

(Plus, I can hardly think of any cases where you would write your packets in a format that couldn't also be easily converted to Word, whether you started off writing in Word, are using Google Docs, or are exporting from QEMS.)

Hopefully everyone sees this post and updates their Archive entries of their own volition, but I'll probably start emailing people soon (and orgs--this applies to ACF, too) and nicely asking them for some extra files.

Re: Please upload your packets as DOCs too

Posted: Thu Oct 05, 2017 8:03 pm
by jonah
For writers whose use LaTeX (Scobol Solo and Masonic being the only current examples I know of), do you want the .tex source, or what?

Re: Please upload your packets as DOCs too

Posted: Thu Oct 05, 2017 10:47 pm
by AKKOLADE
jonah wrote:For writers whose use LaTeX (Scobol Solo and Masonic being the only current examples I know of), do you want the .tex source, or what?
don't use latex

Re: Please upload your packets as DOCs too

Posted: Fri Oct 06, 2017 12:02 am
by UlyssesInvictus
Wow, .tex, didn't expect that, but I should have.

TeX is a standardized markup language, though, right? That's probably as good as DOC, since PanDoc will probably be able to parse it fine as well.

(Although, at that point it really just becomes something only I'd use, rather than the QB audience at large.)

EDIT: confirmed that Pandoc does this just fine. Will happily take LaTex sources!

Re: Please upload your packets as DOCs too

Posted: Fri Oct 06, 2017 12:06 am
by UlyssesInvictus
Although, re-reading your post, I'll note that Scobol and Masonic wildly diverge from the standard ACF-ish format (the one I'm proposing in my still to-be-written post), and so I'd have larger issues past actually converting the packets into machine readable (TBH this really just means my machine) format.

Re: Please upload your packets as DOCs too

Posted: Fri Oct 06, 2017 12:14 am
by jonah
UlyssesInvictus wrote:Although, re-reading your post, I'll note that Scobol and Masonic wildly diverge from the standard ACF-ish format (the one I'm proposing in my still to-be-written post), and so I'd have larger issues past actually converting the packets into machine readable (TBH this really just means my machine) format.
Scobol Solo is tossups only, so I would have thought that would be pretty usable. Masonic consists of tossups and questions that are isomorphic to bonuses, but they're arranged in a weird fashion, so I understand not wanting to mess with that.

Anyway, if you want any of these things, shoot me an email specifying which ones you want.

Re: Please upload your packets as DOCs too

Posted: Fri Oct 06, 2017 12:19 am
by CPiGuy
jonah wrote:For writers whose use LaTeX (Scobol Solo and Masonic being the only current examples I know of), do you want the .tex source, or what?
Math Monstrosity also used LaTeX, and there's a .zip file with the .tex sources, which would probably be better for you to pull from than (what I assume you already pulled from:) the PDFs.

Re: Please upload your packets as DOCs too

Posted: Fri Oct 06, 2017 12:31 am
by UlyssesInvictus
re: Jonah: I was mainly concerned with the interactive arrows and interim text like "Check the score," but we'll see how problematic it is after I parse the files. Thanks! (And, yeah, I'm just going to stay away from the Masonic box-stuff for now.)

re: Conor: Awesome, I'll reupload those at some time using the TeX parsed data instead. The PDF stuff was just what the earlier uploaders used, so it's got the pros of human error checking but also the cons of human error introduction.

I'm actually still concerned with getting the final build of the parser working--the categorizer (yay, logistic regression!)--so while that happens, the most helpful thing for me is for people to just upload these machine readable files on their own to the archive. I'll track people down eventually if I have to, but it'd be lovely to just go back online and find them waiting there for me.

(Though I do want to stress that I believe this is something benefiting the community as a whole, not just QuizDB :D)

Re: Please upload your packets as DOCs too

Posted: Fri Oct 06, 2017 3:05 pm
by Dominator
jonah wrote:For writers whose use LaTeX (Scobol Solo and Masonic being the only current examples I know of), do you want the .tex source, or what?
You cannot be serious.
  • You know that I used LaTeX to format all IMSANITY packets, and you asked me in 2011 to see my LaTeX code.
  • You know from when you interrogated me as a writer applicant for NAQT that the Packetizor system I built for NHBB uses LaTeX for its final packets.
  • You also know that that system has expanded to other corners of the Madden-verse (like USABB) and SCOP.
Are you implying that those organizations are not quizbowl? If so, will they be removed from "major competitor to NAQT" list?
AKKOLADE wrote:
jonah wrote:For writers whose use LaTeX (Scobol Solo and Masonic being the only current examples I know of), do you want the .tex source, or what?
don't use latex
This is the wrong response to the problem.
UlyssesInvictus wrote:Wow, .tex, didn't expect that, but I should have.

TeX is a standardized markup language, though, right? That's probably as good as DOC, since PanDoc will probably be able to parse it fine as well.
Packetizor can be made to output the questions in whatever format is best for QuizDB. If you'd like to find a better solution than parsing .pdf or .tex files for uploading NHBB and SCOP content, feel free to email me.

Re: Please upload your packets as DOCs too

Posted: Sun Oct 08, 2017 2:03 pm
by Mike Bentley
Getting back to the original problem of parsing PDFs, has anyone looked at using OCR for this? For instance, something like the Azure Computer Vision API. It's not necessarily the most efficient (and costs a small amount of money), but might get more reliable results assuming you're able to isolate the content of the document.

Re: Please upload your packets as DOCs too

Posted: Sun Oct 08, 2017 2:34 pm
by UlyssesInvictus
Mike Bentley wrote:Getting back to the original problem of parsing PDFs, has anyone looked at using OCR for this? For instance, something like the Azure Computer Vision API. It's not necessarily the most efficient (and costs a small amount of money), but might get more reliable results assuming you're able to isolate the content of the document.
That's an interesting idea, but more of one to consider solely for fun--or in the case where the original authors are no longer contactable--since I'd much rather just have the original, encoded source as far as data preservation.

Re: Please upload your packets as DOCs too

Posted: Mon Oct 16, 2017 4:08 pm
by ezubaric
UlyssesInvictus wrote:
Mike Bentley wrote:Getting back to the original problem of parsing PDFs, has anyone looked at using OCR for this? For instance, something like the Azure Computer Vision API. It's not necessarily the most efficient (and costs a small amount of money), but might get more reliable results assuming you're able to isolate the content of the document.
That's an interesting idea, but more of one to consider solely for fun--or in the case where the original authors are no longer contactable--since I'd much rather just have the original, encoded source as far as data preservation.
I don't think you'd actually want to use OCR; I think having examples of correct parsing, you could easily read the original bytestream and create an LSTM to convert it.

For things that only exist on paper, OCR would be a good option; I prefer the Google OCR toolkit (which we used for comic books).