Packet archive project...what you can do to help
Packet archive project...what you can do to help
This is double-posted in the Misc HSQB and Collegiate Discussion forums for maximum circulation.
As you may or may not have heard, Jerry has been working on a new packet archive that will have all sorts of awesome features.
Anyway, it would help us to have a lot of the older packets (1989-2000) converted into readable MSWord or .rtf files. Since there are dozens of tournaments that need lots of work, I'm hoping that I can solicit board members to put in some work to get these older questions formatted for current consumption.
Please send me an email at [email protected] if you are willing to help out. I'll send you 3-5 tournament sets that need to be formatted.
Your duties include:
-Converting all non-.rtf or .doc files into .rtfs or .docs.
-Naming all packets using this format [YEAR - TOURNAMENT - PACKET AUTHOR - PACKET NUMBER (if applicable)]. So this would look something like "2001 - ACF Fall - Michigan A" or if there are multiple packets by the same group/person, "2001 - ACF Fall - Editors - 1"
-If a single packet's tossups and bonuses are in two different files, combining those separate tossup and bonus files into a single packet file.
-If there are whacky line breaks, you must delete them.
For example, if a tossup is formatted:
FTP, name this group that, in 1967
released the album Sgt. Pepper's Lonely Hearts Club Band.
you must reformat it to be on one line. Tossups should all be one paragraph with no extraneous line breaks. Bonuses should have no extraneous line breaks either.
Basically the idea is to make these packets look like they're decently formatted.
Jerry, if you have anything to add to these criteria, post it below.
Anyway, send me an email guys ([email protected]) because I could really use your help.
As you may or may not have heard, Jerry has been working on a new packet archive that will have all sorts of awesome features.
Anyway, it would help us to have a lot of the older packets (1989-2000) converted into readable MSWord or .rtf files. Since there are dozens of tournaments that need lots of work, I'm hoping that I can solicit board members to put in some work to get these older questions formatted for current consumption.
Please send me an email at [email protected] if you are willing to help out. I'll send you 3-5 tournament sets that need to be formatted.
Your duties include:
-Converting all non-.rtf or .doc files into .rtfs or .docs.
-Naming all packets using this format [YEAR - TOURNAMENT - PACKET AUTHOR - PACKET NUMBER (if applicable)]. So this would look something like "2001 - ACF Fall - Michigan A" or if there are multiple packets by the same group/person, "2001 - ACF Fall - Editors - 1"
-If a single packet's tossups and bonuses are in two different files, combining those separate tossup and bonus files into a single packet file.
-If there are whacky line breaks, you must delete them.
For example, if a tossup is formatted:
FTP, name this group that, in 1967
released the album Sgt. Pepper's Lonely Hearts Club Band.
you must reformat it to be on one line. Tossups should all be one paragraph with no extraneous line breaks. Bonuses should have no extraneous line breaks either.
Basically the idea is to make these packets look like they're decently formatted.
Jerry, if you have anything to add to these criteria, post it below.
Anyway, send me an email guys ([email protected]) because I could really use your help.
BTW, if you need to mass-convert files into .rtf or .doc, there is a cool Open Office based program which does that. It's called Danny's Convertor and it works well with most formats.
Gautam - ACF
Currently tending to the 'quizbowl hobo' persuasion.
Currently tending to the 'quizbowl hobo' persuasion.
- grapesmoker
- Sin
- Posts: 6345
- Joined: Sat Oct 25, 2003 5:23 pm
- Location: NYC
- Contact:
Hey guys, I could still use volunteers for this. I have a good 85 packet sets that need work, and the archive can't be complete until all of them are formatted and scanned. I already have formatted over 85 sets, and I have another 65 that I am responsible for getting formatted. Any help anyone's willing to give would be greatly appreciated. Thanks.
I know I'm late to the party and not offering any help, but this never stopped anyone on the list before.
Given all of the talk of late of QBML, if people are undertaking so much manual effort in formatting the packets, why not go the extra step and getting it into something less opaque than rtf/doc?
Still better than nothing, though.
Cheers,
Jordan
Given all of the talk of late of QBML, if people are undertaking so much manual effort in formatting the packets, why not go the extra step and getting it into something less opaque than rtf/doc?
Still better than nothing, though.
Cheers,
Jordan
Jordan Boyd-Graber
UMD (College Park, MD), Faculty Advisor 2018-present
UC Boulder, Founder / Faculty Advisor 2014-2017
UMD (College Park, MD), Faculty Advisor 2010-2014
Princeton, Player 2004-2009
Caltech (Pasadena, CA), Player / President 2000-2004
Ark Math & Science (Hot Springs, AR), Player 1998-2000
Monticello High School, Player 1997-1998
Human-Computer Question Answering:
http://qanta.org/
UMD (College Park, MD), Faculty Advisor 2018-present
UC Boulder, Founder / Faculty Advisor 2014-2017
UMD (College Park, MD), Faculty Advisor 2010-2014
Princeton, Player 2004-2009
Caltech (Pasadena, CA), Player / President 2000-2004
Ark Math & Science (Hot Springs, AR), Player 1998-2000
Monticello High School, Player 1997-1998
Human-Computer Question Answering:
http://qanta.org/
LaTeX is great and all, and has at least a publicly available definition that those working with it can refer to (rather than just having to guess, as with RTF and DOC ... especially the latter). XML or even HTML would be better, in my humble opinion.theMoMA wrote:I don't understand the clear formatting circlejerk. If Jerry's program can parse .docs or .rfts into searchable html, what's the point of having LaTeX or whatever?
It's great that Jerry has no problem reading RTFs, but the whole point of open formats is that unexpected things happen when you're able to work across different applications and protocols. For instance, if I wanted to write a Python script to count the frequency of the various words that "titular" precedes, I can do that nearly instantly if the format is XML, HTML, etc. It's much more difficult if the format is DOC or RTF. This is just one example off the top of my head; I'm sure there's quite a bit of data mining that can be done.
It's not necessarily more work, as it's just the format you choose for publication. And if you don't do it, it's possible that others will do it. My point was that if all this work is being done, why not unlock it from opaque, difficult to read Microsoft formats?And yeah, I'm not so sure that the whole telling people they should do more work when you're not doing any of it thing is a great idea.
And, even though the vagaries of teh Intarweb might make unreasoned opinions offered from relatively disinterested parties seem like commands, it was certainly only the former and not the latter.
It's also possible that I've missed a key step in all of this, and that RTF is just a common format that everything is being turned into as an intermediate step, and everything will be in a downloadable SQL dump ala Wikipedia later.
Jordan Boyd-Graber
UMD (College Park, MD), Faculty Advisor 2018-present
UC Boulder, Founder / Faculty Advisor 2014-2017
UMD (College Park, MD), Faculty Advisor 2010-2014
Princeton, Player 2004-2009
Caltech (Pasadena, CA), Player / President 2000-2004
Ark Math & Science (Hot Springs, AR), Player 1998-2000
Monticello High School, Player 1997-1998
Human-Computer Question Answering:
http://qanta.org/
UMD (College Park, MD), Faculty Advisor 2018-present
UC Boulder, Founder / Faculty Advisor 2014-2017
UMD (College Park, MD), Faculty Advisor 2010-2014
Princeton, Player 2004-2009
Caltech (Pasadena, CA), Player / President 2000-2004
Ark Math & Science (Hot Springs, AR), Player 1998-2000
Monticello High School, Player 1997-1998
Human-Computer Question Answering:
http://qanta.org/
Really, you'd have to talk to Jerry about his data mining plans. The point of making the files into docs or rtfs is that that's what Jerry's program uses, that's what most of the files are already saved as, and that's what people use to read documents.
It would be a ton more work to make the files into LaTeX style files because many tournament sets that are already docs or rtfs and need just a few minutes of work or none at all would take a lot longer to do if conversion to LaTeX were required.
It would be a ton more work to make the files into LaTeX style files because many tournament sets that are already docs or rtfs and need just a few minutes of work or none at all would take a lot longer to do if conversion to LaTeX were required.
But even something like a version all in HTML would be great, as it would be greppable.theMoMA wrote:It would be a ton more work to make the files into LaTeX style files
Jordan Boyd-Graber
UMD (College Park, MD), Faculty Advisor 2018-present
UC Boulder, Founder / Faculty Advisor 2014-2017
UMD (College Park, MD), Faculty Advisor 2010-2014
Princeton, Player 2004-2009
Caltech (Pasadena, CA), Player / President 2000-2004
Ark Math & Science (Hot Springs, AR), Player 1998-2000
Monticello High School, Player 1997-1998
Human-Computer Question Answering:
http://qanta.org/
UMD (College Park, MD), Faculty Advisor 2018-present
UC Boulder, Founder / Faculty Advisor 2014-2017
UMD (College Park, MD), Faculty Advisor 2010-2014
Princeton, Player 2004-2009
Caltech (Pasadena, CA), Player / President 2000-2004
Ark Math & Science (Hot Springs, AR), Player 1998-2000
Monticello High School, Player 1997-1998
Human-Computer Question Answering:
http://qanta.org/
- grapesmoker
- Sin
- Posts: 6345
- Joined: Sat Oct 25, 2003 5:23 pm
- Location: NYC
- Contact:
Jordan, sorry for not addressing your questions earlier.
Basically, here is the thing: it is hard to make people adhere to a single style as it is. If I had asked those who are working on this project with me to format everything as XML/HTML/Latex by hand, I would never have gotten anything. So my approach is to ask folks to edit the original Word/RTF files, which I can process fairly easily into XML or Latex or whatever you like. I hope that once you see the results, you will understand the rationale for the distribution of work.
I'm sorry that it's not moving as fast as I'd like; Andrew and whoever has been helping him have done a stellar job so far, but I've been traveling and also moving apartments, which gave me no time to work on stuff over the last two weeks. I hope to have something up soon, with at least a year or two of tournaments available for browsing.
Basically, here is the thing: it is hard to make people adhere to a single style as it is. If I had asked those who are working on this project with me to format everything as XML/HTML/Latex by hand, I would never have gotten anything. So my approach is to ask folks to edit the original Word/RTF files, which I can process fairly easily into XML or Latex or whatever you like. I hope that once you see the results, you will understand the rationale for the distribution of work.
I'm sorry that it's not moving as fast as I'd like; Andrew and whoever has been helping him have done a stellar job so far, but I've been traveling and also moving apartments, which gave me no time to work on stuff over the last two weeks. I hope to have something up soon, with at least a year or two of tournaments available for browsing.
Jerry Vinokurov
ex-LJHS, ex-Berkeley, ex-Brown, sorta-ex-CMU
presently: John Jay College Economics
code ape, loud voice, general nuissance
ex-LJHS, ex-Berkeley, ex-Brown, sorta-ex-CMU
presently: John Jay College Economics
code ape, loud voice, general nuissance
- naturalistic phallacy
- Auron
- Posts: 1490
- Joined: Tue May 01, 2007 12:03 am
- Location: Minneapolis, MN
- Contact:
-
- Rikku
- Posts: 475
- Joined: Thu Apr 14, 2005 8:03 pm
I do not intend to sound impatient or demanding in this post, and I realize Jerry is probably busy and has more important things to do, but I am wondering if there is an estimate for when this will be online.
From the information I have gathered, it appears that the archive will be really cool and very helpful, and I am looking forward to it.
Also, if there is are any more packets that need to be formatted, I would be willing to help with that.
From the information I have gathered, it appears that the archive will be really cool and very helpful, and I am looking forward to it.
Also, if there is are any more packets that need to be formatted, I would be willing to help with that.
Brendan Byrne
Drake University, 2006-2008
University of Minnesota, 2008-2010
Drake University, 2006-2008
University of Minnesota, 2008-2010
- Mike Bentley
- Sin
- Posts: 6461
- Joined: Fri Mar 31, 2006 11:03 pm
- Location: Bellevue, WA
- Contact:
If someone can get me a list of packets that are on this archive, I can potentially scan ones that we have in paper form and the archive does not. They'd be in annoying pdf documents, but it's better than not having the questions (plus, there are probably tools for extracting text from scanned images that work to some degree).
Mike Bentley
Treasurer, Partnership for Academic Competition Excellence
Adviser, Quizbowl Team at University of Washington
University of Maryland, Class of 2008
Treasurer, Partnership for Academic Competition Excellence
Adviser, Quizbowl Team at University of Washington
University of Maryland, Class of 2008