Packet archive project...what you can do to help

Old college threads.
Locked
User avatar
theMoMA
Forums Staff: Administrator
Posts: 5652
Joined: Mon Oct 23, 2006 2:00 am

Packet archive project...what you can do to help

Post by theMoMA » Sat Aug 18, 2007 7:00 pm

This is double-posted in the Misc HSQB and Collegiate Discussion forums for maximum circulation.

As you may or may not have heard, Jerry has been working on a new packet archive that will have all sorts of awesome features.

Anyway, it would help us to have a lot of the older packets (1989-2000) converted into readable MSWord or .rtf files. Since there are dozens of tournaments that need lots of work, I'm hoping that I can solicit board members to put in some work to get these older questions formatted for current consumption.

Please send me an email at limozeen@gmail.com if you are willing to help out. I'll send you 3-5 tournament sets that need to be formatted.

Your duties include:
-Converting all non-.rtf or .doc files into .rtfs or .docs.
-Naming all packets using this format [YEAR - TOURNAMENT - PACKET AUTHOR - PACKET NUMBER (if applicable)]. So this would look something like "2001 - ACF Fall - Michigan A" or if there are multiple packets by the same group/person, "2001 - ACF Fall - Editors - 1"
-If a single packet's tossups and bonuses are in two different files, combining those separate tossup and bonus files into a single packet file.
-If there are whacky line breaks, you must delete them.

For example, if a tossup is formatted:

FTP, name this group that, in 1967
released the album Sgt. Pepper's Lonely Hearts Club Band.

you must reformat it to be on one line. Tossups should all be one paragraph with no extraneous line breaks. Bonuses should have no extraneous line breaks either.

Basically the idea is to make these packets look like they're decently formatted.

Jerry, if you have anything to add to these criteria, post it below.

Anyway, send me an email guys (limozeen@gmail.com) because I could really use your help.

Rothlover
Yuna
Posts: 816
Joined: Wed Feb 25, 2004 8:41 pm
Contact:

Post by Rothlover » Sun Aug 19, 2007 12:14 am

Does this need any webspace to work in? If it does, you know where to hit me up.

User avatar
Gautam
Auron
Posts: 1413
Joined: Sun Feb 11, 2007 7:28 pm
Location: Zone of Avoidance
Contact:

Post by Gautam » Sun Aug 19, 2007 12:43 am

BTW, if you need to mass-convert files into .rtf or .doc, there is a cool Open Office based program which does that. It's called Danny's Convertor and it works well with most formats.
Gautam - ACF
Currently tending to the 'quizbowl hobo' persuasion.

User avatar
grapesmoker
Sin
Posts: 6364
Joined: Sat Oct 25, 2003 5:23 pm
Location: NYC
Contact:

Post by grapesmoker » Sun Aug 19, 2007 1:33 pm

Andrew has covered all the bases. I actually have a lot of packets already and hopefully I will be able to make the site public sometime this week.
Jerry Vinokurov
ex-LJHS, ex-Berkeley, ex-Brown, sorta-ex-CMU
code ape, loud voice, general nuissance

User avatar
theMoMA
Forums Staff: Administrator
Posts: 5652
Joined: Mon Oct 23, 2006 2:00 am

Post by theMoMA » Fri Aug 24, 2007 11:20 pm

Hey guys, I could still use volunteers for this. I have a good 85 packet sets that need work, and the archive can't be complete until all of them are formatted and scanned. I already have formatted over 85 sets, and I have another 65 that I am responsible for getting formatted. Any help anyone's willing to give would be greatly appreciated. Thanks.

User avatar
ezubaric
Rikku
Posts: 366
Joined: Mon Feb 09, 2004 8:02 pm
Location: College Park, MD
Contact:

Post by ezubaric » Fri Sep 07, 2007 2:30 pm

I know I'm late to the party and not offering any help, but this never stopped anyone on the list before.

Given all of the talk of late of QBML, if people are undertaking so much manual effort in formatting the packets, why not go the extra step and getting it into something less opaque than rtf/doc?

Still better than nothing, though.

Cheers,

Jordan
Jordan Boyd-Graber
UMD (College Park, MD), Faculty Advisor 2018-present
UC Boulder, Founder / Faculty Advisor 2014-2017
UMD (College Park, MD), Faculty Advisor 2010-2014
Princeton, Player 2004-2009
Caltech (Pasadena, CA), Player / President 2000-2004
Ark Math & Science (Hot Springs, AR), Player 1998-2000
Monticello High School, Player 1997-1998

Human-Computer Question Answering:
http://qanta.org/

User avatar
theMoMA
Forums Staff: Administrator
Posts: 5652
Joined: Mon Oct 23, 2006 2:00 am

Post by theMoMA » Fri Sep 07, 2007 3:47 pm

I don't understand the clear formatting circlejerk. If Jerry's program can parse .docs or .rfts into searchable html, what's the point of having LaTeX or whatever?

And yeah, I'm not so sure that the whole telling people they should do more work when you're not doing any of it thing is a great idea.

User avatar
ezubaric
Rikku
Posts: 366
Joined: Mon Feb 09, 2004 8:02 pm
Location: College Park, MD
Contact:

Post by ezubaric » Sat Sep 08, 2007 9:11 pm

theMoMA wrote:I don't understand the clear formatting circlejerk. If Jerry's program can parse .docs or .rfts into searchable html, what's the point of having LaTeX or whatever?
LaTeX is great and all, and has at least a publicly available definition that those working with it can refer to (rather than just having to guess, as with RTF and DOC ... especially the latter). XML or even HTML would be better, in my humble opinion.

It's great that Jerry has no problem reading RTFs, but the whole point of open formats is that unexpected things happen when you're able to work across different applications and protocols. For instance, if I wanted to write a Python script to count the frequency of the various words that "titular" precedes, I can do that nearly instantly if the format is XML, HTML, etc. It's much more difficult if the format is DOC or RTF. This is just one example off the top of my head; I'm sure there's quite a bit of data mining that can be done.
And yeah, I'm not so sure that the whole telling people they should do more work when you're not doing any of it thing is a great idea.
It's not necessarily more work, as it's just the format you choose for publication. And if you don't do it, it's possible that others will do it. My point was that if all this work is being done, why not unlock it from opaque, difficult to read Microsoft formats?

And, even though the vagaries of teh Intarweb might make unreasoned opinions offered from relatively disinterested parties seem like commands, it was certainly only the former and not the latter.

It's also possible that I've missed a key step in all of this, and that RTF is just a common format that everything is being turned into as an intermediate step, and everything will be in a downloadable SQL dump ala Wikipedia later.
Jordan Boyd-Graber
UMD (College Park, MD), Faculty Advisor 2018-present
UC Boulder, Founder / Faculty Advisor 2014-2017
UMD (College Park, MD), Faculty Advisor 2010-2014
Princeton, Player 2004-2009
Caltech (Pasadena, CA), Player / President 2000-2004
Ark Math & Science (Hot Springs, AR), Player 1998-2000
Monticello High School, Player 1997-1998

Human-Computer Question Answering:
http://qanta.org/

User avatar
theMoMA
Forums Staff: Administrator
Posts: 5652
Joined: Mon Oct 23, 2006 2:00 am

Post by theMoMA » Sun Sep 09, 2007 2:45 am

Really, you'd have to talk to Jerry about his data mining plans. The point of making the files into docs or rtfs is that that's what Jerry's program uses, that's what most of the files are already saved as, and that's what people use to read documents.

It would be a ton more work to make the files into LaTeX style files because many tournament sets that are already docs or rtfs and need just a few minutes of work or none at all would take a lot longer to do if conversion to LaTeX were required.

User avatar
ezubaric
Rikku
Posts: 366
Joined: Mon Feb 09, 2004 8:02 pm
Location: College Park, MD
Contact:

Post by ezubaric » Mon Sep 10, 2007 1:39 pm

theMoMA wrote:It would be a ton more work to make the files into LaTeX style files
But even something like a version all in HTML would be great, as it would be greppable.
Jordan Boyd-Graber
UMD (College Park, MD), Faculty Advisor 2018-present
UC Boulder, Founder / Faculty Advisor 2014-2017
UMD (College Park, MD), Faculty Advisor 2010-2014
Princeton, Player 2004-2009
Caltech (Pasadena, CA), Player / President 2000-2004
Ark Math & Science (Hot Springs, AR), Player 1998-2000
Monticello High School, Player 1997-1998

Human-Computer Question Answering:
http://qanta.org/

User avatar
grapesmoker
Sin
Posts: 6364
Joined: Sat Oct 25, 2003 5:23 pm
Location: NYC
Contact:

Post by grapesmoker » Mon Sep 10, 2007 2:38 pm

Jordan, sorry for not addressing your questions earlier.

Basically, here is the thing: it is hard to make people adhere to a single style as it is. If I had asked those who are working on this project with me to format everything as XML/HTML/Latex by hand, I would never have gotten anything. So my approach is to ask folks to edit the original Word/RTF files, which I can process fairly easily into XML or Latex or whatever you like. I hope that once you see the results, you will understand the rationale for the distribution of work.

I'm sorry that it's not moving as fast as I'd like; Andrew and whoever has been helping him have done a stellar job so far, but I've been traveling and also moving apartments, which gave me no time to work on stuff over the last two weeks. I hope to have something up soon, with at least a year or two of tournaments available for browsing.
Jerry Vinokurov
ex-LJHS, ex-Berkeley, ex-Brown, sorta-ex-CMU
code ape, loud voice, general nuissance

User avatar
naturalistic phallacy
Auron
Posts: 1343
Joined: Tue May 01, 2007 12:03 am
Location: Minneapolis, MN
Contact:

Post by naturalistic phallacy » Thu Sep 13, 2007 2:52 am

If there's still packets to be done, I can help.

Something to distract from the monotony of stats would be nice.
Bernadette Spencer
University of Minnesota
MCTC
Event Manager, PACE

Father, among these many souls / Is there not one / Whom thou shalt pluck for love out of the coals?

Strongside
Rikku
Posts: 475
Joined: Thu Apr 14, 2005 8:03 pm
Contact:

Post by Strongside » Mon Oct 22, 2007 2:40 pm

I do not intend to sound impatient or demanding in this post, and I realize Jerry is probably busy and has more important things to do, but I am wondering if there is an estimate for when this will be online.

From the information I have gathered, it appears that the archive will be really cool and very helpful, and I am looking forward to it.

Also, if there is are any more packets that need to be formatted, I would be willing to help with that.
Brendan Byrne

Drake University, 2006-2008
University of Minnesota, 2008-2010

User avatar
Mike Bentley
Auron
Posts: 5713
Joined: Fri Mar 31, 2006 11:03 pm
Location: Bellevue, WA
Contact:

Post by Mike Bentley » Mon Oct 22, 2007 4:06 pm

If someone can get me a list of packets that are on this archive, I can potentially scan ones that we have in paper form and the archive does not. They'd be in annoying pdf documents, but it's better than not having the questions (plus, there are probably tools for extracting text from scanned images that work to some degree).
Mike Bentley
Treasurer, Partnership for Academic Competition Excellence
Adviser, Quizbowl Team at University of Washington
University of Maryland, Class of 2008

Locked