[Teammetrics-discuss] Findings from NNTPStat and Web Archive Parser

Fri Dec 9 14:03:07 UTC 2011

On Fri, Dec 09, 2011 at 01:56:19PM +0530, Sukhbir Singh wrote:
> I didn't check the permissions but given that I have, I will do it
> myself in future :)

Just be bold and remove stuff which is in your way.  Everything can be
recreated somehow.

> Here is what I do:
> 
> 1. Read a page for a given message.
> 2. Get the required fields from it, such as Name, Subject, Date,
> Message-ID. These fields are defined on line 19, archiveparser.py in
> the FIELDS tuple.
> 3. Populate the database on the fly, i.e., we don't save anything to
> the disk because there is no need.
> 4. Note that we are *not* saving the message body as of now. I will do
> it later once we fix other things.

Ahh, OK.  Thanks for the explanation.

> > Thinking twice about it:  What about if you *exactly* reimplement the
> > mbox structure (I mean regarding directory layout and file naming
> > scheme) of the official (hidden) mailing list archive.  This has two
> > really great advantages:
> 
> Like I pointed out above, we don't save anything, but yes this is a good idea.
> 
> Based on what I mentioned about our approach with the current archive
> parser, do you want me to implement the mbox creation from the web
> archive? It should not be much work because most of the code is
> already ready in nntpstat.py.

While it is not necessary technically IMHO this would be strategically a
really cool step.  We could become independent from listmasters and
might help others as well.  I'd really love this side effect.  So yes
please recreate the mbox archive and parse this afterwards.

Kind regards and have a nice weekend (I'll be offline)

     Andreas.

-- 
http://fam-tille.de