[Teammetrics-discuss] Converter for mboxes (Was: Debian mailing lists archives as mbox)

Andreas Tille andreas at an3as.eu
Tue Aug 16 14:34:19 UTC 2011


Hi,

as it was requested by listmaster in this longish thread we wrote a
converter which strips certain tags from mboxes of lists.debian.org.
The code can be found in the attached tgz.  You can find it as well in 

  git://git.debian.org/git/teammetrics/teammetrics.git

in directory mbox-tools.  The actual filter is mboxfilter.py.  It takes
an (unzipped for the moment - feel free to ask for support of gzipped)
mbox and outputs a mbox with the extension '.converted' - not very cool
name but you did no specification.  It's easy to adapt to your needs
(better name / stdout / whatever).

For the moment it takes a single file for specifying the Message-IDs
which should be deleted.  This is called messageid and contains *only*
the Message-IDs (not the prefix Skip-Spam-Message-Id: as written below).
It is not clear to us whether this prefix is always the same - this
sounds not probable because it would be just redundant.  If the
exclusion files are featuring those prefixes can we safely assume that
we get the Message-ID with the following regexp:

    ^Skip-.*-Message-Id: (.*)$

?  If not please be more verbose or tell me where I can find those
exclusion files on master.

Moreover you were speaking about more than one exclusion file.  Do you
mean *several* exclusion files per mbox or just one per mbox which has a
defined naming scheme?

Regarding the fields which are taken over into the converted mbox: In
the beginning of mboxfilter.py you find a list HEADERS which specifies
those headers which are taken over.  I also added a list
possible_HEADERS which contans fields which might make sense to take
over for certain reasons.  This is just for documentation currently.

I tested the filter with random mboxes (from different lists, different
times, different sizes):

	debian-accessibility.200406
	debian-announce.200902
	debian-devel.199808
	debian-devel.200704
	debian-devel.201106
	debian-jr.200609
	debian-med.200609
	debian-ocaml-maint.200408

using the messageid file in the attached tarball and found it working
for these.  This messageid file was created using the script
mbox-potential-spam-ids (just to have some input) and I checked the
result by mbox-diff-check to be able to detect some potential problems.
My tests did not revealed any unexpected things.

Please tell us how to proceed from now.

Kind regards

         Andreas.

On Thu, Aug 04, 2011 at 11:32:42AM +0200, Alexander Wirt wrote:
> Sukhbir Singh schrieb am Thursday, den 04. August 2011:
> 
> > Hi Alex,
> > 
> > Can we have some prototype/ format of the Message-IDs that you want us
> > to strip? It would be beneficial for both sides because then we can
> > show you what we will be handling and you can tell if something else
> > needs to be taken care of.
> Sure. We have several files with entries like:
> Skip-Spam-Message-Id: <4610e762.1f8f12a6.0218.7af1 at mx.google.com>
> Skip-Spam-Message-Id: <8600e4c3dd4c62fb51f343ac020608e3 at gmail.com>
> Skip-Spam-Message-Id: <CA287EE3.7684.AC15C2D5 at localhost>
> 
> if would be best if the converter accepts a message box and several skip
> files. I'll write a wrapper that does the dirty details on the filesystem.
> (Explaining everything in detail would take more time than writing a script).
> 
> Alex
> 
> 
> > 
> > Thanks for the help,
> > 
> > -- 
> > Sukhbir
> > 
> 
> 
> -- 
> To UNSUBSCRIBE, email to debian-devel-REQUEST at lists.debian.org
> with a subject of "unsubscribe". Trouble? Contact listmaster at lists.debian.org
> Archive: http://lists.debian.org/20110804093242.GM3348@smithers.snow-crash.org
> 
> 

-- 
http://fam-tille.de
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mboxfilter.tgz
Type: application/x-gtar
Size: 3273 bytes
Desc: not available
URL: <http://lists.alioth.debian.org/pipermail/teammetrics-discuss/attachments/20110816/1acff3e2/attachment.tgz>


More information about the Teammetrics-discuss mailing list