[Teammetrics-discuss] A basic (and broken!) mbox filter.

Andreas Tille andreas at an3as.eu
Thu Aug 11 09:21:49 UTC 2011


On Thu, Aug 11, 2011 at 02:06:00PM +0530, Sukhbir Singh wrote:
>     git pull

Same here - I added 'In-reply-to', 'References' to whitelist.
 
> You will see a file called mboxfiltersimple.py . To test it out, take
> a mbox archive (unzipped) and pass it to the script as an argument.

I'd prefer either a backup of the original mbox or changing the name of
the stripped mbox.
 
> There is only one problem left: If you notice that the lines between
> the 'From' headers are not removed. The mbox module in Python's stdlib
> provides no way to manipulate headers so we have to do it manually.
>
> I think I have to write a regex to remove the lines in between the
> From headers or if you have a better idea, please share :)

IMHO there is a difference between
   From   ....
and
   From:
(including the ':') ... at least if I understood your question correctly.

> Let me your results after testing this script and which approach you
> want me to take.

I appended a stripped down mbox featuring only one single mail which
shows a problem of your approach:  There are fields with content of more
than one line and this just remains.  Just see the cruft after the first
"From" (without the ':') as well as the three lines after 'Date:' which
are originating from X-Spam-Status:.

IMHO, your "simple" way ist too simple.  I have no idea of the mbox
parsing algorithm but I think it is wrong not to use it if you want to
be sure to get real mboxes.

I personally did not dealt with mboxes before but in principle these are
RFC 822 files which could be parsed like I did for instance in the old
code to generate the tasks pages:

   http://anonscm.debian.org/viewvc/blends/blends/branches/webtools_based_on_packages_files/blendstasktools.py?revision=1635&view=markup

line 1027 (works even transparently with gz/bz2 files).  Those stanzas
read via deb822.Sources.iter_paragraphs are containing the *whole*
content bwlonging to a key which is separated by ':'.  The plan to use
the RFC 822 parser might become spoiled by the fact that the first From
does not feature the ':' delimiter.  However, I'm no expert in this so
you need to experiment.

However, I do absolutely not understand why the mailbox module of python
should not work.  IMHO this is exactly what should be used.  Can you
please be more verbose about problems with this.  You have proven to be
able to read mboxes and you also have proven to be able to write mboxes
(in nntp).  So what exactly is the problem.

Kind regards

      Andreas.

-- 
http://fam-tille.de
-------------- next part --------------
A non-text attachment was scrubbed...
Name: example_mbox.tgz
Type: application/x-gtar
Size: 1635 bytes
Desc: not available
URL: <http://lists.alioth.debian.org/pipermail/teammetrics-discuss/attachments/20110811/9e5aa1d3/attachment.tgz>


More information about the Teammetrics-discuss mailing list