[Teammetrics-discuss] Web Archive Parser ready for your testing.

Sukhbir Singh sukhbir.in at gmail.com
Wed Jan 4 06:35:32 UTC 2012


Hi Andreas,

> The only problem I have is the location.  This files is actually no
> configuration file (it is not intended to edit this file manally to
> influence the web archive parser directly.  So I'd vote for something
> like
>
>    /var/cache/teammetrics/archiveparser.status
>
> or something like this - feel free to find a better name.

Ok, good idea. Fixed to:

    CONFIG_FILE = '/var/cache/teammetrics/archiveparser.status'

> Seem to work fine.  I'm just running the parser and it fetches a lot of
> mails in a short time.  However, I do not see any records in listspam
> table.  Is this intended behaviour?

> because this list was renamed.  The logfile looks perfectly normal
> (however also here no sign of SPAM handling).

Yes, that is expected and I will explain why.

When running the script, I noticed that many messages which should fit
the description of spam also match the messages which are definitely
not spam. Now, I thought, it is better to have spam messages in our
metrics *rather* than missing out genuine messages. So as such, the
methods of possible spam that we discussed:

    Invalid date:
        A large number of messages had this problem but the quantity
of messages which were NOT spam was *more* than the quantity of
messages that were spam. So this renders this check invalid for us.

    Missing message-IDs:
        Again, the same problem as above.

    Other filters in spamfilter.py (follows ahead)

So now here is how I want to handle this and your suggestion is needed:

    Populate `listspam` based on all the above filters BUT let's not
skip the message if it is spam, populate it in `listarchives` also.
    So: if it was spam, it will be at the bottom of the list. If it
was not, it will add to the metrics.

This way our spam fighting efforts and our metrics, both will be satisfied.



More information about the Teammetrics-discuss mailing list