[Teammetrics-discuss] Spam filters and encoding handlers in place

Sukhbir Singh sukhbir.in at gmail.com
Wed Jun 22 20:28:55 UTC 2011


Hi,

repository.update()

I know the problem of the root user still exists but this will be the
last time as the next phase will be creating a deb package! For now,
please run it as root if you want to try it out for yourself, or you
can wait for a day or two and let me remove this hurdle (read
further).

Changes:

+ A working spam filter in place. This is handled by the spamfilter.py.

You can have a look at the source code to see what all is being
handled. I think the filters in place cut a significant amount of spam
from what I have seen.

+ I have tried to handle all the encoding errors, but still (very few
I guess) still remain. Weirdly, all the encoding errors as of now are
with the Subject field *only* and not with Name field. I will find out
what is causing the problem soon.

+ There is a new table called listspam which saves the reason why the
message was considered as spam which will help us identify how well
our filter is working (as requested by Andreas).

+ Here is what some sample output looks like from listspam (from the
list: debian-med-commit):

          name           |                     subject
    |        reason
-------------------------+-------------------------------------------------+-----------------------
 CORNEL                  | [med-svn] util
    | Name is in upper case
 CURSURI GRATUITE ONLINE | [med-svn] invitatie la cursuri gratuite
online  | Name is in upper case
 EVRIKA GROUP            | [med-svn] LA MULTI ANI !
    | Name is in upper case
 LINO TECH               | [med-svn] Fw: PARDOSELI PVC TRAFIC INTENS
    | Name is in upper case
 EVRIKA GROUP            | [med-svn] invitatie la cursuri de
perfectionare | Name is in upper case


So, overall, pretty slick!

SELECT name, COUNT(name) FROM listarchives WHERE
project='debian-med-commit' GROUP BY name ORDER BY count DESC LIMIT
10;
                name                | count
------------------------------------+-------
 Charles Plessy                     |  1352
 Andreas Tille                      |  1261
 tille at alioth.debian.org         |   755
 hanska-guest at alioth.debian.org  |   509
 Mathieu Malaterre                  |   498
 plessy at alioth.debian.org        |   389
 Steffen Möller                     |   346
 smoe-guest at alioth.debian.org    |   344
 charles-guest at alioth.debian.org |   342
 olivier sallou                     |   169
(10 rows)

So let me know your thoughts on this.

The next phases in order:

+ deb package.
+ encoding errors.

That's all for tonight!



More information about the Teammetrics-discuss mailing list