[Teammetrics-discuss] Next phase: Handling spam

Andreas Tille andreas at an3as.eu
Thu Jun 9 11:43:44 UTC 2011


On Thu, Jun 09, 2011 at 03:34:27PM +0530, Sukhbir Singh wrote:
> 
> So we can scan the archives for subjects and this can be easily done.
> But what exactly are going to filter about exactly what in the
> subject? Just the above two metrics should do?

Well, it helped me to bring down the *visible* influence of SPAM down to
zero.  This is no complete but a pragmatic approach.
 
> > The next thing is that I tried to put a limit on non-ASCII UTF-8
> > characters which helped a lot against some Chinese SPAM.  However,
> > this has to be handled with some caution on mailing lists with
> > languages with a lot of such letters (Russian, Chinese, Japanese
> > etc.)  I handled this via
> 
> Ok, nice idea but I was thinking something. If we have to have
> exceptional cases, this no longer makes the process automated. This is
> a great metric indeed but don't you think having to manually specify
> this somehow limits us?

I do not think so.  Most team list are in English (all teams in Debian
are international) so the subjects should not contain non-ASCII
characters (or at best some limited exceptions).  The only problem are
i18n user oriented lists.  For our GSoC project we could even ignore
them.  I was just asked whether I could do the graphing as well.  So you
might end up with some more SPAM in the database for those list but I
have never seen any influence of this in those high volume user lists.

This means, that the approach I tried does somehow the job I wanted it
to do.  As I said, I did not intended to write yet another spamfilter
but rather create a graph which is free from the influence of spammers.
 
> > because *currently* the only relevant list which had a lot non-ASCII
> > UTF-8 characters was debian-russian.  However, to make it general you
> > need another configurable list which contains all lists which should
> > allow a lot of non-ASCII characters in the subject.
> 
> >   if ( $author =~ /^[-&#x\d;\sA-F\?:,]+$/ || $countstrangechars > 7 || $numspamauthors > 0 ) {
> 
> Hmm, I see. So we are setting a limit of a maximum of seven 'strange'
> characters.

Yes.  Something in this range worked for me.
 
> Can you please point me to some mailing lists with spam messages that
> you came across so I can get a better sense of this? That way maybe I
> can also add something to this after seeing that. After that, we will
> proceed.

Most of this stuff was based on lists.d.o mailing lists which we
(currently) can not parse. :-(

I'd suggest you should print the subjects of the mails as well in the
logfiles and then you see somehow the problem.  Just try to find some
better means - but do not spent time on fetching them all.  This is too
time consuming.  If you see a specific pattern you will get an idea how
to avoid this pattern in an algorithm.

It might be also helpful to start filling the database and seek in the
database using SQL expressions for certain patterns.  Perhaps you should
concentrate on this before you work on the SPAM reduction.
 
> Reminder: We still have to get a reply from lists.debian.org . As this
> phase should be complete by this weekend hopefully, we have to start
> work on that next week.

We should ping again.  I just asked Alexander Wirt for his reasonse that
he does not like the idea of mboxes but he did not yet responded.

Kind regards

      Andreas.

-- 
http://fam-tille.de



More information about the Teammetrics-discuss mailing list