[Teammetrics-discuss] Next phase: Handling spam

Sukhbir Singh sukhbir.in at gmail.com
Thu Jun 9 10:04:27 UTC 2011


Hi!

Now this is what we call a *very* detailed reply :-)

> There are probably much more but these "authors" in some low traffic
> lists made it into the top X ranking.  I would suggest putting those
> "authors" in a config file (say /etc/teammetrics/spam-handling.conf or
> something like this).  So you can easily add strings you definitely do
> not want to see in the statistics.

Ok.

> The next thing which I notet was that certain Strings in the subject
> are a clear sign of SPAM which is just not relevant for our teammetrics:
>
>  'File blocked - ScanMail for Lotus Notes',
>  '^u?n?subscribe\s+.?$'
>
> Same here: The list is far from complete but helped me sorting out
> a certain amount of useless subjects.  I would add this list as
> SPAMSUBJECTS in the same config file.

So we can scan the archives for subjects and this can be easily done.
But what exactly are going to filter about exactly what in the
subject? Just the above two metrics should do?

> The next thing is that I tried to put a limit on non-ASCII UTF-8
> characters which helped a lot against some Chinese SPAM.  However,
> this has to be handled with some caution on mailing lists with
> languages with a lot of such letters (Russian, Chinese, Japanese
> etc.)  I handled this via

Ok, nice idea but I was thinking something. If we have to have
exceptional cases, this no longer makes the process automated. This is
a great metric indeed but don't you think having to manually specify
this somehow limits us?

> because *currently* the only relevant list which had a lot non-ASCII
> UTF-8 characters was debian-russian.  However, to make it general you
> need another configurable list which contains all lists which should
> allow a lot of non-ASCII characters in the subject.

>   if ( $author =~ /^[-&#x\d;\sA-F\?:,]+$/ || $countstrangechars > 7 || $numspamauthors > 0 ) {

Hmm, I see. So we are setting a limit of a maximum of seven 'strange'
characters.

Can you please point me to some mailing lists with spam messages that
you came across so I can get a better sense of this? That way maybe I
can also add something to this after seeing that. After that, we will
proceed.

Reminder: We still have to get a reply from lists.debian.org . As this
phase should be complete by this weekend hopefully, we have to start
work on that next week.

--
Sukhbir.



More information about the Teammetrics-discuss mailing list