[Teammetrics-discuss] Detecting possible SPAM patterns.

Andreas Tille andreas at an3as.eu
Wed Jun 22 06:55:47 UTC 2011


On Sun, Jun 19, 2011 at 12:32:08AM +0530, Sukhbir Singh wrote:
> 1. Names that start with '=':
> There are many names that with '='. In fact, this is one of the most
> common patterns I have seen. Like:
> 
>     =?windows-1251?B?bWVwcm14eWU=?=
>     =?UTF-8?B?U3RlZmZlbiBNw7ZsbGVy?=
> 
> So if the name starts with an equal to sign, we discard that name.

This is something like my "non-ASCII characters" criterion.
 
> 2. Names in upper case:
> Names in capital letters are a clear indication of spam.

Possible - not always.  Especially French and Japanese people are keen
on following the suggested habit to spell their family name in capital
letters to give other people an idea what is the given name and what is
the family name.  So this rule should be taken with a grain of salt but
you are right - capital letters only name is very probably SPAM.
 
> 3. Names with the words 'lottery', 'promotion' and 'loan' in them.

Yep.
 
> 4. Names that start with either of 'Mr', 'Mrs' or 'Dr'.
> 
> 5. Names that have '.com' and .'net' in them. I am aware that there
> can be other TLDs that constitute spam, but for our purpose, this
> should be enough.

You got the idea.

> -
> Coming to the Subject field:
> 
> 1. Subjects that start with '='.
> 
> 2. Subjects in upper case.
> 
> I need your thoughts on all points.

This is basically what I tried to implement.  Please make a logfile of
all mails you regarded as SPAM.  Or perhaps it might be even better to
have a separate table for the spam adding an extra id for the criterion
which alarmed the SPAM detection.  Thus we can perhaps more easily seek
for false positives (I had some in the past and needed to adjust the
algorithm).  From this table we can later create the file spamurls.txt
(or you can write this file in parallel - whatever seems to be better
for you).

Kind regards

     Andreas.
 

-- 
http://fam-tille.de



More information about the Teammetrics-discuss mailing list