[Teammetrics-discuss] Phase I: Updates

Andreas Tille andreas at an3as.eu
Sat Jun 4 19:39:45 UTC 2011


On Sat, Jun 04, 2011 at 02:26:11AM +0530, Sukhbir Singh wrote:
> Aah yes, I thought about it after sending the email. What I did mean
> to say was that it easy to implement it in the code. But now that I
> think about it, it's going to be a indeed a challenge doing this
> algorithmically and keeping the error count (if any) minimal.

You will never approach a full SPAM detection (even if you implement a
full qualified SPAM detection algorithm which is definitely a way to
large overkill for this project).  My idea to focus on the top ten (or
top X) posters was exactly to just exclude the influence of SPAM for the
evaluation because spammers are not amongst the top posters (they just
use different addresses).

My attempt to do some slight(!) SPAM detection was to just avoid obvious
cruft inside the database.  You might like to have a look into

   svn://svn.debian.org/svn/blends/blends/trunk/team_analysis_tools/get-archive-pages

what means I used.  I did also some logging about the messages which
were considered as SPAM.  It turned out to be an additional means for
list admins to mark SPAM candidates for deletion from the mailing list
archive.  So if we would continue to log those messages which show a
clear SPAM pattern (with not to high effort) this might have some extra
sense.
 
> Anyways, I have decided to put this on hold and focus on this once we
> are done with everything else because once we do find and decide a way
> of doing this, integrating this within the code is a maximum of five -
> ten LoC.

That's perfectly OK.

Kind regards

      Andreas.

-- 
http://fam-tille.de



More information about the Teammetrics-discuss mailing list