[Teammetrics-discuss] Next phase: Handling spam

Andreas Tille andreas at an3as.eu
Thu Jun 9 20:30:50 UTC 2011


On Fri, Jun 10, 2011 at 01:36:12AM +0530, Sukhbir Singh wrote:
> The query:
> 
>     INSERT INTO listarchive (project, yearmonth, author, subject, url,
> ts) VALUES (?, ?, ?, ?, ?, '$today')
> 
> is of main importance to us. So let's work on this.
> 
> * project - the name of the mailing list.
> * yearmonth - Ok.
> * author -
> 
> We are going to insert names here, right? So by parsing 'From' of a
> mbox archive, we we will get this (an example):

To give you an idea I select some random data sets in the database on
blends.debian.net

# SELECT * from listarchive limit 5 ;
   project    | yearmonth  |         author          |       subject        |                                url                                |     ts     
--------------+------------+-------------------------+----------------------+-------------------------------------------------------------------+------------
 user-spanish | 2000-12-01 | Pablo Dorronsoro        | x free 4             | http://lists.debian.org/debian-user-spanish/2000/12/msg00388.html | 2011-06-01
 user-spanish | 2000-12-01 | Carles Pina i Estany    | pgp4pine             | http://lists.debian.org/debian-user-spanish/2000/12/msg00365.html | 2011-06-01
 user-spanish | 2000-12-01 | Miguel Angel Vilela     | Cambio de e-mail     | http://lists.debian.org/debian-user-spanish/2000/12/msg00367.html | 2011-06-01
 user-spanish | 2000-12-01 | Ramiro Alba             | Una de modems PCMCIA | http://lists.debian.org/debian-user-spanish/2000/12/msg00370.html | 2011-06-01
 user-spanish | 2000-12-01 | Alfonso Cepeda Caballos | pseudo-image-kit     | http://lists.debian.org/debian-user-spanish/2000/12/msg00373.html | 2011-06-01

>     tille at debian dot org (Andreas Tille)

I just kept the names without e-mail address.  When thinking about it
I'm a bit unsure whether it is finally a good idea to throw away the
e-mail address.  We could store this in addition.  While this is not
normalised at all I do not think that database normalisation is a real
issue.
 
> For the guest account problem you mentioned:
> 
>     tille-guest at debian dot org (Andreas Tille)
> 
> So I was thinking we do a split on '-' and then push the names?

No.  Splitting does not work.  There are a lot of cases where this
will totally fail:

  'charles-debian-nospam', 'plessy', 'charles-guest'

will all resolve to

  'Charles Plessy'

So the only chance we have is to have another lookup list - perhaps
this should be rather done in the database itself rather than in a
config file.  Following this strategy enables to change the names
using an SQL UPDATE query.

> the above two address are there in the mbox, they are treated the same
> for the user Andreas. Is this approach the one you talked about?
> 
> * subject - Do we need to save this in the DB? If yes, why?

Because it's there. :-)
Well, I have not actively used it.  However, it was some kind of useful
to detect some SPAM patterns.  I do not really mind for the moment but
keeping it does not harm.

> * URL - Ok.
> * TS - Ok.
> 
> So the author issue needs to be sorted out. And I remember you
> mentioning something about multiple IDs so that is why I brought this
> up as this is important.

Yes it is.  Look at the

   $query = $query . "UPDATE listarchive SET author = 

statements filling up the get-archive-pages script.
 
Kind regards

     Andreas. 

-- 
http://fam-tille.de



More information about the Teammetrics-discuss mailing list