[Teammetrics-discuss] Results from Web Archive Parser

Andreas Tille andreas at an3as.eu
Wed Dec 21 15:30:39 UTC 2011


On Wed, Dec 21, 2011 at 07:54:22PM +0530, Sukhbir Singh wrote:
> The web archive parser has finally finished parsing all the 55 lists.
> Yayay! It started on 17th December at 19:40 and completed on 21st
> December at 09:49. A lot of time but look at the # of messages:
> 
>   count
> ---------
>  2751087
> (1 row)
> 
> Cool!

Yep.  And we need to make sure that the parsing is continued from what
is in the database once the parser is called next time. :-)
 
> Anyways, the good news is that it seems like it is working perfectly
> now. I have checked this with the graphs even and compared randomly. I
> will keep on checking more as required.

I try to have a look soon.

> The only issue is that of the invalid dates. Like we discussed, we
> implemented a log output whenever an invalid date was encountered
> because it *could* be spam. Seems like it won't work out.
> 
> Consider this:
> 
>     $ grep -c skipping teammetrics/liststat.log
>     1070
> 
> Very few of these messages are _actually_ spam. The others are those
> from which the Date parsing just won't work! We have dates like:
> 
>     Date: Tue Aug 29 11:19:01 2006
>     Date: Date: Fri, 15 Jun 101 10:33:41 -0400 (EDT)
> 
> No regex can parse all such dates. And because the dates can't be
> parsed, we can't extract the year or the month. And when that doesn't
> match, we end up logging a 'Date' mismatch error. I hope I explained
> it clearly.

I think so.  However, I do not remember that we decided to consider a
broken date string as a sign for a SPAM message.  As far as I remember
we agreed to apply the following logic:  Well, we are unable to parse
the date correctly.  But we know in which month the mail arrived in the
mailing list.  So lets fix the date at something like

  - 1.<month>.<year> / 15.<month>.<year>
  - <random%30>.<month>.<year>
  - date >= date of previous mail && date <= date of next mail

all assumptions will work reasonably well for our purpose.  We are just
considering the month a mail reached the mailing list - and we just know
this month pretty sure.

> So what I feel is that we should remove this and perhaps throw the
> messages to the spam thing we wrote instead, for filtering spam.
> Because parsing the Date field just won't work, there seems to be no
> way to filter such valid messages. This is possible:
> 
> if(date_can_be_parsed)
>    compare
> else
>    don't compare
> 
> But then it defeats our purpose. So IMHO, I see no way.

I'm afraid I do not understand this paragraph.
 
> Other than that, I will now add support so that the same message is
> not fetched again and make improvements so we can finalize this at the
> earliest.

Sounds good.

Kind regards

        Andreas.

-- 
http://fam-tille.de



More information about the Teammetrics-discuss mailing list