[Teammetrics-discuss] Results from Web Archive Parser

Sukhbir Singh sukhbir.in at gmail.com
Wed Dec 21 14:24:22 UTC 2011


Hi Andreas,

PS: Please feel free to reply later as Christmas is near.

The web archive parser has finally finished parsing all the 55 lists.
Yayay! It started on 17th December at 19:40 and completed on 21st
December at 09:49. A lot of time but look at the # of messages:

  count
---------
 2751087
(1 row)

Cool!

Anyways, the good news is that it seems like it is working perfectly
now. I have checked this with the graphs even and compared randomly. I
will keep on checking more as required.

The only issue is that of the invalid dates. Like we discussed, we
implemented a log output whenever an invalid date was encountered
because it *could* be spam. Seems like it won't work out.

Consider this:

    $ grep -c skipping teammetrics/liststat.log
    1070

Very few of these messages are _actually_ spam. The others are those
from which the Date parsing just won't work! We have dates like:

    Date: Tue Aug 29 11:19:01 2006
    Date: Date: Fri, 15 Jun 101 10:33:41 -0400 (EDT)

No regex can parse all such dates. And because the dates can't be
parsed, we can't extract the year or the month. And when that doesn't
match, we end up logging a 'Date' mismatch error. I hope I explained
it clearly.

So what I feel is that we should remove this and perhaps throw the
messages to the spam thing we wrote instead, for filtering spam.
Because parsing the Date field just won't work, there seems to be no
way to filter such valid messages. This is possible:

if(date_can_be_parsed)
   compare
else
   don't compare

But then it defeats our purpose. So IMHO, I see no way.

Other than that, I will now add support so that the same message is
not fetched again and make improvements so we can finalize this at the
earliest.

--
Sukhbir



More information about the Teammetrics-discuss mailing list