[Teammetrics-discuss] Commitstat using key authentication?

Sukhbir Singh sukhbir.in at gmail.com
Wed Sep 7 14:59:15 UTC 2011


Hi Andreas,

Changes in liststat.py:

    - In the earlier code if an encoding was not resolved, we used to
      set it to 'ascii'. That was an error prone approach as tests on
      non-English lists showed. Instead, we now use `chardet` module to
      detect the encoding and then attempt to call unicode() on it. This
      has resulted in proper handling of encoding related messages.

Mostly messages with encoding errors were spam, however after parsing
debian-user-german from lists.d.o., I noticed a large number of
messages that had encoding errors but should not have and were not
spam. Which was not good news.

So I set out to fix this. I noticed that I was doing this:

    subject = u" ".join([unicode(text, charset or 'ascii')

I was defaulting it to 'ascii' but that was not cool! Sure, it works
for the English lists in almost all cases but not for i18n-ized lists.
So I used the 'chardet' module that helps with encoding detection. So
now we do this:

   subject = u" ".join([unicode(text, charset or
chardet.detect(text)['encoding'])

And this has fixed almost 99% of the encoding errors and I am happy :)
The remaining messages are spam or in cases where the encoding can't
be detected (None). We will confirm this more when we run a proper
test run.

    - For messages that had invalid dates, we now use the date of the
      previous message. This helps us to avoid skipping the message
      entirely, as was being done earlier.

Fixed as discussed.

    - Messages with invalid payloads are not skipped, rather they go
      through the spam filter.



More information about the Teammetrics-discuss mailing list