[Teammetrics-discuss] Updates.

Sukhbir Singh sukhbir.in at gmail.com
Thu Jun 30 19:08:14 UTC 2011


Hi,

repository.update()

As you must have noticed, it is July 1st (here as of now). So we can
now 'officially' parse the June teammetrics-mailing list ;)

I have added the signature metric, so here are the results:

     name      | frequency | rawlen | quotelen | blanklen | siglen
---------------+-----------+--------+----------+----------+--------
 Sukhbir Singh |        77 |  58673 |      998 |     1248 |   1248
 Andreas Tille |        46 |  66462 |      946 |     1590 |    854
 Scott Howard  |         4 |   4318 |       48 |       91 |     91

As you can notice, 'siglen == blanklen' as Scott doesn't have a
signature, it's just `~Scott` while Andreas and I do have one. That
explains the difference in the `siglen` column and perhaps why it is
important. I feel all the metrics are pretty conclusive for a mailing
list. Rest you can observe. Here is a summary once again:

    rawlen -- total number of characters in the message body.
    blanklen -- total number of lines in the body excluding blank lines
    quotelen - total number of lines excluding blank lines AND lines
starting with >
    siglen - total number of lines excluding blank lines AND lines
starting with > AND up till '-- '

So 'siglen' is the _complete_ metric.

For the lists.debian.org, I investigated using the NNTP interface.
That works perfectly. We get exactly what we want and it's fast and
doesn't strain the Gmane server (40,000 subjects/ From fields in ~10
seconds).  There is only one drawback and that is the obfuscation of
the mail addresses. And that was only in one list I checked. I didn't
keep a check as to which it was (sorry) but out of six lists, only one
had obfuscated email addresses.

So what I suggest now is that we go with NNTP access only. I think
that obfuscation is a rarity and we should go ahead with this. For
starters, you can point me to some mailing lists that you would want
to parse first so I can check for obfuscation. Then at DebConf, we can
take up how to parse these lists or request for mbox archives.

I will be investigating the CGI thing tomorrow.

-- 
Sukhbir.



More information about the Teammetrics-discuss mailing list