[Teammetrics-discuss] Comparison between the old code and the new code.

Andreas Tille andreas at an3as.eu
Thu Sep 8 17:00:54 UTC 2011


Hi,

On Sun, Aug 28, 2011 at 03:06:54PM +0530, Sukhbir Singh wrote:
> As we discussed, I compared the results of the new code with the old
> code and I am happy to report that our new code is working
> wonderfully.

I used the time when traveling offline to compare the graphs as well but
found severe differences in the results.  Because I assume that the parsing
in principle is correct my theory is that the gmane information is not
complete which would explain the differences.  Here are some observations:

debian-accessibility:
	to few in 2003
debian-amd64:
	small difference between Javier K and Frederic S (detected
	more for Frederic S which is good)
debian-arm:
	missing 1998-2002 -> we need mboxes!!
debian-blends:
	Vagrant C: old=162 / new=92  MISSING mails!
	Jonas S:   old=124 / new=119 MISSING mails!
	... same for others
debian-boot:
	MISSING mails!

 
> I didn't compare all lists, but here the ones I did (only lists.d.o):

After having found those differences I just tried to verify your
observations. 

> +  debian-multimedia
> http://blends.debian.net/liststats/authorstat_multimedia.png
> 
> I manually checked the mailing list archives and found that the new
> rating is the correct one. The old rating does not make any mention of
> 'Adrian Knoth', his name seems to be have completely skipped in the
> graphs.

I checked old stats and have realised that "Adrian K" is actually there
but at position 12 - so not in the graph.
 
> Otherwise, it looks good and comparison is exact.

Not really Guenter G has in old stats 68 mails in new one only 50.
 
> +  debian-python
> http://blends.debian.net/liststats/authorstat_python.png
> 
> The name 'Scott Kitterman' seems to be missing from the old ratings
> and again I find that his name is there in the archives. So our new
> rating is good.

I checked old stats and have realised that "Scott K" is actually there
but at position 11 - so not in the graph.
 
> +  debian-laptop
> http://blends.debian.net/liststats/authorstat_laptop.png
> 
> The name 'Bob Proulx' is missing from the old ratings but it is there
> in the archives and our new ratings.

I checked old stats and have realised that "Bob P" is actually there
but at position 12 - so not in the graph.

> Also in 2003, the names 'Matej
> Cepl', 'Micha Feigin', 'Mattia Dongili', have significant counts,
> while they have no mention in the graphs.

These are also lower than top 10 so not in the graph.

> +  debian-legal
> http://blends.debian.net/liststats/authorstat_legal.png
> 
> The name 'Steve Langasek' is missing from the old ratings, while his
> name is there in the archives and the new ratings.

I checked old stats and have realised that "Steve L" is actually there
but at position 11 - so not in the graph.
 
> Otherwise, the rest looks good.
> 
> +  debian-blends
> http://blends.debian.net/liststats/authorstat_blends.png
> 
> This looks good :)

See above - the numbers are looking different and thus some rankings
are different.
 
> +  debian-boot
> http://blends.debian.net/liststats/authorstat_boot.png
> 
> 'Frank Carmickle' who has 857 posts in 2004 is missing in the old code.

That's an interesting case because I can not find Frank C amongst the
first 30 posters (and I have no access to the raw database, just the
text files featuring the first 30 posters.
 
> Summary:
> 
> Overall, the new ratings have included many authors that were missing
> in the old ratings, so this is good news for us.

I found only one author who might be missing in the old stats and others
listed in your mail are just hidden from the top 10 because of different
ranking numbers.

> Also, NNTPStat worked
> very well this time, fetching 156899 records without breaking down
> even once :) The mbox archives have been saved locally with no error
> can be parsed using the localmboxparser when required. IMHO, I see no
> problem with NNTPStat now and I think it works the way we wanted it
> to.

I somehow have the feeling that NNTP stat is lacking some mails which
might be a bug in the gmane mail fetching algorithm or somewhere else.

The situation is way better if we have real mailboxes from alioth.  While my
offline data from old liststats code is lacking the infomation from August I
can observe that the new code has either the same or more mails (and the
plus of mails somehow fits what I would expect for one month).  So I think
the mbox parsing code is perfectly fine and so my hope is that we finally
will increase the quality of our obsevation once we get straight access to
the mboxes.

Kind regards

        Andreas.

-- 
http://fam-tille.de



More information about the Teammetrics-discuss mailing list