[Teammetrics-discuss] How does filling up the database work?

Andreas Tille andreas at an3as.eu
Tue Aug 9 13:19:26 UTC 2011


On Tue, Aug 09, 2011 at 05:18:32PM +0530, Sukhbir Singh wrote:
> For a given mailing list on Alioth:
> 
> 1. Download the mbox,
> 2. Parse it,
> 3. After parsed, create an entry in
> '/var/cache/teammetrics/lists.hash' so that the mbox is NOT downloaded
> and therefore NOT parsed again.
> 4. Populate the database.

Hmmm, this algorithm does not really need a MD5sum - just the name of
the parsed mbox would be sufficient, right?  You are just not
downloading a mbox which is in lists.hash.  I assumed that you would
download all mboxes in *any* case and only parse it when the md5sum
does not match.  So it is something like

  1. Download mbox if not mentioned in lists.hash
  2. Parse it
  3. Add name of mbox to lists.hash
  4. Populate database

BTW, steps 3. and 4. should be exchanged.  If 4. might fail for some
reason you should not set the "not for download" flag in lists.hash.

We could go with this algorithm in principle and I also do not see a
reason then to continue with

  5. Delete database

The only potential flaw I might see is that according to some time zone
issue if we start the script "very early" on 1st September something in
X-2011-August.txt might change because Alioth might "remain" in August.
So we should remember to run the script on 2nd of every month to be
safe.
 
Question: If (for whatever reason) I would re-read all mboxes (for
instance after getting the information about massive SPAM removal) I
need to delete the corresponding entries in lists.hash, right?
According to your algorithm it also requires to clean up the database
from the entries of this project.  To make sure that this will not be
forgotten we should set a primary key (project,message_id) to prevent
adding a message twice.

In short:
  Downloaded mboxes can be delted
  We need a safe way to reread everything from scratch

Kind regards

       Andreas.

-- 
http://fam-tille.de



More information about the Teammetrics-discuss mailing list