[Teammetrics-discuss] How does filling up the database work?

Andreas Tille andreas at an3as.eu
Tue Aug 9 14:29:17 UTC 2011


On Tue, Aug 09, 2011 at 07:26:45PM +0530, Sukhbir Singh wrote:
> This is the finest example of procrastination :D I had to change the
> code not to calculate the SHA-1 and just save the list name. This will
> be done.

OK.
 
> > BTW, steps 3. and 4. should be exchanged.  If 4. might fail for some
> > reason you should not set the "not for download" flag in lists.hash.
> 
> lists.hash saves the entire mbox, not individual message so the only
> time this can fail is when the entire mbox is corrupted. I don't think
> this can happen at all, so... Plus there is no way to handle this
> later and the complexity is not worth it.

Well, in your algorithm you are just marking the mbox as done before it
is in the database (so if not actually all steps are really done).  I do
not think that updating lists.hash in the end after the database import
would be really extra effort.  Without having looked at the code it is
just a different design of the loops you are using.  You could even work
on a per mbox base:
   1. download mbox
   2. import data of this mbox
   3. store the record in lists.hash
   4. delete mbox
IMHO no real problem.

> > So we should remember to run the script on 2nd of every month to be
> > safe.
> 
> Should I add a check for this? if (script_run_date) == 1st day of Month, quit.

Perhaps there is a better check:  Is there an mbox for *current* Month
available - if yes, then we know that last Month is finished.  However,
if such an mbox is only created after the first message drops in this
might lead to extra waiting if it is a low traffic list.  So the check
should be

  run if
   (script_run_date) != 1st day of Month    OR
   mbox_for_this_month (which is ignored) exists

> > Question: If (for whatever reason) I would re-read all mboxes (for
> > instance after getting the information about massive SPAM removal) I
> > need to delete the corresponding entries in lists.hash, right?
> > According to your algorithm it also requires to clean up the database
> 
> Yes, that's right. We have to delete everything in /var/cache/teammetrics

Well, once the mboxes are deleted as we agreed upon that boils down to
the stored mbox names in lists.hash or did I missed something.
 
> > from the entries of this project.  To make sure that this will not be
> > forgotten we should set a primary key (project,message_id) to prevent
> > adding a message twice.
> 
> ... or we can write a shell script that does it easily for us without
> having to bother with the primary key :)

Nooooooooooooooooooo!  :-)
A primary key is a primary key, right.  We will probably do this with a
simple script - but this is no reason to not implement proper database
logic which is to set proper constraints where these do make sense.

Kind regards

     Andreas. 

-- 
http://fam-tille.de



More information about the Teammetrics-discuss mailing list