[Teammetrics-discuss] Phase I: The final parts.

Andreas Tille andreas at an3as.eu
Mon Jun 6 20:37:03 UTC 2011


On Mon, Jun 06, 2011 at 02:48:26AM +0530, Sukhbir Singh wrote:
> Sorry, I used to the wrong terminology.
> 
> What I meant was that suppose _X_ mailing list is parsed. We generate
> a checksum for the mbox archives downloaded from _X_ and store their
> hashes in a file. So for example for _foo-bar_ mailing list with _foo_
> and _bar_ mbox archives, we store the hashes of _foo_ and _bar_. We
> can't save the hash of the _X_ itself as a whole because we are not
> parsing the current month in a list. If we store the hash of _X_
> itself, we miss the current month.

I admit I'm not yet fully sure whether I understand your plan correctly.
What I would propose is the following.  Assume a mailing list foo which
starts, say in May 2005.  I would create a file with the name, say
foo.hashes (any better name is fine) which contains the following:

  <mbox for 5.2005>: <md5 or sha1 for this mbox file>
  <mbox for 6.2005>: <md5 or sha1 for this mbox file>
  ...
  <mbox for 5.2011>: <md5 or sha1 for this mbox file>

If you download this mbox and calculate the md5 od sha1 sum (whatever
you prefer - both are fine for this purpose) you know whether a
processing / parsing is needed or not.

I'm afraid you need to download the mbox in any case because it might
change later on (for instance because of removed SPAM or whatever).  If
you have an idea how to safely avoid downloading the mboxes at all if
not needed that would be OK as well.
 
> Another related question on this topic, once when we are done with the
> parsing, should we remove both the archives and the mbox files? As of
> now, I am removing the archives only.

I have no strong feeling about this.  The information is easy to
recreate by downloading again - so we can not loose much if also the
mbox files are removed.
 
Kind regards

        Andreas.

-- 
http://fam-tille.de



More information about the Teammetrics-discuss mailing list