Revamping PET (maybe)

Tue Nov 17 08:54:34 UTC 2009

Hi,

I was trying to continue Ryan's work on making PET support multiple 
source repositories and as a side effect, allow using Git, but so far 
I failed. I looked and looked and failed to come up with small 
incremental changes. When ever I take some approach, it resulted in 
massive changes all over.

So, I started writing a design document, currently a wish list for 
what PET could be like.

It is in doc/architecture.mdwn in ryan52-multirepo branch, but I copy 
it here for discussion.

This is what PET looks like in my dreams. As with any dreams, there is 
a slight possibility for slipping off reality. :)

----------------8<----------------

General architecture
====================

 * PET works with a RepositorySet
 * RepositorySet contains Repositories
 * RepositorySet holds a Cache.
   * The Cache is opened R/W (exclusive lock) or R/O (shared lock) depending
     on initialisation.
 * Repositories contain Packages.
   * Each Repository knows its containing RepositorySet.
   * Repositories can access files, directories, branches and tags.
 * Packages contain Files and Directories.
   * Each Package knows which Repository contains it.
   * Packages are populated with data using Collectors.
 * Collectors can retrieve data from the cache or other sources (Repository,
   BTS, Archive).

Use cases
=========

Displaying data (pet.cgi, cache is R/O)
-----------------------------------

 * Works only with data from the Cache
 * listing all packages; for each package, the following data is needed:
   * name
   * versions in repository, Debian (several releases, NEW), upstream
   * tags
 * packages are shown in groups depending on their classification

Ajax stuff
----------

 * pet_chlog.cgi
  * Only retrieves the changelog entry (released or unreleased) of a single
    package
 * pkginfo.cgi
  * Retrieves all the info about a single package

Updating the data from repository(ies) (fetch data, cache is R/W)
-----------------------------------------------------------------

Initial data population
-----------------------

 * for each repository
   * list all packages
     * collect information about the package

Subsequent data updates (post- hook)
------------------------------------

 * for each change set
   * detect affected package(s)
     * update package data (only changes)

Collectors
----------

Each collector is responsible for collecting certain class of information about
the package

Collected information:
 * Repository stuff:
   * watch file: URL and upstream version
   * Changelog:
     * the last released stanza (text and version)
       * signature identity
     * the UNRELEASED stanza (if any) (text and version)
       * item possible NOTES and other pseudo-tags
   * tags
 * Debian archive
   * versions in different suites
   * NEW
 * Bug tracking system
 * Classification
   * uses the collected data to put the package in one of several classes

Cache
-----

The Cache stores information for later re-usage without possible time-consuming
operations.

Cache Interface
---------------

 * TODO

Possible implementations
------------------------

Currently we use one big hash streamed with Storable. This is very handy when
operations are to be done on all packages, like in the web frontend.

OTOH, this approach causes the whole file to be rewritten when there is an
update in a single package (post-commit).

    5.7M 2009-11-12 09:16 archive
    421K 2009-11-12 10:15 bts
    2.5M 2009-11-12 10:15 consolidated
    2.4M 2009-11-12 07:16 cpan_dists
    3.7M 2009-11-12 07:16 cpan_index
    4.1M 2009-11-12 04:43 svn
    424K 2009-11-12 10:15 watch

Maybe an SQL(ite) database can be used instead? It would also allow processing
to be done package by package, not reading the whole thing in the memory.

The design doesn't care which way of caching is used, as long as it conforms to
the interface.

----------------8<----------------

Please tell me if you see flaws in this approach. Note that nothing is 
written yet, so if you think I am losing my time, there is not much 
really lost.

-- 
dam
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <http://lists.alioth.debian.org/pipermail/pet-devel/attachments/20091117/acf2e7a9/attachment.pgp>