[Soc-coordination] Final report: Semantic Package Review Interface for mentors.debian.net

Clément Schreiner clemux at mux.me
Sun Aug 19 22:29:06 UTC 2012


1 Short summary: 
-----------------

My project aimed to gather metadata about packages submitted to
mentors.d.n by new contributors, and recommend them sponsors to help
them get their packages into debian. To achieve my goals, I had to
deeply refactor the package importing procedure and metadata storage.
This allowed for integration of debtags heuristics and matching with
similar packages, which can now be used for finding potential
sponsors.


2 Recent work 
--------------

2.1 Plugin API 
===============

I further improved the plugin API. Maybe I shouldn't have and finished
the semantic metadata stuff instead, but I wanted to be sure I could
store data from semantic plugins properly, so I would not have to
rewrite them later. Moreover, having a good way to import metadata
from a package, store it into the database for easy later retrieval
was key requirement for my project.

2.1.1 Various changes to make the plugins' code less verbose 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 - the PluginResult subclasses now guess their 'entity name', used in
   the inheritance scheme to associate a SQL table with the right
   model

 - To define a PluginResult model as the result of a QA test, we use
   the new decorator 'test_result'. It sets an attribute to the class,
   that will be checked by the plugin when loading the model.

   I should explain what I mean by 'test result': QA plugins typically
   determine whether a package passes or fails some test. For example:
   the package is lintian clean / has lintian warnings; the bugs in
   the changes file's 'Closed-Bugs' section really belong to the
   package or not, etc.

   If needed, the test results' models can return data from other
   models (for example, the lintian plugin defines two models:
   LintianTest, the test's result, and LintianWarning, for
   representing a tag as reported by the ``lintian`` program.

 - I wrote another decorator, ``importercmd``, which decorates plugin
   methods to make the importer (or, later, a controller) call them
   when importing data from a package


2.1.2 'Property factories' 
~~~~~~~~~~~~~~~~~~~~~~~~~~~

PluginResult models can now declare ``fields``, using
automatically-generated-properties. For example, the function
``bool_field will return a property for reading/writing a field as a
boolean, instead of explicitly using the underlying string. Currently,
bool_field, string_field and int_field.

I call these functions 'property factories', but I need to find a
better name for them.

Let's see a very simple and stripped down example: the model for the
``native`` plugin (which determines whether the package is [[native or
not).



  @test_result
  class NativeTest(PluginResult):
      is_native = bool_field('native')
  
      def __str__(self):
          return 'Package is %s native' \
	         % ('' if self.is_native else 'not')


The ``bool_field`` function, defined in debexpo.plugins.api, is
roughly equivalent to this property:



  def fset(instance, value):
      instance['native'] = 'true' if value else 'false'
  
  is_native = property(
      lambda self: True if self.get('native', 'false') else False, # getter
      fset)                                                        # setter


Previously, the writers of plugins had to write a 'is_native' method
decorated with @property, and explicitly coerce the string into a
boolean. This was especially cumbersome if they also wanted a setter
for coercing a boolean back into a string.

2.1.3 Port existing plugins to the new API 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This was longer than I expected, but I don't think it was a waste a
time. The results from some of these plugins will have to be taken
into account when recommending a sponsor to an uploader, and with my
changes the data is now easy to retrieve.


2.1.4 Almost done: trivial to finish 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 - removed the plugin configuration switches and make
   'debexpo.plugins' a packages with several modules, for example:
    + ``qa`` for QA tests
    + ``post_upload`` for various actions done before any data is
      imported from the package (this is the current name, but I think
      we need to find a less ambiguous one)
   post_upload, etc.
    + ``semantic`` for semantic metadata extraction

 - action plugins that can be run before the package has been imported
   (getorigtarball should be one of those). They would replace some of
   current plugin types with ambiguous name like ``post_upload``,
   ``post_upload_to_debian``, ``post_successful_upload``.

 - allow plugins to be run outside the importer, for refreshing data

    -> I'm not sure when. Maybe with a cron tab (either a cronjobs
   system as used by the debexpo worker, or with small scripts that
   could be installed in the system's crontab)? Or maybe through a
   specific controller, called after certain user actions.

    e.g.: after the user has edited a package's tags, the sponsor
   recommendation plugin should be called again


2.2 Sponsor recommendation 
===========================

 - in the package's page, after the results from QA test, potential
   sponsors could be displayed in table. Not very useful in its
   current state, though.  

3 Final assessment 
-------------------

I have not managed to implement all I had intended to, here's a summary:

3.1 Successful 
===============
 
 New plugin system: This API makes it possible to store data 'in an
   almost declarative way' [I need a better qualifier for that] for
   the results of plugins, and make it accessible outside the
   plugin. With a little more magic code, some plugins won't need to
   have their own templates anymore.
   
 Debtags plugin: using debtags heuristics, find tags associated to
                     the package
   
 Similar packages plugin: making use of apt-xapian-index and
      debtags, matches a package with similar ones already in Debian.
      Also usable for finding sponsors.
   
 Small but non-negligible detail, my work's documentation: I have
   written and kept up-to-date comprehensive docstrings for all new
   objects and methods (and some existing ones). This will not
   generate a perfect documentation, but improving it should be easy
   and will mostly be a matter of formatting.


3.2 Unsuccessful, or not finished / needs polishing 
====================================================


 Debtags: I had planned to write new heuristics to gather a richer
              set of metadata for uploaded packages, but I did not
              have the time.

 Sponsor recommendation: this was the ultimate goal of the
      project, and it will not be ready on the final deadline (not
      sure it's really a failure, though, because the new plugin
      architecture should make it easy to improve my proof-of-concept
      code).

 Semantic metadata querying: I have not designed a nice UI for
      browsing through the packages' metadata.

 Documentation: Most of the code has good docstrings, but they
                    probably are not formatted correctly for sphinx
                    and they could be improved so that the arguments
                    and return types are explicitly stated. Also, I
                    wanted to write a few HOWTOs (writing new plugins,
                    adding a new model to debexpo's database, ...)

3.3 What I gained thanks to the Summer of Code 
===============================================

My work has been useful debexpo/mentors.d.n and Debian in general (or
at least, I hope it did!), but it was also very positive for me:

First of all, I've learnt a lot about python development, particularly
about Python's object layer (inheritance, magic methods, attributes
access, among others). I also discovered nice techniques, for
abstracting pieces of code while keeping them readable (using
dictionaries, [namedtuples], iterators, first-class functions, etc.),
among others.

This project introduced me to the [Pylons] framewor and to the
wonderful [sqlalchemy] toolkit, and more generally to web development
and relational databases.

I am now more familiar with Debian and its packaging system, and I am
now motivated for fixing bugs in packages or creating new packages
when I miss something, instead of waiting for someone to do it for me
and installing software outside APT.



[namedtuples]: 
http://docs.python.org/library/collections.html#collections.namedtuple
[Pylons]: http://www.pylonsproject.org/projects/pylons-framework/about
[sqlalchemy]: http://www.sqlalchemy.org/

4 This summer of code is over, now what? 
-----------------------------------------

I will continue working on debexpo, and probably other (related) parts
of debian during the next months (and perhaps permanently? I like this
project). 

My priority is of course to finish what I've started this summer:

4.1 GnuPG wrapper 
==================

   
This was not really part of this summer of code project, but there is
not much work left and it has to be shipped to mentors.d.n soon:

In April I have started rewriting debexpo's GnuPG wrapper (see the
[git branch]) and I used it to add a 'Debian Machine Usage Policy'
agreement form to user profiles. I need to polish it, document it and
write tests. Then I will migrate debexpo's codebase to the new API.

Since I have learnt a lot about python since I wrote that wrapper, I
will be able to make it look nicer that it currently does.


[git branch]: 
http://anonscm.debian.org/gitweb/?p=debexpo/debexpo.git;a=blob;f=debexpo/lib/gnupg2.py;hb=refs/heads/gpg-
rewrite

4.2 Plugin API 
===============

 - default template for very simple QA plugins

 - New type of plugins, with their own controller, for viewing/editing
   semantic metadata: debtags: the user should be able to verify and
   correct the results from debtags heuristics similar packages: the
   maintainer (or any reviewer?) should be able to remove a package
   from the similar list, and that should be taken into account by the
   sponsor-recommendation plugin feedback for sponsor recommendation:
   "I'm not interested in sponsoring that package, remove me from the
   list"


4.3 Semantic metadata, debtags 
===============================

 - work with Enrico Zini to make debtags' heuristics easier to use
   outside debtagsd, and release them as a new library

 - write a lot more debtags heuristics

 - manage packaging teams, and associate each with a set of debtags,
   for easily matching a package with potential teams

4.4 Sponsor preferences 
========================

 - extend the plugin system to allow writing small 'metadata plugins'
   that can easily be used by sponsors to define their 'Sponsoring
   preferences'. 

 - go through the [Sponsor Checklist] on Debian Wiki and the
   preferences linked from there. Then write plugins to standardize
   all of those, and make it easy to determinate whether a package
   meets a registered sponsor's preferences. This shall be done in one
   or more 'metadata plugins'.

 - using the sponsor preference plugin, new maintainers will get
   personalized advice for making their package ready for inclusion
   in debian


   [Sponsor Checklist]: http://wiki.debian.org/SponsorChecklist

4.5 Sponsor recommendation 
===========================
  
The current sponsor recommendation is more a proof-of-concept than a
complete new feature and probably will not be very useful to new
maintainers. I need to improve the UI and the underlying algorithms.

4.6 Next months 
================

During the summer, I got ideas for improving debexpo in other areas
than semantic metadata and sponsor recommendation. Some of them will
not benefit the project but others might. I will discuss them with the
rest of the team and implement them accordingly.

I probably will contribute to [debtags] too.



[debtags]: http://debtags.debian.net/

5 Conclusion 
-------------

Thanks for reading, and many thanks to Google, Debian, my mentors and Debian's 
GsoC admins for making this great experience possible.




More information about the Soc-coordination mailing list