[Soc-coordination] First report: Semantic Package Review Interface for mentors.debian.net

Mon Jun 4 18:14:51 UTC 2012

Hi,

this is the first bi-weekly report on my Summer of Code project
'Semantic Package Review Interface for mentors.debian.net'.

My project aims to extract metadata from packages submitted to
mentors.d.n[1], and use this data to match a mackage with a
potential sponsor. Since a lot of packages get stuck in the
mentoring process because their maintainers have difficulty finding
a sponsor, this should ease their entering the Debian process.

The initial plan was, very roughly, as follows:

 - automatically or semi-automatically assign debtags to new packages
 - match new packages with potential sponsors using the debtags in the
 latters' uploading histories
 - glue it all together in a nice web UI

Before my proposal got accepted, I started working on a small
patch for debexpo, asking the maintainers to accept the Debian Machine
Usage Policies before they can upload packages [2].

To do this properly, I needed to add features to debexpo's GnuPG
wrapper. I suggested using a third party gpg library, but after some
discussion with Nicolas, Arno and others on #debexpo, we realized none
was satisfactory: either very old python code, buggy, or as low level
as the underlying C libraries...

Thus, during the community bonding period, I started working on a new
wrapper. While this wasn't in the scope of my actual Summer of Code
project, it allowed me to familiarize myself with debexpo's
codebase. I set up an alioth account and pushed a first version into a
new branch[3], but I did not have the time to finish it, because of
end-of-term projects and coming exams. I'll polish it and integrate it
with debexpo sometime later when I have made some progress with the
actual gsoc project.

Now to the actual project. My initial plan was to start extracting
tags from new packages. One of my ideas was to use a bayesian or
statistical classifier which would learn with packages in debian's
archive and predict tags for a new package. I knew from the beginning
that it might be too difficult or too big a project, and might be out
of the scope of the SoC, but my other ideas seemed very dumb and I did
not think they might get anywhere.

So I started working on a classifier, hoping that I would at least
gather some data that will be useful later, even if I have to give up
the classifier plan entirely. Also, a more simple classifier might be
a viable idea for the next step of my project (matching packages with
sponsors), so finding out how to use a machine learning API would not
be wasting my time. I decided to first work with package descriptions.

Before I started, I had already chosen the python libraries NLTK
(Natural Language Toolkit)[4] and scikit-learn[5] as the more
interesting. I experimented a bit with both, and quickly saw that NLTK
was way too powerful for my needs, while scikit-learn's text
processing features were sufficient. I put aside NLTK and got to work.

The first step was to gather data on packages in debian's
archive. After some playing around and frustrations with Debian Data
Export[6], I finally realized that I already had all the information
I wanted, on my Debian system; all I needed was to access it with
needed python-apt (for descriptions in apt's cache) and python-debian
(for debtags).

The second, and hardest, step was to figure out how to process these
package descriptions and debtags to make them usable in
scikit-learn. This took some googling, reading documentation, going
through stackoverflow archive and hundreds of tests in ipython.

With sklearn features extraction and text pre-processing tools, I made
a vector space model[7] with the descriptions words (with tf*idf
weights[8]) and binarized the tags for use with a multi-label classifier.

Eventually, I got to the point where I could feed a Naive Bayes
classifier with packages descriptions and tags. The results were,
let's say... weird. A few packages in my test set would get accurate
tags, and most of them none at all. I managed to tweak it a bit to get
more results: tags were assigned to 2% of the packages, this time with
a very low accuracy (except for a few that got exactly the tags they
were supposed to have).

I didn't bother writing a real performance evaluator for this
classifier: it seems clear enough that developing a complety automatic
classifier for debtags is too big a task for this project. I might try
again once the summer of code is over.

For the record, I commited this code into a branch 'metadata-extract',
but I don't think it will be of much future use. This is not much in
terms of lines of code; I spent a lot of time researching stuff, and
still had a few exams (which are now over).

At my mentors Arno and Nicolas' suggestion, I discussed my problem
with Enrico Zini[9]. He was very helpful and gave me a few hints to a
much more simple strategy that 'might just work'. He also advised me
to forget about real classifiers, and told me that someone else had
tried to develop one in a previous GSoC and got nowhere.

I will use debtags's existing heuristics [10] to suggest a first set of
tags for a new package, and ask the maintainer to check and complete
it. Then, I can construct a Xapian[11] query with these tags and tokens
extracted from the description to find similar packages, keeping only
the packages whose maintainer are available sponsors.

Later this summer, I will contribute some debtags heuristics, which
should also benefit debtags besides debexpo.

Thanks to Enrico, I have now a more realistic plan for the next few
weeks. I should even have a working prototype integrated to debexpo's
current UI before the next report.

To help myself stay focused and avoid losing time with too much theory
or over-complicated ideas, I divided my near-future work into small
tasks:

 - apply debtags' heuristics to a package
 - tokenize a package's description and build a Xapian query with
   resulting tokens and above tags
 - make the above work with packages uploaded to mentors.d.n
 - ask the maintainer to check/complete the tags assigned to the package
 - present the result of the query in debexpo's web UI

That's it for today, and I'll keep in mind that valuable lesson: I
should have talked more with my mentors :)

Footnotes:

  [1] [http://mentors.debian.net/]

  [2] [http://wiki.debian.org/Debexpo/Development#Open\_tasks-1]

  [3] [http://anonscm.debian.org/gitweb/?p=debexpo/debexpo.git;a=blob;f=debexpo/lib/gnupg2.py;h=4add6f2a810f2892b99411729f449ecc60be12b1;hb=refs/heads/gpg-rewrite]

  [4] [http://nltk.org/]

  [5] [http://scikit-learn.org/stable/]

  [6] [http://dde.debian.net/dde/]

  [7] [http://en.wikipedia.org/wiki/Vector_space_model]

  [8] [http://en.wikipedia.org/wiki/Tf*idf]

  [9] [http://enricozini.org/]

  [10] [http://anonscm.debian.org/gitweb/?p=debtags/debtagsd.git;a=tree;f=debdata;hb=master]

  [11] [http://www.enricozini.org/2007/debtags/apt-xapian-index/]