[Soc-coordination] applying for Aptitude search ranking and presentation

Mon Mar 23 23:30:45 UTC 2009

On Mon, Mar 23, 2009 at 11:41:10PM +0200, KUTLU EMRE YILMAZ <keylmz at gmail.com> was heard to say:
> my thesis topic is comparison of turkish information retrieval performance
> of lemur and terrier toolkits regarding their different retrieval algos.
> 
> i have also used lucene in my ir course so im familiar with ir and believe i
> can do my best for this project.

  Remember that we are developing software for use in the real world.
This means that we need to stick to tools and libraries that are readily
available; otherwise no-one will be able to use our fine software. :-)

  In this case, what that means is that the libraries we use must be
available in Debian, and they must be C++-accessible (because aptitude
is written in C++).  Xapian is in Debian and is natively a C++ library.
"Lucene" I'm not familiar with and thus can't comment on, except that
it looks like it's a Java library with a C++ imitator.  I don't know
how good the C++ version is.  The other two you mention, lemur and
terrier, are not available in Debian.  If we wanted to use them, they
would have to be packaged.  Also, Terrier appears to be written in
Java, so it's not an option.

  I don't have time to maintain a new library package, unless you can
convince me that there's a truly huge improvement over what I'm using
now, but there are other people in Debian who work on IR that might be
interested.  If you can convince me that there's a truly huge
improvement in search quality from using lemur, I could consider
switching.

  aptitude uses Xapian, but I am not heavily invested in that decision.
apt-xapian-index is mainly useful in that I didn't have to write the
code to build the index; other than that we use Xapian fairly directly.
I'm happy to consider switching tools as long as the proposed
alternative is practical (in the sense I described above) and has some
obvious benefit.  Having indexed searches for substrings would be one
example of an "obvious benefit" (users find it very confusing that
Xapian doesn't do this).

  Another thing to consider is that I am not an expert in information
retrieval, although both aptitude and my paid job have some aspects of
it.  So you may have to explain things to me a little more thoroughly
than you explain them to your professor.

> when it comes to what i can add to this project i see that xapian has okapi
> algortihm i can try to improve ranking of results by
> 
> try all the possible things that affect ir performance tokenization stemming
> may be mistypings (python - ptyhon) AND OR specific boolean queries or
> differently weigted queires.

  As long as you can explain to me in small words what you're doing. ;-)

> also i got good results in my experiments with lemur tf-idf model weigted
> with a modified okapi weighing function i can try different weighing
> algorithms.
> 
> before implementing a new ranking heuristic , first i wish to try the above
> i mentioned but you are the professionals and i would be glad to implement
> some different unigram language models for xapian

  I don't understand what you're proposing.  Are you saying you tried
the Xapian relevance function and it didn't work well?

> i believe that unigrams can give better results for small queries like
> package names and they arent so many fluctating like in natural language
> words form different meanings.
> 
> "java sdk" -->  the probability the word "sdk" coming after java will be
> higher than java rails may be i can do this by modelling collection , here
> as my collection  filenames in the repository their explanations.

  I don't really understand these two paragraphs.  Could you explain
what you mean a little more clearly?  I see that you want to exploit
positional correlations between words, but not what you're going to do
with that information.

> may be we can create two fields one for filename and other for explanations
> of the package and its job , then can combine these two fields in search.

  Unless there are really impressive benefits, changing the fields that
a package contains is going to be politically infeasible.  In fact,
there are a bunch of people over on -devel campaigning to remove what
little information about a package we currently give users, on the
grounds that it's a waste of disk space and bandwidth.  That doesn't
mean it's not possible to do this, but we'd have to start distributing
our own files like Enrico Zini does, and keep it up until it becomes so
obviously useful that the ftpmasters are willing to ship it.  I expect
this would take three to four years (assuming it really is that good).

  Speaking of Enrico, you should probably talk with him, and maybe with
Erich Schubert as well; they both seem to have an interest in IR and
they almost certainly know more about it than I do.

  I'm impressed by how much you know; if you can set a concrete,
achievable goal you should be able to do well.  Please remember that
we're supposed to conduct SoC as a software engineering exercise, not
a research project; that means we can experiment a little, but you
should plan ahead and have an alternative that can be implemented by
the deadline in case your experiment doesn't work out.  Also, I'll be
more inclined to go along with your experiment if I can understand what
it is and why it'll make things better. ;-)

  One other thing you should consider while planning your proposal /
project is how you're going to get an overall picture of the effect your
changes are having on search results.  Without this, you will end up
just making changes that you have a "hunch" will make things better,
when you could actually just be swimming in place or even moving
backwards.

  Daniel