[Aptitude-devel] Resolver status.

Mon Apr 20 15:09:32 UTC 2009

  So, here are the two pending optimizations of the aptitude solver.

  The first one is a very standard optimization in constraint
satisfaction.  Currently, at every step, aptitude examines each
dependency that it still has to solve, computes the versions that will
solve that dependency, and examines each one to see if it's a legal
solution and not known to be a dead-end.  This is expensive, and it
gets more expensive the larger the search gets (larger searches
generally have more broken dependencies at any given step).

  The solution to this is to precompute and cache the list of possible
solutions to each dependency.  In fact, you go a step farther:
precompute and cache the list of *possible versions of each package*.
This requires generating more search nodes, since you have to account
for the fact that a dependency could be solved by package A or package
B.  But the benefit is that if two dependencies require non-overlapping
versions of a package, this allows you to detect it much earlier and
throw those inconsistent branches of the search out.

  This is a somewhat intricate optimization (it essentially requires
rewiring the whole resolver to be trigger-based rather than based on
continually recomputing everything) but should pay out hugely for every
single search that's performed.

  The second optimization is a bit more speculative.  I've observed
that when aptitude has several repositories to draw from, it needs to
work a lot harder to detect dead-ends.  The basic problem is that the
mechanism aptitude uses to store dead-ends that it's found stores
everything in conjunctive normal form (OR-of-ANDs), so the global
knowledge it has is:

  Conflict = (p1 AND p25) OR (p4 AND p25) OR (p68 AND p25) OR ...

  Notice that here we could write this instead:

  Conflict = (p25 AND (p1 OR p4 OR p68)) OR ...

  In fact, there are several places where aptitude could easily store
information about a dead-end as an OR rather than an AND.  For
instance, a dependency can't be solved if any version of a package
other than one that solves it is installed.

  The problem is that representing and using this information is a
little tricky.  I have an idea of how to minimize the amount of
computation that these impose by using a trigger-based system (built on
the same foundation as the first optimization, actually).  But I'm a bit
worried about the possibility of redundant information getting into the
list of dead-ends and promotions.  With simple ANDs, each one is a set
and it's fairly easy to write code that keeps only the most general set
when several overlap.  With AND/OR trees, even checking for equivalence,
never mind implication, is NP-hard.  Oops.

  I plan to attack this problem with a two-pronged approach:

    (1) I think that I can implement a simple checker that eliminates
        *identical* or almost-identical formulas quickly, with the same
        sort of "sideways hashing" I do on solutions.  In this case I'd
        attach a "hash value" to input variables (i.e., choices /
        versions) by using their unique ID, then apply a good generic
        hasher on integers at each level of the tree.

        Hopefully that will eliminate most of the redundancy.  If not,
        I have a second idea:

    (2) Be smarter about which promotions to store.  Right now every
        promotion we ever see is stored for all time, but we don't need
        to: promotions could be ejected at any time without affecting
        correctness (they're just an optimization).  I want to start
        keeping track of which promotions are the most "productive",
        which will probably just be a count of how many solutions have
        actually had their tiers increased by each promotion.  When the
        set of promotions gets too large, the oldest and least
        productive ones will be thrown out to make room for new
        promotions.

        This change can be seen as an *empirical* way of finding out
        which promotions are most generic.  If A implies B, then B is
        more general than A (since it might be true even when A is
        false), and this is reflected by the fact that B's hit count
        will be no lower than A's.  Or you could just look at it as a
        smart caching strategy.  Either way, I think it'll help. :-)

  Daniel