[Babel-users] Fixes to route selection

Tue Mar 11 18:01:39 UTC 2008

Dear all,

Lilian has found a case in which Babel 0.9 would delay reconvergence
by 30 seconds.  I've now fixed this case, and reworked the route
selection algorithm somewhat; convergence speed is much improved.

Since I'd like to release 0.10 pretty soon, I'm very much interested
in any data about the behaviour of the current Darcs tree that you may
have.

Since I'm in a chatty mood, I'll give a few details about the issue.
(By the way, Lilian, my analysis of the problem on IRC was completely
wrong.)

A (proactive) routing protocol has three tasks:

 - neighbour sensing: measuring link quality between neighbours;
 - route discovery: discovering the set of available routes;
 - route selection: choosing a particular route to every destination.

In many routing protocols, the three tasks are mixed up.  For example,
both RIP and B.A.T.M.A.N. mix up neighbour sensing and route discovery,
which is why in RIP you cannot tune the hello interval separately from
the updte interval.

Babel was designed as a modular protocol: the three stages of the
protocol are, to the extent possible, independent.  Of particular
interest is the independence of route selection and route discovery:
the protocol remains correct if nodes do not always choose the route
with smallest metric, but choose ``worse'' routes under some
circumstances (as is usually the case with distance-vector protocols,
but not with link-state protocols).  This makes it possible to
implement route selection policies that choose routes according to
criteria richer than just link quality.

Now, when I started working on Babel, I had just experimented with
OLSR, and found that it has serious problems with route stability:
under some circumstances, it flaps between two parallel routes to
a given destination.  Route flapping is ungood: it can cause packet
reordering, and it usually causes jitter (for example due to an extra
ARP exchange) and sometimes packet loss.

Because of that, Babel's route selection policy tries to promote route
stability.  The main stability mechanism is a simple form of
hysteresis: we don't switch routes unless the new route is
significantly better than the old route.  (The exact amount of
hysteresis used depends on whether we also switch router-id, and is
something that needs more experimentation.)

The second route stability mechanism is much simpler: a route is
regarded with great suspicion unless it has been around for 30
seconds.  Unfortunately, this mechanism turned out to be buggy, and to
interact badly with some other features of the protocol (notably the
feasibility condition), which caused perfectly good routes to be
disregarded for 30 seconds in Lilian's network topology.

After some thought, I've simply removed the 30 second condition; the
only route stability mechanism remaining is hysteresis, which appears
to do The Right Thing in practice.

In case you want to check the code, all of this happens in route.c, in
the functions find_best_route (which finds the candidate route for
a given destination) and consider_route (which decides whether to
switch to a new, potentially ``better'' route).

While I was at it, I reworked some of the constants in the protocol,
notably by making it send multi-hop requests and triggered updates
more aggressively; this makes the protocol somewhat more chatty, but
very slightly so.  All together, you should see much faster convergence
with 0.10 than with 0.9.

                                        Juliusz