[Babel-users] [BUG] Route "deadlocks" under load due to non-atomic kernel route updates

Sun Jun 12 17:27:06 UTC 2016

( +iv, Nicolas's address corrected )

Dear Juliusz, Dave, thanks for reply.

First of all I'd like to say I'm new to routing & friends, but I'll try
to provide feedback:

On Fri, Jun 10, 2016 at 08:47:34PM +0200, Juliusz Chroboczek wrote:
> Dear Kirill,
> 
> Thank you very much for the detailed analysis.

You are welcome.

> If I read you correctly, this looks like a kernel bug: incorrect
> invalidation of the route cache.  While we have seen some similar bugs in
> earlier kernel versions, they were not triggered by something that
> simple -- you needed to do some non-trivial rule manipulation in order to
> trigger them.

Initially I too thought this is incorrect invalidation of the kernel
route cache - i.e. some cloned routes were created, and on new route
addition the route add procedure somehow logically misses the clone,
i.e. if it is in some other subtree or something like that.

What we have here is of another kind - it is inherent race condition
inside kernel - because after route lookup a cloned route is born with table
lock unheld, and then the clone is tried to be inserted into FIB. Yes,
there is a check that if some other clone with same /128 address is
already in FIB (potentially added there in-between while table lock was
unheld) the whole lookup is retried.

But if there is no other same-address /128 clone it does not mean the
corresponding real route could not change in between table lock was
unheld.

( To me it looks it is computationally-expensive, at least with straightforward
  implementation, to check for whether newly installed cloned route should not be
  invalidated by any other route installment - as table has to be scanned for
  other routes that would match. Imho the best way to deal with this is
  not to have route cache at all - like Linux already does for IPv4, and
  like it is now in 95% of the cases with Facebook patches for IPv6 (>=
  4.2 kernel). )

> What is more -- I believe that babeld is using the same procedure as
> Quagga and Bird.  Do you understand why Quagga and Bird are not seeing the
> same issues ?

On Sat, Jun 11, 2016 at 11:26:48AM -0700, Dave Taht also wrote:
> Quagga, at least, switched to atomic updates some time ago, I think.
> 
> http://patchwork.quagga.net/patch/1234/

First of all I tend to think in Re6stnet links are changed more
frequently compared to usual conditions. The probability to hit the race
is higher with high rate of route changes and high traffic. I cannot say
we have really high traffic on lab.nexedi.com, but the site is constantly being
pulled by our bots requesting raw blob contents from repositories, so let's say
we have 15-30-50 requests/second all the time as a background + traffic when
humans use the site.

Then, if there was no network-wide unreachable route (unreachable
2001:67c:1254::/48 in my original mail), unreachable cache entry would
_not_ be created, as cache entries are created only if route lookup
finds some entry in FIB, not upon "entry not found". I tend to think
many setups maybe do not have network-wide unreachable route, but I'm
not sure about this.

Regarding Quagga and Bird: I have not used them at all, but after quick
glance I can see:

Quagga (like Dave already said) uses atomic route updates starting from 2016:

  http://git.savannah.gnu.org/cgit/quagga.git/tree/zebra/rt_netlink.c?h=quagga-1.0.20160315-12-g5f67888#n1688
  http://git.savannah.gnu.org/cgit/quagga.git/tree/zebra/rt_netlink.c?h=quagga-1.0.20160315-12-g5f67888#n1870
  http://git.savannah.gnu.org/cgit/quagga.git/commit/?id=0abf6796

Regarding Bird:

  it used to use NLM_F_REPLACE starting from long ago

    https://gitlab.labs.nic.cz/labs/bird/commit/2253c9e2

  but stopped doing so in 2009.

    https://gitlab.labs.nic.cz/labs/bird/commit/51f4469f

  I have not looked into details of how NLM_F_REPLACE works (yet ?), but
  regarding Bird maybe this email might clarify a bit:

    http://bird.network.cz/pipermail/bird-users/2015-August/009854.html

  Once again I do not yet know how NLM_F_REPLACE works, but I hope it
  can be clarified and used correctly.

> While I have no objection to switching to a different API for manipulating
> routes, I'd like to first make sure that we understand what's going on here.

On Sat, Jun 11, 2016 at 11:26:48AM -0700, Dave Taht also wrote:
> I strongly approve of atomic updates and fixing what, if anything,
> that breaks...
> 
> I have seen oddities in unreachable p2p routes for years now. I've
> suspected a variety of causes - notably getting a icmp route
> unreachable before babel could make the switch, but have never tracked
> it down. Some of the work I'm doing now could be leveraged to try and
> make it happen more often, but a few more pieces on top of this
> 
> https://www.mail-archive.com/netdev@vger.kernel.org/msg114172.html
> 
> need to land before I can propagate all the right pieces to the testbed.

Regarding making sure we understand what is going on here: Yes. And I
think I've described it quite precisely - there is a race between IPv6
route lookups and route changes - a cloned route can be created for
route table state which was some time ago, potentially different.

I've tried to show it precisely in the timing diagram for two threads
doing route change and route lookup in my original email.

Please also see below for a program which demonstrates this bug reliably
with just one local host.

> Oh -- and are you running a stock kernel, or one locally patched?  Can you
> reproduce the issue on a pristine, recent kernel?

We are running pristine latest Debian stable kernels on production. In
particular the issue shows itself with e.g. 3.16.7-ckt25-2 (2016-04-08)

I've run locally patched kernel only on my notebook, on which I've tried
to understood the issue more with tracing.

I've prepared a program

    https://lab.nexedi.com/kirr/iproute2/blob/bd480e66/t/rtcache-torture
    (also attached to this email)

which reproduces the problem in several minutes just on one computer and
retested it locally: I can reliably reproduce the issue on pristine
Debian 3.16.7-ckt25-2 (on both Atom and Core2 notebooks) and on pristine
3.16.35 on Atom (compiled by me, since Debian kernel team has not yet
uploaded 3.16.35 to Jessie).

It is always the same: the issue reproduces reliably in several minutes.
And it looks like e.g.

     ----- 8< ----
     root at mini:/home/kirr/src/tools/net/iproute2/t# time ./rtcache-torture 
     PING 2222:3333:4444:5555::1(2222:3333:4444:5555::1) 56 data bytes
     E.E.E.....E......E..E............E...E..
     <more output from ping>

     BUG: Linux mini 3.16.35-mini64 #14 SMP PREEMPT Sun Jun 12 19:41:09 MSK 2016 x86_64 GNU/Linux
     BUG: Got unexpected unreachable route for 2222:3333:4444:5555::1:
     unreachable 2222:3333:4444:5555::1 from :: dev lo  src 2001:67c:1254:20::1  metric 0 \    cache  error -101

     route table for root 2222:3333:4444::/48
     ---- 8< ----
     unicast 2222:3333:4444:5555::/64 dev dum0  proto boot  scope global  metric 1024 
     unreachable 2222:3333:4444::/48 dev lo  proto boot  scope global  metric 1024  error -101
     ---- 8< ----

     route for 2222:3333:4444:5555::1 (once again)
     unreachable 2222:3333:4444:5555::1 from :: dev lo  src 2001:67c:1254:20::1  metric 0 \    cache  error -101 users 1 used 4

     real    0m49.938s
     user    0m4.488s
     sys     0m5.872s
     ---- 8< ----

The issue should not show itself with kernels >= 4.2, because there the
lookup procedure does not take table lock twice, and /128 cache entries
are not routinely created (they are created only upon PMTU exception).

I'm running Debian testing on my development machine. Currently it has
4.5.5-1 (2016-05-29). I can confirm that /128 route cache entries are
not created there just because a route was looked up.

Kirill

---- 8< ---- (rtcache-torture)
#!/bin/sh -e
# torture for IPv6 RT cache, trying to hit the race between lookup,cache-add & route add
# http://lists.alioth.debian.org/pipermail/babel-users/2016-June/002547.html

tprefix=2222:3333:4444      # "whole-network" prefix for tests  /48
tsubnet=$tprefix:5555       # subnetwork for which "to" route will be changed   /64
taddr=$tsubnet::1           # test address on $tsubnet

# setup for tests:

# dum0 dummy device
ip link del dev dum0 2>/dev/null || :
ip link add dum0 type dummy
ip link set up dev dum0

# clean route table for tprefix with only unreachable whole-network route
ip -6 route flush root $tprefix::/48
ip -6 route add unreachable $tprefix::/48
ip -6 route flush cache

ip -6 route add $tsubnet::/64 dev dum0

# put a lot of requests to rt/rtcache getting route to $taddr
trap 'kill $(jobs -p)' EXIT
rtgetter() {
    # NOTE we cannot do this with `ip route get ...` in a loop, as `ip route
    # get` first takes RTNL lock, and thus will be completely serialized with
    # e.g. route add and del.
    #
    # Ping, like other usually connect/tx activity works without RTNL held.
    exec ping6 -n -f $taddr
}
rtgetter &

# do route del/route in busyloop;
# after route add: check route get $addr is not unreachable
while true; do
    ip -6 route del $tsubnet::/64 dev dum0
    ip -6 route add $tsubnet::/64 dev dum0
    r=`ip -6 -d -o route get $taddr`
    if echo "$r" | grep -q unreachable ; then
        echo
        echo
        echo BUG: `uname -a`
        echo BUG: Got unexpected unreachable route for $taddr:
        echo "$r"
        echo
        echo "route table for root $tprefix::/48"
        echo "---- 8< ----"
        ip -6 -d -o route show root $tprefix::/48
        echo "---- 8< ----"
        echo
        echo "route for $taddr (once again)"
        ip -6 -d -o -s -s -s route get $taddr
        exit 1
    fi
done