[Babel-users] [BUG] Route "deadlocks" under load due to non-atomic kernel route updates

Thu Jun 16 20:47:36 UTC 2016

On Thu, Jun 16, 2016 at 1:40 PM, Kirill Smelkov <kirr at nexedi.com> wrote:
> On Thu, Jun 16, 2016 at 08:38:49AM -0700, Dave Taht wrote:
>> On Thu, Jun 16, 2016 at 4:17 AM, Kirill Smelkov <kirr at nexedi.com> wrote:
>> > On Wed, Jun 15, 2016 at 12:56:34PM +0200, Juliusz Chroboczek wrote:
>> >> >> If I read you correctly, this looks like a kernel bug: incorrect
>> >> >> invalidation of the route cache.
>> >>
>> >> [...]
>> >>
>> >> > What we have here is of another kind - it is inherent race condition
>> >> > inside kernel
>> >>
>> >> Perhaps I'm confused, but it still looks like a kernel bug to me.
>> >
>> > Yes, it is a kernel bug. But in a sense it is so old and so widespread
>> > that it has to be cared about in userspace - as with atomic route
>> > updates we do not hit it.
>> >
>> > Also: atomic route updates are needed not only for avoiding this bug.
>> > Another reason is: if we have routedel & routeadd pair, even after
>> > routeadd the state of cache is correct, in the time between del & add,
>> > if a packet destined to that route gets to the node, it hits
>> > 'unreachable' route case.
>> >
>> > For usual packets it is only "packet lost" and TCP probably retransmits.
>> > But for SYN packets, e.g. when a connection is going to be established,
>> > ICMP error is returned which results in "host unreachable" error on
>> > originator side.
>>
>> Yes this variant of the bug is still there, essentially, and it bugs me.
>>
>> (btw the facebook page you pointed to fixes they did was fascinating -
>> they have "interesting problems" - like dealing with 1+m routes in
>> their route table)
>>
>> one day a year, for several years now, I get sufficiently irked about
>> the atomic update problem in babel to refresh my knowledge of netlink,
>> hack babel all to hell, and have nothing work. I left myself a bunch
>> more breadcrumbs last night in my hacked up babel version, as to what
>> I tried and what it did wrong... (because I'm actually also chasing
>> another bug which I'll put up in another message)....
>>
>> But:
>>
>> Why doing the equivalent of this (and understanding how it does it)
>>
>> ip -6 route add fd99::33/128 via fe80::120d:7fff:fe64:c992 dev eno1
>> ip -6 route replace fd99::33/128 via fe80::120d:7fff:fe64:c991 dev wlp2s0
>>
>> is so hard for me to figure out - that I don't understand. But it
>> seems to require completely tracing through the ip route code, and
>> writing a decoder for the netlink packets created, to figure out why
>> what I thought would be an equivalent for babel, and taking the week
>> or more to do it...
>>
>> -- look! Squirrel!
>
> Dave, maybe this might help you: Wireshark (not tcpdump) has decoder for
> netlink route packets:
>
> https://code.wireshark.org/review/gitweb?p=wireshark.git;a=blob;f=epan/dissectors/packet-netlink-route.c;hb=v2.1.1rc0-170-gc269684

Groovy. Thank you. I did not know.

In discussing this with shemminger this morning, he pointed out there
was a semantic difference between how routes can be replaced in ipv6
and ipv4.

At *one point* last night I thought I'd successfully got ipv6 to
atomic replace, but it had failed on ipv4 - so I will revisit the work
soon, brain cells and time willing.

> so you can create a virtual netlink monitor interface - something along
> the lines of
>
> modprobe nlmon
> ip link add type nlmon
> ip link set nlmon0 up
>
> ( see more details in e.g. https://patchwork.ozlabs.org/patch/259444/ )
>
> and see the actual packets exchanged between iproute and kernel.
>
> Also: there is pyroute2 (https://github.com/svinota/pyroute2) which has debug
> decoder for netlink packets, but out of the box you have to specify packet type
> explicitly:
>
> https://github.com/svinota/pyroute2/blob/master/docs/debug.rst
>
> Maybe you already know all this, but I decided to provide info anyway to make
> sure it is not missed, because you mentioned it is hard for you to understand
> what is going on underneath `ip -6 ...`
>
> Hope this might help,
> Kirill
>
>
>> >> Perhaps it would make sense to speak to netdev about that?
>> >
>> > Yes, makes sense. Though as this particular case is not present on 4.2+
>> > kernels, people on netdev will probably has less interest to look into.
>> >
>> > I will see what can be done.
>> >
>> >> > Quagga, at least, switched to atomic updates some time ago, I think.
>> >> >
>> >> > http://patchwork.quagga.net/patch/1234/
>> >>
>> >> I see.  I'm busy right now, but I'll be grateful for a patch.
>> >
>> > I see about this. Thanks for feedback.
>> >
>> >
>> > On Wed, Jun 15, 2016 at 07:35:05PM -0700, Dave Taht wrote:
>> >> >     https://lab.nexedi.com/kirr/iproute2/blob/bd480e66/t/rtcache-torture
>> >> >     (also attached to this email)
>> >> >
>> >> > which reproduces the problem in several minutes just on one computer and
>> >> > retested it locally: I can reliably reproduce the issue on pristine
>> >> > Debian 3.16.7-ckt25-2 (on both Atom and Core2 notebooks) and on pristine
>> >> > 3.16.35 on Atom (compiled by me, since Debian kernel team has not yet
>> >> > uploaded 3.16.35 to Jessie).
>> >>
>> >> I have been running this script on four different machines for hours
>> >> now without reproducing your bug on the 4.4 or later kernels. It does
>> >> trigger on a 3.14 kernel. (it helps to do a killall fping6 before
>> >> exiting!)
>> >>
>> >> It does not seem to be happening on 4.4 or later. At one level, I'm
>> >> relieved - one last babel bug to worry about in openwrt (now 4.4
>> >> based), although one of the platforms I work on is still stuck at
>> >> 3.18, as is the 3.14 c2 (for now).
>> >>
>> >> At another level I still really, really, really wanted atomic updates
>> >> in babel, and was clearing the decks to make a run at the right
>> >> netlink stuff when I'd decided to confirm your bug existed or not in
>> >> my kernels. :(. Weirdly demotivating.
>> >>
>> >>
>> >> d at dancer:~/bin$ ssh root at pi3 uname -a
>> >> Linux pi3 4.4.12-v7+ #892 SMP Thu Jun 2 15:41:19 BST 2016 armv7l GNU/Linux
>> >> d at dancer:~/bin$ ssh root at pi2 uname -a
>> >> Linux pi2 4.4.12-v7+ #892 SMP Thu Jun 2 15:41:19 BST 2016 armv7l GNU/Linux
>> >> d at dancer:~/bin$ uname -a
>> >> Linux dancer 4.5.0-rc7-fqfi #1 SMP PREEMPT Mon Mar 7 16:04:17 PST 2016
>> >> x86_64 x86_64 x86_64 GNU/Linux
>> >>
>> >> ...
>> >>
>> >> The odroid C2 has the bug.
>> >>
>> >> d at dancer:~/bin$ ssh root at c2 uname -a
>> >> Linux c2 3.14.29-56 #1 SMP PREEMPT Wed Apr 20 12:15:54 BRT 2016
>> >> aarch64 aarch64 aarch64 GNU/Linux
>> >>
>> >> BUG: Got unexpected unreachable route for 2226:3333:4444:5555::1: #
>> >> I'd changed the number
>> >> unreachable 2226:3333:4444:5555::1 from :: dev lo  src fd99::2  metric
>> >> 0 \    cache  error -101
>> >>
>> >> route table for root 2226:3333:4444::/48
>> >> ---- 8< ----
>> >> unicast 2226:3333:4444:5555::/64 dev dum0  proto boot  scope global  metric 1024
>> >> unreachable 2226:3333:4444::/48 dev lo  proto boot  scope global
>> >> metric 1024  error -101
>> >> ---- 8< ----
>> >>
>> >> route for 2226:3333:4444:5555::1 (once again)
>> >> unreachable 2226:3333:4444:5555::1 from :: dev lo  src fd99::2  metric
>> >> 0 \    cache  error -101 users 1 used 3
>> >
>> > Dave, thanks for confirming and for feedback about this.
>> >
>> > Yes, 4.2+ kernels should not have this _particular_ bug, because
>> > https://git.kernel.org/linus/45e4fd26 reworks ip6_pol_route() for above
>> > tested case to not lock the route table twice and not to create /128
>> > cache entries on lookup when there is a gateway.
>> >
>> > BUT
>> >
>> > Route cache for IPv6 is still there in new kernels, and sometimes cache
>> > entries are created. E.g. this happens on PMTU exception, but also for
>> > lookups without gateway when associated flow has FLOWI_FLAG_KNOWN_NH set
>> > (I don't yet know what it is yet, but still):
>> >
>> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/net/ipv6/route.c?id=v4.7-rc3-55-gd325ea8#n1089
>> >
>> > etc.
>> >
>> > So _related_ problems should be there. They are probably just maybe less
>> > easily reproducible and less often happening. I have not looked into
>> > further details though...
>> >
>> > And also: as shown above it is better to have atomic route updates even
>> > without cache issues to get SYN not occasionally rejected in the time of
>> > route update.
>> >
>> > So Dave, please keep up your motivation for fixing this if you were
>> > going to eventually do so.
>> >
>> > Thanks,
>> > Kirill
>> >
>> > P.S.
>> >
>> >> (it helps to do a killall fping6 before exiting!)
>> >
>> > There is
>> >
>> >     trap 'kill $(jobs -p)' EXIT
>> >
>> > it does not work?
>> >
>> >
>> >> > It is always the same: the issue reproduces reliably in several minutes.
>> >> > And it looks like e.g.
>> >> >
>> >> >      ----- 8< ----
>> >> >      root at mini:/home/kirr/src/tools/net/iproute2/t# time ./rtcache-torture
>> >> >      PING 2222:3333:4444:5555::1(2222:3333:4444:5555::1) 56 data bytes
>> >> >      E.E.E.....E......E..E............E...E..
>> >> >      <more output from ping>
>> >> >
>> >> >      BUG: Linux mini 3.16.35-mini64 #14 SMP PREEMPT Sun Jun 12 19:41:09 MSK 2016 x86_64 GNU/Linux
>> >> >      BUG: Got unexpected unreachable route for 2222:3333:4444:5555::1:
>> >> >      unreachable 2222:3333:4444:5555::1 from :: dev lo  src 2001:67c:1254:20::1  metric 0 \    cache  error -101
>> >> >
>> >> >      route table for root 2222:3333:4444::/48
>> >> >      ---- 8< ----
>> >> >      unicast 2222:3333:4444:5555::/64 dev dum0  proto boot  scope global  metric 1024
>> >> >      unreachable 2222:3333:4444::/48 dev lo  proto boot  scope global  metric 1024  error -101
>> >> >      ---- 8< ----
>> >> >
>> >> >      route for 2222:3333:4444:5555::1 (once again)
>> >> >      unreachable 2222:3333:4444:5555::1 from :: dev lo  src 2001:67c:1254:20::1  metric 0 \    cache  error -101 users 1 used 4
>> >> >
>> >> >      real    0m49.938s
>> >> >      user    0m4.488s
>> >> >      sys     0m5.872s
>> >> >      ---- 8< ----
>> >> >
>> >> > The issue should not show itself with kernels >= 4.2, because there the
>> >> > lookup procedure does not take table lock twice, and /128 cache entries
>> >> > are not routinely created (they are created only upon PMTU exception).
>> >> >
>> >> > I'm running Debian testing on my development machine. Currently it has
>> >> > 4.5.5-1 (2016-05-29). I can confirm that /128 route cache entries are
>> >> > not created there just because a route was looked up.
>> >> >
>> >> > Kirill
>> >> >
>> >> >
>> >> > ---- 8< ---- (rtcache-torture)
>> >> > #!/bin/sh -e
>> >> > # torture for IPv6 RT cache, trying to hit the race between lookup,cache-add & route add
>> >> > # http://lists.alioth.debian.org/pipermail/babel-users/2016-June/002547.html
>> >> >
>> >> >
>> >> > tprefix=2222:3333:4444      # "whole-network" prefix for tests  /48
>> >> > tsubnet=$tprefix:5555       # subnetwork for which "to" route will be changed   /64
>> >> > taddr=$tsubnet::1           # test address on $tsubnet
>> >> >
>> >> > # setup for tests:
>> >> >
>> >> > # dum0 dummy device
>> >> > ip link del dev dum0 2>/dev/null || :
>> >> > ip link add dum0 type dummy
>> >> > ip link set up dev dum0
>> >> >
>> >> > # clean route table for tprefix with only unreachable whole-network route
>> >> > ip -6 route flush root $tprefix::/48
>> >> > ip -6 route add unreachable $tprefix::/48
>> >> > ip -6 route flush cache
>> >> >
>> >> > ip -6 route add $tsubnet::/64 dev dum0
>> >> >
>> >> >
>> >> > # put a lot of requests to rt/rtcache getting route to $taddr
>> >> > trap 'kill $(jobs -p)' EXIT
>> >> > rtgetter() {
>> >> >     # NOTE we cannot do this with `ip route get ...` in a loop, as `ip route
>> >> >     # get` first takes RTNL lock, and thus will be completely serialized with
>> >> >     # e.g. route add and del.
>> >> >     #
>> >> >     # Ping, like other usually connect/tx activity works without RTNL held.
>> >> >     exec ping6 -n -f $taddr
>> >> > }
>> >> > rtgetter &
>> >> >
>> >> > # do route del/route in busyloop;
>> >> > # after route add: check route get $addr is not unreachable
>> >> > while true; do
>> >> >     ip -6 route del $tsubnet::/64 dev dum0
>> >> >     ip -6 route add $tsubnet::/64 dev dum0
>> >> >     r=`ip -6 -d -o route get $taddr`
>> >> >     if echo "$r" | grep -q unreachable ; then
>> >> >         echo
>> >> >         echo
>> >> >         echo BUG: `uname -a`
>> >> >         echo BUG: Got unexpected unreachable route for $taddr:
>> >> >         echo "$r"
>> >> >         echo
>> >> >         echo "route table for root $tprefix::/48"
>> >> >         echo "---- 8< ----"
>> >> >         ip -6 -d -o route show root $tprefix::/48
>> >> >         echo "---- 8< ----"
>> >> >         echo
>> >> >         echo "route for $taddr (once again)"
>> >> >         ip -6 -d -o -s -s -s route get $taddr
>> >> >         exit 1
>> >> >     fi
>> >> > done

-- 
Dave Täht
Let's go make home routers and wifi faster! With better software!
http://blog.cerowrt.org