Bug#396937: [pkg-ntp-maintainers] Bug#396937: Backgrounded ntpdate from ifup races with hwclock

Sat Nov 4 15:03:13 CET 2006

Re Kurt et al,

On Sat, Nov 04, 2006 at 12:47:53PM +0100, Kurt Roeckx wrote:
> On Sat, Nov 04, 2006 at 10:32:13AM +0100, Andre Beck wrote:
> > > 
> > > ntpdate should never adjust the clock wrong by an hour, it should set it
> > > correct.
> > 
> > Yep, but it does. I've had the proper loop in rc equipped with a date(1)
> > call after every single init script, which revealed that time was wrong
> > (by misinterpretation of the CMOS clock as UTC) in the whole boot process
> > until S50hwclock.sh fixed it (which up to this is expected behavior). Both
> > the output from that script (I even let it run -xv) and the date(1)
> > immediately following it showed correct time.
> 
> I think you might be right.  ntpdate tries to find an offset between the
> clock and what it thinks is the correct time, and then either steps or
> adjusts it depending on how big the difference is.
> 
> So it finds an offset, gets the current clock, adds the offset, and sets
> the new time.  If the clock is adjusted by something else between the
> time it received the packets on which it based the offset and the time
> it tries to set it, we have a problem.

Yep, IMO that happened. I haven't read the sources of ntpdate, but I assume
there would be a way to protect it from doing something like this. When it
is talking to the NTP servers, it will probably gettimeofday(2) for every
packet sent and received, and if it observes any anomaly in this timestamps
(either nonmonotonic changes or forward jumps larger than some threshold
which should be chosen less or equal the step limit obeyed by ntpd) it should
either just bail out with an error message or start over the entire sequence.
It will then do a read-modify-write, and should check the value it read again
for plausibility. Only then it would apply the offset and write back to the
kernel. The time interval for the remaining race condition would be very
small and cannot be removed unless there would be an atomic get-and-set-time
system call, which doesn't exist.

> > You may probably force this
> > behavior easily by having one or two unreachable servers in the sequence
> > first.
> 
> I think having unreachable servers at the end of the list is more likely
> to cause problems.

I haven't analyzed this further, I just happened to find out that in my
configuration, there were three servers, the first of which was unreach-
able. I *thought* ntpdate would just try every server given to it until
it finds one that answers and is synchronized, but I may be wrong here.

> > > I think your problem is that hwclock is started after ntpdate.
> > 
> > At least this is way too late for hwclock as we all agree - and running
> > hwclock at a more proper time would likely fix it. What remains is the
> > knowledge that ntpdate does something silly, though - when it runs over
> > a macroscopic timescale due to unreachable servers or similar delays
> > and something else changes the kernel clock during this time, it might
> > end up offsetting the time *again*. Obviously it thinks it is the only
> > tool that controls the clock, and everything works perfectly when it is.
> > But now that it runs backgrounded, other tools might interfere. There is
> > probably not only hwclock, but other time correction tools that use
> > various sources might collide with it as well. IMO this should be fixed
> > upstream, even when there cannot be a perfect fix (a small chance for
> > a race condition will remain).
> 
> I think we should just make sure that nothing else can run at the same
> time as ntpdate that wants to change the clock.

We know about the hwclock issue, but maybe this isn't the only one that
lurks here. There might be silly DCF77 or GPS software that just forcibly
adjusts the clock, you might set time from ISDN, users might call netdate
or other antique stuff. This might interfere with background-running
ntpdates that were called due to ifup events for, let's say, Ethernet
or WLAN NICs as in use on notebook computers. This may not be much of
a problem once the time is "mostly correct" already, but it still is
a potential race condition.

> I think this should mean that it needs to run before us in the boot
> process, it shouldn't run in the background.

We will hopefully have this made sure for hwclock, but there still is
a general bad feeling about this - as long as ntpdate is so easy to fool.

> Since it's started when an interface is brought up, I don't see how we
> can run in the background and have some other script wait until we're
> done.

It seems I've stirred up a hornet's nest here anyway. The next issues
I've found are related:

1) If the WLAN association (I'm currently testing the roaming mode of
   wpa_supplicant which does this in the background, previously I had
   a similar but more static hack of my own to achieve the same thing
   which just blocked) takes a bit longer, the boot process will continue
   and ntpd will have started already when the interface finally comes
   up. The ntpdate called from ifup will then just fail due to the
   already bound socket.

2) In the same situation as seen in 1), ntpd will start, but it will not
   work as expected. Even though it actually talks to the configured
   servers (verified that with a sniffer), all of them stay on stratum 16
   in .INIT. state. I assume ntpd cannot deal with the fact that it now
   talks to the world via an interface that didn't exist when it started
   up and ignores the answers it receives from the servers because it
   doesn't recognize their destination address.

> Maybe we should ask some advice to someone else who knows more about
> this, I'm just not sure who to ask.

Me neither. However, the issues of ntpdate needing some more plausibility
checks and ntpd failing in a situation where connectivity to the servers
may be via changing interfaces/addresses should probably be relayed to the
upstream maintainers. I see that there is /etc/network/if-up.d/ntp which
can be edited to force ntpd restarts on every ifup, but this doesn't look
like a solution, just a workaround - but a feasible one for notebooks that
roam between DHCP driven networks and don't need long-running ntpd for
clock stability.

But back on topic: I've moved my hwclock.sh to S11 and the the clock is
back to expected behavior again. Let's see what util-linux will finally
come up with.

Thanks,
Andre.
-- 
                  The _S_anta _C_laus _O_peration
  or "how to turn a complete illusion into a neverending money source"

-> Andre Beck    +++ ABP-RIPE +++    IBH Prof. Dr. Horn GmbH, Dresden <-