[Pkg-iscsi-maintainers] Bug#775778: open-iscsi: Boot with systemd hangs (ordering of init script w.r.t. remote filesystems)

Christian Seiler christian at iwakd.de
Tue Jan 20 15:37:44 UTC 2015


Hello Ritesh,

>> the system boot will hang for 90s because of systemd's default 
>> timeout
>> when devices are not available.
>
> Actually, from what I know so far, systemd aggressively backgrounds 
> any
> processes that is taking time. And only processes that depend on it, 
> are
> put on hold, again in the background.

Well, yes, in principle, but the way dependencies are expressed (both 
by
default and in the current Debian packaging of systemd), you can still
have serialization of things. See below.

>> The reason behind this is that open-iscsi contains the following LSB
>> headers:
>>       Required-Start:    $network $remote_fs
>>       Required-Stop:     $network $remote_fs sendsigs
>> Here, $network maps to network-online.target in systemd, that's 
>> fine,
>> but $remote_fs maps to remote-fs.target in systemd, that is the 
>> problem.
>> This is because
>>
>>  a) systemd treats file systems that couldn't be mounted as hard
>>     failures.
>> and
>>  b) systemd's logic of mounting all remote filesystems is to mount
>>     all filesystems in /etc/fstab that are marked _netdev (and not
>>     makred noauto)
>>
>> Therefore, systemd waits for the iSCSI device to appear for 90s 
>> before
>> timing out and proceeding with boot. Only then remote-fs.target is
>> reached and systemd starts the open-iscsi init script.
>
> I think you may be missing something here. I believe devices marked
> _netdev are always backgrounded. At least in sysvinit. And not having
> them do so in systemd is highly unlikely.

No, in both cases that is not true.

First, if you look at sysvinit with LSB dependency-based boot (Squeeze,
Wheezy, Jessie w/ sysvinit-core). Debian does use startpar(8) to
parallelize some aspects of sysvinit boot, but there are a couple of
syncronization points. They are defined in /etc/insserv.conf and the
relevant ones are:

  $local_fs
  $remote_fs

If you look at the configuration, you will see that $remote_fs is
$local_fs and the mountnfs init script.

Also, there's the fact that all rcS scripts will completed before any
rc[2-5] scripts are run (the way inittab + rc are set up), so that's an
additional syncronization point.

So if you have an init script with Requires-Start: $local_fs, it will 
be
ordered after all scripts (primarily mountall) that appear for 
$local_fs
in /etc/insserv.conf, but (according to insserv logic) as early as
otherwise possible.

Same with Requires-Start: $remote_fs: it will be ordered after 
$local_fs
(i.e. after mountall) and also after mountnfs.

So you have the following boot ordering

  1. anything in rcS that doesn't require $local_fs
  2. $local_fs stuff (i.e. mainly mountall)
  3. anything else in rcS that doesn't require $remote_fs
  4. $remote_fs stuff (i.e. mainly mountnfs)
  5. anything else in rcS
  6. anything in rc[2-5]

So if you have Requires-Start: $remote_fs in the open-iscsi init 
script,
you have the following situation:

  - early boot services (1) are started
  - local file systems are mounted (2)
  - some other services started (3)
  - tries to mount remote file systems (4)
       /etc/init.d/mountnfs calls /etc/network/if-up.d/mountnfs
        (or waits until networking has called that dynamically once
         the network is up, depending on your configuration)
       /etc/network/if-up.d/mountnfs effectively does
            mount -a -O _netdev
       At this point, open-iscsi is NOT started. So mount will fail for
       all mount points on iSCSI devices. However, since mountnfs 
doesn't
       check the exit code of the mount command, it will happily 
continue
       on and pretend everything is fine.
  - services ordered after $remote_fs are started, including open-iscsi
       open-iscsi calls mount -a -O _netdev itself, which will try to
       mount the remaining filesystems again, then succeeding

So nothing is really 'backgrounded', you are just relying on the fact
that mountnfs doesn't really check any exit codes (and that sysvinit
doesn't care if init scripts that your init scripts depends on were
successful), you just tape over that fact by running mount again.

This in turn means that with sysvinit you have kind of exempted
$remote_fs from being the true synchronization point. This doesn't
really matter that much for sysvinit, because there's a different
syncronization point directly after that (end of rcS execution, start 
of
rc[2-5] execution), but for systemd that's a different story (see
below). (But note that this COULD break for an early boot service
ordered after $remote_fs that needs the filesystems, it's just that
Jessie by default doesn't ship one.)


Now let's take systemd. systemd has so-called 'targets' which are also
used as synchronization points at boot. The two sysvinit sync points 
are
mapped as follows:

  $local_fs    -> local-fs.target
  $remote_fs   -> remote-fs.target

Additionally, systemd knows a couple of more sync points, namely

  local-fs-pre.target
  remote-fs-pre.target

However, systemd doesn't really have a sync point for early-boot vs.
runlevel services.

The boot sequence with systemd is then as follows (only depicting a 
part
of it):

        early boot services (e.g. udev)
        ordered before local-fs-pre.target
                   |
                   v
          local-fs-pre.target
                   |
                   v
         mount local file systems
                   |
                   v
             local-fs.target
                   |
                   v
        early boot services ordered after local-fs.target
        but before remote-fs-pre.target
                   |
                   v
           remote-fs-pre.target
                   |
                   v
        mount remote file systems
                   |
                   v
            remote-fs.target
                   |
                   v
               the rest

Within each block, everything is of course parallel (barring other
ordering constraints, of course) - even the filesystems are mounted in
parallel.

And obviously, if something doesn't order against any targets shown
here, they will be started immediately (before or in parallel to
local-fs.target) and the targets in the middle won't wait for their
completion.

On shutdown, the whole thing is done in reverse, with one important
caveat: systemd tracks the state of the system, so it looks at the
dependencies of stuff that's running, so if you start a service 
manually
without having it enabled at boot, its dependencies will still work
properly. (sysvinit/LSB tries to do that partially by always creating
stop links, even if the services is not enabled.)




Now you have two problems in this setup:

   - same thing as with sysvinit: open-iscsi is ordered after
     remote-fs.target, so it won't get started until remote-fs.target is
     reached

   - however, the crucial difference here is that systemd cares whether
     stuff has actually worked or not. It doesn't just call
     mount -a -O _netdev and hopes for the best, it tries to wait for
     the required devices to appear (because they might not appear
     synchronously)

        -> unfortunately, since open-iscsi won't start before
           remote_fs.target, those devices will never appear while
           systemd is waiting for them

        -> systemd has a default timeout of 90s for devices showing up
           so it will wait for 90s for these devices to show up and then
           fail

        -> only then will systemd consider remote-fs.target reached
           (btw. local-fs.target has a setting
           OnFailure=emergency.target, so that when it can't mount a
           local file system, the boot doesn't even continue, see
           Debian bug #743265 for a discussion on this; fortunately
           remote-fs.target doesn't have this setting, so boot does
           continue in this case)

        -> only then will systemd start open-iscsi

        -> that will then mount the filesystems again
           (which is actually unnecessary with systemd, because as soon
           as the devices appear, it will mount the stuff anyway)

        -> hence the 90s delay for waiting on devices that will only 
show
           up later

     You can actually try this easily (if you have an iSCSI target lying
     around ;-)): setup a Jessie box, install open-iscsi, configure it
     to automatically log in to your target, put an iSCSI filesystem as
     _netdev into /etc/fstab and reboot - voilà: 90s delay. It's very
     simple to reproduce, and it ALWAYS happens in that constellation.
     With rootfs on iSCSI it should also happen if you log in to
     additional targets. (Otherwise, rootfs on iSCSI is not affected.)

   - on shutdown, things are also messy, since systemd tries to shut 
down
     stuff much more in parallel than sysvinit does

        - open-iscsi is a early-boot ("runlevel S") service, i.e. with
          sysvinit those always get stopped after all services of the
          current runlevel (e.g. 2) are stopped

        - with systemd, it just cares about explicit dependencies, so
          it will try to stop open-iscsi as early as possible (since
          by default nothing is ordered after it)

        -> this has the consequence that stuff that's using remote
           filesystems might still be running while open-iscsi is
           terminating and it can't unmount them

        -> the open-iscsi service will then (try to) logout of the
           sessions even though stuff is still active.

                -> very, very bad

As I said in the original report, on the test system I've used so far
for Jessie I haven't actually seen this race condition (i.e. shutdown
always worked anyway), since nothing was really using the remote
filesystems on my test box, and it might be the case that it doesn't
always occur, but it will at least some times.

>> That in turn will then make the devices appear. The init script will
>> then call a "mount -a -O _netdev" and "swapon -a -e" in it's start()
>> routine, that will then cause the mount points to be activated.
>>
>> So in the end, the boot is kind-of successful in the sense that
>> everything kind of works at the end of boot, with the following two 
>> caveats:
>>
>>  - there is this needless 90s delay (or whatever other delay the 
>> admin
>>    has configured) in waiting on the iSCSI targets
>
> Have you had luck root causing in why there is the 90 sec delay ?

I hope this reply can make it a bit clearer as to where the problem 
lies
and why my diagnosis is correct.

Note that I have spent probably 10-12 hours on this problem, first
trying to figure out what the problem was and then trying to come up
with a solution that changes as little as possible (because of the
freeze) and testing that against a lot of different scenarios:

  - I only noticed that I needed to move #DEBHELPER# around because of
    testing partial upgrades

  - I don't use rootfs in iSCSI myself, so I set up a test system to
    check that nothing broke (which the first version I wanted to send
    did, so I fixed that before reporting this)

  - I rebooted test boxes quite a lot to see if there was any trouble.

>> Therefore, I suggest that you provide a unit file specifically for
>> systemd. In order to as minimally invasive as possible (especially 
>> this
>> late in the freeze), the unit file should ideally call the original 
>> init
>> script.
>
> I am willing to accept a systemd unit. But it is too late for Jessie
> right now. If you have the unit ready and tested, for now, we can put 
> it
> into experimental.
>
> I would not want to ship something for Jessie now. Ideally, systemd's
> logic on handling init scripts should take care of it. It has worked 
> for
> other sysvinit scripts so far.
>
> And introducing the systemd unit now in Jessie is late. Because it
> wouldn't have had enough test cycles.

systemd's logic of handling it won't take care of it, because it's
already kind-of broken on sysvinit, but a lot of specific details in
sysvinit that systemd doesn't emulate quite that way mitigate that.

The changes required to make systemd support this in the same way as
sysvinit would be far more invasive to the current systemd code base as
fixing a couple of dependencies here.



I'm going to explain how systemd currently handles unit files, because
then it becomes clear why the unit file I have provided is not really
experimental at all.


systemd does not support init scripts directly from PID1 anymore (this
was different in very old versions). systemd's PID1 only understands
systemd unit files. Instead, systemd now has a concept called
'generators', which are small programs (sometimes even scripts) that 
are run

  - at boot
  - every time systemd re-reads its configuration

The job of a generator is to read some aspect of the system
configuration (init scripts, /etc/fstab, /etc/crypttab, ...) and
generate native systemd units from that.

If you boot a systemd Jessie system and look in /run/systemd/generator
and /run/systemd/generator.late, you will see the units that were
generated by these generators. Each line in /etc/fstab becomes a .mount
unit, each sysvinit script becomes a .service file.

Of course, the generator responsible for init scripts doesn't magically
convert a sysvinit file completely into a service file (that's not
really possible to do automatically in the general case), but the
service file it generates just contains the necessary metadata.
Additionally, it sets ExecStart=/etc/init.d/$SCRIPT start and
ExecStop=/etc/init.d/$SCRIPT stop in the service file, so that the
original service file is actually called.

For example, if I take /etc/init.d/kbd, the systemd-sysv-generator will
produce the following serviced file in
/run/systemd/generator.late/kbd.service:

-----------------------------------------------------------
# Automatically generated by systemd-sysv-generator

[Unit]
SourcePath=/etc/init.d/kbd
Description=LSB: Prepare console
DefaultDependencies=no
Before=sysinit.target
After=remote-fs.target

[Service]
Type=forking
Restart=no
TimeoutSec=0
IgnoreSIGPIPE=no
KillMode=process
GuessMainPID=no
RemainAfterExit=yes
SysVStartPriority=18
ExecStart=/etc/init.d/kbd start
ExecStop=/etc/init.d/kbd stop
-----------------------------------------------------------

So what did I do in order to produce the service file I've attached in
my original report?

  - I took the generate service file for the open-iscsi init script
  - I removed the comment about automatic generation
  - I removed SourcePath (that's mainly for documentation purposes if 
you
    run systemctl status)
  - I adjusted the After= and Before= dependencies
  - I added a [Install] section to make it possible to enable this unit

Here's a diff for comparison (old is generated, new is my modified 
version):

-----------------------------------------------------------
diff -u open-iscsi.service /lib/systemd/system/open-iscsi.service
--- open-iscsi.service  2015-01-18 21:12:16.325286854 +0100
+++ /lib/systemd/system/open-iscsi.service      2015-01-19 
19:14:53.000000000 +0100
@@ -1,11 +1,8 @@
-# Automatically generated by systemd-sysv-generator
-
  [Unit]
-SourcePath=/etc/init.d/open-iscsi
-Description=LSB: Starts and stops the iSCSI initiator services and 
logs in to default targets
+Description=iSCSI initiator
  DefaultDependencies=no
-Before=sysinit.target shutdown.target
-After=network-online.target remote-fs.target
+Before=sysinit.target shutdown.target remote-fs-pre.target
+After=network-online.target
  Wants=network-online.target
  Conflicts=shutdown.target

@@ -20,3 +17,6 @@
  SysVStartPriority=20
  ExecStart=/etc/init.d/open-iscsi start
  ExecStop=/etc/init.d/open-iscsi stop
+
+[Install]
+WantedBy=multi-user.target
-----------------------------------------------------------

So it's not like this is really that untested, it's basically the way
systemd handles sysv scripts but just with modified dependencies, to
make sure the unit is started before remote-fs-pre.target and not after
remote-fs.target.

>>  - irrespective of systemd, while looking at it I noticed that
>>    umountiscsi.sh's logic is incomplete, it doesn't try to umount
>>    filesystems on LVM on top of iSCSI, unless they were marked with
>>    _netdev (it only detects direct devices).
>
> Can you please elaborate more here ? Or perhaps just file a separate 
> bug
> report. The current init scripts are designed to support LVM + iSCSI.

I'll file a separate bug report for this. I don't think it's very
critical, especially it doesn't do anything wrong if everything is in
/etc/fstab (or you manually mounted with -o _netdev).

>>    OTOH, this has been the case since at least Squeeze, so it can't
>>    be that critical.
>>
>>  - the current design of using umountiscsi.sh doesn't integrate well
>>    with systemd's dependency logic. I don't think this is a huge 
>> issue,
>>    as far as I can see, stuff works as well under systemd with my 
>> patch
>>    as under sysvinit (except for the /usr-NFS thing), but I do think
>>    that you could make the whole thing a lot more robust if this is
>>    redesigned a bit - but I don't think that is something that 
>> should
>>    go to Jessie.
>
> I agree. We need to switch to systemd. But I haven't had the time to 
> do
> it, and right now, your patch is too late. :-(

I don't think it is: it doesn't change much, I spent a LOT of time
making it as little invasive as possible. And while open-iscsi is not
completely unusable with systemd, there is enough problems with the way
the current package interacts with systemd due to subtle differences in
the handling of dependencies and failures that I think this should
really be fixed in Jessie.

As I said in the original report:

>> Btw. I selected severity 'important' because I don't think this bug 
>> is
>> 'grave', but I do think that it could be categorized as 'serious', 
>> since
>> in my eyes it is unwritten policy that packages should properly 
>> support
>> the default init system unless there's a really good reason against 
>> it.
>> Unfortunately for me, current policy doesn't mention multiple init
>> systems at all, therefore the severity 'important', because I can't
>> point to a specific part of the text. Nevertheless, I think this bug
>> would qualify as RC.

Regards,
Christian



More information about the Pkg-iscsi-maintainers mailing list