[Pkg-xen-devel] Bug#912975: xen-hypervisor-4.8-amd64: Dom0 crashes randomly without logs on Debian Stretch with Xen 4.8.4

Hans van Kranenburg hans at knorrie.org
Wed Nov 7 13:30:50 GMT 2018


Hi,

On 11/7/18 12:48 PM, Roalt Zijlstra | webpower wrote:
> 
> Op di 6 nov. 2018 om 18:54 schreef Hans van Kranenburg <hans at knorrie.org
> <mailto:hans at knorrie.org>>:
> 
>     Hi,
> 
>     On 11/5/18 12:37 PM, Roalt Zijlstra wrote:
>     > Package: src:xen
>     > Version: 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
>     > Severity: important
>     >
>     > Updating Xen to the latest 4.8 version from the security repo
>     makes servers unstable.
> 
>     Can you confirm that this is the only change that you made between the
>     before/after scenario? I mean, if you downgrade the packages, or you
>     drop the old hypervisor xen-x.y-amd64.gz in /boot again, it's stable
>     again?
> 
> 
> We have several servers running the previous versions and those are
> still stable. The servers that we upgraded using 'apt-get update;
> apt-get upgrade'  were rock solid before the upgrade.

Yes, that's why I was asking. Did that apt-get upgrade also upgrade your
dom0 kernel? You can look back in /var/log/dpkg.log* about what
happened. This is very relevant information.

> I did prepare a downgrade script if needed, but atm. the crash interval
> in days seems to be higher then before. We did have servers crashing
> every 2 days or even one crashing twice a day.

>     > The servers randomly reset without any logs.
> 
>     Do you have the noreboot option set on the Xen hypervisor command line?
> 
>  
> For now one busy servers runs an older 4.9.0-4-amd64 kernel with a 3.16
> kernel DomU with MySQL server on it. The second busy server runs all
> domUs with 4.9 (backport) kernels on the lastest 4.9.0-8-amd64 kernel
> for the Dom0. Currently we are awaiting any crash. 

In Debian, 4.9.0-8-amd64 is in the name of the package, but the real
kernel version is in the version of that package.

So, if you have linux-image-4.9.0-8-amd64, you should always also
mention the real version, which is now e.g. 4.9.110-3+deb9u6. This means
it's based on 4.9.110 upstream.

The kernel team follows the 4.9 LTS releases, but only if the changes
have to break the ABI (so custom modules have to be rebuilt), they up
the number in the package name to trigger that process.

> The last mentioned server was rebooted with the noreboot option, so we
> could eventually check the console for errors once it crashes. 
> The remain two servers are our fall-back servers and are not that busy.
> We have seen them crashtoo, but we noticed that the less busy servers
> did not crash that often. But once they were busy they crashed as
> quickly as the master servers.

Ok, that's interesting extra data.

>     Are you able to configure and capture output from serial console?
> 
>  
> Oh wow..  Using old technology for debugging :-) I will need to see how
> that configuration is done. We could connect up physical serial cables
> between different machines.

Well... old... It's the best way to capture text after everything
crashes. On a vga display it scrolls away and you can't copy paste.

If you're using recent Dell hardware, then I guess your drac provides an
extra emulated serial console. I use HP hardware, there it's the ilo
virtual serial port.

>     First interesting thing to know is if it's the Dom0 that crashes, or if
>     it's the hypervisor itself, and the logging will tell you that.
> 
>     > We have serveral Debian Stretch servers running Xen 4.8 and only
>     the ones updated to the 4.8.4+xsa273+shim4.10.1+xsa273-1+deb9u10
>     > version tend to crash ranging from 'twice a day' to 'once every
>     two weeks'. We have already ruled out if hardware was an
>     > issue, since we have 4 individual servers which are different in
>     hardware setup and also were bought at different times.
>     > And these servers ran stable with the previsous version
>     4.8.3+xsa267+shim4.10.1+xsa267-1+deb9u9.
>     > These servers are acting exactly the same. Every thing works as it
>     should, but without any logs it crashes and resets at
>     > a certain point.
>     >
>     > It looks like it could have something to do with DomUs running
>     older (3.16) Linux kernels. As a test we applied 4.9 kernels to
>     > all Jessie DomU servers and so far it runs for 13 days (but this
>     server did crash twice on a day).
>     > We have seen this behaviour with Xen on CentOS6 and 7 too, but the
>     trouble seems to be fixed after some more updates.
> 
>     It can be frustrating that there's not much response on the mailing
>     lists. But, these kinds of problems can be really hard to debug and
>     solve. Unless there's a clear reproduction scenario and debug output,
>     there's often noone who can help you remotely.
> 
>  
> Well we have been having the issues since february this year with
> unstable Xen servers crashing once in a months or so. The first issues
> were on fresh Cent OS 7 servers, but then we also got them with updated
> Cent OS 6 servers. We then decided to use Debian Stretch and the first
> tests were pretty stable. We did install a new R740 with it (Xen
> 4.8.4-pre) and that ran for 110 days pretty well.

I know this feeling. I've been debugging similar kinds of issues this
year that appeared "every few weeks".

>     > As said.. I cannot provide logs since it simply resets without notice.
> 
>     It's still the best starting point...
> 
> 
> Well hopefully the 'noreboot' provided server crashes soon for some
> logs. I will check if we can do any serial console tricks.

Yes.

Hans



More information about the Pkg-xen-devel mailing list