[Pkg-xen-devel] Bug#880554: xen domu freezes with kernel linux-image-4.9.0-4-amd64

Sun Jan 7 18:36:40 UTC 2018

On 01/07/2018 10:05 AM, Valentin Vidic wrote:
> On Sat, Jan 06, 2018 at 11:17:00PM +0100, Hans van Kranenburg wrote:
>> I agree that the upstream default, 32 is quite low. This is indeed a
>> configuration issue. I myself ran into this years ago with a growing
>> number of domUs and network interfaces in use. We have been using
>> gnttab_max_nr_frames=128 for a long time already instead.
>>
>> I was tempted to reassign src:xen, but in the meantime, this option has
>> already been removed again, so this bug does not apply to unstable
>> (well, as soon as we get something new in there) any more (as far as I
>> can see quickly now).
>>
>> https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=18b1be5e324bcbe2f10898b116db641d404b3d30
> 
> It does not seem to be removed but increased the default from 32 to 64?

Ehm, yes you are correct. I was misreading and mixing up things. Let's
try again...

The referenced commit is talking about removal of the obsolete
gnttab_max_nr_frames from the documentation, so not related.

>> Including a better default for gnttab_max_nr_frames in the grub config
>> in the debian xen package in stable sounds reasonable from a best
>> practices point of view.

So, that's gnttab_max_frames, not gnttab_max_nr_frames... I was reading
out loud from my old Jessie dom0 grub config.

>> But, I would be interested in learning more about the relation with
>> block mq although. Does using newer linux kernels (like from
>> stretch-backports) for the domU always put a bigger strain on this? Or,
>> is it just related to the overall number of network devices and block
>> devices you are adding to your domUs in your specific own situation, and
>> did you just trip over the default limit?
> 
> After upgrading the domU and dom0 from jessie to stretch on a big postgresql
> database server (50 VCPUs, 200GB RAM) it starting freezing very soon
> after boot as posted there here:
> 
>   https://lists.xen.org/archives/html/xen-users/2017-07/msg00057.html
> 
> It did not have these problems while running jessie versions of the
> hypervisor and the kernels.  The problem seems to be related to the
> number of CPUs used, as smaller domUs with a few VCPUs did not hang
> like this.  Could it be that large number of VCPUs -> more queues in
> Xen mq driver -> faster exhaustion of allocated pages?

That exactly seems to be the case yes. Maybe this is also one of the
reasons that the default max was increased in Xen.

"xen/blkback: make pool of persistent grants and free pages per-queue"
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d4bf0065b7251afb723a29b2fd58f7c38f8ce297

Recently a tool was added to "dump guest grant table info". You could
see if it compiles on the 4.8 source and see if it works? Would be
interesting to get some idea about how high or low these numbers are in
different scenarios. I mean, I'm using 128, you 256, and we even don't
know if the actual value is maybe just above 32? :]

https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=df36d82e3fc91bee2ff1681fd438c815fa324b6a

If this is something users are going to run into while not doing more
unusual things like having dozens of vcpus or network interfaces, then
changing the default could prevent hours of frustration and debugging
for them.

The least invasive option is to add the option to the documentation of
GRUB_CMDLINE_XEN_DEFAULT in /etc/default/grub.d/xen.cfg like "If you
have more than xyz disks or network interfaces in a domU, use this, blah
blah."

Actually setting the option there is not a good idea, because people can
still have GRUB_CMDLINE_XEN_DEFAULT set in e.g. /etc/default/grub, so
that would override and damage things.

Other option is to add a patch to drag the defaults in the upstream code
from 32 to 64, including documentation etc.

Sorry for the earlier confusion,
Hans