[Pkg-libvirt-maintainers] Bug#719675: Bug#719675: Live migration of KVM guests fails if it takes more than 30 seconds (large memory guests)

Thu Aug 15 06:16:02 UTC 2013

On Thu, Aug 15, 2013 at 09:35:09AM +0900, Christian Balzer wrote:
> On Wed, 14 Aug 2013 21:50:22 +0200 Guido Günther wrote:
> 
> > On Wed, Aug 14, 2013 at 04:49:42PM +0900, Christian Balzer wrote:
> > > 
> > > Package: libvirt0
> > > Version: 0.9.12-11+deb7u1
> > > Severity: important
> > > 
> > > Hello,
> > > 
> > > when doing a live migration using Pacemaker (the OCF VirtualDomain RA)
> > > on a cluster with DRBD (active/active) backing storage everything
> > > works fine with recently started (small memory footprint of about
> > > 200MB at most) KVM guests. 
> > > 
> > > After inflating one guest to 2GB memory usage (memtester comes in handy
> > > for that) the migration failed after 30 seconds, having managed to
> > > migrate about 400MB in that time over the direct, dedicated GbE link
> > > between my test cluster host nodes. 
> > > 
> > > libvirtd.log on the migration target node, migration start time is
> > > 07:24:51 :
> > > ---
> > > 2013-08-13 07:24:51.807+0000: 31953: warning :
> > > qemuDomainObjEnterMonitorInternal :994 : This thread seems to be the
> > > async job owner; entering monitor without ask ing for a nested job is
> > > dangerous 2013-08-13 07:24:51.886+0000: 31953: warning :
> > > qemuDomainObjEnterMonitorInternal :994 : This thread seems to be the
> > > async job owner; entering monitor without ask ing for a nested job is
> > > dangerous 2013-08-13 07:24:51.888+0000: 31953: warning :
> > > qemuDomainObjEnterMonitorInternal :994 : This thread seems to be the
> > > async job owner; entering monitor without ask ing for a nested job is
> > > dangerous 2013-08-13 07:24:51.948+0000: 31953: warning :
> > > qemuDomainObjEnterMonitorInternal :994 : This thread seems to be the
> > > async job owner; entering monitor without ask ing for a nested job is
> > > dangerous 2013-08-13 07:24:51.948+0000: 31953: warning :
> > > qemuDomainObjEnterMonitorInternal :994 : This thread seems to be the
> > > async job owner; entering monitor without ask ing for a nested job is
> > > dangerous 2013-08-13 07:25:21.217+0000: 31950: warning :
> > > virKeepAliveTimer:182 : No response from client 0x1948280 after 5
> > > keepalive messages in 30 seconds 2013-08-13 07:25:31.224+0000: 31950:
> > > warning : qemuProcessKill:3813 : Timed out waiting after SIGTERM to
> > > process 15926, sending SIGKILL
> > 
> > This looks more like you're not replying via the keepalive protocol.
> > What are you using to migrate VMs?
> >  -- Guido
> > 
> As I said up there, the Pacemaker (heartbeat, OCF really) resource agent,
> with SSH as transport (and only) option. 

This is not telling me how this is done within pacemaker. RHCS used to
do this with virsh  internally. I'll check the sources once I get around
to.
 -- Guido

> So the resulting migration URI should be something like:
> 
> qemu+ssh://targethost/system
> 
> Of course with properly distributed authorized_keys, again this works just
> fine with a small enough guest.
> 
> If there wasn't a proper two-way communication going on, shouldn't the
> migration fail from the start?
> 
> [snip]
> 
> Regards,
> 
> Christian
> -- 
> Christian Balzer        Network/Systems Engineer                
> chibi at gol.com   	Global OnLine Japan/Fusion Communications
> http://www.gol.com/
>