[Debian-ha-maintainers] Build LVM2 against corosync 2.3.4

Tue Dec 15 22:14:19 UTC 2015

El 21/10/15 a las 05:08, Ferenc Wagner escribió:
> Dhionel Díaz <ddiaz at cenditel.gob.ve> writes:
> 
>> El 02/09/15 a las 04:24, Ferenc Wagner escribió:
>>
>>> Dhionel Díaz <ddiaz at cenditel.gob.ve> writes:
>>>
>>>> dlm_controld was compiled from the sources published in
>>>> https://git.fedorahosted.org/git/dlm.git.
>>>
>>> Do you feel like testing fresh packages of DLM 4.0.2, Corosync 2.3.5 and
>>> latest LVM?  What's your architecture?
>>
>> I'd be glad to help. The cluster currently has four amd64 physical
>> servers, other four will be added soon. Just let me know where can I
>> download the source packages and what would be the test protocol to follow.
> 
> Looks like I forgot to give you the location of my testing repository:
> 
> deb http://apt.niif.hu/debian jessie main
> 
> A sid suite and sources are also available at the same place.  This is
> not official, occasionally I replace packages without version bumps.
> And it contains a newish upstream of lvm2, compiled with Corosync
> support only and cmirrord included.  And /etc/init.d/clvm removed;
> I'm experimenting with the following unit file (not included):
> 
> # /etc/systemd/system/lvm2-clvm.service 
> [Unit]
> Description=clustered LVM daemon
> Documentation=man:clvmd(8)
> Requires=dlm.service corosync.service
> After=dlm.service corosync.service
> 
> [Service]
> Type=notify
> ExecStart=/usr/sbin/clvmd -f
> 
> [Install]
> WantedBy=multi-user.target
> 
> This does not handle VG activation, I've got a separate customization
> unit for that (hooked into Pacemaker activation):
> 
> # /etc/systemd/system/pc-vgs.service
> [Unit]
> Description=volume groups for Private Cloud virtual machines
> Wants=lvm2-clvmd.service
> After=lvm2-clvmd.service
> 
> [Service]
> Type=oneshot
> RemainAfterExit=true
> Environment="VGS=ssdtest_vhblade"
> ExecStart=/sbin/vgchange -aly $VGS
> ExecStop=/sbin/vgchange -aln $VGS
> 
> I'm yet to see if the upstream service files (as shipped in the package)
> can be bent to my purposes.  But that's a usage question anyway.
> 
I've finally found some time to test these packages, with the following
results so far:

 1. The installation was completed succesfully.

 2. In normal conditions, lvm operations seem to be working correctly.

 3. If a node crashes, sometimes it's fenced twice. In any case the dlm
and corosync in the node that executes the fence action are killed,
leaving an uncontrolled lockspace, and some time later that node gets
fenced. In the remaining node -- the cluster was reduced to a 3-node one
for these tests -- the emission of "clvmd wait for fencing" messages
from dlm_controld continues after both fenced nodes complete their
reboot, some time later clvmd blocks and the node has to be restarted.
In summary, when a node crashes the cluster doesn't return to a normal
state until all nodes reboot and that process may require manual
intervention.

 4. If dlm and clvmd are deactivated, when a node crashes it's just
fenced once and when its reboot process completes the cluster returns to
a normal state, as expected.

Perhaps an unexpected interaction between corosync and dlm is happening,
the following extract from the syslogs could be related to the issue:

===========================================
===========================================

Dec 11 15:39:20 yyyy2 stonithd[2117]:   notice: remote_op_done:
Operation reboot of yyyy1 by xyyy1 for stonith-api.5468 at xyyy1.35b0ba29: OK
Dec 11 15:39:20 yyyy2 crmd[2121]:   notice: tengine_stonith_notify: Peer
yyyy1 was terminated (reboot) by xyyy1 for xyyy1: OK
(ref=35b0ba29-43aa-4b9f-8d02-349aec2bd9e1) by client stonith-api.5468
Dec 11 15:39:20 yyyy2 dlm_controld[2097]: 1832 clvmd wait for fencing
Dec 11 15:39:20 yyyy2 dlm_controld[2097]: 1832 dlm:controld conf 1 0 1
memb yyyy2 join left xyyy1
Dec 11 15:39:20 yyyy2 dlm_controld[2097]: 1832 dlm:controld left reason
nodedown 0 procdown 0 leave 1
Dec 11 15:39:20 yyyy2 dlm_controld[2097]: 1832 daemon remove xyyy1 leave
need_fencing 0 low 0
Dec 11 15:39:20 yyyy2 dlm_controld[2097]: 1832 fence request yyyy1 pid
2323 nodedown time 1449864534 fence_all dlm_stonith
Dec 11 15:39:20 yyyy2 dlm_controld[2097]: 1832 get_fence_actor for yyyy1
low actor xyyy1 is gone
Dec 11 15:39:20 yyyy2 dlm_controld[2097]: 1832 fence request yyyy1 pos 0
Dec 11 15:39:20 yyyy2 dlm_controld[2097]: 1832 fence request yyyy1 pid
2323 nodedown time 1449864534 fence_all dlm_stonith
Dec 11 15:39:20 yyyy2 dlm_controld[2097]: 1832 fence wait yyyy1 pid 2323
running
Dec 11 15:39:20 yyyy2 dlm_controld[2097]: 1832 fence wait yyyy1 pid 2323
running
Dec 11 15:39:20 yyyy2 dlm_controld[2097]: 1832 clvmd wait for fencing
Dec 11 15:39:20 yyyy2 dlm_controld[2097]: 1832 dlm:ls:clvmd conf 1 0 1
memb yyyy2 join left xyyy1
Dec 11 15:39:20 yyyy2 dlm_controld[2097]: 1832 clvmd add_change cg 3
remove nodeid xyyy1 reason procdown
Dec 11 15:39:20 yyyy2 dlm_stonith: stonith_api_time: Found 1 entries for
yyyy1/(null): 0 in progress, 1 completed
Dec 11 15:39:20 yyyy2 dlm_stonith: stonith_api_time: Node yyyy1/(null)
last kicked at: 1449864560
Dec 11 15:39:20 yyyy2 dlm_controld[2097]: 1832 tell corosync to remove
nodeid xyyy1 from cluster
Dec 11 15:39:20 yyyy2 dlm_controld[2097]: 1832 tell corosync to remove
nodeid xyyy1 from cluster

===========================================
===========================================

Dec 11 15:39:18 xyyy1 dlm_controld[4148]: 82671 fence wait yyyy1 pid
5468 running
Dec 11 15:39:18 xyyy1 dlm_controld[4148]: 82671 clvmd wait for fencing
Dec 11 15:39:19 xyyy1 dlm_controld[4148]: 82672 fence wait yyyy1 pid
5468 running
Dec 11 15:39:19 xyyy1 dlm_controld[4148]: 82672 clvmd wait for fencing
Dec 11 15:39:20 xyyy1 stonithd[4184]:   notice: log_operation: Operation
'reboot' [5469] (call 2 from stonith-api.5468) for host 'yyyy1' with
device 'fence_yyyy1' returned: 0 (OK)
Dec 11 15:39:20 xyyy1 stonithd[4184]:  warning: get_xpath_object: No
match for //@st_delegate in /st-reply
Dec 11 15:39:20 xyyy1 stonithd[4184]:   notice: remote_op_done:
Operation reboot of yyyy1 by xyyy1 for stonith-api.5468 at xyyy1.35b0ba29: OK
Dec 11 15:39:20 xyyy1 stonith-api[5468]: stonith_api_kick: Node
yyyy1/(null) kicked: reboot
Dec 11 15:39:20 xyyy1 crmd[4188]:   notice: tengine_stonith_notify: Peer
yyyy1 was terminated (reboot) by xyyy1 for xyyy1: OK
(ref=35b0ba29-43aa-4b9f-8d02-349aec2bd9e1) by client stonith-api.5468
Dec 11 15:39:20 xyyy1 stonithd[4184]:   notice: remote_op_done:
Operation reboot of yyyy1 by xyyy1 for crmd.4188 at xyyy1.efa6e77d: OK
Dec 11 15:39:20 xyyy1 crmd[4188]:   notice: abort_transition_graph:
Transition aborted: External Fencing Operation
(source=tengine_stonith_notify:248, 0)
Dec 11 15:39:20 xyyy1 crmd[4188]:   notice: tengine_stonith_callback:
Stonith operation 3/15:92:0:00323f97-6173-4f9b-96dd-0dd62d2f8644: OK (0)
Dec 11 15:39:20 xyyy1 crmd[4188]:   notice: tengine_stonith_notify: Peer
yyyy1 was terminated (reboot) by xyyy1 for xyyy1: OK
(ref=efa6e77d-5da3-4aaf-a3a2-32df43febd12) by client crmd.4188
Dec 11 15:39:20 xyyy1 stonith-api[5468]: stonith_api_time: Found 2
entries for yyyy1/(null): 0 in progress, 2 completed
Dec 11 15:39:20 xyyy1 stonith-api[5468]: stonith_api_time: Node
yyyy1/(null) last kicked at: 1449864560
Dec 11 15:39:20 xyyy1 dlm_controld[4148]: 82673 shutdown
Dec 11 15:39:20 xyyy1 dlm_controld[4148]: 82673 cpg_leave dlm:controld ...
Dec 11 15:39:20 xyyy1 dlm_controld[4148]: 82673 clear_configfs_nodes
rmdir "/sys/kernel/config/dlm/cluster/comms/yyyy2"
Dec 11 15:39:20 xyyy1 dlm_controld[4148]: 82673 clear_configfs_nodes
rmdir "/sys/kernel/config/dlm/cluster/comms/xyyy1"
Dec 11 15:39:20 xyyy1 dlm_controld[4148]: 82673 dir_member yyyy2
Dec 11 15:39:20 xyyy1 dlm_controld[4148]: 82673 dir_member yyyy1
Dec 11 15:39:20 xyyy1 dlm_controld[4148]: 82673 dir_member xyyy1
Dec 11 15:39:20 xyyy1 dlm_controld[4148]: 82673
clear_configfs_space_nodes rmdir
"/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/yyyy2"
Dec 11 15:39:20 xyyy1 kernel: [82673.053207] dlm: closing connection to
node yyyy2
Dec 11 15:39:20 xyyy1 kernel: [82673.053618] dlm: closing connection to
node xyyy1
Dec 11 15:39:20 xyyy1 dlm_controld[4148]: 82673
clear_configfs_space_nodes rmdir
"/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/yyyy1"
Dec 11 15:39:20 xyyy1 dlm_controld[4148]: 82673
clear_configfs_space_nodes rmdir
"/sys/kernel/config/dlm/cluster/spaces/clvmd/nodes/xyyy1"
Dec 11 15:39:20 xyyy1 dlm_controld[4148]: 82673 clear_configfs_spaces
rmdir "/sys/kernel/config/dlm/cluster/spaces/clvmd"
Dec 11 15:39:20 xyyy1 crmd[4188]:   notice: run_graph: Transition 92
(Complete=1, Pending=0, Fired=0, Skipped=5, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-warn-62.bz2): Stopped
Dec 11 15:39:20 xyyy1 dlm_controld[4148]: 82673 abandoned lockspace clvmd
Dec 11 15:39:20 xyyy1 dlm_controld[4148]: 82673 abandoned lockspace clvmd
Dec 11 15:39:20 xyyy1 kernel: [82673.063123] dlm: dlm user daemon left 1
lockspaces
Dec 11 15:39:20 xyyy1 systemd[1]: dlm.service: main process exited,
code=exited, status=1/FAILURE
Dec 11 15:39:20 xyyy1 systemd[1]: Unit dlm.service entered failed state.
Dec 11 15:39:20 xyyy1 corosync[4134]:   [CFG   ] Killed by node yyyy2:
dlm_controld

===========================================
===========================================

At the time covered by these log lines, the crashed node was rebooting
and it had not reached the point where linux is loaded.

The -D -K -P options for dlm_controld and the -d1 option for clvmd were
active and they were managed by systemd. If further tests would be
useful, just let me know.

Regards,

-- 
Dhionel Díaz
Centro Nacional de Desarrollo e Investigación en Tecnologías Libres
Ministerio del Poder Popular para
Educación Universitaria, Ciencia y Tecnología

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <http://lists.alioth.debian.org/pipermail/debian-ha-maintainers/attachments/20151215/c948655b/attachment-0001.sig>