<div dir="ltr"><div><div><div><div><div><div><div><div><div><div><div><div>Hi,<br><br></div>we are a bit further in debugging this. We installed a DELL PowerEdge r620 (same hardware as used in our DRBD-cluster where this problem happens). As noone in this thread brought DRBD into play, I didn't expect any interaction with it related to this bug. However, we were not able to reproduce with just LVM2 (eg. configure LV, do IO in LV, remove LV, hang.)<br>
<br></div>So we installed a second machine and put DRBD on top of the LVs. And voila, as soon as we create a snapshot of the LV where DRBD is on top and remove this snapshot it fails ca. 1/3 of the time.<br><br></div>Some facts:<br>
<br>root@drbd-primary:~# lvremove --force /dev/vg0/lv0-snap<br> Unable to deactivate open vg0-lv0--snap-cow (254:3)<br> Failed to resume lv0-snap.<br> libdevmapper exiting with 1 device(s) still suspended.<br><br></div>
After this, "dmsetup info" gives the following output:<br><br></div><<< snip >>><br><br>Name: vg0-lv0--snap<br>State: ACTIVE<br>Read Ahead: 256<br>Tables present: LIVE<br>
Open count: 0<br>Event number: 0<br>Major, minor: 254, 1<br>Number of targets: 1<br>UUID: LVM-M0Z897O16CAiYbSivOzgSn0M9Ae9TdoYy4WFhwy43CZA1g7zKFGF915pLAOIPvFZ<br><br>Name: vg0-lv0-real<br>State: ACTIVE<br>
Read Ahead: 0<br>Tables present: LIVE<br>Open count: 1<br>Event number: 0<br>Major, minor: 254, 2<br>Number of targets: 1<br>UUID: LVM-M0Z897O16CAiYbSivOzgSn0M9Ae9TdoYC3ppjt1CZ3AcZR2hNz1VT5CHdM4RR32j-real<br>
<br>Name: vg0-lv0<br>State: SUSPENDED<br>Read Ahead: 256<br>Tables present: LIVE & INACTIVE<br>Open count: 2<br>Event number: 0<br>Major, minor: 254, 0<br>Number of targets: 1<br>
UUID: LVM-M0Z897O16CAiYbSivOzgSn0M9Ae9TdoYC3ppjt1CZ3AcZR2hNz1VT5CHdM4RR32j<br><br>Name: vg0-lv0--snap-cow<br>State: ACTIVE<br>Read Ahead: 0<br>Tables present: LIVE<br>Open count: 0<br>
Event number: 0<br>Major, minor: 254, 3<br>Number of targets: 1<br>UUID: LVM-M0Z897O16CAiYbSivOzgSn0M9Ae9TdoYy4WFhwy43CZA1g7zKFGF915pLAOIPvFZ-cow<br><br></div><<< snap >>><br><br></div>As you can see, the real LV with DRBD on top is now in state SUSPENDED - which causes the cluster to be non-functional as IO operations stall on both the primary and secondary node until one does "dmsetup resume /dev/vg0/lv0".<br>
<br></div>Another interesting issue we've seen: after doing "dmsetup resume /dev/vg0/lv0", lv0-snap doesn't appear to be a snapshot anymore, given the output of lvs (lv0-snap has no origin anymore):<br><br>
LV VG Attr LSize Pool Origin Data% Move Log Copy% Convert<br> lv0 vg0 -wi-ao-- 200.00g <br> lv0-snap vg0 -wi-a--- 40.00g <br>
<br><br></div><div>Some miscellaneous notes:<br></div><div>* It _feels_ to only happen when the snapshot is filled at least something around 50-60%.<br>* We can trigger something like this even without DRBD. When triggered
however, the LV will never end up in SUSPENDED state and a second try of
lvremove will always succeed.<br><br></div>Thats all we have so far. I already had a private conversation with <a href="mailto:waldi@debian.org">waldi@debian.org</a> on this and we will (probably) provide him remote access on this system as soon as we have the setup reachable from the outside.<br>
<br></div></div>Please let me know if I can provide any more information to get this fixed. I put drbd-dev in cc, maybe someone over there has an idea on this?<br><br></div><div>@drbd-dev: system is debian wheezy, w/ drbd 8.3.11, lvm2 2.02.95.<br>
<br></div><div>Thanks,<br>Frank<br></div></div>