Bug#563602: mdadm: autocheck running checkarray fails and degrades array when weekly cron jobs tick off
Mike Bilow
mike at bilow.com
Mon Jan 4 00:26:21 UTC 2010
Package: mdadm
Version: 2.6.7.2-3
Severity: grave
Justification: causes non-serious data loss
Because of the size of the media being checked, the "checkarray"
instance that was invoked at 0057 had not yet completed when the usual
weekly cron jobs triggered at 0626. It is common for those cron jobs to
send signals to various daemons in order to allow log rotation.
Since I am fairly confident that there is nothing actually wrong with
the disks, although I have no hard proof that it was the cron jobs that
caused "checkarray" to fail in such a way as to degrade array "md1" and
mark "sda2" as faulty, the coincidence is too strong to ignore. I am not
sufficiently familiar with the weekly cron jobs installed by default,
but experience teaches that problems occurring exactly at 0626 are
extremely likely to be associated with the signals sent to daemons.
Jan 3 00:57:25 virtual1 mdadm[3471]: RebuildStarted event detected on
md device /dev/md1
Jan 3 02:22:27 virtual1 mdadm[3471]: Rebuild20 event detected on md
device /dev/md1
Jan 3 03:50:29 virtual1 mdadm[3471]: Rebuild40 event detected on md
device /dev/md1
Jan 3 05:21:32 virtual1 mdadm[3471]: Rebuild60 event detected on md
device /dev/md1
Jan 3 06:26:14 virtual1 mdadm[3471]: Fail event detected on md device
/dev/md1, component device /dev/sda2
Jan 3 06:26:14 virtual1 mdadm[3471]: RebuildFinished event detected on
md device /dev/md1, component device mismatches found: 128
I was able to resynchronize the array (which is so far 96% done) by
removing and adding the component ("mdadm --remove /dev/md1 /dev/sda2"
followed by "mdadm --add /dev/mda1 /dev/sda2") without even rebooting
the system, but the array was left degraded for several hours until I
received and was able to act on the e-mail message from the monitoring
daemon, and the resynchronization is consuming about 10 hours, but the
array would have stayed degraded until manual intervention occurred.
While writing this bug report, the resynchronization completed
successfully and promoited the array out of degraded mode:
Jan 3 10:01:31 virtual1 mdadm[3471]: RebuildStarted event detected on
md device /dev/md1
Jan 3 11:35:31 virtual1 mdadm[3471]: Rebuild20 event detected on md
device /dev/md1
Jan 3 13:23:32 virtual1 mdadm[3471]: Rebuild40 event detected on md
device /dev/md1
Jan 3 15:23:33 virtual1 mdadm[3471]: Rebuild60 event detected on md
device /dev/md1
Jan 3 17:22:33 virtual1 mdadm[3471]: Rebuild80 event detected on md
device /dev/md1
Jan 3 19:21:20 virtual1 mdadm[3471]: RebuildFinished event detected on
md device /dev/md1, component device mismatches found: 128
Jan 3 19:21:20 virtual1 mdadm[3471]: SpareActive event detected on md
device /dev/md1, component device /dev/sda2
-- Package-specific info:
--- mount output
/dev/md0 on / type ext3 (rw,errors=remount-ro)
tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
procbususb on /proc/bus/usb type usbfs (rw)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620)
/dev/mapper/virtual1-colossus--archive on /mnt/colossus-archive type ext3 (rw)
/dev/mapper/virtual1-images--archive on /mnt/images-archive type ext3 (rw)
--- mdadm.conf
# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#
# by default, scan all partitions (/proc/partitions) for MD superblocks.
# alternatively, specify devices to scan, using wildcards if desired.
DEVICE partitions
# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes
# automatically tag new arrays as belonging to the local system
HOMEHOST <system>
# instruct the monitoring daemon where to send mail alerts
MAILADDR root
# definitions of existing MD arrays
ARRAY /dev/md0 level=raid1 num-devices=2 UUID=520af0cf:5d9358cf:0bfe99a0:ead009c9
ARRAY /dev/md1 level=raid1 num-devices=2 UUID=0fe83522:3afebffc:663419d7:7a1d4550
# This file was auto-generated on Thu, 30 Oct 2008 17:06:34 +0000
# by mkconf $Id: mkconf 261 2006-11-09 13:32:35Z madduck $
--- /proc/mdstat:
Personalities : [raid1]
md1 : active raid1 sda2[2] sdb2[1]
974808064 blocks [2/1] [_U]
[===================>.] recovery = 96.2% (938482880/974808064) finish=21.8min speed=27728K/sec
md0 : active raid1 sda1[0] sdb1[1]
1951744 blocks [2/2] [UU]
unused devices: <none>
--- /proc/partitions:
major minor #blocks name
8 0 976762584 sda
8 1 1951866 sda1
8 2 974808135 sda2
8 16 976762584 sdb
8 17 1951866 sdb1
8 18 974808135 sdb2
9 0 1951744 md0
9 1 974808064 md1
253 0 15728640 dm-0
253 1 41943040 dm-1
253 2 2097152 dm-2
253 3 2097152 dm-3
253 4 2097152 dm-4
253 5 2097152 dm-5
253 6 209715200 dm-6
253 7 20971520 dm-7
253 8 258048 dm-8
--- initrd.img-2.6.26-2-xen-686:
33375 blocks
ed87a4a20991312e12f397abd288fd54 ./lib/modules/2.6.26-2-xen-686/kernel/drivers/md/dm-log.ko
61a6adc3a4dffac9ee5ad96f5196b590 ./lib/modules/2.6.26-2-xen-686/kernel/drivers/md/raid0.ko
b66b54b318430347504889d45bf16ba2 ./lib/modules/2.6.26-2-xen-686/kernel/drivers/md/multipath.ko
07c46476b799567a5b551f4c0ad71482 ./lib/modules/2.6.26-2-xen-686/kernel/drivers/md/raid1.ko
599daaac429638114e9acd150124145e ./lib/modules/2.6.26-2-xen-686/kernel/drivers/md/linear.ko
75ac8c783adaafe1e68aee05fd5fdccd ./lib/modules/2.6.26-2-xen-686/kernel/drivers/md/raid10.ko
5d62f4f02384a20b6410ada2724dc14c ./lib/modules/2.6.26-2-xen-686/kernel/drivers/md/dm-snapshot.ko
e3e027bf1ef37e06b6d976dedf2faf22 ./lib/modules/2.6.26-2-xen-686/kernel/drivers/md/dm-mirror.ko
c91ab1a51ed03662c06c924b64e19f9e ./lib/modules/2.6.26-2-xen-686/kernel/drivers/md/dm-mod.ko
49b779483baf3655fb819c1ebc0835f3 ./lib/modules/2.6.26-2-xen-686/kernel/drivers/md/md-mod.ko
443504a91e24c3a87d36feccc5019bb8 ./lib/modules/2.6.26-2-xen-686/kernel/drivers/md/raid456.ko
e1e2d0e985196fecaf41fb42e9968af2 ./scripts/local-top/mdadm
845f04e5ccb4e42938e7779d06a304b3 ./etc/mdadm/mdadm.conf
ea9abd44166c288560f8c9789cb3949d ./sbin/mdadm
--- /proc/modules:
dm_mirror 16320 0 - Live 0xee157000
dm_log 9412 1 dm_mirror, Live 0xee15c000
dm_snapshot 15108 0 - Live 0xee079000
dm_mod 47336 22 dm_mirror,dm_log,dm_snapshot, Live 0xee0c9000
raid1 19200 2 - Live 0xee084000
md_mod 69212 3 raid1, Live 0xee1a1000
--- /var/log/syslog:
--- volume detail:
/dev/hdc is not recognised by mdadm.
/dev/sda is not recognised by mdadm.
/dev/sda1:
Magic : a92b4efc
Version : 00.90.00
UUID : 520af0cf:5d9358cf:0bfe99a0:ead009c9
Creation Time : Thu Oct 30 12:58:53 2008
Raid Level : raid1
Used Dev Size : 1951744 (1906.32 MiB 1998.59 MB)
Array Size : 1951744 (1906.32 MiB 1998.59 MB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Update Time : Sun Jan 3 18:57:29 2010
State : active
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Checksum : e4011b7c - correct
Events : 31
Number Major Minor RaidDevice State
this 0 8 1 0 active sync /dev/sda1
0 0 8 1 0 active sync /dev/sda1
1 1 8 17 1 active sync /dev/sdb1
--
/dev/sda2:
Magic : a92b4efc
Version : 00.90.00
UUID : 0fe83522:3afebffc:663419d7:7a1d4550
Creation Time : Thu Oct 30 13:00:18 2008
Raid Level : raid1
Used Dev Size : 974808064 (929.65 GiB 998.20 GB)
Array Size : 974808064 (929.65 GiB 998.20 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 1
Update Time : Sun Jan 3 18:57:29 2010
State : active
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
Checksum : a2c9838d - correct
Events : 25749
Number Major Minor RaidDevice State
this 2 8 2 2 spare /dev/sda2
0 0 0 0 0 removed
1 1 8 18 1 active sync /dev/sdb2
2 2 8 2 2 spare /dev/sda2
--
/dev/sdb is not recognised by mdadm.
/dev/sdb1:
Magic : a92b4efc
Version : 00.90.00
UUID : 520af0cf:5d9358cf:0bfe99a0:ead009c9
Creation Time : Thu Oct 30 12:58:53 2008
Raid Level : raid1
Used Dev Size : 1951744 (1906.32 MiB 1998.59 MB)
Array Size : 1951744 (1906.32 MiB 1998.59 MB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Update Time : Sun Jan 3 18:57:29 2010
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Checksum : e4011bac - correct
Events : 30
Number Major Minor RaidDevice State
this 1 8 17 1 active sync /dev/sdb1
0 0 8 1 0 active sync /dev/sda1
1 1 8 17 1 active sync /dev/sdb1
--
/dev/sdb2:
Magic : a92b4efc
Version : 00.90.00
UUID : 0fe83522:3afebffc:663419d7:7a1d4550
Creation Time : Thu Oct 30 13:00:18 2008
Raid Level : raid1
Used Dev Size : 974808064 (929.65 GiB 998.20 GB)
Array Size : 974808064 (929.65 GiB 998.20 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 1
Update Time : Sun Jan 3 18:57:29 2010
State : clean
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
Checksum : a2c9e839 - correct
Events : 25750
Number Major Minor RaidDevice State
this 1 8 18 1 active sync /dev/sdb2
0 0 0 0 0 removed
1 1 8 18 1 active sync /dev/sdb2
2 2 8 2 2 spare /dev/sda2
--
--- /proc/cmdline
root=/dev/md0 ro console=tty0
--- grub legacy:
module /boot/vmlinuz-2.6.26-2-xen-686 root=/dev/md0 ro console=tty0
module /boot/vmlinuz-2.6.18-6-xen-vserver-686 root=/dev/md0 ro console=tty0
module /boot/vmlinuz-2.6.18-6-xen-686 root=/dev/md0 ro console=tty0
kernel /boot/vmlinuz-2.6.26-2-xen-686 root=/dev/md0 ro
kernel /boot/vmlinuz-2.6.26-2-xen-686 root=/dev/md0 ro single
kernel /boot/vmlinuz-2.6.26-2-686 root=/dev/md0 ro
kernel /boot/vmlinuz-2.6.26-2-686 root=/dev/md0 ro single
kernel /boot/vmlinuz-2.6.18-6-xen-vserver-686 root=/dev/md0 ro
kernel /boot/vmlinuz-2.6.18-6-xen-vserver-686 root=/dev/md0 ro single
kernel /boot/vmlinuz-2.6.18-6-xen-686 root=/dev/md0 ro
kernel /boot/vmlinuz-2.6.18-6-xen-686 root=/dev/md0 ro single
kernel /boot/vmlinuz-2.6.18-6-686 root=/dev/md0 ro
kernel /boot/vmlinuz-2.6.18-6-686 root=/dev/md0 ro single
--- udev:
ii udev 0.125-7+lenny3 /dev/ and hotplug management daemon
cd6f5683974ea65603f04ec699b3cff2 /etc/udev/rules.d/65_mdadm.vol_id.rules
--- /dev:
brw-rw---- 1 root disk 9, 0 2009-12-04 17:33 /dev/md0
brw-rw---- 1 root disk 9, 1 2009-12-04 17:33 /dev/md1
/dev/disk/by-id:
total 0
lrwxrwxrwx 1 root root 9 2009-12-04 17:33 ata-WDC_WD10EACS-00D6B0_WD-WCAU40812326 -> ../../sda
lrwxrwxrwx 1 root root 10 2009-12-04 17:33 ata-WDC_WD10EACS-00D6B0_WD-WCAU40812326-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 2009-12-04 17:33 ata-WDC_WD10EACS-00D6B0_WD-WCAU40812326-part2 -> ../../sda2
lrwxrwxrwx 1 root root 9 2009-12-04 17:33 ata-WDC_WD10EACS-00D6B0_WD-WCAU41049570 -> ../../sdb
lrwxrwxrwx 1 root root 10 2009-12-04 17:33 ata-WDC_WD10EACS-00D6B0_WD-WCAU41049570-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 2009-12-04 17:33 ata-WDC_WD10EACS-00D6B0_WD-WCAU41049570-part2 -> ../../sdb2
lrwxrwxrwx 1 root root 9 2009-12-04 17:33 md-uuid-0fe83522:3afebffc:663419d7:7a1d4550 -> ../../md1
lrwxrwxrwx 1 root root 9 2009-12-04 17:33 md-uuid-520af0cf:5d9358cf:0bfe99a0:ead009c9 -> ../../md0
lrwxrwxrwx 1 root root 9 2009-12-04 17:33 scsi-SATA_WDC_WD10EACS-00_WD-WCAU40812326 -> ../../sda
lrwxrwxrwx 1 root root 10 2009-12-04 17:33 scsi-SATA_WDC_WD10EACS-00_WD-WCAU40812326-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 2009-12-04 17:33 scsi-SATA_WDC_WD10EACS-00_WD-WCAU40812326-part2 -> ../../sda2
lrwxrwxrwx 1 root root 9 2009-12-04 17:33 scsi-SATA_WDC_WD10EACS-00_WD-WCAU41049570 -> ../../sdb
lrwxrwxrwx 1 root root 10 2009-12-04 17:33 scsi-SATA_WDC_WD10EACS-00_WD-WCAU41049570-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 2009-12-04 17:33 scsi-SATA_WDC_WD10EACS-00_WD-WCAU41049570-part2 -> ../../sdb2
/dev/disk/by-label:
total 0
lrwxrwxrwx 1 root root 9 2009-12-04 17:33 Xen\x20dom0 -> ../../md0
/dev/disk/by-path:
total 0
lrwxrwxrwx 1 root root 9 2009-12-04 17:33 pci-0000:00:1f.1-ide-1:0 -> ../../hdc
lrwxrwxrwx 1 root root 9 2009-12-04 17:33 pci-0000:00:1f.2-scsi-0:0:0:0 -> ../../sda
lrwxrwxrwx 1 root root 10 2009-12-04 17:33 pci-0000:00:1f.2-scsi-0:0:0:0-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 2009-12-04 17:33 pci-0000:00:1f.2-scsi-0:0:0:0-part2 -> ../../sda2
lrwxrwxrwx 1 root root 9 2009-12-04 17:33 pci-0000:00:1f.2-scsi-1:0:0:0 -> ../../sdb
lrwxrwxrwx 1 root root 10 2009-12-04 17:33 pci-0000:00:1f.2-scsi-1:0:0:0-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 2009-12-04 17:33 pci-0000:00:1f.2-scsi-1:0:0:0-part2 -> ../../sdb2
/dev/disk/by-uuid:
total 0
lrwxrwxrwx 1 root root 9 2009-12-04 17:33 8d4b7934-f2ca-4219-8525-33876d05b12a -> ../../md0
/dev/md:
total 0
-- System Information:
Debian Release: 5.0.3
APT prefers stable
APT policy: (500, 'stable')
Architecture: i386 (i686)
Kernel: Linux 2.6.26-2-xen-686 (SMP w/4 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash
Versions of packages mdadm depends on:
ii debconf 1.5.24 Debian configuration management sy
ii libc6 2.7-18 GNU C Library: Shared libraries
ii lsb-base 3.2-20 Linux Standard Base 3.2 init scrip
ii makedev 2.3.1-88 creates device files in /dev
ii udev 0.125-7+lenny3 /dev/ and hotplug management daemo
Versions of packages mdadm recommends:
ii exim4 4.69-9 metapackage to ease Exim MTA (v4)
ii exim4-daemon-light [mail-tran 4.69-9 lightweight Exim MTA (v4) daemon
ii module-init-tools 3.4-1 tools for managing Linux kernel mo
mdadm suggests no packages.
-- debconf information:
mdadm/autostart: true
mdadm/mail_to: root
mdadm/initrdstart_msg_errmd:
* mdadm/initrdstart: all
mdadm/initrdstart_msg_errconf:
mdadm/initrdstart_notinconf: false
mdadm/initrdstart_msg_errexist:
mdadm/initrdstart_msg_intro:
mdadm/autocheck: true
mdadm/initrdstart_msg_errblock:
mdadm/start_daemon: true
More information about the pkg-mdadm-devel
mailing list