Bug#780207: default read error timeouts: drives dropped regularly + data loss on array re-build

Chris email.bug at arcor.de
Tue Mar 10 14:59:01 UTC 2015


Package: mdadm
Severity: serious

In theory, the md kernel module supports to cope with common read
errors, by automatically rewriting the block using the redundancy of
other drives.

Practically however, wrong default error timeouts often let the
controller reset drives (and the drive being droped from the array, or
array rebuild failing) before a drive properly signals a read error.

This applies to all drives that have their error recovery timeout
default to "Disabled" (most devices execept special raid devices), or
do not support an error recovery timeout.

To avoid that users run into redundancy and data loss, the mdadm
package should ship with the udev rules that call proper timeout
adjustment scripts for md devices.


(In hope for inclusion, the proposed scripts have been posted upstream,
but without response. Thus, also to https://bugs.debian.org/780162
"default timeouts causing data loss", because it is still a distro
resposibility to ensure that installations will work with safe
defaults.)


Note: The udev rules to be shipped with mdadm are 
available as part of the .zip file attached to bug #780162.




PS: If the smartmontools package does not ship the scripts, then all
redundancy providing packages like mdadm, lvm, btrfs, etc. would have to
ship their own copies.



More information about the pkg-mdadm-devel mailing list