Bug#826907: mdadm: please configure either component device timeout or scterc to guard against scsi layer timeouts

Fri Jun 10 01:31:49 UTC 2016

Package: mdadm
Version: 3.3.2-5+deb8u1
Severity: normal

madam waits forever for component devices to complete operations, but
the kernel scsi layer doesn't and may offline the device, causing md to
kick it off the array.

This is actually a very long-standing "stack integration" issue and not
an mdadm bug by itself.  I'd say it is an "integration deficiency" which
creates a risk with high impact (grave damage, low probability) on
systems not using enterprise-class hardware.  It can be fixed in several
ways, but doing it in mdadm would be the best place to deploy sanity for
this issue from a usability PoV, IMO.

Basically, most SATA HDDs will block on read-errors for up to two
minutes (during which it will retry, retry with reposition, sometimes
retry with lower spindle speed, and on the more insane firmware, even do
sync sector reallocation instead of punting it to a background task),
which is four times more than the Linux scsi layer will be willing to
wait for by default.

Enterprise HDDs are usually the only exception: even NAS-class HDDs will
cause trouble, as nearly all of them default to SCTERC (aka TLER, CCTL)
disabled at power-on (let's hope it won't get disabled by a device reset
during EH, that requires a kernel-level fix to address).

One must enable SCTERC (e.g. with smartctl -l scterr,70,500) before
starting the array (initramfs included).  Fortunately, suport for SCTERC
can be detected, and it can be queried, so one would only mess with it
when unset.  A longer write timeout might help ensure the HDD has time
to relocate the sector (there's never a good reason for an HDD with
spare sectors still available to return a write error other than a
SCTERC write timeout, or the spare tracks going bad/full).

Alternatively, mdadm could increase the timeout of sat/ata component
devices in the scsi layer, from 30s to something like 120s through
/sys/block/###/device/timeout.  This avoids worse data-loss in many
cases, but md will hang for far longer when the component device really
has gone to lalalala land...

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh