[Pkg-iscsi-maintainers] Bug#629442: Bug#629442: iscsitarget: ietd gives "iscsi_trgt: Abort Task" errors on high disk load and iscsi connections are dropped
Massimiliano Ferrero
m.ferrero at midhgard.it
Tue Jun 14 11:42:09 UTC 2011
> Next time, when you try to test/re-create the bug, capture dstat output.
> The default dstat output is good enough to report us on the system state
> was during starvation.
Hello, yesterday and tonight I performed some other tests, these are the
results:
1) it seems I am not able to reproduce the bug on a test system
the test system (san01) has the same processor (E5220) and amount of RAM
12 GB, but a smaller I/O system: an 8 channel 3ware controller with an 8
disks raid 5 array
the system that presents the problem (san00) has a 24 channel controller
and a 23 disks raid 6 array (+ 1 hot spare)
both systems are connected through the same gigabit switches
there is another hw difference between the two environment: the nodes
connected to san00 are high end hw, their network card is able to
generate nearly 1 Gb/s of iscsi traffic
the nodes connected to san01 are low end hw and their network card does
not exceed 300 Mb/s
so the system that presents the problem has both an I/O subsystem with
higher performance and the machine that is doing iscsi traffic is able
to generate more than 3 times i/o operations
at the moment I am not able to tell which of these aspects, or the sum
of them, create the condition for the problem: I suspect that it's a mix
of all these
unfortunately at the moment I do not have hw similar to the one in
production to perform a test in the same conditions.
2) san00 presents the problem event with deadline scheduler active on
all logical volume exported through iscsi or used by the heavy load
operation (dd)
3) on san00 I was able to reproduce the problem in a simpler condition
than the one I described in the first mail: just one node connected
through iscsi, the other node was restarting, no virtual machines
running on the node, the node was performing one i/o intensive operation
on one of the lv exported by iscsi/lvm (an fsck on one file system)
during this operation I launched a dd on san00 and the iscsi connection
was dropped after a few seconds
I am attaching 3 files: dstat output during the test and an extract of
/var/log/messages and /var/log/syslog
I have just filtered out information for non relevant services (nagios,
dhcp, snmp, postfix, etc.) both for readability and confidentiality
ietd was running with the following command line
/usr/sbin/ietd --debug=255
so in the log we have debug information
the problem can be seen in syslog at Jun 14 01:28:53
at Jun 14 01:34:06 I turned off the node for reboot and in the log there
are some record regarding termination of iscsi sessions
I do not see anything relevant in ietd debug log, just a restart of the
connections
in dstat output the dd operation was started around line 197 and was
terminated at line 208 (I interrupted the operation as soon as I saw the
problem)
what I see in dstat output is the following: dd for some seconds (about
10) does not generate a lot of read and writes
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
7 2 56 35 0 0| 12M 14M| 22k 35k| 0 0 |4415 12k
12M read and 14M write, and this could be from the dd operation or the
fsck performed through iscsi
then there is a burst of write, I guess using the full I/O capacity of
the controller and of the disks
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
35 7 35 22 0 1|8180k 325M| 38k 25k| 0 0 |6860 11k
2 3 59 36 0 1|3072B 541M| 20k 26k| 0 0 |5380 2747
3 4 64 30 0 0|5120B 473M| 21k 30k| 0 0 |4752 16k
write 325M, 541M, 473M
and this is exactly the moment when the problem arise
could it be that the i/o operation are cached in memory and the problem
presents when they are flushed to disk?
If from the logs does not come out any pointer to a potential solution
the only other test I can think of is upgrading to a newer kernel, but I
see this a last resort for several reasons:
- as I see it putting a test kernel directly on a production system is
not a wise move, I could (and in the past already have) incur into
several other unknown bugs
- all our other systems are running on a standard lenny or squeeze kernel
- I would lose support for kernel security updates from debian
Best regards
Massimiliano
--
Massimiliano Ferrero
Midhgard s.r.l.
C/so Svizzera 185 bis
c/o centro Piero della Francesca
10149 - Torino
tel. +39-0117575375
fax +39-0117768576
e-mail: m.ferrero at midhgard.it
sito web: http://www.midhgard.it
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 20110614_dstat
URL: <http://lists.alioth.debian.org/pipermail/pkg-iscsi-maintainers/attachments/20110614/ccc216a8/attachment-0003.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 20110614_messages
URL: <http://lists.alioth.debian.org/pipermail/pkg-iscsi-maintainers/attachments/20110614/ccc216a8/attachment-0004.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 20110614_syslog
URL: <http://lists.alioth.debian.org/pipermail/pkg-iscsi-maintainers/attachments/20110614/ccc216a8/attachment-0005.ksh>
More information about the Pkg-iscsi-maintainers
mailing list