[Pkg-iscsi-maintainers] Bug#629442: Bug#629442: iscsitarget: ietd gives "iscsi_trgt: Abort Task" errors on high disk load and iscsi connections are dropped

Tue Jun 14 11:42:09 UTC 2011

> Next time, when you try to test/re-create the bug, capture dstat output.
> The default dstat output is good enough to report us on the system state
> was during starvation.
Hello, yesterday and tonight I performed some other tests, these are the 
results:

1) it seems I am not able to reproduce the bug on a test system
the test system (san01) has the same processor (E5220) and amount of RAM 
12 GB, but a smaller I/O system: an 8 channel 3ware controller with an 8 
disks raid 5 array
the system that presents the problem (san00) has a 24 channel controller 
and a 23 disks raid 6 array (+ 1 hot spare)
both systems are connected through the same gigabit switches

there is another hw difference between the two environment: the nodes 
connected to san00 are high end hw, their network card is able to 
generate nearly 1 Gb/s of iscsi traffic
the nodes connected to san01 are low end hw and their network card does 
not exceed 300 Mb/s
so the system that presents the problem has both an I/O subsystem with 
higher performance and the machine that is doing iscsi traffic is able 
to generate more than 3 times i/o operations

at the moment I am not able to tell which of these aspects, or the sum 
of them, create the condition for the problem: I suspect that it's a mix 
of all these
unfortunately at the moment I do not have hw similar to the one in 
production to perform a test in the same conditions.

2) san00 presents the problem event with deadline scheduler active on 
all logical volume exported through iscsi or used by the heavy load 
operation (dd)

3) on san00 I was able to reproduce the problem in a simpler condition 
than the one I described in the first mail: just one node connected 
through iscsi, the other node was restarting, no virtual machines 
running on the node, the node was performing one i/o intensive operation 
on one of the lv exported by iscsi/lvm (an fsck on one file system)
during this operation I launched a dd on san00 and the iscsi connection 
was dropped after a few seconds

I am attaching 3 files: dstat output during the test and an extract of 
/var/log/messages and /var/log/syslog
I have just filtered out information for non relevant services (nagios, 
dhcp, snmp, postfix, etc.) both for readability and confidentiality
ietd was running with the following command line
/usr/sbin/ietd --debug=255
so in the log we have debug information
the problem can be seen in syslog at Jun 14 01:28:53
at Jun 14 01:34:06 I turned off the node for reboot and in the log there 
are some record regarding termination of iscsi sessions
I do not see anything relevant in ietd debug log, just a restart of the 
connections

in dstat output the dd operation was started around line 197 and was 
terminated at line 208 (I interrupted the operation as soon as I saw the 
problem)

what I see in dstat output is the following: dd for some seconds (about 
10) does not generate a lot of read and writes

usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
   7   2  56  35   0   0|  12M   14M|  22k   35k|   0     0 |4415    12k

12M read and 14M write, and this could be from the dd operation or the 
fsck performed through iscsi

then there is a burst of write, I guess using the full I/O capacity of 
the controller and of the disks

usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
35   7  35  22   0   1|8180k  325M|  38k   25k|   0     0 |6860    11k
   2   3  59  36   0   1|3072B  541M|  20k   26k|   0     0 |5380  2747
   3   4  64  30   0   0|5120B  473M|  21k   30k|   0     0 |4752    16k

write 325M, 541M, 473M
and this is exactly the moment when the problem arise

could it be that the i/o operation are cached in memory and the problem 
presents when they are flushed to disk?

If from the logs does not come out any pointer to a potential solution 
the only other test I can think of is upgrading to a newer kernel, but I 
see this a last resort for several reasons:
- as I see it putting a test kernel directly on a production system is 
not a wise move, I could (and in the past already have) incur into 
several other unknown bugs
- all our other systems are running on a standard lenny or squeeze kernel
- I would lose support for kernel security updates from debian

Best regards
Massimiliano

-- 

Massimiliano Ferrero
Midhgard s.r.l.
C/so Svizzera 185 bis
c/o centro Piero della Francesca
10149 - Torino
tel. +39-0117575375
fax  +39-0117768576
e-mail: m.ferrero at midhgard.it
sito web: http://www.midhgard.it

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 20110614_dstat
URL: <http://lists.alioth.debian.org/pipermail/pkg-iscsi-maintainers/attachments/20110614/ccc216a8/attachment-0003.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 20110614_messages
URL: <http://lists.alioth.debian.org/pipermail/pkg-iscsi-maintainers/attachments/20110614/ccc216a8/attachment-0004.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 20110614_syslog
URL: <http://lists.alioth.debian.org/pipermail/pkg-iscsi-maintainers/attachments/20110614/ccc216a8/attachment-0005.ksh>