[sane-devel] Mandrake 9.1 and ServeRAID 5i

Sun, 14 Sep 2003 18:06:31 +0200

Let me drop this in your lap. It's an extremely serious problem for 
users of a specific RAID system (you know, the people who are paranoid 
that anything should go wrong with their data or the availability of 
their server): it crashes the server and messes up that data. Evidence 
suggests (also look at that bug report mentioned below!) that SANE may 
be the guilty party. I will just reproduce the last message I sent to 
some parties involved (with [...] used to omit some irrelevant text and 
to avoid divulging some identities), which may be rather verbose, but 
you never *really* know what's exactly relevant and/or convincing enough 
and/or interesting.

Raf Schietekat wrote:

> Note: ***Urgent***: If you (Mandrake and maybe IBM) would like to have 
> me perform specific tests on my system, perhaps with Mandrake 9.2 RC1, 
> it will almost have to be this week, because next week I'd like to bring 
> the server into production. Please use this opportunity!

No reaction, BTW, and now it's too late (I probably should have come 
here before, but I did not know what SANE was, and then my message was 
blocked for a while before I sent it again), or it would have to be to 
help with a very targeted, convincing, and quick intervention (I would 
have to invest time in a complete reinstallation, which I am obviously 
reluctant to do). My workaround will be a cron(-like?) task that will 
disable anything related to scanners every minute or so (the frequency 
of the existing msec security check), to protect against accidental 
updates that reinstate the code.

> 
> Brief description: Mandrake 9.1 crashes systems with ServeRAID. 
> Extensive report below, including a reference to a previous bug report, 
> currently marked as needing further information (well, here is the info).
> 
> Raf Schietekat wrote:
>[...]
>> For [...], whom I've 
>> included in cc, a resume, in case you want to step in: I've been 
>> test-running an IBM xSeries 235 with ServeRAID 5i for several weeks, 
>> with Mandrake 9.1 (probably still the most recent version). Yesterday, 
>> I inserted two 3COM NICs in bus B, which also carries the ServeRAID 5i 
>> card. To test that the latter was still independently running at full 
>> 100 MHz speed as in the documentation and not dragged down to the 
>> NICs' 33 MHz, I did "time tar cf - / | wc -l", which showed about 7.5 
>> MB/s throughput as before (unless it was more like 10 MB/s before, I'm 
>> not exactly sure). I then used drakconf to see whether the NICs were 
>> identified correctly. I did this from a remote ssh -X session, which 
>> froze up. I could not open another ssh connection. On the console 
>> itself, the mouse pointer was still moving, but I could not type 
>> anything into the logon screen. The bottom two drives were spinning 
>> continuously, while the top one wasn't doing anything, this for a RAID 
>> 5 setting involving all three drives. Since nothing seemed to work, I 
>> did a reset (small button, I hope I shouldn't have used, e.g., the 
>> power button instead). During reboot, the file system proved to be 
>> corrupted, and could not be repaired (I will have to find out how to 
>> do that, or reinstall everything).
>>
>> After some further research using www.google.com for ["Mandrake 9.1" 
>> ServeRAID], which at first didn't seem necessary because I had 
>> repeatedly and successfully done all these steps before and the only 
>> new thing were the two NICs on bus B (the same bus that carries the 
>> ServeRAID 5i card), it appears that I may have been bitten by what's 
>> dealt with on the following page:
>>
>> http://qa.mandrakesoft.com/show_bug.cgi?id=3421
>> (this is where I saw Thierry Vignaud's address; I've found [...]'s 
>> address in /usr/sbin/scannerdrake on a Mandrake 9.0 installation)
> 
> 
> I've got my system up and running again. This involved the following:
> - /dev/sda8 had disappeared, although its neighbours were still there,
> - I tried MAKEDEV, but this uses /usr/bin/perl, on /dev/sda7, which was 
> not yet mounted,
> - I did "mount /usr",
> - I did ./MAKEDEV,
> - I rebooted, and things seemed fine.
> Then I wanted to try a few things to see whether I could pinpoint the 
> problem. Here is a complete account of what I did, probably erring on 
> the side of giving too much information, but in the hope that it will be 
> helpful for you to fix Mandrake's configuration managers etc. (I suggest 
> that a probe for ServeRAID precedes and disables a probe for a scanner, 
> perhaps with user input, unless the scanner probe can be changed so that 
> it does no damage to the ServeRAID controller card configuration).
> The system now only has a(n extra) NIC on bus A, which is separate from 
> bus B which also carries the ServeRAID controller card. If I do "# 
> scannerdrake" from a remote ssh -X session (I like to work from my 
> laptop; the server is in a little server room), the system wants to 
> install some packages, but I refuse to cooperate. It then says that it 
> is scanning, or something (gone too fast for me to be able to read), and 
> then it says "IBM SERVERAID is not in the scanner database, configure it 
> manually?" (an obvious sign that something is going wrong with the 
> scanner probe). I repond No. It then says "IBM 32P0042a S320 1" is not 
> in the scanner database, configure it manually?". Don't even know what 
> that is. I respond No. Then it does the same for "IBM SERVERAID" again, 
> I respond No. And the same again for the other one, I respond No. Then I 
> get the following panel:
> - title: Scannerdrake
> - text: There are no scanners found which are available on your system.
> - button: Search for new scanners
> - button: Add a scanner manually
> - button: Scanner sharing
> - button: Quit
> I persevered, and clicked "Search for new scanners", well, that's the 
> same as before, from just after the scanning. No crash yet. I did Quit.
> Then I did vi `which harddrake2`, and I tried to add the line that 
> [...] suggested (next if $Ident =~ "SCANNER";), but then vi froze 
> (perhaps some of the file was still in memory from a previous vi 
> session, but then it wanted to access the disk?). The other ssh sessions 
> continued to work, unlike during the previous failure; I tried man perl 
> in another one to try and see an explanation for double quotes ([...]) 
> vs. slashes (harddrake2), but I got the error "-bash: /usr/bin/man: 
> Input/output error", repeatedly. I can still open other ssh sessions, 
> and the console itself works, but I see that all 3 drives have an amber 
> status light (not the green activity light, and if I remember correctly 
> the status light is normally off), and that the "System-error LED" is 
> lit on the "Operator information panel" (only other lit signs are 
> "Power-on LED" and "POST-complete LED"), with also one LED lit in the 
> "diagnostic LED panel" inside the computer, next to a symbol of a disk 
> and the letters "DASD". When I look next, the console has gone from 
> graphics to text mode, and is filling with messages about "EXT3-fs 
> error", "ext3_reserve_inode_write: IO failure", "ext3_get_inode_lock: 
> unable to get inode block". Meanwhile, the remote ssh sessions are still 
> responsive. I don't try anything on the console, and use a remote ssh 
> session to try "# shutdown -h now" as root, but obviously the command 
> cannot be read from disk (error message "bash: shutdown: command not 
> found"). ctl-alt-del on the console's keyboard: same thing (this causes 
> init to (try to) invoke shutdown). I then did a reset (actually a power 
> cycle; just a reset would have been better). The three drives were still 
> marked defunct (status lights on). I used the ServeRAID support CD to 
> boot, and could set two of the physical drives online, but the last one 
> did not have that right-click menu option (I even set the second one 
> defunct again, was able to bring the third online, but then the option 
> was missing on the second one). So then I briefly removed the second 
> drive from its hot-swap bay, and when I inserted it again it started 
> getting rebuilt from the other drives, and (according to the log) 
> completed a little over an hour later (for 30+ GB disk capacity, of 
> which maybe less than 1 GB in use, if that matters). I tell ServeRAID 
> Manager (?) to reboot, and then I'm stuck with a garbled Mandrake splash 
> screen and a succession of:
> Boot: linux-secure
> Loading linux-secure
> Error 0x00
> and then a succession of just:
> Boot: linux-securere
> ctl-alt-del works (but brings no salvation).
> Was data lost during the reset/power cycle (hopefully not during the 
> rebuild, because that would defeat the purpose of having a RAID), or as 
> early as the corruption of the ServeRAID controller card that 
> (ultimately?) set the drives to defunct state? Apparently the boot 
> doesn't even get to the stage where it would decide about clean state of 
> the file systems, so this is not something we can afford on a system in 
> production (evidence that recovery is not a simple matter and may 
> involve data recovery from backup, unless *perhaps* if a boot floppy 
> takes the system past this stage, after which ext2/ext3 gets a chance to 
> repair itself, but I have not boot floppy... (will make one now, though, 
> next chance I get)).
> I reboot into diagnostics (PC DOCTOR 2.0, apparently a specific feature 
> of the IBM server), and the SCSI/RAID Controller test category passed.
> Next I will proceeded to reinstall the whole system from scratch.
> 
>>
>> I'm not sure yet, though (why hasn't this happened before, and has a 
>> conclusion been reached?), that's why I've also cc'ed [...]. 
>> It seems strange however, if this is indeed the 
>> problem, that a hardware adapter card should prove so vulnerable to a 
>> probing method used for a different device (a scanner), but then again 
>> I have no close knowledge of these issues.
>>
>> BTW, the machine is not yet in production (I was going to do that, but 
>> I guess I can now wait a few days), and available for tests.
>>
>> I still think it's really unfortunate that there is no list of known 
>> *in*compatibilities, because who would suspect, with ServeRAID 
>> support, or drivers anyway, available for SuSE, TurboLinux, Caldera 
>> (SCO Group, the enemy!), and RedHat, that Mandrake would pose a 
>> problem? The same goes for Mandrake's site, of course (all of IBM is 
>> just "known hardware", and xSeries 235 and ServeRAID 5i are just absent).
>>
>> http://www-1.ibm.com/servers/enable/site/xinfo/linux/servraid
>> (this is where I saw the address ipslinux@us.ibm.com)
>>
>> http://www.mandrakelinux.com/en/hardware.php3
>>
>> [...]

Raf Schietekat <Raf_Schietekat@ieee.org>