For some reason our server (Dell 1750 Xeons/2Gig RAM and RAID 5 - 3x146 GB and RHEL 3.0 ES) down 24 hours ago. Before it's use to work for 14 months without even one problem.
First problem when we trying to boot server - we got message that "No Logical Drive Detected" - RAID / HDD'S fail to start. We did call DELL and after some time spent with RAID configuration utility (Control M) we got access to RHEL "Maintenance Console"
Second Problem - "e2fsck IO manager magic bad" - it's cannot reopen /home directory (huge 200GB partition)
DELL says it's software issues and they don't know what to do. Our server management company was able to repair server using "fsck" utility from RHEL "Maintenance Console".
Right now server online again - but question is why this happened and how to prevent this ?
Does anybody familiar with DELL 1650/1750 RAID issues that can cause this problem ?
Should we continue to operate this server or move customers to other servers ? Even DELL cannot answer this question - they just suggest to RE-INSTALL everything and as they said "It's should be OK".
However, i'm curious about the error message "No Logical Drive Detected" . If that came up in the BIOS during boot for the RAID Perc4, that's hardware related with your raid not detecting your configured raid 5 logical drive.
If Dell had you go through the Raid menu they must have had you rebuild the array so it could be detected.
There's really no way to tell if that's a failing part like bad ram, battery, or if your colo facility lost or unplugged power to your server or one of your raid drives... (Well unless you're colo'd at SM or EV1 and prolly be a 95% chance someone pulled the wrong cable)
1) Here NO power outages in Datacenter - just for some reason server down in a middle of the night - datacenter confirmed that this is MY hardware issues
2) We got several "warnings" when server boot-up next morning :
HA-0 (Bus 4 Dev 3) Perc 4/Di standard FW 412W DRAM=128MB (SDRAM)
Battery module present on adapter
1 Logical Drive(s) found on the host adapter
1 Logical Drive(s) failed
1 Logical Drive(s) handled by BIOS
Press ctrl M to run configuration utility or any other key to continue.
This message waits for a response for a minute and then the startup continues.
Every thing looks OK until the message
PXE-E61: media test failure, check cable
3) I did ask guys from datacenter to help me with this issue - I put then on phone with DELL support - they did "something" and in 1 hour I was able to login into "maintenance console" - at this time here NO more errors when server boot-up - everything looks fine from HARDWARE side. Looks like they change drives arround.
4) When server trying to run RHEL - I got error "Unable to re-open /home" and then drop me to "Maintenance console"
5) My admin do "fsck" and reboot server
6) The we got server online, but some customers complain about "mess-up" data - I asked them to upload missed/damaged files for now
7) I did call DELL and tell them whole story and ask for advice - should I continue to use this server or move customers away ? They said - their support and my datacenter guys already did everything - just REINSTALL RHEL (which I don't want to do since it's going to create chaos and additional downtime AND since all Linux/CP files is not damaged).
What you should do in situation like this one ? It's is good idea to continue to using this server (I have 2 backups of data on tape) ?