I usually have to replace raid cards every quarter.
RAID card can die, but its not usual... I would have a spare available, and also either backups or a spare SAN with sync
EasyDCIM.com - DataCenter Infrastructure Management - Reach Me: email@example.com Bandwidth Billing | Inventory & Asset Management | Server Control
Order Forms | Reboots | IPMI Control | IP Management | Reverse&Forward DNS | Rack Management
I've got a mix of 3ware, Adaptec, Intel, and LSI(non 3ware). I've had quite a few failures across the board, but adaptec ZCR(the supermicro kind) have had the worst issues. I've normally had recoverable failures on the others. I've had 2 non recoverable failures on 3ware but it was partially drive faults, or so I thought at the time. A couple weeks later 3ware came out with a firmware to fix error son certain drives that may have been what caused my problems (saying SMART/Drive error when no other source saw that and rejecting to use, among other things, from my recollection of it, been a while).
I'd say overall, make sure you have good backups, and spare cards, one day something will fail, and RAID is NOT a backup
We've always used a mix of LSI and 3ware controllers, and while I've seen a small percentage come in DOA, I've never had to replace one.
However, as Stephen said, RAID is not a backup. I have been through an array failure (due to too many disks failing), and it does ruin the day. Several of us spent the next 20 hours restoring backups.
Ideally, if you're going to be storing critical systems, you should really build two systems, each with its own array. And then mirror them.
Or, for not a whole lot more, you can pickup something like the HP MSA2000. An entry level SAN, that you put two controllers in, with auto fail-over. (And then get a second, and mirror) Uptime is a good thing.
Chris Rogers - firstname.lastname@example.org Inerail - Servers, Colocation, IP Transit
Performance, Reliability, Security
New York • Philadelphia • London • Salt Lake City
Very rarely do we have a RAID card failure, but it is the type of failure where you don't want to be scrambling around looking for a replacement - you may find the card is no longer available, or that no one has any stock in your country... and so on, so try to make sure you have an identical spare just in case.
Darren Lingham - UK Webhosting Ltd. - (0800) 024 2931 Tsohost.co.uk - Quality UK Windows and Linux hosting since 2003 UK WordPress Hosting - Fast, easy, cloud based WordPress Hosting
Sadly raid controllers are one of the things we se die more than any other component.
We have easily seen over 200 controllers die. By die I mean they start becoming intermittent where the controller stops responding (rejecting I/O to offline device) and the machine has to be powercycled sometimes has fs corruption and or array gets eaten.
We have seen a huge number of bad controllers with 3ware, LSI, and areca.
The areca controllers ran flawlessly for 1.5 years until they just started randomly failing. In all cases replacing the controller fixes the issue. It took months working with areca to finally verify an issue and get them to fix our cards, of course by that time we bought in excess of 100 extra cards which we didn't need =(.
The areca problem is specific to *some* of the ARC-1222/1212's so if you don't use that model you probably won't see problems. Also I think we see soo many issues as these are all shared servers that run really heavy disk I/O 24/7.
Areca controllers have been the best in not eating the arrays though, also the easiest to recover and the fastest. Considering they were flawless (in reliability) for 1.5 years I still think they are my favorite brand. Both LSI and 3ware would randomly eat arrays sometimes when the controller crapped out or even when it didn't.
Also 3ware performance is so horrible they aren't even worth looking at (atleast their older generation cards before LSI bought them).
We had about a 20-25% failure rate of LSI cards too and half of them failed out of the box. Basically when setting up a new server they would crap out when doing a mkfs. Atleast this never happened with Areca. All the controllers were pretty similar in cost to.
LSI - We have RMAd a number of LSI cards for experiencing "fatal errors" - basically, RAID controller has an onboard kernel panic and the system falls over. Quite nice when you are running a 8-16 drive RAID10 system that's a hypervisor for a bunch of virtualized guests.
Adaptec - We have a number of RAID failures on Adaptec cards, but the cards themselves seemed fine afterwards.
Areca - We had to RMA a good number of their 2 port cards because of bad checksum issues. Areca isn't a very responsive company to deal with, either.
Jay Sudowski // Handy Networks LLC // Co-Founder & CTO AS30475 - Level(3), HE, Telia, XO and Cogent. Noction optimized network. Offering Dedicated Server and Colocation Hosting from our SSAE 16 SOC 2, Type 2 Certified Data Center. Current specials here. Check them out.
I'm a 3ware kind of guy and in the 3 years that I have been using their cards, I have yet to have all but one fail. Like the other folks posting here, I'd also like to recommend to at least have one spare available per server with an active array.
Also, 90% of RAID failures occur with failing disks. Be sure to check your drives once in a while for any evidence of failure. You can use smartctl or even the tw_cli application can show some evidence of failure.