Results 1 to 12 of 12
  1. #1
    Join Date
    Nov 2011
    Location
    Harrisburg, PA
    Posts
    2,073

    Rebuilding software raid

    So a drive in a RAID 10 array just failed. It's going to be a few days until we can get up to the DC to swap it out, so I'd like to try re-integrating the failed drive just in case it was a fluke (these are 18-month-old RE4s - it had BETTER be just a fluke!).

    Since this is a live server (Xen node), is it safe to do mdadm -a /dev/md0 /dev/sde1?

  2. #2
    Join Date
    Aug 2004
    Location
    Dallas, TX
    Posts
    3,507
    Mark it as failed and run a full smart test on the drive.
    Dallas Colocation by Incero, 8 years and counting!
    e: sales(at)incero(dot)com 855.217.COLO (2656)
    Colocation & Enterprise Servers, SATA/SAS/SSD, secure IPMI/KVM remote control, 100% U.S.A. Based Staff
    SSAE 16, SAS70, Redundant Power & Network, Fully Diverse Fiber

  3. #3
    Join Date
    Nov 2011
    Location
    Harrisburg, PA
    Posts
    2,073
    That's the interesting thing -- SMART reports no problems. It's already marked failed and the only thing indicating problem (besides /proc/mdstat and the warning mail) is tons of entries like this in messages:

    Dec 13 23:07:00 (hostname) smartd[4188]: Device: /dev/sde [SAT], 2 Currently unreadable (pending) sectors

  4. If smarts looks fine, do a low level format, redo partitions and put it back in md array. Looks like 2 sectors failed tho.
    Hostabulous | cPanel (Linux) & Plesk (Windows) Hosting KVM VPS R1Soft backups | Proudly Canadian
    Cloudflare LiteSpeed Cloudlinux Remote backups Anti-Spam Web App Firewall Canada/US/Germany

  5. #5
    Join Date
    May 2013
    Location
    India
    Posts
    748
    Quote Originally Posted by [email protected] View Post
    If smarts looks fine, do a low level format, redo partitions and put it back in md array. Looks like 2 sectors failed tho.
    Yup..that is the right decision I too think. Mark it as failed, remove it from RAID, do a low level format and do partition and then add it back to the array the let it to sync. That way you can try your luck.

  6. #6
    Join Date
    Apr 2011
    Location
    Core Files
    Posts
    7,795
    Quote Originally Posted by FRH Lisa View Post
    So a drive in a RAID 10 array just failed. It's going to be a few days until we can get up to the DC to swap it out, so I'd like to try re-integrating the failed drive just in case it was a fluke (these are 18-month-old RE4s - it had BETTER be just a fluke!).

    Since this is a live server (Xen node), is it safe to do mdadm -a /dev/md0 /dev/sde1?

    Any chance you have a new drive at the DC and the ability to hire someone from there to install it? It would save those days for you....unless going to the DC is easier.

  7. #7
    Join Date
    Nov 2011
    Location
    Harrisburg, PA
    Posts
    2,073
    Quote Originally Posted by nixtree View Post
    Yup..that is the right decision I too think. Mark it as failed, remove it from RAID, do a low level format and do partition and then add it back to the array the let it to sync. That way you can try your luck.
    Quote Originally Posted by [email protected] View Post
    If smarts looks fine, do a low level format, redo partitions and put it back in md array. Looks like 2 sectors failed tho.
    Thanks - I'll give that a shot. I'm pulling backups right now, so once the server settles down I'll give it a go.

    I'm not looking forward to replacing the drive. This was the first node we built, and in a stroke of genius it never occurred to us to note which UUID went to which physical drive. Finding the physical MB port for /dev/sde is going to be some tricky fishing.

    Someone suggested it might be a cable failure. Personally I've never seen a cable failure, and it seems unlikely that a server that hasn't been moved for a few months would spontaneously see a cable failure. Anything's possible, I guess. If that's the case, what are the odds of a bad cable blowing out the entire bus or causing the collapse of the entire array?

  8. #8
    Join Date
    Nov 2011
    Location
    Harrisburg, PA
    Posts
    2,073
    This will be our first time rebuilding a software RAID, but it looks pretty straightforward. We're going to drop in a spare and send the failed drive back in for warranty. Since the drive is already failed, is it just a simple matter of:

    1) Power down the server (it's an old build in a tower case; it has to come out of the rack anyway)
    2) Physically remove the failed drive
    3) Physically insert the fresh drive
    4) Reboot, let the array come online, let Xen settle
    5) mdadm /dev/md0 -a /dev/sde1
    6) Array rebuilds itself

    I'm not clear on whether or not we need to manually rebuild the partitions. I've found a few posts on WHT saying that we do, but I thought that was inherent in the RAID rebuild process.

    In a related note, I just finished reading a few posts mentioning that Linux reliably maps /dev/sda to physical port 0, /dev/sdb to physical port 1, and so on. How reliable is this? I've never trusted that logic, but then again, I never really cared what port a drive was connected to -- only its mount point or device name. I forgot hdparm existed.
    Last edited by FRH Lisa; 12-16-2013 at 10:51 PM.

  9. #9
    Join Date
    Jun 2004
    Location
    Europe
    Posts
    3,523
    Quote Originally Posted by FRH Lisa View Post
    This will be our first time rebuilding a software RAID
    Do not tell us that you used WD Raid Edition drives in a Software Raid array, did you?
    It seems you did.
    When people above mentioned "Mark it as failed" they assume you use a raid controller, as Raid Edition drives should be used with a raid controller only. Marking it as failed, turns on the led on the hot-swap tray and allows the onsite engineer to swap the failed drive.
    Some of many perks that come with using a hardware raid controller is to be able to hotswap drives and to label drives as faulty by lighting the LED, Software raid users do not have this possibility (afaik) .
    Now, lets be specific here.
    Raid edition Western Digital drives are called RAID edition because they play well with Hardware raid controllers, they have TLER (Time Limited Error Recovery) software that causes them to, well, only try to recover any errors for a limited time before passing the task on to the hardware raid controller..... since you seem to use software raid, there is no raid controller and this system fails completely
    Why TLER?
    During any recovery of errors, the drive becomes unresponsive to the controller, if this takes too long, any raid controller will toss the disk out of the array and report it as faulty.... this is why Raid Edition drives should be used with raid controllers.
    Last edited by swiftnoc; 12-16-2013 at 11:10 PM.
    Swiftway.net Your Business deserves our Quality - Experts on Hand since 2005. Europe & US locations, we operate our own network AS35017
    Unbeatable dedicated bandwidth deals for Dedicated servers ! Support response time <15 minutes 24/7

  10. #10
    Join Date
    Aug 2004
    Location
    Dallas, TX
    Posts
    3,507
    Quote Originally Posted by FRH Lisa View Post
    I'm not clear on whether or not we need to manually rebuild the partitions. I've found a few posts on WHT saying that we do, but I thought that was inherent in the RAID rebuild process.
    If you have different partitions for different raids, e.g. md0 for boot and md1 for vg then simply copy the partitions from one disk to the other and then add the partition back to the raids. I'd suggest you practice a few times on a test machine, it's a simple process but better done when you're confident with your skill set.
    Dallas Colocation by Incero, 8 years and counting!
    e: sales(at)incero(dot)com 855.217.COLO (2656)
    Colocation & Enterprise Servers, SATA/SAS/SSD, secure IPMI/KVM remote control, 100% U.S.A. Based Staff
    SSAE 16, SAS70, Redundant Power & Network, Fully Diverse Fiber

  11. #11
    Join Date
    Nov 2011
    Location
    Harrisburg, PA
    Posts
    2,073
    Quote Originally Posted by swiftnoc View Post
    Do not tell us that you used WD Raid Edition drives in a Software Raid array, did you?
    It seems you did.
    Thank you for your thoughts on the matter. Wait here while I jump in my DeLorean and go back in time 18 months to when that particular server was built.

    Quote Originally Posted by gordonrp View Post
    If you have different partitions for different raids, e.g. md0 for boot and md1 for vg then simply copy the partitions from one disk to the other and then add the partition back to the raids. I'd suggest you practice a few times on a test machine, it's a simple process but better done when you're confident with your skill set.
    All the RAID10 drives are strictly for customer data (the OS resides on a separate RAID1 array, which is sitting pretty), so they're all just a single partition each. Which is why I was puzzled by the discussion elsewhere about copying / not copying the partition table. The more I read, it sounds like our course of action should be create the single, full-disk SW RAID partition on the new drive, then add that drive to the array with:

    mdadm --manage /dev/md0 --remove /dev/sde1 (drop old drive from the array)
    (power down, replace physical drive, power up, let Xen resume)
    sfdisk -d /dev/sdd | sfdisk /dev/sde (fast way to create the partition)
    mdadm --manage /dev/md0 --add /dev/sde1 (add fresh drive to array)

  12. #12
    Join Date
    Mar 2005
    Location
    Ten1/0/2
    Posts
    2,509
    Yep, you got it....

    Fail drive,
    Replace drive,
    copy partition table from good disk,
    Add the new drive back into the Array.

    and as Gordon Said, practice on a dev/test machine FIRST!
    CPanel Shared and Reseller Hosting, OpenVZ VPS Hosting. West Coast (LA) Servers and Nodes
    Running Linux since 1.0.8 Kernel!
    Providing Internet Services since 1995 and Hosting Since 2004

Similar Threads

  1. Need someone for rebuilding sw-raid
    By Serhat in forum Systems Management Requests
    Replies: 6
    Last Post: 09-29-2011, 10:01 AM
  2. How to stop raid-1 rebuilding
    By petfut in forum Hosting Security and Technology
    Replies: 2
    Last Post: 12-13-2006, 04:37 PM
  3. Linux Software RAID 5 rebuilding?
    By torwill in forum Hosting Security and Technology
    Replies: 4
    Last Post: 10-09-2005, 06:13 AM
  4. Hardware Raid Rebuilding
    By boonchuan in forum Hosting Security and Technology
    Replies: 7
    Last Post: 04-01-2005, 12:29 PM
  5. Rebuilding Raid Array
    By Serverplan in forum Hosting Security and Technology
    Replies: 1
    Last Post: 03-04-2004, 06:30 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •