Results 1 to 5 of 5
  1. #1
    Join Date
    Nov 2007
    Smyrna, DE

    Question Event ID 129: Reset to device, \Device\RaidPort1, MegaSAS (RAID 84016E)

    For a couple of weeks now, I've been chasing some PCIe resets that are sent to my LSI 84016E RAID controller that I cannot seem to solve. For several weeks, the SAN performed great, and then began having regular and erratic Event 129 Reset to Device errors.

    Anytime that the RAID is under load of any kind (it serves media for my XBMC HTPCs in the house, among other things), drive activity to the RAID volume will lock up for 60 seconds, and then resume, seemingly randomly. It's only current function is to serve media files out to HTPCs around the house via Windows File Sharing, however, prior to this, it was serving as a iSCSI mount for VMWare ESXi nodes. Total free space is around 40%.

    The event log always shows an Event ID 129 with the message "Reset to device, \Device\RaidPort1, was issued" with the provider as megasas. The RAID card logs show NO ERRORS when this happens.

    This is a custom built SAN with the following specs:

    CPU: FX-6100 Motherboard: ASUS M5A97 (current) MSI 970A-G43 (prior) RAM: 32GB DDR3-1600 RAID Card: LSI 84016E in PCIex16 slot Power Supply: Corsair Professional Series HX 750 OS Drive: 128GB Crucial M4 SSD RAID Drives: 16 x 2TB Hitachi Ultrastar (14 drives in RAID6, 2 drives in RAID1) OS: Win7 Ultimate (current) Server2008R2 (prior)

    History and Troubleshooting:

    • RAM Tests come back clean
    • Drives unhooked from RAID and connected directly to motherboard and all SMART tests come back clean
    • Cables swapped on RAID card with new cables
    • Motherboard replaced
    • RAID card replaced with identical model
    • RAID card Firmware updated (both cards)
    • Fan attached to heatsink on RAID card for better temperature regulation
    • OS Changed from Server2008R2 to Win7 Ultimate
    • Power supply tested via a tester and multimeter. All rails holding steady voltage, even under load drive load
    • Can replicate error/reset by using CrystalDiskMark3. Lockup/reset SEEMS to happen on the write cycle
    • Cannot replicate error/reset using HDTunePro or IOMeter, even allowing them to run 1 hour+
    • IOMeter does not cause the error even on write cycles (see the CrystalDiskMark3 entry above)
    • Have tried DirectIO and Cached IO on the RAID card
    • Have tried NQC on and off
    • Errors happen to both the RAID1 and RAID6 virtual drives, suggesting it's not limited to a single virtual drive or set of physical drives
    • RAID card consistency check comes back clean
    • RAID card Read Patrol comes back clean
    • Chkdsk on both virtual drives comes back clean
    • sfc /scannow comes back clean (See above: OS replaced)
    • Virus checks come back clean (See above: OS replaced)
    • No errors in RAID card log
    • RAID card log shows no correctable errors, or other errors or alarms
    • MegaCLI shows no errors or SMART errors

    Full text, including the details tab from the Windows Event Viewer:

    Reset to device, \Device\RaidPort1, was issued.
    - <Event xmlns="">
    - <System>
      <Provider Name="megasas" /> 
      <EventID Qualifiers="32772">129</EventID> 
      <TimeCreated SystemTime="2013-10-22T17:32:26.936828400Z" /> 
      <Security /> 
    - <EventData>
    So at this point, I'm out of ideas. The only thing I haven't replaced is the CPU, RAM (but it comes clean on a RAM check), the drives, or the Power Supply. I'm loathe to continue to replace parts indiscriminately. Google isn't much help either.

    Any ideas that I may have missed?

  2. #2
    Join Date
    Oct 2002
    When you unhooked the drives to run a SMART test, did you also run a block check on each disk? Since it's part of a RAID set you'd need your drive vendor's software, or SeaTools. I wonder if there's a bad block or two on a single drive that's triggering your issue.

    I had a NetApp appliance at work where a single drive in the array would hang for about the same amount of time. Exchange Servers do not like that, so the databases hosted would shift themselves to the other server. Finally got lucky and saw the bad blocks on a debug, force-failed the drive and all was well again.
    If you have to operate your company behind the scenes or under a fake name, maybe it's time to leave the industry and start something fresh.

  3. #3
    Join Date
    Nov 2007
    Smyrna, DE
    This is a great suggestion, and no, I did not run a full block check on each drive. Although this had been kind of nagging at the back of my mind, I assumed that the read patrol from the RAID card would have picked up any block errors, but it may not have.

    Brought down the SAN tonight, and am running WD's Data LifeGuard Diagnostic extended tests (basically a block level check) five drives at a time (looks like about 4 hours per set of 5, so I should be finished tomorrow evening at the latest). Have had good luck with WD's tool, so we'll see what turns up.

    Appreciate the suggestion. Sometimes, you just need another set of eyes.

    Will post the results, either way.

  4. #4
    Join Date
    Nov 2007
    Smyrna, DE
    So it took a bit longer than I originally anticipated, but all of the drives have passed a low level scan without any errors whatsoever.

    The only thing I can think of to do at this point is move the data off, destroy and recreate the RAID, and move it back.

  5. #5
    Join Date
    Nov 2007
    Smyrna, DE
    An update for those that may run into this issue in the future.

    After extensive, and frustrating testing, this appears to have been a RAID card and chipset incompatibility. I exhausted every single thing except the CPU itself.

    Backing the system up, I tried a fresh OS install without the chipset drivers, but this didn't help. Of course, any Windows OS would have installed the base chipset drivers anyway, and Device Manager showed no unrecognized devices, so that held that up.

    Finally, I moved the RAID card and RAID volume to an Intel i3-2100 that I had just to test a theory and all the problems immediately disappeared. Also, moving to an F1 socket from the AM3+ socket helped the problem. So it appears to be a problem with the AMD 970 chipset and the LSI 84016E.

    As a final note, I swapped in a 9260-8i with the 970 chipset, and it seemed to work fine.

Similar Threads

  1. VPS storage device: ditch HW RAID and go all SSD's?
    By [email protected] in forum Colocation and Data Centers
    Replies: 73
    Last Post: 02-25-2013, 05:01 PM
  2. Ned a good backup device (RAID alternatives)
    By Compworld in forum Dedicated Server
    Replies: 7
    Last Post: 03-03-2012, 09:03 AM
  3. How to reset memory card password on Symbian device?
    By JFSG in forum Web Hosting Lounge
    Replies: 0
    Last Post: 01-03-2011, 10:08 AM
  4. Replies: 0
    Last Post: 08-31-2010, 02:25 AM
  5. Looking for a remote reset device ...
    By atrocity in forum Hosting Security and Technology
    Replies: 8
    Last Post: 12-02-2003, 03:24 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts