10-22-2013, 05:09 PM #1
Event ID 129: Reset to device, \Device\RaidPort1, MegaSAS (RAID 84016E)
For a couple of weeks now, I've been chasing some PCIe resets that are sent to my LSI 84016E RAID controller that I cannot seem to solve. For several weeks, the SAN performed great, and then began having regular and erratic Event 129 Reset to Device errors.
Anytime that the RAID is under load of any kind (it serves media for my XBMC HTPCs in the house, among other things), drive activity to the RAID volume will lock up for 60 seconds, and then resume, seemingly randomly. It's only current function is to serve media files out to HTPCs around the house via Windows File Sharing, however, prior to this, it was serving as a iSCSI mount for VMWare ESXi nodes. Total free space is around 40%.
The event log always shows an Event ID 129 with the message "Reset to device, \Device\RaidPort1, was issued" with the provider as megasas. The RAID card logs show NO ERRORS when this happens.
This is a custom built SAN with the following specs:
CPU: FX-6100 Motherboard: ASUS M5A97 (current) MSI 970A-G43 (prior) RAM: 32GB DDR3-1600 RAID Card: LSI 84016E in PCIex16 slot Power Supply: Corsair Professional Series HX 750 OS Drive: 128GB Crucial M4 SSD RAID Drives: 16 x 2TB Hitachi Ultrastar (14 drives in RAID6, 2 drives in RAID1) OS: Win7 Ultimate (current) Server2008R2 (prior)
History and Troubleshooting:
- RAM Tests come back clean
- Drives unhooked from RAID and connected directly to motherboard and all SMART tests come back clean
- Cables swapped on RAID card with new cables
- Motherboard replaced
- RAID card replaced with identical model
- RAID card Firmware updated (both cards)
- Fan attached to heatsink on RAID card for better temperature regulation
- OS Changed from Server2008R2 to Win7 Ultimate
- Power supply tested via a tester and multimeter. All rails holding steady voltage, even under load drive load
- Can replicate error/reset by using CrystalDiskMark3. Lockup/reset SEEMS to happen on the write cycle
- Cannot replicate error/reset using HDTunePro or IOMeter, even allowing them to run 1 hour+
- IOMeter does not cause the error even on write cycles (see the CrystalDiskMark3 entry above)
- Have tried DirectIO and Cached IO on the RAID card
- Have tried NQC on and off
- Errors happen to both the RAID1 and RAID6 virtual drives, suggesting it's not limited to a single virtual drive or set of physical drives
- RAID card consistency check comes back clean
- RAID card Read Patrol comes back clean
- Chkdsk on both virtual drives comes back clean
- sfc /scannow comes back clean (See above: OS replaced)
- Virus checks come back clean (See above: OS replaced)
- No errors in RAID card log
- RAID card log shows no correctable errors, or other errors or alarms
- MegaCLI shows no errors or SMART errors
Full text, including the details tab from the Windows Event Viewer:
Reset to device, \Device\RaidPort1, was issued. - <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event"> - <System> <Provider Name="megasas" /> <EventID Qualifiers="32772">129</EventID> <Level>3</Level> <Task>0</Task> <Keywords>0x80000000000000</Keywords> <TimeCreated SystemTime="2013-10-22T17:32:26.936828400Z" /> <EventRecordID>21077</EventRecordID> <Channel>System</Channel> <Computer>SAN.xxxxxxxx.local</Computer> <Security /> </System> - <EventData> <Data>\Device\RaidPort1</Data> <Binary>0F001800010000000000000081000480040000000000000000000000000000000000000000000000000000000000000001000000810004800000000000000000</Binary>
Any ideas that I may have missed?
10-22-2013, 07:42 PM #2Retired Moderator
- Join Date
- Oct 2002
When you unhooked the drives to run a SMART test, did you also run a block check on each disk? Since it's part of a RAID set you'd need your drive vendor's software, or SeaTools. I wonder if there's a bad block or two on a single drive that's triggering your issue.
I had a NetApp appliance at work where a single drive in the array would hang for about the same amount of time. Exchange Servers do not like that, so the databases hosted would shift themselves to the other server. Finally got lucky and saw the bad blocks on a debug, force-failed the drive and all was well again.If you have to operate your company behind the scenes or under a fake name, maybe it's time to leave the industry and start something fresh.
10-23-2013, 12:01 AM #3
This is a great suggestion, and no, I did not run a full block check on each drive. Although this had been kind of nagging at the back of my mind, I assumed that the read patrol from the RAID card would have picked up any block errors, but it may not have.
Brought down the SAN tonight, and am running WD's Data LifeGuard Diagnostic extended tests (basically a block level check) five drives at a time (looks like about 4 hours per set of 5, so I should be finished tomorrow evening at the latest). Have had good luck with WD's tool, so we'll see what turns up.
Appreciate the suggestion. Sometimes, you just need another set of eyes.
Will post the results, either way.
10-25-2013, 03:33 PM #4
So it took a bit longer than I originally anticipated, but all of the drives have passed a low level scan without any errors whatsoever.
The only thing I can think of to do at this point is move the data off, destroy and recreate the RAID, and move it back.
11-12-2013, 06:10 PM #5
An update for those that may run into this issue in the future.
After extensive, and frustrating testing, this appears to have been a RAID card and chipset incompatibility. I exhausted every single thing except the CPU itself.
Backing the system up, I tried a fresh OS install without the chipset drivers, but this didn't help. Of course, any Windows OS would have installed the base chipset drivers anyway, and Device Manager showed no unrecognized devices, so that held that up.
Finally, I moved the RAID card and RAID volume to an Intel i3-2100 that I had just to test a theory and all the problems immediately disappeared. Also, moving to an F1 socket from the AM3+ socket helped the problem. So it appears to be a problem with the AMD 970 chipset and the LSI 84016E.
As a final note, I swapped in a 9260-8i with the 970 chipset, and it seemed to work fine.
By [email protected] in forum Colocation and Data CentersReplies: 73Last Post: 02-25-2013, 05:01 PM
By Compworld in forum Dedicated ServerReplies: 7Last Post: 03-03-2012, 09:03 AM
By JFSG in forum Web Hosting LoungeReplies: 0Last Post: 01-03-2011, 10:08 AM
By tulix in forum Other Hosting OffersReplies: 0Last Post: 08-31-2010, 02:25 AM
By atrocity in forum Hosting Security and TechnologyReplies: 8Last Post: 12-02-2003, 03:24 PM