I'm having a strange issue, that I have yet to figure out the cause of.
First off, here's my current server specs:
Single x3440, 8 cores
2TB SATA2 HDD
1gbps network connectivity
I've recently moved to this server and I've had this bizarre issue.
Occasionally, when my network activity goes up to around 150-200mbps (that would be 15-20MB/s of data being written to the drive) my IO Wait % starts to go up a lot. It jumps from 0.x% to 20% and sometimes even 40%. The network activity jump is not sustained, and the maximum it stays for (generally) is 15-20 minutes. Very rarely 30 minutes or over.
The IO wait % during this time frame spikes my CPU usage to over 1.x, sometimes over 3/4/5.x too.
The problem here is that, at random, my IO Wait % continues to rise and does not drop down, no matter what. This spikes my CPU usage quite a lot (100/200+) and eventually leads to an unresponsive system. Only way to get out of that state is to issue a remote reboot. My longest uptime, without needing to perform a remote reboot, ended earlier today after 2 days. So the longest my system has been consecutively up for, has been 2 days. Otherwise, it goes down at least once or more than once within a 24 hour time frame. It's random. The spikes are random, and I have no idea what is causing them or why.
Now, I've checked literally everything. I've checked the hard drive and it appears to be just fine. On an idle system (with httpd, etc disabled) it writes and reads at an average of 110MB/s +/- 10. The CPU seems to be just fine too, tested that. Same is the case with the memory, nothing wrong there.
What could it be? Or for that matter, would getting a CPU upgrade help the issue, at all? An interesting fact is that I have another system with exactly the same specs and it has network activity spikes of ~100-200mbps too at times. While the IO wait % on it spikes too, I haven't had it becoming unresponsive at all. It's been up ever since it was deployed (along with this system) which is 14 days.
Further, I've only recently moved to this DC/network. With my previous DC, I had both the systems with the only difference being that they were Core i7's 950's and not x3440's. I never had any downtime issue of this sort on them. Nor did I ever notice spikes of elevated IO Wait %.
File sizes are variable. From as low as 50mb to as much as 300+ MB. On some occasions, it's about ~1gig too.
Write. I've noticed that the major IO wait is when data is being written to the drives.
I've never had a problem with Apache, and it integrates the best with my current web application. Like I said, I have an identical system running that has never had this issue. It's been up for 14 straight days without a problem.
Nope, not a filesharing site. A similar concept, yes. But not a dedicated filesharing site. I've been running it perfectly for a few months now, like I mentioned. It's just this one server, out of the 5 that has this issue. And, it's only started ever since I moved over to it. Previously, on the i7, it was perfectly fine.
As for memory, yes I'm aware. At the moment, monitoring via htop, my system is using about 1.1gig out of the 7.9gigs available to it. My vm.dirty_page is set to use 20% of the 8gigs (1.6gigs) before it dumps the data to my hard drive, so that shouldn't be an issue, either.
True, but there's a vast difference between a USB's IO speed and a SATA2 drive (that's onboard) IO speed. The drive is capable of reading/writing up to 100MB/s but performance degrades at only ~15-20MB/s. That's definitely a genuine cause of concern.
I don't serve thousands of downloads a second, nope. The maximum number of disk write processes I have going at any moment is anywhere between 3 and 6. Rarely does it climb to 8 or 10. Never any more than that.
I don't have a lot of disk reads per second, either. Average reads you'd expect from any normal operation. Basically, it has always been fine, except this one server. Like I said, an identical server processes about the same (and/or slightly more) data/load and does not have any problems, whatsoever.
In Linux, the Writeback cache is used to queue disk writes in memory (until the pdflush task decides it's time to flush to disk. The Writeback is essentially a chunk of physical memory reserved for this purpose. It expands as needed.
So you can inspect /proc/meminfo (watch -n 1 cat /proc/meminfo) to find out if the "Writeback" gets full and exhausts physical memory on your system, which in turn can cause swapping. That would indicate a lack of RAM. There's also the case where you're hitting an IOPS limit on the disk itself and so writes get queued up and everything slows down.
You could increase /proc/sys/vm/dirty_writeback_centisecs to 800 (pdflush clock). But it's not clear this will help at this point. Generally Linux is good at managing Writeback for you when RAM is available. So if you have 50MB of RAM free and there's a 100MB file that needs to be written, Writeback won't be able to do its job.
I do remember reading about this and the part the 'vm.dirty_pages' config value plays in it. It was initially set at 40% of RAM utilization before it'd dump the data to the hard drive (3.2GB). I reduced it to 20% (1.6GB).
I haven't had a case of swapping, so that's not quite the issue. I also see a lot of free RAM on the system, as I posted the output above. I don't suppose RAM is quite the issue here.
I'm not quite sure how to figure out the cache size, could you give me the required command?
Yep. Write cache is enabled. I don't have KVM access, but I'll ask my DC for it so I can get in and check. Although, one of the times when it became unresponsive, I was logged into ssh and 'top' was running. Server loads then were average 800.0 and executing any command was next to impossible. Text I entered echo'd after many seconds on the terminal, so I suppose KVM wouldn't really help much with a system that's so sluggish. I might be wrong, though, since I haven't really used KVM a lot.
It usually will be more responsive at the console(kvm) then it will through a ssh process. Remember the ssh process needs cpu power to run.
What os and kernel are you running?
Steven Ciaburri | Industry's Best Server Management- Rack911.com
Software Auditing - 400+ Vulnerabilities Found - Quote @ https://www.RACK911Labs.com Fully Managed Dedicated Servers (Las Vegas, New York City, & Amsterdam) (AS62710)
FreeBSD & Linux Server Management, Security Auditing, Server Optimization, PCI Compliance
Looks like a drive/driver issue to me. Run a full smart test on the drive.
What is the output hdparm -tT /dev/your_disk?
Also, upgrade your kernel to the latest, preferably 2.6.37 to get the latest device drivers.
█Ezeelogin█ Setup your Secure Linux SSH Gateway.
█|Manage & Administer Multiple Linux Servers Quickly & Securely.