Results 1 to 21 of 21

Thread: High IO wait

  1. #1
    Join Date
    Jul 2007
    Posts
    351

    High IO wait

    I'm having a strange issue, that I have yet to figure out the cause of.

    First off, here's my current server specs:
    Single x3440, 8 cores
    8GB RAM
    2TB SATA2 HDD
    1gbps network connectivity

    I've recently moved to this server and I've had this bizarre issue.

    Occasionally, when my network activity goes up to around 150-200mbps (that would be 15-20MB/s of data being written to the drive) my IO Wait % starts to go up a lot. It jumps from 0.x% to 20% and sometimes even 40%. The network activity jump is not sustained, and the maximum it stays for (generally) is 15-20 minutes. Very rarely 30 minutes or over.

    The IO wait % during this time frame spikes my CPU usage to over 1.x, sometimes over 3/4/5.x too.

    The problem here is that, at random, my IO Wait % continues to rise and does not drop down, no matter what. This spikes my CPU usage quite a lot (100/200+) and eventually leads to an unresponsive system. Only way to get out of that state is to issue a remote reboot. My longest uptime, without needing to perform a remote reboot, ended earlier today after 2 days. So the longest my system has been consecutively up for, has been 2 days. Otherwise, it goes down at least once or more than once within a 24 hour time frame. It's random. The spikes are random, and I have no idea what is causing them or why.

    Now, I've checked literally everything. I've checked the hard drive and it appears to be just fine. On an idle system (with httpd, etc disabled) it writes and reads at an average of 110MB/s +/- 10. The CPU seems to be just fine too, tested that. Same is the case with the memory, nothing wrong there.

    What could it be? Or for that matter, would getting a CPU upgrade help the issue, at all? An interesting fact is that I have another system with exactly the same specs and it has network activity spikes of ~100-200mbps too at times. While the IO wait % on it spikes too, I haven't had it becoming unresponsive at all. It's been up ever since it was deployed (along with this system) which is 14 days.

    Further, I've only recently moved to this DC/network. With my previous DC, I had both the systems with the only difference being that they were Core i7's 950's and not x3440's. I never had any downtime issue of this sort on them. Nor did I ever notice spikes of elevated IO Wait %.

    Any help/advice is greatly appreciated.

  2. #2
    Join Date
    Dec 2004
    Posts
    209
    What sizes does your files have?

    Upload, download or both?

    You might consider using nginx instead of apache (you mention httpd)
    Busy, busy, busy

  3. #3
    Join Date
    Jul 2007
    Posts
    351
    File sizes are variable. From as low as 50mb to as much as 300+ MB. On some occasions, it's about ~1gig too.

    Write. I've noticed that the major IO wait is when data is being written to the drives.

    I've never had a problem with Apache, and it integrates the best with my current web application. Like I said, I have an identical system running that has never had this issue. It's been up for 14 straight days without a problem.

  4. #4
    Join Date
    Dec 2004
    Posts
    209
    My first guess is filesharing site.

    with a single drive, you are doomed to iowait problems.

    Using nginx, will make sure the system uses less memory, so system has more memory to put the most used files there, saving you for hd io usage.

    Many factors here, what script? XFS?
    Busy, busy, busy

  5. #5
    Join Date
    Jul 2007
    Posts
    351
    Nope, not a filesharing site. A similar concept, yes. But not a dedicated filesharing site. I've been running it perfectly for a few months now, like I mentioned. It's just this one server, out of the 5 that has this issue. And, it's only started ever since I moved over to it. Previously, on the i7, it was perfectly fine.

    As for memory, yes I'm aware. At the moment, monitoring via htop, my system is using about 1.1gig out of the 7.9gigs available to it. My vm.dirty_page is set to use 20% of the 8gigs (1.6gigs) before it dumps the data to my hard drive, so that shouldn't be an issue, either.

  6. #6
    Join Date
    Dec 2004
    Posts
    209
    Not sure how apache uses memory for caching big files. But you should research that.

    From what i can see is that you got a nice load of outgoing traffic.
    then suddenly a upload with fast speed. This will discrupt the flow.

    Other example, usb hd connected to a computer, you copy 200Gb data from the usb harddrive to your computer. Then you copy 200Gb data to the usb harddrive, what happens?
    Busy, busy, busy

  7. #7
    Join Date
    Jul 2007
    Posts
    351
    True, but there's a vast difference between a USB's IO speed and a SATA2 drive (that's onboard) IO speed. The drive is capable of reading/writing up to 100MB/s but performance degrades at only ~15-20MB/s. That's definitely a genuine cause of concern.

  8. #8
    Join Date
    Dec 2004
    Posts
    209
    More IO you use less MB/s you get. If i dont remember wrong, a standard disk have around 150IO/sec ~

    So the question then, how many downloads do you have per sec?

    Where is funkywizard he usually got some nice pointers here.

    But my guess is that you need to use the memory better. And from experience nginx does a much better job there
    Busy, busy, busy

  9. #9
    Join Date
    Jul 2007
    Posts
    351
    I don't serve thousands of downloads a second, nope. The maximum number of disk write processes I have going at any moment is anywhere between 3 and 6. Rarely does it climb to 8 or 10. Never any more than that.

    I don't have a lot of disk reads per second, either. Average reads you'd expect from any normal operation. Basically, it has always been fine, except this one server. Like I said, an identical server processes about the same (and/or slightly more) data/load and does not have any problems, whatsoever.

    Here's my output from 'free -m'
    http://lulzimg.com/i14/dd63a5.png
    Code:
    [root@ ~]# free -m
                 total       used       free     shared    buffers     cached
    Mem:          7975       7917         57          0         14       6927
    -/+ buffers/cache:        975       6999
    Swap:         9977          0       9977
    Trust me, it's most definitely not a memory issue.

  10. #10
    Join Date
    Feb 2008
    Location
    Houston, Texas, USA
    Posts
    2,955
    Hello,

    In Linux, the Writeback cache is used to queue disk writes in memory (until the pdflush task decides it's time to flush to disk. The Writeback is essentially a chunk of physical memory reserved for this purpose. It expands as needed.

    So you can inspect /proc/meminfo (watch -n 1 cat /proc/meminfo) to find out if the "Writeback" gets full and exhausts physical memory on your system, which in turn can cause swapping. That would indicate a lack of RAM. There's also the case where you're hitting an IOPS limit on the disk itself and so writes get queued up and everything slows down.

    You could increase /proc/sys/vm/dirty_writeback_centisecs to 800 (pdflush clock). But it's not clear this will help at this point. Generally Linux is good at managing Writeback for you when RAM is available. So if you have 50MB of RAM free and there's a 100MB file that needs to be written, Writeback won't be able to do its job.

    What's the cache size on the disk?

    Regards
    Joe / UNIXY
    UNIXy - Fully Managed Servers and Clusters - Established in 2006
    [ cPanel Varnish Nginx Plugin ] - Enhance LiteSpeed and Apache Performance
    www.unixy.net - Los Angeles | Houston | Atlanta | Rotterdam
    Love to help pro bono (time permitting). joe > unixy.net

  11. #11
    Join Date
    Jul 2007
    Posts
    351
    I do remember reading about this and the part the 'vm.dirty_pages' config value plays in it. It was initially set at 40% of RAM utilization before it'd dump the data to the hard drive (3.2GB). I reduced it to 20% (1.6GB).

    I haven't had a case of swapping, so that's not quite the issue. I also see a lot of free RAM on the system, as I posted the output above. I don't suppose RAM is quite the issue here.

    I'm not quite sure how to figure out the cache size, could you give me the required command?

  12. #12
    Join Date
    Feb 2008
    Location
    Houston, Texas, USA
    Posts
    2,955
    Quote Originally Posted by lifetalk View Post
    I'm not quite sure how to figure out the cache size, could you give me the required command?
    hdparm -I /dev/sdx|grep Model

    You can then lookup the model online and figure out the cache size.

    Regards
    UNIXy - Fully Managed Servers and Clusters - Established in 2006
    [ cPanel Varnish Nginx Plugin ] - Enhance LiteSpeed and Apache Performance
    www.unixy.net - Los Angeles | Houston | Atlanta | Rotterdam
    Love to help pro bono (time permitting). joe > unixy.net

  13. #13
    Join Date
    Jul 2007
    Posts
    351
    64MB as listed on Seagate's website. Model, just in case you're interested, is ST32000644NS

  14. #14
    Join Date
    Feb 2008
    Location
    Houston, Texas, USA
    Posts
    2,955
    Can you confirm that write cache is enabled: hdparm -I /dev/sdx|grep "Write cache"

    Based on your description, it's most likely you're saturating your disk IOPS. Can you not KVM into the server when it becomes unresponsive?

    Regards
    UNIXy - Fully Managed Servers and Clusters - Established in 2006
    [ cPanel Varnish Nginx Plugin ] - Enhance LiteSpeed and Apache Performance
    www.unixy.net - Los Angeles | Houston | Atlanta | Rotterdam
    Love to help pro bono (time permitting). joe > unixy.net

  15. #15
    Join Date
    Jul 2007
    Posts
    351
    Yep. Write cache is enabled. I don't have KVM access, but I'll ask my DC for it so I can get in and check. Although, one of the times when it became unresponsive, I was logged into ssh and 'top' was running. Server loads then were average 800.0 and executing any command was next to impossible. Text I entered echo'd after many seconds on the terminal, so I suppose KVM wouldn't really help much with a system that's so sluggish. I might be wrong, though, since I haven't really used KVM a lot.

  16. #16
    Join Date
    Mar 2003
    Location
    California USA
    Posts
    13,294
    It usually will be more responsive at the console(kvm) then it will through a ssh process. Remember the ssh process needs cpu power to run.


    What os and kernel are you running?
    Steven Ciaburri | Industry's Best Server Management - Rack911.com
    Software Auditing - 400+ Vulnerabilities Found - Quote @ https://www.RACK911Labs.com
    Fully Managed Dedicated Servers (Las Vegas, New York City, & Amsterdam) (AS62710)
    FreeBSD & Linux Server Management, Security Auditing, Server Optimization, PCI Compliance

  17. #17
    Join Date
    Jul 2007
    Posts
    351
    CentOS 64 bit
    2.6.18-194.32.1.el5

  18. #18
    Join Date
    Apr 2009
    Location
    whitehouse
    Posts
    656
    Looks like a drive/driver issue to me. Run a full smart test on the drive.
    What is the output hdparm -tT /dev/your_disk?
    Also, upgrade your kernel to the latest, preferably 2.6.37 to get the latest device drivers.
    James B
    EzeeloginSetup your Secure Linux SSH Gateway.
    |Manage & Administer Multiple Linux Servers Quickly & Securely.

  19. #19
    Join Date
    Jul 2007
    Posts
    351
    Tried a SMART test, outputs no errors.

    I happen to have this exact issue on another server. This one has the same SATA2 Drive, although 1TB. And Dual E5620 processors.

    Results for hdparm are more or less this:
    Timing cached reads: 24444 MB in 2.00 seconds = 12248.58 MB/sec
    Timing buffered disk reads: 328 MB in 3.01 seconds = 108.81 MB/sec

    But as usual I see insane IOwait when there's ~200mbps of data write activity. And now, I see IOwait regularly even when there's disk reads (data being streamed to the end user).

    Are these disks just not good enough?

  20. #20
    Join Date
    Mar 2010
    Location
    Germany
    Posts
    681
    a starting point might be to keep the server from going totally unresponsive (apache will still be affected, but not the general OS)
    can you launch your disk io processes using ionice?

    The following example would put them in the least privileged class.
    (ionice -c 3 command_to_start)

    If that works you should see the site & login still work fine and only the up/download might still be in trouble. At least it should allow you to do more debugging.
    Check out my SSD guides for Samsung, HGST (Hitachi Global Storage) and Intel!

  21. #21
    Join Date
    Jul 2007
    Posts
    351
    Thanks for the recommendation, wartungsfenster. ionice sounds like something I could switch to while I find a permanent solution to this.

    Just looked up a basic --help command for it and I see these:
    1: realtime, 2: best-effort, 3: idle

    Could you tell me what is the default scheduling class? Realtime?

Similar Threads

  1. High I/O wait time
    By yajur in forum Hosting Security and Technology
    Replies: 1
    Last Post: 05-11-2010, 07:59 PM
  2. PHP session locks - high io wait
    By unixtized in forum Hosting Security and Technology
    Replies: 9
    Last Post: 02-12-2008, 01:35 PM
  3. High I/O Wait
    By alisaqi in forum Hosting Security and Technology
    Replies: 14
    Last Post: 08-23-2007, 05:35 AM
  4. high io wait but sy and us is nothing
    By Dimu in forum VPS Hosting
    Replies: 1
    Last Post: 07-31-2007, 05:30 AM
  5. High bidder doesn't reply - How long a wait?
    By 95 Degrees in forum Domain Names
    Replies: 10
    Last Post: 03-03-2003, 11:37 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •