Results 1 to 13 of 13
  1. #1

    CFQ I/O Scheduler - Performance for all hosts - Please Read

    My name is Matt Heaton with Bluehost. As many of you know I try and work at very low in the kernel to address what I see as fundamental I/O concerns in the linux kernel. Lately, we have started employing very VERY smart kernel developers both locally and around the world to investigate performance issues in the kernel and then creates patches to solve these issues and do the political work involved to get these put into the mainline linux kernel available at kernel.org.

    While about half of these patches are for us only, when we see a problem that is so widespread that everyone can benefit we do like to share this information. This is one of those times.

    In almost every modern linux distribution CFQ is set as the default I/O scheduler for all your block devices. Here is a brief description of different I/O schedulers, what they are primarily used for etc. http://www.redhat.com/magazine/008ju...es/schedulers/

    Anyway, we have found HUGE problems with CFQ in many different scenarios and many different hardware setups. If it was only an issue with our configuration I would have foregone posting this message and simply informed those kernel developers responsible for the fix.

    Two scenarios where CFQ has a severe problem - When you are running a single block device (1 drive, or a raid 1 scenario) under certain circumstances where heavy sustained writes are occurring the CFQ scheduler will behave very strangely. It will begin to give all access to reads and limit all writes to the point of allowing only 0-2 I/O write operations being allowed per second vs 100-180 read operations per second. This condition will persist indefinitely until the sustained write process completes. This is VERY bad for a shared environment where you need reads and writes to complete regardless of increased reads or writes. This behavior goes beyond what CFQ says it is supposed to do in this situation - meaning this is a bug, and a serious one at that. We can reproduce this EVERY TIME.

    The second scenario occurs when you have two or more block devices, either single drives, or any type of raid array including raid 0,1,0+1,1+0,5 and 6. (We never tested 3,4 who uses raid 3 or 4 anymore anyway?!!). This case is almost exactly opposite of what happens with only one block device. In this case if one of more of the drives is blocked with heavy writes for a sustained period of time CFQ will block reads from the other devices or severely limit the reads until the writes have completed. We can also reproduce this behavior with test software we have written on a 100% consistent basis.

    This is VERY bad.

    I have written several times about dirty cache and how write outs of dirty page cache can and will starve other block device reads. This is still the case, so adding the dirty cache issue to CFQ is a nightmare performance wise.

    We have tested this from 2.6.22 to 2.6.27.rc8 kernels.

    If you don't this doesn't affect you think again! This is a HUGE problem. So, what can you do about it. Well, we have tried adjusting tunables in CFQ to get the proper behavior but its clearly busted deep in the code! My suggestion and what we have done is switched all our block devices to the deadline scheduler. My preference would be to use CFQ "IF" it worked as it is laid out to do, but it doesn't. In all our tests when everything is blocked, switching out to deadline in the middle of the slowdown almost immediately relieves the problem and solves it going forward. I HIGHLY suggest web hosts consider this until CFQ can be properly fixed. We will post our tests and how to duplicate the problem to the LKML (Linux kernel mailing list) on Monday to speed along this fix to CFQ in this area.

    To check what i/o scheduler you are currently running you can type.

    cat /sys/block/sdX/queue/scheduler - Replace sdX with your device such as sda,sdb,sdc, and so forth.

    The output should look like this:

    noop anticipatory deadline [cfq]

    If you would like to change CFQ to deadline you should issue this command.

    echo deadline > /sys/block/sdX/queue/scheduler


    After you issue this commmand it should then look like this:

    noop anticipatory [deadline] cfq

    This method of changing it won't survive a reboot, but you can see instantly how this will or won't affect your block device performance. You can add these changes to a boot up script to set it for all your block devices or you can use sysctrl to make it permanent as well.

    There is one caveat to this information. If you are using SSD's (Sold state drives) for any block devices you should instead use the noop (No operation) I/O scheduler (For reasons I don't want to get into here - you can google it if you want to understand the very important reasons behind this).

    The command to do this is

    echo noop > /sys/block/sdX/queue/scheduler

    Please remember that noop is ONLY for SSD's and other flash devices. It will cause regular platter based hard drives to run incredibly slow!

    Hope this helps until the CFQ guys can get this resolved. I am actually quite astounded that such a severe bug has been in CFQ all this time, and that it is the default scheduler for virtually all enterprise linux distributions. I will post on my blog (Mattheaton.com) when a new CFQ patch that resolves the issue is available. If you have ever wondered why your file systems sometimes "hiccup" for 1-2 seconds and then seem to get back on track then this is usually the problem (At least as far as we have been able to test). Interestingly we have seen much better throughput for both large and small MySQL DBs using deadline. I doubt I will switch back to CFQ for my databases even after a solid patch has been written.
    Matt Heaton / President Bluehost.com - Hostmonster.com

  2. #2
    Join Date
    Sep 2008
    Location
    UK!
    Posts
    37
    Matt,

    Thank you for such an informative and well-thought out post, I'll be taking a look at this in-depth.

    Thanks.
    Exia Studios
    Competent systems administration, development and hosting; premium service at a great price. Advice available.

    Blog: Coming soon. | Email: [email protected] | PIN: 2557B0C2 | Twitter: http://twitter.com/exiastudios

  3. #3
    Join Date
    Mar 2003
    Location
    California USA
    Posts
    13,386
    Matt,

    Have you noticed any issues with database corruption on databases that get hundreds of updates a minute when using deadline? I've been using deadline for years, and in specific cases I have seen it cause database corruptions that didn't happen under CFQ.
    Steven Ciaburri | Industry's Best Server Management - Rack911.com
    Software Auditing - 400+ Vulnerabilities Found - Quote @ https://www.RACK911Labs.com
    Fully Managed Dedicated Servers (Las Vegas, New York City, & Amsterdam) (AS62710)
    FreeBSD & Linux Server Management, Security Auditing, Server Optimization, PCI Compliance

  4. #4

    Response

    Quote Originally Posted by Steven View Post
    Matt,

    Have you noticed any issues with database corruption on databases that get hundreds of updates a minute when using deadline?
    I haven't noticed any corruption at all. Are you

    1) using a raid controller? If so do you have write cache enabled? If so do you have a battery backup on the card (NOT UPS!)
    2) Using MySQL mostly or another DB? Primarily MyISM or InnoDB as the storage engine?
    3) What file system are you using for your DB? Ext3, Ext2, JFS, XFS, Reiser?
    4) Any special mount options for your DB partition? Writeback journal or writethru (Default)? XFS and others only journal the metadata, but I would still like to know.
    5) How much memory do you have on your server? And what is the size of your /var/lib/mysql directory (Not the whole partition). You can cd into /var/lib/mysql and do a "du -sh" to add it all up for you if you want.
    6) What kernel version are you runnning? Do a "uname -a" and post it back.

    What specific types of corruption are you getting? Any other info you could provide would be great.
    Matt Heaton / President Bluehost.com - Hostmonster.com

  5. #5
    Join Date
    Mar 2003
    Location
    California USA
    Posts
    13,386
    Quote Originally Posted by mheaton View Post
    I haven't noticed any corruption at all. Are you

    1) using a raid controller? If so do you have write cache enabled? If so do you have a battery backup on the card (NOT UPS!)
    2) Using MySQL mostly or another DB? Primarily MyISM or InnoDB as the storage engine?
    3) What file system are you using for your DB? Ext3, Ext2, JFS, XFS, Reiser?
    4) Any special mount options for your DB partition? Writeback journal or writethru (Default)? XFS and others only journal the metadata, but I would still like to know.
    5) How much memory do you have on your server? And what is the size of your /var/lib/mysql directory (Not the whole partition). You can cd into /var/lib/mysql and do a "du -sh" to add it all up for you if you want.
    6) What kernel version are you runnning? Do a "uname -a" and post it back.

    What specific types of corruption are you getting? Any other info you could provide would be great.

    The specifics are kind of pointless because it has happened over a series of maybe 30 servers in the past couple years all with various configurations and kernels.

    The issue happened only on MyISAM tables.

    The table(s) would go corrupt and require a myisamchk -o to repair. -r would not repair them.

    Most of them had raids. The ones with raid battery backup were writeback cached, I do not run writeback cache on NON raid battery back up servers.

    Most were raid 10, none were raid 5, and very few were raid 1.

    All filesystems were ext3.

    I stopped using deadline after spending months on finding the cause, and waking up to failed databases randomly.

    The /var/lib/mysql directory generally was a minimum of 5gb

    and the servers have a minimum of 2gb ram.

    Most are scsi, some were sata, none with ide.
    Steven Ciaburri | Industry's Best Server Management - Rack911.com
    Software Auditing - 400+ Vulnerabilities Found - Quote @ https://www.RACK911Labs.com
    Fully Managed Dedicated Servers (Las Vegas, New York City, & Amsterdam) (AS62710)
    FreeBSD & Linux Server Management, Security Auditing, Server Optimization, PCI Compliance

  6. #6
    Quote Originally Posted by Steven View Post
    The specifics are kind of pointless because it has happened over a series of maybe 30 servers in the past couple years all with various configurations and kernels.

    The issue happened only on MyISAM tables.

    The table(s) would go corrupt and require a myisamchk -o to repair. -r would not repair them.

    Most of them had raids. The ones with raid battery backup were writeback cached, I do not run writeback cache on NON raid battery back up servers.

    Most were raid 10, none were raid 5, and very few were raid 1.

    All filesystems were ext3.

    I stopped using deadline after spending months on finding the cause, and waking up to failed databases randomly.

    The /var/lib/mysql directory generally was a minimum of 5gb

    and the servers have a minimum of 2gb ram.

    Most are scsi, some were sata, none with ide.

    Well, here is my opinion. Take it for what its worth. Most raid controllers have an option for writeback cache and writeback journaling and very people understand the difference. I don't know your experience here so I will spit out what I think could be your issue.

    The raid manufacturers are TERRIBLE at explaining the difference, and most raid cards have these options by different names. Lets take 3ware for example.

    3ware has 3 modes to run in.

    Safe, Balanced, and Performance mode. And they have a battery backup option.

    What they DON'T TELL YOU is that safe and balanced will actually use the battery, but performance won't use it at all even if you have it. I know that doesn't make sense, but I am 100% correct on this. Writecache IS enabled in all three modes, but writeback journaling is only enabled in performance mode, writethru is what is used in safe, and balanced mode.

    3ware doesn't even give you the option for anything else, and they don't mention writeback or writethru in any of the settings I was talking about.

    75% of the time I look at peoples raid arrays the ba tery isn't doing anything and they are losing data because of the journaling. 3ware is the WORST offender. They don't want to promote it because if it is set to use the battery properly and not lose data the performance numbers go down quite a lot.

    LSI Controllers do the same thing

    So if you do have a backup backup and you happen to be using 3ware then try "balanced" mode for your array with writecahce turned on. You shouldn't lose any data at all even if you rip the plug out of the wall and boot up over and over

    So if you have a certain brand of raid controller please respond back. I would really like to know.
    Matt Heaton / President Bluehost.com - Hostmonster.com

  7. #7
    Join Date
    Mar 2003
    Location
    California USA
    Posts
    13,386
    Sata were 3ware, SCSI were LSI
    Steven Ciaburri | Industry's Best Server Management - Rack911.com
    Software Auditing - 400+ Vulnerabilities Found - Quote @ https://www.RACK911Labs.com
    Fully Managed Dedicated Servers (Las Vegas, New York City, & Amsterdam) (AS62710)
    FreeBSD & Linux Server Management, Security Auditing, Server Optimization, PCI Compliance

  8. #8

    Well....

    Quote Originally Posted by Steven View Post
    Sata were 3ware, SCSI were LSI
    Well you may want to look at the 3ware ones and make sure they are either "safe" or "balanced", as performance is SURELY the culprit for data loss even with a battery no matter what. On the scsi side with LSI I would look at the same thing, although I can't remember the exact setting.

    I would be interested to know if you had "performance" mode set on your 3ware cards? Could you check 3dm or the tw_cli command line interface and let me know?

    Also, I just barely wrote up a big gripe blog about 3ware and raid controllers in general. There are many things about 3ware that you ought to know that isn't out in general information. You can read it at mattheaton.com if you are interested.
    Matt Heaton / President Bluehost.com - Hostmonster.com

  9. #9
    Join Date
    Oct 2004
    Posts
    294
    Hi

    We have checked our servers and only on CentOS 5.2 with kernel 2.6.18-92.1.13.el5PAE and 2x750 GB SATA RAID-1 drives we've found

    noop anticipatory deadline [cfq]

    We use this server as webhosting box with 300 accounts. Do you think we should consider running:

    echo deadline > /sys/block/sdX/queue/scheduler

    to help server I/O performance? This is cPanel box, running 300 websites , most of them use MySQL (we do not allow InnoDB). Could running this command (echo deadline > /sys/block/sdX/queue/scheduler) on server with RAID-1 on 3ware 8006-2LP card be dangerous? Do we have to stop all services before running this or we can run this command even if webserver is until normal use by our clients (mail, websites etc)
    Thanks for advice!

  10. #10
    Quote Originally Posted by mheaton View Post
    We have tested this from 2.6.22 to 2.6.27.rc8 kernels.

    We will post our tests and how to duplicate the problem to the LKML (Linux kernel mailing list) on Monday to speed along this fix to CFQ in this area.
    When you post, I think it may help to be more specific about the case and avoid using the uppercase "VERY BAD" all the time. I'm sure Jens Axboe will have a look if you provide a good case, and especially if someone else can reproduce your problems.

  11. #11
    We have tested this from 2.6.22 to 2.6.27.rc8 kernels.

    If you don't this doesn't affect you think again! This is a HUGE problem.
    Does anyone happen to know off-hand which kernel version first started using the CFQ I/O scheduler?

    Virtuozzo 3.x uses a patched up 2.6.9 kernel and none of our Virtuozzo 3.x boxes have the /sys path mentioned above (and, yes, I did substitute our drive's device name). I took a gander at various paths and values in that section of /sys and didn't find anything that specified a scheduler.

    We've been avoiding Virtuozzo 4.x due to some reported instabilities that didn't happen with the 3.x version. I don't have a Virtuozzo 4.x box sitting around to check the kernel version to see if it is affected by this. Does anyone reading have a Virtuozzo 4.x box that they can take a look at the output of 'uname -a' and post it here?
    Sincerely,
    Andrew Kinney
    CTO, Advantagecom Networks
    http://www.SimplyWebHosting.com

  12. #12
    Join Date
    May 2006
    Location
    San Francisco
    Posts
    7,207
    Quote Originally Posted by advantagecom View Post
    Does anyone reading have a Virtuozzo 4.x box that they can take a look at the output of 'uname -a' and post it here?
    2.6.18-028stab057.10

  13. #13
    So, it looks like Virtuozzo 4.x is affected by this.

    If I am incorrect, please post what disk I/O queuing method is used.
    Sincerely,
    Andrew Kinney
    CTO, Advantagecom Networks
    http://www.SimplyWebHosting.com

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •