Results 1 to 19 of 19
  1. #1
    Join Date
    Jan 2008
    Location
    Montreal, Canada
    Posts
    133

    R1SOFT v3 brings down servers (I/O issue)

    Hi,

    Is there anyone else who had an issue with R1SOFT v3 entreprise? Each week it brings a server down because of an I/O issue. The servers are in RAID10, we have to reboot them in order for them to be back online (we see many "CDP I/O" in the process manager...).

    Thank you WHT,

  2. #2
    You might have full block scan turned on for the backups. You should only be doing a full block scan for the initial backup, or, potentially, for a backup that occurs where the r1soft server isn't sure it has a good record of all the file deltas (like if you had to reinstall the cdp agent after a kernel upgrade). Full block scan is an option you can turn on / off in the backup policy, so you want to make sure it's off. It's also worthwhile to schedule the r1soft backup for an off peak time of day. For all of these reasons (and more) we started doing daily backups instead of hourly, which also helps here.
    Phoenix Dedicated Servers -- IOFLOOD.com
    Email: sales [at] ioflood.com
    Skype: funkywizard
    Backup Storage VPS -- 1TBVPS.com

  3. #3
    Join Date
    Jan 2008
    Location
    Montreal, Canada
    Posts
    133
    Hi,

    Thank you for your answer. Full block scan is not active, I'll contact R1SOFT directly.

    If anyone else had the same issue, please let me know

    Best Regards,

  4. #4
    Join Date
    Mar 2003
    Posts
    12,923
    Moved > Specialty Hosting and Markets.

  5. #5
    Join Date
    Jun 2005
    Posts
    3,133
    Its normal. They will crash servers frequently, and sometimes cause file corruption on the drives as well which you need to manually fix. Welcome to the world of CDP.
    PingHosters - Expert community for hosters - NEW: Post Reviews

  6. #6
    Join Date
    May 2008
    Location
    Citrus Heights, CA
    Posts
    1,693
    Quote Originally Posted by nibb View Post
    Its normal. They will crash servers frequently, and sometimes cause file corruption on the drives as well which you need to manually fix. Welcome to the world of CDP.
    nicely done.
    Best Regards,

    Mark

  7. #7
    Join Date
    Jun 2002
    Location
    PA, USA
    Posts
    5,130
    We rarely have issue with CDP. It we do, then my admins have not told me of the issues.

    What kind of drives and how many of them do you have on your RAID10? How many servers are you backing up?

  8. #8
    Join Date
    Mar 2003
    Location
    Kansas City, Missouri
    Posts
    462
    * Upgrade your version to the latest available version
    * Build a new kernel module (r1soft-setup --get-module) and then restart your CDP agent (/etc/init.d/cdp-agent restart)

    There were lots of older versions of their kernel module that created IO issues. Please verify you are on the latest greatest versions. We back up quite a few systems without issues.
    =>Admo.net Managed Hosting
    => Managed Hosting • Dedicated Servers • Colocation
    => Dark Fiber Access to 1102 Grand, Multiple Public Providers
    => Over •Sixteen• Years of Service

  9. #9
    Join Date
    Mar 2003
    Location
    California USA
    Posts
    13,258
    Are you using mdadm raid and cloudlinux?
    Steven Ciaburri | Proactive Linux Server Management - Rack911.com
    System Administration Extraordinaire | Follow us on twitter:@Rack911Labs
    Managed Servers (AS62710), Server Management, and Security Auditing.
    www.HostingSecList.com - Security notices for the hosting community.

  10. #10
    Join Date
    Jul 2006
    Posts
    91
    i've been having this issue since January with no end in sight, did 4.0 fix it? nope.

    the error related is this:

    An exception occurred during the request. Unable to stop snapshot for device '/dev/xvda#' with id 1: Operation not permitted

    and then when the next scheduled backup starts, it cant tell that something is still running and causes the CPU to surge and this kills the server.

    they keep saying it will be fixed but nothing.

    also any attempt to stop the cdp process if you can catch after the first bad backup, fails with any attempts i've tried so you STILL have to reboot to clear the issue (although at least you can turn off the backup and your server wont hang so you can do it at a good time)

    yep im using cloudlinux

  11. #11
    Join Date
    Nov 2007
    Location
    Chennai, India
    Posts
    2,371
    Quote Originally Posted by ethical View Post
    yep im using cloudlinux
    Are you using Cloudlinux 6?
    Check this thread out there seem to be people reporting performance issues here
    http://www.webhostingtalk.com/showth...1155043&page=2

  12. #12
    Quote Originally Posted by nibb View Post
    Its normal. They will crash servers frequently, and sometimes cause file corruption on the drives as well which you need to manually fix. Welcome to the world of CDP.
    In our CDP world we use box backup and have never had a single stability issue with it. We also have rolled our own CDP like backup system using some custom server side scripts called by bacula.

  13. #13
    Join Date
    Jul 2006
    Posts
    91
    Quote Originally Posted by chennaihomie View Post
    Are you using Cloudlinux 6?
    Check this thread out there seem to be people reporting performance issues here
    http://www.webhostingtalk.com/showth...1155043&page=2
    nope, Im using CL5 but i will read through that link thanks.

  14. #14
    Join Date
    Jun 2005
    Posts
    3,133
    The issue is back like never before in version 5. Just had 2 crash in 1 week since I upgraded to Idera Server Backup version 5. In all of them R1Soft was doing a backup and not only crashed the VM like it was normal in v3, but crashed the whole dom0 node !!! The whole hardware went crazy because of high I/O load.

    When rebooting the node, the agent was still doing the backup, it never failed, even while the hardware was being rebooted, 5 minutes after it was online, it hung again because r1soft server was still hitting the server, cancelling the backup task immediately made the node respond again. This is not bad. This is AWFUL !!!

    Xenserver 6 will give all type of errors under load like Input/ouput errors, without letting you enter any command at all. Stopping the backup task solves the problem.
    PingHosters - Expert community for hosters - NEW: Post Reviews

  15. #15
    Join Date
    Jul 2006
    Posts
    91
    i've found mine to still be pretty stable so far with 5, what did support say about it I dont want to see this happening again??

    thanks

  16. #16
    Join Date
    Jun 2005
    Posts
    3,133
    Quote Originally Posted by ethical View Post
    i've found mine to still be pretty stable so far with 5, what did support say about it I dont want to see this happening again??

    thanks
    I don´t contact support anymore. They never found a solution years back, so why would have changed today.

    The issues are extremely rare to detect as you need to report it when its happening and I don´t know about you but I cannot have a server down for days.

    Usually I reboot the machine immediately, and R1soft will cause all kinds of corruption in the file system as its still running.

    This seems to happen when I/O is already high on a server. Last time I reboot a machine and it went down almost 4 minutes later, and the on the CDP v5 server the backup was still running. It never detected the server reboot either, the task was still running like if nothing happen, and the server went crazy, so I cancelled the running backup task and the server started to respond again.

    There seems something you can replicate on your servers. While a backup is running on a server, the I/O will slowly increase, slowly but it does increase. For example if the its 0.90 it will increase to 0.91 after one or two minutes, and then to 0.92, and so on.

    So you are better lucky that backups do not take to long to complete, otherwise you have a potential problem.

    Now if the server is on load this is a problem, in particular when for some strange reason the task is frozen and just keeps running for ever. Then I/O will spike to unlimited numbers, because the I/O does increase while the agent is running, and since the task never completed, and its not stopping either, after some hours your server will crash and in a very violent way causing all types of corruptions on a file system. I had this years back and I had this last year with v4 as well.

    R1soft caused me huge down times because of this, as the file systems will go into read mode only after such a crash, and you need to take the machine offline to repair it, and this can take hours and hours for huge drives.
    PingHosters - Expert community for hosters - NEW: Post Reviews

  17. #17
    Join Date
    Apr 2003
    Location
    Los Angeles, CA
    Posts
    791
    All those horror stories about CDP make me wonder why people use that solution.

    Have you looked into ZFS snapshots + zfs send / zfs recv? We've had really good luck with it. Snapshots take a second to make, a few seconds to destroy even on datasets hundreds of GB in size. zfs send/recv pretty much saturates the gigabit link between hosts so moving 10 GB of incremental differences takes only couple minutes without any load issues.

    Potential drawbacks include that ZFS is a COW file system so writes are fast, but reads can get slower over time due to fragmentation (hasn't really been an issue for us after ~1 year of use) and that the Linux port is still 0.6.x.

    Pick your poison.
    Pings <1 ms, Unlimited Transfer, Lowest Price: http://localhost/

  18. #18
    Quote Originally Posted by luki View Post
    Have you looked into ZFS snapshots + zfs send / zfs recv? We've had really good luck with it. Snapshots take a second to make, a few seconds to destroy even on datasets hundreds of GB in size. zfs send/recv pretty much saturates the gigabit link between hosts so moving 10 GB of incremental differences takes only couple minutes without any load issues.
    You can use LVM snapshots too which works really well.

  19. #19
    Join Date
    Jun 2005
    Posts
    3,133
    Well usually it works well, but then one or two times a year something strange crashes a server out of the blue and its almost always tracked to exactly the same time a backup was running. Coincidence? You make the guess.

    One simple solution would be if you could set limits on the agent itself. For example if the server is at XX I/O not to run backups or abort them immediately on the agent, instead of having to manually log into the server and stopping the agent...

    Or if a backup is taking XX minutes to finish, then abort the task.

    If this 2 things could be configured on the agent side, this would solve allot of issues.

    You should be able to set this configs on the CDP server for centralized management, and the server sends this configuration update to the agent, this settings then should be saved and enabled in the agent, not in the CDP server.

    This would avoid having to configure agents manually but would still leave the agent to enforce this limits in case it lost communication with the CDP server.

    This 2 settings on the agent side are very basic but could solve some issues people had in the past, in particular when a new version is released that is unstable, this could avoid the agent just going on a killing spree on a server.

    Almost all issues I had and users here reported where with the server running the agent. This means, the agent has to much priority on the server, and even when the its causing huge loads, huge disk and write reads, it will not stop. It will keep running as nothing, and it will just blow up the servers drives. The agent just wants to finish its backups once its started and this is wrong.

    I also noticed that version to version, from 2 to 3, fro 3 to 4 and now 5, the agent is more hungry on resources. You keep upgrading hardware, but the agent keeps wanting more and more on each version. It requires now plenty of ram and while its doing a backup its very intensive on CPU, even if it just has to replicate a few hundreds megabytes.

    On every single web load metrics I have, there are spikes, and each one is exactly when a backup is running. So if you server with 4 or 8 cores are always on 0.20 load, 0.50, when a backup is running it will easily stay on 2 load until it finishes the backup.
    PingHosters - Expert community for hosters - NEW: Post Reviews

  20. Newsletters

    Subscribe Now & Get The WHT Quick Start Guide!

Similar Threads

  1. R1Soft issue
    By warlock-m in forum VPS Hosting
    Replies: 6
    Last Post: 04-06-2011, 07:28 AM
  2. R1Soft issue on Xen VM
    By HarrySX in forum Hosting Software and Control Panels
    Replies: 2
    Last Post: 04-02-2011, 08:47 AM
  3. Replies: 0
    Last Post: 12-16-2007, 11:14 PM
  4. Replies: 9
    Last Post: 12-12-2007, 05:38 PM
  5. MegaNetServe brings back $49 /mo. Dedicated Servers
    By MegaNetServe in forum Dedicated Hosting Offers
    Replies: 7
    Last Post: 03-17-2005, 05:13 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •