Page 1 of 2 12 LastLast
Results 1 to 25 of 32
  1. #1
    Join Date
    Dec 2000
    Location
    Montreal
    Posts
    539

    Clustering/ fault tolerance

    There is so much discussion about load balancers on the boards, but no one mention anything about true redundancy, where one server goes down it has being replaced by another one, and I'm not referring to pure mirroring, because that's too expensive. What I'm referring to some more like the raid 5 of servers. Where there is one mirrored or redundant server to many others. also interested to hear other solutions out there. At the end of the day, the ultimate goal is uptime.

  2. #2
    Join Date
    Jan 2005
    Location
    Richmond, VA
    Posts
    3,119
    Netfirms has this with their new Enterprise service. I'm planning to upgrade one of my sites to it very soon, but I'm holding out mainly to hear how others who are using it are doing. So far, it's been very hard to find folks who are. But anyway, that's just one example.
    Daniel B., CEO - Bezoka.com and Ungigs.com
    Hosting Solutions Optimized for: WordPress • Joomla • OpenCart • Moodle
    Data Centers in: Chicago (US), London (UK), Sydney (AU), Sofia (BG), Pori (FI)
    Email Daniel directly: ceo [at] bezoka.com

  3. #3
    Join Date
    Jun 2004
    Location
    Tampa Florida
    Posts
    428
    sharkman,
    loadbalancing, with quality equipment, allows any server to die and the extra requests to be picked up by others in the pool.
    Prior to reliable load balancers many used the hot space method, But load balancing allows you to utilize that spare hardware during the long periods where there are no outages.

    Basicaly, it saves money over the hot space method and provides the same level of fault tolerance.

    As a side note, most good load balancers support the hot spare method of failover in case the load balancer itself dies, Similar to vrrp or hsrp in routers.
    Rock solid hosting and dedicated servers since 1998!
    StabilityHosting Where stability and uptime are king!

  4. #4
    There is so much discussion about load balancers on the boards, but no one mention anything about true redundancy, where one server goes down it has being replaced by another one, and I'm not referring to pure mirroring, because that's too expensive. What I'm referring to some more like the raid 5 of servers. Where there is one mirrored or redundant server to many others. also interested to hear other solutions out there. At the end of the day, the ultimate goal is uptime.
    Excellent questions/points... and love the comparison to RAID - because that is actually how it works - briiliant analogy - thanks !

    as Sharkman pointed out - this is all about uptime - and uptime is achieved through removal/identification of single points of failure...

    A single server can be more reliable with less single points of failure then a load balanced array - and that is what is often missed.

    for example:

    situation 1:
    Dual Processor, Redundant Everything like Nics, HDD's (RAID), Fans, Power Supplies - running at 100% of capacity in peak periods

    situation 2:
    2* Single Processor Machines - load balanced, single fans, single Power Supply, Single Hard drive, single nic - LB array is running at 100% of capacity at peak periods

    Situation 1 will always be more reliable then situation 2 as situation 2 has ALOT more single points of failure then situation 1 (because if either node fails in situation 2, the array will fail). Now, this is an extreme example to illustrate a point.

    loadbalancing, with quality equipment, allows any server to die and the extra requests to be picked up by others in the pool.
    Vantage, obviously completely agree - Ideally this is how it would work and the resultant benefits would be higher uptime by removing single points of failure not to mention greater performance through streamlined utilization of resources. This is of course assuming that it is done correctly with redundant storage, load balancers, etc... The problem is load - the problem always has been load and always will be load. If you are asking an array of servers to handle more load then they are capable of, you are creating multiple points of failure (more so then if you are asking a single server to handle loads beyound its capabilities) - as, if any of those nodes go down, the array may crash and or performance will be dramatically affected. Evidence of this can be seen everywhere - as many of these HA solutions have been brought down by just this very thing...

    I guess my convoluted point is - Load Balancing is wonderful and can improve uptime and performance, but, it needs to be done properly and enough resources need to be thrown at the array so as to accomodate x number of units failing (depending how redundant you want to be) -

    this is very much like RAID arrays (to get back to Sharkmans original point ) and disk drives - where RAID1 allows for 1 hard drive to fail and higher levels of RAID will allow for more hard drives to fail - in a LB Array - you need to identify how many nodes can fail without affecting service - and in order to accomplish this, your array needs to be running at an overall capacity which can accomodate 1 or more nodes failing during peak periods without affecting service levels - otherwise, you are just creating unnecessary single points of failure...
    www.cartika.com
    www.clusterlogics.com - You simply cannot run a hosting company without this software. Backups, Disaster Recovery, Big Data, Virtualization. 20 years of building software that solves your problems

  5. #5
    Join Date
    Dec 2000
    Location
    Montreal
    Posts
    539
    Lol, Carika my friend, that was very technical. At the end of the day, do you have recommendation for a solution that duplicates raid 5 in a server format or you recommend just a good Load balancer?

  6. #6
    LOL sharkman, made me smile

    Sure I have a recommendation - redundant load balancers, 3 servers to handle the load of 2 servers (4 to handle the load of 3 and so on) (so if 1 server goes down, your array will not be affect) and a storage array.

    The higher your budget, the better quality everything you can get and the fewer single points of failure...

    I like NetAPP as storage because the solution has zeo points of failure (if you buy the right model - I think 900 series of higher) - but, they are quite expensive.

    Another solution for storage is to buy 2 less expensive NAS devices and then Load balance/mirror between them.

    Or, since NAS devices are pretty reliable, you can just ensure you have RAID protection in the NAS and plan for a potential failure in the NAS with a speedy mean time to recovery. This obviously means an additional single point of failure you need to manage - but, none the less it is an option...
    www.cartika.com
    www.clusterlogics.com - You simply cannot run a hosting company without this software. Backups, Disaster Recovery, Big Data, Virtualization. 20 years of building software that solves your problems

  7. #7
    Join Date
    Nov 2006
    Location
    College Station, TX
    Posts
    185
    I'm not a huge fan of NAS... for a lot of applications, esp. database, you have to buy an immense amount of NAS with a big storage buffer to handle the 50 storage requests that are coming in at once. I've seen I/O wait go through the roof even with three or four servers hitting the NAS at once... and depending on how you handle mounting that network share, NFS isn't exactly the most stable beast on the block, and SMB's so resource-intensive that you may as well have the content local for all the work the processor's having to do to get to the content.

    Fibre channel is a whole 'nother deal, of course... but we have our 'control panel' for our four or five sites hosted on one server, and the load balancers route requests for the subdomain it's on to one server, which then pushes content to the other servers in the cluster. That content server can also take a content server out of the load balancing pool if it can't reach it.

    If you haven't already, read Brad Fitzpatrick's description of how Livejournal does their clustering, and the scaling issues they ran into and ultimately solved.

  8. #8
    I'm not a huge fan of NAS... for a lot of applications, esp. database, you have to buy an immense amount of NAS with a big storage buffer to handle the 50 storage requests that are coming in at once. I've seen I/O wait go through the roof even with three or four servers hitting the NAS at once... and depending on how you handle mounting that network share, NFS isn't exactly the most stable beast on the block, and SMB's so resource-intensive that you may as well have the content local for all the work the processor's having to do to get to the content.
    Completely agree. When we first started testing, redundant NAS seemed like the way to go. Cheap and NFS is supposed to be painless. However, every test we ran showed the same results. HUGE I/O wait issues, NFS permission problems (amongst other NFS issues - they can be oversome, but, it certainly isnt ideal) - end result is we went with NetAPP (fiber) - a much larger investment - but definately worth it.

    The last hurdle for us is how we handle SSL - it works now, but, not "good enough" for our liking...

    Having said this, for a small company, just starting out and wanting to implement load balancing with shared storage, or for companies selling HA solutions to SMB, its a decent solution (not ideal, but workable - and still, in my humble opion, superior to "HA" achieved through rsynching servers and such....)

    karlkatzke, I would be interested in your opinion on this - do you believe a basic 2 server LB situation with a NAS shared storage is a workable, affordable HA solution for the SMB space? (it is something we are thinking about offering, but are still undecided - would love your input)

    If you haven't already, read Brad Fitzpatrick's description of how Livejournal does their clustering, and the scaling issues they ran into and ultimately solved.
    Great read - dont think you can research HA, Shared Storage, etc without running into it Great read though and I am sure alot of people will see good value in it
    www.cartika.com
    www.clusterlogics.com - You simply cannot run a hosting company without this software. Backups, Disaster Recovery, Big Data, Virtualization. 20 years of building software that solves your problems

  9. #9
    Join Date
    Jun 2004
    Location
    Tampa Florida
    Posts
    428
    I renewed my sun certifications a while ago and there was a guy in my group that worked on huge DB clusters. (By huge I mean 15 or 20 sunfire 25k servers with using oracle RAC clustering) aand they had a storage solution that sounded really interesting. I couldnt believe that it was fast enough though. But that was totaly my biased opinion based on my NFS history...

    They ran 4 large EMC sans.
    Mirrored.
    Into 8 (2 each) Sun sunfire 280Rs (dual 1.34Ghz sparcs3s with 16GB of ram)
    Then they has 10GBethernet cards from the 280Rs into the 25Ks. Using NFS...

    They were using suns IP address redundancy on everything. He claimed that they could loose 3 of anything behind the 25Ks and the DB servers wouldn't notice it at all.
    I mentioned that I didn't think NFS was a great way to deal with this and he said that with all the ram in the 280Rs the buffer took care of any speed issues..... I am still rather dubious about this setup. But it was an impressive cluster and if you dump that much money into something I would think it would be well thought out and would work pretty good.
    Rock solid hosting and dedicated servers since 1998!
    StabilityHosting Where stability and uptime are king!

  10. #10
    I mentioned that I didn't think NFS was a great way to deal with this and he said that with all the ram in the 280Rs the buffer took care of any speed issues.....

    and if you dump that much money into something I would think it would be well thought out and would work pretty good.
    LOL - I believe that - but, think its more cost effective to go with NetApp (never thought I would say that )
    www.cartika.com
    www.clusterlogics.com - You simply cannot run a hosting company without this software. Backups, Disaster Recovery, Big Data, Virtualization. 20 years of building software that solves your problems

  11. #11
    Join Date
    Jun 2004
    Location
    Tampa Florida
    Posts
    428
    Just thinking that basicaly a netapp is an nfs box.. and I may give a try to using a few T-1000s and an EMC I have not been able to find a use for and set something like this (only MUCH MUCH smaller) up. NFS4 has proven to be MUCH faster than 3 and with it natively integrated into the ZFS fielsystem on Solaris10 is may respond that much more quickly.

    I am currently using said hardware to play around with iSCSI... But have been unimpressed by the transfer speeds.
    Rock solid hosting and dedicated servers since 1998!
    StabilityHosting Where stability and uptime are king!

  12. #12
    Join Date
    Nov 2006
    Location
    College Station, TX
    Posts
    185
    Quote Originally Posted by CartikaHosting
    karlkatzke, I would be interested in your opinion on this - do you believe a basic 2 server LB situation with a NAS shared storage is a workable, affordable HA solution for the SMB space? (it is something we are thinking about offering, but are still undecided - would love your input)
    If by SMB you mean Samba, I don't think I understand the question. (I think you're using SMB as 'Small to Medium Business' ... which is a marketing question that I'm not qualified to answer, as I don't have a lot of experience in the hosting world -- just some small business stuff I've done on my own. I'll take a crack at it, though.)

    Conceptually, the technical implementation is right for a business hosting -- as long as you have the details correct. You'll want a lot of RAM in the storage device to buffer reads, and as long as logfiles and page caches get written locally. Most upper-end SMB sites (as in, the kind of company that's actually going to deal with this much traffic, and for whom their website is mission-critical) are dynamic these days, so you want to have a lot of the code that they're executing cached. You'll want dual/dual everything in the NAS. You'll want good battery backups. You'll want to have the database either running a cluster edition, or in a redudnant mirrored config, or behind it's own L4 load balancer with mirroring on the backplane.

    I'd probably turn to something like Redhat Cluster Suite if I was going to pursue this end of the business. They're supposed to release a new version in March that'll really be the cat's pajamas.

    Quite honestly, what *I* would find valuable as a small business would be geographically balanced, affordable-for-small-businesses hosting. Geographical balancing is possible with a lot of newer load balancers, and while yes you'd have to do mirroring, most sites aren't changing frequently enough for that to become a huge issue as long as you provide a 'live mirror' button. With the majority of businesses being in disaster-prone areas, I'd love if my website could survive a tsunami on the west coast, an early blizzard in the midwest, thunderstorms that spawn tornados in the south/southeast, and a tornado or nor'easter making it's way up the atlantic coast... simultaneously. Seems we're entering an active weather cycle that's going to put disaster recovery plans to the test. I'm not sure that the small business market is ready for it, but there are a lot of medium-sized e-Commerce businesses who are on VPS's or dedicated servers, and it's technically feasible as long as you figure out some way to geographically balance and integrate the *database*...
    Last edited by karlkatzke; 12-30-2006 at 01:39 PM.

  13. #13
    You do get better uptime, but you can't get true 100% uptime because when your mail server or Apache needs to be troubleshooted, by the time you realize it, your server has been down several minutes, even if it only takes several more minutes to "unplug" the bad server, to reload it and to plug it back.
    Josh Lieber

    iTechPath | Fully managed servers with 24/7/365 support.
    PHP 5, MySQL 5, RHEL, cPanel & rvskins, and much more...

  14. #14
    Load Balancers have checks they perform to confirm that the server is still active and able to accept new connections. Once that server stops responding, it's taken out of the mix and no longer given traffic until it starts to respond again.
    Datums Internet Solutions, LLC
    Systems Engineering & Managed Hosting Services
    Complex Hosting Consultants

  15. #15
    Hello

    I think you're using SMB as 'Small to Medium Business'
    Sorry for the confusion - by SMB I meant Small and Medium Business - not Samba (my apologies for the acronyms )

    I'd probably turn to something like Redhat Cluster Suite if I was going to pursue this end of the business.
    Exactly

    The technical specs you outlined pretty much agree with what we are looking at. Our only concern with this sort of "HA" configuration is the single points of failure in the NAS - and we are wondering out loud if our default "services" cluster configuration that we sell for high end application hosting wont, in fact, be more reliable then what was described above - though obviously a LB array of web servers could probably handle more load then our traditional services cluster.

    Quite honestly, what *I* would find valuable as a small business would be geographically balanced, affordable-for-small-businesses hosting. Geographical balancing is possible with a lot of newer load balancers, and while yes you'd have to do mirroring, most sites aren't changing frequently enough for that to become a huge issue as long as you provide a 'live mirror' button. With the majority of businesses being in disaster-prone areas, I'd love if my website could survive a tsunami on the west coast, an early blizzard in the midwest, thunderstorms that spawn tornados in the south/southeast, and a tornado or nor'easter making it's way up the atlantic coast... simultaneously. Seems we're entering an active weather cycle that's going to put disaster recovery plans to the test. I'm not sure that the small business market is ready for it, but there are a lot of medium-sized e-Commerce businesses who are on VPS's or dedicated servers, and it's technically feasible as long as you figure out some way to geographically balance and integrate the *database*...
    Very interesting concept - doesnt really fit our model as of now - but, I think you are onto something there

    You do get better uptime, but you can't get true 100% uptime because when your mail server or Apache needs to be troubleshooted, by the time you realize it, your server has been down several minutes, even if it only takes several more minutes to "unplug" the bad server, to reload it and to plug it back.
    As Datums has indicated, this really isnt an issue - most load balancers will not only direct traffic to a server that is available, but, most of them will also direct traffic to the server under the lightest load. The only time you would have an issue is with sticky sessions - and if the server that a particular session was assigned to went down, well, those users would lose their session. However, this, in my opinion isnt a big deal - it just needs to be understood (also, we are playing with ways to force terminate a session when that happens and or force a user on a sticky session to another machine when the one they are tied to stops responding - obviously they would still get logged out of a site - and if they were working on something like a CMS editing pages, etc - they would lose that data - but, thats not too bad of an option in a worse case scenario)
    www.cartika.com
    www.clusterlogics.com - You simply cannot run a hosting company without this software. Backups, Disaster Recovery, Big Data, Virtualization. 20 years of building software that solves your problems

  16. #16
    Join Date
    Nov 2006
    Location
    College Station, TX
    Posts
    185
    Yeah, RHCS is nice because of the GFS ... global filesystem. Also available for Fedora Core and Centos, of course, without the absurd licensing fees (and without the support, unfortunately).

    Honestly, I'd like to see a 'mesh filesystem' developed for clustering -- where you have a disk partition on each server that every other server mirrors automatically... if the partition goes bad on one server, the server can still use the other partitions on the other servers while it's own array rebuilds... kind of like a scaled up raid5.

    Simple to describe, I'm sure it's frighteningly complex to integrate.

  17. #17
    Hello,

    If you want to do it more cheaply ... then go for hosting your site in 2 different hosting companies. In that way also you can keep the guaranteed uptime of your site.

    Thank you.

    Regards,

  18. #18
    Join Date
    Mar 2006
    Location
    Reston, VA
    Posts
    3,131
    along the lines of fault tolerance dedicated machines, www.openqrm.org looks promising. We havn't tested it yet but in theory if your running a dedicated machine and it goes down for whatever reason, your OS image would then be booted to a server on standby mode. Downfall? Need a nice little SAN or a netapp on fiber to keep speeds good.

    Openqrm looks to be clustering ont he fly by allocating servers on stand by with X operating image, swap is stored on the host computer and all the filesystem ect is stored on the SAN.

    Anyone else deal with qrm yet?

    But as far as a HA dedicated server anyone can mirror a server to another machine, set the min active hosts to 1 server set the priority to 10 for the "leading server" and 1 for the failover server, lead box goes down failover takes its place. only draw back would be mysql replication. But hey thats where the new mysql 5.1 clustering comes into play whenever that becomes stable if it hasn't already.
    Yellow Fiber Networks
    http://www.yellowfiber.net : Managed Solutions - Colocation - Network Services IPv4/IPv6
    Ashburn/Denver/NYC/Dallas/Chicago Markets Served zak@yellowfiber.net

  19. #19
    Join Date
    Nov 2006
    Location
    College Station, TX
    Posts
    185
    The problem with MySQL clustering is that it's all RAM-based ... so if your database size exceeds what your RAM/kernel/hardware can deal with, you're screwed.

    Haven't dealth with qrm, but I'm leery of anything that has to *boot* an image. I've dealth with LTSP a lot and ... well, the booting is the biggest struggle.

  20. #20
    Join Date
    Oct 2006
    Posts
    68
    It seems the only point of failure that early or late generates half an hour of downtime is when your Raid-1 or Raid-5 array is decayed, you need to stop your machine to change the drive and to rebuild the array.

    One possibility to avoid it would be to have a Raid-1+1 array with the 2 pairs of disks located on 2 different machines. But then we wouldn't be able to rebuild the Raid array without stopping disk i/o...

    Is there a storage solution that allows you 100% uptime and where you can change decayed disks on the fly without having to stop i/o on the alive disks? In other words, a solution that doesn't force you to pull the plug to change defectuous hard drives.

  21. #21
    Join Date
    Oct 2006
    Posts
    68
    Does GFS provide 100% uptime? What happens when the HDD that hosts your heavily accessed database dies? Does it require to stop everything while you change the drive?

  22. #22
    Join Date
    Dec 2006
    Location
    /dev/null
    Posts
    41
    Quote Originally Posted by karlkatzke
    The problem with MySQL clustering is that it's all RAM-based ... so if your database size exceeds what your RAM/kernel/hardware can deal with, you're screwed.

    Haven't dealth with qrm, but I'm leery of anything that has to *boot* an image. I've dealth with LTSP a lot and ... well, the booting is the biggest struggle.
    Don't forget space needed for buffers and cache. On hosts with large databases with lots of indexes and access this could easily be another 512MB of ram.
    Caro.Net: Support is everything
    Offering High Quality Dedicated Servers.

  23. #23
    Join Date
    Dec 2006
    Location
    /dev/null
    Posts
    41
    Quote Originally Posted by TigerHosting
    It seems the only point of failure that early or late generates half an hour of downtime is when your Raid-1 or Raid-5 array is decayed, you need to stop your machine to change the drive and to rebuild the array.

    With a controller that supports hot-swap, you shouldn't have to down the box at all. You should be able to replace the drive and rebuild on the fly. Other then the disk i/o performance going down a bit there shouldn't be any negative results of doing a hot swap and rebuild.
    Caro.Net: Support is everything
    Offering High Quality Dedicated Servers.

  24. #24
    Join Date
    Oct 2006
    Posts
    68
    Won't there be data inconsistency? As the users are writing data to the alive HDD while the other HDD is recovering data at the same time...

  25. #25
    Join Date
    Dec 2006
    Location
    /dev/null
    Posts
    41
    The raid controller should take care of that. An example of this in action would be the use of a hot spare.
    Caro.Net: Support is everything
    Offering High Quality Dedicated Servers.

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •