Results 1 to 25 of 32
Thread: Clustering/ fault tolerance
-
12-29-2006, 01:12 PM #1Web Hosting Evangelist
- Join Date
- Dec 2000
- Location
- Montreal
- Posts
- 539
Clustering/ fault tolerance
There is so much discussion about load balancers on the boards, but no one mention anything about true redundancy, where one server goes down it has being replaced by another one, and I'm not referring to pure mirroring, because that's too expensive. What I'm referring to some more like the raid 5 of servers. Where there is one mirrored or redundant server to many others. also interested to hear other solutions out there. At the end of the day, the ultimate goal is uptime.
-
12-29-2006, 01:19 PM #2Web Hosting Master
- Join Date
- Jan 2005
- Location
- Richmond, VA
- Posts
- 3,119
Netfirms has this with their new Enterprise service. I'm planning to upgrade one of my sites to it very soon, but I'm holding out mainly to hear how others who are using it are doing. So far, it's been very hard to find folks who are. But anyway, that's just one example.
Daniel B., CEO - Bezoka.com and Ungigs.com
Hosting Solutions Optimized for: WordPress • Joomla • OpenCart • Moodle
Data Centers in: Chicago (US), London (UK), Sydney (AU), Sofia (BG), Pori (FI)
Email Daniel directly: ceo [at] bezoka.com
-
12-29-2006, 01:40 PM #3Aspiring Evangelist
- Join Date
- Jun 2004
- Location
- Tampa Florida
- Posts
- 428
sharkman,
loadbalancing, with quality equipment, allows any server to die and the extra requests to be picked up by others in the pool.
Prior to reliable load balancers many used the hot space method, But load balancing allows you to utilize that spare hardware during the long periods where there are no outages.
Basicaly, it saves money over the hot space method and provides the same level of fault tolerance.
As a side note, most good load balancers support the hot spare method of failover in case the load balancer itself dies, Similar to vrrp or hsrp in routers.Rock solid hosting and dedicated servers since 1998!
StabilityHosting Where stability and uptime are king!
-
12-29-2006, 02:16 PM #4Location = SoapBox
- Join Date
- Oct 2003
- Posts
- 6,564
There is so much discussion about load balancers on the boards, but no one mention anything about true redundancy, where one server goes down it has being replaced by another one, and I'm not referring to pure mirroring, because that's too expensive. What I'm referring to some more like the raid 5 of servers. Where there is one mirrored or redundant server to many others. also interested to hear other solutions out there. At the end of the day, the ultimate goal is uptime.
as Sharkman pointed out - this is all about uptime - and uptime is achieved through removal/identification of single points of failure...
A single server can be more reliable with less single points of failure then a load balanced array - and that is what is often missed.
for example:
situation 1:
Dual Processor, Redundant Everything like Nics, HDD's (RAID), Fans, Power Supplies - running at 100% of capacity in peak periods
situation 2:
2* Single Processor Machines - load balanced, single fans, single Power Supply, Single Hard drive, single nic - LB array is running at 100% of capacity at peak periods
Situation 1 will always be more reliable then situation 2 as situation 2 has ALOT more single points of failure then situation 1 (because if either node fails in situation 2, the array will fail). Now, this is an extreme example to illustrate a point.
loadbalancing, with quality equipment, allows any server to die and the extra requests to be picked up by others in the pool.
I guess my convoluted point is - Load Balancing is wonderful and can improve uptime and performance, but, it needs to be done properly and enough resources need to be thrown at the array so as to accomodate x number of units failing (depending how redundant you want to be) -
this is very much like RAID arrays (to get back to Sharkmans original point ) and disk drives - where RAID1 allows for 1 hard drive to fail and higher levels of RAID will allow for more hard drives to fail - in a LB Array - you need to identify how many nodes can fail without affecting service - and in order to accomplish this, your array needs to be running at an overall capacity which can accomodate 1 or more nodes failing during peak periods without affecting service levels - otherwise, you are just creating unnecessary single points of failure...www.cartika.com
www.clusterlogics.com - You simply cannot run a hosting company without this software. Backups, Disaster Recovery, Big Data, Virtualization. 20 years of building software that solves your problems
-
12-29-2006, 02:32 PM #5Web Hosting Evangelist
- Join Date
- Dec 2000
- Location
- Montreal
- Posts
- 539
Lol, Carika my friend, that was very technical. At the end of the day, do you have recommendation for a solution that duplicates raid 5 in a server format or you recommend just a good Load balancer?
-
12-29-2006, 02:39 PM #6Location = SoapBox
- Join Date
- Oct 2003
- Posts
- 6,564
LOL sharkman, made me smile
Sure I have a recommendation - redundant load balancers, 3 servers to handle the load of 2 servers (4 to handle the load of 3 and so on) (so if 1 server goes down, your array will not be affect) and a storage array.
The higher your budget, the better quality everything you can get and the fewer single points of failure...
I like NetAPP as storage because the solution has zeo points of failure (if you buy the right model - I think 900 series of higher) - but, they are quite expensive.
Another solution for storage is to buy 2 less expensive NAS devices and then Load balance/mirror between them.
Or, since NAS devices are pretty reliable, you can just ensure you have RAID protection in the NAS and plan for a potential failure in the NAS with a speedy mean time to recovery. This obviously means an additional single point of failure you need to manage - but, none the less it is an option...www.cartika.com
www.clusterlogics.com - You simply cannot run a hosting company without this software. Backups, Disaster Recovery, Big Data, Virtualization. 20 years of building software that solves your problems
-
12-29-2006, 11:31 PM #7Junior Guru
- Join Date
- Nov 2006
- Location
- College Station, TX
- Posts
- 185
I'm not a huge fan of NAS... for a lot of applications, esp. database, you have to buy an immense amount of NAS with a big storage buffer to handle the 50 storage requests that are coming in at once. I've seen I/O wait go through the roof even with three or four servers hitting the NAS at once... and depending on how you handle mounting that network share, NFS isn't exactly the most stable beast on the block, and SMB's so resource-intensive that you may as well have the content local for all the work the processor's having to do to get to the content.
Fibre channel is a whole 'nother deal, of course... but we have our 'control panel' for our four or five sites hosted on one server, and the load balancers route requests for the subdomain it's on to one server, which then pushes content to the other servers in the cluster. That content server can also take a content server out of the load balancing pool if it can't reach it.
If you haven't already, read Brad Fitzpatrick's description of how Livejournal does their clustering, and the scaling issues they ran into and ultimately solved.
-
12-29-2006, 11:48 PM #8Location = SoapBox
- Join Date
- Oct 2003
- Posts
- 6,564
I'm not a huge fan of NAS... for a lot of applications, esp. database, you have to buy an immense amount of NAS with a big storage buffer to handle the 50 storage requests that are coming in at once. I've seen I/O wait go through the roof even with three or four servers hitting the NAS at once... and depending on how you handle mounting that network share, NFS isn't exactly the most stable beast on the block, and SMB's so resource-intensive that you may as well have the content local for all the work the processor's having to do to get to the content.
The last hurdle for us is how we handle SSL - it works now, but, not "good enough" for our liking...
Having said this, for a small company, just starting out and wanting to implement load balancing with shared storage, or for companies selling HA solutions to SMB, its a decent solution (not ideal, but workable - and still, in my humble opion, superior to "HA" achieved through rsynching servers and such....)
karlkatzke, I would be interested in your opinion on this - do you believe a basic 2 server LB situation with a NAS shared storage is a workable, affordable HA solution for the SMB space? (it is something we are thinking about offering, but are still undecided - would love your input)
If you haven't already, read Brad Fitzpatrick's description of how Livejournal does their clustering, and the scaling issues they ran into and ultimately solved.www.cartika.com
www.clusterlogics.com - You simply cannot run a hosting company without this software. Backups, Disaster Recovery, Big Data, Virtualization. 20 years of building software that solves your problems
-
12-30-2006, 12:02 AM #9Aspiring Evangelist
- Join Date
- Jun 2004
- Location
- Tampa Florida
- Posts
- 428
I renewed my sun certifications a while ago and there was a guy in my group that worked on huge DB clusters. (By huge I mean 15 or 20 sunfire 25k servers with using oracle RAC clustering) aand they had a storage solution that sounded really interesting. I couldnt believe that it was fast enough though. But that was totaly my biased opinion based on my NFS history...
They ran 4 large EMC sans.
Mirrored.
Into 8 (2 each) Sun sunfire 280Rs (dual 1.34Ghz sparcs3s with 16GB of ram)
Then they has 10GBethernet cards from the 280Rs into the 25Ks. Using NFS...
They were using suns IP address redundancy on everything. He claimed that they could loose 3 of anything behind the 25Ks and the DB servers wouldn't notice it at all.
I mentioned that I didn't think NFS was a great way to deal with this and he said that with all the ram in the 280Rs the buffer took care of any speed issues..... I am still rather dubious about this setup. But it was an impressive cluster and if you dump that much money into something I would think it would be well thought out and would work pretty good.Rock solid hosting and dedicated servers since 1998!
StabilityHosting Where stability and uptime are king!
-
12-30-2006, 12:08 AM #10Location = SoapBox
- Join Date
- Oct 2003
- Posts
- 6,564
I mentioned that I didn't think NFS was a great way to deal with this and he said that with all the ram in the 280Rs the buffer took care of any speed issues.....
and if you dump that much money into something I would think it would be well thought out and would work pretty good.
www.cartika.com
www.clusterlogics.com - You simply cannot run a hosting company without this software. Backups, Disaster Recovery, Big Data, Virtualization. 20 years of building software that solves your problems
-
12-30-2006, 12:19 AM #11Aspiring Evangelist
- Join Date
- Jun 2004
- Location
- Tampa Florida
- Posts
- 428
Just thinking that basicaly a netapp is an nfs box.. and I may give a try to using a few T-1000s and an EMC I have not been able to find a use for and set something like this (only MUCH MUCH smaller) up. NFS4 has proven to be MUCH faster than 3 and with it natively integrated into the ZFS fielsystem on Solaris10 is may respond that much more quickly.
I am currently using said hardware to play around with iSCSI... But have been unimpressed by the transfer speeds.Rock solid hosting and dedicated servers since 1998!
StabilityHosting Where stability and uptime are king!
-
12-30-2006, 01:33 PM #12Junior Guru
- Join Date
- Nov 2006
- Location
- College Station, TX
- Posts
- 185
Originally Posted by CartikaHosting
Conceptually, the technical implementation is right for a business hosting -- as long as you have the details correct. You'll want a lot of RAM in the storage device to buffer reads, and as long as logfiles and page caches get written locally. Most upper-end SMB sites (as in, the kind of company that's actually going to deal with this much traffic, and for whom their website is mission-critical) are dynamic these days, so you want to have a lot of the code that they're executing cached. You'll want dual/dual everything in the NAS. You'll want good battery backups. You'll want to have the database either running a cluster edition, or in a redudnant mirrored config, or behind it's own L4 load balancer with mirroring on the backplane.
I'd probably turn to something like Redhat Cluster Suite if I was going to pursue this end of the business. They're supposed to release a new version in March that'll really be the cat's pajamas.
Quite honestly, what *I* would find valuable as a small business would be geographically balanced, affordable-for-small-businesses hosting. Geographical balancing is possible with a lot of newer load balancers, and while yes you'd have to do mirroring, most sites aren't changing frequently enough for that to become a huge issue as long as you provide a 'live mirror' button. With the majority of businesses being in disaster-prone areas, I'd love if my website could survive a tsunami on the west coast, an early blizzard in the midwest, thunderstorms that spawn tornados in the south/southeast, and a tornado or nor'easter making it's way up the atlantic coast... simultaneously. Seems we're entering an active weather cycle that's going to put disaster recovery plans to the test. I'm not sure that the small business market is ready for it, but there are a lot of medium-sized e-Commerce businesses who are on VPS's or dedicated servers, and it's technically feasible as long as you figure out some way to geographically balance and integrate the *database*...Last edited by karlkatzke; 12-30-2006 at 01:39 PM.
-
12-30-2006, 01:52 PM #13WHT Addict
- Join Date
- Dec 2006
- Posts
- 107
You do get better uptime, but you can't get true 100% uptime because when your mail server or Apache needs to be troubleshooted, by the time you realize it, your server has been down several minutes, even if it only takes several more minutes to "unplug" the bad server, to reload it and to plug it back.
Josh Lieber
█ iTechPath | Fully managed servers with 24/7/365 support.
█ PHP 5, MySQL 5, RHEL, cPanel & rvskins, and much more...
-
12-30-2006, 02:00 PM #14Web Hosting Master
- Join Date
- May 2003
- Posts
- 1,151
Load Balancers have checks they perform to confirm that the server is still active and able to accept new connections. Once that server stops responding, it's taken out of the mix and no longer given traffic until it starts to respond again.
Datums Internet Solutions, LLC
Systems Engineering & Managed Hosting Services
Complex Hosting Consultants
-
12-30-2006, 03:09 PM #15Location = SoapBox
- Join Date
- Oct 2003
- Posts
- 6,564
Hello
I think you're using SMB as 'Small to Medium Business'
I'd probably turn to something like Redhat Cluster Suite if I was going to pursue this end of the business.
The technical specs you outlined pretty much agree with what we are looking at. Our only concern with this sort of "HA" configuration is the single points of failure in the NAS - and we are wondering out loud if our default "services" cluster configuration that we sell for high end application hosting wont, in fact, be more reliable then what was described above - though obviously a LB array of web servers could probably handle more load then our traditional services cluster.
Quite honestly, what *I* would find valuable as a small business would be geographically balanced, affordable-for-small-businesses hosting. Geographical balancing is possible with a lot of newer load balancers, and while yes you'd have to do mirroring, most sites aren't changing frequently enough for that to become a huge issue as long as you provide a 'live mirror' button. With the majority of businesses being in disaster-prone areas, I'd love if my website could survive a tsunami on the west coast, an early blizzard in the midwest, thunderstorms that spawn tornados in the south/southeast, and a tornado or nor'easter making it's way up the atlantic coast... simultaneously. Seems we're entering an active weather cycle that's going to put disaster recovery plans to the test. I'm not sure that the small business market is ready for it, but there are a lot of medium-sized e-Commerce businesses who are on VPS's or dedicated servers, and it's technically feasible as long as you figure out some way to geographically balance and integrate the *database*...
You do get better uptime, but you can't get true 100% uptime because when your mail server or Apache needs to be troubleshooted, by the time you realize it, your server has been down several minutes, even if it only takes several more minutes to "unplug" the bad server, to reload it and to plug it back.www.cartika.com
www.clusterlogics.com - You simply cannot run a hosting company without this software. Backups, Disaster Recovery, Big Data, Virtualization. 20 years of building software that solves your problems
-
12-31-2006, 08:13 AM #16Junior Guru
- Join Date
- Nov 2006
- Location
- College Station, TX
- Posts
- 185
Yeah, RHCS is nice because of the GFS ... global filesystem. Also available for Fedora Core and Centos, of course, without the absurd licensing fees (and without the support, unfortunately).
Honestly, I'd like to see a 'mesh filesystem' developed for clustering -- where you have a disk partition on each server that every other server mirrors automatically... if the partition goes bad on one server, the server can still use the other partitions on the other servers while it's own array rebuilds... kind of like a scaled up raid5.
Simple to describe, I'm sure it's frighteningly complex to integrate.
-
01-01-2007, 04:24 AM #17Disabled
- Join Date
- Jun 2005
- Posts
- 588
Hello,
If you want to do it more cheaply ... then go for hosting your site in 2 different hosting companies. In that way also you can keep the guaranteed uptime of your site.
Thank you.
Regards,
-
01-01-2007, 01:49 PM #18Master of the Truth
- Join Date
- Mar 2006
- Location
- Reston, VA
- Posts
- 3,131
along the lines of fault tolerance dedicated machines, www.openqrm.org looks promising. We havn't tested it yet but in theory if your running a dedicated machine and it goes down for whatever reason, your OS image would then be booted to a server on standby mode. Downfall? Need a nice little SAN or a netapp on fiber to keep speeds good.
Openqrm looks to be clustering ont he fly by allocating servers on stand by with X operating image, swap is stored on the host computer and all the filesystem ect is stored on the SAN.
Anyone else deal with qrm yet?
But as far as a HA dedicated server anyone can mirror a server to another machine, set the min active hosts to 1 server set the priority to 10 for the "leading server" and 1 for the failover server, lead box goes down failover takes its place. only draw back would be mysql replication. But hey thats where the new mysql 5.1 clustering comes into play whenever that becomes stable if it hasn't already.Yellow Fiber Networks
http://www.yellowfiber.net : Managed Solutions - Colocation - Network Services IPv4/IPv6
Ashburn/Denver/NYC/Dallas/Chicago Markets Served zak@yellowfiber.net
-
01-01-2007, 03:46 PM #19Junior Guru
- Join Date
- Nov 2006
- Location
- College Station, TX
- Posts
- 185
The problem with MySQL clustering is that it's all RAM-based ... so if your database size exceeds what your RAM/kernel/hardware can deal with, you're screwed.
Haven't dealth with qrm, but I'm leery of anything that has to *boot* an image. I've dealth with LTSP a lot and ... well, the booting is the biggest struggle.
-
01-02-2007, 09:46 PM #20Junior Guru Wannabe
- Join Date
- Oct 2006
- Posts
- 68
It seems the only point of failure that early or late generates half an hour of downtime is when your Raid-1 or Raid-5 array is decayed, you need to stop your machine to change the drive and to rebuild the array.
One possibility to avoid it would be to have a Raid-1+1 array with the 2 pairs of disks located on 2 different machines. But then we wouldn't be able to rebuild the Raid array without stopping disk i/o...
Is there a storage solution that allows you 100% uptime and where you can change decayed disks on the fly without having to stop i/o on the alive disks? In other words, a solution that doesn't force you to pull the plug to change defectuous hard drives.
-
01-02-2007, 10:28 PM #21Junior Guru Wannabe
- Join Date
- Oct 2006
- Posts
- 68
Does GFS provide 100% uptime? What happens when the HDD that hosts your heavily accessed database dies? Does it require to stop everything while you change the drive?
-
01-03-2007, 09:58 AM #22Junior Guru Wannabe
- Join Date
- Dec 2006
- Location
- /dev/null
- Posts
- 41
Originally Posted by karlkatzkeCaro.Net: Support is everything
Offering High Quality Dedicated Servers.
-
01-03-2007, 10:04 AM #23Junior Guru Wannabe
- Join Date
- Dec 2006
- Location
- /dev/null
- Posts
- 41
Originally Posted by TigerHosting
With a controller that supports hot-swap, you shouldn't have to down the box at all. You should be able to replace the drive and rebuild on the fly. Other then the disk i/o performance going down a bit there shouldn't be any negative results of doing a hot swap and rebuild.Caro.Net: Support is everything
Offering High Quality Dedicated Servers.
-
01-03-2007, 10:16 AM #24Junior Guru Wannabe
- Join Date
- Oct 2006
- Posts
- 68
Won't there be data inconsistency? As the users are writing data to the alive HDD while the other HDD is recovering data at the same time...
-
01-03-2007, 10:24 AM #25Junior Guru Wannabe
- Join Date
- Dec 2006
- Location
- /dev/null
- Posts
- 41
The raid controller should take care of that. An example of this in action would be the use of a hot spare.
Caro.Net: Support is everything
Offering High Quality Dedicated Servers.