
06-22-2006, 11:15 AM
|
|
Newbie
|
|
Join Date: Jun 2006
Location: Montreal, Canada
Posts: 5
|
|
Report on viability of DNS failover solution
I run a site with about 1,000,000 unique visitors per month and recents server failures made me decide to get a failover server to minimize downtime. My goal wasn't to get 99.999% uptime but to be able to be back on track after a failure in a "reasonable" amount of time. After evaluating several solutions, I decided to go with DNS failover. Here's how the setup work:
1) mydomain.com points to main server with a very low TTL (time to live)
2) failover server replicates data from main server
3) when main server goes down, mydomain.com is changed to point to failover server
The drawback is the DNS propagation time since some DNS servers don't honor the TTL and there is some caching happening on the user's machine and browser. I looked for empirical data to gauge the extent of the problem but couldn't find any so I decided to setup my own experiment.
The Experiment
==============
I start with mydomain.com pointing to the main server with a TTL of 1800 seconds (1/2 hour). I then change it to point to the failover server which simply port forwards to the main server. On the main server, I periodically compute the percentage of requests coming from the failover server which gives me the percentage of people for which the DNS change has propagated.
I made the DNS change at exactly 16:04 on 06/21/06 and here are the percentage of propagated users:
06/21/06 16:00 0 %
06/21/06 16:05 3 %
06/21/06 16:10 20 %
06/21/06 16:15 37 %
06/21/06 16:20 59 %
06/21/06 16:25 69 %
06/21/06 16:30 76 %
06/21/06 16:35 80 %
06/21/06 16:40 86 %
06/21/06 16:45 90 %
06/21/06 16:50 91 %
06/21/06 16:55 92 %
06/21/06 17:00 93 %
06/21/06 17:05 94 %
06/21/06 17:10 94 %
06/21/06 17:15 95 %
06/21/06 17:35 95 %
06/21/06 17:40 96 %
06/21/06 17:45 97 %
...
06/22/06 10:40 99 %
So even after 18 hours, there is still a certain percentage of users going to the old server so DNS failover is obviously not a 99.999% uptime solution. However, since more than 90% of the users are propagated in the first hour, the solution works well enough for me.
Regards
Jean-Philippe Bouchard
|

06-22-2006, 11:27 AM
|
|
Junior Guru Wannabe
|
|
Join Date: Jun 2006
Posts: 67
|
|
You are hosting your pwn DNS servers?
|

06-22-2006, 11:39 AM
|
|
Newbie
|
|
Join Date: Jun 2006
Location: Montreal, Canada
Posts: 5
|
|
Quote:
|
Originally Posted by siliconcowboy73
You are hosting your pwn DNS servers?
|
No, I'm using DNS made easy. Also my main server is hosted at layered technologies and my failover server is at 1&1.
|

06-22-2006, 11:58 AM
|
|
Junior Guru Wannabe
|
|
Join Date: Jun 2006
Posts: 67
|
|
Quote:
|
Originally Posted by jeanphil
No, I'm using DNS made easy. Also my main server is hosted at layered technologies and my failover server is at 1&1.
|
Ok. What if you were to jsut setup phonydomain.com and host it with a 99.999% uptime company like Network Solutions or Register.com. Then have phonydomain.com redirect to your IP.
The world should cache phonydomain.com. Since it will "never" go down just have phonydomain.com be the traffic redirector. You could login to Network Solutions/Register.com and toggle it between your two servers. Would that work?
|

06-22-2006, 12:06 PM
|
|
Newbie
|
|
Join Date: Jun 2006
Location: Montreal, Canada
Posts: 5
|
|
Quote:
|
Originally Posted by siliconcowboy73
Ok. What if you were to jsut setup phonydomain.com and host it with a 99.999% uptime company like Network Solutions or Register.com. Then have phonydomain.com redirect to your IP.
The world should cache phonydomain.com. Since it will "never" go down just have phonydomain.com be the traffic redirector. You could login to Network Solutions/Register.com and toggle it between your two servers. Would that work?
|
That's a good point. Actually, when I was doing my research, that's the first thing I looked for. Basically, a proxy provider that would allow me to point my domain to one of their IP and control where that IP goes. A "hosted" load balancer/router service if you will. However, I wasn't able to find any company offering that service so I decided to go with DNS failover.
|

06-22-2006, 12:21 PM
|
|
Junior Guru Wannabe
|
|
Join Date: Jun 2006
Posts: 67
|
|
Quote:
|
Originally Posted by jeanphil
That's a good point. Actually, when I was doing my research, that's the first thing I looked for. Basically, a proxy provider that would allow me to point my domain to one of their IP and control where that IP goes. A "hosted" load balancer/router service if you will. However, I wasn't able to find any company offering that service so I decided to go with DNS failover.
|
Try Akamai. I think they do that caching and redirect hosting stuff.
|

06-22-2006, 04:26 PM
|
|
Newbie
|
|
Join Date: Jun 2006
Posts: 17
|
|
..yeah... assuming bgp routing + bgp anycasting.... not 100% but very, VERY close to it I guess 
|

06-23-2006, 06:04 AM
|
|
Web Hosting Guru
|
|
Join Date: May 2004
Posts: 300
|
|
Jean-Philippe, many thanks for your report. Your research is much appreciated and addresses one of the questions I had about the DNSMadeEasy service. 93% of users found your backup server within an hour, sounds great.
I've been investigating this subject recently as well, and hope to duplicate your success.
The missing piece for me is learning how to create the mirror server and keep the data up to date.
I'd be grateful for any remarks you may care to share about your procedure there. Thanks again.
|

06-23-2006, 02:17 PM
|
|
Newbie
|
|
Join Date: Jun 2006
Location: Montreal, Canada
Posts: 5
|
|
Quote:
|
Originally Posted by squirreldog
Jean-Philippe, many thanks for your report. Your research is much appreciated and addresses one of the questions I had about the DNSMadeEasy service. 93% of users found your backup server within an hour, sounds great.
I've been investigating this subject recently as well, and hope to duplicate your success.
The missing piece for me is learning how to create the mirror server and keep the data up to date.
I'd be grateful for any remarks you may care to share about your procedure there. Thanks again.
|
You're welcome!
The 2 servers are running debian 3.1 with mysql 4.1 as the backend.
The mysql is replicated as described in this article ( http://www.onlamp.com/pub/a/onlamp/2005/06/16/MySQLian.html). Moreover, the replication takes place over a virtual private network. I use openvpn ( http://openvpn.net/), configured with static key as described here ( http://openvpn.net/static.html).
The web data is replicated using rsync ( http://samba.anu.edu.au/rsync/) using ssh as the transport.
I don't use the automatic failover feature of DNS made easy but I do use their server monitoring feature. When the main server is down, I receive an SMS message and set the IP to the failover server.
If you need more details, let me know.
|

06-23-2006, 04:20 PM
|
|
WHT Addict
|
|
Join Date: Aug 2005
Posts: 126
|
|
I'd be very curious how the response is with much short TTL values. Your data seems to show that 80% of your users got routed even though they arrived within the TTL of 30 minutes. That's pretty good. It seems to imply that 80% of users had not visited within recent time and so their first hit was not cached. The other 20% may have been on and then "lost you" and had to wait some time to get the right ip again. For those users it looks like downtime, so it's not so good.
But what if you set it up more like a dynamic dns where the dns gets updated with the ip whenever it changes and in those cases TTL is shorter. I wonder how much you could reduce that downtime for users who were on at the time of failure?
|

06-23-2006, 04:42 PM
|
|
Newbie
|
|
Join Date: Jun 2006
Location: Montreal, Canada
Posts: 5
|
|
Quote:
|
Originally Posted by csavery
I'd be very curious how the response is with much short TTL values.
|
From what I read, some DNS servers will not honor very short TTL and will fallback to a default value. 1/2 hour seems to be the "optimal" value.
Also, very short TTL results in more hit on the DNS servers, which means more bandwidth cost or, in my case, a more expensive DNS made easy package (they charge on a per request basis).
However, I may do another test with a TTL of 5 minutes, just to see if that theory of too short a TTL holds.
Quote:
|
Originally Posted by csavery
Your data seems to show that 80% of your users got routed even though they arrived within the TTL of 30 minutes. That's pretty good. It seems to imply that 80% of users had not visited within recent time and so their first hit was not cached.
|
Not necessarily. When I switched the IP, DNS and/or users had the old IP in their cache with different expiration times. Some were due to refresh it 30 minutes from now, some 15 minutes, some 1 minute, etc. Also, based on the usage pattern of the site (length of visits, etc), I doubt that 80% of the visitors at any given time have been on the site for less than 30 minutes.
Quote:
|
Originally Posted by csavery
The other 20% may have been on and then "lost you" and had to wait some time to get the right ip again. For those users it looks like downtime, so it's not so good.
|
That's correct. It's a drawback of the solution. However, I'm willing to live with it since I have a major server crash once every few years.
Quote:
|
Originally Posted by csavery
But what if you set it up more like a dynamic dns where the dns gets updated with the ip whenever it changes and in those cases TTL is shorter. I wonder how much you could reduce that downtime for users who were on at the time of failure?
|
If I decide to run the experiment again with a shorter TTL, I'll definitely publish the results here.
|

06-15-2008, 11:07 AM
|
|
New Member
|
|
Join Date: May 2008
Posts: 2
|
|
nice topic, have been thinking about this to, altough there is still a considerable downtime..
looking forward to hear from you when you have completed the test with the TTL set to 5 mins 
|

06-15-2008, 08:39 PM
|
|
Web Monkey
|
|
Join Date: Dec 2005
Location: Finland
Posts: 1,180
|
|
Thanks for posting. This is interresting. The results are much better than I expected.
|

06-16-2008, 02:34 PM
|
|
Away
|
|
Join Date: Jun 2002
Posts: 5,278
|
|
Depending on what you are hosting why not just use a CDN that proxies your website.....
|

08-20-2008, 07:17 AM
|
|
******* Unleaded
|
|
Join Date: Feb 2004
Posts: 3,788
|
|
I first ran into this data last year on jeanphil's blog and found it useful as a reference.
Earlier today, I ran into some interesting data to add. Quite by accident.
A zone that was running about 100K queries per day was pointed at another NS. The TTL's have always been very short, less than 300 seconds.
For the next 30 odd days, until the zone was repointed at the original NS, approximately 50+ queries were arriving at the NS. This caused no problems because the NS still had the zone data.
In comparison to the usual traffic, clearly 50+ is infinitely small, approaching zero. But, it is a number that showed no sign of declining in the 30 days.
So, while clearly there are some small number of clients or caches out there that are not respecting the TTL, it is so small as to be insignificant. However, those few are very stubborn about sticking to the original NS for reasons unknown.
Trivia? Maybe, but it at least you have a clear set of numbers to draw your own conclusions from.
|
| Thread Tools |
Search this Thread |
|
|
|
| Display Modes |
Linear Mode
|
| Postbit Selector |
|
|
Posting Rules
|
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts
HTML code is Off
|
|
|
|
|
|
| Login: |
|
|
| Advertisement: |
|
|
| Web Hosting News: |
|
|
|