Page 1 of 2 12 LastLast
Results 1 to 15 of 16
  1. #1
    Join Date
    Jun 2006
    Location
    Montreal, Canada
    Posts
    5

    Report on viability of DNS failover solution

    I run a site with about 1,000,000 unique visitors per month and recents server failures made me decide to get a failover server to minimize downtime. My goal wasn't to get 99.999% uptime but to be able to be back on track after a failure in a "reasonable" amount of time. After evaluating several solutions, I decided to go with DNS failover. Here's how the setup work:

    1) mydomain.com points to main server with a very low TTL (time to live)
    2) failover server replicates data from main server
    3) when main server goes down, mydomain.com is changed to point to failover server

    The drawback is the DNS propagation time since some DNS servers don't honor the TTL and there is some caching happening on the user's machine and browser. I looked for empirical data to gauge the extent of the problem but couldn't find any so I decided to setup my own experiment.

    The Experiment
    ==============

    I start with mydomain.com pointing to the main server with a TTL of 1800 seconds (1/2 hour). I then change it to point to the failover server which simply port forwards to the main server. On the main server, I periodically compute the percentage of requests coming from the failover server which gives me the percentage of people for which the DNS change has propagated.

    I made the DNS change at exactly 16:04 on 06/21/06 and here are the percentage of propagated users:

    06/21/06 16:00 0 %
    06/21/06 16:05 3 %
    06/21/06 16:10 20 %
    06/21/06 16:15 37 %
    06/21/06 16:20 59 %
    06/21/06 16:25 69 %
    06/21/06 16:30 76 %
    06/21/06 16:35 80 %
    06/21/06 16:40 86 %
    06/21/06 16:45 90 %
    06/21/06 16:50 91 %
    06/21/06 16:55 92 %
    06/21/06 17:00 93 %
    06/21/06 17:05 94 %
    06/21/06 17:10 94 %
    06/21/06 17:15 95 %
    06/21/06 17:35 95 %
    06/21/06 17:40 96 %
    06/21/06 17:45 97 %
    ...
    06/22/06 10:40 99 %

    So even after 18 hours, there is still a certain percentage of users going to the old server so DNS failover is obviously not a 99.999% uptime solution. However, since more than 90% of the users are propagated in the first hour, the solution works well enough for me.

    Regards
    Jean-Philippe Bouchard

  2. #2
    You are hosting your pwn DNS servers?
    My shameless plug here:
    http://www.FishingConnection.net
    Let's go FISHING!

  3. #3
    Join Date
    Jun 2006
    Location
    Montreal, Canada
    Posts
    5
    Quote Originally Posted by siliconcowboy73
    You are hosting your pwn DNS servers?
    No, I'm using DNS made easy. Also my main server is hosted at layered technologies and my failover server is at 1&1.

  4. #4
    Quote Originally Posted by jeanphil
    No, I'm using DNS made easy. Also my main server is hosted at layered technologies and my failover server is at 1&1.

    Ok. What if you were to jsut setup phonydomain.com and host it with a 99.999% uptime company like Network Solutions or Register.com. Then have phonydomain.com redirect to your IP.

    The world should cache phonydomain.com. Since it will "never" go down just have phonydomain.com be the traffic redirector. You could login to Network Solutions/Register.com and toggle it between your two servers. Would that work?
    My shameless plug here:
    http://www.FishingConnection.net
    Let's go FISHING!

  5. #5
    Join Date
    Jun 2006
    Location
    Montreal, Canada
    Posts
    5
    Quote Originally Posted by siliconcowboy73
    Ok. What if you were to jsut setup phonydomain.com and host it with a 99.999% uptime company like Network Solutions or Register.com. Then have phonydomain.com redirect to your IP.

    The world should cache phonydomain.com. Since it will "never" go down just have phonydomain.com be the traffic redirector. You could login to Network Solutions/Register.com and toggle it between your two servers. Would that work?
    That's a good point. Actually, when I was doing my research, that's the first thing I looked for. Basically, a proxy provider that would allow me to point my domain to one of their IP and control where that IP goes. A "hosted" load balancer/router service if you will. However, I wasn't able to find any company offering that service so I decided to go with DNS failover.

  6. #6
    Quote Originally Posted by jeanphil
    That's a good point. Actually, when I was doing my research, that's the first thing I looked for. Basically, a proxy provider that would allow me to point my domain to one of their IP and control where that IP goes. A "hosted" load balancer/router service if you will. However, I wasn't able to find any company offering that service so I decided to go with DNS failover.
    Try Akamai. I think they do that caching and redirect hosting stuff.
    My shameless plug here:
    http://www.FishingConnection.net
    Let's go FISHING!

  7. #7
    ..yeah... assuming bgp routing + bgp anycasting.... not 100% but very, VERY close to it I guess

  8. #8
    Join Date
    May 2004
    Posts
    300
    Jean-Philippe, many thanks for your report. Your research is much appreciated and addresses one of the questions I had about the DNSMadeEasy service. 93% of users found your backup server within an hour, sounds great.

    I've been investigating this subject recently as well, and hope to duplicate your success.

    The missing piece for me is learning how to create the mirror server and keep the data up to date.

    I'd be grateful for any remarks you may care to share about your procedure there. Thanks again.

  9. #9
    Join Date
    Jun 2006
    Location
    Montreal, Canada
    Posts
    5
    Quote Originally Posted by squirreldog
    Jean-Philippe, many thanks for your report. Your research is much appreciated and addresses one of the questions I had about the DNSMadeEasy service. 93% of users found your backup server within an hour, sounds great.

    I've been investigating this subject recently as well, and hope to duplicate your success.

    The missing piece for me is learning how to create the mirror server and keep the data up to date.

    I'd be grateful for any remarks you may care to share about your procedure there. Thanks again.
    You're welcome!

    The 2 servers are running debian 3.1 with mysql 4.1 as the backend.

    The mysql is replicated as described in this article (http://www.onlamp.com/pub/a/onlamp/2005/06/16/MySQLian.html). Moreover, the replication takes place over a virtual private network. I use openvpn (http://openvpn.net/), configured with static key as described here (http://openvpn.net/static.html).

    The web data is replicated using rsync (http://samba.anu.edu.au/rsync/) using ssh as the transport.

    I don't use the automatic failover feature of DNS made easy but I do use their server monitoring feature. When the main server is down, I receive an SMS message and set the IP to the failover server.

    If you need more details, let me know.

  10. #10
    I'd be very curious how the response is with much short TTL values. Your data seems to show that 80% of your users got routed even though they arrived within the TTL of 30 minutes. That's pretty good. It seems to imply that 80% of users had not visited within recent time and so their first hit was not cached. The other 20% may have been on and then "lost you" and had to wait some time to get the right ip again. For those users it looks like downtime, so it's not so good.

    But what if you set it up more like a dynamic dns where the dns gets updated with the ip whenever it changes and in those cases TTL is shorter. I wonder how much you could reduce that downtime for users who were on at the time of failure?

  11. #11
    Join Date
    Jun 2006
    Location
    Montreal, Canada
    Posts
    5
    Quote Originally Posted by csavery
    I'd be very curious how the response is with much short TTL values.
    From what I read, some DNS servers will not honor very short TTL and will fallback to a default value. 1/2 hour seems to be the "optimal" value.

    Also, very short TTL results in more hit on the DNS servers, which means more bandwidth cost or, in my case, a more expensive DNS made easy package (they charge on a per request basis).

    However, I may do another test with a TTL of 5 minutes, just to see if that theory of too short a TTL holds.

    Quote Originally Posted by csavery
    Your data seems to show that 80% of your users got routed even though they arrived within the TTL of 30 minutes. That's pretty good. It seems to imply that 80% of users had not visited within recent time and so their first hit was not cached.
    Not necessarily. When I switched the IP, DNS and/or users had the old IP in their cache with different expiration times. Some were due to refresh it 30 minutes from now, some 15 minutes, some 1 minute, etc. Also, based on the usage pattern of the site (length of visits, etc), I doubt that 80% of the visitors at any given time have been on the site for less than 30 minutes.

    Quote Originally Posted by csavery
    The other 20% may have been on and then "lost you" and had to wait some time to get the right ip again. For those users it looks like downtime, so it's not so good.
    That's correct. It's a drawback of the solution. However, I'm willing to live with it since I have a major server crash once every few years.

    Quote Originally Posted by csavery
    But what if you set it up more like a dynamic dns where the dns gets updated with the ip whenever it changes and in those cases TTL is shorter. I wonder how much you could reduce that downtime for users who were on at the time of failure?
    If I decide to run the experiment again with a shorter TTL, I'll definitely publish the results here.

  12. #12
    nice topic, have been thinking about this to, altough there is still a considerable downtime..
    looking forward to hear from you when you have completed the test with the TTL set to 5 mins

  13. #13
    Join Date
    Dec 2005
    Location
    Finland
    Posts
    1,466
    Thanks for posting. This is interresting. The results are much better than I expected.

  14. #14
    Depending on what you are hosting why not just use a CDN that proxies your website.....

  15. #15
    I first ran into this data last year on jeanphil's blog and found it useful as a reference.

    Earlier today, I ran into some interesting data to add. Quite by accident.

    A zone that was running about 100K queries per day was pointed at another NS. The TTL's have always been very short, less than 300 seconds.

    For the next 30 odd days, until the zone was repointed at the original NS, approximately 50+ queries were arriving at the NS. This caused no problems because the NS still had the zone data.

    In comparison to the usual traffic, clearly 50+ is infinitely small, approaching zero. But, it is a number that showed no sign of declining in the 30 days.

    So, while clearly there are some small number of clients or caches out there that are not respecting the TTL, it is so small as to be insignificant. However, those few are very stubborn about sticking to the original NS for reasons unknown.

    Trivia? Maybe, but it at least you have a clear set of numbers to draw your own conclusions from.
    edgedirector.com
    managed dns global failover and load balance (gslb)
    exactstate.com
    uptime report for webhostingtalk.com

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •