Web Hosting Talk







View Full Version : Redundant Pipes


Circa3000
01-18-2002, 01:21 AM
Hi all,

Can a routing failure (at a Qwest facility) knock out access to your network, even if you have redundant pipes?

Our upstream provider procures bandwidth for us from Qwest and others. Today, a network failure outside of our company disrupted service for over two hours. The disruption was not total -- some visitors did get through while those that reported failure could not get through at all, despite several retries.

The inevitable question was asked by one of my customers, "but don't you guys have redundant pipes?" I replied, "yes, of course" but then couldn't explain why he would not be rerouted around the ailing Qwest facility (in Burbank).

According to our upstream provider, a router (typically, responsible for switching from a slow or broken pipe to a good pipe) can fail, jamming access through any of the available pipes.

I'll buy that, but I still can't explain why my customer in Indiana would be routed through Burbank, CA and then hang up there on a failed router -- rather than be immediately rerouted through, say, Sacramento -- our nearest mega-POP.

Do these explanations add up? Or, am I paying for redundancy that I'm not getting?

Any input is greatly appreciated.

astanley
01-18-2002, 08:57 AM
Well a downed router is not the same as a routing error. Border Gateway Protocol, better known as BGP is the protocol of choice for most company's using routers now, and the biggest reason is the addition of classless adressing (assiging partial IP blocks, instead of full class a/b/c). At a set interval all routers running BGP check the status of all neighboring routers, and if one is unresponsive - then BGP turns off that route, and routes all traffic through the next priority link. In the event of a route ERROR, however, the bordering router does not become unresponsive, as the router never goes down, it has been merely misconfigured. A routing error would show up in a traceroute as what is also known as a "routing loop", if your users had tried to traceroute your machine while receiving the downtimes they would've seen something similar to:

7 icix.att.net (165.117.69.10) 11.16 ms 12.948 ms 10.619 ms
8 gbr3-p50.wswdc.ip.att.net (12.123.9.50) 10.525 ms 10.753 ms 10.469 ms
9 gbr3-p40.sl9mo.ip.att.net (12.122.2.82) 43.565 ms 43.883 ms 41.358 ms
10 gbr3-p20.sffca.ip.att.net (12.122.2.74) 75.868 mst 77.882 ms 78.434 ms
11 gbr3-p40.sl9mo.ip.att.net (12.122.2.82) 43.565 ms 43.883 ms 41.358 ms
12 gbr3-p20.sffca.ip.att.net (12.122.2.74) 75.868 mst 77.882 ms 78.434 ms
13 gbr3-p40.sl9mo.ip.att.net (12.122.2.82) 43.565 ms 43.883 ms 41.358 ms
14 gbr3-p20.sffca.ip.att.net (12.122.2.74) 75.868 mst 77.882 ms 78.434 ms

Notice that 9-14 simply bounce the user back and forth between two routers - this is most likely the error your customers would've seen. As far as why a customer was routed halfway across the internet to get to your site - that's just what happens. Routers are fairly intelligent and do constant polling on all neighboring connections in order to build a reliability, and bandwidth table...each neighboring router is rated based on this table, and the packets are sent to the highest rated neighbor that has an entry in the routing table for that destination. I hope that's not too confusing, and I hope it better explains the problem you had.

-Adam