Web Hosting Talk







View Full Version : What good is BGP4 when one router going down kills you?


pmak0
12-23-2001, 09:36 AM
Right now, about half the servers hosted at RackShack.net are unreachable. According to the techs, this is due to a backbone problem with Verio.

All of RackShack.net's machines are in the same facility, AFAIK. And that facility has multiple backbone connections other than Verio.

So why are those machines unreachable, when other backbones are working? Isn't BGP4 supposed to be able to automatically route around this sort of failure?

cperciva
12-23-2001, 09:41 AM
BGP4 is designed to route around network problems. Unfortunately it makes not provision for routing around PEBKAC.

bitserve
12-23-2001, 02:08 PM
Originally posted by cperciva
BGP4 is designed to route around network problems. Unfortunately it makes not provision for routing around PEBKAC.

LOL! :)

pmak0
12-23-2001, 05:21 PM
> BGP4 is designed to route around network problems.
> Unfortunately it makes not provision for routing around
> PEBKAC.

I don't understand... what is PEBKAC?

cperciva
12-23-2001, 05:37 PM
Originally posted by pmak0
I don't understand... what is PEBKAC?

Problem
Exists
Between
Keyboard
And
Chair

pmak0
12-23-2001, 05:42 PM
> Problem Exists Between Keyboard And Chair

Ahh, so you're saying that given that RackShack's data center has redundant backbone connections etc., the fact that many machines went down due to one router failing means that their network is not configured properly for redundancy (but presumably could be)?

One other peculiar thing that I noticed during the outage: I logged into a RaQ that was still reachable, and tried to ping a RaQ in the same building that was unreachable---I couldn't. (Someone surmised that RackShack's network was setup such that internal traffic had to pass through external routers.)

cperciva
12-23-2001, 05:57 PM
Originally posted by pmak0
Ahh, so you're saying that given that RackShack's data center has redundant backbone connections etc., the fact that many machines went down due to one router failing means that their network is not configured properly for redundancy (but presumably could be)?

That's more or less it. The case of a router failing inside their datacenter should be solved by having internal redundancy, while the case of an external router failure should be solved by routing packets along other lines.

But to be fair, it is easy to set things up wrong, and almost impossible to detect a misconfiguration before it causes problems.

alchiba
12-23-2001, 06:12 PM
Originally posted by cperciva
it is easy to set things up wrong, and almost impossible to detect a misconfiguration before it causes problems.

Actually, it's very simple. You test it.

pmak0
12-23-2001, 06:15 PM
> Actually, it's very simple. You test it.

That's what I was thinking too. For example, couldn't RackShack have tested what happens if they suddenly switch off their router to Verio (or their router to Savvis, or to Time Warner, etc.)?

alchiba
12-23-2001, 06:17 PM
Originally posted by pmak0
couldn't RackShack have tested what happens if they suddenly switch off their router to Verio (or their router to Savvis, or to Time Warner, etc.)?

They're stupid as rocks if they don't or didn't.

cperciva
12-23-2001, 06:19 PM
Originally posted by alchiba
Actually, it's very simple. You test it.

Well, yes, but people like to have excuses when their network goes down. "AT&T screwed up" is a much better excuse than "we decided to start randomly turning routers off", don't you think?

alchiba
12-23-2001, 07:29 PM
Originally posted by cperciva
Well, yes, but people like to have excuses when their network goes down. "AT&T screwed up" is a much better excuse than "we decided to start randomly turning routers off", don't you think?

This is not best practice, and I certainly hope anyone serious about their business would not approach installing new equipment in such a cavalier manner. If it were done properly, there would be no need for such transparent excuses.

cperciva
12-23-2001, 09:40 PM
Originally posted by alchiba
I certainly hope anyone serious about their business would not approach installing new equipment in such a cavalier manner.

I entirely agree. I'd also hope that anyone serious about creating an operating system wouldn't move to an entirely new, untested, virtual memory subsystem in the middle of a "stable" kernel series.

Unfortunately, because companies almost never publish accurate details of their internal operations, it's hard for an outside observer to tell if people apply best practices or if they've just been lucky so far... until Bad Things happen.

You can't build a poorly engineered bridge, because the government will come along, inspect your plans, and refuse to give you a permit. Unfortunately, there are no regulations in place concerning poorly engineered networks (or operating systems!).

cbaker17
12-24-2001, 12:24 AM
unless rackshacks core router went down it should always route around verio, sounds like theres something their not telling you

sigma
12-24-2001, 08:24 AM
Originally posted by pmak0
[BOne other peculiar thing that I noticed during the outage: I logged into a RaQ that was still reachable, and tried to ping a RaQ in the same building that was unreachable---I couldn't. (Someone surmised that RackShack's network was setup such that internal traffic had to pass through external routers.) [/B]

It's pretty easy to arrange things that way using a "Layer 3 Switch" such as the NetIron or BigIron. Then you have internal traffic moving through the "switching side", and external traffic through the "routing side" of the same device. There's more work for the box to do, and more problems if it goes down, but hey, you save money, and that's what it's all about, right?

Disclaimer: I have no knowledge of anyone's network configuration other than my own, and I'm sure they are all excellent designs with multiple redundancies for internal, external, hardware, and software issues. Any problems are likely caused by irritable holiday elves rather than design problems.

Kevin

pmak0
12-26-2001, 09:14 AM
I think some routing tables are screwed up. I can reach
forum.rackshack.net, but my own site (lina.aaanime.net)
hosted on the same network is unreachable.

============================================================
=== VisualRoute (tm) 4.2a report on 26-Dec-01 8:04:50 AM ===
============================================================

Real-time report for lina.aaanime.net [216.40.250.27] (70% done)

Analysis: IP packets are not moving from network "Time Warner Telecom" to network "Time Warner Telecom
Interfaces" at hops 12-13. Connections to HTTP port 80 are being rejected.

---------------------------------------------------------------------------------------------------------------------------------------------------------
| Hop | Err | IP Address | Node Name | Location | ms | Graph | Network |
---------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | | 64.3.42.49 | LINA-CHAN | * | | | Concentric Network Corporation |
| 1 | 6 | | | | | | |
| 2 | | 64.3.41.113 | - | ?San Jose, CA 95126-3429 | 178 | -x-- | Concentric Network Corporation |
| 3 | | 64.220.2.226 | ge10-0.dist1.was-dc.us.xo.net | ?San Jose, CA 95126-3429 | 181 | -x--- | Concentric Network Corporation |
| 4 | | 64.220.0.213 | ge1-0.edge1.was-dc.us.xo.net | ?San Jose, CA 95126-3429 | 155 | x-- | Concentric Network Corporation |
| 5 | | 207.88.56.38 | - | ?San Jose, CA 95126-3429 | 138 | x | Concentric Network Corporation |
| 6 | | 198.32.187.48 | mae-east.twtelecom.com | Vienna, VA, USA | 136 | x | Exchange Point Blocks |
| 7 | | 168.215.53.129 | jr-02-so-1-1-0-155m.chrl.twtelecom.net | ?Brookfield, WI 53045 | 146 | x | Time Warner Telecom |
| 8 | | 168.215.53.146 | ip146.53.215.168.in-addr.arpa | ?Brookfield, WI 53045 | 177 | x- | Time Warner Telecom |
| 9 | | 168.215.55.229 | jr-01-ge-2-3-0-1000m.atln.twtelecom.net | ?Brookfield, WI 53045 | 173 | x- | Time Warner Telecom |
| 10 | | 168.215.53.65 | jr-01-so-2-2-0-622m.dlfw.twtelecom.net | ?Brookfield, WI 53045 | 180 | x- | Time Warner Telecom |
| 11 | | 168.215.53.62 | jr-01-so-0-0-0.2488m.hsto.twtelecom.net | ?Brookfield, WI 53045 | 177 | x- | Time Warner Telecom |
| 12 | | 168.215.172.42 | jr-03-ge-0-3-0-1000m.hsto.twtelecom.net | ?Brookfield, WI 53045 | 177 | x- | Time Warner Telecom |
| 13 | 5 | ?64.132.191.14 | ip14.191.132.64.in-addr.arpa | | | | Time Warner Telecom Interfaces |
| 14 | 5 | ?207.218.223.33 | tayhou-223-33.ev1.net | | | | Everyones Internet, Inc. |
| 15 | 5 | ?216.40.250.27 | lina.aaanime.net | | | | Everyones Internet, Inc. |
---------------------------------------------------------------------------------------------------------------------------------------------------------