Web Hosting Talk







View Full Version : Redundancy Is not Enough


web_res
12-28-2001, 04:12 PM
I don't really get some hosts...

Example Rackshack: Currently they're network is down, but they have like what 7 providers? It's not enough to just have alot of providers... I really think they need to do something about they're network. Maybe get some really "capable" people to do their networking. Even when they are working they're connection speeds are still disappointing. I am surprised that they are actually utilizing BGP4.

Normally I am able to download at over 80 KBytes/s with my connection to decent sites. But with rackshack it's like 20-30 Kbytes/s and with a pretty bad latency.

This doesn't seem to uncommon either...

web_res
12-28-2001, 04:32 PM
Hmm in the last 25 minutes it got up and it's down once again..

Never mind it was apparently a power outage. Must have been two of them.

Now about those redundant power supply things they have...

headsurfer
12-28-2001, 09:37 PM
WE did in fact lose power of about 3 minutes twice today. The primary failure resulted from an electrician arching (sp?) by dropping a tool, two 480 streams that blew through a primary breaker and into the primary UPS controller. That primary controller then , shall we say, exploded. There are pics in our forum at our site.

The electrician is lucky to be alive.

The primary failure did what it was supposed to do and bypassed to street power but due to the primary building breaker also breaking/tripping, it went to genset power. If the UPS controller had not been fried, it would have been on battery.

Then when the mail building breaker was reset, it flipped back from genset power to street power but because the primary UPS controller failed earlier, there was no battery in the middle to keep everything alive when the power failed.

The absolute worst kind of failure and we get it right before the new year.

Since the bypass is intact, when the UPS controllers, and such are repaired and the batteries load tested again, we will switch back to operating mode on the UPS. Then we will be able to run on generator while teh main breaker is replaced.

It is a complex situation and we have engineers as well as techs from teh electricians and UPS companies working on the issue.

In conclusion, there were two outages of about 2 to 5 minutes each and power is currently restored.

Our forum is quite slow due to the enormous load and downloads of the UPS pics. I beg the moderators, should this cross any line, please leave it intact so that our customers who frequent this forum can have a good update.

Robert Marsh
Head Surfer Rackshack.net

web_res
12-28-2001, 09:43 PM
Woah... I didn't realize it was that serious...

I hope everyone is physically okay...

The Prohacker
12-28-2001, 10:30 PM
Originally posted by headsurfer
The electrician is lucky to be alive.



From some of those pictures, your absolutely right, the electrician should loose his license for that screw up, when dealing with what looks like industrial 440v, you must be much much more careful, and shut down the circcut before working on, or anything where near it.......

ADEhost
12-29-2001, 01:17 AM
Originally posted by The Prohacker

From some of those pictures, your absolutely right, the electrician should loose his license for that screw up, when dealing with what looks like industrial 440v, you must be much much more careful, and shut down the circcut before working on, or anything where near it.......

did we not all learn that killing the power was the first thing we do. but the sad thing is that he might have done that, but when the screwdriver fell it hit the UPS voltage line ( that's a guess, in no way an i an electrician )

mike

richy
12-29-2001, 10:27 AM
the rs accident was a fluke of monumental proportions. they have excellent redundancy but that didnt help them here. they need proper industrial (read power plant, chemical plant) measures in place to monitor contractors. all contractors need to be supervised directly by a member of staff who knows the systems, they need to be competent and briefed, and their tools need to be attached to them at all times. pref with wire ;) so they cant drop anything twice.

Mike the newbie
12-29-2001, 11:00 AM
They may have excellent redundancy in some areas, but there was still a single point of failure in at least one critical area.

Accidents happen, especially when humans are around. A properly designed redundant system should accomodate that.

richy
12-29-2001, 11:22 AM
humans? who mentioned them? contractors and electricians maybe lol . they need a serious think about changing both their physical systems and their operational procedures to prevent this happening again. a look at their network graphs shows how many servers didnt come back up immediately. looks like their traffic halved after the outage.

Mike the newbie
12-30-2001, 11:42 AM
Originally posted by richy
... they need a serious think about changing both their physical systems and their operational procedures to prevent this happening again. ...



Yup. It is easy to put the words "state of the art carrier class data facility" on their webpage. But to install the systems and institute the procedures that earn that classification is a daunting task, requiring a whole different mindset for running the datacenter.

It will, by my experience, take at least a couple of years before we find out if they really have the carrier grade data center that they promise. A carrier grade facility has such low downtime numbers that only after a couple of years will the downtime numbers become statistically valid.