something i don't understand
we've been at colo4 since december 2008, so i assume we are in the 'old' facility.
we've got 11 servers in the rack, and at about 12:20 this afternoon, nine were back on line.
two are still down.
when i ssh'd into our main shared server, i noticed that the last boot date had not changed.
this implies that those nine machines did NOT lose power.
we have two separate circuits into the rack, which i do NOT believe are A/B redundant. at least that's not what we ordered, nor are we using them that way. however, i SUSPECT one circuit is on the A feed and the other on the B feed, but i have no idea. nor do i remember if both down boxes are on the same circuit (DUH!).
is the problem that ***I'M*** seeing a routing issue, since i can't get a traceroute to either down box? even though all the machines in the rack on on the same class c net?
perhaps someone smarter than me can enlighten me.
Good information... finally
now that is some good news, give news to us hard, give news to us fast.
i especially appreciate the fact that colo4 is accepting fault for several engineering issues and i am sure that in the next several days that a team will discover more small problems that can be fixed in order to prevent a similar occurrence.
one suggestion would be to host an emergency response site elsewhere... and to twitter every 15 or 30 minutes in order to take the pressure off of those of us who are at the mercy of our client bases.
keep at it guys, don't panic and make things worse.