Riggs
11-07-2000, 02:15 AM
I'm not going to name the host (yet) because they seem like they are still making some attempt to resolve the situation, but here's the situation so far:
- We've been with this certain host for 8 months. Our web site is huge and is the combined work of many talented people. Unfortunately this site also happens to be on a Cobalt RaQ (ugh)
- Anyway, one day we start getting 100s of emails saying that the site has been down. I'm the technical person for the site, and I'm the only one who telnets in and installs stuff/makes changes to the server. So I go to check it out... no web, no telnet... last time I even telneted in was over 3 weeks ago, so this is definitely odd.
- I contact our host and ask them if there's a problem with our machine in particular, or if the entire block of servers we're on is temporarily offline. Support says that it's just our server, and for some reason it's not rebooting. Hmmm... again, definitely weird since I'm the only person who touches it for that kind of stuff, and I haven't done anything for weeks with it.
- After a few hours, the support dude tells me that a lot of the stuff in the /etc mount looks trashed, so he would try copying some files from a good RaQ over onto ours and that should fix the boot. That's fine, but I still wanted to know HOW this happened... only thing I could think of is someone hacked us, but all of the stuff in /home (where 99% of the server resides) was intact according to him. That sure was a relief, because even though we keep nightly backups, they are all in /home/backup.
- Anyway, I don't hear from him for an entire day. After a few more emails bugging him for a status report, he said that he still hasn't gotten the machine to reboot. He said he still had a few ideas though, and that he didn't want to resort to replacing the entire drive.
- Another day passes. We've been down for over 48 hours now. Support now tells me that they are just going to try loading a fresh hard drive and mounting the old one so I could copy the web site and all of our old files over. Not exactly the ideal situation, but better than nothing. I just wanted to get back up and running ASAP.
- Another day passes and I have no f***ing clue what they are doing. They said they were going to load a new drive, but that was it... didn't hear from them for over a day after that (despite numerous angry emails). I tried telnet to our web server to see if the new hard drive was loaded, with no such luck =(
- Finally I start emailing a bunch of people at the hosting company to find out what's going on. FINALLY I noticed that a telnet session popped up... only problem is that it's a brand new drive, and I couldn't find a mount to our old drive anywhere. It's absolutely critical that we get to our old drive... ALL of our stuff is on there
- More emails... I get a reply here and there to the tune of "we're working on it" and stuff like that. I'm trying to be as cooperative with these guys as I can, but we've been down for 4 days at this point and losing a ton of money.
- Tonight I get an email from their main tech guy stating that "he hasn't been able to mount the old hard drive yet cause it's a Cobalt partition and acting weird, and he's running out of ideas". He also said that "we normally don't handle support for this sort of stuff, but he will continue working on it until he runs out of ideas". That just f***ing great... real uplifting. At this point I'm panicing because this potentially means losing months and months of hard work that people have put into the site because of some accident that I don't even understand how happend in the first place.
If things get worse, we will be ditching these guys and making sure that all of their major clients are aware that something like this could easily happen to THEM. I'm not going to give out who they are yet, because it's still possible the situation could be fixed.
Hopefully someone can offer advice, learn from this story, etc... I just needed to vent. I know their main tech guy reads/posts on this board too, so hopefully he will realize just how concerned we are about these problems... me emailing/calling him every day sure doesn't seem to be doing much good =(
- We've been with this certain host for 8 months. Our web site is huge and is the combined work of many talented people. Unfortunately this site also happens to be on a Cobalt RaQ (ugh)
- Anyway, one day we start getting 100s of emails saying that the site has been down. I'm the technical person for the site, and I'm the only one who telnets in and installs stuff/makes changes to the server. So I go to check it out... no web, no telnet... last time I even telneted in was over 3 weeks ago, so this is definitely odd.
- I contact our host and ask them if there's a problem with our machine in particular, or if the entire block of servers we're on is temporarily offline. Support says that it's just our server, and for some reason it's not rebooting. Hmmm... again, definitely weird since I'm the only person who touches it for that kind of stuff, and I haven't done anything for weeks with it.
- After a few hours, the support dude tells me that a lot of the stuff in the /etc mount looks trashed, so he would try copying some files from a good RaQ over onto ours and that should fix the boot. That's fine, but I still wanted to know HOW this happened... only thing I could think of is someone hacked us, but all of the stuff in /home (where 99% of the server resides) was intact according to him. That sure was a relief, because even though we keep nightly backups, they are all in /home/backup.
- Anyway, I don't hear from him for an entire day. After a few more emails bugging him for a status report, he said that he still hasn't gotten the machine to reboot. He said he still had a few ideas though, and that he didn't want to resort to replacing the entire drive.
- Another day passes. We've been down for over 48 hours now. Support now tells me that they are just going to try loading a fresh hard drive and mounting the old one so I could copy the web site and all of our old files over. Not exactly the ideal situation, but better than nothing. I just wanted to get back up and running ASAP.
- Another day passes and I have no f***ing clue what they are doing. They said they were going to load a new drive, but that was it... didn't hear from them for over a day after that (despite numerous angry emails). I tried telnet to our web server to see if the new hard drive was loaded, with no such luck =(
- Finally I start emailing a bunch of people at the hosting company to find out what's going on. FINALLY I noticed that a telnet session popped up... only problem is that it's a brand new drive, and I couldn't find a mount to our old drive anywhere. It's absolutely critical that we get to our old drive... ALL of our stuff is on there
- More emails... I get a reply here and there to the tune of "we're working on it" and stuff like that. I'm trying to be as cooperative with these guys as I can, but we've been down for 4 days at this point and losing a ton of money.
- Tonight I get an email from their main tech guy stating that "he hasn't been able to mount the old hard drive yet cause it's a Cobalt partition and acting weird, and he's running out of ideas". He also said that "we normally don't handle support for this sort of stuff, but he will continue working on it until he runs out of ideas". That just f***ing great... real uplifting. At this point I'm panicing because this potentially means losing months and months of hard work that people have put into the site because of some accident that I don't even understand how happend in the first place.
If things get worse, we will be ditching these guys and making sure that all of their major clients are aware that something like this could easily happen to THEM. I'm not going to give out who they are yet, because it's still possible the situation could be fixed.
Hopefully someone can offer advice, learn from this story, etc... I just needed to vent. I know their main tech guy reads/posts on this board too, so hopefully he will realize just how concerned we are about these problems... me emailing/calling him every day sure doesn't seem to be doing much good =(
