Web Hosting Talk







View Full Version : Tonight on


Riggs
11-07-2000, 02:15 AM
I'm not going to name the host (yet) because they seem like they are still making some attempt to resolve the situation, but here's the situation so far:

- We've been with this certain host for 8 months. Our web site is huge and is the combined work of many talented people. Unfortunately this site also happens to be on a Cobalt RaQ (ugh)

- Anyway, one day we start getting 100s of emails saying that the site has been down. I'm the technical person for the site, and I'm the only one who telnets in and installs stuff/makes changes to the server. So I go to check it out... no web, no telnet... last time I even telneted in was over 3 weeks ago, so this is definitely odd.

- I contact our host and ask them if there's a problem with our machine in particular, or if the entire block of servers we're on is temporarily offline. Support says that it's just our server, and for some reason it's not rebooting. Hmmm... again, definitely weird since I'm the only person who touches it for that kind of stuff, and I haven't done anything for weeks with it.

- After a few hours, the support dude tells me that a lot of the stuff in the /etc mount looks trashed, so he would try copying some files from a good RaQ over onto ours and that should fix the boot. That's fine, but I still wanted to know HOW this happened... only thing I could think of is someone hacked us, but all of the stuff in /home (where 99% of the server resides) was intact according to him. That sure was a relief, because even though we keep nightly backups, they are all in /home/backup.

- Anyway, I don't hear from him for an entire day. After a few more emails bugging him for a status report, he said that he still hasn't gotten the machine to reboot. He said he still had a few ideas though, and that he didn't want to resort to replacing the entire drive.

- Another day passes. We've been down for over 48 hours now. Support now tells me that they are just going to try loading a fresh hard drive and mounting the old one so I could copy the web site and all of our old files over. Not exactly the ideal situation, but better than nothing. I just wanted to get back up and running ASAP.

- Another day passes and I have no f***ing clue what they are doing. They said they were going to load a new drive, but that was it... didn't hear from them for over a day after that (despite numerous angry emails). I tried telnet to our web server to see if the new hard drive was loaded, with no such luck =(

- Finally I start emailing a bunch of people at the hosting company to find out what's going on. FINALLY I noticed that a telnet session popped up... only problem is that it's a brand new drive, and I couldn't find a mount to our old drive anywhere. It's absolutely critical that we get to our old drive... ALL of our stuff is on there

- More emails... I get a reply here and there to the tune of "we're working on it" and stuff like that. I'm trying to be as cooperative with these guys as I can, but we've been down for 4 days at this point and losing a ton of money.

- Tonight I get an email from their main tech guy stating that "he hasn't been able to mount the old hard drive yet cause it's a Cobalt partition and acting weird, and he's running out of ideas". He also said that "we normally don't handle support for this sort of stuff, but he will continue working on it until he runs out of ideas". That just f***ing great... real uplifting. At this point I'm panicing because this potentially means losing months and months of hard work that people have put into the site because of some accident that I don't even understand how happend in the first place.

If things get worse, we will be ditching these guys and making sure that all of their major clients are aware that something like this could easily happen to THEM. I'm not going to give out who they are yet, because it's still possible the situation could be fixed.

Hopefully someone can offer advice, learn from this story, etc... I just needed to vent. I know their main tech guy reads/posts on this board too, so hopefully he will realize just how concerned we are about these problems... me emailing/calling him every day sure doesn't seem to be doing much good =(

Félix C.Courtemanche
11-07-2000, 11:54 AM
The advice is pray.

I know for a fasct that the cobalts DO act weird ALL the damn time. If something goes wrong, your nowhere near the end of it. I don't personally believe that the tech is at fault here, neither are you. Cobalt is and its nice old RAQs (the newer ones are a bit better).

The real advice in thsi case would have been... backup! often, frequently, incremental and full.

A server is a production environment, if it crashes for some hardware reason (HD in your case) you can't rely on some magic... you need a backup and re-install ASAP.

Félix C.Courtemanche
11-07-2000, 11:57 AM
oh... by the way, menacing will not help your case with them. If I have to face a client saying that he will 'ditch' or publicise his problem worldwide, I would:

- ask him where he does it to make sure to rectify and tell the truth everywhere. It's not their fault, its a server problem.
- most likely stop trying to resolve the situation, if I was doing it friendly and I didn't 'have' to do it.

Basically, forcing someone will simply generate the opposite effect and you will be left with nothing.

Of course, that is how I would react, but I'm pretty peacefull... some other people are much MUCH more agressive and would... shut down the server, cancel the contract (did you read the fine print?), etc.

Riggs
11-07-2000, 12:18 PM
Thanks for the reply.

The fact that the RaQ crashed is not why I am angry. If I don't have off-site backups, then that is my fault. Only reason I posted technical details is in case someone had similar experiences and knew how we could salvage the site from the old disk.

What I don't like is how this company handles their support... they've kept me guessing ever since day 1 of this problem. They always make it seem like they are very close to "fixing" the problem. At first when we just thought it was a small problem, they said that they would just restore the /etc directory and then it would be fixed. Well, I sat around twiddling my thumbs, the server never popped back up, and the people helping me disappeared. Then they made it sound like the old drive was perfectly fine and all they had to do was copy the old data onto a new one, but now they're sounding like that's impossible to do. No matter what, I never get straight-up answers (and I have to constantly email them just to get an answer, which I DO hate doing)

You're right though... it could come off the wrong way in a public forum

Félix C.Courtemanche
11-07-2000, 12:57 PM
ok, for a solution, you will need to re-mount the old drive (as read only is better), forget about any useless aprtition, only remount /home since you need the files.

Then you should be able to transfer everything if it was not corrupted as well. I doubt there are such things as 'impossible to mount because it's weird' but most likely 'I don't know how to do it on a raq'

Well, since they can't do that, ask them to mount it on a crap server somewhere so that you can access the files, transfer them, then have everything fixed. If they can'T do that......... ask them to send over the HD to you by the mail to do it yourself or find someone to do it. It must be done on Linux, btw.

I hope this helps... even though I know it's not easy top do.