Quote:
Originally posted by smarq
Actually the old disk is online and accessible. Shortly
after posting this thread yesterday I was able to
zip up all our data on the old drive and copy it to
a different host. But maybe I got luck and the part
of the disk where my data is was not damaged.
The damage reports are actually in /var, which implied that a simple forward swap would quickly resolve the problem. We've done dozens and dozens of these in the past. Unfortunately, the drive is responding *extremely* slowly to all accesses.
Quote:
But never the less, the disk is mounted and online
and is not in as bad a shape as you described.
A simple copy command could restore most
of the data. I noticed on Friday that someone
was doing this account by account. But it stopped
or has gotten very slow since Friday afternoon. You
should try to get all the uneffected accounts transfered
before working on the problem ones. If you
had done it in that order more uneffected accounts
like mine would be online now.
Actually, the best procedure would normally be to symlink to all of the old accounts, and restore them one-by-one. With the drive running so slowly, we took this approach instead in order to reduce drive load. Today, I've discovered that drive load makes little difference, so as of a few minutes ago, all accounts are online again, via symlinks. The resulting increase in disk load on the old drive does not seem to be slowing down the procedure of copying accounts to the new drive. A new and improved procedure is being used as well, which should complete more quickly. Nonetheless, since this server has over 20GB of user data on it, it will still take until at least sometime tonight to complete the transfer.
Quote:
1. Will you be giving the users any kind of credit
for this downtime which is into its 3rd day now.
These will be addressed on a case-by-case basis or possibly a general post-mortem analysis. Please watch our System Notices pages for details.
Quote:
2. It would be better if the home pages of the
sites that are effected mapped to a page that
says the site is down due to technical problems
rather then going to the home page of pair.com.
That way it does not confuse the site visitors
and so there will be less support mail for the
site webmaster.
What you've probably been seeing is the default page from http://www10.pair.com/ or a 404 page. It's not intended to steal user traffic, and with accounts online again now, it should no longer be an issue. Otherwise, agreed and noted for future reference.
Quote:
3. Your phone support needs to be re-evaluated.
It is way under par.
Our phone support, which is barely over a year old now, is an enhancement which for many years we did not provide. We do of course have 24x7 coverage and prompt response from urgent@pair.com - something we've offered for 6.5 years. A fully-staffed callcenter - staffed with experienced techs, I might add - is a cost which has to be carefully managed.
I might add that urgent@pair.com can be reached from anywhere in the world, even better than an online form. Of course, we do use a custom ticketing system, as you can see from the Account Control Center.
Quote:
4. You should setup your servers with a 2nd backup
disk that is a mirror of the main disk so that
such a long down time does not occur in the future.
Everyone knows Crap happens, but that is
why we pay you to take preventive measures (like
proactively swaping out old disks before they have
problems) and keeping hot backups so that when
crap happens the down time is minimized.
As I mentioned before in this thread, we would love to have something like this running, but there are serious complications which I can't discuss at length right now. It's something we've investigated repeatedly, both in hardware and software, and although we will work on it again with FreeBSD 4.6-STABLE, right now our focus will be on improving restoration times (which covers many more failure modes).
Quote:
A serious problem like a disk failure provides a
real test of a hosting company. A hosting company
should well expect that a disk on one of the servers
is going to get blown away sooner or later and
be totally prepared to restore it from a hot backup.
As you can see from the years and years of System Notices we've posted, we have gone through dozens and dozens of drive failures - and documented every one of them publicly. This particular failure mode has happened perhaps three times, twice recently, and we are adapting to that.
Quote:
Especially on shared server where so many sites
are effected. 3 days (so far) for a restore is totally
unacceptable (for a good hosting company). Aside
from the messages posted on the System Notice
board I have not seen any actual progress since
Friday whenever I login and look at the new disk to
see now many accounts have been restored.
I certainly agree that it is unacceptable. I think you will now find that an improved procedure is underway, that customer sites are back online, and that progress is being made at a more measurable pace (we're working in /usr/www presently if you happen to check).
Quote:
Good thing my account is not an ecommerce site
where each hour of down time could mean lose
of sales.
Of course, no matter how great your host is or isn't, you should always have your own backups and your own disaster plan. We strive to maintain customer sites and data, and we still advise that.
Again, please watch System Notices for further updates. I'll gladly discuss here, but I don't want to turn WHT into the "pair support forum". It wouldn't be fair to crowd out all the discussion of Cyberwings and Rackshack, would it?
Thanks,
Kevin