
|
View Full Version : pair.com horror story
smarq 08-03-2002, 07:52 PM On Friday at 7 am I noticed that my shared web hosting
account at pair.com looked
like it was down. I was able to telnet into the server
but my home directory was gone. I sent an email
to support@pair.com and waited for 2 hours; no
reply. I wanted to call them but could not find
their phone number on the web site. I looked
them up through switchboard.com and called them
even though it was a not toll free number. When
I finally got to the support person she said they
were swaping the disk and it should be done
in about 10 minutes. Waited for another
couple of hours and the site was still down.
The home page of our site was now showing
the home page of pair.com; really nice way
to advertise themselves huh.
I clicked on the Support -> System Notices link
from the home page and found some postings
that pi.pair.com had some disk warnings and
they were transfering the data to another disk.
So I left them alone and figured it would be
done an a few hours at worst. By 4pm Friday
it was still not done.
On Saturday morning when I saw that the site
was still down and the same message from
Friday posted there I called again. But I got
a message that phone support is provided
only M-F 9am to 5pm. How convienient.
Now it's Saturday evening and my site along
with 100's of others is still down. They posted
another System Notice a little while ago that
the disk is still not restored.
I can't imagine that fixing the disk is taking
2 days. Why not just restore from back up;
unless you didn't keep a backup.
I guess all I can do now is wait and hope
that they don't lose any of my data. They
don't have any refund policy for down time
so I won't even get any credit for the
site being down all this time. You are
probably wondering why Im paying $30
per month for hosting at such a great
hosting company. Well I got the account
a long time back and have not had the
time to get it moved. But Im definitely
going to move it now.
Watch the system notices from the pair.com
support page and see how this unfolds.
What would you do in this case?
Scott
avara 08-03-2002, 08:08 PM Ouch. I really don't have any advice, but if I were you I would expect them to credit you at least a month of free hosting, even if they don't have an official refund policy.
This is actually the first horror story I've heard about pair.com.
hostfreak123 08-03-2002, 08:14 PM They seem to be a pretty good company tho...
UmBillyCord 08-03-2002, 08:28 PM Originally posted by hostfreak123
They seem to be a pretty good company tho...
Pair. Not Pear.
Pair is excellent. Very odd why it takes two days for a restore. Must of had some serious issue.
hostfreak123 08-03-2002, 08:34 PM Yes, UmBillyCord is right. They have been in business for last 7 years....Hope everything works out fine and your website is back online
mwatkins 08-03-2002, 08:36 PM Having been a client of Pair.com, and a happy one at that, my first reaction is "crap happens", and at least with Pair.com you know that there is a capable organization there to deal with "crap".
After all, how many web hosts have *all* their system outages, planned and unplanned, and outcomes, posted in near real time and available to the public?
http://pair.com/pair/support/notices/
If you review the histories available you can see that the vast majority of issues are resolved in minutes.
As for your site's server, it does appear that they were working on the issue throughout this time period.
My final thought is that the worst can happen even with the best of providers. On the positive side, at least with Pair this event doesn't mean they go out of business (and with that your site and data). Some webhosters carrying on business would be destroyed by such an event, and you can find evidence of this right here on WHT.
[Aug 3, 2002, 12:58 PM] pi Status Update
The drive swap on pi is still in progress. Unfortunately, due to the condition of the original drive, file restoration is progressing slower than expected. Because of this, we are not able to provide an accurate estimate of when the drive swap will be completed. Web services will be restored as the files are restored. Mail sent to any accounts on pi will remain in the queue until the mailboxes and rules have been properly restored. We thank you for your continuing patience and understanding in this matter.
[Aug 2, 2002, 6:25 PM] pi Update
The swap on pi continues, slowly. Some accounts have only been partially restored, which is not cause for alarm. All accounts will be fully restored once work has been completed. We are not sure how long this swap could take, it may yet be several hours. However, our staff will be monitoring its progress until completion.
[Aug 2, 2002, 1:53 PM] pi Status
The drive swap is continuing on pi. As this is an older server with many users with extensive file systems, this has been taking longer than usual. At this stage, over half of user files have been recovered, and the process is accelerating. We hope to have all files recovered within the next couple of hours.
[Aug 2, 2002, 6:01 AM] pi Emergency Maintenance
pi has web services starting to be run again, some domains might still be calling up the pair default pages while the directories are still being copied over. If you find your account empty still, it'll be repaired soon.
[Aug 2, 2002, 2:00 AM] pi Emergency Maintenance
The main drive for pi was so badly damaged, it was unable to perform any routine diagnostics/fixing; we're still able to get at most of the data, but we have to do this manually with the old drive on secondary; so pi will effectively be down for a while yet.
[Aug 2, 2002, 1:47 AM] pi Emergency Maintenance
We have discovered a warning condition on the hard drive of pi (www10). We will be starting a drive swap at this time. There should be one additional outage to complete the swap, which should be no longer then 10 minutes.
Gem Hexen 08-03-2002, 09:16 PM Well, at least if they do as they promise you won't lose any data!
FloHost 08-03-2002, 11:59 PM I would stay with them, look how often they are updating people on this matter. They seem to be doing all they can.
VoxKeysGtr 08-04-2002, 12:50 AM I was with pair when I first started, and they were great. Their support and reliability were superb. I didn't stay with them because they charged extra for hosting additional domains, but I was completely satisfied with my experience with them. I'm sure they'll get it all worked out. Good companies do, and pair is one of the good ones. :)
sigma 08-04-2002, 07:40 AM I think our System Notices try to make it plain that this is a lousy situation. One of our oldest servers has a badly dying drive; some users are being restored from backups, and others are being slowly, painfully copied from the dying drive. Neither is a fast operation, because older servers have very large accounts on them. Some people have as much as 1GB of data in each account - copying that from a drive with errors, or restoring it from backup archives, is not fast :(
Having said that, I believe that nearly everything has been restored so far. Also, I am working right now on improvements to our custom backup system which could reduce the impact of these kinds of problems. As you can see from our System Notices, often a drive failure leads to a live swap, during which nearly all data remains online. Unfortunately, there are less pleasant failure modes, and this one (drive refuses to boot, spends a lot of time giving I/O errors, which are slow) is just slightly better than a complete meltdown which would require a time-consuming full restore.
As always, we invite discussion on our news server news.pair.com, and my mailbox is always open to discuss problems and improvements. We sincerely apologize for this problem, and thank you for your business, even though unfortunately it sounds like we won't be keeping it.
Thanks,
Kevin
smarq 08-04-2002, 09:57 AM Hi Kevin,
Thanks for the reply.
Actually the old disk is online and accessible. Shortly
after posting this thread yesterday I was able to
zip up all our data on the old drive and copy it to
a different host. But maybe I got luck and the part
of the disk where my data is was not damaged.
But never the less, the disk is mounted and online
and is not in as bad a shape as you described.
A simple copy command could restore most
of the data. I noticed on Friday that someone
was doing this account by account. But it stopped
or has gotten very slow since Friday afternoon. You
should try to get all the uneffected accounts transfered
before working on the problem ones. If you
had done it in that order more uneffected accounts
like mine would be online now.
You are posting nice messages but I don't see
any progress when I login and check how many
of the accounts have been restored. I can't
imagine you would be hung up for days.
I have a couple of questions/requests:
1. Will you be giving the users any kind of credit
for this downtime which is into its 3rd day now.
2. It would be better if the home pages of the
sites that are effected mapped to a page that
says the site is down due to technical problems
rather then going to the home page of pair.com.
That way it does not confuse the site visitors
and so there will be less support mail for the
site webmaster.
3. Your phone support needs to be re-evaluated.
It is way under par.
1. You don't provide a toll free number. Most all
other major hosting companies do.
2. When I did call your non-toll free number
it said that phone support is only provided
M-F during business hours. A lot of the good
hosting companies now provide 24x7 phone
support.
My recommendation would be to use a online
trouble ticket system (aviods email problems when
the domains are down like now) that is manned 24x7
along with an emergency pager number.
4. You should setup your servers with a 2nd backup
disk that is a mirror of the main disk so that
such a long down time does not occur in the future.
Everyone knows Crap happens, but that is
why we pay you to take preventive measures (like
proactively swaping out old disks before they have
problems) and keeping hot backups so that when
crap happens the down time is minimized.
A serious problem like a disk failure provides a
real test of a hosting company. A hosting company
should well expect that a disk on one of the servers
is going to get blown away sooner or later and
be totally prepared to restore it from a hot backup.
Especially on shared server where so many sites
are effected. 3 days (so far) for a restore is totally
unacceptable (for a good hosting company). Aside
from the messages posted on the System Notice
board I have not seen any actual progress since
Friday whenever I login and look at the new disk to
see now many accounts have been restored.
Good thing my account is not an ecommerce site
where each hour of down time could mean lose
of sales.
Scott
sigma 08-04-2002, 10:13 AM Originally posted by smarq
Actually the old disk is online and accessible. Shortly
after posting this thread yesterday I was able to
zip up all our data on the old drive and copy it to
a different host. But maybe I got luck and the part
of the disk where my data is was not damaged.
The damage reports are actually in /var, which implied that a simple forward swap would quickly resolve the problem. We've done dozens and dozens of these in the past. Unfortunately, the drive is responding *extremely* slowly to all accesses.
But never the less, the disk is mounted and online
and is not in as bad a shape as you described.
A simple copy command could restore most
of the data. I noticed on Friday that someone
was doing this account by account. But it stopped
or has gotten very slow since Friday afternoon. You
should try to get all the uneffected accounts transfered
before working on the problem ones. If you
had done it in that order more uneffected accounts
like mine would be online now.
Actually, the best procedure would normally be to symlink to all of the old accounts, and restore them one-by-one. With the drive running so slowly, we took this approach instead in order to reduce drive load. Today, I've discovered that drive load makes little difference, so as of a few minutes ago, all accounts are online again, via symlinks. The resulting increase in disk load on the old drive does not seem to be slowing down the procedure of copying accounts to the new drive. A new and improved procedure is being used as well, which should complete more quickly. Nonetheless, since this server has over 20GB of user data on it, it will still take until at least sometime tonight to complete the transfer.
1. Will you be giving the users any kind of credit
for this downtime which is into its 3rd day now.
These will be addressed on a case-by-case basis or possibly a general post-mortem analysis. Please watch our System Notices pages for details.
2. It would be better if the home pages of the
sites that are effected mapped to a page that
says the site is down due to technical problems
rather then going to the home page of pair.com.
That way it does not confuse the site visitors
and so there will be less support mail for the
site webmaster.
What you've probably been seeing is the default page from http://www10.pair.com/ or a 404 page. It's not intended to steal user traffic, and with accounts online again now, it should no longer be an issue. Otherwise, agreed and noted for future reference.
3. Your phone support needs to be re-evaluated.
It is way under par.
Our phone support, which is barely over a year old now, is an enhancement which for many years we did not provide. We do of course have 24x7 coverage and prompt response from urgent@pair.com - something we've offered for 6.5 years. A fully-staffed callcenter - staffed with experienced techs, I might add - is a cost which has to be carefully managed.
I might add that urgent@pair.com can be reached from anywhere in the world, even better than an online form. Of course, we do use a custom ticketing system, as you can see from the Account Control Center.
4. You should setup your servers with a 2nd backup
disk that is a mirror of the main disk so that
such a long down time does not occur in the future.
Everyone knows Crap happens, but that is
why we pay you to take preventive measures (like
proactively swaping out old disks before they have
problems) and keeping hot backups so that when
crap happens the down time is minimized.
As I mentioned before in this thread, we would love to have something like this running, but there are serious complications which I can't discuss at length right now. It's something we've investigated repeatedly, both in hardware and software, and although we will work on it again with FreeBSD 4.6-STABLE, right now our focus will be on improving restoration times (which covers many more failure modes).
A serious problem like a disk failure provides a
real test of a hosting company. A hosting company
should well expect that a disk on one of the servers
is going to get blown away sooner or later and
be totally prepared to restore it from a hot backup.
As you can see from the years and years of System Notices we've posted, we have gone through dozens and dozens of drive failures - and documented every one of them publicly. This particular failure mode has happened perhaps three times, twice recently, and we are adapting to that.
Especially on shared server where so many sites
are effected. 3 days (so far) for a restore is totally
unacceptable (for a good hosting company). Aside
from the messages posted on the System Notice
board I have not seen any actual progress since
Friday whenever I login and look at the new disk to
see now many accounts have been restored.
I certainly agree that it is unacceptable. I think you will now find that an improved procedure is underway, that customer sites are back online, and that progress is being made at a more measurable pace (we're working in /usr/www presently if you happen to check).
Good thing my account is not an ecommerce site
where each hour of down time could mean lose
of sales.
Of course, no matter how great your host is or isn't, you should always have your own backups and your own disaster plan. We strive to maintain customer sites and data, and we still advise that.
Again, please watch System Notices for further updates. I'll gladly discuss here, but I don't want to turn WHT into the "pair support forum". It wouldn't be fair to crowd out all the discussion of Cyberwings and Rackshack, would it?
Thanks,
Kevin
the-muse 08-04-2002, 12:21 PM Again, please watch System Notices for further updates. I'll gladly discuss here, but I don't want to turn WHT into the "pair support forum". It wouldn't be fair to crowd out all the discussion of Cyberwings and Rackshack, would it? That kind of humor under your present circumstances earns my respect.
Not that everyone can afford this option, but I have all my own personal accounts, and those of my hosting clients - I am a reseller - backed up by a second hosting company. If one goes down, the sites point to the nameservers at the other one. It digs a little bit more into my profits, but does wonders for my peace of mind. :dunce:
Annette 08-04-2002, 06:31 PM Originally posted by sigma
Again, please watch System Notices for further updates. I'll gladly discuss here, but I don't want to turn WHT into the "pair support forum". It wouldn't be fair to crowd out all the discussion of Cyberwings and Rackshack, would it?
Thanks,
Kevin
Now *that* is funny. Good on you, Kevin.
|