Page 1 of 2 12 LastLast
Results 1 to 25 of 43
  1. #1

    Amazon EC2, Bitter Experience

    Hi,

    One of our client has two Amazon instances running in EU zone. Both Windows 2003 instances. Recently they had a hardware failure and site was down for almost 5 days. They won't update you even it is a hardware failure form Amazon's side. And in order to get a response from them, we need to purchase premium support ,very funny Ticket response times were very poor .They are providing 99.95% uptime and as per SLA hardware failure on individual instances are not supported. Anybody has any similar experience?
    Ideamine Technologies
    http://www.ideaminetech.com, sales(a)ideaminetech.com
    Server Management|Outsourced Support|Web Development|Mobile Applications
    Skype: servernix|GTalk: servernix|AIM: servernix

  2. #2
    Join Date
    Apr 2002
    Location
    Seattle, WA
    Posts
    947
    I have had issues with EBS and of course the big failure earlier this year. I'm not a fan, much better alternatives out there. Amazon has grown too big for their britches so to speak.
    I <3 Linux Clusters

  3. #3
    Hi Brandon,

    And this was amazon reply after 6 hours

    "I have taken a look at i-88f75bff and see that the underlying hardware is in a degraded state. As this instance is not an EBS-backed instance, you will need to launch a replacement instance from your most recent snapshot and terminate this instance. Unfortunately this can happen in an AWS environment, just as unforeseen hardware issues can occur in any physical environment. While we cannot give too many details about our architecture, degraded hardware essentially means that the underlying hardware, aka the host, may have some type of hardware issue, or other pending issue that is causing issues with your instance running on this host. Just as in any physical environment, such as a traditional datacenter, hardware failure is expected. The benefit to utilizing AWS is that we provide a scalable architecture that allows you to build redundant, fault tolerant systems. That said, while we provide the infrastructure, it is still up to our customers to architect their environment with these best practices in mind. "
    Ideamine Technologies
    http://www.ideaminetech.com, sales(a)ideaminetech.com
    Server Management|Outsourced Support|Web Development|Mobile Applications
    Skype: servernix|GTalk: servernix|AIM: servernix

  4. #4
    Join Date
    Jan 2011
    Location
    Canada
    Posts
    934
    As this instance is not an EBS-backed instance...

    That pretty much sums it up. Understand that when running in AWS, unless you are EBS-backed, you acknowledge that the VM can and will likely move to a different host when you stop the instance for any reason - whether it be a node failure, like you encountered, or even just by hitting stop.

    This also means you also lose access to ALL your data that was on the old local HDD.

    I wouldn't say they're doing it wrong, but they are doing it differently from other providers, so care and planning do need to be involved when placing stuff on EC2 instances.

    If it is beyond your technical expertise to setup fault tolerance within AWS, you might want to consider Rackspace or a SAN-backed cloud provider. Those operate the closest to a traditional server.

  5. #5
    We have won several clients recently that have had long outages with Amazon. In a way I guess we should be greatful but the bigger picture is that they are the biggest player in the market yet they are starting to give cloud a bad name. The whole reason organisations are moving to the cloud is take advantage of the increased reliability so it's important for the industry that things are done properly.

    If you want reliability you need to look for cloud providers that publish the specifications & configuration of their platform and the service levels they have achieved. Unfortunately, most providers don't provide this level of information.
    ██ Enterprise Class Cloud Hosting And Disaster Recovery. SAN Replication.
    ██ VMware Hosting on HP Blades With NetApp or EqualLogic SAN Storage. 100% Guaranteed Uptime.
    ██ Build Your Own Virtual DataCentre In The Cloud. Fully Integrated With vCenter.
    ██ StratoGen Are An Authorised VMware Partner | StratoGen.net

  6. #6
    Join Date
    Mar 2010
    Location
    Germany
    Posts
    653
    Quote Originally Posted by Stratogen View Post
    We have won several clients recently that have had long outages with Amazon. In a way I guess we should be greatful but the bigger picture is that they are the biggest player in the market yet they are starting to give cloud a bad name. The whole reason organisations are moving to the cloud is take advantage of the increased reliability so it's important for the industry that things are done properly.

    If you want reliability you need to look for cloud providers that publish the specifications & configuration of their platform and the service levels they have achieved. Unfortunately, most providers don't provide this level of information.
    I dont think they're giving cloud a bad name - coming from a HA environment where not a single data loss could be accepted in 20 years, I can say that cloud had *exactly* this bad rep with the people there, even before the failures happened. We chatted about what would have to change to bring matters forward in cloud environments, and so far none of the big providers are tackling them.
    For example the "san" storage most cloud providers are proud using is what we used for LAB systems!

    And in general, when you need good IT hosting and aren't prepared to run really *instanced* applications, you should go someone specialized in HA hosting - i.e. like the guy i'm replying to hehe.


    Amazon is just an online store!!!!!! They have done a very good job on their cloud platform, but it's not intended or fit for running mission critical apps. (and no, reddit and other sites that failed are not "mission critical" hehe)

    If someone has the budget to build his app around the cloud and it's all well-enough designed to run on multiple different EC2-compatible clouds then i'm quite sure this site will exceed anything that normal IT shops can do - but not with putting a few windows servers on normal amazon ec2 instances. (not S3, not EBS, just local disk ... oh well.
    Check out my SSD guides for Samsung, HGST (Hitachi Global Storage) and Intel!

  7. #7
    Quote Originally Posted by wartungsfenster View Post
    I dont think they're giving cloud a bad name - coming from a HA environment where not a single data loss could be accepted in 20 years, I can say that cloud had *exactly* this bad rep with the people there, even before the failures happened.
    Do you really think cloud has a bad name? The cloud is dominating hosting in 2011 so clearly a lot of people have bought in to the idea, and there are real business benefits.

    I guess the issue here is not about cloud per se, but the fact a hosting provider like Amazon can have a 5 day outage. Any company with that amount of downtime is going to find it hard to win new customers.
    ██ Enterprise Class Cloud Hosting And Disaster Recovery. SAN Replication.
    ██ VMware Hosting on HP Blades With NetApp or EqualLogic SAN Storage. 100% Guaranteed Uptime.
    ██ Build Your Own Virtual DataCentre In The Cloud. Fully Integrated With vCenter.
    ██ StratoGen Are An Authorised VMware Partner | StratoGen.net

  8. #8
    Join Date
    Dec 2001
    Location
    Atlanta
    Posts
    4,419
    Quote Originally Posted by Stratogen View Post
    Do you really think cloud has a bad name? The cloud is dominating hosting in 2011 so clearly a lot of people have bought in to the idea, and there are real business benefits.

    I guess the issue here is not about cloud per se, but the fact a hosting provider like Amazon can have a 5 day outage. Any company with that amount of downtime is going to find it hard to win new customers.
    I would agree with this. But I would add further that clients need to understand what they need and what they are getting from each service out there. Offerings like you and I have are not comparable to an amazon - but maybe everyone does not need it either. As long as people get what they need they should be happy.


    if you need something more stable and secure for your business then you may not want to look at an amazon unless you can suffer the issues from time to time. If its just cheap and hourly usage and you can be down from time to time then its probably ok.
    Dedicated Servers
    WWW.NETDEPOT.COM
    Since 2000

  9. #9
    Join Date
    Jan 2011
    Location
    Canada
    Posts
    934
    Quote Originally Posted by Stratogen View Post
    I guess the issue here is not about cloud per se, but the fact a hosting provider like Amazon can have a 5 day outage. Any company with that amount of downtime is going to find it hard to win new customers.
    For the record, the outage affected only a single region (US-East). Specifically, only one availability zone (AZ) in that region was down, while the other AZ suffered overloading/capacity problems as the majority of the affected users tried to spool up instances in the remaining AZ in that region. US-West was still running fine the whole time.

    I figured adding color was appropriate since I keep seeing the same regurgitated headline version of the event and the use of 'outage' in the sense that the entire service was down or unavailable. In reality, its more complicated and was analogous to a DC being down in the traditional sense.

    I'm sure as an HA expert, you do setup systems that span multiple DCs and don't base your SLA on just one geographical area.


    I agree with wartungsfenster - reddit is not a mission critical app. To further this discussion, here's a quick link to High Scalability's big list of articles related to the incident. You'll find that quite a few people survived the 'outage' just fine.

    http://highscalability.com/blog/2011...on-outage.html

  10. #10
    Quote Originally Posted by tchen View Post
    For the record, the outage affected only a single region (US-East). Specifically, only one availability zone (AZ) in that region was down, while the other AZ suffered overloading/capacity problems as the majority of the affected users tried to spool up instances in the remaining AZ in that region. US-West was still running fine the whole time.
    But surely it's not about which platform went down - it's the 5 days to fix it that will be of concern.
    ██ Enterprise Class Cloud Hosting And Disaster Recovery. SAN Replication.
    ██ VMware Hosting on HP Blades With NetApp or EqualLogic SAN Storage. 100% Guaranteed Uptime.
    ██ Build Your Own Virtual DataCentre In The Cloud. Fully Integrated With vCenter.
    ██ StratoGen Are An Authorised VMware Partner | StratoGen.net

  11. #11
    Join Date
    Jan 2011
    Location
    Canada
    Posts
    934
    Quote Originally Posted by Stratogen View Post
    But surely it's not about which platform went down - it's the 5 days to fix it that will be of concern.
    Actually the platform depends greatly. If I was on a dual-DC service, then yes, 5 days exposure is way too much. On a multi-DC service where each region is setup to be fully independent from each other (and within which, each AZ is designed to be independent from each other) then the 5 days shouldn't be even a concern.

    And about the 5 days....

    http://aws.amazon.com/message/65648/

    * April 21 - outage occurs
    * April 22 - 97.8% of EBS volume restored in 9 hrs. 13% of which still in stuck mode due to backplane capacity issues.
    * April 23 - expanded backplane installed for EBS, unsticking EBS vols.
    * April 24 - manual restoration of backups from S3 for remaining 2.2% volumes that were left.

    Ultimately, 0.07% of the volumes could not be restored in a consistent state. There is the timeline. You can compare that against other outages and make your decision there. Of course, the '5-day' outage does make for a better cover story though.

  12. #12
    The biggest concern is that the attitude towards this type of hardware issues..Amazon is selling servers just like mobile phones or TVs..and support is just like purchasing extended warranty They should realize that webhosting is more than that..and business is very critical for customers..
    Ideamine Technologies
    http://www.ideaminetech.com, sales(a)ideaminetech.com
    Server Management|Outsourced Support|Web Development|Mobile Applications
    Skype: servernix|GTalk: servernix|AIM: servernix

  13. #13
    Join Date
    Nov 2009
    Posts
    544
    Quote Originally Posted by ideamine View Post
    The biggest concern is that the attitude towards this type of hardware issues..Amazon is selling servers just like mobile phones or TVs..and support is just like purchasing extended warranty They should realize that webhosting is more than that..and business is very critical for customers..
    It is very disheartening to see a response like this from someone advertising themselves as a server manager for hire. One that lists Amazon's EC2 as their area of expertise at that...

    Why would you require a response from support to tell you that the server is down? Simple monitoring should have told you this. Hopefully you did not wait 6 hours to spin up another instance.

    It is probably not Amazon's attitude that should be in question here... You should realize what it takes to keep your customer's business on line.

  14. #14
    Hi,

    Are you from Amazon?

    We are monitoring servers and found that it was down just after 5 min. But if the server is entirely down due to a hardware issue from webhost, what should we do? How long we have to wait for a response?
    Ideamine Technologies
    http://www.ideaminetech.com, sales(a)ideaminetech.com
    Server Management|Outsourced Support|Web Development|Mobile Applications
    Skype: servernix|GTalk: servernix|AIM: servernix

  15. #15
    Join Date
    Apr 2002
    Location
    Auckland - New Zealand
    Posts
    1,572
    Amazon provide a good platform, with lots of functionality, bells and whistles and they are perfect for those that know how to create clustered and fail over applications.

    If you need support, then you can purchase it. For the most part, if you have something mission critical, it can live really quite happily on ec2, but you need to build in your HA yourself. They give you the keys, you are driving.

  16. #16
    I second to StevenG, AMazon rocks.

  17. #17
    Join Date
    Feb 2008
    Location
    Philadelphia, PA
    Posts
    1,076
    Quote Originally Posted by Stratogen View Post
    Do you really think cloud has a bad name? The cloud is dominating hosting in 2011 so clearly a lot of people have bought in to the idea, and there are real business benefits.

    I guess the issue here is not about cloud per se, but the fact a hosting provider like Amazon can have a 5 day outage. Any company with that amount of downtime is going to find it hard to win new customers.
    I think 90% of providers offer "cloud" hosting that isn't what most people think they're buying. When I think of cloud, I think of a COMPLETE fault tolerant system. Most "cloud" hosts are offering these little VPS containers that have the ability to expand with next to no fault tolerance, and they're calling them cloud. There is nothing cloud-like about that IMO.
    Last edited by Encrypted; 07-02-2011 at 12:20 PM.

  18. #18
    Join Date
    Dec 2001
    Location
    Atlanta
    Posts
    4,419
    Quote Originally Posted by Encrypted View Post
    I think 90% of providers offer "cloud" hosting that isn't what most people think they're buying. When I think of cloud, I think of a COMPLETE fault tolerant system. Most "cloud" hosts are offering these little VPS containers that have the ability to expand with next to no fault tolerance, and they're calling them cloud. There is nothing cloud-like about that IMO.

    Unfortunately you are correct.
    Dedicated Servers
    WWW.NETDEPOT.COM
    Since 2000

  19. #19
    Join Date
    Jan 2011
    Location
    Canada
    Posts
    934
    Quote Originally Posted by ideamine View Post
    Hi,

    Are you from Amazon?

    We are monitoring servers and found that it was down just after 5 min. But if the server is entirely down due to a hardware issue from webhost, what should we do? How long we have to wait for a response?
    srfreeman, StevenG and I are all from Amazon. No, not really.

    Look, there's two ways his thread can continue.

    A) you take that attitude displayed above and ignore anyone who doesn't share your opinion that Amazon sucks. Or....

    B) recognize that you currently are deficient in knowledge about how EC2 works. Realize that an EC2 instance is ephemeral and can easily go away at any time for any number of reasons. This is widely documented in third party blogs, but it is also clearly laid out in the best practices from Amazon themselves.

    You need to build your own HA/failover as StevenG suggest. Considering you're using windows instances, there's likely some legacy support issues preventing you from making any required changes to your application, so your best bet is to use EBS.

    Remember to snapshot your EBS volumes regularly to another AZ (preferably another region too) as the SLA for EBS isn't 100% either. It's high, but there is a very slim chance that an EBS volume can become unmountable. If you have a backup, then its a moot point.

  20. #20
    Join Date
    Dec 2001
    Location
    Atlanta
    Posts
    4,419
    Quote Originally Posted by tchen View Post
    srfreeman, StevenG and I are all from Amazon. No, not really.

    Look, there's two ways his thread can continue.

    A) you take that attitude displayed above and ignore anyone who doesn't share your opinion that Amazon sucks. Or....

    B) recognize that you currently are deficient in knowledge about how EC2 works. Realize that an EC2 instance is ephemeral and can easily go away at any time for any number of reasons. This is widely documented in third party blogs, but it is also clearly laid out in the best practices from Amazon themselves.

    You need to build your own HA/failover as StevenG suggest. Considering you're using windows instances, there's likely some legacy support issues preventing you from making any required changes to your application, so your best bet is to use EBS.

    Remember to snapshot your EBS volumes regularly to another AZ (preferably another region too) as the SLA for EBS isn't 100% either. It's high, but there is a very slim chance that an EBS volume can become unmountable. If you have a backup, then its a moot point.
    don't run mission critical or revenue producing business apps that you count on in EC2 and I think you will be fine with what you are getting for what you are paying for.

    These points above are all good advice.
    Dedicated Servers
    WWW.NETDEPOT.COM
    Since 2000

  21. #21
    Join Date
    Nov 2009
    Posts
    544
    Quote Originally Posted by ideamine View Post
    Hi,

    Are you from Amazon?

    We are monitoring servers and found that it was down just after 5 min. But if the server is entirely down due to a hardware issue from webhost, what should we do? How long we have to wait for a response?
    Really, come now. A server manager...? If you find that a server is down, that would be the time to start up your backup server. This would be the case with any host, not just Amazon (though Amazon makes this easy to do.)

    After your customer is back online, if you are curious about why the original went down, check with support. The support response time is of no consequence.

    As the server manager, your response time in getting your customer back up is the only one that matters.

  22. #22
    As a paying customer, we have the right to know what happened to our servers. " The support response time is of no consequence " Support response time is important for a customer who is paying money for support. We have many client servers running on rackspace cloud and it is not the case.
    Ideamine Technologies
    http://www.ideaminetech.com, sales(a)ideaminetech.com
    Server Management|Outsourced Support|Web Development|Mobile Applications
    Skype: servernix|GTalk: servernix|AIM: servernix

  23. #23
    and do you know why major websites like Foursquare, Reddit, Quora, Heroku and Hootsuite offline for more than 24 hours? It is lack of knowledge from respective server managers?
    Ideamine Technologies
    http://www.ideaminetech.com, sales(a)ideaminetech.com
    Server Management|Outsourced Support|Web Development|Mobile Applications
    Skype: servernix|GTalk: servernix|AIM: servernix

  24. #24
    Join Date
    Jan 2011
    Location
    Canada
    Posts
    934
    Quote Originally Posted by ideamine View Post
    and do you know why major websites like Foursquare, Reddit, Quora, Heroku and Hootsuite offline for more than 24 hours? It is lack of knowledge from respective server managers?
    Yes.

    From Heroku's own incident report at http://status.heroku.com/incident/151

    In their (Heroku's) own words

    Failures at the IaaS layer will happen. It's Heroku's responsibility to shield our customers from this; part of our value proposition is to abstract away these concerns. We failed at this in a big way this weekend, and our engineers are even as we speak hard at work on architectural changes that will allow us to handle infrastructure outages of this magnitude with less or no disruption to our customers in the future.

    There are three major lessons about IaaS we've learned from this experience:

    1. Spreading across multiple availability zones in single region does not provide as much partitioning as we thought.

    2. Block storage is not a cloud-friendly technology.

    3. Continuous database backups for all.
    From Quora's own engineers...

    Our main database and slave were still both operating in the broken AZ (a mistake that is very clear to me now)
    - emphasis mine.


    To be fair, the only ones that could be arguably absolved from responsibility are those that did span two AZs but sadly only in the same US-east region. That DID take a lot of people by surprise (including Amazon) that EBS wasn't as isolated between those two AZs as envisioned/designed.

    As for the rest of the companies you mentioned above, do a google search. It's easy to find their respective postmortems (or use the High Scalability link I posted). The professionals are quite forthcoming in acknowledging where they failed and need to bolster their knowledge.

  25. #25
    Join Date
    Nov 2009
    Posts
    544
    Quote Originally Posted by ideamine View Post
    As a paying customer, we have the right to know what happened to our servers. " The support response time is of no consequence " Support response time is important for a customer who is paying money for support. We have many client servers running on rackspace cloud and it is not the case.
    Ok, so you feel you have a right to know something. Using the same logic, your customers have a right to have their servers managed.

    It is really just a simple two step process:
    1. Notice server is down.
    2. Start backup server.

    Automating this process is something you as a professional should strive for.

    Calling support (RackSpace, Amazon or any leasing host) to hear what you already know "Yep, server is down, we will get it back up as soon as possible." is not any form of management. It is a curiosity at best. You should not really be asking these types of questions or paying for these types of answers - another type of management...

    Why do you feel that "Support response time is important..." for a server down issue?

Page 1 of 2 12 LastLast

Similar Threads

  1. Amazon EC2 Questions?
    By Brandon_R in forum Cloud Hosting
    Replies: 21
    Last Post: 05-14-2011, 01:20 PM
  2. Amazon EC2?
    By Steven F in forum Cloud Hosting
    Replies: 4
    Last Post: 03-14-2011, 06:30 PM
  3. Amazon S3 & Amazon EC2
    By Clone in forum Specialty Hosting and Markets
    Replies: 8
    Last Post: 06-06-2009, 03:23 AM
  4. Amazon EC2
    By txitcs in forum Web Hosting
    Replies: 7
    Last Post: 04-22-2009, 10:39 AM
  5. Amazon EC2
    By Rich in forum Programming Discussion
    Replies: 3
    Last Post: 09-05-2008, 03:09 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •