Originally Posted by Stratogen
I guess the issue here is not about cloud per se, but the fact a hosting provider like Amazon can have a 5 day outage. Any company with that amount of downtime is going to find it hard to win new customers.
For the record, the outage affected only a single region (US-East). Specifically, only one availability zone (AZ) in that region was down, while the other AZ suffered overloading/capacity problems as the majority of the affected users tried to spool up instances in the remaining AZ in that region. US-West was still running fine the whole time.
I figured adding color was appropriate since I keep seeing the same regurgitated headline version of the event and the use of 'outage' in the sense that the entire service was down or unavailable. In reality, its more complicated and was analogous to a DC being down in the traditional sense.
I'm sure as an HA expert, you do setup systems that span multiple DCs and don't base your SLA on just one geographical area.
I agree with wartungsfenster - reddit is not a mission critical app. To further this discussion, here's a quick link to High Scalability's big list of articles related to the incident. You'll find that quite a few people survived the 'outage' just fine.