Does anyone have a matrix of what is needed in ways of infrastructure to reach a certain SLA or up time level. There are several companies saying that they guarantee 99.9% up time and similar but I have yet to find a single one that actually presents how they reached the conclusion of that particular percentage.
I am looking to offer clients different SLA levels, but what I really need are hard numbers on what each design will yield in terms of percentages.
I.e, a 5 node RHEL5 AS cluster with a fail over Apache daemon connected with bonded ethernet interfaces to two different routers using MPLS internet connection seems really redundant to me. But what up time percentage would that get compared to a single RHEL5 ES server with an Apache daemon connected with a single ehternet controller to the public internet.
I know I might be asking a bit much here, but I would like to take my expertize in designing HA systems and actually be able to put a label on it.
99.9% uptime is not really that impressive - thats about 8 hours of downtime per year ..im from the old school thinking of 5-nines is the norm (telecom world); 4-nines (99.99) should be a good goal ~ 4.5 minutes of downtime per month
if i was going to design a HA DC setup, i would move to DC power with redundant A/B system for all servers, switch, routers. Cleaner power implementation IMO.
I would probably stay away from any linux distro and use FreeBSD for more stability and less kernel issues. BSD also gives you the ability to use CARP for redundancy on servers that are not clustered.
Of course use multiple upstreams/peers with BGP; diverse transport entry, east west fiber entries, dual edge routers (on separate racks)
just FYI - i'd had servers that exceeded 451 days uptime
I do agree that the goal would always be the five nines.
But if you count in economics, you would probably have to create a matrix and estimate percentages for each "upgrade" that you apply.
You would probably want to add a percentage point for that redundant power supply, as well as a percentage for the redundant network. With that said, there are several "upgrades" that you can do to better the solution and if money wasn't an issue, a 99.999% system wouldn't be that hard to design and install.
What kind of percentage is added for :
1) Two redundant PSU?
2) Two redundant network interfaces?
3) Bonded interfaces bond0 eth0:0/eth1:1 & eth0:1/eth1:0?
4) Two redundant network switches for the bonded interfaces?
5) Redundant internet connections on each network switch/router?
6) A three (or more) node cluster running RHEL5 AS?
7) A mirrored sets of hdds (RAID1) for the operating system?
8) A striped set of hdds (RAID1+0) for the data?
9) A SAN storage LUN for the data?
10) A DAS storage area for the data?
11) Tape backup?
12) SAN snapshot backup?
13) A DR hosting site?
14) BGP Multihomed Internet connections?
Of course, the uptime or SLA's would always have to reflect the response time on hardware replacements and on-call response times.
But the fact that I can't find a single infrastructural matrix to reflect the promised SLA times of hosting companies makes me wonder where they actually came up with the different numbers.
But since I'm an IT architect, I would want to be able to say what's needed for a specified percentage. The industry has flooded our customers with the up time SLA guarantees to the point that that's about the only thing that they care about.
I don't think it's possible to generalize a "this architecture -> this availability" mapping - there's simply too many architectural options available to choose from and too many application-specific variables that you have to take into consideration, not to mention potentially significant availability differences in practice with different infrastructure providers.
That said, there are common patterns that are used to target specific availability needs, and if you're targeting a few generic classes of applications you should be able to design a number of different architectures, analyze each for potential points of failure and make a risk assessment that leads to an estimated availability. In practice you'd likely still need to specialize this for a particular application but it can give you guidelines to start with.
You might want to take a look at HighScalability, which profiles the architectures (and sometimes the failure modes) of quite a few big sites. It can give you a starting point for building some estimates.