Cloud services hosted by Amazon Web Services, Azure, Google and most others publish the Service Level Agreement, or SLA, for the individual services they provide. Architects, Platform Engineers and Developers are then responsible for putting these together to create an architecture that provides the hosting for an application.
Taken in isolation, these services usually provide something in the range of three to four nine's of availability:
However when combined together in architectures there is the possibility that any one component could suffer an outage resulting in an overall availability that is not equal to the the component services.
In this example there are three possible failure modes:
Therefore the overall availability of this "system" must lower than 99.95%. My rationale for thinking this is if the SLA for both services was:
The service will be available 23 hours out of 24
Then:
Both component parts are within their SLA but the total system was unavailable for 2 hours out of 24.
In this architecture there are a large number of failure modes however principally:
Because Traffic Manager is a circuit breaker it is capable of detecting an outage in either region and routing traffic to the working region, however there is still a single point of failure in the form of Traffic Manager so the total availability of the "system" cannot be higher than 99.99%.
How can the compound availability of the two systems above be calculated and documented for the business, potentially requiring rearchitecting if the business desires a higher service level than the architecture is capable of providing?
If you want to annotate the diagrams, I have built them in Lucid Chart and created a multi-use link, bear in mind that anyone can edit this so you might want to create a copy of the pages to annotate.
Lowest SLA from SPOF, assuming your app is able to cope with the session breaking ? @Tensibai - I don't think it can be, based upon my first example *if* the SLA for both services was it will be available 23 hours out of 24 then, the App Service could be out between 0100 and 0200 and the Database out between 0500 and 0600, both component parts are within their SLA but the total system was unavailable for 2 hours out of 24. Make sense? Yep, makes sense, but in this case the resulting should be the product of all no ? I mean app 99.95 x sql 99.95 should be the overall availability of the group Keep in mind also that you can build a system that's more reliable than its components, through retries or failovers or degradation instead of full failure.