High Availability on AWS (part 1): What is HA? and other key considerations

Thursday, 25 May, 2017   /   Peter Joseph

high-avaliability-AWS-part-1.jpg

What is HA anyway?

Before you set out to build your “Highly Available” service in the cloud, its essential to understand exactly what this means.

No Official Definition Appears to Exist

Unfortunately to the best of my knowledge and additional research, there is no official definition of what level of availability a “Highly Available” service must actually provide to be considered “Highly Available”.

The nearest thing to a definition appears to come from the Harvard Research Group’s 'Availability Environment Classifications' paper.

Here are the two highest classifications:

  • AE4 - Business functions that demand continuous computing and where any failure is transparent to the user. This means no interruption of work; no transactions lost; no degradation in performance; and continuous 24x7 operation.
  • AE3 - Business functions that require uninterrupted computing services, either during essential time periods, or during most hours of the day and most days of the week throughout the year. This means that the user stays on-line. However, the current transaction may need restarting and users may experience some performance degradation.

Many are likely to consider the AE4 and AE3 classifications as having ‘High Availability’ characteristics.

In order to define anything as “High Availability” we need to first have an appreciation of what ‘Availability’ really means.

What Do We Mean by Availability?

Traditionally, we have spoken of availability in terms of system availability, as measured against an agreed metric. Availability is often referred to synonymously as 'up-time'. There is no problem referring to availability as up-time as long as the meaning is kept consistent.

In reality there is usually a significant difference between the technical meaning of up-time of a system and the availability of the system or service itself. Simply stated, a system can be ‘up’, but at the same time be unavailable, as critical services and functions are impaired.

How many ‘nines’?

In practice, availability is commonly quantified as ‘nines’ of availability - a system that, according to it’s availability criteria, is actually available for a certain amount of time. Below is a table of availability numbers in ‘nines’ that are commonly discussed.

Availability %

Downtime per year

Downtime per month

Downtime per day

99% (“two nines”)

3.65 days

7.20 hours

14.4 minutes

99.5% (“two and a half nines”)

1.83 days

3.60 hours

7.2 minutes

99.9% (“three nines”)

8.76 hours

43.8 minutes

1.44 minutes

99.95% (“three and a half nines”)

4.38 hours

21.56 minutes

43.2 seconds

99.99% (“four nines”)

52.56 minutes

4.38 minutes

8.66 seconds

99.999% (“five nines”)

5.26 minutes

25.9 seconds

864.3 milliseconds

99.9999% (“six nines”)

31.5 seconds

2.59 seconds

86.4 milliseconds

99.99999% (“seven nines”)

3.15 seconds

262.97 milliseconds

8.64 milliseconds

 

Availability criteria, what does ‘available’ actually mean?

For a system or service to be considered available at all, a set of criteria must be defined which, if met, will cause the system to be deemed ‘available’. It is then necessary to decide on a set of availability criteria.

These availability criteria must actually be measurable to be useful . These measurable availability criteria become the ‘availability metrics’.

  • Qualitative/Subjective/Un-measurable Criteria
    • "The system must always be responsive to users requests”
    • “The system should never be slow”
    • “The system should always be fast”
    • “Users shouldn’t be waiting too long for things to happen”.
  • Quantitative “Good” Metrics
    • “User requests completed successfully in less than 3 seconds”
    • “API Response time to complete requests below 25ms”
    • “HTTP Page Not Found Errors (404)s below 0.1% of all website hits”.

In practice, availability status would not usually be based on a single criteria but is based on all the availability metrics together to produce a single availability status.

Build the Availability You Need, Not The Availability You Want

Once a definition of availability is determined for a desired system, an actual availability number can be agreed on. How much availability do you want? This is the wrong question. The real question must be:

How much availability do you need?

The availability needs of a system must be driven by the needs of the users and ultimately the business that depends on it. A back office system that is used by a small number of employees twice a week, may not need 99% availability (~14 minutes of unavailability/downtime per day). An ecommerce platform supporting a million users an hour is likely to require at least ‘five nines’ (99.999%) availability.

Why Not Just Make Everything ‘six nines’?

The initial answer that is often found is cost. This certainly is a major element, but along with cost, the complexity of a highly available system increases to. It is this complexity that often increases the cost and can also ironically contribute to difficulties in achieving the design availability target. The more elements a system has, the more reliable each element has to be to achieve a defined high availability target.

 

This blog is based on a presentation from Peter Joseph at a recent Soltius AWS Meetup event. Click here to watch the full presentation replay.

 

Click here to view the presentation replay