With the recent outage on AWS last week, there have been a lot of discussions about various architectures and how effective they would've been to combat the downtime. Since these vary a lot in cost, complexity, and tradeoffs, I want to go through a couple of them at a high level, then dive into one that was unfortunately missing from much of the conversation.
First were comments made about the value of going multi-cloud. The idea of this is to run your application in… multiple clouds. By spreading the load across multiple providers, you are insulated in case one of them goes down. In theory, it sounds great! Surely two cloud vendors won't go down at the same time! In practice, this is horrible at the application level for a multitude of reasons:
- The infrastructure is different for each cloud
- Deployments gain a magnitude of complexity
- Bandwidth charges to go between the two are outrageous
Because of this, a multi-cloud architecture would not be a feasible option for high availability (outside a select few edge cases).
Next, there were talks about multi-region. An AWS Region is a group of multiple availability zones (AZ), with each AZ being one or more discrete data centers with independent power, networking, and connectivity. Operating in a single region across multiple AZs provides high availability, but doesn't provide DR. For that, you need multiple regions. A very simplified version of a multi-region setup looks like this: This fixes a couple of the issues of going multi-cloud:
- Your application will stay in the same cloud, so the infrastructure will stay the same
- Regions are completely separated, so you get the same availability benefits!
- Region-to-region bandwidth charges are much lower than cloud-to-cloud fees!
Unfortunately, most of the comments were around Active-Active multi-region. That is, distributing the load across multiple regions at the same time. This adds in a lot of complexities around keeping the persistence layer in sync. This also adds complexity around deployments and has a lot of places for things to go wrong. This can lead to more self-inflicted downtime than AWS has ever caused.
This is the one that has been largely overlooked in recent days. It is the idea that only a single region is active at a time, and a secondary region is capable of taking over in the event of a disaster (hence DR). This shares the benefits listed above. However, it is able to largely mitigate the complexities of a full Active-Active setup. Under this setup, the secondary region doesn't need to be fully built - only persistent data needs to be replicated.
But wait, won't it take a while to deploy the full application stack in the event of a disaster? Yes… yes it will. And this is okay! High-Availability is achieved by using multiple AZs is sufficient for most common outages. If an entire region has issues, like we saw last week, spending <1 hour standing up a new stack from backups is still preferable to a >8 hour outage. This process can be streamlined via automation, but even if it's a manual (but practiced) operation, the fact that you actually have options is important.
So let's dive into this a little more to explore what it looks like:
- Application is deployed as usual in the primary region
- Using AWS managed services, backups and replication for persistent data is generally a configuration setting or two away:
- Add a read replica to RDS in a different region
- Create a Dynamo DB global table
- Enable S3 bucket replication
- In the event of a fail-over, deploy the application in the other region (hopefully using Infrastructure-as-Code) and update DNS settings
- This process should be regularly tested
Is this a silver-bullet? Absolutely not. It won't work for every workload, and definitely won't work for every type of outage. However, it is a relatively simple solution that can also be cost effective. And last week, it could've allowed clients to be back up and functioning long before services in us-east-1 were fully restored.
In conclusion, outages happen. This does not diminish the value of AWS in any way, but it does make clear the importance of good architecture and planning. There are some very expensive and elaborate systems that can be designed to mitigate these outages, but which are overkill and impractical for most clients. Fortunately, there are other options that may offer an “effective enough” solution with reasonable tradeoffs, which should become “best-practice” when working in AWS.