Achieving Disaster Recovery Compliance for AWS Workloads

This article was written in coordination with Arpio. Arpio provides teams with a solution for automating backup and disaster recovery of cloud resources and data across your AWS environment. You can checkout their solution at https://arpio.io

Nearly every compliance regime will require organizations to consider and mitigate the business continuity risks associated with all forms of IT disaster. However, because these regimes are designed to be general purpose, they are intentionally vague on the details of what is actually required. They leave it to organizations and their auditors to envision an appropriate set of procedures and protections for the workload being assessed.

For applications that run in Amazon Web Services, though, the classes of disasters, and therefore their mitigations, can be clearly identified. In this article, we’ll detail what you need to know to implement a compliant disaster recovery process for your AWS environment.

What Is a Disaster?

Mitigating the risks of a disaster starts with understanding those risks, which come in the form of an outage, data loss, or sometimes both.

In AWS, disasters come in three forms. Here are some examples.

Availability Zone Outage

On the morning of August 31, 2019, power was lost at one of the datacenters in AWS’ Northern Virginia region. Backup generators immediately kicked in, but 90 minutes later they too began to fail. Within minutes, 7.5% of the virtual servers in the affected availability zone were no longer running.

Amazon was able to restore power less than 2 hours later, but recovering the impacted virtual servers took several hours longer. Unfortunately, due to the nature of these failures, a percentage of the hardware in the datacenter was permanently damaged, and a number of EBS volumes could not be restored. Amazon worked tirelessly for 4 days to recover all of the lost data, but unfortunately in the end, some data was never recovered.

Regional Outage

On February 28, 2017, an AWS employee, who was performing a routine maintenance operation, accidentally typo’d a command on the command line. In doing so, he brought down a massive number of servers that support Amazon’s Simple Storage Service (S3).

S3 is a foundational service for AWS, which means this one service outage brought down functionality throughout the northern Virginia region of AWS. The outage was so bad that even the AWS service status page was inaccessible. Amazon had to rely on Twitter to communicate updates on the outage.

It took Amazon 5 hours to perform a complete reboot of S3 and restore service. But since this region is the largest of AWS, the overall impact was immense. The economic damage was assessed at $250 million. Perhaps more importantly, though, the outage highlighted that AWS is so central to the workings of our economy that a multi-day outage from a natural disaster or some form of black swan event could be economically catastrophic.

Catastrophic Data Loss

Code Spaces was once an up and coming competitor to services like Microsoft’s GitHub and Atlassian’s BitBucket. They differentiated their business by providing a boutique service, and frequently touted their focus on data security and disaster recovery as an advantage.

One morning, they woke up to find that their AWS account had been hacked, and the attackers were demanding a ransom. Rather than paying the ransom, Code Spaces decided to play hard ball, and they attempted to revoke the attacker’s access.

Unbeknownst to them, though, the attackers had installed a back door. When they realized that Code Spaces had blocked their access, they used the back door to once again gain access to their account. They then proceeded to delete all of Code Spaces’ data, including all of their backups.

Having lost all of their customer’s critical data, Code Spaces was not able to stay in business. They closed up shop the next week.

What Does Compliance Require?

Different compliance regimes make different demands with respect to disaster recovery, but they all focus on maintaining great data backup practices. To achieve compliance, organizations must document their recovery point objectives and recovery time objectives, they must maintain offsite backups, and they must periodically test their ability to recover service from those backups.

HIPAA requires that organizations implement backup and disaster recovery plans. This means that teams must have specific processes and documentation in place for backup and disaster recovery processes.

164.308(a)(7)(ii)(A) – Data Backup Plan

164.308(a)(7)(ii)(B) – Disaster Recovery Plan

The SOC 2 Trust Service Criteria (TSC) requires organizations to create backups in a remote location, create a business continuity plan and test recovery of backups under the Additional Criteria for Availability category.

A1.2 – The entity authorizes, designs, develops or acquires, implements, operates, approves, maintains, and monitors environmental protections, software, data backup processes, and recovery infrastructure to meet its objectives.

A1.3 – The entity tests recovery plan procedures supporting system recovery to meet its objectives

The HITRUST CSF requires organizations to create backups in a remote location, conduct testing of backups and the restoration process, as well as implement encryption and automate backups for higher-level implementations.

9.05 Information Back-Up

09.l Back-up

Implementing Disaster Recovery in AWS

Implementing a disaster recovery solution for an AWS workload involves mitigating the aforementioned risks of availability zone outage, region outage, and catastrophic data loss.

Mitigating an Availability Zone Outage

Applications that are built for AWS generally consider availability zone (AZ) resilience in their design. By architecting “clustered” components, and deploying those across multiple AZs, the workload is automatically resilient to an AZ outage. This is a high-availability solution; there is minimal downtime associated with a failure, and the application self-heals when the AZ outage occurs.

Some workloads, though, are not built to deploy across multiple availability zones. For these workloads, it is possible to rely on traditional backup/restore to recreate the workload in an alternate availability zone should the primary AZ fail. However, most organizations are better off focusing on mitigating a region outage as that mitigation covers for AZ outages as well.

Mitigating a Region Outage

To mitigate a region outage, it is necessary to be able to operate an application in an alternate region. Some applications are architected so that they innately span multiple regions. These multi-region-active applications are rare, though, and the majority of applications focus on failing over to an alternate region in the event of a region outage.

Cross-region failover relies on the ability to re-create the application’s deployment, including all necessary AWS infrastructure, and to restore a recent backup of the production data into that new deployment. This obviously requires that a copy of the production data be stored in the alternate region before the outage occurs in the production region. These copies can be achieved via replication, snapshot copy, or traditional backup/restore depending on which AWS service stores the data.

Mitigating a Catastrophic Data Loss

Catastrophic data loss may be accidental, or it could be malicious. To mitigate both scenarios, it’s necessary to maintain data backups, and to store those backups in a locked-down location. If data backups are stored in the same security realm as the data, a bad actor has the ability to delete (or ransomware) both the data and the backups.

In AWS, maintaining locked-down backups almost always entails copying backups into an alternate AWS account. This “bunker account” should have minimal access, and therefore not be vulnerable to the same attack vectors as the production account. When the data needs to be recovered, it can be restored in the bunker account, or shared with another account and restored there.

Pulling It All Together

We’ve covered the compliance requirements for disaster recovery, as well as the types of disasters that can happen in AWS. We’ve also covered the high-level mitigations for each disaster risk. But what is the quickest path to achieving compliance?

Most organizations opt for a single disaster recovery solution that covers all 3 types of disasters. By frequently replicating backups of critical data to an alternate region and storing them in a locked down AWS account, they ensure that a recent copy of their data will be available in the event of an AZ outage, a region outage, or a catastrophic data loss. By then practicing or automating the recovery of their AWS infrastructure, and restoring the data backups into that infrastructure, they establish a compliant procedure that satisfies the most stringent audit requirements.

About Arpio

Arpio provides disaster recovery as a service for applications that run in AWS so that you don’t have to build it yourself. In 15 minutes, you can configure Arpio to fully automate cross-region and cross-account disaster recovery for your entire AWS account. Best of all, Arpio makes testing easy, so you can quickly satisfy all of your compliance requirements for DR.