-
Notifications
You must be signed in to change notification settings - Fork 1
ENG-883 Add ECR support proposal. #78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,58 @@ | ||
| # Immutable and Air-Gapped ECR Backup Solution | ||
|
|
||
| This document outlines a robust solution for creating immutable backups of Amazon ECR container images. The approach leverages a combination of custom scripting, Amazon S3, and AWS Backup to provide an air-gapped, cross-account disaster recovery strategy that is resilient to account compromise. | ||
|
|
||
| Why an Immutable ECR Backup? | ||
| While Amazon ECR provides image replication, it lacks an immutable, long-term backup solution in a separate security boundary. In a disaster recovery (DR) scenario where a primary AWS account is compromised, standard replication is not sufficient. This solution addresses that by creating an "air-gapped" backup protected by an AWS Backup Vault Lock, which provides a Write-Once-Read-Many (WORM) model. | ||
|
|
||
| ## Solution Architecture | ||
|
|
||
| The solution consists of three main stages: | ||
|
|
||
| * Stage 1: ECR-to-S3 Backup: A scheduled process backs up container images from ECR to a source S3 bucket. | ||
| * Stage 2: Cross-Account Backup: AWS Backup automates the process of copying the S3 backups to a separate, dedicated "backup account." | ||
| * Stage 3: Immutable Vault Lock: An AWS Backup Vault Lock is applied to the destination vault, making the backups immutable for a defined period. | ||
|
|
||
| ## Step-by-Step Implementation | ||
|
|
||
| ### Stage 1: ECR to S3 Backup | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just for reference, on MESH, when we push and image to ECR, we also push the tarball to S3 - the build will fail if both aren't completed. |
||
|
|
||
| This stage involves creating a scheduled Lambda function that pulls images from ECR and pushes them to an S3 bucket. The steps are as follows: | ||
| 1. Schedule event is triggered (e.g., daily) using Amazon EventBridge. | ||
| 2. The schedule event triggers an AWS Lambda function. | ||
| 3. The Lambda function lists all repositories in the ECR registry. | ||
| 4. For each repository, the function lists all image tags. | ||
| 5. For each image tag (or a subsection of tags), the function pulls the image using the Docker CLI. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How would an efficient "incremental" backup cohort be identified? Would the backup complete within 15mins? I don't know how large some people's images are, but there's a 10GB ephermeral storage limit in lambda so, for some, that might be exceeded, or at least there would need to be some housekeeping.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we use skopeo this shouldn't be a problem. It'll do the copy in layers (which is the unit of increment we have available)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do you do this in MESH? |
||
| 6. The function then pushes the image to a designated S3 bucket in the source account, organizing images by repository and tag for easy retrieval. | ||
|
Comment on lines
+23
to
+26
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Rather than hand-crank this, we'd be better off trying to leverage https://github.com/containers/skopeo?tab=readme-ov-file#syncing-registries or similar. If we can get an s3fs-fuse mount into a container lambda, it should Just Work. We might want to get a spike in to prove it out, but I'd much rather not have to build this bit ourselves. |
||
|
|
||
|
|
||
| ### Stage 2: Cross-Account Backup with AWS Backup | ||
|
|
||
| This functionality already exists in the existing blueprint solution. | ||
|
|
||
| ### Stage 3: Enable Immutability with Vault Lock | ||
|
|
||
| This functionality already exists in the existing blueprint solution. | ||
|
|
||
| ## Summary of Benefits | ||
|
|
||
| Immutability: The AWS Backup Vault Lock in Compliance mode ensures your backups cannot be tampered with. | ||
| Air-Gapped Security: Backups are stored in a separate AWS account, isolating them from any compromise of your production environment. | ||
| Centralized Management: AWS Backup handles the scheduling, retention, and lifecycle management of your S3 backups. | ||
| Cost-Effective: Only the objects in the S3 bucket are backed up, and AWS Backup automatically transitions older recovery points to a more cost-effective cold storage tier. | ||
|
|
||
|
|
||
| ## Alternative Solution | ||
|
|
||
| ECR does provide cross-account replication, which can be used as a simpler alternative to this solution. However, it does not provide the same level of immutability and air-gapped security as the proposed solution. If you choose to use ECR replication, ensure that you have appropriate lifecycle policies and access controls in place to protect your images. | ||
|
|
||
| ### Steps to set up ECR replication: | ||
|
|
||
| 1. In the source account, create a replication rule to publish to the vault account. | ||
| 2. In the destination account, create a repository to receive the replicated images. | ||
| 3. Ensure that the IAM roles and policies are correctly configured to allow replication between accounts. | ||
| 4. Implement lifecycle policies to manage the retention and deletion of images in the destination account. | ||
|
|
||
| ### Considerations | ||
|
|
||
| * ECR replication does not provide immutability; images can be deleted or overwritten. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's https://docs.aws.amazon.com/AmazonECR/latest/userguide/image-tag-mutability.html - but yes, a bad actor with admin access could delete images. |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about restoration? For reference, on MESH, the tarballs of the images are in the S3 remote immutable backup. For restoration, we fetch the tarball, docker load it, then push to ECR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an additional proposal for restoration feature to be built in to the blueprint as well.
The proposal doesn't include ECR at the moment but gives the framework for how the restoration will happen through the blueprint.
https://github.com/NHSDigital/terraform-aws-backup/pull/79/files